Just Throw It Into Postgres
When you think of databases... it probably looks like a neatly organized schema of multiple tables, having unique ids connecting them, neatly listing names, addresses and other well-defined entities. It's either something corporate, or... maybe a PHP webapp? A forum?
Obviously, if your data does not look like this, you shouldn't use a database. Right...?
Or at least you should try making your data look like this. Anyone remembers normal forms? There is actual literature on how to do this the right way.
Meanwhile also: once upon a time I read a blog post (that I, sadly, cannot find anymore) about how, if you were to archive all your social media posts, you should aim for just saving the result of curl
requests you make in their original form, as boring blobs. It is very tempting to turn them into a neat, well-organized database structure that you can search efficiently right away. However, this assumes you did think of everything that you might ever want out of this data. What if you find some exciting new piece of metadata in them, something you didn't care about before? What if your original conversion script had a bug that you'd now like to fix? What if the site you're archiving suddenly changes its format? (Your scraper scripts are now broken...? maybe just in subtle ways?)
Meanwhile, if you just save the original bytes, you can fix everything later. Plus, given the amount of data your own social media posts consist of (... and which this post was primarily concerned about), a few Python scripts can go through them every single time you're looking for something.
So... now we have two roughly independent arguments against using relational databases!
(and thus this post is clearly going to end up recommending databases for, um, about everything. Anything else would be way too reasonable. One shouldn't expect any better from people who also put obnoxiously rotating webgl cubes on their blog front page.)
The Middle Ground
On one hand, there is the temptation to turn everything into neat database tables. That way lies... well, IBM i a.k.a. AS/400 definitely deserves credit for going all the way with this!
The other way is UNIX, with its many files, organized in a neat, hierarchical directory structure, and saying nothing whatsoever about what goes into those files. Which then... turn out to be trees themselves, too. Meanwhile... you really need to name them properly, so that you can find them. Does your directory structure look like year/month/day/topic
for your photos? Or is it topic/year
? And then you name your files like IMG_YYYYMMDD_HHMMSS.jpg
. But if you want your videos to sort well with your photos, you should probably start prefixing them with IMG_
too...?
(... and while you might have outsourced your photo storage into something that is one long linear, one-dimensional stream of stuff on someone else's computer, you likely still need to organize other things.)
Instead, what if you take what would have been, say, text files or JSON documents that you planned on organizing in a neat directory structure... and just threw it into a relational database, without much concern about what the right way of doing so is?
Or whether the kind of data you're storing fits into the general concept of a... "database"?
Example 1: a Java IDE
Java is the language which is both well organized enough... and also tedious to write otherwise; people tend to reach for an actual IDE as soon as possible. You're obviously not going to write out all those import
statements by hand, right?
This was enough to keep me using IntelliJ derivatives at work for years at work. Until it decided to re-index everything the 23th time, taking an excessively long time (there might have been custom company addons involved), making me think... could we just... do this better?
Emacs is great, as an editor. What it cannot do well (in itself): "jump to definition". Or find all references. Or, for that matter, find the most likely import for you & add it to the top of the file. So we'll need to build something extra.
(This was all before LSP, by the way.)
Obviously, in order to build up a fancy database of all the cross-references, you need to parse all the source code. As it happens, there is an excellent tool that can understand source code as well as the Java compiler does... and which also works well with your build system.
It is... the Java compiler.
Unlike native binaries, Java class files aren't especially far from the original source code. You can still find all the symbols in them, and with the help of libraries like ASM, just list every single function call your code makes, with a tool just a couple hundred lines long, with the complexity comparable to that of a medium-ugly shell script. What you end up with is maybe hundreds of thousands of records, each stating things like "function A calls function B" or "function A is a member of class C".
At this point you just throw the entire thing into Postgres. Using "copy", all it takes is a few tens of seconds. In the process, you can also nuke everything out of your old database table. After all, we do not really care about "persistence" here.
What you now have is an excellent indexing engine that most IDE developers have to work hard for, right at your fingertips. It was fast enough to call psql
the command line Postgres tool multiple times a second as a part of an Emacs autocomplete, to return candidate class names! (... and then add some more elisp to inject the right include on top of the file.)
Example 2: Chinese characters
Make Me a Hanzi is an excellent project containing SVG outlines of tens of thousands of characters one might want to peruse / study. As it happens, it has so many SVG files that GitHub has trouble showing all of them. Your file system might or might not have similar reactions.
As with before, we do not have to make any decisions whatsoever about how to store this data. We can just, well... throw it into Postgres!
You can create an extremely stupid and not-well-thought-through Postgres table, with a JSONB column, for all the characters; the project also comes with a dictionary, which can then end up in another such table. Postgres has great support for JSON data; it can pick specific subfields and add it to your query results, without you having to pretend that it's text & parse it yourself. With the right indexes (which you can then add afterwards) and possibly by turning some important details (which character?) into its own column, you now have a quick and comprehensive Chinese character database, with fast response times and the capability of joining characters with their meanings.
Example 3: when the room temperature changes...
... the sensor talks to an MQTT server. With some JSON payload.
Wouldn't it be nice to draw a neat graph of it? Using Grafana?
Except... you need to log the data somewhere? (since the MQTT server only stores the latest records.) Plus aggregate and query it somehow? Despite this being the slightly disorganized set of various JSON payloads?
There are... solutions for this.