What database should you use?

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

Someone on Twitter asked for advice about databases, so I wrote this reply.

This is my current take, using rough categories and rough rules of thumbs:

1. If I’m working on a game, I often want a single abstraction for most of the items that the user can own. That is, I don’t want a separate database table for swords, magic wands, shields, or spears, instead I want one “Items” table and I’ll have a “type” field that will let me know if the Item is a sword or a magic wand or a shield or a spear. To keep the programming simple I will often take the EAV style to the extreme, and rely on polymorphism. For such situations, a document store such as MongoDB is a great database.

2. If I need to track financial data or real time data, where time is a crucial factor to the system, then I typically want to use a temporal or bi-temporal database such as Datomic or XTDB. You can also evaluate InfluxDB and TimescaleDB though I’ve been avoiding them, based on an analysis that I did in 2018. Of course, things might have changed since 2018. Timescale, in their own marketing, offers this observation which is true of this whole category: “In particular, time-series high-cardinality is a common problem in industrial IoT (e.g., manufacturing, oil & gas, utilities), as well as some monitoring and event data workloads.” Timescale is built on top of PostgreSQL, it’s basically a plugin that adds temporality to PostgreSQL.

3. If I’m building a project in PHP, for some reason, I might chose MySQL simply because PHP and MySQL grew up together and the entire culture is built around their relationship. This extends to tutorials and examples in documentation and answers found on StackOverflow — the connection between PHP and MySQL is deep and goes back 25 years. Back in the year 2000 Tim Perdue posted a great article, much discussed on Slashdot, that showed that Postgres was better than MySQL, and at that point it seemed likely that PHP programmers would give up on MySQL and switch to Postgres but that never happened. MySQL remains at the center of PHP culture, and acceptance of a culture matters if you want your project to maximize its chances of success.

4. If the customer or industry demands that I distribute a binary (this used to be common in all medical settings) then I’ll use a database that I can embed with the binary, such SQLite or DuckDB. Many hospitals still have IT teams that ask for binaries they can vet and then run inside of all of their firewalls, so in these situations one needs a way to embed the database with the binary.

5. For a huge company with many divisions, Kafka can function as the central source of truth. Jay Kreps invented Kafka at LinkedIn. In his essay, which he wrote in 2013, he said LinkedIn was storing 900 terabytes of data in Kafka. That was a lot of data back in 2013, and I can hardly imagine how big their Kafka system is now. There are some pain points with Kafka — when you try to resize or redesign the topics, sometimes the whole system freezes up. My friend Chris Clarke, the best devops guy I know, generally just copies all the data from one system to another, new system and reboots, when redoing the topic partitions.

6. Some datastores, such as ElasticSearch, are mostly used as cache (that is, normalized data is taken from your canonical datasource and denormalized for ease of query and then stored in ElasticSearch) but if your backend is simple while your frontend queries are complex, then you can use these systems as your primary datastore. I’ve only seen this done with phone apps where all the complexity was on the frontend.

7. I’ve never seen a key-value database used as a canonical database. I have seen systems such as DynamoDB used to speed lookup over huge datasets, but typically DynamoDB is being used as a kind of fast-lookup cache. Typically, it is storing data that has already gone through some processing, and so the canonical data is further upstream.

8. When you have a unique situation, sometimes you have to invent your own database. Nubank did that recently when it needed constant time lookup over truly enormous datasets, so it implemented “ideal hash keys” for a hard drive, and built its database as a hashing of the AWS NFS service, which is called EFS. Details are here: https://www.youtube.com/watch?v=8XgY1j1etOI

9. For understanding distributed databases, and their flaws, the most important, and comprehensive, research was done by Kyle Kingsbury, and most of his research, over the last 10 years, is published here: https://aphyr.com/tags/jepsen . He eventually built Jespen as the test suite for the failures of ACID that tend to plague distributed databases. If you decide to use a distributed database, always read his review of it first.

10. The problem of systems that had very high levels of write throughput first emerged back in the era of 2005 or 2010 when people first started talking about “Web Scale.” But the problem of heavy writes is even more intense if you’re dealing with real-time data from machines, such as sensors on a farm or an oil rig. At first, companies dealt with the problem by putting a queue in-front of the database, but that didn’t solve the main problem, so then we saw the first of the column-based NoSQL databases, such as Cassandra. At that time, Cassandra was the category leader, and it remains very good, details here: https://medium.com/@kajol_singh/cassandra-the-scalable-nosql-database-for-modern-applications-7e61cfd831e3 . But it is also true that in 2024 we are blessed with an abundance of datastores that can handle very high levels of writes. Still, both Cassandra and MongoDB have the advantage that they’ve been around a long time and they’ve been tested and they’ve matured. Chosing between the two of them mostly comes down to whether you want to enforce a particular schema, in which case chose MongoDB. Otherwise, chose Cassandra.

11. Finally, if you are building a typical website, and maybe using a framework such as Ruby on Rails or Django or Quarkus or Laravel then you should use PostgresSQL. It is the best, standard, open-source SQL database. It is endlessly flexible. It can be adapted to almost anything. As you build your company, and add more apps, and more complexity, you can go far with just PostgresSQL. It has the ability to mimic many of the features found in other databases. As a general rule, when you are in doubt about which database to use, you should always default to PostgresSQL.

I could offer some other comments about datastores that are mostly caches, such as Redis, but that is such a huge topic I would have to write a book to cover all of it.