Why do some companies use a database that is not a relational RDBMS?

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

I saw this question on Reddit. This would be my answer:

I’ve worked at many companies where they used a RDBMS but they cheated massively, so in a sense it is a bit more honest to switch to something else.

For instance, when I was at PrivCo.com, they were using MySQL, but they had many tables that they were using as a cache. Much worse, they had no clear sense of which of their tables were truly canonical. Inside the tables that were clearly supposed to be the starting point of the rest of the system, they had certain fields that were mere calculated aggregates — in other words, those fields were a bit of a cache, they were not truly canonical.

A well designed database is normalized to the 4th or 5th form. But I’ve seen violations of this at the majority of companies that I’ve worked.

Some might feel this is controversial, but I also think that the use of the entity attribute pattern is already a violation of the idea of a relational database — you gain flexibility, but you can no longer determine the relationships that exist among the data strictly by looking at the database schema. As soon as you use the entity attribute pattern, you need to look in the code to figure out the meaning of various “types” and only then can you deduce what relationships are being kept in the entity attribute tables. But you can no longer see the relationships among the data simply by looking at the database schema. The relationships are moved into your code, they no longer exist in the database. And yet the entity attribute pattern is very common.

At PrivCo.com the starting point of the database was supposed to be the “companies” table in which data about corporations was maintained. Some early programmers added more and more fields to this one table. It ended up with 265 fields, many of which were unused, or had only been used for a few months and then they were abandoned, but the fields were still there, holding some data from 2017. Much of the data in those 265 fields should have been moved to different tables, and established as one to many relationships. For instance, there was:

address_a

address_b

city

state

zip

address_a_2

address_b_2

city_2

state_2

zip_2

address_a_3

address_b_3

city_3

state_3

zip_3

In other words, the whole database was badly denormalized, in the manner you’d expect a cache. There was a poor understanding of the need to maintain the difference between canonical data versus cache data.

Even worse, when they decided to add foreign companies, they simply copied the “companies” table, and created “companies_foreign” which repeated the 265 fields.

I convinced them that they would eventually need to reform the database, but that was impossible for as long as their PHP Symfony monolith was making direct calls to the database, so I built a system to move data from MySQL to ElasticSearch, which we then used as an API, and there was the long term goal of getting the website to make all calls to ElasticSearch. Once we had masked the reads, it would be easier to reform the MySQL database schema. (I left, so I’ve no idea what they eventually implemented.)

However, it is worth noting that these companies survive, year after year, and make a profit. So in some sense, having a badly misshapen database schema is not fatal for a company.

The rise of Kafka, and the ideas around having an immutable log, have reversed how people think about this. If the historical log is the source of all truth, then the RDBMS is simply another kind of cache. And once the RDBMS is nothing but a cache, you have to wonder, do you need a RDBMS at all?

The pattern I saw a lot 10 years ago was that the RDBMS was the canonical source of truth and then, for the sake of speed and convenience and decoupling, the normalized data from the RDBMS would be denormalized to something like ElasticSearch or MongoDB, which would serve an API to the frontend. But in companies that moved to immutable logs, something like Kafka, I’ve seen companies give up on the RDBMS and simply move data from Kafka to ElasticSearch. This means all relationships exist in your code, but apparently companies can handle this without much of a problem. Often there is a central schema file that offers as much clarity as one might have previously gotten by looking at the database schema.

By the way, treating MySQL or Postgres or Oracle like a cache dumping ground for data is a mistake, but since so many companies do it, there must be a need for a dumping ground where companies can put data. I’ve seen many companies that use AWS S3 for this, but also flexible document DBs like MongoDB are useful in this respect. Companies keep using MySQL or Postgres or Oracle as dumping grounds, but when you simply need a dumping ground for raw data, there are better tools for that, so this is an excellent use case for using something other than a RDBMS.

So to answer the question, why use a non-RDBMS, the obvious answer is as a cache for denormalized data. But also, if you adopt the immutable log pattern, then there is no need for a RDBMS at all.

I’ve also worked at companies where we use MongoDB for everything. This can work, but it takes discipline — it is very important that some collections are understood to be canonical and others are a cache. One should never mix those two. But as I said above, I’ve seen the two (canonical/cache) mixed together even when using a RDBMS, so the problem is very much with the software developers, not the technologies.

Post external references

  1. 1
    https://privco.com/
Source