A defense of MongoDB

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

I posted this on Hacker News and now re-post it here.

MongoDB offers the greatest benefit to those who have an evolving concept of their schema, and that tends to be startups, though I have worked in large firms that entirely re-invented their schemas. I worry that I would seem tedious if I listed the places that I have worked, and yet, on Hacker News, when I speak in abstract terms, I tend to get downvoted, so I will name a few specifics.

I worked at Wine Spectator for a year (2010-2011), http://www.winespectator.com/ . They had built their first web site circa 2000 using Oracle, Sun Solaris, and Vignette with Java templates. Circa 2009 they decided to scrap the old, expensive system and move to PHP/MySql and the Symfony framework. They could not decide what their new schema should be, so, in the name of keeping things flexible, they decided that all data would be in a single table. This table had 240 fields, most of them with generic names such as “modifier_01” and “modifier_02”. This was an organization that was looking for the flexibility offered by MongoDB, but they tried to cram that flexibility into a relational database, and they did so by ignoring all the relational features offered by MySql. This “one size fits all” database table did not work for the last assignment I was given: import all the old FileMaker Pro databases to a system running Mysql/PHP/Symfony. I build an entirely different project, with its own database and schema. Lord knows who is maintaining it now.

Then I worked a year (2011-2012) at Shermans Travel, http://www.shermanstravel.com/ , which also tore apart its database. When I first arrived they were trying to save a system built with MySql and PHP and later forced to conform to the CakePHP framework. The database had over 300 tables, many of which were no longer in use. Of the tables that were in use, many had fields that were no longer in use. The code, and the database, were a sprawling mess, that had evolved chaotically. (I am emphasizing this chaos because this criticism is often made of MongoDB: without a schema then how do you keep your data organized? Well, most of the places I have worked have had relational databases where the data was completely disorganized). After a few months, the CTO and the tech team decided on a complete re-write of the code. The tech team was allowed to vote for either Java or Python or Ruby (no one wanted to use PHP). We voted for Ruby. We rebuilt the site as 6 apps, using MySql for some of the apps and MongoDB for some. My last big project there was a rescue effort for a broken group of 4 database tables in MySql. There was a “users_history” table that was suppose to track whether a user had subscribed or unsubscribed to various newsletters we offered, but there had been a bug, apparently for years, such that many of the “unsubscribe” attempts were not recorded. There were 3 other tables with somewhat redundant data, and I wrote a script that scanned those other 3 tables and attempted to funnel the correct data to the 4th table.

I have many more stories like this. I could write a whole book about places where Oracle, MySql or PostGre was in use, but the data was badly organized. Unused tables and unused fields are extremely common.

Why do I emphasize the chaos I have encountered? Because the charge of badly organized data gets thrown at MongoDB a lot. If you would like to read a scathing attack against MongoDB, read this:

http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/

But to me, this line of argument compares the platonic ideal of relational data against the actual use of MongoDB. Maybe if Edgar F. Codd designed your schema to the 7th Normal Form then your schema really is well organized, but I have not seen anything like this in real life.

What I have seen, in real life, convinces me that every organization has an informal schema that is constantly evolving, and which can not be maintained with anything like regularity. Most of the organizations I’ve been hired at default to rebuilding everything every 5 or 6 years, because by that point the old system has grown chaotic. Sarah Mei’s description of the dangers of MongoDB matches my own experience with relational databases: “we figured out that we had accidentally chosen a cache for our database.”

What I like about MongoDB is it openly, boldly declares that the chaos I’ve seen is typical, and it facilitates the evolution of the schema which is going to happen no matter what you do or say. Things evolve, often chaotically. A programmer has a great idea and works on it in 2007, another programmer takes over in 2008, the project starts as raw PHP, later is imported into Symfony, then is re-written in Ruby, then it is broken up into several small apps. Someone quits. The CTO is fired. Someone new starts working and, for the sake of simplicity, prefers doing as much as possible as a background task, using cron scripts. A year later someone joins and is disgusted with the profusion of cron scripts, they want everything organized around a message queue. A very good sysadmin joins the team at a time when most of the programmers are weak, the sysadmin re-writes many of the background scripts, but he prefers Perl for everything and he implements some data caching strategies that no one understands.

You may think that I am exaggerating the level of chaos I have seen. I have not worked for Facebook or Google or Apple and if you tell me that in those companies everything is well run and well organized, then I will believe you, as I have no reason to doubt you. But I have worked at a lot of older media companies in New York City, and what I have seen is constant churn, churn at every level, churn in the team, churn in the technologies, and churn in the database.

I know I will be misunderstood, so let me try to clarify this:

I am not saying that chaos is good.

I am not saying that MongoDB is good because it encourages chaos.

I am saying that chaos is a symptom of the fact that most businesses do not know what their schema should be, and even if they did know what their schema should be, their needs would be different a year from now. The real schema needs of the organization (that is, what sets of data should be acquired and what the relations should be between those sets) are undergoing constant evolution, and this evolution is necessary, healthy, and unstoppable.

What is the strength of a relational system? Consider Wikipedia’s explanation of Codd’s Theorem:

http://en.wikipedia.org/wiki/Edgar_F._Codd

“The domain independent relational calculus queries are precisely those relational calculus queries that are invariant under choosing domains of values beyond those appearing in the database itself. That is, queries that may return different results for different domains are excluded. An example of such a forbidden query is the query “select all tuples other than those occurring in relation R”, where R is a relation in the database. Assuming different domains, i.e., sets of atomic data items from which tuples can be constructed, this query returns different results and thus is clearly not domain independent.”

Clearly, this assumes that the relations among the data are known. The organizations that I work with have no real idea about what relations they want to establish among their data. They are in a permanent exploratory phase. I believe these organizations could be described as “pre Codd”, but most of them have been “pre Codd” for decades, and they will always be “pre Codd”. If you force them to specify relations among their data, you will get answers exactly as useful as these:

http://blog.jimmyr.com/Funny_student_Exam_Answers_13_2008.php

MongoDB is useful in this context. Start acquiring data. Don’t pretend you know what your schema is. You do not know what your schema is. The schema is changing all the time anyway.

Is there a place for relational databases? Yes, because sometimes some parts of the business become steady for some length of time, and for that part of the business, capturing fixed sets of data, with fixed relations, is very useful. But we should not pretend that this situation holds where it does not. I am not convinced that this is even the general case, though there is an overwhelming tendency in computer science, and in business, to pretend that fixed-sets-with-fixed-relations is the general case. If you feel it is, then you have been working at places facing conditions far more steady than what I have seen, or perhaps you are simply considering a shorter time frame than I am.

Relational SQL databases offer their greatest strength when consistency is needed. The apps I am currently working on only need eventual consistency, and can tolerate some data loss. When you have strict consistency needs, I would say use a relational SQL database. Otherwise, look carefully at MongoDB, as it might be perfect for you.

By the way, when you manage your schema from your app, then your schema is always under version control (assuming you use version control). In theory you can keep your relational SQL schema under version control, but I have only seen that done when I was at Timeout.com. Implicitly, in something like a Rails app, you are keeping the schema for that app under version control, but you have to run a command, so the relationship is not as exact as something like my schema.edn file, which actually creates the schema.

Post external references

  1. 1
    https://news.ycombinator.com/item?id=7449637
  2. 2
    http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/
  3. 3
    http://en.wikipedia.org/wiki/Edgar_F._Codd
  4. 4
    http://blog.jimmyr.com/Funny_student_Exam_Answers_13_2008.php
  5. 5
    https://github.com/lkrubner/humorus-mg/blob/master/resources/config/schema.edn
Source