Fixing Bad Data in Datomic

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

I am intrigued:

A Motivating Example

ACME Co. buys, sells, and processes things. Unfortunately, their circa-2003 web interface is not a shining example of UI/UX design. Befuddled by all the modal screens, managers regularly put bad data into the system.

In fact, manager Steve just accidentally created an inventory record showing that ACME now has 999 Tribbles. This is ridiculous, since everyone knows that the CEO refuses to deal in Tribbles, citing “a bad experience”. In a rather excited voice, Steve says “Quick, please delete the last entry added to the system!”

As is so often the case, job one is to carefully interpret a stakeholder’s request. In particular, the words “delete”, “last”, and “entry” all warrant careful consideration.
What “Delete” Means

Let’s start with “delete”. At first glance, one might think that Steve wants us to do our best to “unhappen” his mistake. But upon reflection, that isn’t such a good idea. The database is a live system, and someone may have used the bad data to make decisions. If so, then simply excising the mistake will just lead to more confusion later. What if, during the few minutes Tribbles were in the inventory, we sold a Tribble? Or, more subtly, what if we moved our widget inventory to a different warehouse to make room for the nonexistent Tribbles?

Databases need to remember the history of their data, even (and perhaps especially) when the data is later discovered to be bad. An easy analogy to drive home the point is source control. Source control systems act as a simple database for code. When a bug happens, you care very much about which versions of the code manifest the bug.

So rather than deleting the Tribbles, we want something more like a “reverting commit” in source control; that is, to record that we no longer believe that we have (or in fact ever had) Tribbles, but that during a specific time window we mistakenly believed we had 999 Tribbles.
What “Last” Means

Next, let us consider the temporal word “last”. Happily, ACID databases have a unit of time ordering: transactions. Datomic goes a step further with reified transactions, i.e. transactions that you can manipulate as first-class objects in the system. In such a system, you might indeed be able to say “Treat the last transaction as a data entry error and record a reverting transaction.”

We still need to be careful, though, about what we mean by “last”. Again, the inventory system is a live system, so the last transaction is a moving target. It would be dangerous and incorrect to blindly revert the last transaction, without making sure that Steve’s erroneous transaction is still the last one. Generally, we will want a way to review recent transactions, and then narrow down by some other criteria, e.g. the id for Tribbles.

Post external references

  1. 1
    http://blog.datomic.com/2014/08/stuff-happens-fixing-bad-data-in-datomic.html
Source