Can MongoDB keep its promises?

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com

See “Stale reads with WriteConcern Majority and ReadPreference Primary

Kyle Kingsbury starts the fight without meaning to:

In this test, MongoDB returns the value “0″ for the document, even though the only possible values for the document at that time were 1, 2, 3, or 4. The value 0 was the proper state at some time close to the partition’s beginning, but successful reads just after the partition was fully established indicated that at least one of the indeterminate (:info) CaS operations changing the value away from 0 had to have executed.

You can see this visually in the attached image, where I’ve drawn the acknowledged (:ok) operations as green and indeterminate (:info) operations as yellow bars; omitting :fail ops which are known to have not taken place. Time moves from left to right; each process is a numbered horizontal track. The value must be zero just prior to the partition, but in order to read 4 and 3 we must execute process 1′s CAS from 0->4; all possible paths from that point on cannot result in a value of 0 in time for process 5′s final read.

Since the MongoDB docs for Read Preferences (http://docs.mongodb.org/manual/core/read-preference/) say “reading from the primary guarantees that read operations reflect the latest version of a document”, I suspect this behavior conflicts with Mongo’s intended behavior.

Asya Kamsky dismisses the concern:

I believe that what your test framework is not taking into account is that reading from a primary does not guarantee that the read data will survive a network partition. This is because MongoDB read isolation semantics are similar to “read uncommitted” in a traditional database system when you take into account the full replica set.

As the docs mention, data written with majority writeConcern that has been acknowledged will survive any replica set event that allows a new primary to be elected. However, after the write is made on the primary, but before it has successfully replicated to majority of the cluster, it is visible to any other connection that’s reading from the primary.

This allows the following sequence of events:

T1 network partition happens

T2 write A happens, waits for write concern

T3 read of A happens on the primary

T4 primary steps down due to not seeing majority

T5 new primary is elected

When write A has not propagated to the majority of the replica set, it may not be present on the newly elected primary (in fact, if write A has replicated to none of the secondaries, it is guaranteed to be absent from the newly elected primary).

I believe such a sequence of events was observed in your case, where the majority write concern is not yet satisfied, the unacknowledged data have been written on the primary and were visible to other connections (process 1 in your case), but the value was not present on the newly elected primary (which is the node that process 5 finally successfully read from).

Kingsbury replies:

Note that you don’t have to go full read-committed to fix this anomaly: you can prevent stale and dirty reads for single documents without supporting RC for multi-doc operations (just a difference in lock granularity), so if you want to support reading the latest version, you can have it in both read-uncommitted and read-committed modes.

The read isolation docs (http://docs.mongodb.org/manual/core/write-concern/#read-isolation) are technically correct, I think, but sorta misleading: “For all inserts and updates, MongoDB modifies each document in isolation: clients never see documents in intermediate states” kinda suggests that the read uncommitted problem refers to multiple-document updates—which is also true—but it doesn’t mention that even read operations on a single document may see invalid states that are not causally connected to the final history.

The read preference docs (http://docs.mongodb.org/manual/core/read-preference/) make some pretty explicit claims that Mongo supports linearizable reads, saying “Reading from the primary guarantees that read operations reflect the latest version of a document”, and “All read preference modes except primary may return stale data”.

With this in mind, it might be a good idea to let users know all read modes may return stale data, and that the difference in ReadPreference just changes the probabilities. For instance, “Ensure that your application can tolerate stale data if you choose to use a non-primary mode,” could read “Always ensure that your application can tolerate stale data.”

And there is much more there, and a lot of interesting details, plus a suggestion about how things can be fixed.

Eventually Andy Schwerin writes:

Your diagram indicates that by the end of the period in question, none of the writes have finished being confirmed by the replication system. If SERVER-18022 were resolved all the reads in the diagram would have returned 0, because none of the writes had committed. As such, there would be a legal linearizable schedule, whose prefix includes the completion of all the read operations, and whose suffix includes the completion of all the write operations. I think in this case that the fact that the writes had in fact started is not relevant. In this case, write operation completion means replication to a majority of voting nodes and confirmation of that fact to the primary that accepted the write from the client.

Anyhow, the behavior you did observe certainly doesn’t have a linearizable schedule. As you point out, even with SERVER-18022 you don’t get reads into a linear schedule for free. The problem is that there is a period of time during a network partition when two nodes may believe themselves to be primary. As soon as those two nodes communicate, one will step down. If the partition lasts long enough, the node in the minority partition will step down, but there is an inevitable window for stale reads when a client reads from a primary that will inevitably step down. As an aside, improvements to the consensus protocol can be used to bring that period down to a few network roundtrip periods (hundreds of milliseconds), and that is the subject of the somewhat ill-described SERVER-12385.

You suggested (approximately) transforming reads into atomic read-modify-writes in order to achieve linearizable reads. You didn’t propose it exactly that way, and your description leaves more room for optimization, but “coupling reads to oplog acknowledgement” pretty much degrades to converting reads to read-modify-writes in periods of low write volume. The behavior can be achieved today, albeit somewhat clumsily and only with some client drivers, by using the “findAndModify” command to issue your reads and then issuing a getLastError command to wait for write concern satisfaction. Your findAndModify command will need to make some change to the document being read, such as incrementing an otherwise ignored field, in order to force an entry into the oplog, and you cannot observe the value until the getLastError command returns successfully, indicating that your read-modify-write replicated successfully.

But then:

Finally, as you indicated above, there is a clear documentation issue. The documentation you reference needs to be updated.

What do I make of this? That MongoDB can not be trusted, and I should try to use it in simple ways. The MongoDB engineers make their case, but Kingsbury seems to be reporting a real condition that I personally would have stumbled into, and then worried that my own code had a bug. I am sure there are other engineers who feel as I do. Sometimes, with engineering, you can win the battle and lose the war.

Kingsbury has an exhaustive study on their blog:

How bad are dirty reads?
Read uncommitted allows all kinds of terrible anomalies we probably don’t want as MongoDB users.

For instance, suppose we have a user registration service keyed by a unique username. Now imagine a partition occurs, and two users–Alice and Bob–try to claim the same username–one on each side of the partition. Alice’s request is routed to the majority primary, and she successfully registers the account. Bob, talking to the minority primary, will see his account creation request time out. The minority primary will eventually roll his account back to a nonexistent state, and when the partition heals, accept Alice’s version.

But until the minority primary detects the failure, Bob’s invalid user registration will still be visible for reads. After registration, the web server redirects Alice to /my/account to show her the freshly created account. However, this HTTP request winds up talking to a server whose client still thinks the minority primary is valid–and that primary happily responds to a read request for the account with Bob’s information.

Alice’s page loads, and in place of her own data, she sees Bob’s name, his home address, his photograph, and other things that never should have been leaked between users.

You can probably imagine other weird anomalies. Temporarily visible duplicate values for unique indexes. Locks that appear to be owned by two processes at once. Clients seeing purchase orders that never went through.

Or consider a reconciliation process that scans the list of all transactions a client has made to make sure their account balance is correct. It sees an attempted but invalid transaction that never took place, and happily sets the user’s balance to reflect that impossible transaction. The mischievous transaction subsequently disappears on rollback, leaving customer support to puzzle over why the numbers don’t add up.

Or, worse, an admin goes in to fix the rollback, assumes the invalid transaction should have taken place, and applies it to the new primary. The user sensibly retried their failed purchase, so they wind up paying twice for the same item. Their account balance goes negative. They get hit with an overdraft fine and have to spend hours untangling the problem with support.

Read-uncommitted is scary.

Kingsbury also offers this map of consistency models:

Source