Smash Company Splash Image

November 18th, 2014

No Comments

If you enjoy this article, see the other most popular articles

If you enjoy this article, see the other most popular articles

If you enjoy this article, see the other most popular articles

Rapid intake and output of data, with control over the scale of the failures

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

Two new and interesting libraries:

s3-logging:

At first, we used the Hadoop S3 client, which we had used in some lightweight use cases before. Unfortunately, it ended up coping poorly. Each server had a sustained throughput of about 10 megabytes of compressed data per second, and every so often S3 would hiccup and stop accepting data for a few minutes. When this happened, Hadoop’s client would simply buffer the data in-memory until all the memory was exhausted, at which point the process would become non-responsive. That server’s traffic would then spill over onto the other servers, and the process would be repeated.

…So in addition to minimizing overhead, s3-journal writes entries to disk in batches constrained by the number of entries, the maximum time allowed to elapse between the first to last entry, or both. These entries are written and then read out as a single binary blob, allowing us a high effective throughput measured in entries/sec without saturating the throughput of the disk controller. The downside to this, obviously, is the same as with the previous client: data buffered in memory may be lost forever. However, the issue with the other client was less that any data could be lost, but rather that the amount of data that could be lost was unbounded. By exposing parameters for how often s3-journal flushes to disk, people using the library can find their own golden mean between throughput and durability.

riffling:

There exist today a surprising number of in-process key/value stores, including LevelDB, LMDB, RocksDB, and many others. Each of these provides some variation on the “database as library” use case, where a process needs to persist and look up certain values by key. Our first version took LevelDB, which had decent Java bindings, added some background syncing logic, and wrapped an HTTP server around it.

At first, this worked quite well. Starting from an empty initial state, we were able to quickly populate the database, all the while maintaining an impressively low latency on reads. But then the size of the database exceeded the available memory.

LevelDB uses memory-mapping to keep its index files in memory regions that can be quickly accessed by the process. However, once the size of the memory-mapped regions exceed the available memory, some will be evicted from memory, and only refetched if the region is accessed again. This works well if only some of the regions are “hot” – they can stay in memory and the others can be lazily loaded on demand. Unfortunately, our data had a uniform access pattern, which meant that regions were being continuously evicted and reloaded.

Even worse, once the database grew past a threshold, write throughput plummeted. This is because of write amplification, which is a measure of how many bytes need to be written to disk for each byte written to the database. Since most databases need to keep their entries ordered, a write to the middle of an index will require at least half the index to be rewritten. Most databases will offset this cost by keeping a write-ahead log, or WAL, which persists the recent updates in order of update, and then periodically merges these changes into the main index. This amortizes the write amplification, but the overhead can still be significant. In our case, with a 100GB database comprising 100mm entries, it appeared to be as high as 50x.

..Our implementation cherry-picked the elements we wanted from each of these sources: fixed memory overhead per key, linear time merging, and block compression. While memory-mapping is used for the hashtable, values are read directly from disk, decoupling our I/O throughput from how much memory is available. The keys are consistently ordered to allow for linear merges, but not lexicographically, so unlike SSTables range queries are not possible. The resulting library, Riffle, is far from novel, but is only ~600 lines of code and is something we we can understand inside and out, which allowed us to write a simple set of Hadoop jobs that, given an arbitrary set of updated keys and values would construct a set of sharded Riffle indices, which could then be downloaded by our on-premise database servers and efficiently merged into the current set of values. A server which is new or has fallen behind can simultaneously download many such updates, merging them all together in constant space and linear time.

Post external references

1
http://blog.factual.com/how-factual-uses-persistent-storage-for-its-real-time-services

Source

Check out my books:

RECENT COMMENTS

February 8, 2022 9:33 am

From Michael S on How I recovered from Lyme Disease: I fasted for two weeks, no food, just water

"Did you have Bartonella, too? Seems it uses autogenesis..."

January 11, 2022 4:33 am

From Essie on Docker is the dangerous gamble which we will regret

"Once in 1990s, there are popular high performance solution called HPC software, many commercial softwares are ..."

December 17, 2021 7:32 pm

From John Carston on The ethics of being a high level tech consultant (a Fractional CTO)

"It helped when you mentioned that it is important to have a real connection with your consumer. My cousin ment..."

September 2, 2021 7:47 pm

From Mojavedfo on Where PHP regex fails

"55 thousand Greek, 30 thousand Armenian..."

August 7, 2021 9:53 am

From Colin Steele on The ethics of being a high level tech consultant (a Fractional CTO)

"Fantastic essay. Thoughtful, well-constructed, timely and applicable. I think every part-timer in the tech f..."

August 5, 2021 3:02 pm

From Rachiovwn on Where PHP regex fails

"consists of the book itself..."

October 19, 2019 3:08 am

From Bernd Schatz on Object Oriented Programming is an expensive disaster which must end

"I really enjoyed your article. But i can't understand the example with the interface. The example is reall..."

October 17, 2019 4:50 pm

From Anderson Nascimento Nunes on The conventional wisdom among social media companies is that you can’t put too much of the onus on users to personalize their own feeds

"Can't speak for anyone else, but on my feed reader: 5K bookmarked feeds, 50K regex on the killfile to filter o..."

October 10, 2019 11:17 am

From روابط: البث المباشر – صفحات صغيرة on RSS has been damaged by in-fighting among those who advocate for it

"[...] تاريخ تقنية RSS، مقال قديم ويلقي نظرة على الناس الذين طوروا التقنية [...]..."

October 9, 2019 3:08 pm

From Dan Campbell on Object Oriented Programming is an expensive disaster which must end

"Object-Oriented Programming is Bad https://www.youtube.com/watch?v=QM1iUe6IofM..."

October 4, 2019 8:44 pm

From lawrence on My final post regarding the flaws of Docker / Kubernetes and their eco-system

"Gorgi Kosev, I am working to clean up some of my Packer/Terraform code so I can release it on Github, and then..."

October 4, 2019 5:14 pm

From Gorgi Kosev on My final post regarding the flaws of Docker / Kubernetes and their eco-system

"> Packer, sometimes with some Ansible. The combination of Packer and Terraform typically gives me what I ne..."

October 4, 2019 12:40 pm

From lawrence on My final post regarding the flaws of Docker / Kubernetes and their eco-system

"Gorgi Kosev, about this: "I would love if you could point out which VM based system makes it simpler and..."

October 4, 2019 7:31 am

From Gorgi Kosev on My final post regarding the flaws of Docker / Kubernetes and their eco-system

"I won't list anything concrete that you missed, because that will just give you ammunition to build the next a..."

October 4, 2019 1:39 am

From lawrence on My final post regarding the flaws of Docker / Kubernetes and their eco-system

"Gorgi Kosev, also, I don't think you understand what a "straw man argument" is. This is a definition from Wiki..."

NO COMMENTS

Leave a Reply Cancel reply