March 17th, 2016
(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: firstname.lastname@example.org
Good lord, this makes me happy. I think Michael Drogalis is a very smart guy and everyone knows that Kyle Kingsbury is a very, very smart guy. Kingsbury’s work on Jespen is the finest work that anyone has ever done on the problems of distributed data. Onyx is exciting as an a Clojure answer to Apache Storm. Sad to say, Storm is written in Scala. The idea of testing Onxy with Jespen is one of the finest ideas I’ve heard this year.
Distributed systems are incredibly powerful for dealing with massive amounts of load and providing high availability. Ensuring that your system behaves correctly under stress, however, is a notoriously difficult problem. All of this power is useless if you can’t trust your system to handle network partitions, connection loss, killed nodes, consistency anomalies, and other nasty issues.
From the beginning, Onyx has had a variety of unit and integration tests. Over time we have also added numerous property tests to the mix. Our property tests stress our peer coordination code paths and cluster scheduler, and we found numerous bugs that would have been hard to pickup with other testing methods. These techniques have allowed us to add complex features quickly.
While we have users happily using Onyx in production, it is likely that there are bugs waiting for the right set of scenarios to occur. When they do, reproducing these scenarios can be incredibly time consuming. We would much prefer to find these issues early and to have a way to test every release against grueling conditions that may only occasionally occur in a production environment.
Many forms of distributed tests can be both difficult to formulate and time consuming for developers to build. Luckily, a paper, Simple Testing Can Prevent Most Critical Failures Yuan et. al. found that almost all distributed systems failures can be reproduced with 3 or fewer nodes. Howevere we were in need of a better way to test for these forms of faults.
Kyle Kingsbury’s Jepsen library and Call Me Maybe series have been blazing a path to better testing of distributed systems. A Jepsen test is self described by Kingsbury as “a Clojure program which uses the Jepsen library to set up a distributed system, run a bunch of operations against that system, and verify that the history of those operations makes sense”. Kyle has been dragging the distributed systems world into a more consistent (and pager friendly) future. Did we mention that he’s now available for Jepsen consulting?