New data structures in Clojure 1.8

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

Apparently there are a lot of new data structures which may arrive in Clojure 1.8, thanks to Zach Tellman:

So, at the end of this exercise we have more than 5000 lines of Java, and we want to add them to the core implementation of Clojure. Ideally, we won’t introduce any bugs in the process. But the same unrolling that makes the code faster makes it significantly harder to simply read the code and verify it’s “correct”. The Clojure code which generates the Java, while more compact, is mostly concerned with string concatenating its way to proper syntax. The semantics of both codebases are a bit opaque.

But even if the code were perfectly clear, data structures are easy to get wrong and difficult to test. On a map, boundary conditions that need to be tested include key collisions, hash collisions, removing non-existent keys, conversions to and from transient collisions, and so on. The exact boundary conditions depend on the underlying implementation, so a test suite that covers one implementation of a vector or map won’t necessarily cover another, even if they behave identically.

Given this, it seems like writing exhaustive unit tests is a poor way for us to make sure our implementation is sound. A better approach would be to use property-based testing, which instead of defining both inputs and expected behavior, allows us to just define invariants which must hold true across all inputs. The testing framework will then search through the space of inputs for one which breaks the invariant, and then reduce the input down to the simplest possible reproducing case.

This is an especially nice approach for us, since both the inputs and invariants are straightforward. The inputs are all possible actions that can be performed on the data structure (conj, disj, assoc, dissoc, etc.), and the invariant is that it must behave just like the existing implementation, no matter what actions we take. Luckily there’s a readymade library for this purpose: collection-check. This library has been used to validate most (possibly all) of the alternate data structures in the Clojure ecosystem. It also uncovered an bug in Clojure’s own implementation of transient maps, which is discussed in more detail in this talk.

But while collection-check is a useful tool for validating the unrolled collections, we still need to map it effectively onto the underlying implementation. The initial tests for the collections only checked maps of integers onto integers, which skipped over the special code paths dedicated to keywords. When an additional test was run for maps of keywords onto integers, an off-by-one error in the Duff’s device snippet discussed above was found.

And it might arrive in 1.8:

The results of this work can be found in the cambrian-collections library, the output of which has been submitted as a patch for Clojure, and is currently under review to be merged into the Clojure 1.8.0 release. Initial performance tests are promising: the cost of building collections when deserializing JSON in Cheshire is halved with unrolled collections, giving us a 33% overall improvement in performance. Since at Factual we spend quite a bit of time deserializing data from disk, this will have a meaningful and immediate impact on our daily operations. We hope it will have a similar benefit for others.

Post external references

  1. 1
    http://blog.factual.com/using-clojure-to-generate-java-to-reimplement-clojure
Source