The difference between git and mecurial

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

Interesting:

Both git and mercurial were developed to solve a large problem that was happening in 2005. The Linux kernel was no longer allowed to use BitKeeper for free as its version control system. Having used BitKeeper for 3 years, the kernel developers had become accustomed to their new distributed workflow. No longer were patches emailed between people, lost, resubmitted, and managed personally by a series of shell scripts. Now patches/features were recorded, pulled, and merged by a fancy piece of software that made it possible to track history over long periods and hunt down regressions.

It also strengthened the kernel development workflow where there was one Dictator and multiple Lieutenants, each responsible for their subsystems. Each Lieutenant vetted and accepted patches to their subsystems and Linus pulled their changes and made the official linux repository available. Anything that replaced BitKeeper would have to enable this workflow.

Not only did any replacement need to support a distributed workflow, it also had to be fast for a large number of changes and files. The Linux kernel is a very large project that has thousands of changes each day contributed by thousands of people.

Lots of tools were evaluated and none quite passed muster. Matt Mackall decided to create mercurial to solve the problem around the same time1 that Linus decided to create git. Both borrowed some ideas from the monotone project. I will try to identify those where I recognize them.

Both git and mercurial identify versions of files with hashes. File hashes are combined in manifests (git calls them trees and git trees can also point to other trees). Manifests are pointed to by revisions/commits/changelogs (commits from now on). The key to how the various tools differ is how they represent these concepts.

Mercurial decided to solve the performance problem by developing a specialized storage format: Revlog2. Every file is made up of an index and a data file. Data files contain snapshots and deltas–snapshots are only created if the number of deltas to represent a file goes over a threshold. The index is key to efficient access to the data file. Changes to files are only ever appended to the data file. Because files aren’t always changed sequentially, the index is used to group parts of the data file into coherent chunks that represent a particular file version.

From file revisions, manifests are created and from manifests, commits are created. Creating, finding, and calculating differences to files are very efficient given this method. It takes a relatively small amount of space on the disk to represent these changes. The network protocol to transfer changes is similarly efficient.

Git takes the opposite approach: file blobs3. To store revisions quickly, each new file revision is a complete copy of the file. These copies are compressed, but there is a lot of duplication. The developers of git have created methods to reduce the storage requirements by packing data–essentially creating something like a revlog at a given point in time. These packs are not the same thing as a revlog, but serve a similar purpose of storing data in a space efficient format.

Because git stores everything in files, its history is a lot more fluid. Object files can be copied in from anywhere using any method (e.g. rsync). Commits can be created or destroyed. Just as history isn’t linear in distributed version control world, git’s data model doesn’t depend on linear files. Mercurial’s file format is to git as compressed files are to sparse files.

Post external references

  1. 1
    http://xentac.net/2012/01/19/the-real-difference-between-git-and-mercurial.html
Source