Gini Impurity is quick to measure and easy to adapt to

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at:, or follow me on Twitter.

I’m trying to convince a client that we need to match records using a Random Forest approach. They have hit the limits of what can be done with simple string matching. I like basic articles like this, for explaining things clearly, and making it obvious how easy Random Forests can be (especially compared to Neural Nets).

Which is the better split? This is a subjective question. In practice, people use different metrics for evaluating splits. The most commonly used metric is Information Gain. Another commonly used metric is Gini Impurity, and since it’s easier to explain, let’s use it.
Gini Impurity measures the disorder of a set of elements. It is calculated as the probability of mislabeling an element assuming that the element is randomly labeled according the the distribution of all the classes in the set.
Example: Suppose we have a set with 6 elements: {red, red, blue, blue, blue, blue}. (The classes of this set are red and blue). We select an element at random. Then we randomly label it according to the distribution of classes in the set. This would be equivalent to labeling the selected element by rolling a 6 sided die with 4 blue sides and 2 red sides. The probability that we misclassify the element is equal to the probability that we select a red element times the probability that we label it blue plus the probability that we select a blue element times the probability that we label it red. This is
2/6 * 4/6 + 4/6 * 2/6 = 16/36.
A few points before we move on. If all the elements in a set are identical, the Gini Impurity is 0. If we have two classes of elements, our max Gini Impurity occurs when there is an equal number of elements from each class in the set. In this case the Gini Impurity is 1/2. Gini Impurity generalizes for more than 2 classes, and the max Gini Impurity approaches 1 as the number of classes approaches infinity.
Now suppose we have a set {red, red, blue, blue, blue, blue}. We consider a split {red blue blue blue} {red blue}. We can measure the goodness of this split by averaging the Gini Impurity of the leaves, weighted by the number of elements in each leaf and compare that number to the Gini Impurity in the root.
The Gini Impurity in the root was 16/36 = 0.444. The weighted average of the Gini Impurity in the leaves is roughly 0.417 So this split reduced our Impurity by about 0.027.

Post external references

  1. 1