What does statistical over-fitting look like?

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

I like how clear this makes the mistake of over-fitting:

The model explains over 99% of the variance in the data. Like I said, not a typical data set.

View the estimates of the coefficients, and the p-values of their t-tests

(:coefs lm)
(:t-probs lm)
The values for coefficients b0, … b10 are (0.878 0.065 -0.066 -0.016 0.037 0.003 -0.009 -2.8273E-4 9.895E-4 1.050E-5 -4.029E-5), and the p-values are (0 0 0 1.28E-5 0 0.083 1.35E-12 0.379 3.74E-8 0.614 2.651E-5).

All the coefficients are significant except b5, b7, and b9.

Finally, overlay the fitted values on the original scatter-plot

(add-lines plot x (:fitted lm))

That’s the kind of fit rarely seen on real data! In fact, on real data this would be an example of over-fitting. This model likely wouldn’t generalize to new data from the same process that created this sample.

Post external references

  1. 1
    http://data-sorcery.org/2009/06/04/linear-regression-with-higher-order-terms/
Source