Data Science Guy: How to Avoid Overfitting

Thursday, January 12, 2017

How to Avoid Overfitting

Regularization is the fix. Cross-Validation is the diagnosis tool.

In short, the general strategies are to

collect more data
use ensembling methods that “average” models
choose simpler models / penalize complexity

For the first point, it may help to plot learning curves, plotting the training vs. the validation or cross-validation performance. If you see a trend that more data helps with closing the cap between the two, and if you could afford collecting more data, then this would probably the best choice.

In my experience, ensembling is probably the most convenient way to build robust predictive models on somewhat small-sized datasets. As in real life, consulting a bunch of “experts” is usually not a bad idea before making a decision ;).

Regarding the third point, I usually start a predictive modeling task with the simplest model as a benchmark: usually logistic regression. Overfitting can be a real problem if our model has too much capacity — too many model parameters to fit, and too many hyperparameters to tune. If the dataset is small, a simple model is always a good option to prevent overfitting, and it is also a good benchmark for comparison to more “complex” alternatives.

As others have said, the best ways to address under-fitting and over-fitting are to use regularization and/or cross-validation. Regularization controls the penalty for complexity, which (when successful) will prevent under- and over-fitting. Cross-validation separates model selection from testing, resulting in a more conservative estimate of generalization. For the case of over-fitting only, obtaining more training data will also help.

To "test" for under- and over-fitting, it can be helpful to plot your regression/classification error by some measure of complexity. For under-fitting degrees of complexity, training and testing errors will both be high. For over-fitting degrees of complexity, training errors will be low and testing errors will be high.

(A) An example of overfitting in black (versus a good fit in red)
(B) Underfitting on the left of graph, overfitting on the right of graph

An alternative graph to use is a "learning curve" which plots regression/classification error by the number of training examples used. For under-fitting, training and testing errors will converge with larger training sets (and both be high). For over-fitting, the gap between training and testing errors will remain large (with the training error being smaller).

(A) An example of a learning curve with over-fitting

Data Science Guy

Thursday, January 12, 2017

How to Avoid Overfitting

No comments:

Post a Comment