Wednesday, January 4, 2017

How to Tackle Overfitting?

I can see why the interviewers would be unsatisfied with your answer to this question. This is an introductory level interview question which is usually mentioned in the first or second lecture of a machine learning course (similar to sorting algorithms or reversing lists in software engineering interviews). Overfitting is usually illustrated with either K-Nearest Neighbors where K = 1 or Polynomial Regression where you use a high order polynomial. The key is that if you use a 14th order polynomial to train a model using only 14 observations, then you perfectly fit the training set while having poor generalization, e.g., poor accuracy on the test set. This naturally leads to the concept of regularization, using either Ridge Regression (L2 penalization) or LASSO (L1 penalization). My guess is that this is what the interviewers are looking for. 

Your answer is likely unsatisfactory here because it says nothing about how to tackle the overfitting problem. While cross validation allows you to observe the existence of overfitting (high train accuracy + low test/validation accuracy), it doesn't provide you with any answers regarding how to solve this. If you were using random forests or xgboost, all you have to do to "reduce" overfitting is to decrease the max_depth or increase the min_child_weight parameter. However, the general term used in machine learning is regularization, and it applies to regression, SVM, and neural networks. You can then go from there to discuss how to achieve similar results by decreaseing max_depth in tree-based methods, or increasing dropout rate in neural networks, or increase K in KNN. 

I hope this answers your question. For further understanding you should look up "regularization" in a machine learning textbook or on wikipedia.

First of all, I don't do automated grid search with xgboost, instead I search manually with parameters that seem to make sense. In the case when I do use some automated gird search I would just implement my own for loop with a set of different parameters to use. The reason I do this is to get a good feel for how the algorithm behave under different parameters, and also go get more flexibility in the implementation (e.g., getting the CV scores). My method isn't ideal because I see that others' posted scripts are more elegant and clean, but they require some additional effort to understand. 

That aside, overfitting is a much bigger issue than cross validation. The primary purpose of cross validation isn't to reduce overfitting, rather it's to allow you to validate on all of your training data rather than a simple train/test split. The reasoning is as follows: 1) you can train an algorithm on all of your training data and then deploy it. This naturally leads to overfitting because by definition the maximum likelihood estimator finds the best parameters for your training set only. 2) you can split your training data into a training and validation set via something like a 3:1 split. This allows you to see if you've overfit on the training set because you can use algorithm hyperparameters that maximize performance on your validation set predictions than the training fit. However the drawback is that you're only training on 75% of the data and validating on 25% of the data. If there are some outliers in either of the parts of the data then you might get biased predictions in the end. 3) Cross validation allows you to use different partitions of 3:1 splits so that ultimately you've trained on all of the data and validated on all of the data. This is better than option 2 because it covers more of the data, but it also takes more time to run (e.g., 5-fold CV takes 5 times as long to run). 

Ultimately though, these methods helps you to see if you've overfit, and optimizing your algorithm based on validation error is less likely to overfit. However, learning algorithms can only minimize the training error, not on the prediction error, so you need to have methods that makes this tradeoff (e.g., regularization) that hopes to slightly increase training error but also increases generalizability of your algorithm and therefore decrease the validation or test error.

Yes, your process does take care of the overfitting issue. However, simply citing that process doesn't demonstrate that you understand what overfitting is, why it exists, and how to deal with it in general. Furthermore, even with cross validation you can see significant overfitting, which isn't always a bad thing. For example, even if there is a large gap between the train and validation error, it's usually best to select the model with the best validation error.

So what exactly is overfitting then? By definition, all machine learning methods are optimizing something on a training set. For linear regression, you're optimizing the mean squared error. For support vector machines you're maximizing the distance between the class boundary and the closest two points of different classes. For decision trees you're optimizing the mean squared error from a cut at each layer. Traditionally, statistical and machine learning methods have focused on only this bit. Often times, optimizing the mean squared error is also mathematically analogous to the maximum likelihood estimator, which is saying that you find parameters which maximize the probability of observing the training set. However, maximizing the probability of observing the training set may not generalize well because you may inadvertently capture noise in your data. This would be illustrated with a high-order polynomial regression where noise is fitted into the model. When you make predictions with such a model, you will get poor performance, hence your model has overfit. Take xgboost for example, if you set your max depth to infinity, then each individual observation from the training set will be fit and in essence you assume zero noise in your training data. That would be a textbook example of overfitting. To avoid this problem, researchers have developed different methods which prevents the algorithm from completely reaching the maximum likelihood estimator by including some form of "penalty". However, where the penalty stands exactly cannot be fit together with the original parameter estimation, so the penalty is called hyperparameter, and you optimize them through validation whether it's cross validation or some other way. The "parameters" that you're referring to in xgboost are actually hyperparameters. You can see that in practice all machine learning methods have hyperparameters which can be tuned by the user, and you'll also observe that almost never will you see that the hyperparameter settings corresonding to the maximum likelihood estimation of a method lead to  




For most purposes, I think using only a hold-out set for validation is sufficient. It may even be better than cross validation when you have situations where the test set differs in a structured way from the train set (e.g., time series data). There's no rule of thumb for what is large enough because it's completely dependent on the problem and data for each individual case. Therefore it's important to understand the data before developing a train/validation process.

As for the evaluation measures, usually different measures do lead to different resulting models. By definition machine learning methods are optimizing on a specific measure. This is commonly illustrated with MSE vs MAE. The MAE is more robust to outliers in the dependent variable, than MSE. Imagine if you have a dataset where the dependent variable values are: 0 1 2 3 4 5 6 7 8 9 1000. If you use MSE as your measure, then a naive prediction which minimizes the error would be 95, and if you use MAE the naive prediction would be 5. If indeed the 1000 value was an outlier (e.g., data input error) and it was actually intended to be 10, then the MSE-based prediction would've been very wrong and the MAE-based prediction correct. When you see such a difference in the naive prediction, you can bet that the actual estimated model would be very different as well. This is an extreme example of course, but extreme cases occur far more often than one may think. Unfortunately, for mathematical reasons most methods are actually just optimizing on MSE, but when you test your feature engineering and tune hyperparameters you're actually using a user-specified measure there so there will still be significant difference in the resulting model if the measures are differ enough for the dataset. 





No comments:

Post a Comment