Thursday, January 12, 2017

How to Build a Model Given a Data with 1000 Observations and 1000 Variables?

When the number of features >= the number of observations, we have an "ill-posed problem" in ML terminology, or "over-identification" in statistics and econometrics terminology. This comes from the linear regression framework where when the number of features (m) >= the number of observations (n), there are infinitely many solutions where the training error is zero. Depending on the exact algorithm used, you might not even be able compute the linear regression result as the matrix (X'X) won't be invertible. Basically, this case is the extreme maximum of overfitting, where essentially any model you can come up with will have zero training error. 

How to proceed? Here are some things to consider:
1. Can we increase the number of observations? If we want to use all 1000 features then something like 10000 observations might be a minimum requirement (although this is arbitrary).
2. Can we remove some features? If apriori we know that some features aren't reasonable to use then they could be left out. If we keep to 1000 observations then we should probably only use 100 features tops. 
3. If we want to keep to 1000 observations and 1000 features then there are some methods which can deal with this. Including ridge or LASSO penalization can get around the overidentification problem. Alternatively you can ensemble ML methods each using a subset of the features (e.g., bagging).   


The general rule of thumb (based on stuff in Frank Harrell's book, Regression Modeling Strategies) is that if you expect to be able to detect reasonable-size effects with reasonable power, you need 10-20 observations per parameter (covariate) estimated. Harrell discusses a lot of options for "dimension reduction" (getting your number of covariates down to a more reasonable size), such as PCA, but the most important thing is that in order to have any confidence in the results dimension reduction must be done without looking at the response variable. Doing the regression again with just the significant variables, as you suggest above, is in almost every case a bad idea.



What you've hit on here is the curse of dimensionality or the p>>n problem (where p is predictors and n is observations). There have been many techniques developed over the years to solve this problem. You can use AIC or BIC to penalize models with more predictors. You can choose random sets of variables and asses their importance using cross-validation. You can use ridge-regressionthe lasso, or the elastic net for regularization. Or you can choose a technique, such as a support vector machine or random forest that deals well with a large number of predictors.
Honestly, the solution depends on the specific nature of the problem you are trying to solve.

No comments:

Post a Comment