Data Science Guy: Random Forest Limitation

Monday, February 13, 2017

Random Forest Limitation

Random forest for bag-of-words? No.

Random forest is a very good, robust and versatile method, however it’s no mystery that for high-dimensional sparse data it’s not a best choice. And BoW representation is a perfect example of sparse and high-d.

We covered bag of words a few times before, for example in A bag of words and a nice little network. In that post, we used a neural network for classification, but the truth is that a linear model in all its glorious simplicity is usually the first choice. We’ll use logistic regression, for now leaving hyperparams at their default values.

Validation AUC for logistic regression is 92.8%, and it trains much faster than a random forest. If you’re going to remember only one thing from this article, remember to use a linear model for sparse high-dimensional data such as text as bag-of-words.

Data Science Guy

Monday, February 13, 2017

Random Forest Limitation

Random forest for bag-of-words? No.

No comments:

Post a Comment