Data Science Guy: August 2016

Thursday, August 25, 2016

Gaussian Mixture Model Simple Explanation

http://www.computerrobotvision.org/2010/tutorial_day/GMM_said_crv10_tutorial.pdf

https://www.quora.com/What-is-an-intuitive-explanation-of-Gaussian-mixture-models

http://stackoverflow.com/questions/26019584/understanding-concept-of-gaussian-mixture-models

Wednesday, August 10, 2016

Statistics synonyms

Depending on the context, an independent variable is sometimes called a "predictor variable", "regressor", "controlled variable", "manipulated variable", "explanatory variable", "exposure variable" (see reliability theory), "risk factor" (see medical statistics), "feature" (in machine learning and pattern recognition) or "input variable."^[10]^[11]

Depending on the context, a dependent variable is sometimes called a "response variable", "regressand", "predicted variable", "measured variable", "explained variable", "experimental variable", "responding variable", "outcome variable", or "output variable".^[11]

"Explanatory variable" is preferred by some authors over "independent variable" when the quantities treated as independent variables may not be statistically independent or independently manipulable by the researcher.^[12]^[13] If the independent variable is referred to as an "explanatory variable" then the term "response variable" is preferred by some authors for the dependent variable.^[11]^[12]^[13]

"Explained variable" is preferred by some authors over "dependent variable" when the quantities treated as "dependent variables" may not be statistically dependent.^[14] If the dependent variable is referred to as an "explained variable" then the term "predictor variable" is preferred by some authors for the independent variable.^[14]

Variables may also be referred to by their form: continuous, binary/dichotomous, nominal categorical, and ordinal categorical, among others.

Friday, August 5, 2016

How to Speed Up SVR?

PCA

small sample: doesn't work for RF

How is Out-of-Bag Error Calculated in Random Forest?

http://stats.stackexchange.com/questions/70704/interpreting-out-of-bag-error-estimate-for-randomforestregressor

http://www.stat.berkeley.edu/~breiman/OOBestimation.pdf

http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_oob.html

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=how+to+estimate+number+of+iterations+for+gradient+boosting

https://yanirseroussi.com/2014/12/29/stochastic-gradient-boosting-choosing-the-best-number-of-iterations/

Tuesday, August 2, 2016

Understanding RandomForest Parameters

https://www.kaggle.com/general/4092

Random Forest is just a bagged version of decision trees except that at each split we only select 'm' randomly chosen attributes.

Random forest achieves a lower test error solely by variance reduction. Therefore increasing the number of trees in the ensemble won't have any effect on the bias of your model. Higher number of trees will only reduce the variance of your model. Moreover you can achieve a higher variance reduction by reducing the correlation between trees in the ensemble. This is the reason why we randomly select 'm' attributes at each split because it will introduce some randomness in to the ensemble and reduce the correlation between trees. Hence 'm' is the major attribute to be tuned in a random forest ensemble.

In general best 'm' is obtained by cross validation. some of the factors affecting 'm' are 1) A small value of m will reduce the variance of the ensemble but will also increase the bias of an individual tree in the ensemble. 2)the value of m also depends on the ratio of noisy variables to important variables in your data set. If you have a lot of noisy variables then small 'm' will decrease the probability of choosing an important variable at a split thus affecting your model.

Hope this helps

========================================================

max_features = non/auto becomes bagged trees

Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. As in bagging, we build a number of decision trees on boostrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors.

Empirical good default values are max_features=n_features for regression problems, and max_features=sqrt(n_features) for classification tasks

Number of tried attributes the default is square root of the whole number of attributes, yet usually the forest is not very sensitive about the value of this parameter -- in fact it is rarely optimized, especially because stochastic aspect of RF may introduce larger variations.

Increasing max_features generally improves the performance of the model as at each node now we have a higher number of options to be considered. However, this is not necessarily true as this decreases the diversity of individual tree which is the USP of random forest.

f you have built a decision tree before, you can appreciate the importance of minimum sample leaf size. Leaf is the end node of a decision tree. A smaller leaf makes the model more prone to capturing noise in train data. Generally I prefer a minimum leaf size of more than 50. However, you should try multiple leaf sizes to find the most optimum for your use case.
=====================================================
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#overview

Each tree is grown as follows:

If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This sample will be the training set for growing the tree.
If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
Each tree is grown to the largest extent possible. There is no pruning.

In the original paper on random forests, it was shown that the forest error rate depends on two things:

The correlation between any two trees in the forest. Increasing the correlation increases the forest error rate.
The strength of each individual tree in the forest. A tree with a low error rate is a strong classifier. Increasing the strength of the individual trees decreases the forest error rate.

Reducing m reduces both the correlation and the strength. Increasing it increases both. Somewhere in between is an "optimal" range of m - usually quite wide. Using the oob error rate (see below) a value of m in the range can quickly be found. This is the only adjustable parameter to which random forests is somewhat sensitive.

=====================================================

You are doing it wrong -- the essential part of RF is that it basically only requires making # trees large enough to converge and that's it (it becomes obvious once one starts doing proper tuning, i.e. nested cross-validation to check how robust the selection of parameters really is). If the performance is bad it is better to fix the features or look for an other method.

Pruning trees works nice for decision trees because it removes noise, but doing this within RF kills bagging which relays on it for having uncorrelated members during voting. Max depth is usually only a technical parameter to avoid recursion overflows while min sample in leaf is mainly for smoothing votes for regression -- the spirit of the method is that

Each tree is grown to the largest extent possible.

Understanding RandomForest Parameters

max_features = non/auto becomes bagged trees

Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. As in bagging, we build a number of decision trees on boostrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors.

Empirical good default values are max_features=n_features for regression problems, and max_features=sqrt(n_features) for classification tasks

Number of tried attributes the default is square root of the whole number of attributes, yet usually the forest is not very sensitive about the value of this parameter -- in fact it is rarely optimized, especially because stochastic aspect of RF may introduce larger variations.