Data Science Guy: 2016

Thursday, August 25, 2016

Gaussian Mixture Model Simple Explanation

http://www.computerrobotvision.org/2010/tutorial_day/GMM_said_crv10_tutorial.pdf

https://www.quora.com/What-is-an-intuitive-explanation-of-Gaussian-mixture-models

http://stackoverflow.com/questions/26019584/understanding-concept-of-gaussian-mixture-models

Wednesday, August 10, 2016

Statistics synonyms

Depending on the context, an independent variable is sometimes called a "predictor variable", "regressor", "controlled variable", "manipulated variable", "explanatory variable", "exposure variable" (see reliability theory), "risk factor" (see medical statistics), "feature" (in machine learning and pattern recognition) or "input variable."^[10]^[11]

Depending on the context, a dependent variable is sometimes called a "response variable", "regressand", "predicted variable", "measured variable", "explained variable", "experimental variable", "responding variable", "outcome variable", or "output variable".^[11]

"Explanatory variable" is preferred by some authors over "independent variable" when the quantities treated as independent variables may not be statistically independent or independently manipulable by the researcher.^[12]^[13] If the independent variable is referred to as an "explanatory variable" then the term "response variable" is preferred by some authors for the dependent variable.^[11]^[12]^[13]

"Explained variable" is preferred by some authors over "dependent variable" when the quantities treated as "dependent variables" may not be statistically dependent.^[14] If the dependent variable is referred to as an "explained variable" then the term "predictor variable" is preferred by some authors for the independent variable.^[14]

Variables may also be referred to by their form: continuous, binary/dichotomous, nominal categorical, and ordinal categorical, among others.

Friday, August 5, 2016

How to Speed Up SVR?

PCA

small sample: doesn't work for RF

How is Out-of-Bag Error Calculated in Random Forest?

http://stats.stackexchange.com/questions/70704/interpreting-out-of-bag-error-estimate-for-randomforestregressor

http://www.stat.berkeley.edu/~breiman/OOBestimation.pdf

http://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_oob.html

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=how+to+estimate+number+of+iterations+for+gradient+boosting

https://yanirseroussi.com/2014/12/29/stochastic-gradient-boosting-choosing-the-best-number-of-iterations/

Tuesday, August 2, 2016

Understanding RandomForest Parameters

https://www.kaggle.com/general/4092

Random Forest is just a bagged version of decision trees except that at each split we only select 'm' randomly chosen attributes.

Random forest achieves a lower test error solely by variance reduction. Therefore increasing the number of trees in the ensemble won't have any effect on the bias of your model. Higher number of trees will only reduce the variance of your model. Moreover you can achieve a higher variance reduction by reducing the correlation between trees in the ensemble. This is the reason why we randomly select 'm' attributes at each split because it will introduce some randomness in to the ensemble and reduce the correlation between trees. Hence 'm' is the major attribute to be tuned in a random forest ensemble.

In general best 'm' is obtained by cross validation. some of the factors affecting 'm' are 1) A small value of m will reduce the variance of the ensemble but will also increase the bias of an individual tree in the ensemble. 2)the value of m also depends on the ratio of noisy variables to important variables in your data set. If you have a lot of noisy variables then small 'm' will decrease the probability of choosing an important variable at a split thus affecting your model.

Hope this helps

========================================================

max_features = non/auto becomes bagged trees

Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. As in bagging, we build a number of decision trees on boostrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors.

Empirical good default values are max_features=n_features for regression problems, and max_features=sqrt(n_features) for classification tasks

Number of tried attributes the default is square root of the whole number of attributes, yet usually the forest is not very sensitive about the value of this parameter -- in fact it is rarely optimized, especially because stochastic aspect of RF may introduce larger variations.

Increasing max_features generally improves the performance of the model as at each node now we have a higher number of options to be considered. However, this is not necessarily true as this decreases the diversity of individual tree which is the USP of random forest.

f you have built a decision tree before, you can appreciate the importance of minimum sample leaf size. Leaf is the end node of a decision tree. A smaller leaf makes the model more prone to capturing noise in train data. Generally I prefer a minimum leaf size of more than 50. However, you should try multiple leaf sizes to find the most optimum for your use case.
=====================================================
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#overview

Each tree is grown as follows:

If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This sample will be the training set for growing the tree.
If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
Each tree is grown to the largest extent possible. There is no pruning.

In the original paper on random forests, it was shown that the forest error rate depends on two things:

The correlation between any two trees in the forest. Increasing the correlation increases the forest error rate.
The strength of each individual tree in the forest. A tree with a low error rate is a strong classifier. Increasing the strength of the individual trees decreases the forest error rate.

Reducing m reduces both the correlation and the strength. Increasing it increases both. Somewhere in between is an "optimal" range of m - usually quite wide. Using the oob error rate (see below) a value of m in the range can quickly be found. This is the only adjustable parameter to which random forests is somewhat sensitive.

=====================================================

You are doing it wrong -- the essential part of RF is that it basically only requires making # trees large enough to converge and that's it (it becomes obvious once one starts doing proper tuning, i.e. nested cross-validation to check how robust the selection of parameters really is). If the performance is bad it is better to fix the features or look for an other method.

Pruning trees works nice for decision trees because it removes noise, but doing this within RF kills bagging which relays on it for having uncorrelated members during voting. Max depth is usually only a technical parameter to avoid recursion overflows while min sample in leaf is mainly for smoothing votes for regression -- the spirit of the method is that

Each tree is grown to the largest extent possible.

Understanding RandomForest Parameters

max_features = non/auto becomes bagged trees

Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. As in bagging, we build a number of decision trees on boostrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors.

Empirical good default values are max_features=n_features for regression problems, and max_features=sqrt(n_features) for classification tasks

Number of tried attributes the default is square root of the whole number of attributes, yet usually the forest is not very sensitive about the value of this parameter -- in fact it is rarely optimized, especially because stochastic aspect of RF may introduce larger variations.

Sunday, July 31, 2016

Feature Engineering

You can take different combinations of features such as sum of features: feat_1 + feat_2 + feat_3..., or product of those. Or you can transform features by log, or exponential, sigmoid ... or even discretize the numeric feature into a categorical one. It's an infinite space to explore.

Whatever combination or transformation that increases your Cross-Validation or Test Set performance then you should use it.

How to Tune Gradient Boosting in Python

Saturday, July 30, 2016

How to Tune RandomForestRegressor

min_leave_size = 50 to avoid capture noise

How to write a Python Module

__init__

print(sys.path)

Thursday, June 9, 2016

Dataframe vs. Nested List vs. Dictionary for Storing info in Python

DataFrame

DecisionTree
Param
Score

Saturday, June 4, 2016

use a list of values to select rows from a pandas dataframe

In [5]: df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})

In [6]: df
Out[6]:
   A  B
0  5  1
1  6  2
2  3  3
3  4  5

In [7]: df[df['A'].isin([3, 6])]
Out[7]:
   A  B
1  6  2
2  3  3

http://www.unknownerror.org/opensource/pydata/pandas/q/stackoverflow/12096252/use-a-list-of-values-to-select-rows-from-a-pandas-dataframe

Difference Between Groupby and Pivot_table for Pandas

Both pivot_table and groupby are used to aggregate your dataframe.

If you want to get SQL style of aggregation, groupby is the way to go.

Both pivot_table and groupby are used to aggregate your dataframe. The difference is only with regard to the shape of the result.

Friday, June 3, 2016

An Example of Converting SQL Aggregate Function into Python

select mykey, sum(Field1) as Field1, avg(Field1) as avg_field1, min(field2) as min_field2
from df
group by mykey

f = {'Field1':'sum',
'Field2':['max','mean'],
'Field3':['min','mean','count'],
'Field4':'count'
}

grouped = df.groupby('mykey').agg(f)

Thursday, June 2, 2016

Passing Query Parameters in Pandas

import pyodbc
conn = pyodbc.connect(dsn="hive", autocommit=True)

beg_dt = '2016-05-01'
end_dt = '2016-06-01'

mq = """
select
local_date,
hotel_id as exp_id,
sum(xclick) as xclick,
sum(pclick) as pclick,
sum(pcost_usd) as pcost,
sum(trx) as trx,
sum(gp_usd) as gp,
sum(bid_gp_usd) as bid_gp

from embid.bid_unit_kpi_agg
where local_date between ? and ?
and etl_processed_type = 'HOTEL'
and partner_org in ('TRIPADVISOR')
and partner_pos = 'US'
and brand = 'ORBITZ'
and device_type = 'MOBILE'

group by local_date, hotel_id
order by local_date, hotel_id
"""
ta_orb_mob_perf = pd.read_sql(mq, conn, params=[beg_dt, end_dt])

Access a Column of Pandas Dataframe

The Python and NumPy indexing operators [] and attribute operator . provide quick and easy access to pandas data structures across a wide range of use cases.

Tuesday, May 31, 2016

Change IPython Working Directory

Properties

Folder

How to Find Out Version Number of A Package in iPython Notebook

import nltk
import sklearn

print('The nltk version is {}.'.format(nltk.__version__))
print('The scikit-learn version is {}.'.format(sklearn.__version__))

How To Install Development Version of Scikit-Learn?

Use the dev packages built by its continuous integration system:

http://windows-wheels.scikit-learn.org/

Thursday, May 26, 2016

Dummy Variables

The representation above is redundant, because to encode three values you need two indicator columns. In general, one needs d - 1 columns for d values. This is not a big deal, but apparently some methods will complain about collinearity. The solution is to drop one of the columns. It won’t result in information loss, because in the redundant scheme with d columns one of the indicators must be non-zero, so if two out of three are zeros then the third must be 1. And if one among the two is positive than the third must be zero.

Factor Level Limit for R

Random Forest implementation in R has a hard limit of 32-levels for a categorical variable. If you want to use randomForest in R, then you need to think about how to reduce the number of levels in categorical variables with more than 32-levels. For ex: You could create dummy variables out of such categorical variables and/or get rid of infrequently occuring levels.

Alternatively, you could switch to scikit-learn in Python which (i think?) does not have such a limit.

Use Numpy Arrays for Sklearn

Also, don't put pandas DataFrames into scikit-learn estimators, you should use numpy arrays (by calling np.array(dataframe) or dataframe.values).

How to encode a categorical variable in sklearn?

In most of the well established machine learning systems categorical variables are handled naturally. For example in R you would use factors, in weka you would use nominal variables. This is not the care in scikit learn. The decision trees implemented in sklearn uses only numerical features and these features are interpreted always as continuous numeric variables.

Thus, simply replacing the strings with a hash code should be avoided, because being considered as a continuous numerical feature any coding you will use will induce an order which simply does not exists in your data.

One example is to code ['red','green','blue'] with [1,2,3], would produce weird things like 'red' is lower than 'blue', and if you average a 'red' and a 'blue' you will get a 'green'. Another more sublte example maigh happen when you code ['low', 'medium', 'high'] with [1,2,3]. In the latter case it might happen to have an ordering which makes sense, however some sublte inconsistencies might happen when 'medium' in not in the middle of 'low' and 'high'.

Finally the answer to your question lies in coding the categorical feature into multiple binary features. For example you might code ['red','green','blue'] with 3 columns, one for each category, having 1 when the category match and 0 otherwise. This is called one-hot-encoding, binary encoding, one-of-k-encoding or whatever. You can check documentation here for encoding categorical features andfeature extraction - hashing and dicts. Obviously one-hot-encoding will expand your space requirements and sometimes it hurts the performance as well.

Tuesday, May 17, 2016

Get Current Work Directory

import os
os.getcwd()
os.chdir('C:\_HUANGT')

Pandas Merge (pd.merge) How to set the index and join

In [10]: dfL
Out[10]: 
           cuspin  factorL
date                      
2012-01-03   XXXX      4.5
2012-01-03   YYYY      6.2

In [11]: dfL1 = dfLeft.set_index('cuspin', append=True)

In [12]: dfR1 = dfRight.set_index('idc_id', append=True)

In [13]: dfL1
Out[13]: 
                   factorL
date       cuspin         
2012-01-03 XXXX        4.5
           YYYY        6.2

In [14]: dfL1.join(dfR1)
Out[14]: 
                   factorL  factorR
date       cuspin                  
2012-01-03 XXXX        4.5        5
           YYYY        6.2        6


Reset the indices and then merge on multiple (column-)keys:
dfLeft.reset_index(inplace=True)
dfRight.reset_index(inplace=True)
dfMerged = pd.merge(dfLeft, dfRight,
              left_on=['date', 'cusip'],
              right_on=['date', 'idc__id'],
              how='inner')

You can then reset 'date' as an index:
dfMerged.set_index('date', inplace=True)

Here's an example:
raw1 = '''
2012-01-03    XXXX      4.5
2012-01-03    YYYY      6.2
2012-01-04    XXXX      4.7
2012-01-04    YYYY      6.1
'''

raw2 = '''
2012-01-03    XYXX      45.
2012-01-03    YYYY      62.
2012-01-04    XXXX      -47.
2012-01-05    YYYY      61.
'''

import pandas as pd
from StringIO import StringIO


df1 = pd.read_table(StringIO(raw1), header=None,
                    delim_whitespace=True, parse_dates=[0], skiprows=1)
df2 = pd.read_table(StringIO(raw2), header=None,
                    delim_whitespace=True, parse_dates=[0], skiprows=1)

df1.columns = ['date', 'cusip', 'factorL']
df2.columns = ['date', 'idc__id', 'factorL']

print pd.merge(df1, df2,
         left_on=['date', 'cusip'],
         right_on=['date', 'idc__id'],
         how='inner')

which gives
                  date cusip  factorL_x idc__id  factorL_y
0  2012-01-03 00:00:00  YYYY        6.2    YYYY         62
1  2012-01-04 00:00:00  XXXX        4.7    XXXX        -47

Sunday, May 15, 2016

Why Matplotlib Give The Error matplotlib.lines.Line2D object at... in iPython Notebook?

Use %matplotlib inline

Example:

from numpy.random import randn
import matplotlib.pyplot as plt

%matplotlib inline

plt.plot(randn(50).cumsum(),'k--')

Friday, May 13, 2016

What are some good resources for learning text mining online?

If you want to learn text mining; it is basically two components Machine learning and Natural Language processing. I will tell you what I have used in learning it online

Natural language processing

1.      Stanford NLP Christopher Manning: Coursera
2.       Stanford: Foundation of Statistical NLP: Foundations of Statistical Natural Language Processing
3.      NLP: Information Retrieval Page on washington.edu
4.      Stanford: Information Retrieval and Web Search (Good): Information Retrieval and Web Search

Machine Learning: While there are many good sources, I would rather tell you to be specific to algorithms. If you are going to do text mining, it is better to stick to SVM, Logistic Regression and Random Forest for supervised learning and K-means for unsupervised learning, some of the cool sources are as:

1.   Machine Learning Videos by Yaser: Machine Learning Video Library
2.   Mining Massive Datasets: Coursera
3.   Machine Learning in ML (Andrew NG): Coursera

But i think you should use platform specific resource to learn

For python: Natural language processing with python NLTK Book

others: - Natural Language Toolkit for Python (NLTK): Natural Language Toolkit
- Natural Language Processing with Python (book): Natural Language Processing with Python (free online version: NLTK Book)
- Python Text Processing with NLTK 2.0 Cookbook (book): Python Text Processing with NLTK 2.0 Cookbook
- Python wrapper for the Stanford CoreNLP Java library: Python Package Index
- guess_language (Python library for language identification):https://bitbucket.org/spirit/gue...
- MITIE (new C/C++-based NER library from MIT with a Python API): mit-nlp/MITIE
- gensim (topic modeling library for Python): gensim: topic modelling for humans

For R: there is tm package and nlp package which can be used

Written Oct 29, 2014 • View Upvotes