Data Science Guy: May 2016

Tuesday, May 31, 2016

Change IPython Working Directory

Properties

Folder

How to Find Out Version Number of A Package in iPython Notebook

import nltk
import sklearn

print('The nltk version is {}.'.format(nltk.__version__))
print('The scikit-learn version is {}.'.format(sklearn.__version__))

How To Install Development Version of Scikit-Learn?

Use the dev packages built by its continuous integration system:

http://windows-wheels.scikit-learn.org/

Thursday, May 26, 2016

Dummy Variables

The representation above is redundant, because to encode three values you need two indicator columns. In general, one needs d - 1 columns for d values. This is not a big deal, but apparently some methods will complain about collinearity. The solution is to drop one of the columns. It won’t result in information loss, because in the redundant scheme with d columns one of the indicators must be non-zero, so if two out of three are zeros then the third must be 1. And if one among the two is positive than the third must be zero.

Factor Level Limit for R

Random Forest implementation in R has a hard limit of 32-levels for a categorical variable. If you want to use randomForest in R, then you need to think about how to reduce the number of levels in categorical variables with more than 32-levels. For ex: You could create dummy variables out of such categorical variables and/or get rid of infrequently occuring levels.

Alternatively, you could switch to scikit-learn in Python which (i think?) does not have such a limit.

Use Numpy Arrays for Sklearn

Also, don't put pandas DataFrames into scikit-learn estimators, you should use numpy arrays (by calling np.array(dataframe) or dataframe.values).

How to encode a categorical variable in sklearn?

In most of the well established machine learning systems categorical variables are handled naturally. For example in R you would use factors, in weka you would use nominal variables. This is not the care in scikit learn. The decision trees implemented in sklearn uses only numerical features and these features are interpreted always as continuous numeric variables.

Thus, simply replacing the strings with a hash code should be avoided, because being considered as a continuous numerical feature any coding you will use will induce an order which simply does not exists in your data.

One example is to code ['red','green','blue'] with [1,2,3], would produce weird things like 'red' is lower than 'blue', and if you average a 'red' and a 'blue' you will get a 'green'. Another more sublte example maigh happen when you code ['low', 'medium', 'high'] with [1,2,3]. In the latter case it might happen to have an ordering which makes sense, however some sublte inconsistencies might happen when 'medium' in not in the middle of 'low' and 'high'.

Finally the answer to your question lies in coding the categorical feature into multiple binary features. For example you might code ['red','green','blue'] with 3 columns, one for each category, having 1 when the category match and 0 otherwise. This is called one-hot-encoding, binary encoding, one-of-k-encoding or whatever. You can check documentation here for encoding categorical features andfeature extraction - hashing and dicts. Obviously one-hot-encoding will expand your space requirements and sometimes it hurts the performance as well.

Tuesday, May 17, 2016

Get Current Work Directory

import os
os.getcwd()
os.chdir('C:\_HUANGT')

Pandas Merge (pd.merge) How to set the index and join

In [10]: dfL
Out[10]: 
           cuspin  factorL
date                      
2012-01-03   XXXX      4.5
2012-01-03   YYYY      6.2

In [11]: dfL1 = dfLeft.set_index('cuspin', append=True)

In [12]: dfR1 = dfRight.set_index('idc_id', append=True)

In [13]: dfL1
Out[13]: 
                   factorL
date       cuspin         
2012-01-03 XXXX        4.5
           YYYY        6.2

In [14]: dfL1.join(dfR1)
Out[14]: 
                   factorL  factorR
date       cuspin                  
2012-01-03 XXXX        4.5        5
           YYYY        6.2        6


Reset the indices and then merge on multiple (column-)keys:
dfLeft.reset_index(inplace=True)
dfRight.reset_index(inplace=True)
dfMerged = pd.merge(dfLeft, dfRight,
              left_on=['date', 'cusip'],
              right_on=['date', 'idc__id'],
              how='inner')

You can then reset 'date' as an index:
dfMerged.set_index('date', inplace=True)

Here's an example:
raw1 = '''
2012-01-03    XXXX      4.5
2012-01-03    YYYY      6.2
2012-01-04    XXXX      4.7
2012-01-04    YYYY      6.1
'''

raw2 = '''
2012-01-03    XYXX      45.
2012-01-03    YYYY      62.
2012-01-04    XXXX      -47.
2012-01-05    YYYY      61.
'''

import pandas as pd
from StringIO import StringIO


df1 = pd.read_table(StringIO(raw1), header=None,
                    delim_whitespace=True, parse_dates=[0], skiprows=1)
df2 = pd.read_table(StringIO(raw2), header=None,
                    delim_whitespace=True, parse_dates=[0], skiprows=1)

df1.columns = ['date', 'cusip', 'factorL']
df2.columns = ['date', 'idc__id', 'factorL']

print pd.merge(df1, df2,
         left_on=['date', 'cusip'],
         right_on=['date', 'idc__id'],
         how='inner')

which gives
                  date cusip  factorL_x idc__id  factorL_y
0  2012-01-03 00:00:00  YYYY        6.2    YYYY         62
1  2012-01-04 00:00:00  XXXX        4.7    XXXX        -47

Sunday, May 15, 2016

Why Matplotlib Give The Error matplotlib.lines.Line2D object at... in iPython Notebook?

Use %matplotlib inline

Example:

from numpy.random import randn
import matplotlib.pyplot as plt

%matplotlib inline

plt.plot(randn(50).cumsum(),'k--')

Friday, May 13, 2016

What are some good resources for learning text mining online?

If you want to learn text mining; it is basically two components Machine learning and Natural Language processing. I will tell you what I have used in learning it online

Natural language processing

1.      Stanford NLP Christopher Manning: Coursera
2.       Stanford: Foundation of Statistical NLP: Foundations of Statistical Natural Language Processing
3.      NLP: Information Retrieval Page on washington.edu
4.      Stanford: Information Retrieval and Web Search (Good): Information Retrieval and Web Search

Machine Learning: While there are many good sources, I would rather tell you to be specific to algorithms. If you are going to do text mining, it is better to stick to SVM, Logistic Regression and Random Forest for supervised learning and K-means for unsupervised learning, some of the cool sources are as:

1.   Machine Learning Videos by Yaser: Machine Learning Video Library
2.   Mining Massive Datasets: Coursera
3.   Machine Learning in ML (Andrew NG): Coursera

But i think you should use platform specific resource to learn

For python: Natural language processing with python NLTK Book

others: - Natural Language Toolkit for Python (NLTK): Natural Language Toolkit
- Natural Language Processing with Python (book): Natural Language Processing with Python (free online version: NLTK Book)
- Python Text Processing with NLTK 2.0 Cookbook (book): Python Text Processing with NLTK 2.0 Cookbook
- Python wrapper for the Stanford CoreNLP Java library: Python Package Index
- guess_language (Python library for language identification):https://bitbucket.org/spirit/gue...
- MITIE (new C/C++-based NER library from MIT with a Python API): mit-nlp/MITIE
- gensim (topic modeling library for Python): gensim: topic modelling for humans

For R: there is tm package and nlp package which can be used

Written Oct 29, 2014 • View Upvotes

Tuesday, May 10, 2016

Why does the regularization term prevent overfitting in machine learning?

Is there any statistical interpretation for the regularization term?