Data Science Guy: 2015

Friday, December 4, 2015

How to Detect Overfitting

accepted

how would you normally tell that the model is over-fitting?

One useful rule of thumb is that you may be overfitting when your model's performance on its own training set is much better than on its held-out validation set or in a cross-validation setting. That's not all there is to it, though.

The blog entry I linked to describes a procedure for testing for overfit: plot training set and validation set error as a function of training set size. If they show a stable gap at the right end of the plot, you're probably overfitting.

What is the best way of preventing over-fitting (in this situation) to be sure that the prediction results are good or not?

Use a held-out test set. Only do evaluation on this set when you're completely done with model selection (hyperparameter tuning); don't train on it, don't use it in (cross-)validation. The score you get on the test set is the model's final evaluation. This should show whether you've accidentally overfit the validation set(s).

[Machine learning conferences are sometimes set up like a competition, where the test set is not given to the researchers until after they've delivered their final model to the organisers. In the meanwhile, they can use the training set as they please, e.g. by testing models using cross-validation.Kaggle does something similar.]

If LeaveOneOut cross validation is used, how can the model possibly over-fit with good results?

Because you can tune the model as much as you want in this cross-validation setting, until it performs nearly perfectly in CV.

As an extreme example, suppose that you've implemented an estimator that is essentially a random number generator. You can keep trying random seeds until you hit a "model" that produces very low error in cross-validation, but that doesn't you've hit the right model. It means you've overfit to the cross-validation.

Thursday, December 3, 2015

Tackling overfitting via regularization

Overfitting is a common problem in machine learning, where a model performs well on training data but does not generalize well to unseen data (test data). If a model suffers from overfitting, we also say that the model has a high variance, which can be caused by having too many parameters that lead to a model that is too complex given the underlying data. Similarly, our model can also suffer from underfitting (high bias), which means that our model is not complex enough to capture the pattern in the training data well and therefore also suffers from low performance
on unseen data.

Why do I get different regression outputs in SAS and in Python

How to Get the Row Count of A Pandas Dataframe?

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: df =pd.DataFrame(np.arange(9).reshape(3,3))

In [4]: df
Out[4]: 
   0  1  2
0  0  1  2
1  3  4  5
2  6  7  8

In [5]: df.shape
Out[5]: (3, 3)

In [6]: len(df.index)

Out[6]: (3, 3)

Wednesday, November 25, 2015

MapReduce

A basic version of the MapReduce algorithm consists of the following steps:
1. Use a mapper function to turn each item into zero or more key-value pairs.
(Often this is called the map function, but there is already a Python function
called map and we don’t need to confuse the two.)
2. Collect together all the pairs with identical keys.
3. Use a reducer function on each collection of grouped values to produce output
values for the corresponding key.

The most widely used MapReduce system is Hadoop, which itself merits many
books. There are various commercial and noncommercial distributions and a
huge ecosystem of Hadoop-related tools.
In order to use it, you have to set up your own cluster (or find someone to let you
use theirs), which is not necessarily a task for the faint-hearted. Hadoop mappers
and reducers are commonly written in Java, although there is a facility known as
“Hadoop streaming” that allows you to write them in other languages (including
Python).

Spark

In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's multi-stage in-memory primitives provides performance up to 100 times faster for certain applications.[1] By allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well-suited to machine learning algorithms.[2]

Hadoop

Hadoop jobs are typically high-latency, which makes them a poor choice for
“real-time” analytics. There are various “real-time” tools built on top of Hadoop,
but there are also several alternative frameworks that are growing in popularity.
Two of the most popular are Spark and Storm.

Monday, November 23, 2015

Measuring Impurity for Decision Trees

Decision trees will now aim at minimizing the impurity in the data. In order to do so appropriately, one needs a measure to quantify impurity. Various measures have been introduced in the literature, and the most popular are:

Entropy:
Gini:
Chi-squared:

How to Aggregate Data

Create dclk_summ as

Select oneg_id,

Sum(clicks_cnt) as clk,

Sum(cost_amt) as cost,

Sum(hotel_margin+air_margin) as ha_margin,

Sum(cost_amt)/sum(tot_margin) as SR

From dclk_data

Group by oneg_id

How to Delete A Cell in iPython

In the new IPython cells could have two states: when it has a green selection rectangle around it you can edit what's inside; when it has a grey rectangle around it you edit the cell itself (copy/paste/delete). Enter/Return makes it go green, Esc makes it go grey. When it is gray, 'dd' will delete it.

How to Save DataFrame in Disk

import pandas as pd

a = pd.DataFrame({'A':[0,1,0,1,0],'B':[True, True, False, False, False]})
print a
#    A      B
# 0  0   True
# 1  1   True
# 2  0  False
# 3  1  False
# 4  0  False

a.to_pickle('my_file.pkl')

b = pd.read_pickle('my_file.pkl')
print b
#    A      B
# 0  0   True
# 1  1   True
# 2  0  False
# 3  1  False
# 4  0  False

Sunday, November 22, 2015

How To Read CSV File into Python

import pandas as pd

ctd = pd.read_csv('rpc_mod/ctix_mod_data5.csv')

How to Read Query Output Into A Data Frame

Saturday, November 21, 2015

IPython Notebook Can't Display Graphs

If your IPython can't display graphs, all you need to do is to add the following statement:

%matplotlib inline

Thursday, September 24, 2015

Free eBooks On Data Mining, Statistics and Predictive Modeling

1. Theory and Applications for Advanced Text Mining
by Shigeaki Sakurai (ed.) - InTech , 2012
Text mining techniques are studied aggressively in order to extract the knowledge from the data. This book introduces advanced text mining techniques. There are various techniques from relation extraction to under or less resourced language.

Sunday, September 20, 2015

Top 10 Books on Predictive Analytics and Data Modeling

Sunday, September 13, 2015

What is the Difference Between Training, Validation and Test Set?

Wednesday, August 26, 2015

Data Frame vs Data Table

While this is a broad question, if someone is new to R this can be confusing and the distinction can get lost.

All data.tables are also data.frames. Loosely speaking, you can think of data.tables as data.frames with extra features.

data.frame is part of base R.

data.table is a package that extends data.frames. Two of its most notable features are speed and cleaner syntax.

However, that syntax sugar is different from the standard R syntax for data.frame while being hard for the untrained eye to distinguish at a glance. Therefore, if you read a code snippet and there is no other context to indicate you are working with data.tables and try to apply the code to a data.frame it may fail or produce unexpected results. (a clear giveaway that you are working with d.t's, besides the library/require call is the presence of the assignment operator := which is unique to d.t)

With all that being said, I think it is hard to actually appreciate the beauty of data.table without experiencing the shortcomings of data.frame. (for example, see the first 3 bullet points of @eddi's answer). In other words, I would very much suggest learning how to work with and manipulate data.frames first then move on to data.tables.