Data Science Guy: November 2015

Wednesday, November 25, 2015

MapReduce

A basic version of the MapReduce algorithm consists of the following steps:
1. Use a mapper function to turn each item into zero or more key-value pairs.
(Often this is called the map function, but there is already a Python function
called map and we don’t need to confuse the two.)
2. Collect together all the pairs with identical keys.
3. Use a reducer function on each collection of grouped values to produce output
values for the corresponding key.

The most widely used MapReduce system is Hadoop, which itself merits many
books. There are various commercial and noncommercial distributions and a
huge ecosystem of Hadoop-related tools.
In order to use it, you have to set up your own cluster (or find someone to let you
use theirs), which is not necessarily a task for the faint-hearted. Hadoop mappers
and reducers are commonly written in Java, although there is a facility known as
“Hadoop streaming” that allows you to write them in other languages (including
Python).

Spark

In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's multi-stage in-memory primitives provides performance up to 100 times faster for certain applications.[1] By allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well-suited to machine learning algorithms.[2]

Hadoop

Hadoop jobs are typically high-latency, which makes them a poor choice for
“real-time” analytics. There are various “real-time” tools built on top of Hadoop,
but there are also several alternative frameworks that are growing in popularity.
Two of the most popular are Spark and Storm.

Monday, November 23, 2015

Measuring Impurity for Decision Trees

Decision trees will now aim at minimizing the impurity in the data. In order to do so appropriately, one needs a measure to quantify impurity. Various measures have been introduced in the literature, and the most popular are:

Entropy:
Gini:
Chi-squared:

How to Aggregate Data

Create dclk_summ as

Select oneg_id,

Sum(clicks_cnt) as clk,

Sum(cost_amt) as cost,

Sum(hotel_margin+air_margin) as ha_margin,

Sum(cost_amt)/sum(tot_margin) as SR

From dclk_data

Group by oneg_id

How to Delete A Cell in iPython

In the new IPython cells could have two states: when it has a green selection rectangle around it you can edit what's inside; when it has a grey rectangle around it you edit the cell itself (copy/paste/delete). Enter/Return makes it go green, Esc makes it go grey. When it is gray, 'dd' will delete it.

How to Save DataFrame in Disk

import pandas as pd

a = pd.DataFrame({'A':[0,1,0,1,0],'B':[True, True, False, False, False]})
print a
#    A      B
# 0  0   True
# 1  1   True
# 2  0  False
# 3  1  False
# 4  0  False

a.to_pickle('my_file.pkl')

b = pd.read_pickle('my_file.pkl')
print b
#    A      B
# 0  0   True
# 1  1   True
# 2  0  False
# 3  1  False
# 4  0  False

Sunday, November 22, 2015

How To Read CSV File into Python

import pandas as pd

ctd = pd.read_csv('rpc_mod/ctix_mod_data5.csv')

How to Read Query Output Into A Data Frame

Saturday, November 21, 2015

IPython Notebook Can't Display Graphs

If your IPython can't display graphs, all you need to do is to add the following statement:

%matplotlib inline