Data Science Guy: April 2016

Thursday, April 28, 2016

Terminology For Air Business

Solution – A collection of Slices

Slice – A collection of segments containing the directional journey
Round Trip is 2 Slices
One way is 1 Slice

Segment
The smallest unit a carrier and flight number exists

Leg
Even smaller

AA-123 ORD-MCI – leg 1, segment 1, slice 1
AA-123 MCI-LAS – leg 2, segment 1, slice 1
AA-456 LAS-SFO – leg 3, segment 2, slice 1
AA-789 SFO-ORD – leg 4, segment 3, slice 2

Oneline – Single carrier throughout the solution
AA-123
AA-456

Interline – 2 or more carriers in a solution
AA-123
UA-111

Codeshare – One “markets” a flight, a different one “operates” it

Plating carrier – the carrier that pays us commission. Only one per solution.

concatenate two data frame column r

paste(df$c1, df$c2,

Wednesday, April 27, 2016

When Random Forest Will Not Perform Better

When they are very few variables which are significant, RF will not perform better than Linear Model.

Monday, April 25, 2016

Get The Length Of Each Values In A Column

nchar takes a character vector as an argument and returns a vector whose elements contain the sizes of the corresponding elements of x.

nzchar is a fast way to find out if elements of a character vector are non-empty strings.

Why Ifelse Returns Numeric Values?

ock$ref_id2<-ifelse(substr(ock$ref_id,1,4)=="FLT.",substr(ock$ref_id,5,11),ock$ref_id)

because ref_id is factor

Saturday, April 23, 2016

How To Read Variable Length String Correctly In SAS

data oc;
infile "&file_dir./orb_air_clk_2016.csv" dlm=',' firstobs=2;
input local_date $10. mktg_code : $100. ref_id : $50. clicks spend;
run;

The : (colon) format modifier enables you to use list input but also to specify an informat after a variable name, whether character or numeric. SAS reads until it encounters a blank column, the defined length of the variable (character only), or the end of the data line, whichever comes first.

Write A SAS Dataset to A SQL Script

data _null_;
*file "/dfs/public/AA/thuang/sql/ta_orb_dsk_rpc.sql";
file "/dfs/marketing/thuang/sql/ta_orb_dsk_rpc.sql";
set rpc.ta_orb_dsk_rpc_&fid. end=last;
length sql_stmt $1000.;

if _n_ = 1 then do;

put 'DROP TABLE IF EXISTS sandbox.tch_ta_orb_dsk_rpc10;';
put 'CREATE TABLE sandbox.tch_ta_orb_dsk_rpc10 (';
put 'oneg_id INT,rpc FLOAT,new_ind INT, update_dt DATE)';
put 'DISTRIBUTED BY (oneg_id, rpc);';
put 'GRANT ALL PRIVILEGES ON sandbox.tch_ta_orb_dsk_rpc10 TO PUBLIC;';
put 'INSERT INTO sandbox.tch_ta_orb_dsk_rpc10 VALUES';

end;

if last then
sql_stmt= compress("("||oneg_id||","||adj_rpc||","||new_ind||",'"||&update_dt.||"');");
else sql_stmt= compress("("||oneg_id||","||adj_rpc||","||new_ind||",'"||&update_dt.||"'),");

put sql_stmt;

run;

Friday, April 22, 2016

How to trim leading and trailing whitespace in R?

Probably the best way is to handle the trailing whitespaces when you read your data file. If you use read.csv or read.table you can set the parameterstrip.white=TRUE.

If you want to clean strings afterwards you could use one of these functions:

# returns string w/o leading whitespace
trim.leading <- function (x) sub("^\\s+", "", x)

# returns string w/o trailing whitespace
trim.trailing <- function (x) sub("\\s+$", "", x)

# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
To use one of these functions on myDummy$country:

myDummy$country <- trim(myDummy$country)
To 'show' the whitespace you could use:

paste(myDummy$country)
which will show you the strings surrounded by quotation marks (") making whitespaces easier to spot.

rim {gdata}

R Documentation

Remove leading and trailing spaces from character strings

Description

Remove leading and trailing spaces from character strings and other related objects.

Usage

trim(s, recode.factor=TRUE, ...)

bob <- data.frame(lapply(bob, as.character), stringsAsFactors=FALSE)

bob[] <- lapply(bob, as.character)

level sets of factors are different

a<-ad[trim(ad$airpt_code)!=trim(ad$airpt_metro_code),]
Error in Ops.factor(trim(ad$airpt_code), trim(ad$airpt_metro_code)) :
level sets of factors are different

Find Columns Which Have A Missing Value

sapply(oa, function(x) sum(is.na(x))

apply(is.na(oa),2,sum)

sapply(oa,function(x) all(is.na(x)))

colnames(oa)[colSums(is.na(oa))>0]

Tuesday, April 19, 2016

Concatenate Two Data Frames

cbind(df1, df2)

cbind() function combines vector, matrix or data frame by columns.

cbind(x1,x2,...)

x1,x2:vector, matrix, data frames

Read in the data from the file:

>x <- read.csv("data1.csv",header=T,sep=",")
>x2 <- read.csv("data2.csv",header=T,sep=",")

>x3 <- cbind(x,x2)
>x3

  Subtype Gender Expression Age     City
1       A      m      -0.54  32 New York
2       A      f      -0.80  21  Houston
3       B      f      -1.03  34  Seattle
4       C      m      -0.41  67  Houston

Monday, April 18, 2016

Aggregate Multiple Columns in R

aggregate(cbind(x1, x2)~year+month, data=df1, sum, na.rm=TRUE)

Saturday, April 16, 2016

Machines Learning Course

http://pages.uoregon.edu/aarong/teaching/G4075_Outline/node1.html

Friday, April 15, 2016

SQL Interview Questions

i ask this. if u have a table with city and population columns. how do u find the 7th largest populated city?
did u get a chance to use HAVING anywhere?
if so, how did u use it?

SELECT * FROM (
SELECT EmployeeID, Salary, RANK() over (order by Salary DESC) RANKING
from Employee
)
WHERE ranking = N;

SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary;

SELECT depname, empno, salary, rank() OVER (PARTITION BY depname ORDER BY salary DESC) FROM empsalary;

CASE WHEN

CAST

a pair of photo

select a.pi as pi1,
b.pi as pi2
from
(select pi, count(*)
from save
group by pair
having count(*) > 100) a,
(select pi, count(*)
from save
group by pair
having count(*) > 100) b
where
a.pi < b.pi

Tuesday, April 12, 2016

Study Note For ISLR

Data Scientist Interview Skill List

Overfitting
Modeling rare events
VIF
Lasso & Ridge Regression
Multiple Imputation
Cross validation
Bootstrapping
ROC Curve
Prunning
Multinomial Logistic Regression
K-nn
K-means
Gbm
Glmnet
inner join
Having statement
DoE (Design of Experiments)
Spark
Central Limit Theorem
Probit
Scala
Hypothesis testing
ggplot2
Java
NumPy, pandas, SciPy
Developing APIs, web services
Developing UIs (HTML, PMP, web2py)
Hadoop, Hive
PMML
XML

Wednesday, April 6, 2016

What are the advantages of logistic regression over decision trees?

9 Answers

Jack Rae, Google DeepMind Research Engineer

61.5k Views • Upvoted by Chomba Bupe, I develop machine learning algorithms, Yuval Feinstein, Algorithmic Software Engineer in NLP,IR and Machine Learning, William Chen, Data Scientist at Quora

Jack is a Most Viewed Writer in Decision Trees.

The answer to "Should I ever use learning algorithm (a) over learning algorithm (b)" will pretty much always be yes. Different learning algorithms make different assumptions about the data and have different rates of convergence. The one which works best, i.e. minimizes some cost function of interest (cross validation for example) will be the one that makes assumptions that are consistent with the data and has sufficiently converged to its error rate.

Put in the context of decision trees vs. logistic regression, what are the assumptions made?

Decision trees assume that our decision boundaries are parallel to the axes, for example if we have two features (x1, x2) then it can only create rules such as x1>=4.5, x2>=6.5 etc. which we can visualize as lines parallel to the axis. We see this in practice in the diagram below.

So decision trees chop up the feature space into rectangles (or in higher dimensions, hyper-rectangles). There can be many partitions made and so decision trees naturally scale up to creating more complex (say, higher VC) functions - which can be a problem with over-fitting.

What assumptions does logistic regression make? Despite the probabilistic framework of logistic regression, all that logistic regression assumes is that there is one smooth linear decision boundary. It finds that linear decision boundary by making assumptions that the P(Y|X) of some form, like the inverse logit function applied to a weighted sum of our features. Then it finds the weights by a maximum likelihood approach.

However people get too caught up on that... The decision boundary it creates is a linear* decision boundary that can be of any direction. So if you have data where the decision boundary is not parallel to the axes,

then logistic regression picks it out pretty well, whereas a decision tree will have problems.

So in conclusion,

Both algorithms are really fast. There isn't much to distinguish them in terms of run-time.
Logistic regression will work better if there's a single decision boundary, not necessarily parallel to the axis.
Decision trees can be applied to situations where there's not just one underlying decision boundary, but many, and will work best if the class labels roughly lie in hyper-rectangular regions.
Logistic regression is intrinsically simple, it has low variance and so is less prone to over-fitting. Decision trees can be scaled up to be very complex, are are more liable to over-fit. Pruning is applied to avoid this.

Maybe you'll be left thinking, "I wish decision trees didn't have to create rules that are parallel to the axis." This motivates support vector machines.

Footnotes:
* linear in your covariates. If you include non-linear transformations or interactions then it will be non-linear in the space of those original covariates.

Updated May 20, 2015 • View Upvotes

Upvote493Downvote

Comments14+

Murthy Kolluru, President at INSOFE

8.8k Views

Both Logistic regression and Decision trees are used for classification purpose such as
1.       Predicting whether a particular user will click an ad in shown in the webpage.
2.       Whether a customer will take a loan from bank or not
3.       Identifying whether a document was written Author –A or Author-B
Decision trees will generate the output as rules along with metrics such as Support, Confidence and Lift, while logistic regression analysis is based on calculating the odds of the outcome as the ratio of the probability of having the outcome divided by the probability of not having it.
Let us understand this better by looking at the outputs generated by these algorithms for above the case study-1. Decision Tree outputs rules like: If the ad is shown on the first page right side, the user will click the ad - Support 0.9, Confidence 0.95 and Lift 3.345 and logistic regression generates Odds ratio of the user clicking on the ad is 0.785.
1.      Assumptions
The decision tree assumes the splits are axis parallel and will become more complex with the increase in number of features and multiple decision boundaries are possible. On the other hand, Logistic regression assumes there is only one decision boundary that is smooth and non-linear.
2.      How the decision boundaries are constructed?
Below are two basic functions of Decision trees and Logistic regression.
A.    Decision Tree
a.       Selecting the best attribute/feature to divide a set at each branch, and
b.       Deciding whether each branch is justified adequately. The different decision-tree programs differ in how these are accomplished.
B.     Logistic Regression
a.       Stepwise selections of the variables and the corresponding coefficients computed
b.       The maximum-likelihood ratio is used to determine the statistical significance of the variables which will be part of the Logistic Regression equation.

3.      Limitations
Complex decision trees may over fit the data and trees will become unstable. You could prune the tree to solve this. You could use L1 regularization to solve the problem of unreasonable coefficients for the independent variables.

Written Aug 24, 2014 • View Upvotes

Upvote20Downvote

Comments2

Vijay Krishnan, Founder & CTO RoverApp.com, Personalized content recommendation app

12.9k Views • Upvoted by Yuval Feinstein, Algorithmic Software Engineer in NLP,IR and Machine Learning

Let me add to Jack's points.
Logistic Regression is better than decision trees particularly when you are dealing with very high dimension data. Text classification is a classic problem. You might have 100,000 train documents and also see around 500,000 distinct words (features).

In such a case, a simply rule like learning of a linear hyperplane is strictly better due to theCurse of dimensionality, since Decision Trees have far too many degrees of freedom and you will almost necessarily overfit. One could still try to use a decision tree on text data by doing Feature selection. However you will lose on a lot of valuable information for text classification, by merely picking a small reduced subset of features. When learning models with high dimensional data, it is very easy for variance based errors to get out of hand and simple models with higher bias are a better bet.

Decision Trees are likely to be a better fit for problems where you have a small number of features (say < 100 features) and plenty (say > 100,000) train examples. In such a case, your data permits you to learn more complex decision rules. Here variance is a smaller concern and one would likely be better off opting for Decision Trees with their high expressive power and low bias.

I have written in more detail about what learning algorithms would be suitable for different kinds of data here:
Vijay Krishnan's answer to Big Data: What algorithms do data scientists actually use at work?

Written Feb 4, 2014 • View Upvotes

Upvote47Downvote

Comment

Fred Richardson, Applied Machine Learning

7.2k Views • Upvoted by Vladimir Novakovski

With logistic regression it's also important to consider regularization. With very high dimensional (and possibly sparse) features, L1 or L2 regularization is critical to avoid over fitting. L1 regularization leads to a sparse model which has been shown to perform as well as an SVM on some text processing tasks.

Straying off topic here, but as far as equivalence goes, kernel methods like the SVM can't always be realized with a traditional logistic regression. Some kernel functions like the radial basis kernels lead to an infinite expansion of the original feature space. Polynomial kernel functions are finite but will lead to practical problems very easily.

Having said all that, logistic regression really shine when you want to analyze things like the relative importance of each feature. The "glmnet" package in R demonstrates this very well. For anyone who's interested, it's really worth while looking at Hastie and Tibshirani work on regularization paths.

Lastly, logistic regressions can optimize the multi-class (multinomial) problems directly. With some techniques like an SVM you have to use something like "one-vs-rest" training to make a multi-class classifier which may be less optimal.

Written May 3, 2014 • View Upvotes

Upvote12Downvote

Comments1+

Haoyan Cai, Interested in DataMining/MachineLearning

6.8k Views • Upvoted by Yuval Feinstein, Algorithmic Software Engineer in NLP,IR and Machine Learning

As as others have said, in most questions like "Should I use algo a instead of algo b?"，the answer is it depends. To my knowledge, logistic regression and decision trees has the following pros and cons.
1. Logistic regression are easier to interpret and there are lots of techniques developed to do subset selection or stepwise selection, however when feature dimension gets large, it's almost impossible to interpret for decision trees.
2. As Vijay pointed out, logistic regression is faster and more reliable when the dimension gets large. Also note that the decision boundaries of decision trees are parallel to axises.
3. One disadvantage of logistic regression, it is unstable when one predictor could almost explain the response variable, because the coefficient of this variable will be boosted to as high as possible, here is a case when people turn to discriminant analysis.
4. Personally, I favor logistic regression when there is well over 20 predictors at hand because logistic gives me much more flexibility in modeling. I could do subset selection, backward or forward selection. I could link shrinkage methods like lasso with logistic and so on and it is much easier to interpret the results(coefficients and so on).

Updated Apr 18, 2014 • View Upvotes

Upvote6Downvote

Comment

Vijay Venkatesh, Data analytics and Machine learning

3.6k Views

Kinda depends on the data set size. I found some unpredictable results and values for trees at smaller volumes of data over logistic regression. Also for larger amounts of data - performance time and compute will start to matter, trees take a looong time to complete.

4.6k Views

If we want modify logistic reg to get better accuracy, than we come to SVM or neural network.

If we want modify decision tree to get better accuracy, than we come to Random Forest or BART.

In both cases we win in accuracy (MSE, AUC, etc) but lost explainability / simplicity.

Written Dec 3, 2013 • View Upvotes

Upvote9Downvote

Comment

Norm Matloff, See matloff.wordpress.com

4.1k Views

Some of the distinctions made in the previous answers are rather artificial.

If one is worried that relationships are parallel to the axes, simply add squared and product terms, etc.; SVM does a pre-transformation of the data, and there is no reason not to do so in other models.

A parametric model such as the logistic has a major advantage in that it gives you insight into the impact of each predictor variable on the response variable. None of the nonparametric methods really provide this. CART is nice in that it is easy to implement and easy to explain to nontechnical people.

I'd recommend being wary of general statements about predictive ability along the lines of "Method X works well in Setting A, while Method Y works well in Setting B." As the saying goes, "Your mileage may vary."

Written Mar 22, 2014 • View Upvotes

Upvote1Downvote

Comments1+

Ted Lippe-Corstjens

1.7k Views

Advantages:
Logistic Regression is slightly faster.
Can be turned into an online algorithm.
Logistic Regression is less complex.
Logistic Regression is easier to inspect.

Use logistic regression over decision trees, when:
- Their performance is equal (Occam's Razor).
- The additional accuracy of decision trees do not weigh up to increased complexity of implementation of the model.
- The user/owner of the model demands a logical and easy to follow explanation for its predictions (regulatory institutions and some project managers).