Data Science Guy: February 2017

Sunday, February 19, 2017

Why SVD?

Singular value decomposition (SVD) is not the same as reducing the dimensionality of the data. It is a method of decomposing a matrix into other matrices that has lots of wonderful properties which I won't go into here. For more on SVD, see the Wikipedia page.

Reducing the dimensionality of your data is sometimes very useful. It may be that you have a lot more variables than observations; this is not uncommon in genomic work. It may be that we have several variables that are very highly correlated, e.g., when they are heavily influenced by a small number of underlying factors, and we wish to recover some approximation to the underlying factors. Dimensionality-reducing techniques such as principal component analysis, multidimensional scaling, and canonical variate analysis give us insights into the relationships between observations and/or variables that we might not be able to get any other way.

A concrete example: some years ago I was analyzing an employee satisfaction survey that had over 100 questions on it. Well, no manager is ever going to be able to look at 100+ questions worth of answers, even summarized, and do more than guess at what it all means, because who can tell how the answers are related and what is driving them, really? I performed a factor analysis on the data, for which I had over 10,000 observations, and came up with five very clear and readily interpretable factors which could be used to develop manager-specific scores (one for each factor) that would summarize the entirety of the 100+ question survey. A much better solution than the Excel spreadsheet dump that had been the prior method of reporting results!

Tuesday, February 14, 2017

Bag of Words

Conceptually, we can view bag-of-word model as a special case of the n-gram model, with n=1.

TF & TF-IDF belong to Vector Space Model

Bag-of-words: For a given document, you extract only the unigram words (aka terms) to create an unordered list of words. No POS tag, no syntax, no semantics, no position, no bigrams, no trigrams. Only the unigram words themselves, making for a bunch of words to represent the document. Thus: Bag-of-words.

We do not consider the order of words in a document. Represented the same way: John is quicker than Mary Mary is quicker than John This is called a bag of words model

Vector Space Model doesn't consider ordering as well.

Monday, February 13, 2017

Random Forest Limitation

Random forest for bag-of-words? No.

Random forest is a very good, robust and versatile method, however it’s no mystery that for high-dimensional sparse data it’s not a best choice. And BoW representation is a perfect example of sparse and high-d.

We covered bag of words a few times before, for example in A bag of words and a nice little network. In that post, we used a neural network for classification, but the truth is that a linear model in all its glorious simplicity is usually the first choice. We’ll use logistic regression, for now leaving hyperparams at their default values.

Validation AUC for logistic regression is 92.8%, and it trains much faster than a random forest. If you’re going to remember only one thing from this article, remember to use a linear model for sparse high-dimensional data such as text as bag-of-words.

Sunday, February 12, 2017

NLP Terms

Bag of Words Model
TF-IDF
Vector Space Model
Standard Boolean Model

Why does mean normalization help in gradient descent?

Essentially, scaling the inputs (through mean normalization, or z-score) gives the error surface a more spherical shape, where it would otherwise be a very high curvature ellipse. Since gradient descent is curvature-ignorant, having an error surface with high curvature will mean that we take many steps which aren't necessarily in the optimal direction. When we scale the inputs, we reduce the curvature, which makes methods that ignore curvature (like gradient descent) work much better. When the error surface is circular (spherical), the gradient points right at the minimum, so learning is easy.

Feature Scaling

Three Types:

http://stats.stackexchange.com/questions/70553/how-to-verify-a-distribution-is-normalized/70555#70555

Centered: X - mean

Standardized: X - mean / sd
Standardize features by removing the mean and scaling to unit variance

Normalized: X - mean / max - min

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

https://en.wikipedia.org/wiki/Feature_scaling

Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.

Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it^{[citation needed]}.

http://stats.stackexchange.com/questions/29781/when-conducting-multiple-regression-when-should-you-center-your-predictor-varia

http://stats.stackexchange.com/questions/86434/is-standardisation-before-lasso-really-necessary

Lasso regression puts constraints on the size of the coefficients associated to each variable. However, this value will depend on the magnitude of each variable. It is therefore necessary to center and reduce, or standardize, the variables.

The result of centering the variables means that there is no longer an intercept. This applies equally to ridge regression, by the way.

Another good explanation is this post: Need for centering and standardizing data in regression

Need Normalization Before PCA

Normalization is important in PCA since it is a variance maximizing exercise. It projects your original data onto directions which maximize the variance. The first plot below shows the amount of total variance explained in the different principal components wher we have not normalized the data. As you can see it seems like only component one explains all the variance in the data.

If you look at the second picture we have normalized the data first. Here it is clear that the other components contribute as well. The reason for this is because PCA seeks to maximize the variance of each component. And since the covariance matrix of this particular dataset is:

             Murder   Assault   UrbanPop      Rape
Murder    18.970465  291.0624   4.386204  22.99141
Assault  291.062367 6945.1657 312.275102 519.26906
UrbanPop   4.386204  312.2751 209.518776  55.76808
Rape      22.991412  519.2691  55.768082  87.72916

From this structure the PCA will of course select to project as much as possible in the direction of Assault since that variance is much greater. So for finding features usable for any kind of model a PCA without normalization would perform badly.

Saturday, February 11, 2017

Log & Exp function

e = 2.71828

Log Loss

Log-loss is an appropriate performance measure when you're model output is the probability of a binary outcome.

The log-loss measure considers confidence of the prediction when assessing how to penalize incorrect classification. For instance consider two predictions of an outcome P(Y=1|X), where the predictions are 0.51 and 0.99 respectively. In the former case the model is only slightly confident of the class prediction (assuming a 0.5 cutoff), while in the latter it is extremely confident. Since in our case both are wrong, the penalty will be more harsh for the more confident (but incorrect) prediction by employing a log-loss penalty.

I just installed GTX 970 into my Ubuntu box. Using the following command shows the card is connected to the system.

~$ lspci -nnk |grep NVIDIA 60:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM204 [GeForce GTX 970] [10de:13c2] (rev a1) 60:00.1 Audio device [0403]: NVIDIA Corporation GM204 High Definition Audio Controller [10de:0fbb] (rev a1)

But when I ran my Keras script using TensorFlow as the backend, it didn't seem to speed up the training - each epoch still takes a long time to run.

Is there any way to check if Keras is using GPUs?

You need to install the cuda lib for tensorflow to use GPUhttps://www.tensorflow.org/get_started/os_setup#test_the_tensorflow_installation

If it is installed correctly, everytime you run tf or keras with tf backend, it will show something like "using Cuda library..."

Key Steps Enable Keras with TensorFlow backend to Use GPUs

1. Install driver

2. Pip install tensorflow-gpu
https://www.tensorflow.org/get_started/os_setup#test_the_tensorflow_installation
3. Install CUDA
https://www.tensorflow.org/get_started/os_setup#test_the_tensorflow_installation

Friday, February 10, 2017

PCA

Principal Component transformation is a method that stands for reduction of the dimension, that means try to reshape your space in a way that a great part of the total information can be explained with the first few axes. and those axes are actually a linear combination of the initial axes ( variables) , so yes PCA can catch the interaction between variables but only a little , cause to express the interaction between 2 ,3 4 ... variables there a lot of advanced ( aggregation for example) things to do then a simple linear combination between the 2,3 4 ... if you get my idea.

Thursday, February 9, 2017

Gradient Boosting & Imbalanced Classes

Insensitive to small classes

Some Notes about Scikit-learn Logistic Regression

1. It doesn't use SGD, so there is no learning rate. It uses liblinear, newton-sg, etc. method to solve.
2. Use SGDClassifier for SGD training
3. C is used to control regularization effect. Set C to be very large will reduce regularization to zero
4.

Sunday, February 5, 2017

Ensemble - Blending & Stacking

Blending means the same thing as Stacking

Stacking, Blending and and Stacked Generalization are all the same thing with different names. It is a kind of ensemble learning.

In traditional ensemble learning, we have multiple classifiers trying to fit to a training set to approximate the target function. Since each classifier will have its own output, we will need to find a combining mechanism to combine the results. This can be through voting (majority wins), weighted voting (some classifier has more authority than the others), averaging the results, etc. This is the traditional way of ensemble learning.

In stacking, the combining mechanism is that the output of the classifiers (Level 0 classifiers) will be used as training data for another classifier (Level 1 classifier) to approximate the same target function. Basically, you let the Level 1 classifier to figure out the combining mechanism.

In practice, this works very well. In fact, it is most famously used in Netflix to achieve a very good score.

Ensemble - Stacking

Stacking is a way to ensemble multiple classification or regression model. There are many ways to ensemble models. Among most widely known are bagging or boosting. Bagging allows multiple similar models with high variance are averaged to decrease variance. Boosting builds multiple incremental models to decrease the bias, while keeping variance small.

Stacking is a different paradigm however. The point of stacking is to explore a space of different models for the same problem. The idea is that you can attack a learning problem with different types of models which are capable to learn some part of the problem, but not the whole space of the problem. So you can build multiple different learners and you use them to build an intermediate prediction, one prediction for each learned model. Then you add a new model which learns from the intermediate predictions the same target. This final model is said to be stacked on the top of the others, hence the name. Thus you might improve your overall performance, and often you end up with a model which is better than any individual intermediate model. Notice however that it does not gives you any guarantee, as is often the case with any machine learning technique.

Variance of Sum of Two Correlated Variables

Var(X + Y) = Var(X) + Var(Y) + 2 Cov(X, Y)

If X & Y uncorrelated, then Cov(X, Y) = 0.

Var(X + Y) = Var(X) + Var(Y)

If X & Y correlated,

Var(X + Y) > Var(X) + Var(Y)

so RF is higher variance if Trees are correlated.

GB & RF interpretable?

The main problem with neural networks has always been the difficulty of interpreting the trained model. Unlike gradient boosting or random forests, neural networks relies on combinations of mathematical functions that cannot be easily translated to intuitive human understanding.

"Do you mean GB & RF are interpretable because they can generate important variable list?"

No, the generated variable importance means nothing as it's built through a heuristic that doesn't guarantee that its output is aligned with the actual GB or RF algorithm. What I mean is that the underlying algorithm that GB and RF are based on, which is decision trees, actually has a non-mathematical meaning. For example, you can say that if x1 > 700 and x2 == False then y = 4.73. That has an exactly interpretable meaning. While GB and RF ensemble many trees together in some mathematical fashion, fundamentally each tree itself is easy to interpret and the process of building them is intuitive as well. NNs on the other hand are just layers of more or less linear combinations, where the output from one layer are combined linearly to feed into the next layer. If you only had one layer, then the model becomes more or less linear regression and it is easy to interpret, but with multiple layers it's not intuitive to interpret the meaning of the lower layers and weights.

Tuning Hyper-paramters of xGBoost & Radom Forest

Tuning sequence for xGBoost:

1. Get n_estimators give a learning rate & some default parameters
2. Tune max_depth, min_child_weight
3. Tune gamma
4. Tune subsample & colsample_bytree
5. Tune reg_alpha
6. Reduce learning rate

lambda [default=1, alias: reg_lambda]
- L2 regularization term on weights, increase this value will make model more conservative.
alpha [default=0, alias: reg_alpha]

L1 regularization term on weights, increase this value will make model more conservative.

lambda [default=1]

L2 regularization term on weights (analogous to Ridge regression)
This used to handle the regularization part of XGBoost. Though many data scientists don’t use it often, it should be explored to reduce overfitting.

alpha [default=0]

L1 regularization term on weight (analogous to Lasso regression)
Can be used in case of very high dimensionality so that the algorithm runs faster when implemented

Tuning sequence for Random Forests:

1. max_features
2. n_estimators
3. min_sample_leaf

Data Scientist Interview Preparation

hackerank
lootcode

Gradient Descent Made Simple

partial derivative of Loss function on parameter

MAE vs MSE

Basically MAE is more robust to outlier than is MSE. MAE assigns equal weight to the data whereas MSE emphasizes the extremes - the square of a very small number (smaller than 1) is even smaller, and the square of a big number is even bigger.

https://www.kaggle.com/c/allstate-claims-severity/discussion/24293

Same here. MAE is non-differentiable and solving for it is therefore less efficient (a lot less in this case!).

It also uses more memory. In my case, memory usage continues to increase and at some point the system runs out of free memory and the process crashes.

Saturday, February 4, 2017

Application of Central Limit Theorem

sample mean approximate a normal distribution with population mean and population variance/n
sample mean approximate a normal distribution with population mean x n and population variance x n

Normal Distribution Examples

1. Light Bulb life
2. Human IQ
3. Human Height
4. Size of Product Produced By A Machine
5. Measurement Error

Why More Data Reduc Overfitting?

Practical concerns like memory and processor time aside, I can't imagine any situation where having more representative training data leads to a worse outcome. Overfitting is essentially learning spurious correlations that occur in your training data, but not the real world. For example, if you considered only my colleagues, you might learn to associate "named Matt" with "has a beard." It's 100% valid (

n = 4

, even!), but it's obviously not true in general. Increasing the size of your data set (e.g., to the entire building or city) should reduce these spurious correlations and improve the performance of your learner.

That said, one situation where more data does not help---and may even hurt---is if your additional training data is noisy or doesn't match whatever you are trying to predict. I once did an experiment where I plugged different language models[*] into a voice-activated restaurant reservation system.

I varied the amount of training data as well as its relevance: at one extreme, I had a small, carefully curated collection of people booking tables, a perfect match for my application. At the other, I had a model estimated from huge collection of classic literature, a more accurate language model, but a much worse match to the application. To my surprise, the small-but-relevant model vastly outperformed the big-but-less-relevant model.

Unfortunately, I don't think there are any hard and fast rules for this sort of trade-off. You'll have to try it and see how it works.

[*]A language model is just the probability of seeing a given sequence of words e.g.

P (w_{n} ='quick', w_{n + 1} ='brown', w_{n + 2} ='fox')

. They're vital to building halfway decent speech/character recognizers.

Thursday, February 2, 2017

Fractional Factorial Design

Use of a FF design instead of full factorial design is usually done for economic reasons. Since there is no free lunch , what price to pay? See next.

nests = [1, 2, [3, 4, [5],['hi']], [6, [[[7, 'hello']]]]]

def flatten(container):
    for i in container:
        if isinstance(i, (list,tuple)):
            for j in flatten(i):
                yield j
        else:
            yield i

print list(flatten(nests))

returns:

[1, 2, 3, 4, 5, 'hi', 6, 7, 'hello']

Note, this doesn't make any guarantees for speed or overhead use, but illustrates a recursive solution that hopefully will be helpful.