Data Science Guy: January 2017

Tuesday, January 31, 2017

Steps to Create Text Clusters

Create TFIDF Matrix: rows are documents, columns are normalized text tokens, N x T
Apply Sigular Value Decomposition(SVD) to reduce dimensions, N x V
Use Gaussian Mixture Model(GMM) to create clusters: N x S

Difference between SVD and PCA

How Performance Metrics Affect Model Fitting

Here is a nice article:

http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/

L1 vs L2

Objective Function = Loss Function + Regularization Penalty/Term

L1 loss = least absolute error = Sum(| y - f(x) |)
L2 loss = least square error = Sum((y - f(x))^2)

L1 vs L2 properties (loss function)

regularization term in order to prevent the coefficients to fit so perfectly to overfit.

L1 reg = the sum of the weights.
L2 reg = the sum of the square of the weights

L1 vs L2 properties (regularization)

Monday, January 30, 2017

Imbalanced Classes

How to combat Imbalanced Training Data

1. Collect more data
2. Change performance metric: Precision & Recall
3. Resamplingdata
4. Generate synthetic samples
5. Try different algorithms
6. Try penalized models
7. Try different perspective
8. Try getting creative

What is Bagging?

Bootstrap aggregating, also called bagging

Given a standard training set D of size n, bagging generates m new training sets

D_{i}

, each of size n′, by sampling from D uniformly and with replacement. By sampling with replacement, some observations may be repeated in each

D_{i}

. If n′=n, then for large n the set

D_{i}

is expected to have the fraction (1 - 1/e) (≈63.2%) of the unique examples of D, the rest being duplicates.^[1] This kind of sample is known as a bootstrap sample. The m models are fitted using the above m bootstrap samples and combined by averaging the output (for regression) or voting (for classification).

Bagging leads to "improvements for unstable procedures" (Breiman, 1996), which include, for example, artificial neural networks, classification and regression trees, and subset selection in linear regression (Breiman, 1994). An interesting application of bagging showing improvement in preimage learning is provided here.^[2]^[3] On the other hand, it can mildly degrade the performance of stable methods such as K-nearest neighbors (Breiman, 1996).

Sunday, January 29, 2017

Why Dropout Regularization Works?

Dropout is a regularization technique for neural network models proposed by Srivastava, et al. in their 2014 paper Dropout: A Simple Way to Prevent Neural Networks from Overfitting (download the PDF).

Dropout is a technique where randomly selected neurons are ignored during training. They are “dropped-out” randomly. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.

As a neural network learns, neuron weights settle into their context within the network. Weights of neurons are tuned for specific features providing some specialization. Neighboring neurons become to rely on this specialization, which if taken too far can result in a fragile model too specialized to the training data. This reliant on context for a neuron during training is referred to complex co-adaptations.

You can imagine that if neurons are randomly dropped out of the network during training, that other neurons will have to step in and handle the representation required to make predictions for the missing neurons. This is believed to result in multiple independent internal representations being learned by the network.

The effect is that the network becomes less sensitive to the specific weights of neurons. This in turn results in a network that is capable of better generalization and is less likely to overfit the training data.

The original paper on Dropout provides experimental results on a suite of standard machine learning problems. As a result they provide a number of useful heuristics to consider when using dropout in practice.

Generally use a small dropout value of 20%-50% of neurons with 20% providing a good starting point. A probability too low has minimal effect and a value too high results in under-learning by the network.
Use a larger network. You are likely to get better performance when dropout is used on a larger network, giving the model more of an opportunity to learn independent representations.
Use dropout on incoming (visible) as well as hidden units. Application of dropout at each layer of the network has shown good results.
Use a large learning rate with decay and a large momentum. Increase your learning rate by a factor of 10 to 100 and use a high momentum value of 0.9 or 0.99.
Constrain the size of network weights. A large learning rate can result in very large network weights. Imposing a constraint on the size of network weights such as max-norm regularization with a size of 4 or 5 has been shown to improve results.

Keras Nueral Network Hyperparameter Tuning List

batch_size, nb_epoch
Optimization algorithm: ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
Learning Rate & Momentum for SGD
Weight Init: ['uniform', 'lecun_uniform', 'normal', 'zero', 'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform']
Activation Function:['softmax', 'softplus', 'softsign', 'relu', 'tanh', 'sigmoid', 'hard_sigmoid', 'linear']
Dropout Rate: -> Keras's regularization
the dropout rate for regularization in an effort to limit overfitting and improve the model’s ability to generalize.
[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

# of Neurons:[1, 5, 10, 15, 20, 25, 30]
Also, generally, a large enough single layer network can approximate any other neural network, at least in theory.

Reproducibility is a Problem. Although we set the seed for the random number generator in NumPy, the results are not 100% reproducible. There is more to reproducibility when grid searching wrapped Keras models than is presented in this post.

Start with Single Hidden Layer and add more layers

Specify number of cores to be used with tensorflow

import tensorflow as tf
from keras import backend as K

config = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1, \
allow_soft_placement=True, device_count = {'CPU': 1})
session = tf.Session(config=config)

K.set_session(session)

Saturday, January 28, 2017

xgboost cv

xgboost cv is equivalent to scikit learn's cross_val_score

Early Stopping for Keras

callbacks = [
    EarlyStopping(monitor='val_loss', patience=2, verbose=0),
    ModelCheckpoint(kfold_weights_path, monitor='val_loss', save_best_only=True, verbose=0),
]
model.fit(X_train.astype('float32'), Y_train, batch_size=batch_size, nb_epoch=nb_epoch,
      shuffle=True, verbose=1, validation_data=(X_valid, Y_valid),
      callbacks=callbacks)

earlyStopping=keras.callbacks.EarlyStopping(monitor='val_loss', patience=0, verbose=0, mode='auto')
model.fit(X, y, batch_size=128, nb_epoch=100, verbose=1, callbacks=[earlyStopping], validation_split=0.1, validation_data=None, shuffle=True, show_accuracy=False, class_weight=None, sample_weight=None)

What is DNN?

"Neural networks" is a term usually used to refer to feedforward neural networks. Deep Neural Networks are feedforward Neural Networks with many layers.

A Deep belief network is not the same as a Deep Neural Network.

As you have pointed out a deep belief network has undirected connections between some layers. This means that the topology of the DNN and DBN is different by definition.

The undirected layers in the DBN are called Restricted Boltzmann Machines. This layers can be trained using an unsupervised learning algorithm (Contrastive Divergence) that is very fast (Here's a link! with details).

Some more comments:

The solutions obtained with deeper neural networks correspond to solutions that perform worse than the solutions obtained for networks with 1 or 2 hidden layers. As the architecture gets deeper, it becomes more difficult to obtain good generalization using a Deep NN.

In 2006 Hinton discovered that much better results could be achieved in deeper architectures when each layer (RBM) is pre-trained with an unsupervised learning algorithm (Contrastive Divergence). Then the Network can be trained in a supervised way using backpropagation in order to "fine-tune" the weights.

Friday, January 27, 2017

K-Means Algorithom

Input: K, set of points x1...xn
Place centroids c1,,,,ck at random locations
Repeat until convergence:
for each point xi:
arg min j D(xi, cj)

for each cluster, computer new centroids
cj(a) = 1/csum(xi)

Stop when none of the cluster assignments change

K-Means: very fast
complexity(#iterations * #clusters * #instances * # dimensions)
Euclidean distance
O

Hierarchical Clustering and k-means clustering complement each other. In hierarchical clustering, the researcher is not aware of the number of clusters to be made whereas in k-means clustering, the number of clusters to be made are specified before-hand.

Advice- If unaware about the number of clusters to be formed, use hierarchical clustering to determine the number and then use k-means clustering to make more stable clusters as hierarchical clustering is a single-pass exercise whereas k-means is an iterative process.

Explain what a local optimum is and why it is important in a specific context, such as K-means clustering. What are specific ways of determining if you have a local optimum problem? What can be done to avoid local optima?

A solution that is optimal in within a neighboring set of candidate solutions
In contrast with global optimum: the optimal solution among all others
K-means clustering context:
It’s proven that the objective cost function will always decrease until a local optimum is reached.
Results will depend on the initial random cluster assignment
Determining if you have a local optimum problem:
Tendency of premature convergence
Different initialization induces different optima
Avoid local optima in a K-means context: repeat K-means and take the solution that has the lowest cost

What is batch size in neural network?

Batch size defines number of samples that going to be propagated through the network.

For instance, let's say you have 1050 training samples and you want to set up batch_size equal to 100. Algorithm takes first 100 samples (from 1st to 100th) from the training dataset and trains network. Next it takes second 100 samples (from 101st to 200th) and train network again. We can keep doing this procedure until we will propagate through the networks all samples. The problem usually happens with the last set of samples. In our example we've used 1050 which is not divisible by 100 without remainder. The simplest solution is just to get final 50 samples and train the network.

Advantages:

It requires less memory. Since you train network using less number of samples the overall training procedure requires less memory. It's especially important in case if you are not able to fit dataset in memory.
Typically networks trains faster with mini-batches. That's because we update weights after each propagation. In our example we've propagated 11 batches (10 of them had 100 samples and 1 had 50 samples) and after each of them we've updated network's parameters. If we used all samples during propagation we would make only 1 update for the network's parameter.

Disadvantages:

The smaller the batch the less accurate estimate of the gradient. In the figure below you can see that mini-batch (green color) gradient's direction fluctuates compare to the full batch (blue color).

Keras IndexError: indices are out-of-bounds

Solution:

Answer from the comment - trainx and trainy should be numpy arrays. You can convert the data frame to numpy array using as_matrix() method. I also faced this issue. It's weird that Keras does not give meaningful error message.

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD

model = Sequential()
model.add(Dense(64, input_dim=20, init='uniform', activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
          optimizer='rmsprop')
model.fit(trainx, trainy, nb_epoch=20, batch_size=16) # THROWS INDICES ERROR

Error:

model.fit(trainx, trainy, nb_epoch=20, batch_size=16)

Epoch 1/20
Traceback (most recent call last):

  File "<ipython-input-6-c81bd7606eb0>", line 1, in <module>
model.fit(trainx, trainy, nb_epoch=20, batch_size=16)

  File "C:\Users\Thiru\Anaconda3\lib\site-packages\keras\models.py", line 646, in fit
shuffle=shuffle, metrics=metrics)

  File "C:\Users\Thiru\Anaconda3\lib\site-packages\keras\models.py", line 271, in _fit
ins_batch = slice_X(ins, batch_ids)

  File "C:\Users\Thiru\Anaconda3\lib\site-packages\keras\models.py", line 65, in slice_X
return [x[start] for x in X]

  File "C:\Users\Thiru\Anaconda3\lib\site-packages\keras\models.py", line 65, in <listcomp>
return [x[start] for x in X]

  File "C:\Users\Thiru\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1963, in __getitem__
return self._getitem_array(key)

  File "C:\Users\Thiru\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2008, in _getitem_array
return self.take(indexer, axis=1, convert=True)

  File "C:\Users\Thiru\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1371, in take
convert=True, verify=True)

  File "C:\Users\Thiru\Anaconda3\lib\site-packages\pandas\core\internals.py", line 3619, in take
indexer = maybe_convert_indices(indexer, n)

  File "C:\Users\Thiru\Anaconda3\lib\site-packages\pandas\core\indexing.py", line 1750, in maybe_convert_indices
raise IndexError("indices are out-of-bounds")

IndexError: indices are out-of-bounds

Tuesday, January 24, 2017

Transpose a Column in Pandas

import pandas as pd

d = pd.DataFrame({'state':['CA','MA','NV','CO','CA'], 'id':[100,101,102,103,104]})

for s in d.state:
d.ix[d.state==s,s] = 1
d.ix[~(d.state==s),s] = 0

print(d)

Sunday, January 22, 2017

AdaBoost vs Gradient Boosting

These are what I've understood so far:

There are both boosting algorithms, which learns from previous model's errors and finally make a weighted sum of the models.
GBM and Adaboost are pretty similar except for their loss functions.

But still it is difficult for me to grab an idea of differences between them. Can someone give me intuitive explanations?

In Gradient Boosting, ‘shortcomings’ (of existing weak learners) are identified by gradients.
In Adaboost, ‘shortcomings’ are identified by high-weight data points.

Performance Measures for Binary Target

The accuracy paradox for predictive analytics states that predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy.

For highly imbalanced class problem, precision and recall is much better perf metric than accuracy.
example if highly imbalanced class problem:
1. cancer rate
2. insurance fraud: 150 / 9850
3. Information retrieval

Recall is defined as the number of relevant documents retrieved by a search divided by the total number of existing relevant documents, while precision is defined as the number of relevant documents retrieved by a search divided by the total number of documents retrieved by that search.

Accuracy Paradox

P, N Total: 10000
T 100 50
F 150 9700

Accuracy = 9800/10000 = 98%
Precision = 100 / 250 = 0.4
Recall = 100 / 150 = 0.667
F1 = 2 * 0.4 * 0.667 / 1.067 = 0.57

P, N
T 0 150
F 0 9850

Accuracy = 9850/10000 = 98.5%
Precision = 0 / 0 = 0
Recall = 0 / 150 = 0
F1 = 2 * 0 * 0 / 0 = 0

P, N Total: 10000
T 150 0
F 0 9850

Precision = 150 / 150 = 1
Recall = 150 / 150 = 1
F1 = 2 * 1 * 1 / (1+1) = 1

F1 = 2 * precision * recall / (precision + recall)

Accuracy:
Weighted Accuracy:
Lift: typical application is marketing
Precision/Recall: document retrieval
ROC Area: medicine & biology, false positive & false negative

The measure you optimize makes a difference
The measure you report makes a difference
Use the measure appropritate for problme/community
accuracy often is not sufficient/appropriate
Only accuracy generalizes to >2 classes

Confision Matrix
Pred True, Pred False
True True Positive, False Negative
False False Positive, True Negative

Precision, How many of the returned doc are correct = TP / (TP + FP)
Recall, how many of the correct doc does the model return = TP / (TP + FN)

True: 100
False: 10000
TP: 90, FN: 10
FP: 10, TN: 9990

Precision = 90 / 100 = 0.9
Recall = 90 / 100 = 0.9

F = 2 * 0.9 * 0.9 / (1.8) = 0.9

The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets