As such, linear regression was developed in the field of statistics and is studied as a model for understanding the relationship between input and output numerical variables, but has been borrowed bymachine learning. It is both a statistical algorithm and a machine learning algorithm
Wednesday, March 22, 2017
Monday, March 20, 2017
Regression Correlation vs Casuality
Multiple regression, like all statistical techniques based on correlation, has a severe limitation due to the fact that correlation doesn't prove causation.
Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable;[2] for example, correlation does not imply causation.
https://www.ma.utexas.edu/users/mks/statmistakes/causality.html
Age -> Shoe size
Age -> Reading score
Shoe size and reading score correlated
Not casuality variable, but can be actionable variable
For marketing, we can't find age due to privacy. But we can use shoe size to find kids with higher reading score
============================================
Conversion rate study
Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable;[2] for example, correlation does not imply causation.
https://www.ma.utexas.edu/users/mks/statmistakes/causality.html
Age -> Shoe size
Age -> Reading score
Shoe size and reading score correlated
Not casuality variable, but can be actionable variable
For marketing, we can't find age due to privacy. But we can use shoe size to find kids with higher reading score
============================================
Conversion rate study
Thursday, March 9, 2017
Search Engine vs Recommendation Engine
Recommendation engine is a search engine without user intent.
With a search engine, customers find products through an active search, assuming customers know what they want and how to describe it when forming their search query.
Recommendation engines proactively identify products that have a high probabalility of being something the consumer might want to buy. Each time customers
With a search engine, customers find products through an active search, assuming customers know what they want and how to describe it when forming their search query.
Recommendation engines proactively identify products that have a high probabalility of being something the consumer might want to buy. Each time customers
Sunday, February 19, 2017
Why SVD?
Singular value decomposition (SVD) is not the same as reducing the dimensionality of the data. It is a method of decomposing a matrix into other matrices that has lots of wonderful properties which I won't go into here. For more on SVD, see the Wikipedia page.
Reducing the dimensionality of your data is sometimes very useful. It may be that you have a lot more variables than observations; this is not uncommon in genomic work. It may be that we have several variables that are very highly correlated, e.g., when they are heavily influenced by a small number of underlying factors, and we wish to recover some approximation to the underlying factors. Dimensionality-reducing techniques such as principal component analysis, multidimensional scaling, and canonical variate analysis give us insights into the relationships between observations and/or variables that we might not be able to get any other way.
A concrete example: some years ago I was analyzing an employee satisfaction survey that had over 100 questions on it. Well, no manager is ever going to be able to look at 100+ questions worth of answers, even summarized, and do more than guess at what it all means, because who can tell how the answers are related and what is driving them, really? I performed a factor analysis on the data, for which I had over 10,000 observations, and came up with five very clear and readily interpretable factors which could be used to develop manager-specific scores (one for each factor) that would summarize the entirety of the 100+ question survey. A much better solution than the Excel spreadsheet dump that had been the prior method of reporting results!
Reducing the dimensionality of your data is sometimes very useful. It may be that you have a lot more variables than observations; this is not uncommon in genomic work. It may be that we have several variables that are very highly correlated, e.g., when they are heavily influenced by a small number of underlying factors, and we wish to recover some approximation to the underlying factors. Dimensionality-reducing techniques such as principal component analysis, multidimensional scaling, and canonical variate analysis give us insights into the relationships between observations and/or variables that we might not be able to get any other way.
A concrete example: some years ago I was analyzing an employee satisfaction survey that had over 100 questions on it. Well, no manager is ever going to be able to look at 100+ questions worth of answers, even summarized, and do more than guess at what it all means, because who can tell how the answers are related and what is driving them, really? I performed a factor analysis on the data, for which I had over 10,000 observations, and came up with five very clear and readily interpretable factors which could be used to develop manager-specific scores (one for each factor) that would summarize the entirety of the 100+ question survey. A much better solution than the Excel spreadsheet dump that had been the prior method of reporting results!
Tuesday, February 14, 2017
Bag of Words
Conceptually, we can view bag-of-word model as a special case of the n-gram model, with n=1.
TF & TF-IDF belong to Vector Space Model
Bag-of-words: For a given document, you extract only the unigram words (aka terms) to create an unordered list of words. No POS tag, no syntax, no semantics, no position, no bigrams, no trigrams. Only the unigram words themselves, making for a bunch of words to represent the document. Thus: Bag-of-words.
We do not consider the order of words in a document. Represented the same way: John is quicker than Mary Mary is quicker than John This is called a bag of words model
Vector Space Model doesn't consider ordering as well.
TF & TF-IDF belong to Vector Space Model
Bag-of-words: For a given document, you extract only the unigram words (aka terms) to create an unordered list of words. No POS tag, no syntax, no semantics, no position, no bigrams, no trigrams. Only the unigram words themselves, making for a bunch of words to represent the document. Thus: Bag-of-words.
We do not consider the order of words in a document. Represented the same way: John is quicker than Mary Mary is quicker than John This is called a bag of words model
Vector Space Model doesn't consider ordering as well.
Monday, February 13, 2017
Random Forest Limitation
Random forest for bag-of-words? No.
Random forest is a very good, robust and versatile method, however it’s no mystery that for high-dimensional sparse data it’s not a best choice. And BoW representation is a perfect example of sparse and high-d.
We covered bag of words a few times before, for example in A bag of words and a nice little network. In that post, we used a neural network for classification, but the truth is that a linear model in all its glorious simplicity is usually the first choice. We’ll use logistic regression, for now leaving hyperparams at their default values.
Validation AUC for logistic regression is 92.8%, and it trains much faster than a random forest. If you’re going to remember only one thing from this article, remember to use a linear model for sparse high-dimensional data such as text as bag-of-words.
Sunday, February 12, 2017
Why does mean normalization help in gradient descent?
Essentially, scaling the inputs (through mean normalization, or z-score) gives the error surface a more spherical shape, where it would otherwise be a very high curvature ellipse. Since gradient descent is curvature-ignorant, having an error surface with high curvature will mean that we take many steps which aren't necessarily in the optimal direction. When we scale the inputs, we reduce the curvature, which makes methods that ignore curvature (like gradient descent) work much better. When the error surface is circular (spherical), the gradient points right at the minimum, so learning is easy.
Feature Scaling
Three Types:
http://stats.stackexchange.com/questions/70553/how-to-verify-a-distribution-is-normalized/70555#70555
Centered: X - mean
Standardized: X - mean / sd
Standardize features by removing the mean and scaling to unit variance
Normalized: X - mean / max - min
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
https://en.wikipedia.org/wiki/Feature_scaling
http://stats.stackexchange.com/questions/29781/when-conducting-multiple-regression-when-should-you-center-your-predictor-varia
http://stats.stackexchange.com/questions/86434/is-standardisation-before-lasso-really-necessary
http://stats.stackexchange.com/questions/70553/how-to-verify-a-distribution-is-normalized/70555#70555
Centered: X - mean
Standardized: X - mean / sd
Standardize features by removing the mean and scaling to unit variance
Normalized: X - mean / max - min
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
https://en.wikipedia.org/wiki/Feature_scaling
Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.
Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it[citation needed].
http://stats.stackexchange.com/questions/86434/is-standardisation-before-lasso-really-necessary
Lasso regression puts constraints on the size of the coefficients associated to each variable. However, this value will depend on the magnitude of each variable. It is therefore necessary to center and reduce, or standardize, the variables.
The result of centering the variables means that there is no longer an intercept. This applies equally to ridge regression, by the way.
Another good explanation is this post: Need for centering and standardizing data in regression
Need Normalization Before PCA
Normalization is important in PCA since it is a variance maximizing exercise. It projects your original data onto directions which maximize the variance. The first plot below shows the amount of total variance explained in the different principal components wher we have not normalized the data. As you can see it seems like only component one explains all the variance in the data.

If you look at the second picture we have normalized the data first. Here it is clear that the other components contribute as well. The reason for this is because PCA seeks to maximize the variance of each component. And since the covariance matrix of this particular dataset is:
Murder Assault UrbanPop Rape
Murder 18.970465 291.0624 4.386204 22.99141
Assault 291.062367 6945.1657 312.275102 519.26906
UrbanPop 4.386204 312.2751 209.518776 55.76808
Rape 22.991412 519.2691 55.768082 87.72916
From this structure the PCA will of course select to project as much as possible in the direction of Assault since that variance is much greater. So for finding features usable for any kind of model a PCA without normalization would perform badly.

Saturday, February 11, 2017
Log Loss
Log-loss is an appropriate performance measure when you're model output is the probability of a binary outcome.
The log-loss measure considers confidence of the prediction when assessing how to penalize incorrect classification. For instance consider two predictions of an outcome P(Y=1|X), where the predictions are 0.51 and 0.99 respectively. In the former case the model is only slightly confident of the class prediction (assuming a 0.5 cutoff), while in the latter it is extremely confident. Since in our case both are wrong, the penalty will be more harsh for the more confident (but incorrect) prediction by employing a log-loss penalty.
I just installed GTX 970 into my Ubuntu box. Using the following command shows the card is connected to the system.
~$ lspci -nnk |grep NVIDIA 60:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM204 [GeForce GTX 970] [10de:13c2] (rev a1) 60:00.1 Audio device [0403]: NVIDIA Corporation GM204 High Definition Audio Controller [10de:0fbb] (rev a1)
But when I ran my Keras script using TensorFlow as the backend, it didn't seem to speed up the training - each epoch still takes a long time to run.
Is there any way to check if Keras is using GPUs?
You need to install the cuda lib for tensorflow to use GPUhttps://www.tensorflow.org/get_started/os_setup#test_the_tensorflow_installation
If it is installed correctly, everytime you run tf or keras with tf backend, it will show something like "using Cuda library..."
Key Steps Enable Keras with TensorFlow backend to Use GPUs
1. Install driver
2. Pip install tensorflow-gpu
https://www.tensorflow.org/get_started/os_setup#test_the_tensorflow_installation
3. Install CUDA
https://www.tensorflow.org/get_started/os_setup#test_the_tensorflow_installation
2. Pip install tensorflow-gpu
https://www.tensorflow.org/get_started/os_setup#test_the_tensorflow_installation
3. Install CUDA
https://www.tensorflow.org/get_started/os_setup#test_the_tensorflow_installation
Friday, February 10, 2017
PCA
Principal Component transformation is a method that stands for reduction of the dimension, that means try to reshape your space in a way that a great part of the total information can be explained with the first few axes. and those axes are actually a linear combination of the initial axes ( variables) , so yes PCA can catch the interaction between variables but only a little , cause to express the interaction between 2 ,3 4 ... variables there a lot of advanced ( aggregation for example) things to do then a simple linear combination between the 2,3 4 ... if you get my idea.
Thursday, February 9, 2017
Some Notes about Scikit-learn Logistic Regression
1. It doesn't use SGD, so there is no learning rate. It uses liblinear, newton-sg, etc. method to solve.
2. Use SGDClassifier for SGD training
3. C is used to control regularization effect. Set C to be very large will reduce regularization to zero
4.
2. Use SGDClassifier for SGD training
3. C is used to control regularization effect. Set C to be very large will reduce regularization to zero
4.
Sunday, February 5, 2017
Ensemble - Blending & Stacking
Blending means the same thing as Stacking
Stacking, Blending and and Stacked Generalization are all the same thing with different names. It is a kind of ensemble learning.
Stacking, Blending and and Stacked Generalization are all the same thing with different names. It is a kind of ensemble learning.
In traditional ensemble learning, we have multiple classifiers trying to fit to a training set to approximate the target function. Since each classifier will have its own output, we will need to find a combining mechanism to combine the results. This can be through voting (majority wins), weighted voting (some classifier has more authority than the others), averaging the results, etc. This is the traditional way of ensemble learning.
In stacking, the combining mechanism is that the output of the classifiers (Level 0 classifiers) will be used as training data for another classifier (Level 1 classifier) to approximate the same target function. Basically, you let the Level 1 classifier to figure out the combining mechanism.
In practice, this works very well. In fact, it is most famously used in Netflix to achieve a very good score.
Ensemble - Stacking
Stacking is a way to ensemble multiple classification or regression model. There are many ways to ensemble models. Among most widely known are bagging or boosting. Bagging allows multiple similar models with high variance are averaged to decrease variance. Boosting builds multiple incremental models to decrease the bias, while keeping variance small.
Stacking is a different paradigm however. The point of stacking is to explore a space of different models for the same problem. The idea is that you can attack a learning problem with different types of models which are capable to learn some part of the problem, but not the whole space of the problem. So you can build multiple different learners and you use them to build an intermediate prediction, one prediction for each learned model. Then you add a new model which learns from the intermediate predictions the same target. This final model is said to be stacked on the top of the others, hence the name. Thus you might improve your overall performance, and often you end up with a model which is better than any individual intermediate model. Notice however that it does not gives you any guarantee, as is often the case with any machine learning technique.
Variance of Sum of Two Correlated Variables
Var(X + Y) = Var(X) + Var(Y) + 2 Cov(X, Y)
If X & Y uncorrelated, then Cov(X, Y) = 0.
Var(X + Y) = Var(X) + Var(Y)
If X & Y correlated,
Var(X + Y) > Var(X) + Var(Y)
so RF is higher variance if Trees are correlated.
If X & Y uncorrelated, then Cov(X, Y) = 0.
Var(X + Y) = Var(X) + Var(Y)
If X & Y correlated,
Var(X + Y) > Var(X) + Var(Y)
so RF is higher variance if Trees are correlated.
GB & RF interpretable?
The main problem with neural networks has always been the difficulty of interpreting the trained model. Unlike gradient boosting or random forests, neural networks relies on combinations of mathematical functions that cannot be easily translated to intuitive human understanding.
"Do you mean GB & RF are interpretable because they can generate important variable list?"
No, the generated variable importance means nothing as it's built through a heuristic that doesn't guarantee that its output is aligned with the actual GB or RF algorithm. What I mean is that the underlying algorithm that GB and RF are based on, which is decision trees, actually has a non-mathematical meaning. For example, you can say that if x1 > 700 and x2 == False then y = 4.73. That has an exactly interpretable meaning. While GB and RF ensemble many trees together in some mathematical fashion, fundamentally each tree itself is easy to interpret and the process of building them is intuitive as well. NNs on the other hand are just layers of more or less linear combinations, where the output from one layer are combined linearly to feed into the next layer. If you only had one layer, then the model becomes more or less linear regression and it is easy to interpret, but with multiple layers it's not intuitive to interpret the meaning of the lower layers and weights.
"Do you mean GB & RF are interpretable because they can generate important variable list?"
No, the generated variable importance means nothing as it's built through a heuristic that doesn't guarantee that its output is aligned with the actual GB or RF algorithm. What I mean is that the underlying algorithm that GB and RF are based on, which is decision trees, actually has a non-mathematical meaning. For example, you can say that if x1 > 700 and x2 == False then y = 4.73. That has an exactly interpretable meaning. While GB and RF ensemble many trees together in some mathematical fashion, fundamentally each tree itself is easy to interpret and the process of building them is intuitive as well. NNs on the other hand are just layers of more or less linear combinations, where the output from one layer are combined linearly to feed into the next layer. If you only had one layer, then the model becomes more or less linear regression and it is easy to interpret, but with multiple layers it's not intuitive to interpret the meaning of the lower layers and weights.
Tuning Hyper-paramters of xGBoost & Radom Forest
Tuning sequence for xGBoost:
1. Get n_estimators give a learning rate & some default parameters
2. Tune max_depth, min_child_weight
3. Tune gamma
4. Tune subsample & colsample_bytree
5. Tune reg_alpha
6. Reduce learning rate
Tuning sequence for Random Forests:
1. max_features
2. n_estimators
3. min_sample_leaf
1. Get n_estimators give a learning rate & some default parameters
2. Tune max_depth, min_child_weight
3. Tune gamma
4. Tune subsample & colsample_bytree
5. Tune reg_alpha
6. Reduce learning rate
- lambda [default=1, alias: reg_lambda]
- L2 regularization term on weights, increase this value will make model more conservative.
- alpha [default=0, alias: reg_alpha]
- L1 regularization term on weights, increase this value will make model more conservative.
lambda [default=1]
- L2 regularization term on weights (analogous to Ridge regression)
- This used to handle the regularization part of XGBoost. Though many data scientists don’t use it often, it should be explored to reduce overfitting.
- L1 regularization term on weight (analogous to Lasso regression)
- Can be used in case of very high dimensionality so that the algorithm runs faster when implemented
Tuning sequence for Random Forests:
1. max_features
2. n_estimators
3. min_sample_leaf
MAE vs MSE
Basically MAE is more robust to outlier than is MSE. MAE assigns equal weight to the data whereas MSE emphasizes the extremes - the square of a very small number (smaller than 1) is even smaller, and the square of a big number is even bigger.
https://www.kaggle.com/c/allstate-claims-severity/discussion/24293
https://www.kaggle.com/c/allstate-claims-severity/discussion/24293
Same here. MAE is non-differentiable and solving for it is therefore less efficient (a lot less in this case!).
It also uses more memory. In my case, memory usage continues to increase and at some point the system runs out of free memory and the process crashes.
Saturday, February 4, 2017
Application of Central Limit Theorem
sample mean approximate a normal distribution with population mean and population variance/n
sample mean approximate a normal distribution with population mean x n and population variance x n
sample mean approximate a normal distribution with population mean x n and population variance x n
Normal Distribution Examples
1. Light Bulb life
2. Human IQ
3. Human Height
4. Size of Product Produced By A Machine
5. Measurement Error
2. Human IQ
3. Human Height
4. Size of Product Produced By A Machine
5. Measurement Error
Why More Data Reduc Overfitting?
Practical concerns like memory and processor time aside, I can't imagine any situation where having more representative training data leads to a worse outcome. Overfitting is essentially learning spurious correlations that occur in your training data, but not the real world. For example, if you considered only my colleagues, you might learn to associate "named Matt" with "has a beard." It's 100% valid (n=4 , even!), but it's obviously not true in general. Increasing the size of your data set (e.g., to the entire building or city) should reduce these spurious correlations and improve the performance of your learner.
That said, one situation where more data does not help---and may even hurt---is if your additional training data is noisy or doesn't match whatever you are trying to predict. I once did an experiment where I plugged different language models[*] into a voice-activated restaurant reservation system.
I varied the amount of training data as well as its relevance: at one extreme, I had a small, carefully curated collection of people booking tables, a perfect match for my application. At the other, I had a model estimated from huge collection of classic literature, a more accurate language model, but a much worse match to the application. To my surprise, the small-but-relevant model vastly outperformed the big-but-less-relevant model.
Unfortunately, I don't think there are any hard and fast rules for this sort of trade-off. You'll have to try it and see how it works.
[*]A language model is just the probability of seeing a given sequence of words e.g. P(wn='quick', wn+1='brown', wn+2='fox') . They're vital to building halfway decent speech/character recognizers.
Thursday, February 2, 2017
Fractional Factorial Design
Use of a FF design instead of full factorial design is usually done for
economic reasons. Since there is no free lunch , what price to pay? See
next.
Wednesday, February 1, 2017
LSA & LDA
LSA: Latent Semantic Analysis
LDA: Latent Dirichlet Allocation
LSI: Latent Semantic Indexing
LDA: Latent Dirichlet Allocation
LSI: Latent Semantic Indexing
Flatten Nested List in Python
Here's a recursive approach that is string friendly:
nests = [1, 2, [3, 4, [5],['hi']], [6, [[[7, 'hello']]]]]
def flatten(container):
for i in container:
if isinstance(i, (list,tuple)):
for j in flatten(i):
yield j
else:
yield i
print list(flatten(nests))
returns:
[1, 2, 3, 4, 5, 'hi', 6, 7, 'hello']
Note, this doesn't make any guarantees for speed or overhead use, but illustrates a recursive solution that hopefully will be helpful.
Tuesday, January 31, 2017
Steps to Create Text Clusters
Create TFIDF Matrix: rows are documents, columns are normalized text tokens, N x T
Apply Sigular Value Decomposition(SVD) to reduce dimensions, N x V
Use Gaussian Mixture Model(GMM) to create clusters: N x S
Apply Sigular Value Decomposition(SVD) to reduce dimensions, N x V
Use Gaussian Mixture Model(GMM) to create clusters: N x S
How Performance Metrics Affect Model Fitting
Here is a nice article:
http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/
http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/
L1 vs L2
Objective Function = Loss Function + Regularization Penalty/Term
L1 loss = least absolute error = Sum(| y - f(x) |)
L2 loss = least square error = Sum((y - f(x))^2)

regularization term in order to prevent the coefficients to fit so perfectly to overfit.
L1 reg = the sum of the weights.
L2 reg = the sum of the square of the weights

L1 loss = least absolute error = Sum(| y - f(x) |)
L2 loss = least square error = Sum((y - f(x))^2)
regularization term in order to prevent the coefficients to fit so perfectly to overfit.
L1 reg = the sum of the weights.
L2 reg = the sum of the square of the weights
Monday, January 30, 2017
Imbalanced Classes
How to combat Imbalanced Training Data
1. Collect more data
2. Change performance metric: Precision & Recall
3. Resamplingdata
4. Generate synthetic samples
5. Try different algorithms
6. Try penalized models
7. Try different perspective
8. Try getting creative
1. Collect more data
2. Change performance metric: Precision & Recall
3. Resamplingdata
4. Generate synthetic samples
5. Try different algorithms
6. Try penalized models
7. Try different perspective
8. Try getting creative
What is Bagging?
Bootstrap aggregating, also called bagging
Given a standard training set D of size n, bagging generates m new training sets
, each of size n′, by sampling from D uniformly and with replacement. By sampling with replacement, some observations may be repeated in each
. If n′=n, then for large n the set
is expected to have the fraction (1 - 1/e) (≈63.2%) of the unique examples of D, the rest being duplicates.[1] This kind of sample is known as a bootstrap sample. The m models are fitted using the above m bootstrap samples and combined by averaging the output (for regression) or voting (for classification).
Bagging leads to "improvements for unstable procedures" (Breiman, 1996), which include, for example, artificial neural networks, classification and regression trees, and subset selection in linear regression (Breiman, 1994). An interesting application of bagging showing improvement in preimage learning is provided here.[2][3] On the other hand, it can mildly degrade the performance of stable methods such as K-nearest neighbors (Breiman, 1996).
Sunday, January 29, 2017
Why Dropout Regularization Works?
Dropout is a regularization technique for neural network models proposed by Srivastava, et al. in their 2014 paper Dropout: A Simple Way to Prevent Neural Networks from Overfitting (download the PDF).
Dropout is a technique where randomly selected neurons are ignored during training. They are “dropped-out” randomly. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.
As a neural network learns, neuron weights settle into their context within the network. Weights of neurons are tuned for specific features providing some specialization. Neighboring neurons become to rely on this specialization, which if taken too far can result in a fragile model too specialized to the training data. This reliant on context for a neuron during training is referred to complex co-adaptations.
You can imagine that if neurons are randomly dropped out of the network during training, that other neurons will have to step in and handle the representation required to make predictions for the missing neurons. This is believed to result in multiple independent internal representations being learned by the network.
The effect is that the network becomes less sensitive to the specific weights of neurons. This in turn results in a network that is capable of better generalization and is less likely to overfit the training data.
The original paper on Dropout provides experimental results on a suite of standard machine learning problems. As a result they provide a number of useful heuristics to consider when using dropout in practice.
- Generally use a small dropout value of 20%-50% of neurons with 20% providing a good starting point. A probability too low has minimal effect and a value too high results in under-learning by the network.
- Use a larger network. You are likely to get better performance when dropout is used on a larger network, giving the model more of an opportunity to learn independent representations.
- Use dropout on incoming (visible) as well as hidden units. Application of dropout at each layer of the network has shown good results.
- Use a large learning rate with decay and a large momentum. Increase your learning rate by a factor of 10 to 100 and use a high momentum value of 0.9 or 0.99.
- Constrain the size of network weights. A large learning rate can result in very large network weights. Imposing a constraint on the size of network weights such as max-norm regularization with a size of 4 or 5 has been shown to improve results.
Keras Nueral Network Hyperparameter Tuning List
batch_size, nb_epoch
Optimization algorithm: ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
Learning Rate & Momentum for SGD
Weight Init: ['uniform', 'lecun_uniform', 'normal', 'zero', 'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform']
Activation Function:['softmax', 'softplus', 'softsign', 'relu', 'tanh', 'sigmoid', 'hard_sigmoid', 'linear']
Dropout Rate: -> Keras's regularization
the dropout rate for regularization in an effort to limit overfitting and improve the model’s ability to generalize.
[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
# of Neurons:[1, 5, 10, 15, 20, 25, 30]
Also, generally, a large enough single layer network can approximate any other neural network, at least in theory.
Optimization algorithm: ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
Learning Rate & Momentum for SGD
Weight Init: ['uniform', 'lecun_uniform', 'normal', 'zero', 'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform']
Activation Function:['softmax', 'softplus', 'softsign', 'relu', 'tanh', 'sigmoid', 'hard_sigmoid', 'linear']
Dropout Rate: -> Keras's regularization
the dropout rate for regularization in an effort to limit overfitting and improve the model’s ability to generalize.
[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
# of Neurons:[1, 5, 10, 15, 20, 25, 30]
Also, generally, a large enough single layer network can approximate any other neural network, at least in theory.
- Reproducibility is a Problem. Although we set the seed for the random number generator in NumPy, the results are not 100% reproducible. There is more to reproducibility when grid searching wrapped Keras models than is presented in this post.
Start with Single Hidden Layer and add more layers