Data Science Guy
Friday, March 22, 2019
Wednesday, March 22, 2017
Linear Regression
As such, linear regression was developed in the field of statistics and is studied as a model for understanding the relationship between input and output numerical variables, but has been borrowed bymachine learning. It is both a statistical algorithm and a machine learning algorithm
Monday, March 20, 2017
Regression Correlation vs Casuality
Multiple regression, like all statistical techniques based on correlation, has a severe limitation due to the fact that correlation doesn't prove causation.
Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable;[2] for example, correlation does not imply causation.
https://www.ma.utexas.edu/users/mks/statmistakes/causality.html
Age -> Shoe size
Age -> Reading score
Shoe size and reading score correlated
Not casuality variable, but can be actionable variable
For marketing, we can't find age due to privacy. But we can use shoe size to find kids with higher reading score
============================================
Conversion rate study
Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable;[2] for example, correlation does not imply causation.
https://www.ma.utexas.edu/users/mks/statmistakes/causality.html
Age -> Shoe size
Age -> Reading score
Shoe size and reading score correlated
Not casuality variable, but can be actionable variable
For marketing, we can't find age due to privacy. But we can use shoe size to find kids with higher reading score
============================================
Conversion rate study
Thursday, March 9, 2017
Search Engine vs Recommendation Engine
Recommendation engine is a search engine without user intent.
With a search engine, customers find products through an active search, assuming customers know what they want and how to describe it when forming their search query.
Recommendation engines proactively identify products that have a high probabalility of being something the consumer might want to buy. Each time customers
With a search engine, customers find products through an active search, assuming customers know what they want and how to describe it when forming their search query.
Recommendation engines proactively identify products that have a high probabalility of being something the consumer might want to buy. Each time customers
Sunday, February 19, 2017
Why SVD?
Singular value decomposition (SVD) is not the same as reducing the dimensionality of the data. It is a method of decomposing a matrix into other matrices that has lots of wonderful properties which I won't go into here. For more on SVD, see the Wikipedia page.
Reducing the dimensionality of your data is sometimes very useful. It may be that you have a lot more variables than observations; this is not uncommon in genomic work. It may be that we have several variables that are very highly correlated, e.g., when they are heavily influenced by a small number of underlying factors, and we wish to recover some approximation to the underlying factors. Dimensionality-reducing techniques such as principal component analysis, multidimensional scaling, and canonical variate analysis give us insights into the relationships between observations and/or variables that we might not be able to get any other way.
A concrete example: some years ago I was analyzing an employee satisfaction survey that had over 100 questions on it. Well, no manager is ever going to be able to look at 100+ questions worth of answers, even summarized, and do more than guess at what it all means, because who can tell how the answers are related and what is driving them, really? I performed a factor analysis on the data, for which I had over 10,000 observations, and came up with five very clear and readily interpretable factors which could be used to develop manager-specific scores (one for each factor) that would summarize the entirety of the 100+ question survey. A much better solution than the Excel spreadsheet dump that had been the prior method of reporting results!
Reducing the dimensionality of your data is sometimes very useful. It may be that you have a lot more variables than observations; this is not uncommon in genomic work. It may be that we have several variables that are very highly correlated, e.g., when they are heavily influenced by a small number of underlying factors, and we wish to recover some approximation to the underlying factors. Dimensionality-reducing techniques such as principal component analysis, multidimensional scaling, and canonical variate analysis give us insights into the relationships between observations and/or variables that we might not be able to get any other way.
A concrete example: some years ago I was analyzing an employee satisfaction survey that had over 100 questions on it. Well, no manager is ever going to be able to look at 100+ questions worth of answers, even summarized, and do more than guess at what it all means, because who can tell how the answers are related and what is driving them, really? I performed a factor analysis on the data, for which I had over 10,000 observations, and came up with five very clear and readily interpretable factors which could be used to develop manager-specific scores (one for each factor) that would summarize the entirety of the 100+ question survey. A much better solution than the Excel spreadsheet dump that had been the prior method of reporting results!
Tuesday, February 14, 2017
Bag of Words
Conceptually, we can view bag-of-word model as a special case of the n-gram model, with n=1.
TF & TF-IDF belong to Vector Space Model
Bag-of-words: For a given document, you extract only the unigram words (aka terms) to create an unordered list of words. No POS tag, no syntax, no semantics, no position, no bigrams, no trigrams. Only the unigram words themselves, making for a bunch of words to represent the document. Thus: Bag-of-words.
We do not consider the order of words in a document. Represented the same way: John is quicker than Mary Mary is quicker than John This is called a bag of words model
Vector Space Model doesn't consider ordering as well.
TF & TF-IDF belong to Vector Space Model
Bag-of-words: For a given document, you extract only the unigram words (aka terms) to create an unordered list of words. No POS tag, no syntax, no semantics, no position, no bigrams, no trigrams. Only the unigram words themselves, making for a bunch of words to represent the document. Thus: Bag-of-words.
We do not consider the order of words in a document. Represented the same way: John is quicker than Mary Mary is quicker than John This is called a bag of words model
Vector Space Model doesn't consider ordering as well.
Monday, February 13, 2017
Random Forest Limitation
Random forest for bag-of-words? No.
Random forest is a very good, robust and versatile method, however it’s no mystery that for high-dimensional sparse data it’s not a best choice. And BoW representation is a perfect example of sparse and high-d.
We covered bag of words a few times before, for example in A bag of words and a nice little network. In that post, we used a neural network for classification, but the truth is that a linear model in all its glorious simplicity is usually the first choice. We’ll use logistic regression, for now leaving hyperparams at their default values.
Validation AUC for logistic regression is 92.8%, and it trains much faster than a random forest. If you’re going to remember only one thing from this article, remember to use a linear model for sparse high-dimensional data such as text as bag-of-words.
Subscribe to:
Posts (Atom)