sapply mean simple lapply. Simplified version of lapply. Both does the same thing except lapply returns a vector while sapply returns a list.
Thursday, March 31, 2016
Initialize A Dataframe in R
df <- data.frame(matrix(ncol = 300, nrow = 100))
Remove Data/Objects in R
rm(list = ls())
rm(list = grep("^paper", ls(), value = TRUE, invert = TRUE))
Wednesday, March 30, 2016
Handling Missing Values in R
is.na(a)
is.na(x1) <- which(x1 == 7)
is.na(x1) <- which(x1 == 7)
Recoding Data in R
# recode missing values
A <- c(3, 2, NA, 5, 3, 7, NA, NA, 5, 2, 6)
A[ is.na(A) ] <- 0
A <- c(3, 2, NA, 5, 3, 7, NA, NA, 5, 2, 6)
A[ is.na(A) ] <- 0
#Let’s re-code all values less than 5 to the value 99.
A[ A < 5 ] <- 99
is.na(x1) <- which(x1 == 7)
Subsetting A Dataframe
library(lattice)
a<-barley[4] # create a data frame
a<-barley[,4] # create a vector
a<-barley[4] # create a data frame
a<-barley[,4] # create a vector
RF vs GB
There are two main reasons why you would use Random Forests over Gradient Boosted Decision Trees, and they are both pretty related:
And, regarding (2), while it is not true that RF do not overfit (as opposed as many are led to believe by Breiman's strong assertions[2]), it is true that they are more robust to overfitting and require less tuning to avoid it.
In some sense, RF is a tree ensemble that is more "plug'n'play" than GBM. However, it is generally true that a well-tuned GBM can outperform a RF.
Also, as Tianqi Chen mentioned, RF has traditionally been easier to parallelism. However, that is not a good reason anymore given there are efficient ways to do it with GBMs also.
- RF are much easier to tune than GBM
- RF are harder to overfit than GBM
And, regarding (2), while it is not true that RF do not overfit (as opposed as many are led to believe by Breiman's strong assertions[2]), it is true that they are more robust to overfitting and require less tuning to avoid it.
In some sense, RF is a tree ensemble that is more "plug'n'play" than GBM. However, it is generally true that a well-tuned GBM can outperform a RF.
Also, as Tianqi Chen mentioned, RF has traditionally been easier to parallelism. However, that is not a good reason anymore given there are efficient ways to do it with GBMs also.
Both are ensemble learning methods and predict (regression or classification) by combining the outputs from individual trees. They differ in the way the trees are built - order and the way the results are combined.
Random Forests train each tree independently, using a random sample of the data. This randomness helps to make the model more robust than a single decision tree, and less likely to overfit on the training data. There are typically two parameters in RF - number of trees and no. of features to be selected at each node.
GBTs build trees one at a time, where each new tree helps to correct errors made by previously trained tree. With each tree added, the model becomes even more expressive. There are typically three parameters - number of trees, depth of trees and learning rate, and the each tree built is generally shallow.
GBDT training generally takes longer because of the fact that trees are built sequentially. However benchmark results have shown GBDT are better learners than Random Forests.
An overview of differences and some benchmarks results in terms of error rate and training time are given in link below:
Random Forests train each tree independently, using a random sample of the data. This randomness helps to make the model more robust than a single decision tree, and less likely to overfit on the training data. There are typically two parameters in RF - number of trees and no. of features to be selected at each node.
GBTs build trees one at a time, where each new tree helps to correct errors made by previously trained tree. With each tree added, the model becomes even more expressive. There are typically three parameters - number of trees, depth of trees and learning rate, and the each tree built is generally shallow.
GBDT training generally takes longer because of the fact that trees are built sequentially. However benchmark results have shown GBDT are better learners than Random Forests.
An overview of differences and some benchmarks results in terms of error rate and training time are given in link below:
Scatter Plot For Variables of A Dataframe
plot(df$var1, df$v2)
Assessing Model Accuracy
MSE: mean squared error
Error Rate:
Error Rate:
Classification
The most widely-used classifiers: logistic regression, linear discriminant analysis, and K-nearest neighbors.
More computer-intensive methods: generalized additive models, trees, random forests, and boosting. and support vector machines.
More computer-intensive methods: generalized additive models, trees, random forests, and boosting. and support vector machines.
Machine Learning Terminology
Classifer: classification techniques
Response Variable(Y): can be quantitative or qualitative
Quantitative: numerical
Qualitative: categorical
Response Variable(Y): can be quantitative or qualitative
Quantitative: numerical
Qualitative: categorical
Tuesday, March 29, 2016
Random Forests In A Nutshell
Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based on random samples of variables), classifying a case using each tree in this new "forest", and deciding a final predicted outcome by combining the results across all of the trees (an average in regression, a majority vote in classification). Breiman and Cutler's random forest approach is implimented via therandomForest package.
Here is an example.
# Random Forest prediction of Kyphosis data
library(randomForest)
fit <- randomForest(Kyphosis ~ Age + Number + Start, data=kyphosis)
print(fit) # view results
importance(fit) # importance of each predictor
Regression Model Strategy
Diagnostics
Transformation
Variable Selection
Diagnostics
Transformation
Variable Selection
Diagnostics
Plotting A Mathematical Expression
plotcurve <-
function(equation = "y = sqrt(1/(1+x^2))", ...){
leftright <- strsplit(equation, split = "=")[[1]]
left <- leftright[1] # The part to the left of the "="
right <- leftright[2] # The part to the right of the "="
expr <- parse(text=right)
xname <- all.vars(expr)
if(length(xname) > 1)stop(paste("There are multiple variables, i.e.",paste(xname,
collapse=" & "),
"on the right of the equation"))
if(length(list(...))==0)assign(xname, 1:10)
else {
nam <- names(list(...))
if(nam!=xname)stop("Clash of variable names")
assign("x", list(...)[[1]])
assign(xname, x)
}
y <- eval(expr)
yexpr <- parse(text=left)[[1]]
xexpr <- parse(text=xname)[[1]]
plot(x, y, ylab = yexpr, xlab = xexpr, type="n")
lines(spline(x,y))
mainexpr <- parse(text=paste(left, "==", right))
title(main = mainexpr)
}
plotcurve()
plotcurve("ang=asin(sqrt(p))", p=(1:49)/50)
Searching R Functions For A Specified Token
grep<-function(str){
tempobj<-ls(envir=sys.frame(-1))
objstring<-character(0)
for(i in tempobj) {
myfunc<-get(i)
if(is.function(myfunc))
if(length(grep(str,deparse(myfunc))))
objstring<-c(objstring,i)
}
return(objstring)
}
mygrep("for")
tempobj<-ls(envir=sys.frame(-1))
objstring<-character(0)
for(i in tempobj) {
myfunc<-get(i)
if(is.function(myfunc))
if(length(grep(str,deparse(myfunc))))
objstring<-c(objstring,i)
}
return(objstring)
}
mygrep("for")
Saturday, March 26, 2016
Applying The Same Operation On Multiple Data Frames In R
Solution #1:
a<-c(1,2,3,4)
b<-c(5,6,7,8)
pos1<-data.frame(cbind(a,b))
pos2<-pos1+1
pos3<-pos1+2
var1<-paste0('pos',1:3)
exps<-paste0(var1,'$x<-',var1,'$a+10')
for(exp in exps){
eval(parse(text=exp))
}
Solution #2: Use lapply
x<-list(pos1,pos2,pos3)
lapply(x, function(x) x$x<-x$a+10)
http://stackoverflow.com/questions/16115745/applying-a-set-of-operations-across-several-data-frames-in-r
http://stackoverflow.com/questions/19249303/applying-lapply-on-multiple-data-frames-in-a-list-r
https://www.datacamp.com/community/tutorials/r-tutorial-apply-family
a<-c(1,2,3,4)
b<-c(5,6,7,8)
pos1<-data.frame(cbind(a,b))
pos2<-pos1+1
pos3<-pos1+2
var1<-paste0('pos',1:3)
exps<-paste0(var1,'$x<-',var1,'$a+10')
for(exp in exps){
eval(parse(text=exp))
}
Solution #2: Use lapply
x<-list(pos1,pos2,pos3)
lapply(x, function(x) x$x<-x$a+10)
lapply(x,function(x)cor(x[,2],x[,3]))
http://stackoverflow.com/questions/16115745/applying-a-set-of-operations-across-several-data-frames-in-r
http://stackoverflow.com/questions/19249303/applying-lapply-on-multiple-data-frames-in-a-list-r
https://www.datacamp.com/community/tutorials/r-tutorial-apply-family
Thursday, March 24, 2016
Commands To Inspect Data Structure In R
str(a),
class(a)
typeof(a)
length(a)
attr(a)
names(a)
dim(a)
is.character(0
is.numeric()
is.double()
is.integer()
is.logical()
is.atomic()
is.function()
is.vector()
is.data.frame()
class(a)
typeof(a)
length(a)
attr(a)
names(a)
dim(a)
is.character(0
is.numeric()
is.double()
is.integer()
is.logical()
is.atomic()
is.function()
is.vector()
is.data.frame()
Tuesday, March 22, 2016
Keyboard Shortcuts For R Studio
Keyboard Shortcuts
Console | ||
Description | Windows & Linux | Mac |
---|---|---|
Move cursor to Console | Ctrl+2 | Ctrl+2 |
Clear console | Ctrl+L | Ctrl+L |
Move cursor to beginning of line | Home | Command+Left |
Move cursor to end of line | End | Command+Right |
Navigate command history | Up/Down | Up/Down |
Popup command history | Ctrl+Up | Command+Up |
Interrupt currently executing command | Esc | Esc |
Change working directory | Ctrl+Shift+H | Ctrl+Shift+H |
Source | ||
Description | Windows & Linux | Mac |
Goto File/Function | Ctrl+. | Ctrl+. |
Move cursor to Source Editor | Ctrl+1 | Ctrl+1 |
New document (except on Chrome/Windows) | Ctrl+Shift+N | Command+Shift+N |
New document (Chrome only) | Ctrl+Alt+Shift+N | Command+Shift+Alt+N |
Open document | Ctrl+O | Command+O |
Save active document | Ctrl+S | Command+S |
Close active document (except on Chrome) | Ctrl+W | Command+W |
Close active document (Chrome only) | Ctrl+Alt+W | Command+Option+W |
Close all open documents | Ctrl+Shift+W | Command+Shift+W |
Preview HTML (Markdown and HTML) | Ctrl+Shift+K | Command+Shift+K |
Knit Document (knitr) | Ctrl+Shift+K | Command+Shift+K |
Compile Notebook | Ctrl+Shift+K | Command+Shift+K |
Compile PDF (TeX and Sweave) | Ctrl+Shift+K | Command+Shift+K |
Insert chunk (Sweave and Knitr) | Ctrl+Alt+I | Command+Option+I |
Insert code section | Ctrl+Shift+R | Command+Shift+R |
Run current line/selection | Ctrl+Enter | Command+Enter |
Run current line/selection (retain cursor position) | Alt+Enter | Option+Enter |
Re-run previous region | Ctrl+Shift+P | Command+Shift+P |
Run current document | Ctrl+Alt+R | Command+Option+R |
Run from document beginning to current line | Ctrl+Alt+B | Command+Option+B |
Run from current line to document end | Ctrl+Alt+E | Command+Option+E |
Run the current function definition | Ctrl+Alt+F | Command+Option+F |
Run the current code section | Ctrl+Alt+T | Command+Option+T |
Run previous Sweave/Rmd code | Ctrl+Alt+P | Command+Option+P |
Run the current Sweave/Rmd chunk | Ctrl+Alt+C | Command+Option+C |
Run the next Sweave/Rmd chunk | Ctrl+Alt+N | Command+Option+N |
Source a file | Ctrl+Shift+O | Command+Shift+O |
Source the current document | Ctrl+Shift+S | Command+Shift+S |
Source the current document (with echo) | Ctrl+Shift+Enter | Command+Shift+Enter |
Fold Selected | Alt+L | Cmd+Option+L |
Unfold Selected | Shift+Alt+L | Cmd+Shift+Option+L |
Fold All | Alt+O | Cmd+Option+O |
Unfold All | Shift+Alt+O | Cmd+Shift+Option+O |
Go to line | Shift+Alt+G | Cmd+Shift+Option+G |
Jump to | Shift+Alt+J | Cmd+Shift+Option+J |
Switch to tab | Ctrl+Shift+. | Ctrl+Shift+. |
Previous tab | Ctrl+F11 | Ctrl+F11 |
Next tab | Ctrl+F12 | Ctrl+F12 |
First tab | Ctrl+Shift+F11 | Ctrl+Shift+F11 |
Last tab | Ctrl+Shift+F12 | Ctrl+Shift+F12 |
Navigate back | Ctrl+F9 | Cmd+F9 |
Navigate forward | Ctrl+F10 | Cmd+F10 |
Extract function from selection | Ctrl+Alt+X | Command+Option+X |
Extract variable from selection | Ctrl+Alt+V | Command+Option+V |
Reindent lines | Ctrl+I | Command+I |
Comment/uncomment current line/selection | Ctrl+Shift+C | Command+Shift+C |
Reflow Comment | Ctrl+Shift+/ | Command+Shift+/ |
Reformat Selection | Ctrl+Shift+A | Command+Shift+A |
Show Diagnostics | Ctrl+Shift+Alt+P | Command+Shift+Alt+P |
Transpose Letters | Ctrl+T | |
Move Lines Up/Down | Alt+Up/Down | Option+Up/Down |
Copy Lines Up/Down | Shift+Alt+Up/Down | Command+Option+Up/Down |
Jump to Matching Brace/Paren | Ctrl+P | Ctrl+P |
Expand to Matching Brace/Paren | Ctrl+Shift+E | Ctrl+Shift+E |
Select to Matching Brace/Paren | Ctrl+Shift+Alt+E | Ctrl+Shift+Alt+E |
Add Cursor Above Current Cursor | Ctrl+Alt+Up | Ctrl+Alt+Up |
Add Cursor Below Current Cursor | Ctrl+Alt+Down | Ctrl+Alt+Down |
Move Active Cursor Up | Ctrl+Alt+Shift+Up | Ctrl+Alt+Shift+Up |
Move Active Cursor Down | Ctrl+Alt+Shift+Down | Ctrl+Alt+Shift+Down |
Find and Replace | Ctrl+F | Command+F |
Find Next | Win: F3, Linux: Ctrl+G | Command+G |
Find Previous | Win: Shift+F3, Linux: Ctrl+Shift+G | Command+Shift+G |
Use Selection for Find | Ctrl+F3 | Command+E |
Replace and Find | Ctrl+Shift+J | Command+Shift+J |
Find in Files | Ctrl+Shift+F | Command+Shift+F |
Check Spelling | F7 | F7 |
Editing (Console and Source) | ||
Description | Windows & Linux | Mac |
Undo | Ctrl+Z | Command+Z |
Redo | Ctrl+Shift+Z | Command+Shift+Z |
Cut | Ctrl+X | Command+X |
Copy | Ctrl+C | Command+C |
Paste | Ctrl+V | Command+V |
Select All | Ctrl+A | Command+A |
Jump to Word | Ctrl+Left/Right | Option+Left/Right |
Jump to Start/End | Ctrl+Home/End or Ctrl+Up/Down | Command+Home/End or Command+Up/Down |
Delete Line | Ctrl+D | Command+D |
Select | Shift+[Arrow] | Shift+[Arrow] |
Select Word | Ctrl+Shift+Left/Right | Option+Shift+Left/Right |
Select to Line Start | Alt+Shift+Left | Command+Shift+Left |
Select to Line End | Alt+Shift+Right | Command+Shift+Right |
Select Page Up/Down | Shift+PageUp/PageDown | Shift+PageUp/Down |
Select to Start/End | Ctrl+Shift+Home/End or Shift+Alt+Up/Down | Command+Shift+Up/Down |
Delete Word Left | Ctrl+Backspace | Option+Backspace or Ctrl+Option+Backspace |
Delete Word Right | Option+Delete | |
Delete to Line End | Ctrl+K | |
Delete to Line Start | Option+Backspace | |
Indent | Tab (at beginning of line) | Tab (at beginning of line) |
Outdent | Shift+Tab | Shift+Tab |
Yank line up to cursor | Ctrl+U | Ctrl+U |
Yank line after cursor | Ctrl+K | Ctrl+K |
Insert currently yanked text | Ctrl+Y | Ctrl+Y |
Insert assignment operator | Alt+- | Option+- |
Insert pipe operator | Ctrl+Shift+M | Cmd+Shift+M |
Show help for function at cursor | F1 | F1 |
Show source code for function at cursor | F2 | F2 |
Find usages for symbol at cursor (C++) | Ctrl+Alt+U | Cmd+Option+U |
Completions (Console and Source) | ||
Description | Windows & Linux | Mac |
Attempt completion | Tab or Ctrl+Space | Tab or Command+Space |
Navigate candidates | Up/Down | Up/Down |
Accept selected candidate | Enter, Tab, or Right | Enter, Tab, or Right |
Dismiss completion popup | Esc | Esc |
Views | ||
Description | Windows & Linux | Mac |
Move focus to Source Editor | Ctrl+1 | Ctrl+1 |
Move focus to Console | Ctrl+2 | Ctrl+2 |
Move focus to Help | Ctrl+3 | Ctrl+3 |
Show History | Ctrl+4 | Ctrl+4 |
Show Files | Ctrl+5 | Ctrl+5 |
Show Plots | Ctrl+6 | Ctrl+6 |
Show Packages | Ctrl+7 | Ctrl+7 |
Show Environment | Ctrl+8 | Ctrl+8 |
Show Git/SVN | Ctrl+9 | Ctrl+9 |
Show Build | Ctrl+0 | Ctrl+0 |
Sync Editor & PDF Preview | Ctrl+F8 | Cmd+F8 |
Show Keyboard Shortcut Reference | Alt+Shift+K | Option+Shift+K |
Build | ||
Description | Windows & Linux | Mac |
Build and Reload | Ctrl+Shift+B | Cmd+Shift+B |
Load All (devtools) | Ctrl+Shift+L | Cmd+Shift+L |
Test Package (Desktop) | Ctrl+Shift+T | Cmd+Shift+T |
Test Package (Web) | Ctrl+Alt+F7 | Cmd+Alt+F7 |
Check Package | Ctrl+Shift+E | Cmd+Shift+E |
Document Package | Ctrl+Shift+D | Cmd+Shift+D |
Debug | ||
Description | Windows & Linux | Mac |
Toggle Breakpoint | Shift+F9 | Shift+F9 |
Execute Next Line | F10 | F10 |
Step Into Function | Shift+F4 | Shift+F4 |
Finish Function/Loop | Shift+F6 | Shift+F6 |
Continue | Shift+F5 | Shift+F5 |
Stop Debugging | Shift+F8 | Shift+F8 |
Plots | ||
Description | Windows & Linux | Mac |
Previous plot | Ctrl+Alt+F11 | Command+Option+F11 |
Next plot | Ctrl+Alt+F12 | Command+Option+F12 |
Git/SVN | ||
Description | Windows & Linux | Mac |
Diff active source document | Ctrl+Alt+D | Ctrl+Option+D |
Commit changes | Ctrl+Alt+M | Ctrl+Option+M |
Scroll diff view | Ctrl+Up/Down | Ctrl+Up/Down |
Stage/Unstage (Git) | Spacebar | Spacebar |
Stage/Unstage and move to next (Git) | Enter | Enter |
Session | ||
Description | Windows & Linux | Mac |
Quit Session (desktop only) | Ctrl+Q | Command+Q |
Restart R Session | Ctrl+Shift+F10 | Command+Shift+F10 |
Friday, March 18, 2016
Keyboard Shortcut For Stopping A Running Code In R Studio
Move the cursor to Console window and press Esc key.
Thursday, March 17, 2016
Sorting in R
Sorting one column:
all[order(all$trivago_id, all$avg_pos),]
Sorting 2 columns:
all[order(all$trivago_id, all$avg_pos),]
Count By Group In R
library(plyr)
count(ti_pos1, 'cpc_cnt_7d');
count(ti_pos1, 'cpc_cnt_7d');
Wednesday, March 16, 2016
What is the difference between NaN and Inf, and NULL and NA in R?
http://www.quantlego.com/howto/special-missing-values-in-r/
Tuesday, March 15, 2016
Regression
http://scc.stat.ucla.edu/page_attachments/0000/0139/reg_1.pdf
Convert data.frame columns from factors to characters
Just following on Matt and Dirk. If you want to recreate your existing data frame without changing the global option, you can recreate it with an apply statement:
bob <- data.frame(lapply(bob, as.character), stringsAsFactors=FALSE)
To replace only factors:
i <- sapply(bob, is.factor)
bob[i] <- lapply(bob[i], as.character)
Monday, March 14, 2016
How to Count Unique/Distinct Values of A Variable in R
base R:
length(unique(mydata$colA))
Sunday, March 13, 2016
Summarize data by group in R
tapply(a$avg_cpc_1d, a$avg_pos, mean)
aggregate(avg_cpc_1d~avg_pos, a, mean)
clk_summ<-data.frame(group_by(clk2, OND)%>%summarise(cnt=n()))
a<-tapply(gp$txn,gp$product_ln_name,sum)
I have R data frame like this:
I need to get data frame in the following form:
Group number may vary, but their names and quantity could be obtained by calling
levels(factor(data$group))
What manipulations should be done with the data to get the result?
| |||||||||||||||||||||
closed as off-topic by gung, user777, kjetil b halvorsen, John, Peter Flom♦ Sep 11 '15 at 23:23
This question appears to be off-topic. The users who voted to close gave this specific reason:
| |||||||||||||||||||||
|
Here is the plyr one line variant using ddply:
Here is another one line variant using new package data.table.
This one is faster, though this is noticeable only on table with 100k rows. Timings on my Macbook Pro with 2.53 Ghz Core 2 Duo processor and R 2.11.1:
Further savings are possible if we use
setkey :
| |||||||||||||||||||||
|
Subscribe to:
Posts (Atom)