Thursday, March 31, 2016

What sapply Stands For?

sapply mean simple lapply. Simplified version of lapply. Both does the same thing except lapply returns a vector while sapply returns a list.

Initialize A Dataframe in R

df <- data.frame(matrix(ncol = 300, nrow = 100))

Remove Data/Objects in R

rm(list = ls())


rm(list = grep("^paper", ls(), value = TRUE, invert = TRUE))

Wednesday, March 30, 2016

Miscellaneous Functions in R

which



Handling Missing Values in R

is.na(a)


is.na(x1) <- which(x1 == 7)

Recoding Data in R

# recode missing values
A <- c(3, 2, NA, 5, 3, 7, NA, NA, 5, 2, 6)
A[ is.na(A) ] <- 0



#Let’s re-code all values less than 5 to the value 99.
A[ A < 5 ] <- 99
is.na(x1) <- which(x1 == 7)

Subsetting A Dataframe

library(lattice)

a<-barley[4] # create a data frame

a<-barley[,4] # create a vector

RF vs GB

There are two main reasons why you would use Random Forests over Gradient Boosted Decision Trees, and they are both pretty related:

  1. RF are much easier to tune than GBM
  2. RF are harder to overfit than GBM
Related to (1), RF basically has only one hyperparameter to set: the number of features to randomly select at each node. However there is a rule-of-thumb to use the square root of the number of total features which works pretty well in most cases[1]. On the other hand, GBMs have several hyperparameters that include the number of trees, the depth (or number of leaves), and the shrinkage (or learning rate).

And, regarding (2), while it is not true that RF do not overfit (as opposed as many are led to believe by Breiman's strong assertions[2]), it is true that they are more robust to overfitting and require less tuning to avoid it.

In some sense, RF is a tree ensemble that is more "plug'n'play" than GBM. However, it is generally true that a well-tuned GBM can outperform a RF.

Also, as Tianqi Chen mentioned, RF has traditionally been easier to parallelism. However, that is not a good reason anymore given there are efficient ways to do it with GBMs also.
Both are ensemble learning methods and predict (regression or classification) by combining the outputs from individual trees.  They differ in the way the trees are built - order and the way the results are combined. 

Random Forests train each tree independently, using a random sample of the data. This randomness helps to make the model more robust than a single decision tree, and less likely to overfit on the training data. There are typically two parameters in RF - number of trees and no. of features to be selected at each node.

GBTs build trees one at a time, where each new tree helps to correct errors made by previously trained tree. With each tree added, the model becomes even more expressive. There are typically three parameters - number of trees, depth of trees and learning rate, and the each tree built is generally shallow.

GBDT training generally takes longer because of the fact that trees are built sequentially. However benchmark results have shown GBDT are better learners than Random Forests.

An overview of differences and some benchmarks results in terms of error rate and training time are given in link below:

Scatter Plot For Variables of A Dataframe

plot(df$var1, df$v2)

Assessing Model Accuracy

MSE: mean squared error
Error Rate: 

Classification

The most widely-used classifiers: logistic regression, linear discriminant analysis, and K-nearest neighbors.

More computer-intensive methods: generalized additive models, trees, random forests, and boosting. and support vector machines.


Machine Learning Terminology

Classifer: classification techniques
Response Variable(Y): can be quantitative or qualitative
Quantitative: numerical
Qualitative: categorical

Tuesday, March 29, 2016

Random Forests In A Nutshell

Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based on random samples of variables), classifying a case using each tree in this new "forest", and deciding a final predicted outcome by combining the results across all of the trees (an average in regression, a majority vote in classification). Breiman and Cutler's random forest approach is implimented via therandomForest package.

Here is an example.
# Random Forest prediction of Kyphosis data
library(randomForest)
fit <- randomForest(Kyphosis ~ Age + Number + Start,   data=kyphosis)
print(fit) # view results 
importance(fit) # importance of each predictor

Regression Model Strategy

Diagnostics
Transformation
Variable Selection
Diagnostics

Plotting A Mathematical Expression


plotcurve <-
  function(equation = "y = sqrt(1/(1+x^2))", ...){
    leftright <- strsplit(equation, split = "=")[[1]]
    left <- leftright[1] # The part to the left of the "="
    right <- leftright[2] # The part to the right of the "="
    expr <- parse(text=right)
    xname <- all.vars(expr)
    if(length(xname) > 1)stop(paste("There are multiple variables, i.e.",paste(xname,
                                                                               collapse=" & "),
                                    "on the right of the equation"))
    if(length(list(...))==0)assign(xname, 1:10)
    else {
      nam <- names(list(...))
      if(nam!=xname)stop("Clash of variable names")
      assign("x", list(...)[[1]])
      assign(xname, x)
    }
    y <- eval(expr)
    yexpr <- parse(text=left)[[1]]
    xexpr <- parse(text=xname)[[1]]
    plot(x, y, ylab = yexpr, xlab = xexpr, type="n")
    lines(spline(x,y))
    mainexpr <- parse(text=paste(left, "==", right))
    title(main = mainexpr)
  }

plotcurve()
plotcurve("ang=asin(sqrt(p))", p=(1:49)/50)

Searching R Functions For A Specified Token

grep<-function(str){
  tempobj<-ls(envir=sys.frame(-1))
  objstring<-character(0)
  for(i in tempobj) {
    myfunc<-get(i)
    if(is.function(myfunc))
      if(length(grep(str,deparse(myfunc))))
        objstring<-c(objstring,i)
  }
  return(objstring)
}

mygrep("for")

Saturday, March 26, 2016

Applying The Same Operation On Multiple Data Frames In R

Solution #1:

a<-c(1,2,3,4)
b<-c(5,6,7,8)
pos1<-data.frame(cbind(a,b))
pos2<-pos1+1
pos3<-pos1+2

var1<-paste0('pos',1:3)
exps<-paste0(var1,'$x<-',var1,'$a+10')

for(exp in exps){
  eval(parse(text=exp))
}

Solution #2: Use lapply

x<-list(pos1,pos2,pos3)

lapply(x, function(x) x$x<-x$a+10)

lapply(x,function(x)cor(x[,2],x[,3]))


http://stackoverflow.com/questions/16115745/applying-a-set-of-operations-across-several-data-frames-in-r

http://stackoverflow.com/questions/19249303/applying-lapply-on-multiple-data-frames-in-a-list-r

https://www.datacamp.com/community/tutorials/r-tutorial-apply-family



Thursday, March 24, 2016

Commands To Inspect Data Structure In R

str(a),
class(a)
typeof(a)
length(a)
attr(a)
names(a)
dim(a)

is.character(0
is.numeric()
is.double()
is.integer()
is.logical()
is.atomic()
is.function()
is.vector()
is.data.frame()

Tuesday, March 22, 2016

Keyboard Shortcuts For R Studio

Keyboard Shortcuts

Console

DescriptionWindows & LinuxMac
Move cursor to ConsoleCtrl+2Ctrl+2
Clear consoleCtrl+LCtrl+L
Move cursor to beginning of lineHomeCommand+Left
Move cursor to end of lineEndCommand+Right
Navigate command historyUp/DownUp/Down
Popup command historyCtrl+UpCommand+Up
Interrupt currently executing commandEscEsc
Change working directoryCtrl+Shift+HCtrl+Shift+H

Source

DescriptionWindows & LinuxMac
Goto File/FunctionCtrl+.Ctrl+.
Move cursor to Source EditorCtrl+1Ctrl+1
New document (except on Chrome/Windows)Ctrl+Shift+NCommand+Shift+N
New document (Chrome only)Ctrl+Alt+Shift+NCommand+Shift+Alt+N
Open documentCtrl+OCommand+O
Save active documentCtrl+SCommand+S
Close active document (except on Chrome)Ctrl+WCommand+W
Close active document (Chrome only)Ctrl+Alt+WCommand+Option+W
Close all open documentsCtrl+Shift+WCommand+Shift+W
Preview HTML (Markdown and HTML)Ctrl+Shift+KCommand+Shift+K
Knit Document (knitr)Ctrl+Shift+KCommand+Shift+K
Compile NotebookCtrl+Shift+KCommand+Shift+K
Compile PDF (TeX and Sweave)Ctrl+Shift+KCommand+Shift+K
Insert chunk (Sweave and Knitr)Ctrl+Alt+ICommand+Option+I
Insert code sectionCtrl+Shift+RCommand+Shift+R
Run current line/selectionCtrl+EnterCommand+Enter
Run current line/selection (retain cursor position)Alt+EnterOption+Enter
Re-run previous regionCtrl+Shift+PCommand+Shift+P
Run current documentCtrl+Alt+RCommand+Option+R
Run from document beginning to current lineCtrl+Alt+BCommand+Option+B
Run from current line to document endCtrl+Alt+ECommand+Option+E
Run the current function definitionCtrl+Alt+FCommand+Option+F
Run the current code sectionCtrl+Alt+TCommand+Option+T
Run previous Sweave/Rmd codeCtrl+Alt+PCommand+Option+P
Run the current Sweave/Rmd chunkCtrl+Alt+CCommand+Option+C
Run the next Sweave/Rmd chunkCtrl+Alt+NCommand+Option+N
Source a fileCtrl+Shift+OCommand+Shift+O
Source the current documentCtrl+Shift+SCommand+Shift+S
Source the current document (with echo)Ctrl+Shift+EnterCommand+Shift+Enter
Fold SelectedAlt+LCmd+Option+L
Unfold SelectedShift+Alt+LCmd+Shift+Option+L
Fold AllAlt+OCmd+Option+O
Unfold AllShift+Alt+OCmd+Shift+Option+O
Go to lineShift+Alt+GCmd+Shift+Option+G
Jump toShift+Alt+JCmd+Shift+Option+J
Switch to tabCtrl+Shift+.Ctrl+Shift+.
Previous tabCtrl+F11Ctrl+F11
Next tabCtrl+F12Ctrl+F12
First tabCtrl+Shift+F11Ctrl+Shift+F11
Last tabCtrl+Shift+F12Ctrl+Shift+F12
Navigate backCtrl+F9Cmd+F9
Navigate forwardCtrl+F10Cmd+F10
Extract function from selectionCtrl+Alt+XCommand+Option+X
Extract variable from selectionCtrl+Alt+VCommand+Option+V
Reindent linesCtrl+ICommand+I
Comment/uncomment current line/selectionCtrl+Shift+CCommand+Shift+C
Reflow CommentCtrl+Shift+/Command+Shift+/
Reformat SelectionCtrl+Shift+ACommand+Shift+A
Show DiagnosticsCtrl+Shift+Alt+PCommand+Shift+Alt+P
Transpose LettersCtrl+T
Move Lines Up/DownAlt+Up/DownOption+Up/Down
Copy Lines Up/DownShift+Alt+Up/DownCommand+Option+Up/Down
Jump to Matching Brace/ParenCtrl+PCtrl+P
Expand to Matching Brace/ParenCtrl+Shift+ECtrl+Shift+E
Select to Matching Brace/ParenCtrl+Shift+Alt+ECtrl+Shift+Alt+E
Add Cursor Above Current CursorCtrl+Alt+UpCtrl+Alt+Up
Add Cursor Below Current CursorCtrl+Alt+DownCtrl+Alt+Down
Move Active Cursor UpCtrl+Alt+Shift+UpCtrl+Alt+Shift+Up
Move Active Cursor DownCtrl+Alt+Shift+DownCtrl+Alt+Shift+Down
Find and ReplaceCtrl+FCommand+F
Find NextWin: F3, Linux: Ctrl+GCommand+G
Find PreviousWin: Shift+F3, Linux: Ctrl+Shift+GCommand+Shift+G
Use Selection for FindCtrl+F3Command+E
Replace and FindCtrl+Shift+JCommand+Shift+J
Find in FilesCtrl+Shift+FCommand+Shift+F
Check SpellingF7F7

Editing (Console and Source)

DescriptionWindows & LinuxMac
UndoCtrl+ZCommand+Z
RedoCtrl+Shift+ZCommand+Shift+Z
CutCtrl+XCommand+X
CopyCtrl+CCommand+C
PasteCtrl+VCommand+V
Select AllCtrl+ACommand+A
Jump to WordCtrl+Left/RightOption+Left/Right
Jump to Start/EndCtrl+Home/End or Ctrl+Up/DownCommand+Home/End or Command+Up/Down
Delete LineCtrl+DCommand+D
SelectShift+[Arrow]Shift+[Arrow]
Select WordCtrl+Shift+Left/RightOption+Shift+Left/Right
Select to Line StartAlt+Shift+LeftCommand+Shift+Left
Select to Line EndAlt+Shift+RightCommand+Shift+Right
Select Page Up/DownShift+PageUp/PageDownShift+PageUp/Down
Select to Start/EndCtrl+Shift+Home/End or Shift+Alt+Up/DownCommand+Shift+Up/Down
Delete Word LeftCtrl+BackspaceOption+Backspace or Ctrl+Option+Backspace
Delete Word RightOption+Delete
Delete to Line EndCtrl+K
Delete to Line StartOption+Backspace
IndentTab (at beginning of line)Tab (at beginning of line)
OutdentShift+TabShift+Tab
Yank line up to cursorCtrl+UCtrl+U
Yank line after cursorCtrl+KCtrl+K
Insert currently yanked textCtrl+YCtrl+Y
Insert assignment operatorAlt+-Option+-
Insert pipe operatorCtrl+Shift+MCmd+Shift+M
Show help for function at cursorF1F1
Show source code for function at cursorF2F2
Find usages for symbol at cursor (C++)Ctrl+Alt+UCmd+Option+U

Completions (Console and Source)

DescriptionWindows & LinuxMac
Attempt completionTab or Ctrl+SpaceTab or Command+Space
Navigate candidatesUp/DownUp/Down
Accept selected candidateEnter, Tab, or RightEnter, Tab, or Right
Dismiss completion popupEscEsc

Views

DescriptionWindows & LinuxMac
Move focus to Source EditorCtrl+1Ctrl+1
Move focus to ConsoleCtrl+2Ctrl+2
Move focus to HelpCtrl+3Ctrl+3
Show HistoryCtrl+4Ctrl+4
Show FilesCtrl+5Ctrl+5
Show PlotsCtrl+6Ctrl+6
Show PackagesCtrl+7Ctrl+7
Show EnvironmentCtrl+8Ctrl+8
Show Git/SVNCtrl+9Ctrl+9
Show BuildCtrl+0Ctrl+0
Sync Editor & PDF PreviewCtrl+F8Cmd+F8
Show Keyboard Shortcut ReferenceAlt+Shift+KOption+Shift+K

Build

DescriptionWindows & LinuxMac
Build and ReloadCtrl+Shift+BCmd+Shift+B
Load All (devtools)Ctrl+Shift+LCmd+Shift+L
Test Package (Desktop)Ctrl+Shift+TCmd+Shift+T
Test Package (Web)Ctrl+Alt+F7Cmd+Alt+F7
Check PackageCtrl+Shift+ECmd+Shift+E
Document PackageCtrl+Shift+DCmd+Shift+D

Debug

DescriptionWindows & LinuxMac
Toggle BreakpointShift+F9Shift+F9
Execute Next LineF10F10
Step Into FunctionShift+F4Shift+F4
Finish Function/LoopShift+F6Shift+F6
ContinueShift+F5Shift+F5
Stop DebuggingShift+F8Shift+F8

Plots

DescriptionWindows & LinuxMac
Previous plotCtrl+Alt+F11Command+Option+F11
Next plotCtrl+Alt+F12Command+Option+F12

Git/SVN

DescriptionWindows & LinuxMac
Diff active source documentCtrl+Alt+DCtrl+Option+D
Commit changesCtrl+Alt+MCtrl+Option+M
Scroll diff viewCtrl+Up/DownCtrl+Up/Down
Stage/Unstage (Git)SpacebarSpacebar
Stage/Unstage and move to next (Git)EnterEnter

Session

DescriptionWindows & LinuxMac
Quit Session (desktop only)Ctrl+QCommand+Q
Restart R SessionCtrl+Shift+F10Command+Shift+F10

Friday, March 18, 2016

Keyboard Shortcut For Stopping A Running Code In R Studio

Move the cursor to Console window and press Esc key.

Thursday, March 17, 2016

Sorting in R



Sorting one column:

all[order(all$trivago_id, all$avg_pos),]

Sorting 2 columns:

all[order(all$trivago_id, all$avg_pos),] 

Count By Group In R

library(plyr)

count(ti_pos1, 'cpc_cnt_7d');

Wednesday, March 16, 2016

What is the difference between NaN and Inf, and NULL and NA in R?

http://www.quantlego.com/howto/special-missing-values-in-r/

Tuesday, March 15, 2016

Regression

http://scc.stat.ucla.edu/page_attachments/0000/0139/reg_1.pdf

Convert data.frame columns from factors to characters

Just following on Matt and Dirk. If you want to recreate your existing data frame without changing the global option, you can recreate it with an apply statement:
bob <- data.frame(lapply(bob, as.character), stringsAsFactors=FALSE)


To replace only factors:
i <- sapply(bob, is.factor)
bob[i] <- lapply(bob[i], as.character)

Monday, March 14, 2016

How to Count Unique/Distinct Values of A Variable in R


base R:


length(unique(mydata$colA))





Sunday, March 13, 2016

Summarize data by group in R


tapply(a$avg_cpc_1d, a$avg_pos, mean)
aggregate(avg_cpc_1d~avg_pos, a, mean)



clk_summ<-data.frame(group_by(clk2, OND)%>%summarise(cnt=n()))

a<-tapply(gp$txn,gp$product_ln_name,sum)

I have R data frame like this:
        age group
1   23.0883     1
2   25.8344     1
3   29.4648     1
4   32.7858     2
5   33.6372     1
6   34.9350     1
7   35.2115     2
8   35.2115     2
9   35.2115     2
10  36.7803     1
...
I need to get data frame in the following form:
group mean     sd
1     34.5     5.6
2     32.3     4.2
...
Group number may vary, but their names and quantity could be obtained by calling levels(factor(data$group))
What manipulations should be done with the data to get the result?
shareimprove this question

closed as off-topic by gunguser777kjetil b halvorsenJohnPeter Flom Sep 11 '15 at 23:23

This question appears to be off-topic. The users who voted to close gave this specific reason:
  • "This question appears to be off-topic because EITHER it is not about statistics, machine learning, data analysis, data mining, or data visualization, OR it focuses on programming, debugging, or performing routine operations within a statistical computing platform. If the latter, you could try the support links we maintain." – gung, user777, kjetil b halvorsen, John, Peter Flom
If this question can be reworded to fit the rules in the help center, please edit the question.
  
the commas in the result data frame mean something special, or is it just the decimal point? – mpiktas Mar 13 '11 at 12:46
3
I suspected that. All of the Europe uses comma except the British. – mpiktas Mar 13 '11 at 13:04
4
Despite not being British, I prefer dot for decimal separator. – Roman Luštrik Mar 14 '11 at 12:17
15
The British also drive on the wrong side – RockScience Mar 15 '11 at 11:23
5
@RockScience: We drive on the right side - the left side. Everyone else is on the wrong right side! (And we wonder why people struggle with our language!) – Mark K Cowan Sep 16 '13 at 16:30

9 Answers


up vote101down voteaccepted
Here is the plyr one line variant using ddply:
dt <- data.frame(age=rchisq(20,10),group=sample(1:2,20,rep=T))
ddply(dt,~group,summarise,mean=mean(age),sd=sd(age))
Here is another one line variant using new package data.table.
dtf <- data.frame(age=rchisq(100000,10),group=factor(sample(1:10,100000,rep=T)))
dt <- data.table(dtf)
dt[,list(mean=mean(age),sd=sd(age)),by=group]
This one is faster, though this is noticeable only on table with 100k rows. Timings on my Macbook Pro with 2.53 Ghz Core 2 Duo processor and R 2.11.1:
> system.time(aa <- ddply(dtf,~group,summarise,mean=mean(age),sd=sd(age)))
utilisateur     système      écoulé 
      0.513       0.180       0.692 
> system.time(aa <- dt[,list(mean=mean(age),sd=sd(age)),by=group])
utilisateur     système      écoulé 
      0.087       0.018       0.103 
Further savings are possible if we use setkey:
> setkey(dt,group)
> system.time(dt[,list(mean=mean(age),sd=sd(age)),by=group])
utilisateur     système      écoulé 
      0.040       0.007       0.048 
shareimprove this answer
1
Thanks for the update and the benchmark info! – chl Mar 15 '11 at 10:24
1
@chl, it gave me a chance to try out this new data.table package. It looks really promising. – mpiktas Mar 15 '11 at 12:54
5
+6000 for data.table. It really is so much faster than ddply, even for me on datasets smaller than 100k (I have one with just 20k rows). Must be something to do with the functions I am applying, but ddply will take minutes and data.table a few seconds. – atomicules Sep 22 '11 at 15:22
  
Simple typo: I think you meant dt <- data.table(dtf) instead of dt <- data.table(dt) in the second code block. That way, you are creating the data table from a data frame instead of from the dtfunction from the stats package. I tried editing it, but I cannot do edits under six characters. – Christopher Bottoms Oct 24 '14 at 18:50 
  
Thanks, fixed it. – mpiktas Oct 25 '14 at 10:16