Sunday, March 13, 2016

Summarize data by group in R


tapply(a$avg_cpc_1d, a$avg_pos, mean)
aggregate(avg_cpc_1d~avg_pos, a, mean)



clk_summ<-data.frame(group_by(clk2, OND)%>%summarise(cnt=n()))

a<-tapply(gp$txn,gp$product_ln_name,sum)

I have R data frame like this:
        age group
1   23.0883     1
2   25.8344     1
3   29.4648     1
4   32.7858     2
5   33.6372     1
6   34.9350     1
7   35.2115     2
8   35.2115     2
9   35.2115     2
10  36.7803     1
...
I need to get data frame in the following form:
group mean     sd
1     34.5     5.6
2     32.3     4.2
...
Group number may vary, but their names and quantity could be obtained by calling levels(factor(data$group))
What manipulations should be done with the data to get the result?
shareimprove this question

closed as off-topic by gunguser777kjetil b halvorsenJohnPeter Flom Sep 11 '15 at 23:23

This question appears to be off-topic. The users who voted to close gave this specific reason:
  • "This question appears to be off-topic because EITHER it is not about statistics, machine learning, data analysis, data mining, or data visualization, OR it focuses on programming, debugging, or performing routine operations within a statistical computing platform. If the latter, you could try the support links we maintain." – gung, user777, kjetil b halvorsen, John, Peter Flom
If this question can be reworded to fit the rules in the help center, please edit the question.
  
the commas in the result data frame mean something special, or is it just the decimal point? – mpiktas Mar 13 '11 at 12:46
3
I suspected that. All of the Europe uses comma except the British. – mpiktas Mar 13 '11 at 13:04
4
Despite not being British, I prefer dot for decimal separator. – Roman Luštrik Mar 14 '11 at 12:17
15
The British also drive on the wrong side – RockScience Mar 15 '11 at 11:23
5
@RockScience: We drive on the right side - the left side. Everyone else is on the wrong right side! (And we wonder why people struggle with our language!) – Mark K Cowan Sep 16 '13 at 16:30

9 Answers


up vote101down voteaccepted
Here is the plyr one line variant using ddply:
dt <- data.frame(age=rchisq(20,10),group=sample(1:2,20,rep=T))
ddply(dt,~group,summarise,mean=mean(age),sd=sd(age))
Here is another one line variant using new package data.table.
dtf <- data.frame(age=rchisq(100000,10),group=factor(sample(1:10,100000,rep=T)))
dt <- data.table(dtf)
dt[,list(mean=mean(age),sd=sd(age)),by=group]
This one is faster, though this is noticeable only on table with 100k rows. Timings on my Macbook Pro with 2.53 Ghz Core 2 Duo processor and R 2.11.1:
> system.time(aa <- ddply(dtf,~group,summarise,mean=mean(age),sd=sd(age)))
utilisateur     système      écoulé 
      0.513       0.180       0.692 
> system.time(aa <- dt[,list(mean=mean(age),sd=sd(age)),by=group])
utilisateur     système      écoulé 
      0.087       0.018       0.103 
Further savings are possible if we use setkey:
> setkey(dt,group)
> system.time(dt[,list(mean=mean(age),sd=sd(age)),by=group])
utilisateur     système      écoulé 
      0.040       0.007       0.048 
shareimprove this answer
1
Thanks for the update and the benchmark info! – chl Mar 15 '11 at 10:24
1
@chl, it gave me a chance to try out this new data.table package. It looks really promising. – mpiktas Mar 15 '11 at 12:54
5
+6000 for data.table. It really is so much faster than ddply, even for me on datasets smaller than 100k (I have one with just 20k rows). Must be something to do with the functions I am applying, but ddply will take minutes and data.table a few seconds. – atomicules Sep 22 '11 at 15:22
  
Simple typo: I think you meant dt <- data.table(dtf) instead of dt <- data.table(dt) in the second code block. That way, you are creating the data table from a data frame instead of from the dtfunction from the stats package. I tried editing it, but I cannot do edits under six characters. – Christopher Bottoms Oct 24 '14 at 18:50 
  
Thanks, fixed it. – mpiktas Oct 25 '14 at 10:16

No comments:

Post a Comment