Data Science Guy

Question

I have R data frame like this:

        age group
1   23.0883     1
2   25.8344     1
3   29.4648     1
4   32.7858     2
5   33.6372     1
6   34.9350     1
7   35.2115     2
8   35.2115     2
9   35.2115     2
10  36.7803     1
...

I need to get data frame in the following form:

group mean     sd
1     34.5     5.6
2     32.3     4.2
...

Group number may vary, but their names and quantity could be obtained by calling levels(factor(data$group))

What manipulations should be done with the data to get the result?

the commas in the result data frame mean something special, or is it just the decimal point? — mpiktas, Mar 13 '11 at 12:46
I suspected that. All of the Europe uses comma except the British. — mpiktas, Mar 13 '11 at 13:04
Despite not being British, I prefer dot for decimal separator. — Roman Luštrik, Mar 14 '11 at 12:17
@RockScience: We drive on the right side - the left side. Everyone else is on the wrong right side! (And we wonder why people struggle with our language!) — Mark K Cowan, Sep 16 '13 at 16:30

mpiktas · Accepted Answer · 2014-10-25 10:16:17Z

Here is the plyr one line variant using ddply:

dt <- data.frame(age=rchisq(20,10),group=sample(1:2,20,rep=T))
ddply(dt,~group,summarise,mean=mean(age),sd=sd(age))

Here is another one line variant using new package data.table.

dtf <- data.frame(age=rchisq(100000,10),group=factor(sample(1:10,100000,rep=T)))
dt <- data.table(dtf)
dt[,list(mean=mean(age),sd=sd(age)),by=group]

This one is faster, though this is noticeable only on table with 100k rows. Timings on my Macbook Pro with 2.53 Ghz Core 2 Duo processor and R 2.11.1:

> system.time(aa <- ddply(dtf,~group,summarise,mean=mean(age),sd=sd(age)))
utilisateur     système      écoulé 
      0.513       0.180       0.692 
> system.time(aa <- dt[,list(mean=mean(age),sd=sd(age)),by=group])
utilisateur     système      écoulé 
      0.087       0.018       0.103

Further savings are possible if we use setkey:

> setkey(dt,group)
> system.time(dt[,list(mean=mean(age),sd=sd(age)),by=group])
utilisateur     système      écoulé 
      0.040       0.007       0.048

@chl, it gave me a chance to try out this new data.table package. It looks really promising. — mpiktas, Mar 15 '11 at 12:54
+6000 for data.table. It really is so much faster than ddply, even for me on datasets smaller than 100k (I have one with just 20k rows). Must be something to do with the functions I am applying, but ddply will take minutes and data.table a few seconds. — atomicules, Sep 22 '11 at 15:22
Simple typo: I think you meant dt <- data.table(dtf) instead of dt <- data.table(dt) in the second code block. That way, you are creating the data table from a data frame instead of from the dtfunction from the stats package. I tried editing it, but I cannot do edits under six characters. — Christopher Bottoms, Oct 24 '14 at 18:50

Data Science Guy

Sunday, March 13, 2016

Summarize data by group in R

closed as off-topic by gung, user777, kjetil b halvorsen, John, Peter Flom♦ Sep 11 '15 at 23:23

9 Answers

No comments:

Post a Comment