aggregate vs tapply; is there a middle ground?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

aggregate vs tapply; is there a middle ground?

Hans Gardfjell
I faced a similar problem. Here's what I did

tmp <-
data.frame(A=sample(LETTERS[1:5],10,replace=T),B=sample(letters[1:5],10,replace=T),C=rnorm(10))
tmp1 <- with(tmp,aggregate(C,list(A=A,B=B),sum))
tmp2 <- expand.grid(A=sort(unique(tmp$A)),B=sort(unique(tmp$B)))
merge(tmp2,tmp1,all.x=T)

At least fewer than 10 extra lines of code. Anyone with a simpler solution?

Cheers, Hans


lebouton wrote:

>
>Dear all,
>
>I'm wanting to do a series of comparisons among 4 categorical variables:
>
>a <- aggregate(y, list(var1, var2, var3, var4), sum)
>
>This gets me a very nice 2-dimensional data frame with one column per
>variable, BUT, as help for aggregate says, <<empty subsets are
>removed>>.  I don't see in help(aggregate) how I can change this.
>
>In contrast,
>a <- tapply(y, list(var1, var2, var3, var4), sum)
>
>gives me results for everything including empty subsets, but in an
>awkward 4-dimensional array that takes me another 10 lines of
>inefficient code to turn into a 2D data.frame.
>
>Is there a way to directly do this calculation INCLUDING results for
>empty subsets, and still obtain a 2D array, matrix, or data.frame?  OR
>alternatively is there a simple way to mush the 4D result from the
>tapply into a 2D matrix/data.frame?
>
>thanks very much in advance for any help!
>
>-jlb
>
>--
>************************************
>Joseph P. LeBouton
>Forest Ecology PhD Candidate
>Department of Forestry
>Michigan State University
>East Lansing, Michigan 48824
>
>Office phone: 517-355-7744
>email: lebouton at msu.edu <https://stat.ethz.ch/mailman/listinfo/r-help>


--

*********************************
Hans Gardfjell
Ecology and Environmental Science
Umeå University
90187 Umeå, Sweden
email: [hidden email]
phone:  +46 907865267
mobile: +46 705984464

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: aggregate vs tapply; is there a middle ground?

hadley wickham
> I faced a similar problem. Here's what I did
>
> tmp <-
> data.frame(A=sample(LETTERS[1:5],10,replace=T),B=sample(letters[1:5],10,replace=T),C=rnorm(10))
> tmp1 <- with(tmp,aggregate(C,list(A=A,B=B),sum))
> tmp2 <- expand.grid(A=sort(unique(tmp$A)),B=sort(unique(tmp$B)))
> merge(tmp2,tmp1,all.x=T)
>
> At least fewer than 10 extra lines of code. Anyone with a simpler solution?

Well, you can almost do this in with the reshape package:

tmp <-
data.frame(A=sample(LETTERS[1:5],10,replace=T),B=sample(letters[1:5],10,replace=T),C=rnorm(10))
a <- recast(tmp, A + B ~ ., sum)
# see also recast(tmp, A  ~ B, sum)
add.all.combinations(a, row="A", cols = "B")

Where add.all.combinations basically does what you outlined above --
it would be easy enough to generalise to multiple dimensions.

Hadley

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: aggregate vs tapply; is there a middle ground?

Peter Dalgaard
hadley wickham <[hidden email]> writes:

> > I faced a similar problem. Here's what I did
> >
> > tmp <-
> > data.frame(A=sample(LETTERS[1:5],10,replace=T),B=sample(letters[1:5],10,replace=T),C=rnorm(10))
> > tmp1 <- with(tmp,aggregate(C,list(A=A,B=B),sum))
> > tmp2 <- expand.grid(A=sort(unique(tmp$A)),B=sort(unique(tmp$B)))
> > merge(tmp2,tmp1,all.x=T)
> >
> > At least fewer than 10 extra lines of code. Anyone with a simpler solution?
>
> Well, you can almost do this in with the reshape package:
>
> tmp <-
> data.frame(A=sample(LETTERS[1:5],10,replace=T),B=sample(letters[1:5],10,replace=T),C=rnorm(10))
> a <- recast(tmp, A + B ~ ., sum)
> # see also recast(tmp, A  ~ B, sum)
> add.all.combinations(a, row="A", cols = "B")
>
> Where add.all.combinations basically does what you outlined above --
> it would be easy enough to generalise to multiple dimensions.

Anything wrong with

> as.data.frame(with(tmp,as.table(tapply(C,list(A=A,B=B),sum))))
   A B       Freq
1  A a         NA
2  B a -0.2524320
3  C a  3.8539264
4  D a         NA
5  A c  0.7227294
6  B c -0.2694669
7  C c  0.4760957
8  D c         NA
9  A e         NA
10 B e  0.1800500
11 C e         NA
12 D e -1.0350928

(except the silly colname, responseName="sum" should fix that).

--
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - ([hidden email])                  FAX: (+45) 35327907

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: aggregate vs tapply; is there a middle ground?

Hans Gardfjell
Thanks Peter!

I had a "feeling" that there must be a simpler, better, more elegant
solution.

/Hans


Peter Dalgaard wrote:

> hadley wickham <[hidden email]> writes:
>
>  
>>> I faced a similar problem. Here's what I did
>>>
>>> tmp <-
>>> data.frame(A=sample(LETTERS[1:5],10,replace=T),B=sample(letters[1:5],10,replace=T),C=rnorm(10))
>>> tmp1 <- with(tmp,aggregate(C,list(A=A,B=B),sum))
>>> tmp2 <- expand.grid(A=sort(unique(tmp$A)),B=sort(unique(tmp$B)))
>>> merge(tmp2,tmp1,all.x=T)
>>>
>>> At least fewer than 10 extra lines of code. Anyone with a simpler solution?
>>>      
>> Well, you can almost do this in with the reshape package:
>>
>> tmp <-
>> data.frame(A=sample(LETTERS[1:5],10,replace=T),B=sample(letters[1:5],10,replace=T),C=rnorm(10))
>> a <- recast(tmp, A + B ~ ., sum)
>> # see also recast(tmp, A  ~ B, sum)
>> add.all.combinations(a, row="A", cols = "B")
>>
>> Where add.all.combinations basically does what you outlined above --
>> it would be easy enough to generalise to multiple dimensions.
>>    
>
> Anything wrong with
>
>  
>> as.data.frame(with(tmp,as.table(tapply(C,list(A=A,B=B),sum))))
>>    
>    A B       Freq
> 1  A a         NA
> 2  B a -0.2524320
> 3  C a  3.8539264
> 4  D a         NA
> 5  A c  0.7227294
> 6  B c -0.2694669
> 7  C c  0.4760957
> 8  D c         NA
> 9  A e         NA
> 10 B e  0.1800500
> 11 C e         NA
> 12 D e -1.0350928
>
> (except the silly colname, responseName="sum" should fix that).
>
>  


--

*********************************
Hans Gardfjell
Ecology and Environmental Science
Umeå University
90187 Umeå, Sweden
email: [hidden email]
phone:  +46 907865267
mobile: +46 705984464

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html