

Hi,
I read about the by() function, but it does not seem to do the job I
need. Here is the problem:
Say  I have a data frame, with three columns. The first one contains
strings that describe the data points, with repeats (for example, days
of a week). The other two contain numbers. Something like that:
Day val1 val2
Tue 1 2
Tue 2 8
Tue 3 5
Wed 1 2
Wed 1 8
etc.
Now I would like to have a data frame with averages for each week:
Day val1 val2
Tue 2 5
Wed 1 5
etc.
I now I can do tapply(DF$val2, DF$days, mean) to get the means for
val2. But I would like to have a data frame as result (as in reality I
have many more columns).
Further question: where can I find a good, advanced introduction to R
data types? R's help() function just kills my brain, and the tutorials
are very limited.
My kind regards,
On Wed, 14 Dec 2005, January Weiner wrote:
> Hi,
>
> I read about the by() function, but it does not seem to do the job I
> need. Here is the problem:
by() will work, you just need to use the right function in it.
You want
by(df[,1], df$Day, function.that.means.each.column)
so all you need to do is write function.that.means.each.column()
In this case there is a builtin function, colMeans, so you don't even
have to write it.
More generally (eg the approach would work for medians as well)
by(df[,1], df$Day, function(today) apply(today, 2, mean))
Finally, you could just use aggregate().
thomas
> Say  I have a data frame, with three columns. The first one contains
> strings that describe the data points, with repeats (for example, days
> of a week). The other two contain numbers. Something like that:
>
> Day val1 val2
> Tue 1 2
> Tue 2 8
> Tue 3 5
> Wed 1 2
> Wed 1 8
> etc.
>
> Now I would like to have a data frame with averages for each week:
>
> Day val1 val2
> Tue 2 5
> Wed 1 5
> etc.
> I now I can do tapply(DF$val2, DF$days, mean) to get the means for
> val2. But I would like to have a data frame as result (as in reality I
> have many more columns).
>
> Further question: where can I find a good, advanced introduction to R
> data types? R's help() function just kills my brain, and the tutorials
> are very limited.
>
> My kind regards,
>
> January Weiner
>
Hello again,
On 12/14/05, Thomas Lumley < [hidden email]> wrote:
> You want
>
> by(df[,1], df$Day, function.that.means.each.column)
OK, slowly :) I don't understand it.
 why df[,1] and not df? don't we loose the df$Day entries?
(by the way, why does typeof(df) show "list"? I thought that
read.table() returns a data frame?)
> so all you need to do is write function.that.means.each.column()
> In this case there is a builtin function, colMeans, so you don't even
> have to write it.
Hmmmmm, I tried it and it did not work. That is, it works  but not as
intended :).
Fake example:
> df < data.frame(Day=c("Tue","Tue","Tue", "Wed", "Wed"), val1=seq(1,5), val2=3*seq(1,5))
> df
Day val1 val2
1 Tue 1 3
2 Tue 2 6
3 Tue 3 9
4 Wed 4 12
5 Wed 5 15
> ddf < by(df[,1], df$Day, colMeans)
> ddf
df$Day: Tue
val1 val2
2 6

df$Day: Wed
val1 val2
4.5 13.5
> ddf$Day
NULL
> ddf$val1
NULL
In real data, instead of "days", I have around 6000 items, so I need
them to be in one column called "Days" (or whatever). OK. So correct
me if I understand wrongly what is happening here:
by() divides df in data frame subsets and applies a function
(colMeans) to each of them. The result of colMeans ... manual says
that colMeans returns the following:
A numeric or complex array of suitable size, or a vector if the
result is onedimensional. The 'dimnames' (or 'names' for a
vector result) are taken from the original array.
...which doesn't tell me much. typeof(colMeans(...)) tells me
"double" but I think it lies. OK, lets assume it is a vector (should
be, I assume the result is onedimensional, as I can hardly imagine a
multidimensional result).
So in the end I have a list with as many columns as I have days, and
in each column I have a vector with N named dimensions, where N is the
numbers of variables in the original data frame bar one. But what I
would like to have is a data frame with exactly the same column names,
and rows being just a summary. And no clue how to convert one in the
other :)
> More generally (eg the approach would work for medians as well)
>
> by(df[,1], df$Day, function(today) apply(today, 2, mean))
Huh? why is it df[,1] now? I think I'm completly lost.
> Finally, you could just use aggregate().
Probably, yes. As soon as I figure out how to use it, that is :) (an
hour later: OK, I got it! yuppie!) However what I really needed was
smth like this:
ddf < by(df[,1], df$Day, function(z) { return(cor(z$val1,z$val2)) ; } )
(but I still don't know how to convert it to a friendly data frame...)
Thanks for the answers!
Hi,
On 12/15/05, Gabor Grothendieck < [hidden email]> wrote:
> You don't get them as a column but you get them as the
> component labels.
>
> by(df, df$Day, function(x) colMeans(x[,1]))
>
> If you convert it to a data frame you get them as the rownames:
>
> do.call("rbind", by(df, df$Day, function(x) colMeans(x[,1])))
Thanks! that helps a lot. But I still run into problems with this.
Sorry for bothering you with newbie questions, if my problems are
trivial, point me to a suitable guide (I did read the introductory
materials on R).
First: it works for colMeans, but it does not work for a function like this:
do.call("rbind", by(df, df$Day, function(x) cor(df$val1, df$val2))
it says "Error in do.call(....) : second argument must be a list". I
do not understand this, as the second argument is "b" of the class
"by", as it was in the case of colMeans, so it did not change...?
Second: in case of colMeans (where it works) it returns a matrix, and
I have troubles getting it back to the data.frame, so I can access
blah$Day. Instead, I have smth like that:
> do.call("rbind",b)
V2 V3 V4 V5 V7
Tue 19 15 2 0 1.538462
Wed 5 3 6 1 1.285714
...and I do not know how to acces, for example, values for "Tue",
except with [1,]  which is somewhat problematic. For example, I
would like to display the 3 days for which V7 is highest. How can I
do that?
> I think you want class(df) which shows its a data frame.
Ops. Sorry, I didn't guess it from the manual :)
> aggregate(df[,1], df[,1,drop = FALSE], mean)
But why is df[,1,drop=FALSE] a list? I don't get it...
> aggregate(df[,1], list(Day = df$Day), mean)
Yeah, I figured out that one.
> Another alternative is to use summaryBy from the doBy package found
> at http://genetics.agrsci.dk/~sorenh/misc/ :
>
> library(doBy)
> summaryBy(cbind(var1, var2) ~ Day, data = df)
I think I am not confident enough with the basic data types in R, I
need to understand them before I go over to specialized packages :)
Again, thanks a lot,
One other point. The cor example could be done using tapply like
this:
tapply(rownames(df), df$Day, function(r) cor(df[r,"val1"], df[r, "val2"]))
