Hi,
I read about the by() function, but it does not seem to do the job I need. Here is the problem: Say - I have a data frame, with three columns. The first one contains strings that describe the data points, with repeats (for example, days of a week). The other two contain numbers. Something like that: Day val1 val2 Tue 1 2 Tue 2 8 Tue 3 5 Wed 1 2 Wed 1 8 etc. Now I would like to have a data frame with averages for each week: Day val1 val2 Tue 2 5 Wed 1 5 etc. I now I can do tapply(DF$val2, DF$days, mean) to get the means for val2. But I would like to have a data frame as result (as in reality I have many more columns). Further question: where can I find a good, advanced introduction to R data types? R's help() function just kills my brain, and the tutorials are very limited. My kind regards, January Weiner -- ------------ January Weiner 3 ---------------------+--------------- Division of Bioinformatics, University of Muenster | Schloßplatz 4 (+49)(251)8321634 | D48149 Münster http://www.uni-muenster.de/Biologie.Botanik/ebb/ | Germany ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
On Wed, 14 Dec 2005, January Weiner wrote:
> Hi, > > I read about the by() function, but it does not seem to do the job I > need. Here is the problem: by() will work, you just need to use the right function in it. You want by(df[,-1], df$Day, function.that.means.each.column) so all you need to do is write function.that.means.each.column() In this case there is a built-in function, colMeans, so you don't even have to write it. More generally (eg the approach would work for medians as well) by(df[,1], df$Day, function(today) apply(today, 2, mean)) Finally, you could just use aggregate(). -thomas > Say - I have a data frame, with three columns. The first one contains > strings that describe the data points, with repeats (for example, days > of a week). The other two contain numbers. Something like that: > > Day val1 val2 > Tue 1 2 > Tue 2 8 > Tue 3 5 > Wed 1 2 > Wed 1 8 > etc. > > Now I would like to have a data frame with averages for each week: > > Day val1 val2 > Tue 2 5 > Wed 1 5 > etc. > I now I can do tapply(DF$val2, DF$days, mean) to get the means for > val2. But I would like to have a data frame as result (as in reality I > have many more columns). > > Further question: where can I find a good, advanced introduction to R > data types? R's help() function just kills my brain, and the tutorials > are very limited. > > My kind regards, > > January Weiner > > -- > ------------ January Weiner 3 ---------------------+--------------- > Division of Bioinformatics, University of Muenster | Schloßplatz 4 > (+49)(251)8321634 | D48149 Münster > http://www.uni-muenster.de/Biologie.Botanik/ebb/ | Germany > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > [hidden email] University of Washington, Seattle ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
Hello again,
On 12/14/05, Thomas Lumley <[hidden email]> wrote: > You want > > by(df[,-1], df$Day, function.that.means.each.column) OK, slowly :-) I don't understand it. - why df[,-1] and not df? don't we loose the df$Day entries? (by the way, why does typeof(df) show "list"? I thought that read.table() returns a data frame?) > so all you need to do is write function.that.means.each.column() > In this case there is a built-in function, colMeans, so you don't even > have to write it. Hmmmmm, I tried it and it did not work. That is, it works - but not as intended :-). Fake example: > df <- data.frame(Day=c("Tue","Tue","Tue", "Wed", "Wed"), val1=seq(1,5), val2=3*seq(1,5)) > df Day val1 val2 1 Tue 1 3 2 Tue 2 6 3 Tue 3 9 4 Wed 4 12 5 Wed 5 15 > ddf <- by(df[,-1], df$Day, colMeans) > ddf df$Day: Tue val1 val2 2 6 ------------------------------------------------------------ df$Day: Wed val1 val2 4.5 13.5 > ddf$Day NULL > ddf$val1 NULL In real data, instead of "days", I have around 6000 items, so I need them to be in one column called "Days" (or whatever). OK. So correct me if I understand wrongly what is happening here: by() divides df in data frame subsets and applies a function (colMeans) to each of them. The result of colMeans ... manual says that colMeans returns the following: A numeric or complex array of suitable size, or a vector if the result is one-dimensional. The 'dimnames' (or 'names' for a vector result) are taken from the original array. ...which doesn't tell me much. typeof(colMeans(...)) tells me "double" but I think it lies. OK, lets assume it is a vector (should be, I assume the result is one-dimensional, as I can hardly imagine a multidimensional result). So in the end I have a list with as many columns as I have days, and in each column I have a vector with N named dimensions, where N is the numbers of variables in the original data frame bar one. But what I would like to have is a data frame with exactly the same column names, and rows being just a summary. And no clue how to convert one in the other :-) > More generally (eg the approach would work for medians as well) > > by(df[,1], df$Day, function(today) apply(today, 2, mean)) Huh? why is it df[,1] now? I think I'm completly lost. > Finally, you could just use aggregate(). Probably, yes. As soon as I figure out how to use it, that is :-) (an hour later: OK, I got it! yuppie!) However what I really needed was smth like this: ddf <- by(df[,-1], df$Day, function(z) { return(cor(z$val1,z$val2)) ; } ) (but I still don't know how to convert it to a friendly data frame...) Thanks for the answers! January -- ------------ January Weiner 3 ---------------------+--------------- Division of Bioinformatics, University of Muenster | Schloßplatz 4 (+49)(251)8321634 | D48149 Münster http://www.uni-muenster.de/Biologie.Botanik/ebb/ | Germany ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
On 12/15/05, January Weiner <[hidden email]> wrote:
> Hello again, > > On 12/14/05, Thomas Lumley <[hidden email]> wrote: > > You want > > > > by(df[,-1], df$Day, function.that.means.each.column) > > OK, slowly :-) I don't understand it. > > - why df[,-1] and not df? don't we loose the df$Day entries? You don't get them as a column but you get them as the component labels. by(df, df$Day, function(x) colMeans(x[,-1])) If you convert it to a data frame you get them as the rownames: do.call("rbind", by(df, df$Day, function(x) colMeans(x[,-1]))) > > (by the way, why does typeof(df) show "list"? I thought that > read.table() returns a data frame?) I think you want class(df) which shows its a data frame. > > > so all you need to do is write function.that.means.each.column() > > In this case there is a built-in function, colMeans, so you don't even > > have to write it. > > Hmmmmm, I tried it and it did not work. That is, it works - but not as > intended :-). > > Fake example: > > > df <- data.frame(Day=c("Tue","Tue","Tue", "Wed", "Wed"), val1=seq(1,5), val2=3*seq(1,5)) > > df > Day val1 val2 > 1 Tue 1 3 > 2 Tue 2 6 > 3 Tue 3 9 > 4 Wed 4 12 > 5 Wed 5 15 > > ddf <- by(df[,-1], df$Day, colMeans) > > ddf > df$Day: Tue > val1 val2 > 2 6 > ------------------------------------------------------------ > df$Day: Wed > val1 val2 > 4.5 13.5 > > ddf$Day > NULL > > ddf$val1 > NULL > > In real data, instead of "days", I have around 6000 items, so I need > them to be in one column called "Days" (or whatever). OK. So correct > me if I understand wrongly what is happening here: > > by() divides df in data frame subsets and applies a function > (colMeans) to each of them. The result of colMeans ... manual says > that colMeans returns the following: > > A numeric or complex array of suitable size, or a vector if the > result is one-dimensional. The 'dimnames' (or 'names' for a > vector result) are taken from the original array. > > ...which doesn't tell me much. typeof(colMeans(...)) tells me > "double" but I think it lies. OK, lets assume it is a vector (should > be, I assume the result is one-dimensional, as I can hardly imagine a > multidimensional result). > > So in the end I have a list with as many columns as I have days, and > in each column I have a vector with N named dimensions, where N is the > numbers of variables in the original data frame bar one. But what I > would like to have is a data frame with exactly the same column names, > and rows being just a summary. And no clue how to convert one in the > other :-) > > > More generally (eg the approach would work for medians as well) > > > > by(df[,1], df$Day, function(today) apply(today, 2, mean)) > > Huh? why is it df[,1] now? I think I'm completly lost. df[,1] and df$Day both refer to the same first column. > > > Finally, you could just use aggregate(). > > Probably, yes. As soon as I figure out how to use it, that is :-) (an aggregate(df[,-1], df[,1,drop = FALSE], mean) or aggregate(df[,-1], list(Day = df$Day), mean) The second arg of aggregate must be a list which is why we used drop = FALSE in the first instance and an explicit list in the second. Another alternative is to use summaryBy from the doBy package found at http://genetics.agrsci.dk/~sorenh/misc/ : library(doBy) summaryBy(cbind(var1, var2) ~ Day, data = df) > hour later: OK, I got it! yuppie!) However what I really needed was > smth like this: > > ddf <- by(df[,-1], df$Day, function(z) { return(cor(z$val1,z$val2)) ; } ) > > (but I still don't know how to convert it to a friendly data frame...) > do.call("rbind", ddf) > Thanks for the answers! > > January > > -- > ------------ January Weiner 3 ---------------------+--------------- > Division of Bioinformatics, University of Muenster | Schloßplatz 4 > (+49)(251)8321634 | D48149 Münster > http://www.uni-muenster.de/Biologie.Botanik/ebb/ | Germany > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
Hi,
On 12/15/05, Gabor Grothendieck <[hidden email]> wrote: > You don't get them as a column but you get them as the > component labels. > > by(df, df$Day, function(x) colMeans(x[,-1])) > > If you convert it to a data frame you get them as the rownames: > > do.call("rbind", by(df, df$Day, function(x) colMeans(x[,-1]))) Thanks! that helps a lot. But I still run into problems with this. Sorry for bothering you with newbie questions, if my problems are trivial, point me to a suitable guide (I did read the introductory materials on R). First: it works for colMeans, but it does not work for a function like this: do.call("rbind", by(df, df$Day, function(x) cor(df$val1, df$val2)) it says "Error in do.call(....) : second argument must be a list". I do not understand this, as the second argument is "b" of the class "by", as it was in the case of colMeans, so it did not change...? Second: in case of colMeans (where it works) it returns a matrix, and I have troubles getting it back to the data.frame, so I can access blah$Day. Instead, I have smth like that: > do.call("rbind",b) V2 V3 V4 V5 V7 Tue 19 15 2 0 1.538462 Wed 5 3 6 1 1.285714 ...and I do not know how to acces, for example, values for "Tue", except with [1,] -- which is somewhat problematic. For example, I would like to display the 3 days for which V7 is highest. How can I do that? > I think you want class(df) which shows its a data frame. Ops. Sorry, I didn't guess it from the manual :-) > aggregate(df[,-1], df[,1,drop = FALSE], mean) But why is df[,1,drop=FALSE] a list? I don't get it... > aggregate(df[,-1], list(Day = df$Day), mean) Yeah, I figured out that one. > Another alternative is to use summaryBy from the doBy package found > at http://genetics.agrsci.dk/~sorenh/misc/ : > > library(doBy) > summaryBy(cbind(var1, var2) ~ Day, data = df) I think I am not confident enough with the basic data types in R, I need to understand them before I go over to specialized packages :-) Again, thanks a lot, January -- ------------ January Weiner 3 ---------------------+--------------- Division of Bioinformatics, University of Muenster | Schloßplatz 4 (+49)(251)8321634 | D48149 Münster http://www.uni-muenster.de/Biologie.Botanik/ebb/ | Germany ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
On 12/16/05, January Weiner <[hidden email]> wrote:
> Hi, > > On 12/15/05, Gabor Grothendieck <[hidden email]> wrote: > > You don't get them as a column but you get them as the > > component labels. > > > > by(df, df$Day, function(x) colMeans(x[,-1])) > > > > If you convert it to a data frame you get them as the rownames: > > > > do.call("rbind", by(df, df$Day, function(x) colMeans(x[,-1]))) > > Thanks! that helps a lot. But I still run into problems with this. > Sorry for bothering you with newbie questions, if my problems are > trivial, point me to a suitable guide (I did read the introductory > materials on R). > > First: it works for colMeans, but it does not work for a function like this: > > do.call("rbind", by(df, df$Day, function(x) cor(df$val1, df$val2)) There are a number of problems: 1. the function does not depend on x and therefore will return the same result for each day group. 2. although ?by says it returns a list, it apparently simplifies the result, contrary to the documentation, in certain cases. Try this: do.call("rbind", as.list(by(df, df$Day, function(x) cor(x$val1, x$val2)))) or this: do.call("rbind", by(df, df$Day, function(x) list(cor = cor(x$val1, x$val2)))) 3. In your sample data val1 is constant for Wed so you won't be able to get a correlation. That's the source of the warning that you get when running the line in #2. > > it says "Error in do.call(....) : second argument must be a list". I > do not understand this, as the second argument is "b" of the class > "by", as it was in the case of colMeans, so it did not change...? > > Second: in case of colMeans (where it works) it returns a matrix, and > I have troubles getting it back to the data.frame, so I can access > blah$Day. Instead, I have smth like that: Try blah[,"Day"] which works with both matrices and data frames. > > > do.call("rbind",b) > V2 V3 V4 V5 V7 > Tue 19 15 2 0 1.538462 > Wed 5 3 6 1 1.285714 Another possibility is to coerce it to a data frame: as.data.frame(do.call("rbind", b)) or change your function to return a list. > > ...and I do not know how to acces, for example, values for "Tue", > except with [1,] -- which is somewhat problematic. For example, I > would like to display the 3 days for which V7 is highest. How can I > do that? > > > I think you want class(df) which shows its a data frame. > > Ops. Sorry, I didn't guess it from the manual :-) > > > aggregate(df[,-1], df[,1,drop = FALSE], mean) > > But why is df[,1,drop=FALSE] a list? I don't get it... Because df is a one column data frame and data frames are lists. Had we not specified drop, it would have automatically dropped it since it has only one dimension simplifying it to a non-list. We do not want that simplification here. > > > aggregate(df[,-1], list(Day = df$Day), mean) > > Yeah, I figured out that one. > > > Another alternative is to use summaryBy from the doBy package found > > at http://genetics.agrsci.dk/~sorenh/misc/ : > > > > library(doBy) > > summaryBy(cbind(var1, var2) ~ Day, data = df) > > I think I am not confident enough with the basic data types in R, I > need to understand them before I go over to specialized packages :-) > Again, thanks a lot, > January > > -- > ------------ January Weiner 3 ---------------------+--------------- > Division of Bioinformatics, University of Muenster | Schloßplatz 4 > (+49)(251)8321634 | D48149 Münster > http://www.uni-muenster.de/Biologie.Botanik/ebb/ | Germany > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
Gabor Grothendieck wrote:
> On 12/16/05, January Weiner <[hidden email]> wrote: > >>Hi, >> >>On 12/15/05, Gabor Grothendieck <[hidden email]> wrote: >> >>>You don't get them as a column but you get them as the >>>component labels. >>> >>> by(df, df$Day, function(x) colMeans(x[,-1])) >>> >>>If you convert it to a data frame you get them as the rownames: >>> >>> do.call("rbind", by(df, df$Day, function(x) colMeans(x[,-1]))) >> >>Thanks! that helps a lot. But I still run into problems with this. >>Sorry for bothering you with newbie questions, if my problems are >>trivial, point me to a suitable guide (I did read the introductory >>materials on R). >> >>First: it works for colMeans, but it does not work for a function like this: >> >>do.call("rbind", by(df, df$Day, function(x) cor(df$val1, df$val2)) > > > There are a number of problems: > > 1. the function does not depend on x and therefore will return the > same result for each day group. > > 2. although ?by says it returns a list, it apparently simplifies the result, > contrary to the documentation, in certain cases. Try this: > > do.call("rbind", as.list(by(df, df$Day, function(x) cor(x$val1, x$val2)))) > > or this: > > do.call("rbind", by(df, df$Day, function(x) list(cor = cor(x$val1, x$val2)))) > > > 3. In your sample data val1 is constant for Wed so you won't be able > to get a correlation. That's the source of the warning that you get > when running the line in #2. > > >>it says "Error in do.call(....) : second argument must be a list". I >>do not understand this, as the second argument is "b" of the class >>"by", as it was in the case of colMeans, so it did not change...? >> >>Second: in case of colMeans (where it works) it returns a matrix, and >>I have troubles getting it back to the data.frame, so I can access >>blah$Day. Instead, I have smth like that: > > > Try blah[,"Day"] which works with both matrices and data frames. > > >>>do.call("rbind",b) >> >> V2 V3 V4 V5 V7 >>Tue 19 15 2 0 1.538462 >>Wed 5 3 6 1 1.285714 > > > > Another possibility is to coerce it to a data frame: > > as.data.frame(do.call("rbind", b)) > > or change your function to return a list. > > >>...and I do not know how to acces, for example, values for "Tue", >>except with [1,] -- which is somewhat problematic. For example, I >>would like to display the 3 days for which V7 is highest. How can I >>do that? >> >> >>>I think you want class(df) which shows its a data frame. >> >>Ops. Sorry, I didn't guess it from the manual :-) >> >> >>> aggregate(df[,-1], df[,1,drop = FALSE], mean) >> >>But why is df[,1,drop=FALSE] a list? I don't get it... > > > Because df is a one column data frame and data frames are lists. > Had we not specified drop, it would have automatically dropped it > since it has only one dimension simplifying it to a non-list. > We do not want that simplification here. > > >>> aggregate(df[,-1], list(Day = df$Day), mean) >> >>Yeah, I figured out that one. >> >> >>>Another alternative is to use summaryBy from the doBy package found >>>at http://genetics.agrsci.dk/~sorenh/misc/ : >>> >>> library(doBy) >>> summaryBy(cbind(var1, var2) ~ Day, data = df) >> >>I think I am not confident enough with the basic data types in R, I >>need to understand them before I go over to specialized packages :-) >>Again, thanks a lot, >>January You might want to look at the summarize function in the Hmisc package. Frank >> >>-- >>------------ January Weiner 3 ---------------------+--------------- >>Division of Bioinformatics, University of Muenster | Schloßplatz 4 >>(+49)(251)8321634 | D48149 Münster >>http://www.uni-muenster.de/Biologie.Botanik/ebb/ | Germany >> >>______________________________________________ >>[hidden email] mailing list >>https://stat.ethz.ch/mailman/listinfo/r-help >>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >> > > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Frank Harrell
Department of Biostatistics, Vanderbilt University |
In reply to this post by Gabor Grothendieck
One other point. The cor example could be done using tapply like
this: tapply(rownames(df), df$Day, function(r) cor(df[r,"val1"], df[r, "val2"])) On 12/16/05, Gabor Grothendieck <[hidden email]> wrote: > On 12/16/05, January Weiner <[hidden email]> wrote: > > Hi, > > > > On 12/15/05, Gabor Grothendieck <[hidden email]> wrote: > > > You don't get them as a column but you get them as the > > > component labels. > > > > > > by(df, df$Day, function(x) colMeans(x[,-1])) > > > > > > If you convert it to a data frame you get them as the rownames: > > > > > > do.call("rbind", by(df, df$Day, function(x) colMeans(x[,-1]))) > > > > Thanks! that helps a lot. But I still run into problems with this. > > Sorry for bothering you with newbie questions, if my problems are > > trivial, point me to a suitable guide (I did read the introductory > > materials on R). > > > > First: it works for colMeans, but it does not work for a function like this: > > > > do.call("rbind", by(df, df$Day, function(x) cor(df$val1, df$val2)) > > There are a number of problems: > > 1. the function does not depend on x and therefore will return the > same result for each day group. > > 2. although ?by says it returns a list, it apparently simplifies the result, > contrary to the documentation, in certain cases. Try this: > > do.call("rbind", as.list(by(df, df$Day, function(x) cor(x$val1, x$val2)))) > > or this: > > do.call("rbind", by(df, df$Day, function(x) list(cor = cor(x$val1, x$val2)))) > > > 3. In your sample data val1 is constant for Wed so you won't be able > to get a correlation. That's the source of the warning that you get > when running the line in #2. > > > > > it says "Error in do.call(....) : second argument must be a list". I > > do not understand this, as the second argument is "b" of the class > > "by", as it was in the case of colMeans, so it did not change...? > > > > Second: in case of colMeans (where it works) it returns a matrix, and > > I have troubles getting it back to the data.frame, so I can access > > blah$Day. Instead, I have smth like that: > > Try blah[,"Day"] which works with both matrices and data frames. > > > > > > do.call("rbind",b) > > V2 V3 V4 V5 V7 > > Tue 19 15 2 0 1.538462 > > Wed 5 3 6 1 1.285714 > > > Another possibility is to coerce it to a data frame: > > as.data.frame(do.call("rbind", b)) > > or change your function to return a list. > > > > > ...and I do not know how to acces, for example, values for "Tue", > > except with [1,] -- which is somewhat problematic. For example, I > > would like to display the 3 days for which V7 is highest. How can I > > do that? > > > > > I think you want class(df) which shows its a data frame. > > > > Ops. Sorry, I didn't guess it from the manual :-) > > > > > aggregate(df[,-1], df[,1,drop = FALSE], mean) > > > > But why is df[,1,drop=FALSE] a list? I don't get it... > > Because df is a one column data frame and data frames are lists. > Had we not specified drop, it would have automatically dropped it > since it has only one dimension simplifying it to a non-list. > We do not want that simplification here. > > > > > > aggregate(df[,-1], list(Day = df$Day), mean) > > > > Yeah, I figured out that one. > > > > > Another alternative is to use summaryBy from the doBy package found > > > at http://genetics.agrsci.dk/~sorenh/misc/ : > > > > > > library(doBy) > > > summaryBy(cbind(var1, var2) ~ Day, data = df) > > > > I think I am not confident enough with the basic data types in R, I > > need to understand them before I go over to specialized packages :-) > > Again, thanks a lot, > > January > > > > -- > > ------------ January Weiner 3 ---------------------+--------------- > > Division of Bioinformatics, University of Muenster | Schloßplatz 4 > > (+49)(251)8321634 | D48149 Münster > > http://www.uni-muenster.de/Biologie.Botanik/ebb/ | Germany > > > > ______________________________________________ > > [hidden email] mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > > > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
Free forum by Nabble | Edit this page |