Quantcast

Using summaryBy with weighted data

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Using summaryBy with weighted data

Solomon Messing
Dear Soren and R users:

I am trying to use the summaryBy function with weights.  Is this possible?  An example that illustrates what I am trying to do follows:

library(doBy)
## make up some data
response = rnorm(100)
group = c(rep(1,20), rep(2,20), rep(3,20), rep(4,20), rep(5,20))
weights = runif(100, 0, 1)
mydata = data.frame(response,group,weights)

## run summaryBy without weights:
summaryBy(response~group, data = mydata, FUN = mean)

## attempt to run summaryBy with weights, throws error
summaryBy(x~group, data = mydata, FUN = weighted.mean, w=weights )

## throws the error:
# Error in tapply(lh.data[, lh.var[vv]], rh.string.factor, function(x) { :
# arguments must have same length

My guess is that summaryBy is not giving weighted.mean() each group of weights, but instead is passing all of the weights in the data set each time it calls weighted.mean().  Do you know if there is some way to get summaryBy to pass weights to weighted.mean() only for each group?  

I suspect this functionality would be a tremendous benefit to R users who regularly work with weighted data, such as myself.  

Thanks,

Solomon Messing
www.stanford.edu/~messing

PS I know this basic example can be done using lapply(split(...)) approach referenced here:

http://www.mail-archive.com/r-help@.../msg12349.html 

but for more complex tasks the lapply approach will mean writing a lot of extra code to run everything and then to get things formatted as nicely as summaryBy() was designed to do.


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Using summaryBy with weighted data

Joshua Wiley-2
Dear Solomon,

On Sun, Jan 16, 2011 at 10:27 PM, Solomon Messing
<[hidden email]> wrote:

> Dear Soren and R users:
>
> I am trying to use the summaryBy function with weights.  Is this possible?  An example that illustrates what I am trying to do follows:
>
> library(doBy)
> ## make up some data
> response = rnorm(100)
> group = c(rep(1,20), rep(2,20), rep(3,20), rep(4,20), rep(5,20))
> weights = runif(100, 0, 1)
> mydata = data.frame(response,group,weights)
>
> ## run summaryBy without weights:
> summaryBy(response~group, data = mydata, FUN = mean)
>
> ## attempt to run summaryBy with weights, throws error
> summaryBy(x~group, data = mydata, FUN = weighted.mean, w=weights )
>
> ## throws the error:
> # Error in tapply(lh.data[, lh.var[vv]], rh.string.factor, function(x) { :
> #                                       arguments must have same length
>
> My guess is that summaryBy is not giving weighted.mean() each group of weights, but instead is passing all of the weights in the data set each time it calls weighted.mean().

Yes, of course.  It has no way of knowing that the weights should also
be being broken down by group....they are not in the formula.

>  Do you know if there is some way to get summaryBy to pass weights to weighted.mean() only for each group?

Ideally there would be a way to pass more than one variable to a
function (e.g., response and weights) or just an entire object
(mydata) broken down by group.  Then you would just make a wrapper
function to pass the right values to the x and w arguments of
weighted.mean.  Instead here is a somewhat hacked version:

library(doBy)
## make up some data (easier)
mydata <- data.frame(response = rnorm(100),
 group = rep(1:5, each = 20), weights = runif(100, 0, 1))

## manually compute weighted mean
tmp <- summaryBy(response*weights ~ group, data = mydata, FUN = sum)
tmp[,2] <- tmp[,2]/with(mydata, tapply(weights, group, sum))
tmp ## weighted means

## here's the 'problem', if you will, even with  +, they are passed
one at a time
summaryBy(response + weights ~ group, data = mydata, FUN = str)
summaryBy(mydata ~ group, data = mydata, FUN = str)

## here is an option using by():
xy <- by(mydata, mydata$group, function(z) weighted.mean(z$response, z$weights))
xy
## if you don't like the formatting....
data.frame(group = names(c(xy)), weighted.mean = c(xy))

HTH,

Josh

>
> I suspect this functionality would be a tremendous benefit to R users who regularly work with weighted data, such as myself.
>
> Thanks,
>
> Solomon Messing
> www.stanford.edu/~messing
>
> PS I know this basic example can be done using lapply(split(...)) approach referenced here:
>
> http://www.mail-archive.com/r-help@.../msg12349.html
>
> but for more complex tasks the lapply approach will mean writing a lot of extra code to run everything and then to get things formatted as nicely as summaryBy() was designed to do.
>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
http://www.joshuawiley.com/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Using summaryBy with weighted data

David Freedman
In reply to this post by Solomon Messing
You might use the plyr package to get group-wise weighted means

library(plyr)
ddply(mydata,~group,summarise, b=mean(weights), c=weighted.mean(response,weights))

hth
david freedman
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Using summaryBy with weighted data

Solomon Messing
In reply to this post by Joshua Wiley-2
Thanks Josh.  I built on your example and ended up with the code below--if you or anyone sees any issues please let me know.  It would be great if there were a slicker way to get these kinds of summary stats in R, but this gets the job done.

# takes data frame z with weights w and data x, returns weighted mean, weighted SE, and N
msenw = function(z){
        N = length(na.omit(z)$response)
        i = which(!is.na(z$response))
        return(
                        c( W.M = weighted.mean(z$response, z$weights, na.rm=T),
                        W.SE = sqrt(wtd.var(z$response, weights = z$weights))/sqrt(sum(z$weights[i])),
                        N=N ) )
}

library(doBy)
library(Hmisc)
## make up some data (easier)
mydata <- data.frame(response = rnorm(100),
                group = rep(1:5, each = 20), weights = runif(100, 0, 1))

xy <- by(mydata, mydata$group, msenw)
data.frame( group = names(c(xy)), do.call(rbind, xy) )

## can be extended to other data using:
xy <- by(data.frame(response = mydata$response, weights = mydata$weights), mydata$group, msenw)


Solomon Messing
www.stanford.edu/~messing



On Jan 16, 2011, at 11:16 PM, Joshua Wiley wrote:

> Dear Solomon,
>
> On Sun, Jan 16, 2011 at 10:27 PM, Solomon Messing
> <[hidden email]> wrote:
>> Dear Soren and R users:
>>
>> I am trying to use the summaryBy function with weights.  Is this possible?  An example that illustrates what I am trying to do follows:
>>
>> library(doBy)
>> ## make up some data
>> response = rnorm(100)
>> group = c(rep(1,20), rep(2,20), rep(3,20), rep(4,20), rep(5,20))
>> weights = runif(100, 0, 1)
>> mydata = data.frame(response,group,weights)
>>
>> ## run summaryBy without weights:
>> summaryBy(response~group, data = mydata, FUN = mean)
>>
>> ## attempt to run summaryBy with weights, throws error
>> summaryBy(x~group, data = mydata, FUN = weighted.mean, w=weights )
>>
>> ## throws the error:
>> # Error in tapply(lh.data[, lh.var[vv]], rh.string.factor, function(x) { :
>> #                                       arguments must have same length
>>
>> My guess is that summaryBy is not giving weighted.mean() each group of weights, but instead is passing all of the weights in the data set each time it calls weighted.mean().
>
> Yes, of course.  It has no way of knowing that the weights should also
> be being broken down by group....they are not in the formula.
>
>>  Do you know if there is some way to get summaryBy to pass weights to weighted.mean() only for each group?
>
> Ideally there would be a way to pass more than one variable to a
> function (e.g., response and weights) or just an entire object
> (mydata) broken down by group.  Then you would just make a wrapper
> function to pass the right values to the x and w arguments of
> weighted.mean.  Instead here is a somewhat hacked version:
>
> library(doBy)
> ## make up some data (easier)
> mydata <- data.frame(response = rnorm(100),
> group = rep(1:5, each = 20), weights = runif(100, 0, 1))
>
> ## manually compute weighted mean
> tmp <- summaryBy(response*weights ~ group, data = mydata, FUN = sum)
> tmp[,2] <- tmp[,2]/with(mydata, tapply(weights, group, sum))
> tmp ## weighted means
>
> ## here's the 'problem', if you will, even with  +, they are passed
> one at a time
> summaryBy(response + weights ~ group, data = mydata, FUN = str)
> summaryBy(mydata ~ group, data = mydata, FUN = str)
>
> ## here is an option using by():
> xy <- by(mydata, mydata$group, function(z) weighted.mean(z$response, z$weights))
> xy
> ## if you don't like the formatting....
> data.frame(group = names(c(xy)), weighted.mean = c(xy))
>
> HTH,
>
> Josh
>
>>
>> I suspect this functionality would be a tremendous benefit to R users who regularly work with weighted data, such as myself.
>>
>> Thanks,
>>
>> Solomon Messing
>> www.stanford.edu/~messing
>>
>> PS I know this basic example can be done using lapply(split(...)) approach referenced here:
>>
>> http://www.mail-archive.com/r-help@.../msg12349.html
>>
>> but for more complex tasks the lapply approach will mean writing a lot of extra code to run everything and then to get things formatted as nicely as summaryBy() was designed to do.
>>
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Joshua Wiley
> Ph.D. Student, Health Psychology
> University of California, Los Angeles
> http://www.joshuawiley.com/


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Using summaryBy with weighted data

djmuseR
Hi:

Does this do what you need?

wstats <- function(d) {
             require(Hmisc)
             N <- length(d$response[!is.na(d$response)])
             c(WM = wtd.mean(d$response, d$weights),
               WSE = sqrt(wtd.var(d$response, d$weights)),
               N = N)
    }
library(plyr)
ddply(mydata, .(group), wstats)
  group            WM       WSE  N
1     1  0.1302255752 1.1911298 20
2     2 -0.2814664362 0.8582928 20
3     3 -0.3640550516 1.2618343 20
4     4  0.0002852392 1.1463205 20
5     5 -0.0070283053 1.2315683 20

The trick to writing this function for input into plyr is that the argument
is a data frame. When called in ddply(), the function wstats() will be
applied to each sub-frame corresponding to the grouping factor(s). Inside
it, the variables of interest are extracted relative to the input data frame
and the three quantities are computed. I used wtd.mean() and wtd.var() from
Hmisc, as both will remove NAs by default. In the ddply call, the function
name is simply cited since a sub-data frame is the sole argument of the
function.

I couldn't figure out how to get doBy to get this to work, as it seems best
suited to functions of one argument (a single response), but here's an
alternative using the data.table package:

library(data.table)
# Assumes Hmisc is already loaded...
myDT <- data.table(mydata, key = 'group')
myDT[, list(N = length(response[!is.na(response)]),
             wtdMean = wtd.mean(response, weights),
             wtdSE = sqrt(wtd.var(response, weights))), by = 'group']
     group  N       wtdMean     wtdSE
[1,]     1 20  0.1302255752 1.1911298
[2,]     2 20 -0.2814664362 0.8582928
[3,]     3 20 -0.3640550516 1.2618343
[4,]     4 20  0.0002852392 1.1463205
[5,]     5 20 -0.0070283053 1.2315683

data.table uses a different model of data organization from data frames. A
simplistic description is that it you can think of a data.table as analogous
to a table in a DBMS. Notice that the 'function call' is indexed inside the
data table: the first 'subscript' corresponds to what are called I()
operations  (analogous to 'select' statements in an SQL); the second
'subscript' corresponds to J() operations, (analogous to  'where'
statements), while the third argument is the by group(s), or sub-data
tables, to which (in this case) the J() operations apply.

For functions that take multiple arguments and that are meant to be applied
in a groupwise fashion, I find plyr and data.table to be very good options.
There are also base package alternatives (e.g., some combination of
lapply(), mapply() and do.call()) and several other packages, but plyr and
data.table are generally pretty good at handling most of the niggling
details. Having said that, both have learning curves - data.table, in
particular, will be much easier to pick up if you have some background in
SQLs, since its syntax uses primary principles of SQL.

data.table has a vignette and FAQ, along with an independent help list - for
details, see its page on R-forge:
http://r-forge.r-project.org/projects/datatable/
For plyr's documentation, see
http://had.co.nz/plyr/
A link to its mailing list is found on that page as well.

HTH,
Dennis

On Mon, Jan 17, 2011 at 10:24 AM, Solomon Messing <[hidden email]
> wrote:

> Thanks Josh.  I built on your example and ended up with the code below--if
> you or anyone sees any issues please let me know.  It would be great if
> there were a slicker way to get these kinds of summary stats in R, but this
> gets the job done.
>
> # takes data frame z with weights w and data x, returns weighted mean,
> weighted SE, and N
> msenw = function(z){
>        N = length(na.omit(z)$response)
>        i = which(!is.na(z$response))
>        return(
>                        c( W.M = weighted.mean(z$response, z$weights,
> na.rm=T),
>                        W.SE = sqrt(wtd.var(z$response, weights =
> z$weights))/sqrt(sum(z$weights[i])),
>                        N=N ) )
> }
>
> library(doBy)
> library(Hmisc)
> ## make up some data (easier)
> mydata <- data.frame(response = rnorm(100),
>                group = rep(1:5, each = 20), weights = runif(100, 0, 1))
>
> xy <- by(mydata, mydata$group, msenw)
> data.frame( group = names(c(xy)), do.call(rbind, xy) )
>
> ## can be extended to other data using:
> xy <- by(data.frame(response = mydata$response, weights = mydata$weights),
> mydata$group, msenw)
>
>
> Solomon Messing
> www.stanford.edu/~messing <http://www.stanford.edu/%7Emessing>
>
>
>
> On Jan 16, 2011, at 11:16 PM, Joshua Wiley wrote:
>
> > Dear Solomon,
> >
> > On Sun, Jan 16, 2011 at 10:27 PM, Solomon Messing
> > <[hidden email]> wrote:
> >> Dear Soren and R users:
> >>
> >> I am trying to use the summaryBy function with weights.  Is this
> possible?  An example that illustrates what I am trying to do follows:
> >>
> >> library(doBy)
> >> ## make up some data
> >> response = rnorm(100)
> >> group = c(rep(1,20), rep(2,20), rep(3,20), rep(4,20), rep(5,20))
> >> weights = runif(100, 0, 1)
> >> mydata = data.frame(response,group,weights)
> >>
> >> ## run summaryBy without weights:
> >> summaryBy(response~group, data = mydata, FUN = mean)
> >>
> >> ## attempt to run summaryBy with weights, throws error
> >> summaryBy(x~group, data = mydata, FUN = weighted.mean, w=weights )
> >>
> >> ## throws the error:
> >> # Error in tapply(lh.data[, lh.var[vv]], rh.string.factor, function(x) {
> :
> >> #                                       arguments must have same length
> >>
> >> My guess is that summaryBy is not giving weighted.mean() each group of
> weights, but instead is passing all of the weights in the data set each time
> it calls weighted.mean().
> >
> > Yes, of course.  It has no way of knowing that the weights should also
> > be being broken down by group....they are not in the formula.
> >
> >>  Do you know if there is some way to get summaryBy to pass weights to
> weighted.mean() only for each group?
> >
> > Ideally there would be a way to pass more than one variable to a
> > function (e.g., response and weights) or just an entire object
> > (mydata) broken down by group.  Then you would just make a wrapper
> > function to pass the right values to the x and w arguments of
> > weighted.mean.  Instead here is a somewhat hacked version:
> >
> > library(doBy)
> > ## make up some data (easier)
> > mydata <- data.frame(response = rnorm(100),
> > group = rep(1:5, each = 20), weights = runif(100, 0, 1))
> >
> > ## manually compute weighted mean
> > tmp <- summaryBy(response*weights ~ group, data = mydata, FUN = sum)
> > tmp[,2] <- tmp[,2]/with(mydata, tapply(weights, group, sum))
> > tmp ## weighted means
> >
> > ## here's the 'problem', if you will, even with  +, they are passed
> > one at a time
> > summaryBy(response + weights ~ group, data = mydata, FUN = str)
> > summaryBy(mydata ~ group, data = mydata, FUN = str)
> >
> > ## here is an option using by():
> > xy <- by(mydata, mydata$group, function(z) weighted.mean(z$response,
> z$weights))
> > xy
> > ## if you don't like the formatting....
> > data.frame(group = names(c(xy)), weighted.mean = c(xy))
> >
> > HTH,
> >
> > Josh
> >
> >>
> >> I suspect this functionality would be a tremendous benefit to R users
> who regularly work with weighted data, such as myself.
> >>
> >> Thanks,
> >>
> >> Solomon Messing
> >> www.stanford.edu/~messing <http://www.stanford.edu/%7Emessing>
> >>
> >> PS I know this basic example can be done using lapply(split(...))
> approach referenced here:
> >>
> >> http://www.mail-archive.com/r-help@.../msg12349.html
> >>
> >> but for more complex tasks the lapply approach will mean writing a lot
> of extra code to run everything and then to get things formatted as nicely
> as summaryBy() was designed to do.
> >>
> >>
> >>        [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> [hidden email] mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> >
> >
> > --
> > Joshua Wiley
> > Ph.D. Student, Health Psychology
> > University of California, Los Angeles
> > http://www.joshuawiley.com/
>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Using summaryBy with weighted data

Søren Højsgaard
In reply to this post by Joshua Wiley-2
It is currently not possible to pass weights in summaryBy.
Regards
Søren

________________________________________
Fra: Joshua Wiley [[hidden email]]
Sendt: 17. januar 2011 08:16
Til: Solomon Messing
Cc: [hidden email]; Søren Højsgaard
Emne: Re: [R] Using summaryBy with weighted data

Dear Solomon,

On Sun, Jan 16, 2011 at 10:27 PM, Solomon Messing
<[hidden email]> wrote:

> Dear Soren and R users:
>
> I am trying to use the summaryBy function with weights.  Is this possible?  An example that illustrates what I am trying to do follows:
>
> library(doBy)
> ## make up some data
> response = rnorm(100)
> group = c(rep(1,20), rep(2,20), rep(3,20), rep(4,20), rep(5,20))
> weights = runif(100, 0, 1)
> mydata = data.frame(response,group,weights)
>
> ## run summaryBy without weights:
> summaryBy(response~group, data = mydata, FUN = mean)
>
> ## attempt to run summaryBy with weights, throws error
> summaryBy(x~group, data = mydata, FUN = weighted.mean, w=weights )
>
> ## throws the error:
> # Error in tapply(lh.data[, lh.var[vv]], rh.string.factor, function(x) { :
> #                                       arguments must have same length
>
> My guess is that summaryBy is not giving weighted.mean() each group of weights, but instead is passing all of the weights in the data set each time it calls weighted.mean().

Yes, of course.  It has no way of knowing that the weights should also
be being broken down by group....they are not in the formula.

>  Do you know if there is some way to get summaryBy to pass weights to weighted.mean() only for each group?

Ideally there would be a way to pass more than one variable to a
function (e.g., response and weights) or just an entire object
(mydata) broken down by group.  Then you would just make a wrapper
function to pass the right values to the x and w arguments of
weighted.mean.  Instead here is a somewhat hacked version:

library(doBy)
## make up some data (easier)
mydata <- data.frame(response = rnorm(100),
 group = rep(1:5, each = 20), weights = runif(100, 0, 1))

## manually compute weighted mean
tmp <- summaryBy(response*weights ~ group, data = mydata, FUN = sum)
tmp[,2] <- tmp[,2]/with(mydata, tapply(weights, group, sum))
tmp ## weighted means

## here's the 'problem', if you will, even with  +, they are passed
one at a time
summaryBy(response + weights ~ group, data = mydata, FUN = str)
summaryBy(mydata ~ group, data = mydata, FUN = str)

## here is an option using by():
xy <- by(mydata, mydata$group, function(z) weighted.mean(z$response, z$weights))
xy
## if you don't like the formatting....
data.frame(group = names(c(xy)), weighted.mean = c(xy))

HTH,

Josh

>
> I suspect this functionality would be a tremendous benefit to R users who regularly work with weighted data, such as myself.
>
> Thanks,
>
> Solomon Messing
> www.stanford.edu/~messing
>
> PS I know this basic example can be done using lapply(split(...)) approach referenced here:
>
> http://www.mail-archive.com/r-help@.../msg12349.html
>
> but for more complex tasks the lapply approach will mean writing a lot of extra code to run everything and then to get things formatted as nicely as summaryBy() was designed to do.
>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
http://www.joshuawiley.com/
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Using summaryBy with weighted data

Solomon Messing
In reply to this post by djmuseR
Thanks Dennis, looks like there's even less boiler plate code with plyr.  By the way, what I labelled "W.SE" is meant to represent the weighted standard error of the mean.  Your "WSE" calculations appear to be providing the weighted standard deviation of the variable.  Is this a matter of needing to change my labels to W.SEM to avoid this kind of confusion, or is there literature suggesting that I should be using the standard deviation of the variable to estimate the weighted standard error of the mean?

Thanks,

-Solomon

On Jan 17, 2011, at 1:11 PM, Dennis Murphy wrote:

> Hi:
>
> Does this do what you need?
>
> wstats <- function(d) {
>              require(Hmisc)
>              N <- length(d$response[!is.na(d$response)])
>              c(WM = wtd.mean(d$response, d$weights),
>                WSE = sqrt(wtd.var(d$response, d$weights)),
>                N = N)
>     }
> library(plyr)
> ddply(mydata, .(group), wstats)
>   group            WM       WSE  N
> 1     1  0.1302255752 1.1911298 20
> 2     2 -0.2814664362 0.8582928 20
> 3     3 -0.3640550516 1.2618343 20
> 4     4  0.0002852392 1.1463205 20
> 5     5 -0.0070283053 1.2315683 20
>
> The trick to writing this function for input into plyr is that the argument is a data frame. When called in ddply(), the function wstats() will be applied to each sub-frame corresponding to the grouping factor(s). Inside it, the variables of interest are extracted relative to the input data frame and the three quantities are computed. I used wtd.mean() and wtd.var() from Hmisc, as both will remove NAs by default. In the ddply call, the function name is simply cited since a sub-data frame is the sole argument of the function.
>
> I couldn't figure out how to get doBy to get this to work, as it seems best suited to functions of one argument (a single response), but here's an alternative using the data.table package:
>
> library(data.table)
> # Assumes Hmisc is already loaded...
> myDT <- data.table(mydata, key = 'group')
> myDT[, list(N = length(response[!is.na(response)]),
>              wtdMean = wtd.mean(response, weights),
>              wtdSE = sqrt(wtd.var(response, weights))), by = 'group']
>      group  N       wtdMean     wtdSE
> [1,]     1 20  0.1302255752 1.1911298
> [2,]     2 20 -0.2814664362 0.8582928
> [3,]     3 20 -0.3640550516 1.2618343
> [4,]     4 20  0.0002852392 1.1463205
> [5,]     5 20 -0.0070283053 1.2315683
>
> data.table uses a different model of data organization from data frames. A simplistic description is that it you can think of a data.table as analogous to a table in a DBMS. Notice that the 'function call' is indexed inside the data table: the first 'subscript' corresponds to what are called I() operations  (analogous to 'select' statements in an SQL); the second 'subscript' corresponds to J() operations, (analogous to  'where' statements), while the third argument is the by group(s), or sub-data tables, to which (in this case) the J() operations apply.
>
> For functions that take multiple arguments and that are meant to be applied in a groupwise fashion, I find plyr and data.table to be very good options. There are also base package alternatives (e.g., some combination of lapply(), mapply() and do.call()) and several other packages, but plyr and data.table are generally pretty good at handling most of the niggling details. Having said that, both have learning curves - data.table, in particular, will be much easier to pick up if you have some background in SQLs, since its syntax uses primary principles of SQL.
>
> data.table has a vignette and FAQ, along with an independent help list - for details, see its page on R-forge:
> http://r-forge.r-project.org/projects/datatable/
> For plyr's documentation, see
> http://had.co.nz/plyr/
> A link to its mailing list is found on that page as well.
>
> HTH,
> Dennis
>
> On Mon, Jan 17, 2011 at 10:24 AM, Solomon Messing <[hidden email]> wrote:
> Thanks Josh.  I built on your example and ended up with the code below--if you or anyone sees any issues please let me know.  It would be great if there were a slicker way to get these kinds of summary stats in R, but this gets the job done.
>
> # takes data frame z with weights w and data x, returns weighted mean, weighted SE, and N
> msenw = function(z){
>        N = length(na.omit(z)$response)
>        i = which(!is.na(z$response))
>        return(
>                        c( W.M = weighted.mean(z$response, z$weights, na.rm=T),
>                        W.SE = sqrt(wtd.var(z$response, weights = z$weights))/sqrt(sum(z$weights[i])),
>                        N=N ) )
> }
>
> library(doBy)
> library(Hmisc)
> ## make up some data (easier)
> mydata <- data.frame(response = rnorm(100),
>                group = rep(1:5, each = 20), weights = runif(100, 0, 1))
>
> xy <- by(mydata, mydata$group, msenw)
> data.frame( group = names(c(xy)), do.call(rbind, xy) )
>
> ## can be extended to other data using:
> xy <- by(data.frame(response = mydata$response, weights = mydata$weights), mydata$group, msenw)
>
>
> Solomon Messing
> www.stanford.edu/~messing
>
>
>
> On Jan 16, 2011, at 11:16 PM, Joshua Wiley wrote:
>
> > Dear Solomon,
> >
> > On Sun, Jan 16, 2011 at 10:27 PM, Solomon Messing
> > <[hidden email]> wrote:
> >> Dear Soren and R users:
> >>
> >> I am trying to use the summaryBy function with weights.  Is this possible?  An example that illustrates what I am trying to do follows:
> >>
> >> library(doBy)
> >> ## make up some data
> >> response = rnorm(100)
> >> group = c(rep(1,20), rep(2,20), rep(3,20), rep(4,20), rep(5,20))
> >> weights = runif(100, 0, 1)
> >> mydata = data.frame(response,group,weights)
> >>
> >> ## run summaryBy without weights:
> >> summaryBy(response~group, data = mydata, FUN = mean)
> >>
> >> ## attempt to run summaryBy with weights, throws error
> >> summaryBy(x~group, data = mydata, FUN = weighted.mean, w=weights )
> >>
> >> ## throws the error:
> >> # Error in tapply(lh.data[, lh.var[vv]], rh.string.factor, function(x) { :
> >> #                                       arguments must have same length
> >>
> >> My guess is that summaryBy is not giving weighted.mean() each group of weights, but instead is passing all of the weights in the data set each time it calls weighted.mean().
> >
> > Yes, of course.  It has no way of knowing that the weights should also
> > be being broken down by group....they are not in the formula.
> >
> >>  Do you know if there is some way to get summaryBy to pass weights to weighted.mean() only for each group?
> >
> > Ideally there would be a way to pass more than one variable to a
> > function (e.g., response and weights) or just an entire object
> > (mydata) broken down by group.  Then you would just make a wrapper
> > function to pass the right values to the x and w arguments of
> > weighted.mean.  Instead here is a somewhat hacked version:
> >
> > library(doBy)
> > ## make up some data (easier)
> > mydata <- data.frame(response = rnorm(100),
> > group = rep(1:5, each = 20), weights = runif(100, 0, 1))
> >
> > ## manually compute weighted mean
> > tmp <- summaryBy(response*weights ~ group, data = mydata, FUN = sum)
> > tmp[,2] <- tmp[,2]/with(mydata, tapply(weights, group, sum))
> > tmp ## weighted means
> >
> > ## here's the 'problem', if you will, even with  +, they are passed
> > one at a time
> > summaryBy(response + weights ~ group, data = mydata, FUN = str)
> > summaryBy(mydata ~ group, data = mydata, FUN = str)
> >
> > ## here is an option using by():
> > xy <- by(mydata, mydata$group, function(z) weighted.mean(z$response, z$weights))
> > xy
> > ## if you don't like the formatting....
> > data.frame(group = names(c(xy)), weighted.mean = c(xy))
> >
> > HTH,
> >
> > Josh
> >
> >>
> >> I suspect this functionality would be a tremendous benefit to R users who regularly work with weighted data, such as myself.
> >>
> >> Thanks,
> >>
> >> Solomon Messing
> >> www.stanford.edu/~messing
> >>
> >> PS I know this basic example can be done using lapply(split(...)) approach referenced here:
> >>
> >> http://www.mail-archive.com/r-help@.../msg12349.html
> >>
> >> but for more complex tasks the lapply approach will mean writing a lot of extra code to run everything and then to get things formatted as nicely as summaryBy() was designed to do.
> >>
> >>
> >>        [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> [hidden email] mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> >
> >
> > --
> > Joshua Wiley
> > Ph.D. Student, Health Psychology
> > University of California, Los Angeles
> > http://www.joshuawiley.com/
>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Using summaryBy with weighted data

Sebastián Daza
Hi everyone,
I am trying to run Sweave.bat (batchfiles_0.6-1) from the command line
on Windows, but I get this error:

C:\batchfiles_0.6-1>Sweave.bat Sweave-test-1
"Error: rterm.exe not found"

I don't know how to set up the path if this one were the problem... I
ran rcmd.bat and I got this... so I don't know if it is a path problem.

C:\batchfiles_0.6-1>Rcmd,bat
R_ARCH=/x64
R_ARCH0=x64
R_ARCH0=x64
cmdpath=C:\R\R-2.12.1\bin\x64\Rcmd.exe
args=,bat
'bat' is not recognized as an internal or external command,
operable program or batch file.

the path of rterm.exe in my computer is: C:\R\R-2.12.1\bin\x64
thank you in advance!

--
Sebastián Daza
[hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

sweave.bat

Sebastián Daza
In reply to this post by Solomon Messing
Hi everyone,
I am trying to run Sweave.bat (batchfiles_0.6-1) from the command line
on Windows, but I get this error:

C:\batchfiles_0.6-1>Sweave.bat Sweave-test-1
"Error: rterm.exe not found"

I don't know how to set up the path if this one were the problem... I
ran rcmd.bat and I got this... so I don't know if it is a path problem.

C:\batchfiles_0.6-1>Rcmd,bat
R_ARCH=/x64
R_ARCH0=x64
R_ARCH0=x64
cmdpath=C:\R\R-2.12.1\bin\x64\Rcmd.exe
args=,bat
'bat' is not recognized as an internal or external command,
operable program or batch file.

the path of rterm.exe in my computer is: C:\R\R-2.12.1\bin\x64
thank you in advance!

--
Sebastián Daza
[hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: sweave.bat

Gabor Grothendieck
2011/1/17 Sebastián Daza <[hidden email]>:

> Hi everyone,
> I am trying to run Sweave.bat (batchfiles_0.6-1) from the command line on
> Windows, but I get this error:
>
> C:\batchfiles_0.6-1>Sweave.bat Sweave-test-1
> "Error: rterm.exe not found"
>
> I don't know how to set up the path if this one were the problem... I ran
> rcmd.bat and I got this... so I don't know if it is a path problem.
>
> C:\batchfiles_0.6-1>Rcmd,bat
> R_ARCH=/x64
> R_ARCH0=x64
> R_ARCH0=x64
> cmdpath=C:\R\R-2.12.1\bin\x64\Rcmd.exe
> args=,bat
> 'bat' is not recognized as an internal or external command,
> operable program or batch file.
>
> the path of rterm.exe in my computer is: C:\R\R-2.12.1\bin\x64
> thank you in advance!
>

You should not have to set up any paths. The whole point of these
batch commands are to save you from doing that.  If you contact me
privately we can try to determine what has gone wrong with Sweave.bat
.

Regarding Rcmd.bat, the comma in your command line should be a dot.

--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...