Problem with Weighted Variance in Hmisc

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem with Weighted Variance in Hmisc

Tom La Bone-2
 

The function wtd.var(x,w) in Hmisc calculates the weighted variance of x
where w are the weights.  It appears to me that wtd.var(x,w) = var(x) if all
of the weights are equal, but this does not appear to be the case. Can
someone point out to me where I am going wrong here?  Thanks.

 

Tom La Bone


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Problem with Weighted Variance in Hmisc

jiho
On 2007-June-01  , at 01:03 , Tom La Bone wrote:
> The function wtd.var(x,w) in Hmisc calculates the weighted variance  
> of x
> where w are the weights.  It appears to me that wtd.var(x,w) = var
> (x) if all
> of the weights are equal, but this does not appear to be the case. Can
> someone point out to me where I am going wrong here?  Thanks.

The true formula of weighted variance is this one:
        http://www.itl.nist.gov/div898/software/dataplot/refman2/ch2/ 
weighvar.pdf
But for computation purposes, wtd.var uses another definition which  
considers the weights as repeats instead of true weights. However if  
the weights are normalized (sum to one) to two formulas are equal. If  
you consider weights as real weights instead of repeats, I would  
recommend to use this option.
With normwt=T, your issue is solved:

 > a=1:10
 > b=a
 > b[]=2
 > b
[1] 2 2 2 2 2 2 2 2 2 2
 > wtd.var(a,b)
[1] 8.68421
# all weights equal 2 <=> there are two repeats of each element of a
 > var(c(a,a))
[1] 8.68421
 > wtd.var(a,b,normwt=T)
[1] 9.166667
 > var(a)
[1] 9.166667

Cheers,

JiHO
---
http://jo.irisson.free.fr/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Problem with Weighted Variance in Hmisc

Tom La Bone-2
Thanks.  I have another related question:  

The equation for weighted variance given in the NIST DataPlot documentation
is the usual variance equation with the weights inserted.  The weighted
variance of the weighted mean is this weighted variance divided by N.

There is another approach to calculating the weighted variance of the
weighted mean that propagates the uncertainty of each term in the weighted
mean (see Data Reduction and Error Analysis for the Physical Sciences by
Bevington & Robinson).  The two approaches do not give the same answer. Can
anyone suggest a reference that discusses the merits of the DataPlot
approach versus the Bevington approach?

Tom La Bone

-----Original Message-----
From: jiho [mailto:[hidden email]]
Sent: Friday, June 01, 2007 2:17 AM
To: [hidden email]; R-help
Subject: Re: [R] Problem with Weighted Variance in Hmisc

On 2007-June-01  , at 01:03 , Tom La Bone wrote:
> The function wtd.var(x,w) in Hmisc calculates the weighted variance  
> of x
> where w are the weights.  It appears to me that wtd.var(x,w) = var
> (x) if all
> of the weights are equal, but this does not appear to be the case. Can
> someone point out to me where I am going wrong here?  Thanks.

The true formula of weighted variance is this one:
        http://www.itl.nist.gov/div898/software/dataplot/refman2/ch2/ 
weighvar.pdf
But for computation purposes, wtd.var uses another definition which  
considers the weights as repeats instead of true weights. However if  
the weights are normalized (sum to one) to two formulas are equal. If  
you consider weights as real weights instead of repeats, I would  
recommend to use this option.
With normwt=T, your issue is solved:

 > a=1:10
 > b=a
 > b[]=2
 > b
[1] 2 2 2 2 2 2 2 2 2 2
 > wtd.var(a,b)
[1] 8.68421
# all weights equal 2 <=> there are two repeats of each element of a
 > var(c(a,a))
[1] 8.68421
 > wtd.var(a,b,normwt=T)
[1] 9.166667
 > var(a)
[1] 9.166667

Cheers,

JiHO
---
http://jo.irisson.free.fr/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Problem with Weighted Variance in Hmisc

Frank Harrell
In reply to this post by jiho
jiho wrote:

> On 2007-June-01  , at 01:03 , Tom La Bone wrote:
>> The function wtd.var(x,w) in Hmisc calculates the weighted variance  
>> of x
>> where w are the weights.  It appears to me that wtd.var(x,w) = var
>> (x) if all
>> of the weights are equal, but this does not appear to be the case. Can
>> someone point out to me where I am going wrong here?  Thanks.
>
> The true formula of weighted variance is this one:
> http://www.itl.nist.gov/div898/software/dataplot/refman2/ch2/ 
> weighvar.pdf
> But for computation purposes, wtd.var uses another definition which  
> considers the weights as repeats instead of true weights. However if  
> the weights are normalized (sum to one) to two formulas are equal. If  
> you consider weights as real weights instead of repeats, I would  
> recommend to use this option.
> With normwt=T, your issue is solved:
>
>  > a=1:10
>  > b=a
>  > b[]=2
>  > b
> [1] 2 2 2 2 2 2 2 2 2 2
>  > wtd.var(a,b)
> [1] 8.68421
> # all weights equal 2 <=> there are two repeats of each element of a
>  > var(c(a,a))
> [1] 8.68421
>  > wtd.var(a,b,normwt=T)
> [1] 9.166667
>  > var(a)
> [1] 9.166667
>
> Cheers,
>
> JiHO

The issue is what is being assumed for N in the denominator of the
variance formula, since the unbiased estimator subtracts one.  Using
normwt=TRUE means you are in effect assuming N is the number of elements
in the data vector, ignoring the weights.

Frank Harrell

> ---
> http://jo.irisson.free.fr/
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


--
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Frank Harrell
Department of Biostatistics, Vanderbilt University
Reply | Threaded
Open this post in threaded view
|

Re: Problem with Weighted Variance in Hmisc

jiho
In reply to this post by Tom La Bone-2
On 2007-June-01  , at 13:00 , Tom La Bone wrote:

> The equation for weighted variance given in the NIST DataPlot  
> documentation
> is the usual variance equation with the weights inserted.  The  
> weighted
> variance of the weighted mean is this weighted variance divided by N.
>
> There is another approach to calculating the weighted variance of the
> weighted mean that propagates the uncertainty of each term in the  
> weighted
> mean (see Data Reduction and Error Analysis for the Physical  
> Sciences by
> Bevington & Robinson).  The two approaches do not give the same  
> answer. Can
> anyone suggest a reference that discusses the merits of the DataPlot
> approach versus the Bevington approach?

I am no expert but I did a little research on the subject and it  
seems there is no analytical equivalent to the standard error of the  
mean in weighted statistics (i.e. there is no standard error/variance  
of the weighted mean). That's why you find so many formulas: they are  
all numerical approximations that make more of less sense. If you  
have access to DcienceDirect there is a paper which compares 3 of  
those analytical approximations to the variance estimated by bootstrap:

@article{Gatz1995AE,
        Author = {Gatz, Donald F and Smith, Luther},
        Date-Added = {2007-05-19 14:15:58 +0200},
        Date-Modified = {2007-06-01 14:12:17 +0200},
        Filed = {Yes},
        Journal = {Atmospheric Environment},
        Number = {11},
        Pages = {1185--1193},
        Rating = {0},
        Title = {The standard error of a weighted mean concentration--I.  
Bootstrapping vs other methods},
        Url = {http://www.sciencedirect.com/science/article/B6VH3-3YGV47C-6X/ 
2/18b627259a75ff9b765410aaa231e352},
        Volume = {29},
        Year = {1995},
        Abstract = {Concentrations of chemical constituents of precipitation  
are frequently expressed in terms of the precipitation-weighted mean,  
which has several desirable properties. Unfortunately, the weighted  
mean has no analytical analog of the standard error of the arithmetic  
mean for use in characterizing its statistical uncertainty. Several  
approximate expressions have been used previously in the literature,  
but there is no consensus as to which is best. This paper compares  
three methods from the literature with a standard based on  
bootstrapping. Comparative calculations were carried out for nine  
major ions measured at 222 sampling sites in the National Atmospheric  
Deposition/National Trends Network (NADP/NTN). The ratio variance  
approximation of Cochran (1977) gave results that were not  
statistically different from those of bootstrapping, and is suggested  
as the method of choice for routine computing of the standard error  
of the weighted mean. The bootstrap method has advantages of its own,  
including the fact that it is nonparametric, but requires additional  
effort and computation time.}}

The analytical formula which is the closest to the bootstrap  is this  
one:

var.wtd.mean.cochran <- function(x,w)
#
# Computes the variance of a weighted mean following Cochran 1977  
definition
#
{
        n = length(w)
        xWbar = wtd.mean(x,w)
        wbar = mean(w)
        out = n/((n-1)*sum(w)^2)*(sum((w*x-wbar*xWbar)^2)-2*xWbar*sum((w-
wbar)*(w*x-wbar*xWbar))+xWbar^2*sum((w-wbar)^2))
        return(out)
}

It's the one I am retaining.

NB: the part two of the paper cited above may also interest you:

@article{Gatz1995AEa,
        Author = {Gatz, Donald F and Smith, Luther},
        Date-Added = {2007-05-19 14:15:58 +0200},
        Date-Modified = {2007-06-01 12:11:53 +0200},
        Filed = {Yes},
        Journal = {Atmospheric Environment},
        Number = {11},
        Pages = {1195--1200},
        Title = {The standard error of a weighted mean concentration--II.  
Estimating confidence intervals},
        Url = {http://www.sciencedirect.com/science/article/B6VH3-3YGV47C-6Y/ 
2/a187487377ef52b741e3dabdfca97517},
        Volume = {29},
        Year = {1995},
        Abstract = {One motivation for estimating the standard error, SEMw,  
of a weighted mean concentration, Mw, of an ion in precipitation is  
to use it to compute a confidence interval for Mw. Typically this is  
done by multiplying the standard error by a factor that depends on  
the degree of confidence one wishes to express, on the assumption  
that the weighted mean has a normal distribution. This paper compares  
confidence intervals of Mw concentrations of ions in precipitation,  
as computed using the assumption of a normal distribution, with those  
estimated from distributions produced by bootstrapping. The  
hypothesis that Mw was normally distributed was rejected about half  
the time (at the 5{\%} significance level) in tests involving nine  
major ions measured at ten diverse sites in the National Atmospheric  
Deposition Program/National Trends Network (NADP/NTN). Most of these  
rejections occurred at sites with fewer than 100 samples, in  
agreement with previous results. Nevertheless, the hypothesis was  
often rejected at sites with more than 100 samples as well. The  
maximum error (relative to Mw) in the 95{\%} confidence limits made  
by assuming a normal distribution of the Mw at the ten sites examined  
was about 27{\%}. Most such errors were less than 10{\%}, and errors  
were smaller at sampling sites with > 100 samples than at those with  
< 100 samples.}}

Cheers,

JiHO
---
http://jo.irisson.free.fr/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Problem with Weighted Variance in Hmisc

Tom La Bone-2
Wonderful!  Thanks for the assistance.

Tom La Bone


-----Original Message-----
From: jiho [mailto:[hidden email]]
Sent: Friday, June 01, 2007 8:17 AM
To: [hidden email]
Cc: 'R-help'
Subject: Re: [R] Problem with Weighted Variance in Hmisc

On 2007-June-01  , at 13:00 , Tom La Bone wrote:

> The equation for weighted variance given in the NIST DataPlot  
> documentation
> is the usual variance equation with the weights inserted.  The  
> weighted
> variance of the weighted mean is this weighted variance divided by N.
>
> There is another approach to calculating the weighted variance of the
> weighted mean that propagates the uncertainty of each term in the  
> weighted
> mean (see Data Reduction and Error Analysis for the Physical  
> Sciences by
> Bevington & Robinson).  The two approaches do not give the same  
> answer. Can
> anyone suggest a reference that discusses the merits of the DataPlot
> approach versus the Bevington approach?

I am no expert but I did a little research on the subject and it  
seems there is no analytical equivalent to the standard error of the  
mean in weighted statistics (i.e. there is no standard error/variance  
of the weighted mean). That's why you find so many formulas: they are  
all numerical approximations that make more of less sense. If you  
have access to DcienceDirect there is a paper which compares 3 of  
those analytical approximations to the variance estimated by bootstrap:

@article{Gatz1995AE,
        Author = {Gatz, Donald F and Smith, Luther},
        Date-Added = {2007-05-19 14:15:58 +0200},
        Date-Modified = {2007-06-01 14:12:17 +0200},
        Filed = {Yes},
        Journal = {Atmospheric Environment},
        Number = {11},
        Pages = {1185--1193},
        Rating = {0},
        Title = {The standard error of a weighted mean concentration--I.  
Bootstrapping vs other methods},
        Url =
{http://www.sciencedirect.com/science/article/B6VH3-3YGV47C-6X/ 
2/18b627259a75ff9b765410aaa231e352},
        Volume = {29},
        Year = {1995},
        Abstract = {Concentrations of chemical constituents of precipitation

are frequently expressed in terms of the precipitation-weighted mean,  
which has several desirable properties. Unfortunately, the weighted  
mean has no analytical analog of the standard error of the arithmetic  
mean for use in characterizing its statistical uncertainty. Several  
approximate expressions have been used previously in the literature,  
but there is no consensus as to which is best. This paper compares  
three methods from the literature with a standard based on  
bootstrapping. Comparative calculations were carried out for nine  
major ions measured at 222 sampling sites in the National Atmospheric  
Deposition/National Trends Network (NADP/NTN). The ratio variance  
approximation of Cochran (1977) gave results that were not  
statistically different from those of bootstrapping, and is suggested  
as the method of choice for routine computing of the standard error  
of the weighted mean. The bootstrap method has advantages of its own,  
including the fact that it is nonparametric, but requires additional  
effort and computation time.}}

The analytical formula which is the closest to the bootstrap  is this  
one:

var.wtd.mean.cochran <- function(x,w)
#
# Computes the variance of a weighted mean following Cochran 1977  
definition
#
{
        n = length(w)
        xWbar = wtd.mean(x,w)
        wbar = mean(w)
        out = n/((n-1)*sum(w)^2)*(sum((w*x-wbar*xWbar)^2)-2*xWbar*sum((w-
wbar)*(w*x-wbar*xWbar))+xWbar^2*sum((w-wbar)^2))
        return(out)
}

It's the one I am retaining.

NB: the part two of the paper cited above may also interest you:

@article{Gatz1995AEa,
        Author = {Gatz, Donald F and Smith, Luther},
        Date-Added = {2007-05-19 14:15:58 +0200},
        Date-Modified = {2007-06-01 12:11:53 +0200},
        Filed = {Yes},
        Journal = {Atmospheric Environment},
        Number = {11},
        Pages = {1195--1200},
        Title = {The standard error of a weighted mean concentration--II.  
Estimating confidence intervals},
        Url =
{http://www.sciencedirect.com/science/article/B6VH3-3YGV47C-6Y/ 
2/a187487377ef52b741e3dabdfca97517},
        Volume = {29},
        Year = {1995},
        Abstract = {One motivation for estimating the standard error, SEMw,

of a weighted mean concentration, Mw, of an ion in precipitation is  
to use it to compute a confidence interval for Mw. Typically this is  
done by multiplying the standard error by a factor that depends on  
the degree of confidence one wishes to express, on the assumption  
that the weighted mean has a normal distribution. This paper compares  
confidence intervals of Mw concentrations of ions in precipitation,  
as computed using the assumption of a normal distribution, with those  
estimated from distributions produced by bootstrapping. The  
hypothesis that Mw was normally distributed was rejected about half  
the time (at the 5{\%} significance level) in tests involving nine  
major ions measured at ten diverse sites in the National Atmospheric  
Deposition Program/National Trends Network (NADP/NTN). Most of these  
rejections occurred at sites with fewer than 100 samples, in  
agreement with previous results. Nevertheless, the hypothesis was  
often rejected at sites with more than 100 samples as well. The  
maximum error (relative to Mw) in the 95{\%} confidence limits made  
by assuming a normal distribution of the Mw at the ten sites examined  
was about 27{\%}. Most such errors were less than 10{\%}, and errors  
were smaller at sampling sites with > 100 samples than at those with  
< 100 samples.}}

Cheers,

JiHO
---
http://jo.irisson.free.fr/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.