A basic statistics question

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

A basic statistics question

Ron_M
Hi,

I would need to get a clarification on a quite fundamental statistics property, hope expeRts here would not mind if I post that here.

I leant that variance-covariance matrix of the standardized data is equal to the correlation matrix for the unstandardized data. So I used following data.

Data <- structure(c(7L, 5L, 9L, 7L, 8L, 7L, 6L, 6L, 5L, 7L, 8L, 6L, 7L,  7L, 6L, 7L, 7L, 6L, 8L, 6L, 7L, 7L, 7L, 8L, 7L, 9L, 8L, 7L, 7L,  0L, 10L, 10L, 10L, 7L, 6L, 8L, 5L, 5L, 6L, 6L, 7L, 11L, 9L, 10L,  0L, 13L, 13L, 10L, 7L, 7L, 7L, 10L, 7L, 5L, 8L, 7L, 10L, 10L,  10L, 6L, 7L, 6L, 6L, 8L, 8L, 7L, 7L, 7L, 7L, 8L, 7L, 8L, 6L,  6L, 8L, 7L, 4L, 7L, 7L, 10L, 10L, 6L, 7L, 7L, 12L, 12L, 8L, 5L,  5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 5L, 4L, 5L, 5L, 5L, 6L,  7L, 5L, 7L, 5L, 7L, 7L, 7L, 7L, 8L, 7L, 6L, 7L, 7L, 6L, 7L, 7L,  6L, 4L, 4L, 6L, 6L, 7L, 8L, 7L, 11L, 10L, 8L, 7L, 6L, 6L, 11L,  5L, 4L, 6L, 6L, 6L, 7L, 8L, 7L, 12L, 4L, 4L, 2L, 5L, 6L, 7L,  6L, 6L, 5L, 6L, 5L, 7L, 7L, 7L, 6L, 5L, 6L, 6L, 5L, 5L, 6L, 6L,  4L, 4L, 5L, 10L, 10L, 7L, 7L, 6L, 4L, 6L, 10L, 7L, 4L, 6L, 6L,  6L, 8L, 8L, 8L, 7L, 8L, 9L, 10L, 7L, 6L, 6L, 8L, 6L, 8L, 3L,  3L, 4L, 5L, 5L, 6L, 5L, 5L, 6L, 4L, 8L, 7L, 3L, 5L, 6L, 9L, 8L,  9L, 10L, 8L, 9L, 8L, 9L, 8L, 8L, 9L, 11L, 10L, 9L, 9L, 13L,
 13L,  10L, 7L, 7L, 7L, 9L, 8L, 7L, 6L, 10L, 8L, 7L, 8L, 8L, 3L, 4L,  3L, 7L, 6L, 6L, 6L, 6L, 5L, 6L, 6L, 6L, 2L, 5L, 7L, 9L, 8L, 9L,  10L, 8L, 8L, 9L, 9L, 11L, 11L, 11L, 10L, 9L, 9L, 11L, 2L, 3L,  2L, 2L, 2L, 1L, 4L, 4L, 2L, 2L, 1L, 1L, 1L, 3L, 3L, 4L, 6L, 4L,  5L, 2L, 3L, 5L, 4L, 4L, 2L, 4L, 4L, 5L, 4L, 2L, 7L, 3L, 3L, 10L,  13L, 11L, 9L, 9L, 7L, 8L, 9L, 6L, 7L, 6L, 5L, 3L, 13L, 3L, 3L,  0L, 1L, 4L, 5L, 3L, 3L, 0L, 2L, 20L, 3L, 2L, 6L, 5L, 5L, 5L,  2L, 2L, 5L, 5L, 5L, 4L, 3L, 4L, 4L, 3L, 4L, 10L, 10L, 9L, 8L,  4L, 4L, 8L, 7L, 10L, 3L, 1L, 9L, 5L, 11L, 9L), .Dim = c(45L,  8L), .Dimnames = list(NULL, c("V1", "V7", "V13", "V19", "V25",  "V31", "V37", "V43")))

    
Data_Normalized <- apply(Data, 2, function(x) return((x - mean(x))/sd(x)))

(t(Data_Normalized) %*% Data_Normalized)/dim(Data_Normalized)[1]



Point is that I am not getting exact CORR matrix. Can somebody point me what I am missing here?

Thanks for your pointer. 

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: A basic statistics question

Ted Harding
On 12-Aug-2014 19:57:29 Ron Michael wrote:

> Hi,
>
> I would need to get a clarification on a quite fundamental statistics
> property, hope expeRts here would not mind if I post that here.
>
> I leant that variance-covariance matrix of the standardized data is equal to
> the correlation matrix for the unstandardized data. So I used following data.
>
> Data <- structure(c(7L, 5L, 9L, 7L, 8L, 7L, 6L, 6L, 5L, 7L, 8L, 6L, 7L,  7L,
> 6L, 7L, 7L, 6L, 8L, 6L, 7L, 7L, 7L, 8L, 7L, 9L, 8L, 7L, 7L,  0L, 10L, 10L,
> 10L, 7L, 6L, 8L, 5L, 5L, 6L, 6L, 7L, 11L, 9L, 10L,  0L, 13L, 13L, 10L, 7L,
> 7L, 7L, 10L, 7L, 5L, 8L, 7L, 10L, 10L,  10L, 6L, 7L, 6L, 6L, 8L, 8L, 7L, 7L,
> 7L, 7L, 8L, 7L, 8L, 6L,  6L, 8L, 7L, 4L, 7L, 7L, 10L, 10L, 6L, 7L, 7L, 12L,
> 12L, 8L, 5L,  5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 5L, 4L, 5L, 5L, 5L, 6L,
> 7L, 5L, 7L, 5L, 7L, 7L, 7L, 7L, 8L, 7L, 6L, 7L, 7L, 6L, 7L, 7L,  6L, 4L, 4L,
> 6L, 6L, 7L, 8L, 7L, 11L, 10L, 8L, 7L, 6L, 6L, 11L,  5L, 4L, 6L, 6L, 6L, 7L,
> 8L, 7L, 12L, 4L, 4L, 2L, 5L, 6L, 7L,  6L, 6L, 5L, 6L, 5L, 7L, 7L, 7L, 6L, 5L,
> 6L, 6L, 5L, 5L, 6L, 6L,  4L, 4L, 5L, 10L, 10L, 7L, 7L, 6L, 4L, 6L, 10L, 7L,
> 4L, 6L, 6L,  6L, 8L, 8L, 8L, 7L, 8L, 9L, 10L, 7L, 6L, 6L, 8L, 6L, 8L, 3L,
> 3L, 4L, 5L, 5L, 6L, 5L, 5L, 6L, 4L, 8L, 7L, 3L, 5L, 6L, 9L, 8L,  9L, 10L, 8L,
> 9L, 8L, 9L, 8L, 8L, 9L, 11L, 10L, 9L, 9L, 13L,
>  13L,  10L, 7L, 7L, 7L, 9L, 8L, 7L, 6L, 10L, 8L, 7L, 8L, 8L, 3L, 4L,  3L, 7L,
> 6L, 6L, 6L, 6L, 5L, 6L, 6L, 6L, 2L, 5L, 7L, 9L, 8L, 9L,  10L, 8L, 8L, 9L, 9L,
> 11L, 11L, 11L, 10L, 9L, 9L, 11L, 2L, 3L,  2L, 2L, 2L, 1L, 4L, 4L, 2L, 2L, 1L,
> 1L, 1L, 3L, 3L, 4L, 6L, 4L,  5L, 2L, 3L, 5L, 4L, 4L, 2L, 4L, 4L, 5L, 4L, 2L,
> 7L, 3L, 3L, 10L,  13L, 11L, 9L, 9L, 7L, 8L, 9L, 6L, 7L, 6L, 5L, 3L, 13L, 3L,
> 3L,  0L, 1L, 4L, 5L, 3L, 3L, 0L, 2L, 20L, 3L, 2L, 6L, 5L, 5L, 5L,  2L, 2L,
> 5L, 5L, 5L, 4L, 3L, 4L, 4L, 3L, 4L, 10L, 10L, 9L, 8L,  4L, 4L, 8L, 7L, 10L,
> 3L, 1L, 9L, 5L, 11L, 9L), .Dim = c(45L,  8L), .Dimnames = list(NULL, c("V1",
> "V7", "V13", "V19", "V25",  "V31", "V37", "V43")))
>
> ____  
> Data_Normalized <- apply(Data, 2, function(x) return((x - mean(x))/sd(x)))
>
> (t(Data_Normalized) %*% Data_Normalized)/dim(Data_Normalized)[1]
>
>
>
> Point is that I am not getting exact CORR matrix. Can somebody point me
> what I am missing here?
>
> Thanks for your pointer.

Try:
  Data_Normalized <- apply(Data, 2, function(x) return((x - mean(x))/sd(x)))
  (t(Data_Normalized) %*% Data_Normalized)/(dim(Data_Normalized)[1]-1)

and compare the result with

  cor(Data)

And why? Look at

  ?sd

and note that:

  Details:
     Like 'var' this uses denominator n - 1.

Hoping this helps,
Ted.

-------------------------------------------------
E-Mail: (Ted Harding) <[hidden email]>
Date: 12-Aug-2014  Time: 22:32:26
This message was sent by XFMail

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: A basic statistics question

Rolf Turner
In reply to this post by Ron_M
On 13/08/14 07:57, Ron Michael wrote:
> Hi,
>
> I would need to get a clarification on a quite fundamental statistics property, hope expeRts here would not mind if I post that here.
>
> I leant that variance-covariance matrix of the standardized data is equal to the correlation matrix for the unstandardized data. So I used following data.

<SNIP>

> (t(Data_Normalized) %*% Data_Normalized)/dim(Data_Normalized)[1]
>
>
>
> Point is that I am not getting exact CORR matrix. Can somebody point me what I am missing here?

You are using a denominator of "n" in calculating your "covariance"
matrix for your normalized data.  But these data were normalized using
the sd() function which (correctly) uses a denominator of n-1 so as to
obtain an unbiased estimator of the population standard deviation.

If you calculated

    (t(Data_Normalized) %*% Data_Normalized)/(dim(Data_Normalized)[1]-1)

then you would get the same result as you get from cor(Data) (to within
about 1e-15).

cheers,

Rolf Turner

--
Rolf Turner
Technical Editor ANZJS

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: A basic statistics question

Ted Harding
On 12-Aug-2014 21:41:52 Rolf Turner wrote:

> On 13/08/14 07:57, Ron Michael wrote:
>> Hi,
>>
>> I would need to get a clarification on a quite fundamental statistics
>> property, hope expeRts here would not mind if I post that here.
>>
>> I leant that variance-covariance matrix of the standardized data is equal to
>> the correlation matrix for the unstandardized data. So I used following
>> data.
>
> <SNIP>
>
>> (t(Data_Normalized) %*% Data_Normalized)/dim(Data_Normalized)[1]
>>
>> Point is that I am not getting exact CORR matrix. Can somebody point
>> me what I am missing here?
>
> You are using a denominator of "n" in calculating your "covariance"
> matrix for your normalized data.  But these data were normalized using
> the sd() function which (correctly) uses a denominator of n-1 so as to
> obtain an unbiased estimator of the population standard deviation.
>
> If you calculated
>
>     (t(Data_Normalized) %*% Data_Normalized)/(dim(Data_Normalized)[1]-1)
>
> then you would get the same result as you get from cor(Data) (to within
> about 1e-15).
>
> cheers,
> Rolf Turner

One could argue about "(correctly)"!

>From the "descriptive statistics" point of view, if one is given a single
number x, then this dataset has no variation, so one could say that
sd(x) = 0. And this is what one would get with a denominator of "n".

But if the single value x is viewed as sampled from a distribution
(with positive dispersion), then the value of x gives no information
about the SD of the distribution. If you use denominator (n-1) then
sd(x) = NA, i.e. is indeterminate (as it should be in this application).

The important thing when using pre-programmed functions is to know
which is being used. R uses (n-1), and this can be found from
looking at

  ?sd

or (with more detail) at

  ?cor

Ron had assumed that the denominator was n, apparently not being aware
that R uses (n-1).

Just a few thoughts ...
Ted.

-------------------------------------------------
E-Mail: (Ted Harding) <[hidden email]>
Date: 12-Aug-2014  Time: 23:22:09
This message was sent by XFMail

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: A basic statistics question

Ted Harding
On 12-Aug-2014 22:22:13 Ted Harding wrote:

> On 12-Aug-2014 21:41:52 Rolf Turner wrote:
>> On 13/08/14 07:57, Ron Michael wrote:
>>> Hi,
>>>
>>> I would need to get a clarification on a quite fundamental statistics
>>> property, hope expeRts here would not mind if I post that here.
>>>
>>> I leant that variance-covariance matrix of the standardized data is
>>> equal to the correlation matrix for the unstandardized data. So I
>>> used following data.
>>
>> <SNIP>
>>
>>> (t(Data_Normalized) %*% Data_Normalized)/dim(Data_Normalized)[1]
>>>
>>> Point is that I am not getting exact CORR matrix. Can somebody point
>>> me what I am missing here?
>>
>> You are using a denominator of "n" in calculating your "covariance"
>> matrix for your normalized data.  But these data were normalized using
>> the sd() function which (correctly) uses a denominator of n-1 so as to
>> obtain an unbiased estimator of the population standard deviation.
>>
>> If you calculated
>>
>>     (t(Data_Normalized) %*% Data_Normalized)/(dim(Data_Normalized)[1]-1)
>>
>> then you would get the same result as you get from cor(Data) (to within
>> about 1e-15).
>>
>> cheers,
>> Rolf Turner
>
> One could argue about "(correctly)"!
>
>>From the "descriptive statistics" point of view, if one is given a single
> number x, then this dataset has no variation, so one could say that
> sd(x) = 0. And this is what one would get with a denominator of "n".
>
> But if the single value x is viewed as sampled from a distribution
> (with positive dispersion), then the value of x gives no information
> about the SD of the distribution. If you use denominator (n-1) then
> sd(x) = NA, i.e. is indeterminate (as it should be in this application).
>
> The important thing when using pre-programmed functions is to know
> which is being used. R uses (n-1), and this can be found from
> looking at
>
>   ?sd
>
> or (with more detail) at
>
>   ?cor
>
> Ron had assumed that the denominator was n, apparently not being aware
> that R uses (n-1).
>
> Just a few thoughts ...
> Ted.
> -------------------------------------------------
> E-Mail: (Ted Harding) <[hidden email]>
> Date: 12-Aug-2014  Time: 23:22:09

After some hesitation (not wanting to prolong a thread whose original
query has been effectively settled), nevertheless I think it may be
worth stating the general picture, for the sake of users who might
be confused or misled regarding the use of  the functions var(), cov(),
cor(), sd() etc.

The distinction to become aware of is between "summary" statistics
and statistics which will be used for estimation/inference.

Given a set of numbers, x[1], x[2], ... , x[N], they have a mean

  MEAN = sum(x)/N

and their variance VAR is the mean of the deviations between the x[i]
and MEAN:

  VAR = (sum(x - MEAN)^2)/N

If a random value X is drawn from {x[1], x[2], ... , x[N]}, with
uniform probability, then the expectation of X is again

  E(X) = MEAN

and the variance of X is again

  Var(X) = E((X - MEAN)^2) = VAR

with MEAN and VAR as given above. And the R function mean(x) will
return MEAN is its value. However, the R function var(x) will not
return VAR -- it will return (N/(N-1))*VAR

The above definitions of MEAN and VAR use division by N, i.e.
"denominator = N". But the R functions var(x), sd(x) etc. divide by
(N-1), i.e. use "denominator = N-1", as explained in
  ?var
  ?sd
etc.

The basic reason for this is that, given a random sample X[1], ... ,X[n]
of size n from {x[1], x[2], ... , x[N]}, the expectation of

  sum((X - mean(X))^2)

is (n-1)*VAR, so to obtain an unbiased estimate of VAR from the X
sample one must use the "bias-corrected" sample variance

  var(X) = sum((X - mean(X))^2)/(n-1)

i.e. "denominator = (n-1)" as described in the help pages.

So the function var(), with denominator = (n-1), is "correct" for
obtaining an unbiased estimator. But it will not be correct for the
variance sum((X - mean(X))^2)/n of the numbers X[1], ... ,X[n].

Since sd() also uses denominator = (n-1), sd(X) is the square root
of var(X). But, while var(X) is unbiased for VAR, sd(X) is not
unbiased for SD = sqrt(VAR), since, for a positive ransom variable Y,
in general

  E(sqrt(Y)) < sqrt(E(Y))

i.e. E(sd(X)) = E(sqrt(var(X))) < sqrt(E(var(X))) = sqrt(VAR) = SD.

The R functions var() etc., which use denominator = (n-1), do not
have an option which allows the user to choose denominator = n.

Therefore, in particular, a user who has a set of numbers
  {x[1], x[2], ... , x[N]}
(e.g. a record of a popuation) and wants the SD of these (e.g. for
use in summary statistics Mean and SD), could inadvertently use
R's sd(x), expecting SD, without being alerted to the fact that it
will give the wrong answer. And the only way round it is to explicitly
write one's own correction, e.g.

  SD <- function(x){n<-length(x); sd(x)*sqrt((n-1)/n)}

Indeed, this topic has got me wondering how many times I may have
blindly used sd(x) in the past, as if it were going to give me the
standard (sum(x - mean(x))^2)/length(x) result!

As I wrote earlier, when there is more than one definition which
might be used, the important thing is to know which one is being used,
and to correct accordingly if it is not the one you want.

Best wishes to all,
Ted.

-------------------------------------------------
E-Mail: (Ted Harding) <[hidden email]>
Date: 13-Aug-2014  Time: 19:49:02
This message was sent by XFMail

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: A basic statistics question

Peter Dalgaard-2

On 13 Aug 2014, at 20:49 , (Ted Harding) <[hidden email]> wrote:

> Indeed, this topic has got me wondering how many times I may have
> blindly used sd(x) in the past, as if it were going to give me the
> standard (sum(x - mean(x))^2)/length(x) result!


At the risk of flogging a horse that has been dead for the better part of a century, I don't think there is anything "standard" about an SD with a divisor of N, and the biasedness of the version with N-1 divisor is not really the crucial issue. Rather, the distinction is between

- one sample from a known finite distribution
- multiple samples from an unknown distribution

and in particular between whether the mean is estimated or known.

One argument for the N-1 divisor in the normal case is that you can transform data to one observation with unknown mean and N-1 independent observations with mean known to be 0. The variance estimate will be a function of the N-1 variables, and thus there is no reason to let the mere existence of the uninformative Nth variable change the estimator.

Of course few people really care about N vs. N-1 but in larger linear models, it becomes N-p and p can be a sizeable fraction of N.  

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: A basic statistics question

Joshua Wiley-2
In reply to this post by Rolf Turner
On Wed, Aug 13, 2014 at 7:41 AM, Rolf Turner <[hidden email]>
wrote:

> On 13/08/14 07:57, Ron Michael wrote:
>
>> Hi,
>>
>> I would need to get a clarification on a quite fundamental statistics
>> property, hope expeRts here would not mind if I post that here.
>>
>> I leant that variance-covariance matrix of the standardized data is equal
>> to the correlation matrix for the unstandardized data. So I used following
>> data.
>>
>
> <SNIP>
>
>
>  (t(Data_Normalized) %*% Data_Normalized)/dim(Data_Normalized)[1]
>>
>>
>>
>> Point is that I am not getting exact CORR matrix. Can somebody point me
>> what I am missing here?
>>
>
> You are using a denominator of "n" in calculating your "covariance" matrix
> for your normalized data.  But these data were normalized using the sd()
> function which (correctly) uses a denominator of n-1 so as to obtain an
> unbiased estimator of the population standard deviation.
>

As a small point n - 1 is not _quite_ an unbiased estimator of the
population SD see Cureton. (1968).
Unbiased Estimation of the Standard Deviation, The American Statistician,
22(1).

To see this in action:

res <- unlist(parLapply(cl, 1:1e7, function(i) sd(rnorm(10, mean = 0, sd =
1))))
correction <- function(n) {
    gamma((n-1)/2) * sqrt((n-1)/2) / gamma(n/2)
}
mean(res)
# 0.972583
mean(res * correction(10))
# 0.9999216

The calculation for sample variance is an unbiased estimate of the
population variance, but square root is a nonlinear function and the square
root of an unbiased estimator is not itself necessarily unbiased.




>
> If you calculated
>
>
>    (t(Data_Normalized) %*% Data_Normalized)/(dim(Data_Normalized)[1]-1)
>
> then you would get the same result as you get from cor(Data) (to within
> about 1e-15).
>
> cheers,
>
> Rolf Turner
>
> --
> Rolf Turner
> Technical Editor ANZJS
>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--
Joshua F. Wiley
Ph.D. Student, UCLA Department of Psychology
http://joshuawiley.com/
Senior Analyst, Elkhart Group Ltd.
http://elkhartgroup.com
Office: 260.673.5518

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: A basic statistics question

Rolf Turner
On 16/08/14 01:29, Joshua Wiley wrote:

>
> On Wed, Aug 13, 2014 at 7:41 AM, Rolf Turner <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 13/08/14 07:57, Ron Michael wrote:
>
>         Hi,
>
>         I would need to get a clarification on a quite fundamental
>         statistics property, hope expeRts here would not mind if I post
>         that here.
>
>         I leant that variance-covariance matrix of the standardized data
>         is equal to the correlation matrix for the unstandardized data.
>         So I used following data.
>
>
>     <SNIP>
>
>
>         (t(Data_Normalized) %*% Data_Normalized)/dim(Data___Normalized)[1]
>
>
>
>         Point is that I am not getting exact CORR matrix. Can somebody
>         point me what I am missing here?
>
>
>     You are using a denominator of "n" in calculating your "covariance"
>     matrix for your normalized data.  But these data were normalized
>     using the sd() function which (correctly) uses a denominator of n-1
>     so as to obtain an unbiased estimator of the population standard
>     deviation.
>
>
> As a small point n - 1 is not _quite_ an unbiased estimator of the
> population SD see Cureton. (1968).
> Unbiased Estimation of the Standard Deviation, The American
> Statistician, 22(1).
>
> To see this in action:
>
> res <- unlist(parLapply(cl, 1:1e7, function(i) sd(rnorm(10, mean = 0, sd
> = 1))))
> correction <- function(n) {
>      gamma((n-1)/2) * sqrt((n-1)/2) / gamma(n/2)
> }
> mean(res)
> # 0.972583
> mean(res * correction(10))
> # 0.9999216
>
> The calculation for sample variance is an unbiased estimate of the
> population variance, but square root is a nonlinear function and the
> square root of an unbiased estimator is not itself necessarily unbiased.


Aaaaarrrggghhh.  Yes of course.  I *know* that you don't get an unbiased
estimate of the sd by using n-1 in the denominator; you get an unbiased
estimate of the variance and as you say, sqrt() is a non-linear function
.....

I just didn't think carefully enuff before I wrote.  Thanks for pulling
me up on this error.

cheers,

Rolf


--
Rolf Turner
Technical Editor ANZJS

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.