Quantcast

Normality test

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Normality test

Williams, Robin
Hi,
I am looking for a normality test in R to see if a vector of data I have
can be assumed to be normally distributed and hence used in a linear
regression.
> help.search("normality test")
suggests the Shapiro test, ?shapiro.test.
Now maybe I am interpreting things incorrectly (as is usually the case),
am I right in assuming that this is a composite test for normality, and
hence a high p-value would suggest that the sample is normally
distributed? As a test I did
shapiro.test(rnorm(4500))
a few times, and achieved very different p-values, so I cannot be sure.
I had assumed that a random sample of 4500 would have a very high
p-value on all occasions but it appears not, this is interesting.
  Are there any other tests that people would recommend over this one in
the base packages? I assume not as help.search did not suggest any.
  So am I right about a high p-value suggesting normality?
Many thanks for any help.  
 

Robin Williams
Met Office summer intern - Health Forecasting
[hidden email]

 

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Normality test

Duncan Murdoch
On 03/09/2008 10:33 AM, Williams, Robin wrote:
> Hi,
> I am looking for a normality test in R to see if a vector of data I have
> can be assumed to be normally distributed and hence used in a linear
> regression.

Raw data that is suitable for standard linear regression is normally
distributed, but the mean varies from observation to observation.  The
necessary assumption is that the errors are normally distributed with
zero mean, but the data itself also includes the non-random parts of the
model.  The effect of the varying means is that the data will generally
*not* appear to come from a normal distribution if you just throw it all
into a vector and look at it.

So let's assume you're working with residuals from a linear fit.  The
residuals should be normally distributed with mean zero, but their
variances won't be equal.  It may be that in a large dataset this will
be enough to get a false declaration of non-normality even with
perfectly normal errors.  In a small dataset you'll rarely have enough
power to detect non-normality.

So overall, don't use something like shapiro.test for what you have in
mind.  Any recent regression text should give advice on model
diagnostics that will do a better job.

>> help.search("normality test")
> suggests the Shapiro test, ?shapiro.test.
> Now maybe I am interpreting things incorrectly (as is usually the case),
> am I right in assuming that this is a composite test for normality, and
> hence a high p-value would suggest that the sample is normally
> distributed?

A low p-value (e.g. p < 0.05) could suggest there is evidence of
non-normality, but p > 0.05 just shows a lack of evidence.  In the case
where the data is truly normally distributed, you'd expect p to be
uniformly distributed between 0 and 1.  (I have an article in the
current American Statistician suggesting ways to teach p-values to
emphasize this; unfortunately, it seems to be a surprise to a lot of
people.)

Duncan Murdoch

As a test I did

> shapiro.test(rnorm(4500))
> a few times, and achieved very different p-values, so I cannot be sure.
> I had assumed that a random sample of 4500 would have a very high
> p-value on all occasions but it appears not, this is interesting.
>   Are there any other tests that people would recommend over this one in
> the base packages? I assume not as help.search did not suggest any.
>   So am I right about a high p-value suggesting normality?
> Many thanks for any help.  
>  
>
> Robin Williams
> Met Office summer intern - Health Forecasting
> [hidden email]
>
>  
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Normality test

Greg Snow-2
In reply to this post by Williams, Robin
What is the distribution of the p-value when the null hypothesis is true?

This is an important question that unfortunately tends to get glossed over or left out completely in many courses due to the amount of information that needs to be packed into them.

For most appropriate tests, when the null hypothesis is true and all other assumptions are true, the p-value is distributed as uniform(0,1).  Hence the probability of a type I error is alpha for any value of alpha.  Therefore, when the null is true, the likelihoods of getting a p-value of 0.99, 0.051, 0.049, or 0.0001 are all exactly the same.

If you want a high p-value for a normality test, just collect only 1 data point, no matter what it's value is, it is completely consistant with the assumption that it came from some normal distribution (p-value=1).

For large sample sizes the important question is not "did this data come from an exact normal distribution?", but rather, "Is the distribution this data came from close enough to normal?".

If you really feel the need for a test of normality in large sample sizes, then see this post:
http://finzi.psych.upenn.edu/R/Rhelp02a/archive/136160.html

Hope this helps,

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[hidden email]
(801) 408-8111



> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of Williams, Robin
> Sent: Wednesday, September 03, 2008 8:34 AM
> To: [hidden email]
> Subject: [R] Normality test
>
> Hi,
> I am looking for a normality test in R to see if a vector of
> data I have can be assumed to be normally distributed and
> hence used in a linear regression.
> > help.search("normality test")
> suggests the Shapiro test, ?shapiro.test.
> Now maybe I am interpreting things incorrectly (as is usually
> the case), am I right in assuming that this is a composite
> test for normality, and hence a high p-value would suggest
> that the sample is normally distributed? As a test I did
> shapiro.test(rnorm(4500))
> a few times, and achieved very different p-values, so I
> cannot be sure.
> I had assumed that a random sample of 4500 would have a very
> high p-value on all occasions but it appears not, this is interesting.
>   Are there any other tests that people would recommend over
> this one in the base packages? I assume not as help.search
> did not suggest any.
>   So am I right about a high p-value suggesting normality?
> Many thanks for any help.
>
>
> Robin Williams
> Met Office summer intern - Health Forecasting
> [hidden email]
>
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...