On 03/09/2008 10:33 AM, Williams, Robin wrote:

> Hi,

> I am looking for a normality test in R to see if a vector of data I have

> can be assumed to be normally distributed and hence used in a linear

> regression.

Raw data that is suitable for standard linear regression is normally

distributed, but the mean varies from observation to observation. The

necessary assumption is that the errors are normally distributed with

zero mean, but the data itself also includes the non-random parts of the

model. The effect of the varying means is that the data will generally

*not* appear to come from a normal distribution if you just throw it all

into a vector and look at it.

So let's assume you're working with residuals from a linear fit. The

residuals should be normally distributed with mean zero, but their

variances won't be equal. It may be that in a large dataset this will

be enough to get a false declaration of non-normality even with

perfectly normal errors. In a small dataset you'll rarely have enough

power to detect non-normality.

So overall, don't use something like shapiro.test for what you have in

mind. Any recent regression text should give advice on model

diagnostics that will do a better job.

>> help.search("normality test")

> suggests the Shapiro test, ?shapiro.test.

> Now maybe I am interpreting things incorrectly (as is usually the case),

> am I right in assuming that this is a composite test for normality, and

> hence a high p-value would suggest that the sample is normally

> distributed?

A low p-value (e.g. p < 0.05) could suggest there is evidence of

non-normality, but p > 0.05 just shows a lack of evidence. In the case

where the data is truly normally distributed, you'd expect p to be

uniformly distributed between 0 and 1. (I have an article in the

current American Statistician suggesting ways to teach p-values to

emphasize this; unfortunately, it seems to be a surprise to a lot of

people.)

Duncan Murdoch

As a test I did

> shapiro.test(rnorm(4500))

> a few times, and achieved very different p-values, so I cannot be sure.

> I had assumed that a random sample of 4500 would have a very high

> p-value on all occasions but it appears not, this is interesting.

> Are there any other tests that people would recommend over this one in

> the base packages? I assume not as help.search did not suggest any.

> So am I right about a high p-value suggesting normality?

> Many thanks for any help.

>

>

> Robin Williams

> Met Office summer intern - Health Forecasting

>

[hidden email]
>

>

>

> [[alternative HTML version deleted]]

>

> ______________________________________________

>

[hidden email] mailing list

>

https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide

http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.

______________________________________________

[hidden email] mailing list

https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide

http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.