Quantcast

Kolmogorov Smirnov Test

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Kolmogorov Smirnov Test

kbrownk
I'm using ks.test (mydata, dnorm) on my data. I know some of my
different variable samples (mydata1, mydata2, etc) must be normally
distributed but the p value is always < 2.0^-16 (the 2.0 can change
but not the exponent).

I want to test mydata against a normal distribution. What could I be
doing wrong?

I tried instead using rnorm to create a normal distribution: y = rnorm
(68,mean=mydata, sd=mydata), where N= the sample size from mydata.
Then I ran the k-s: ks.test (mydata,y). Should this work?

One issue I had was that some of my data has a minimum value of 0, but
rnorm ran as I have it above will potentially create negative numbers.

Also some of my variables will likely be better tested against non-
normal distributions (uniform etc.), but if I figure I should learn
how to even use ks.test first.

I used to use SPSS but am really trying to jump into R instead, but I
find the help to assume too heavy of statistical knowledge.

I'm guessing I have a long road before I get this, so any bits of
information that may help me get a bit further will be appreciated!

Thanks,
kbrownk

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Kolmogorov Smirnov Test

Greg Snow-2
The way you are running the test the null hypothesis is that the data comes from a normal distribution with mean=0 and standard deviation = 1.  If your minimum data value is 0, then it seems very unlikely that the mean is 0.  So the test is being strongly influenced by the mean and standard deviation not just the shape of the distribution.

Note that the KS test was not designed to test against a distribution with parameters estimated from the same data (you can do the test, but it makes the p-value inaccurate).  You can do a little better by simulating the process and comparing the KS statistic to the simulations rather than looking at the computed p-value.

However you should ask yourself why you are doing the normality tests in the first place.  The common reasons that people do this don't match with what the tests actually test (see the fortunes on normality).

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[hidden email]
801.408.8111


> -----Original Message-----
> From: [hidden email] [mailto:r-help-bounces@r-
> project.org] On Behalf Of Kerry
> Sent: Wednesday, November 10, 2010 9:23 PM
> To: [hidden email]
> Subject: [R] Kolmogorov Smirnov Test
>
> I'm using ks.test (mydata, dnorm) on my data. I know some of my
> different variable samples (mydata1, mydata2, etc) must be normally
> distributed but the p value is always < 2.0^-16 (the 2.0 can change
> but not the exponent).
>
> I want to test mydata against a normal distribution. What could I be
> doing wrong?
>
> I tried instead using rnorm to create a normal distribution: y = rnorm
> (68,mean=mydata, sd=mydata), where N= the sample size from mydata.
> Then I ran the k-s: ks.test (mydata,y). Should this work?
>
> One issue I had was that some of my data has a minimum value of 0, but
> rnorm ran as I have it above will potentially create negative numbers.
>
> Also some of my variables will likely be better tested against non-
> normal distributions (uniform etc.), but if I figure I should learn
> how to even use ks.test first.
>
> I used to use SPSS but am really trying to jump into R instead, but I
> find the help to assume too heavy of statistical knowledge.
>
> I'm guessing I have a long road before I get this, so any bits of
> information that may help me get a bit further will be appreciated!
>
> Thanks,
> kbrownk
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Kolmogorov Smirnov Test

kbrownk
Thanks for the feedback. My goal is to run a simple test to show that
the data cannot be rejected as either normally or uniformally
distributed (depening on the variable), which is what a previous K-S
test run using SPSS had shown. The actual distribution I compare to my
sample only matters that it would be rejected were my data multi-
modal. This way I can suggest the data is from the same population. I
later run PCA and cluster analyses to confirm this but I want an easy
stat to start with for the individual variables.

I didn't think I was comparing my data against itself, but rather
again a normal distribution with the same mean and standard deviation.
Using the mean seems necessary, so is it incorrect to have the same
standard deviation too? I need to go back and read on the K-S test to
see what the appropriate constraints are before bothering anyone for
more help. Sorry, I thought I had it.

Thanks again,
kbrownk

On Nov 11, 12:40 am, Greg Snow <[hidden email]> wrote:

> The way you are running the test the null hypothesis is that the data comes from a normal distribution with mean=0 and standard deviation = 1.  If your minimum data value is 0, then it seems very unlikely that the mean is 0.  So the test is being strongly influenced by the mean and standard deviation not just the shape of the distribution.
>
> Note that the KS test was not designed to test against a distribution with parameters estimated from the same data (you can do the test, but it makes the p-value inaccurate).  You can do a little better by simulating the process and comparing the KS statistic to the simulations rather than looking at the computed p-value.
>
> However you should ask yourself why you are doing the normality tests in the first place.  The common reasons that people do this don't match with what the tests actually test (see the fortunes on normality).
>
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> [hidden email]
> 801.408.8111
>
>
>
> > -----Original Message-----
> > From: [hidden email] [mailto:r-help-bounces@r-
> > project.org] On Behalf Of Kerry
> > Sent: Wednesday, November 10, 2010 9:23 PM
> > To: [hidden email]
> > Subject: [R] Kolmogorov Smirnov Test
>
> > I'm using ks.test (mydata, dnorm) on my data. I know some of my
> > different variable samples (mydata1, mydata2, etc) must be normally
> > distributed but the p value is always < 2.0^-16 (the 2.0 can change
> > but not the exponent).
>
> > I want to test mydata against a normal distribution. What could I be
> > doing wrong?
>
> > I tried instead using rnorm to create a normal distribution: y = rnorm
> > (68,mean=mydata, sd=mydata), where N= the sample size from mydata.
> > Then I ran the k-s: ks.test (mydata,y). Should this work?
>
> > One issue I had was that some of my data has a minimum value of 0, but
> > rnorm ran as I have it above will potentially create negative numbers.
>
> > Also some of my variables will likely be better tested against non-
> > normal distributions (uniform etc.), but if I figure I should learn
> > how to even use ks.test first.
>
> > I used to use SPSS but am really trying to jump into R instead, but I
> > find the help to assume too heavy of statistical knowledge.
>
> > I'm guessing I have a long road before I get this, so any bits of
> > information that may help me get a bit further will be appreciated!
>
> > Thanks,
> > kbrownk
>
> > ______________________________________________
> > [hidden email] mailing list
> >https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guidehttp://www.R-project.org/posting-
> > guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Kolmogorov Smirnov Test

ted.harding-3
In reply to this post by kbrownk
On 11-Nov-10 04:22:55, Kerry wrote:
> I'm using ks.test (mydata, dnorm) on my data.

I think your problem may lie here! If you look at the documentation
for ks.test, available with the command:
  help("ks.test")
or simply:
  ?ks.test
you will read the following near the beginning:

Usage: ks.test(x, y, ...,
Arguments:
       x: a numeric vector of data values.
       y: either a numeric vector of data values, or a character string
          naming a cumulative distribution function or an actual
          cumulative distribution function such as 'pnorm'.

Note *cumulative* and *'pnorm'*. You say that you used 'dnorm'.
"dnorm" is R's name for the *density* function of the Normal
distribution, while the name for the *cumulative distribution*
function is "pnorm". So try the K-S test instead with

  ks.test(mydata, pnorm, ... )

where (as also stated in '?ks.test') the "..." is to be replaced
by a list of values for the parameters of the named cumulative
distribution. For example (since the parameters for pnorm are
its mean and SD):

   ks.test(mydata, pnorm, mean(mydata), sd(mydata) )

A toy example (comparing the two usages):

## First, using pnorm as above:
  Y <- rnorm(200)
  ks.test(Y,"pnorm",mean(Y),sd(Y))
  #         One-sample Kolmogorov-Smirnov test
  # data:  Y
  # D = 0.0251, p-value = 0.9996
  # alternative hypothesis: two-sided
## Note the nice P-value

## Next, using dnorm as you wrote:
 ks.test(Y,"dnorm",mean(Y),sd(Y))
  #         One-sample Kolmogorov-Smirnov test
  # data:  Y
  # D = 0.9965, p-value < 2.2e-16
  # alternative hypothesis: two-sided
## (Note the similarity to the p-values you report)!

For the deatils of 'dnorm', 'pnorm' and the like, see the help at:

   ?dnorm
or
   ?pnorm

(both lead to the same page). Granted, for a newcomer to R the
documentation (which often relies heavily on cross-referencing,
and sometimes the cross-references can be difficult to identify)
can be difficult to get to grips with. So look on this (which is
one of the easier cases) as an initiation into getting to grips
with R.

Hoping this helps,
Ted.

> I know some of my
> different variable samples (mydata1, mydata2, etc) must be normally
> distributed but the p value is always < 2.0^-16 (the 2.0 can change
> but not the exponent).
>
> I want to test mydata against a normal distribution. What could I be
> doing wrong?
>
> I tried instead using rnorm to create a normal distribution: y = rnorm
> (68,mean=mydata, sd=mydata), where N= the sample size from mydata.
> Then I ran the k-s: ks.test (mydata,y). Should this work?
>
> One issue I had was that some of my data has a minimum value of 0, but
> rnorm ran as I have it above will potentially create negative numbers.
>
> Also some of my variables will likely be better tested against non-
> normal distributions (uniform etc.), but if I figure I should learn
> how to even use ks.test first.
>
> I used to use SPSS but am really trying to jump into R instead, but I
> find the help to assume too heavy of statistical knowledge.
>
> I'm guessing I have a long road before I get this, so any bits of
> information that may help me get a bit further will be appreciated!
>
> Thanks,
> kbrownk
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <[hidden email]>
Fax-to-email: +44 (0)870 094 0861
Date: 11-Nov-10                                       Time: 09:46:52
------------------------------ XFMail ------------------------------

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Kolmogorov Smirnov Test

Greg Snow-2
In reply to this post by kbrownk
Consider the following simulations (also fixing the pnorm instead of dnorm that Ted pointed out and I missed):

out1 <- replicate(10000, {
        x <- rnorm(1000, 100, 3);
        ks.test( x, pnorm, mean=100, sd=3 )$p.value
        } )

out2 <- replicate(10000, {
        x <- rnorm(1000, 100, 3);
        ks.test( x, pnorm, mean=mean(x), sd=sd(x) )$p.value
        } )

par(mfrow=c(2,1))
hist(out1)
hist(out2)

mean(out1 <= 0.05 )
mean(out2 <= 0.05 )


In both cases the null hypothesis is true (or at least a meaningful approximation to true) so the p-values should follow a uniform distribution.  In the case of out1 where the mean and sd are specified as part of the null the p-values are reasonably uniform and the rejection rate is close to alpha (should asymptotically approach alpha as the number of simulations increases).  However looking at out2, where the parameters are set not by outside knowledge or tests, but rather from the observed data, the p-values are clearly not uniform and the rejection rate is far from alpha.


--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[hidden email]
801.408.8111


> -----Original Message-----
> From: [hidden email] [mailto:r-help-bounces@r-
> project.org] On Behalf Of Kerry
> Sent: Thursday, November 11, 2010 12:02 AM
> To: [hidden email]
> Subject: Re: [R] Kolmogorov Smirnov Test
>
> Thanks for the feedback. My goal is to run a simple test to show that
> the data cannot be rejected as either normally or uniformally
> distributed (depening on the variable), which is what a previous K-S
> test run using SPSS had shown. The actual distribution I compare to my
> sample only matters that it would be rejected were my data multi-
> modal. This way I can suggest the data is from the same population. I
> later run PCA and cluster analyses to confirm this but I want an easy
> stat to start with for the individual variables.
>
> I didn't think I was comparing my data against itself, but rather
> again a normal distribution with the same mean and standard deviation.
> Using the mean seems necessary, so is it incorrect to have the same
> standard deviation too? I need to go back and read on the K-S test to
> see what the appropriate constraints are before bothering anyone for
> more help. Sorry, I thought I had it.
>
> Thanks again,
> kbrownk
>
> On Nov 11, 12:40 am, Greg Snow <[hidden email]> wrote:
> > The way you are running the test the null hypothesis is that the data
> comes from a normal distribution with mean=0 and standard deviation =
> 1.  If your minimum data value is 0, then it seems very unlikely that
> the mean is 0.  So the test is being strongly influenced by the mean
> and standard deviation not just the shape of the distribution.
> >
> > Note that the KS test was not designed to test against a distribution
> with parameters estimated from the same data (you can do the test, but
> it makes the p-value inaccurate).  You can do a little better by
> simulating the process and comparing the KS statistic to the
> simulations rather than looking at the computed p-value.
> >
> > However you should ask yourself why you are doing the normality tests
> in the first place.  The common reasons that people do this don't match
> with what the tests actually test (see the fortunes on normality).
> >
> > --
> > Gregory (Greg) L. Snow Ph.D.
> > Statistical Data Center
> > Intermountain Healthcare
> > [hidden email]
> > 801.408.8111
> >
> >
> >
> > > -----Original Message-----
> > > From: [hidden email] [mailto:r-help-bounces@r-
> > > project.org] On Behalf Of Kerry
> > > Sent: Wednesday, November 10, 2010 9:23 PM
> > > To: [hidden email]
> > > Subject: [R] Kolmogorov Smirnov Test
> >
> > > I'm using ks.test (mydata, dnorm) on my data. I know some of my
> > > different variable samples (mydata1, mydata2, etc) must be normally
> > > distributed but the p value is always < 2.0^-16 (the 2.0 can change
> > > but not the exponent).
> >
> > > I want to test mydata against a normal distribution. What could I
> be
> > > doing wrong?
> >
> > > I tried instead using rnorm to create a normal distribution: y =
> rnorm
> > > (68,mean=mydata, sd=mydata), where N= the sample size from mydata.
> > > Then I ran the k-s: ks.test (mydata,y). Should this work?
> >
> > > One issue I had was that some of my data has a minimum value of 0,
> but
> > > rnorm ran as I have it above will potentially create negative
> numbers.
> >
> > > Also some of my variables will likely be better tested against non-
> > > normal distributions (uniform etc.), but if I figure I should learn
> > > how to even use ks.test first.
> >
> > > I used to use SPSS but am really trying to jump into R instead, but
> I
> > > find the help to assume too heavy of statistical knowledge.
> >
> > > I'm guessing I have a long road before I get this, so any bits of
> > > information that may help me get a bit further will be appreciated!
> >
> > > Thanks,
> > > kbrownk
> >
> > > ______________________________________________
> > > [hidden email] mailing list
> > >https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guidehttp://www.R-project.org/posting-
> > > guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > [hidden email] mailing
> listhttps://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guidehttp://www.R-project.org/posting-
> guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Kolmogorov Smirnov Test

kbrownk
Thanks Ted and Greg. I had actually tried pnorm and after having
problems, thought maybe I was misunderstanding dnorm as a variable in
ks.test due to over- (more likely under) thinking it. I'm assuming now
that ks.test will consider my data in cumulative form (makes sense now
that I think about it, but I didn't want to assume any steps that the
R version of k-s test takes). I plan to explore the ideas and run the
simulations you sent in full over the weekend.

Thanks again!
Kerry

On Nov 11, 12:05 pm, Greg Snow <[hidden email]> wrote:

> Consider the following simulations (also fixing the pnorm instead of dnorm that Ted pointed out and I missed):
>
> out1 <- replicate(10000, {
>         x <- rnorm(1000, 100, 3);
>         ks.test( x, pnorm, mean=100, sd=3 )$p.value
>         } )
>
> out2 <- replicate(10000, {
>         x <- rnorm(1000, 100, 3);
>         ks.test( x, pnorm, mean=mean(x), sd=sd(x) )$p.value
>         } )
>
> par(mfrow=c(2,1))
> hist(out1)
> hist(out2)
>
> mean(out1 <= 0.05 )
> mean(out2 <= 0.05 )
>
> In both cases the null hypothesis is true (or at least a meaningful approximation to true) so the p-values should follow a uniform distribution.  In the case of out1 where the mean and sd are specified as part of the null the p-values are reasonably uniform and the rejection rate is close to alpha (should asymptotically approach alpha as the number of simulations increases).  However looking at out2, where the parameters are set not by outside knowledge or tests, but rather from the observed data, the p-values are clearly not uniform and the rejection rate is far from alpha.
>
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.s...@imail.org801.408.8111begin_of_the_skype_highlighting              801.408.8111      end_of_the_skype_highlighting
>
>
>
> > -----Original Message-----
> > From: [hidden email] [mailto:r-help-bounces@r-
> > project.org] On Behalf Of Kerry
> > Sent: Thursday, November 11, 2010 12:02 AM
> > To: [hidden email]
> > Subject: Re: [R] Kolmogorov Smirnov Test
>
> > Thanks for the feedback. My goal is to run a simple test to show that
> > the data cannot be rejected as either normally or uniformally
> > distributed (depening on the variable), which is what a previous K-S
> > test run using SPSS had shown. The actual distribution I compare to my
> > sample only matters that it would be rejected were my data multi-
> > modal. This way I can suggest the data is from the same population. I
> > later run PCA and cluster analyses to confirm this but I want an easy
> > stat to start with for the individual variables.
>
> > I didn't think I was comparing my data against itself, but rather
> > again a normal distribution with the same mean and standard deviation.
> > Using the mean seems necessary, so is it incorrect to have the same
> > standard deviation too? I need to go back and read on the K-S test to
> > see what the appropriate constraints are before bothering anyone for
> > more help. Sorry, I thought I had it.
>
> > Thanks again,
> > kbrownk
>
> > On Nov 11, 12:40 am, Greg Snow <[hidden email]> wrote:
> > > The way you are running the test the null hypothesis is that the data
> > comes from a normal distribution with mean=0 and standard deviation =
> > 1.  If your minimum data value is 0, then it seems very unlikely that
> > the mean is 0.  So the test is being strongly influenced by the mean
> > and standard deviation not just the shape of the distribution.
>
> > > Note that the KS test was not designed to test against a distribution
> > with parameters estimated from the same data (you can do the test, but
> > it makes the p-value inaccurate).  You can do a little better by
> > simulating the process and comparing the KS statistic to the
> > simulations rather than looking at the computed p-value.
>
> > > However you should ask yourself why you are doing the normality tests
> > in the first place.  The common reasons that people do this don't match
> > with what the tests actually test (see the fortunes on normality).
>
> > > --
> > > Gregory (Greg) L. Snow Ph.D.
> > > Statistical Data Center
> > > Intermountain Healthcare
> > > [hidden email]
> > > 801.408.8111
>
> > > > -----Original Message-----
> > > > From: [hidden email] [mailto:r-help-bounces@r-
> > > > project.org] On Behalf Of Kerry
> > > > Sent: Wednesday, November 10, 2010 9:23 PM
> > > > To: [hidden email]
> > > > Subject: [R] Kolmogorov Smirnov Test
>
> > > > I'm using ks.test (mydata, dnorm) on my data. I know some of my
> > > > different variable samples (mydata1, mydata2, etc) must be normally
> > > > distributed but the p value is always < 2.0^-16 (the 2.0 can change
> > > > but not the exponent).
>
> > > > I want to test mydata against a normal distribution. What could I
> > be
> > > > doing wrong?
>
> > > > I tried instead using rnorm to create a normal distribution: y =
> > rnorm
> > > > (68,mean=mydata, sd=mydata), where N= the sample size from mydata.
> > > > Then I ran the k-s: ks.test (mydata,y). Should this work?
>
> > > > One issue I had was that some of my data has a minimum value of 0,
> > but
> > > > rnorm ran as I have it above will potentially create negative
> > numbers.
>
> > > > Also some of my variables will likely be better tested against non-
> > > > normal distributions (uniform etc.), but if I figure I should learn
> > > > how to even use ks.test first.
>
> > > > I used to use SPSS but am really trying to jump into R instead, but
> > I
> > > > find the help to assume too heavy of statistical knowledge.
>
> > > > I'm guessing I have a long road before I get this, so any bits of
> > > > information that may help me get a bit further will be appreciated!
>
> > > > Thanks,
> > > > kbrownk
>
> > > > ______________________________________________
> > > > [hidden email] mailing list
> > > >https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guidehttp://www.R-project.org/posting-
> > > > guide.html
> > > > and provide commented, minimal, self-contained, reproducible code.
>
> > > ______________________________________________
> > > [hidden email] mailing
> > listhttps://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guidehttp://www.R-project.org/posting-
> > guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
>
> > ______________________________________________
> > [hidden email] mailing list
> >https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guidehttp://www.R-project.org/posting-
> > guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...