

Hi all,
This question is sort of related to R (I'm not sure if I used an R function
correctly), but also related to stats in general. I'm sorry if this is
considered as offtopic.
I'm currently working on a data set with two sets of samples. The csv file
of the data could be found here: http://pastebin.com/200v10pyI would like to use KS test to see if these two sets of samples are from
different distributions.
I ran the following R script:
# read data from the file
> data = read.csv('data.csv')
> ks.test(data[[1]], data[[2]])
Twosample KolmogorovSmirnov test
data: data[[1]] and data[[2]]
D = 0.025, pvalue = 0.9132
alternative hypothesis: twosided
The KS test shows that these two samples are very similar. (In fact, they
should come from same distribution.)
However, due to some reasons, instead of the raw values, the actual data
that I will get will be normalized (zero mean, unit variance). So I tried
to normalize the raw data I have and run the KS test again:
> ks.test(scale(data[[1]]), scale(data[[2]]))
Twosample KolmogorovSmirnov test
data: scale(data[[1]]) and scale(data[[2]])
D = 0.3273, pvalue < 2.2e16
alternative hypothesis: twosided
The pvalue becomes almost zero after normalization indicating these two
samples are significantly different (from different distributions).
My question is: How the normalization could make two similar samples
becomes different from each other? I can see that if two samples are
different, then normalization could make them similar. However, if two sets
of data are similar, then intuitively, applying same operation onto them
should make them still similar, at least not different from each other too
much.
I did some further analysis about the data. I also tried to normalize the
data into [0,1] range (using the formula (xmin(x))/(max(x)min(x))), but
same thing happened. At first, I thought it might be outliers caused this
problem (I can see that an outlier may cause this problem if I normalize
the data into [0,1] range.) I deleted all data whose abs value is larger
than 4 standard deviation. But it still didn't help.
Plus, I even plotted the eCDFs, they *really* look the same to me even
after normalization. Anything wrong with my usage of the R function?
Since the data contains ties, I also tried ks.boot (
http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same
result.
Could anyone help me to explain why it happened? Also, any suggestion about
the hypothesis testing on normalized data? (The data I have right now is
simulated data. In real world, I cannot get raw data, but only normalized
one.)
Regards,
Monnand
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


The main issue is that the original distributions are the same, you shift the two samples *by different amounts* (about 0.01 SD), and you have a large (n=1000) sample size. Thus the new distributions are not the same.
This is a problem with testing for equality of distributions. With large samples, even a small deviation is significant.
Chris
Original Message
From: Monnand [mailto: [hidden email]]
Sent: Sunday, January 11, 2015 10:13 PM
To: [hidden email]
Subject: [R] twosample KS test: data becomes significantly different after normalization
Hi all,
This question is sort of related to R (I'm not sure if I used an R function
correctly), but also related to stats in general. I'm sorry if this is
considered as offtopic.
I'm currently working on a data set with two sets of samples. The csv file
of the data could be found here: http://pastebin.com/200v10pyI would like to use KS test to see if these two sets of samples are from
different distributions.
I ran the following R script:
# read data from the file
> data = read.csv('data.csv')
> ks.test(data[[1]], data[[2]])
Twosample KolmogorovSmirnov test
data: data[[1]] and data[[2]]
D = 0.025, pvalue = 0.9132
alternative hypothesis: twosided
The KS test shows that these two samples are very similar. (In fact, they
should come from same distribution.)
However, due to some reasons, instead of the raw values, the actual data
that I will get will be normalized (zero mean, unit variance). So I tried
to normalize the raw data I have and run the KS test again:
> ks.test(scale(data[[1]]), scale(data[[2]]))
Twosample KolmogorovSmirnov test
data: scale(data[[1]]) and scale(data[[2]])
D = 0.3273, pvalue < 2.2e16
alternative hypothesis: twosided
The pvalue becomes almost zero after normalization indicating these two
samples are significantly different (from different distributions).
My question is: How the normalization could make two similar samples
becomes different from each other? I can see that if two samples are
different, then normalization could make them similar. However, if two sets
of data are similar, then intuitively, applying same operation onto them
should make them still similar, at least not different from each other too
much.
I did some further analysis about the data. I also tried to normalize the
data into [0,1] range (using the formula (xmin(x))/(max(x)min(x))), but
same thing happened. At first, I thought it might be outliers caused this
problem (I can see that an outlier may cause this problem if I normalize
the data into [0,1] range.) I deleted all data whose abs value is larger
than 4 standard deviation. But it still didn't help.
Plus, I even plotted the eCDFs, they *really* look the same to me even
after normalization. Anything wrong with my usage of the R function?
Since the data contains ties, I also tried ks.boot (
http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same
result.
Could anyone help me to explain why it happened? Also, any suggestion about
the hypothesis testing on normalized data? (The data I have right now is
simulated data. In real world, I cannot get raw data, but only normalized
one.)
Regards,
Monnand
[[alternative HTML version deleted]]
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Thank you, Chris!
I think it is exactly the problem you mentioned. I did consider
1000point data is a large one at first.
I downsampled the data from 1000 points to 100 points and ran KS test
again. It worked as expected. Is there any typical method to compare
two large samples? I also tried KL diverge, but it only gives me some
number but does not tell me how large the distance is should be
considered as significantly different.
Regards,
Monnand
On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris < [hidden email]> wrote:
>
> The main issue is that the original distributions are the same, you shift the two samples *by different amounts* (about 0.01 SD), and you have a large (n=1000) sample size. Thus the new distributions are not the same.
>
> This is a problem with testing for equality of distributions. With large samples, even a small deviation is significant.
>
> Chris
>
> Original Message
> From: Monnand [mailto: [hidden email]]
> Sent: Sunday, January 11, 2015 10:13 PM
> To: [hidden email]
> Subject: [R] twosample KS test: data becomes significantly different after normalization
>
> Hi all,
>
> This question is sort of related to R (I'm not sure if I used an R function
> correctly), but also related to stats in general. I'm sorry if this is
> considered as offtopic.
>
> I'm currently working on a data set with two sets of samples. The csv file
> of the data could be found here: http://pastebin.com/200v10py>
> I would like to use KS test to see if these two sets of samples are from
> different distributions.
>
> I ran the following R script:
>
> # read data from the file
>> data = read.csv('data.csv')
>> ks.test(data[[1]], data[[2]])
> Twosample KolmogorovSmirnov test
>
> data: data[[1]] and data[[2]]
> D = 0.025, pvalue = 0.9132
> alternative hypothesis: twosided
> The KS test shows that these two samples are very similar. (In fact, they
> should come from same distribution.)
>
> However, due to some reasons, instead of the raw values, the actual data
> that I will get will be normalized (zero mean, unit variance). So I tried
> to normalize the raw data I have and run the KS test again:
>
>> ks.test(scale(data[[1]]), scale(data[[2]]))
> Twosample KolmogorovSmirnov test
>
> data: scale(data[[1]]) and scale(data[[2]])
> D = 0.3273, pvalue < 2.2e16
> alternative hypothesis: twosided
> The pvalue becomes almost zero after normalization indicating these two
> samples are significantly different (from different distributions).
>
> My question is: How the normalization could make two similar samples
> becomes different from each other? I can see that if two samples are
> different, then normalization could make them similar. However, if two sets
> of data are similar, then intuitively, applying same operation onto them
> should make them still similar, at least not different from each other too
> much.
>
> I did some further analysis about the data. I also tried to normalize the
> data into [0,1] range (using the formula (xmin(x))/(max(x)min(x))), but
> same thing happened. At first, I thought it might be outliers caused this
> problem (I can see that an outlier may cause this problem if I normalize
> the data into [0,1] range.) I deleted all data whose abs value is larger
> than 4 standard deviation. But it still didn't help.
>
> Plus, I even plotted the eCDFs, they *really* look the same to me even
> after normalization. Anything wrong with my usage of the R function?
>
> Since the data contains ties, I also tried ks.boot (
> http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same
> result.
>
> Could anyone help me to explain why it happened? Also, any suggestion about
> the hypothesis testing on normalized data? (The data I have right now is
> simulated data. In real world, I cannot get raw data, but only normalized
> one.)
>
> Regards,
> Monnand
>
> [[alternative HTML version deleted]]
>
>
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


This sounds more like quality control than hypothesis testing. Rather than statistical significance, you want to determine what is an acceptable difference (an 'equivalence margin', if you will). And that is a question about the application, not a statistical one.
________________________________________
From: Monnand [ [hidden email]]
Sent: Monday, January 12, 2015 10:14 PM
To: Andrews, Chris
Cc: [hidden email]
Subject: Re: [R] twosample KS test: data becomes significantly different after normalization
Thank you, Chris!
I think it is exactly the problem you mentioned. I did consider
1000point data is a large one at first.
I downsampled the data from 1000 points to 100 points and ran KS test
again. It worked as expected. Is there any typical method to compare
two large samples? I also tried KL diverge, but it only gives me some
number but does not tell me how large the distance is should be
considered as significantly different.
Regards,
Monnand
On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris < [hidden email]> wrote:
>
> The main issue is that the original distributions are the same, you shift the two samples *by different amounts* (about 0.01 SD), and you have a large (n=1000) sample size. Thus the new distributions are not the same.
>
> This is a problem with testing for equality of distributions. With large samples, even a small deviation is significant.
>
> Chris
>
> Original Message
> From: Monnand [mailto: [hidden email]]
> Sent: Sunday, January 11, 2015 10:13 PM
> To: [hidden email]
> Subject: [R] twosample KS test: data becomes significantly different after normalization
>
> Hi all,
>
> This question is sort of related to R (I'm not sure if I used an R function
> correctly), but also related to stats in general. I'm sorry if this is
> considered as offtopic.
>
> I'm currently working on a data set with two sets of samples. The csv file
> of the data could be found here: http://pastebin.com/200v10py>
> I would like to use KS test to see if these two sets of samples are from
> different distributions.
>
> I ran the following R script:
>
> # read data from the file
>> data = read.csv('data.csv')
>> ks.test(data[[1]], data[[2]])
> Twosample KolmogorovSmirnov test
>
> data: data[[1]] and data[[2]]
> D = 0.025, pvalue = 0.9132
> alternative hypothesis: twosided
> The KS test shows that these two samples are very similar. (In fact, they
> should come from same distribution.)
>
> However, due to some reasons, instead of the raw values, the actual data
> that I will get will be normalized (zero mean, unit variance). So I tried
> to normalize the raw data I have and run the KS test again:
>
>> ks.test(scale(data[[1]]), scale(data[[2]]))
> Twosample KolmogorovSmirnov test
>
> data: scale(data[[1]]) and scale(data[[2]])
> D = 0.3273, pvalue < 2.2e16
> alternative hypothesis: twosided
> The pvalue becomes almost zero after normalization indicating these two
> samples are significantly different (from different distributions).
>
> My question is: How the normalization could make two similar samples
> becomes different from each other? I can see that if two samples are
> different, then normalization could make them similar. However, if two sets
> of data are similar, then intuitively, applying same operation onto them
> should make them still similar, at least not different from each other too
> much.
>
> I did some further analysis about the data. I also tried to normalize the
> data into [0,1] range (using the formula (xmin(x))/(max(x)min(x))), but
> same thing happened. At first, I thought it might be outliers caused this
> problem (I can see that an outlier may cause this problem if I normalize
> the data into [0,1] range.) I deleted all data whose abs value is larger
> than 4 standard deviation. But it still didn't help.
>
> Plus, I even plotted the eCDFs, they *really* look the same to me even
> after normalization. Anything wrong with my usage of the R function?
>
> Since the data contains ties, I also tried ks.boot (
> http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same
> result.
>
> Could anyone help me to explain why it happened? Also, any suggestion about
> the hypothesis testing on normalized data? (The data I have right now is
> simulated data. In real world, I cannot get raw data, but only normalized
> one.)
>
> Regards,
> Monnand
>
> [[alternative HTML version deleted]]
>
>
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


I know this must be a wrong method, but I cannot help to ask: Can I only
use the pvalue from KS test, saying if pvalue is greater than \beta, then
two samples are from the same distribution. If the definition of pvalue is
the probability that the null hypothesis is true, then why there's little
people uses pvalue as a "true" probability. e.g. normally, people will not
multiply or add pvalues to get the probability that two independent null
hypothesis are both true or one of them is true. I had this question for
very long time.
Monnand
On Tue Jan 13 2015 at 2:47:30 PM Andrews, Chris < [hidden email]>
wrote:
> This sounds more like quality control than hypothesis testing. Rather
> than statistical significance, you want to determine what is an acceptable
> difference (an 'equivalence margin', if you will). And that is a question
> about the application, not a statistical one.
> ________________________________________
> From: Monnand [ [hidden email]]
> Sent: Monday, January 12, 2015 10:14 PM
> To: Andrews, Chris
> Cc: [hidden email]
> Subject: Re: [R] twosample KS test: data becomes significantly different
> after normalization
>
> Thank you, Chris!
>
> I think it is exactly the problem you mentioned. I did consider
> 1000point data is a large one at first.
>
> I downsampled the data from 1000 points to 100 points and ran KS test
> again. It worked as expected. Is there any typical method to compare
> two large samples? I also tried KL diverge, but it only gives me some
> number but does not tell me how large the distance is should be
> considered as significantly different.
>
> Regards,
> Monnand
>
> On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris < [hidden email]>
> wrote:
> >
> > The main issue is that the original distributions are the same, you
> shift the two samples *by different amounts* (about 0.01 SD), and you have
> a large (n=1000) sample size. Thus the new distributions are not the same.
> >
> > This is a problem with testing for equality of distributions. With
> large samples, even a small deviation is significant.
> >
> > Chris
> >
> > Original Message
> > From: Monnand [mailto: [hidden email]]
> > Sent: Sunday, January 11, 2015 10:13 PM
> > To: [hidden email]
> > Subject: [R] twosample KS test: data becomes significantly different
> after normalization
> >
> > Hi all,
> >
> > This question is sort of related to R (I'm not sure if I used an R
> function
> > correctly), but also related to stats in general. I'm sorry if this is
> > considered as offtopic.
> >
> > I'm currently working on a data set with two sets of samples. The csv
> file
> > of the data could be found here: http://pastebin.com/200v10py> >
> > I would like to use KS test to see if these two sets of samples are from
> > different distributions.
> >
> > I ran the following R script:
> >
> > # read data from the file
> >> data = read.csv('data.csv')
> >> ks.test(data[[1]], data[[2]])
> > Twosample KolmogorovSmirnov test
> >
> > data: data[[1]] and data[[2]]
> > D = 0.025, pvalue = 0.9132
> > alternative hypothesis: twosided
> > The KS test shows that these two samples are very similar. (In fact, they
> > should come from same distribution.)
> >
> > However, due to some reasons, instead of the raw values, the actual data
> > that I will get will be normalized (zero mean, unit variance). So I tried
> > to normalize the raw data I have and run the KS test again:
> >
> >> ks.test(scale(data[[1]]), scale(data[[2]]))
> > Twosample KolmogorovSmirnov test
> >
> > data: scale(data[[1]]) and scale(data[[2]])
> > D = 0.3273, pvalue < 2.2e16
> > alternative hypothesis: twosided
> > The pvalue becomes almost zero after normalization indicating these two
> > samples are significantly different (from different distributions).
> >
> > My question is: How the normalization could make two similar samples
> > becomes different from each other? I can see that if two samples are
> > different, then normalization could make them similar. However, if two
> sets
> > of data are similar, then intuitively, applying same operation onto them
> > should make them still similar, at least not different from each other
> too
> > much.
> >
> > I did some further analysis about the data. I also tried to normalize the
> > data into [0,1] range (using the formula (xmin(x))/(max(x)min(x))), but
> > same thing happened. At first, I thought it might be outliers caused this
> > problem (I can see that an outlier may cause this problem if I normalize
> > the data into [0,1] range.) I deleted all data whose abs value is larger
> > than 4 standard deviation. But it still didn't help.
> >
> > Plus, I even plotted the eCDFs, they *really* look the same to me even
> > after normalization. Anything wrong with my usage of the R function?
> >
> > Since the data contains ties, I also tried ks.boot (
> > http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same
> > result.
> >
> > Could anyone help me to explain why it happened? Also, any suggestion
> about
> > the hypothesis testing on normalized data? (The data I have right now is
> > simulated data. In real world, I cannot get raw data, but only normalized
> > one.)
> >
> > Regards,
> > Monnand
> >
> > [[alternative HTML version deleted]]
> >
> >
> > **********************************************************
> > Electronic Mail is not secure, may not be read every day, and should not
> be used for urgent or sensitive issues
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not
> be used for urgent or sensitive issues
>
>
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


>>>>> Monnand < [hidden email]>
>>>>> on Wed, 14 Jan 2015 07:17:02 +0000 writes:
> I know this must be a wrong method, but I cannot help to ask: Can I only
> use the pvalue from KS test, saying if pvalue is greater than \beta, then
> two samples are from the same distribution. If the definition of pvalue is
> the probability that the null hypothesis is true,
Ouch, ouch, ouch, ouch !!!!!!!!
The worst misuse/misunderstanding of statistics now even on Rhelp ...
> please get help from a statistician !!
> and erase that sentence from your mind (unless you are pro
and want to keep it for anectdotal or didactical purposes...)
> then why there's little
> people uses pvalue as a "true" probability. e.g. normally, people will not
> multiply or add pvalues to get the probability that two independent null
> hypothesis are both true or one of them is true. I had this question for
> very long time.
> Monnand
> On Tue Jan 13 2015 at 2:47:30 PM Andrews, Chris < [hidden email]>
> wrote:
>> This sounds more like quality control than hypothesis testing. Rather
>> than statistical significance, you want to determine what is an acceptable
>> difference (an 'equivalence margin', if you will). And that is a question
>> about the application, not a statistical one.
>> ________________________________________
>> From: Monnand [ [hidden email]]
>> Sent: Monday, January 12, 2015 10:14 PM
>> To: Andrews, Chris
>> Cc: [hidden email]
>> Subject: Re: [R] twosample KS test: data becomes significantly different
>> after normalization
>>
>> Thank you, Chris!
>>
>> I think it is exactly the problem you mentioned. I did consider
>> 1000point data is a large one at first.
>>
>> I downsampled the data from 1000 points to 100 points and ran KS test
>> again. It worked as expected. Is there any typical method to compare
>> two large samples? I also tried KL diverge, but it only gives me some
>> number but does not tell me how large the distance is should be
>> considered as significantly different.
>>
>> Regards,
>> Monnand
>>
>> On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris < [hidden email]>
>> wrote:
>> >
>> > The main issue is that the original distributions are the same, you
>> shift the two samples *by different amounts* (about 0.01 SD), and you have
>> a large (n=1000) sample size. Thus the new distributions are not the same.
>> >
>> > This is a problem with testing for equality of distributions. With
>> large samples, even a small deviation is significant.
>> >
>> > Chris
>> >
>> > Original Message
>> > From: Monnand [mailto: [hidden email]]
>> > Sent: Sunday, January 11, 2015 10:13 PM
>> > To: [hidden email]
>> > Subject: [R] twosample KS test: data becomes significantly different
>> after normalization
>> >
>> > Hi all,
>> >
>> > This question is sort of related to R (I'm not sure if I used an R
>> function
>> > correctly), but also related to stats in general. I'm sorry if this is
>> > considered as offtopic.
>> >
>> > I'm currently working on a data set with two sets of samples. The csv
>> file
>> > of the data could be found here: http://pastebin.com/200v10py >> >
>> > I would like to use KS test to see if these two sets of samples are from
>> > different distributions.
>> >
>> > I ran the following R script:
>> >
>> > # read data from the file
>> >> data = read.csv('data.csv')
>> >> ks.test(data[[1]], data[[2]])
>> > Twosample KolmogorovSmirnov test
>> >
>> > data: data[[1]] and data[[2]]
>> > D = 0.025, pvalue = 0.9132
>> > alternative hypothesis: twosided
>> > The KS test shows that these two samples are very similar. (In fact, they
>> > should come from same distribution.)
>> >
>> > However, due to some reasons, instead of the raw values, the actual data
>> > that I will get will be normalized (zero mean, unit variance). So I tried
>> > to normalize the raw data I have and run the KS test again:
>> >
>> >> ks.test(scale(data[[1]]), scale(data[[2]]))
>> > Twosample KolmogorovSmirnov test
>> >
>> > data: scale(data[[1]]) and scale(data[[2]])
>> > D = 0.3273, pvalue < 2.2e16
>> > alternative hypothesis: twosided
>> > The pvalue becomes almost zero after normalization indicating these two
>> > samples are significantly different (from different distributions).
>> >
>> > My question is: How the normalization could make two similar samples
>> > becomes different from each other? I can see that if two samples are
>> > different, then normalization could make them similar. However, if two
>> sets
>> > of data are similar, then intuitively, applying same operation onto them
>> > should make them still similar, at least not different from each other
>> too
>> > much.
>> >
>> > I did some further analysis about the data. I also tried to normalize the
>> > data into [0,1] range (using the formula (xmin(x))/(max(x)min(x))), but
>> > same thing happened. At first, I thought it might be outliers caused this
>> > problem (I can see that an outlier may cause this problem if I normalize
>> > the data into [0,1] range.) I deleted all data whose abs value is larger
>> > than 4 standard deviation. But it still didn't help.
>> >
>> > Plus, I even plotted the eCDFs, they *really* look the same to me even
>> > after normalization. Anything wrong with my usage of the R function?
>> >
>> > Since the data contains ties, I also tried ks.boot (
>> > http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same
>> > result.
>> >
>> > Could anyone help me to explain why it happened? Also, any suggestion
>> about
>> > the hypothesis testing on normalized data? (The data I have right now is
>> > simulated data. In real world, I cannot get raw data, but only normalized
>> > one.)
>> >
>> > Regards,
>> > Monnand
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Your definition of pvalue is not correct. See, for example, http://en.wikipedia.org/wiki/Pvalue#Misunderstandings
Original Message
From: Monnand [mailto: [hidden email]]
Sent: Wednesday, January 14, 2015 2:17 AM
To: Andrews, Chris
Cc: [hidden email]
Subject: Re: [R] twosample KS test: data becomes significantly different after normalization
I know this must be a wrong method, but I cannot help to ask: Can I only
use the pvalue from KS test, saying if pvalue is greater than \beta, then
two samples are from the same distribution. If the definition of pvalue is
the probability that the null hypothesis is true, then why there's little
people uses pvalue as a "true" probability. e.g. normally, people will not
multiply or add pvalues to get the probability that two independent null
hypothesis are both true or one of them is true. I had this question for
very long time.
Monnand
On Tue Jan 13 2015 at 2:47:30 PM Andrews, Chris < [hidden email]>
wrote:
> This sounds more like quality control than hypothesis testing. Rather
> than statistical significance, you want to determine what is an acceptable
> difference (an 'equivalence margin', if you will). And that is a question
> about the application, not a statistical one.
> ________________________________________
> From: Monnand [ [hidden email]]
> Sent: Monday, January 12, 2015 10:14 PM
> To: Andrews, Chris
> Cc: [hidden email]
> Subject: Re: [R] twosample KS test: data becomes significantly different
> after normalization
>
> Thank you, Chris!
>
> I think it is exactly the problem you mentioned. I did consider
> 1000point data is a large one at first.
>
> I downsampled the data from 1000 points to 100 points and ran KS test
> again. It worked as expected. Is there any typical method to compare
> two large samples? I also tried KL diverge, but it only gives me some
> number but does not tell me how large the distance is should be
> considered as significantly different.
>
> Regards,
> Monnand
>
> On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris < [hidden email]>
> wrote:
> >
> > The main issue is that the original distributions are the same, you
> shift the two samples *by different amounts* (about 0.01 SD), and you have
> a large (n=1000) sample size. Thus the new distributions are not the same.
> >
> > This is a problem with testing for equality of distributions. With
> large samples, even a small deviation is significant.
> >
> > Chris
> >
> > Original Message
> > From: Monnand [mailto: [hidden email]]
> > Sent: Sunday, January 11, 2015 10:13 PM
> > To: [hidden email]
> > Subject: [R] twosample KS test: data becomes significantly different
> after normalization
> >
> > Hi all,
> >
> > This question is sort of related to R (I'm not sure if I used an R
> function
> > correctly), but also related to stats in general. I'm sorry if this is
> > considered as offtopic.
> >
> > I'm currently working on a data set with two sets of samples. The csv
> file
> > of the data could be found here: http://pastebin.com/200v10py> >
> > I would like to use KS test to see if these two sets of samples are from
> > different distributions.
> >
> > I ran the following R script:
> >
> > # read data from the file
> >> data = read.csv('data.csv')
> >> ks.test(data[[1]], data[[2]])
> > Twosample KolmogorovSmirnov test
> >
> > data: data[[1]] and data[[2]]
> > D = 0.025, pvalue = 0.9132
> > alternative hypothesis: twosided
> > The KS test shows that these two samples are very similar. (In fact, they
> > should come from same distribution.)
> >
> > However, due to some reasons, instead of the raw values, the actual data
> > that I will get will be normalized (zero mean, unit variance). So I tried
> > to normalize the raw data I have and run the KS test again:
> >
> >> ks.test(scale(data[[1]]), scale(data[[2]]))
> > Twosample KolmogorovSmirnov test
> >
> > data: scale(data[[1]]) and scale(data[[2]])
> > D = 0.3273, pvalue < 2.2e16
> > alternative hypothesis: twosided
> > The pvalue becomes almost zero after normalization indicating these two
> > samples are significantly different (from different distributions).
> >
> > My question is: How the normalization could make two similar samples
> > becomes different from each other? I can see that if two samples are
> > different, then normalization could make them similar. However, if two
> sets
> > of data are similar, then intuitively, applying same operation onto them
> > should make them still similar, at least not different from each other
> too
> > much.
> >
> > I did some further analysis about the data. I also tried to normalize the
> > data into [0,1] range (using the formula (xmin(x))/(max(x)min(x))), but
> > same thing happened. At first, I thought it might be outliers caused this
> > problem (I can see that an outlier may cause this problem if I normalize
> > the data into [0,1] range.) I deleted all data whose abs value is larger
> > than 4 standard deviation. But it still didn't help.
> >
> > Plus, I even plotted the eCDFs, they *really* look the same to me even
> > after normalization. Anything wrong with my usage of the R function?
> >
> > Since the data contains ties, I also tried ks.boot (
> > http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same
> > result.
> >
> > Could anyone help me to explain why it happened? Also, any suggestion
> about
> > the hypothesis testing on normalized data? (The data I have right now is
> > simulated data. In real world, I cannot get raw data, but only normalized
> > one.)
> >
> > Regards,
> > Monnand
> >
> > [[alternative HTML version deleted]]
> >
> >
> > **********************************************************
> > Electronic Mail is not secure, may not be read every day, and should not
> be used for urgent or sensitive issues
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not
> be used for urgent or sensitive issues
>
>
[[alternative HTML version deleted]]
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Thank you, Chris and Martin!
On Wed Jan 14 2015 at 7:31:12 AM Andrews, Chris < [hidden email]>
wrote:
> Your definition of pvalue is not correct. See, for example,
> http://en.wikipedia.org/wiki/Pvalue#Misunderstandings>
> Original Message
> From: Monnand [mailto: [hidden email]]
> Sent: Wednesday, January 14, 2015 2:17 AM
> To: Andrews, Chris
> Cc: [hidden email]
> Subject: Re: [R] twosample KS test: data becomes significantly different
> after normalization
>
> I know this must be a wrong method, but I cannot help to ask: Can I only
> use the pvalue from KS test, saying if pvalue is greater than \beta, then
> two samples are from the same distribution. If the definition of pvalue is
> the probability that the null hypothesis is true, then why there's little
> people uses pvalue as a "true" probability. e.g. normally, people will not
> multiply or add pvalues to get the probability that two independent null
> hypothesis are both true or one of them is true. I had this question for
> very long time.
>
> Monnand
>
> On Tue Jan 13 2015 at 2:47:30 PM Andrews, Chris < [hidden email]>
> wrote:
>
> > This sounds more like quality control than hypothesis testing. Rather
> > than statistical significance, you want to determine what is an
> acceptable
> > difference (an 'equivalence margin', if you will). And that is a
> question
> > about the application, not a statistical one.
> > ________________________________________
> > From: Monnand [ [hidden email]]
> > Sent: Monday, January 12, 2015 10:14 PM
> > To: Andrews, Chris
> > Cc: [hidden email]
> > Subject: Re: [R] twosample KS test: data becomes significantly different
> > after normalization
> >
> > Thank you, Chris!
> >
> > I think it is exactly the problem you mentioned. I did consider
> > 1000point data is a large one at first.
> >
> > I downsampled the data from 1000 points to 100 points and ran KS test
> > again. It worked as expected. Is there any typical method to compare
> > two large samples? I also tried KL diverge, but it only gives me some
> > number but does not tell me how large the distance is should be
> > considered as significantly different.
> >
> > Regards,
> > Monnand
> >
> > On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris < [hidden email]>
> > wrote:
> > >
> > > The main issue is that the original distributions are the same, you
> > shift the two samples *by different amounts* (about 0.01 SD), and you
> have
> > a large (n=1000) sample size. Thus the new distributions are not the
> same.
> > >
> > > This is a problem with testing for equality of distributions. With
> > large samples, even a small deviation is significant.
> > >
> > > Chris
> > >
> > > Original Message
> > > From: Monnand [mailto: [hidden email]]
> > > Sent: Sunday, January 11, 2015 10:13 PM
> > > To: [hidden email]
> > > Subject: [R] twosample KS test: data becomes significantly different
> > after normalization
> > >
> > > Hi all,
> > >
> > > This question is sort of related to R (I'm not sure if I used an R
> > function
> > > correctly), but also related to stats in general. I'm sorry if this is
> > > considered as offtopic.
> > >
> > > I'm currently working on a data set with two sets of samples. The csv
> > file
> > > of the data could be found here: http://pastebin.com/200v10py> > >
> > > I would like to use KS test to see if these two sets of samples are
> from
> > > different distributions.
> > >
> > > I ran the following R script:
> > >
> > > # read data from the file
> > >> data = read.csv('data.csv')
> > >> ks.test(data[[1]], data[[2]])
> > > Twosample KolmogorovSmirnov test
> > >
> > > data: data[[1]] and data[[2]]
> > > D = 0.025, pvalue = 0.9132
> > > alternative hypothesis: twosided
> > > The KS test shows that these two samples are very similar. (In fact,
> they
> > > should come from same distribution.)
> > >
> > > However, due to some reasons, instead of the raw values, the actual
> data
> > > that I will get will be normalized (zero mean, unit variance). So I
> tried
> > > to normalize the raw data I have and run the KS test again:
> > >
> > >> ks.test(scale(data[[1]]), scale(data[[2]]))
> > > Twosample KolmogorovSmirnov test
> > >
> > > data: scale(data[[1]]) and scale(data[[2]])
> > > D = 0.3273, pvalue < 2.2e16
> > > alternative hypothesis: twosided
> > > The pvalue becomes almost zero after normalization indicating these
> two
> > > samples are significantly different (from different distributions).
> > >
> > > My question is: How the normalization could make two similar samples
> > > becomes different from each other? I can see that if two samples are
> > > different, then normalization could make them similar. However, if two
> > sets
> > > of data are similar, then intuitively, applying same operation onto
> them
> > > should make them still similar, at least not different from each other
> > too
> > > much.
> > >
> > > I did some further analysis about the data. I also tried to normalize
> the
> > > data into [0,1] range (using the formula (xmin(x))/(max(x)min(x))),
> but
> > > same thing happened. At first, I thought it might be outliers caused
> this
> > > problem (I can see that an outlier may cause this problem if I
> normalize
> > > the data into [0,1] range.) I deleted all data whose abs value is
> larger
> > > than 4 standard deviation. But it still didn't help.
> > >
> > > Plus, I even plotted the eCDFs, they *really* look the same to me even
> > > after normalization. Anything wrong with my usage of the R function?
> > >
> > > Since the data contains ties, I also tried ks.boot (
> > > http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same
> > > result.
> > >
> > > Could anyone help me to explain why it happened? Also, any suggestion
> > about
> > > the hypothesis testing on normalized data? (The data I have right now
> is
> > > simulated data. In real world, I cannot get raw data, but only
> normalized
> > > one.)
> > >
> > > Regards,
> > > Monnand
> > >
> > > [[alternative HTML version deleted]]
> > >
> > >
> > > **********************************************************
> > > Electronic Mail is not secure, may not be read every day, and should
> not
> > be used for urgent or sensitive issues
> > **********************************************************
> > Electronic Mail is not secure, may not be read every day, and should not
> > be used for urgent or sensitive issues
> >
> >
>
> [[alternative HTML version deleted]]
>
>
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not
> be used for urgent or sensitive issues
>
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.

