Numerical stability in chisq.test

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Numerical stability in chisq.test

yzan
The chisq.test on line 57 contains following code:
        STATISTIC <- sum(sort((x - E)^2/E, decreasing = TRUE))

However, based on book "Accuracy and stability of numerical algorithms" available from:
        http://ftp.demec.ufpr.br/CFD/bibliografia/Higham_2002_Accuracy%20and%20Stability%20of%20Numerical%20Algorithms.pdf
Table 4.1 on page 89, it is better to sort the data in increasing order than in decreasing order, when the data are non-negative.

An example:
        x = matrix(c(rep(1.1, 10000)), 10^16, nrow = 10001, ncol = 1)    # We have a vector with 10000*1.1 and 1*10^16
        c(sum(sort(x, decreasing = TRUE)), sum(sort(x, decreasing = FALSE)))
The result:
        10000000000010996 10000000000011000
When we sort the data in the increasing order, we get the correct result. If we sort the data in the decreasing order, we get a result that is off by 4.

Shouldn't the sort be in the increasing order rather than in the decreasing order?

Best regards,
Jan Motl


PS: This post is based on discussion on https://stackoverflow.com/questions/47847295/why-does-chisq-test-sort-data-in-descending-order-before-summation and the response from the post to [hidden email].
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Numerical stability in chisq.test

Kurt Hornik-5
>>>>> Jan Motl writes:

> The chisq.test on line 57 contains following code:
> STATISTIC <- sum(sort((x - E)^2/E, decreasing = TRUE))

The preceding 2 lines seem relevant:

            ## Sorting before summing may look strange, but seems to be
            ## a sensible way to deal with rounding issues (PR#3486):
            STATISTIC <- sum(sort((x - E) ^ 2 / E, decreasing = TRUE))

-k

> However, based on book "Accuracy and stability of numerical algorithms" available from:
> http://ftp.demec.ufpr.br/CFD/bibliografia/Higham_2002_Accuracy%20and%20Stability%20of%20Numerical%20Algorithms.pdf
> Table 4.1 on page 89, it is better to sort the data in increasing order than in decreasing order, when the data are non-negative.

> An example:
> x = matrix(c(rep(1.1, 10000)), 10^16, nrow = 10001, ncol = 1)    # We have a vector with 10000*1.1 and 1*10^16
> c(sum(sort(x, decreasing = TRUE)), sum(sort(x, decreasing = FALSE)))
> The result:
> 10000000000010996 10000000000011000
> When we sort the data in the increasing order, we get the correct result. If we sort the data in the decreasing order, we get a result that is off by 4.

> Shouldn't the sort be in the increasing order rather than in the decreasing order?

> Best regards,
> Jan Motl


> PS: This post is based on discussion on https://stackoverflow.com/questions/47847295/why-does-chisq-test-sort-data-in-descending-order-before-summation and the response from the post to [hidden email].
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Numerical stability in chisq.test

Peter Dalgaard-2


> On 28 Dec 2017, at 13:08 , Kurt Hornik <[hidden email]> wrote:
>
>>>>>> Jan Motl writes:
>
>> The chisq.test on line 57 contains following code:
>> STATISTIC <- sum(sort((x - E)^2/E, decreasing = TRUE))
>
> The preceding 2 lines seem relevant:
>
>            ## Sorting before summing may look strange, but seems to be
>            ## a sensible way to deal with rounding issues (PR#3486):
>            STATISTIC <- sum(sort((x - E) ^ 2 / E, decreasing = TRUE))
>
> -k

My thoughts too. PR 3486 is about simulated tables that theoretically have STATISTIC equal to the one observed, but come out slightly different, messing up the simulated p value. The sort is not actually intended to squeeze the very last bit of accuracy out of the computation, just to make sure that the round-off affects equivalent tables in the same way. "Fixing" the code may therefore unfix PR#3486; at the very least some care is required if this is modified.  

-pd


>
>> However, based on book "Accuracy and stability of numerical algorithms" available from:
>> http://ftp.demec.ufpr.br/CFD/bibliografia/Higham_2002_Accuracy%20and%20Stability%20of%20Numerical%20Algorithms.pdf
>> Table 4.1 on page 89, it is better to sort the data in increasing order than in decreasing order, when the data are non-negative.
>
>> An example:
>> x = matrix(c(rep(1.1, 10000)), 10^16, nrow = 10001, ncol = 1)    # We have a vector with 10000*1.1 and 1*10^16
>> c(sum(sort(x, decreasing = TRUE)), sum(sort(x, decreasing = FALSE)))
>> The result:
>> 10000000000010996 10000000000011000
>> When we sort the data in the increasing order, we get the correct result. If we sort the data in the decreasing order, we get a result that is off by 4.
>
>> Shouldn't the sort be in the increasing order rather than in the decreasing order?
>
>> Best regards,
>> Jan Motl
>
>
>> PS: This post is based on discussion on https://stackoverflow.com/questions/47847295/why-does-chisq-test-sort-data-in-descending-order-before-summation and the response from the post to [hidden email].
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Numerical stability in chisq.test

yzan
Hi,

there is also PR#8224, which seems to be relevant. I executed the following code:

## Modify the function
chisq.test2 <- edit(chisq.test) # Modify to use increasing order of sorting at line 57


## PR#8224 (patological contingency table)
m <- matrix(c(1,0,7,16),2,2);

# Original
original <- chisq.test(m, sim=T)$p.value
for(i in (1:2000)){original <- c(original, chisq.test(m, sim=T)$p.value)}

# Modified
modified <- chisq.test2(m, sim=T)$p.value
for(i in (1:2000)){modified <- c(modified, chisq.test2(m, sim=T)$p.value)}

# Evaluation
t.test(original, modified)


## PR#3486 (invariance to transposition)
x <- rbind(c(149, 151), c(1, 8))

# Original
c2x <- chisq.test(x, sim=T, B=100000)$p.value
for(i in (1:200)){c2x<-c(c2x,chisq.test(x, sim=T,B=100000)$p.value)}
c2tx <- chisq.test(t(x), sim=T, B=100000)$p.value
for(i in (1:200)){c2tx<-c(c2tx,chisq.test(t(x), sim=T, B=100000)$p.value)}
sum(abs(c2x-c2tx))

# Modified
mc2x <- chisq.test2(x, sim=T, B=100000)$p.value
for(i in (1:200)){mc2x <- c(mc2x, chisq.test2(x, sim=T, B=100000)$p.value)}
mc2tx <- chisq.test2(t(x), sim=T, B=100000)$p.value
for(i in (1:200)){mc2tx <- c(mc2tx, chisq.test2(t(x), sim=T, B=100000)$p.value)}
sum(abs(mc2x-mc2tx))

# Evaluation
t.test((c2x-c2tx), (mc2x-mc2tx))

on two computers:
        1) OS: OS X 10.11.6, x86_64, darwin15.6.0; Version: R version 3.4.2 (2017-09-28)
        2) OS: Windows XP, i386, mingw32; Version: R version 3.4.3 (2017-11-30)

On both computers, the increasing and decreasing order return approximately the same results.

Best regards,
 Jan Motl

> My thoughts too. PR 3486 is about simulated tables that theoretically have STATISTIC equal to the one observed, but come out slightly different, messing up the simulated p value. The sort is not actually intended to squeeze the very last bit of accuracy out of the computation, just to make sure that the round-off affects equivalent tables in the same way. "Fixing" the code may therefore unfix PR#3486; at the very least some care is required if this is modified.  


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel