The chisq.test on line 57 contains following code:
STATISTIC <- sum(sort((x - E)^2/E, decreasing = TRUE)) However, based on book "Accuracy and stability of numerical algorithms" available from: http://ftp.demec.ufpr.br/CFD/bibliografia/Higham_2002_Accuracy%20and%20Stability%20of%20Numerical%20Algorithms.pdf Table 4.1 on page 89, it is better to sort the data in increasing order than in decreasing order, when the data are non-negative. An example: x = matrix(c(rep(1.1, 10000)), 10^16, nrow = 10001, ncol = 1) # We have a vector with 10000*1.1 and 1*10^16 c(sum(sort(x, decreasing = TRUE)), sum(sort(x, decreasing = FALSE))) The result: 10000000000010996 10000000000011000 When we sort the data in the increasing order, we get the correct result. If we sort the data in the decreasing order, we get a result that is off by 4. Shouldn't the sort be in the increasing order rather than in the decreasing order? Best regards, Jan Motl PS: This post is based on discussion on https://stackoverflow.com/questions/47847295/why-does-chisq-test-sort-data-in-descending-order-before-summation and the response from the post to [hidden email]. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
>>>>> Jan Motl writes:
> The chisq.test on line 57 contains following code: > STATISTIC <- sum(sort((x - E)^2/E, decreasing = TRUE)) The preceding 2 lines seem relevant: ## Sorting before summing may look strange, but seems to be ## a sensible way to deal with rounding issues (PR#3486): STATISTIC <- sum(sort((x - E) ^ 2 / E, decreasing = TRUE)) -k > However, based on book "Accuracy and stability of numerical algorithms" available from: > http://ftp.demec.ufpr.br/CFD/bibliografia/Higham_2002_Accuracy%20and%20Stability%20of%20Numerical%20Algorithms.pdf > Table 4.1 on page 89, it is better to sort the data in increasing order than in decreasing order, when the data are non-negative. > An example: > x = matrix(c(rep(1.1, 10000)), 10^16, nrow = 10001, ncol = 1) # We have a vector with 10000*1.1 and 1*10^16 > c(sum(sort(x, decreasing = TRUE)), sum(sort(x, decreasing = FALSE))) > The result: > 10000000000010996 10000000000011000 > When we sort the data in the increasing order, we get the correct result. If we sort the data in the decreasing order, we get a result that is off by 4. > Shouldn't the sort be in the increasing order rather than in the decreasing order? > Best regards, > Jan Motl > PS: This post is based on discussion on https://stackoverflow.com/questions/47847295/why-does-chisq-test-sort-data-in-descending-order-before-summation and the response from the post to [hidden email]. > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
> On 28 Dec 2017, at 13:08 , Kurt Hornik <[hidden email]> wrote: > >>>>>> Jan Motl writes: > >> The chisq.test on line 57 contains following code: >> STATISTIC <- sum(sort((x - E)^2/E, decreasing = TRUE)) > > The preceding 2 lines seem relevant: > > ## Sorting before summing may look strange, but seems to be > ## a sensible way to deal with rounding issues (PR#3486): > STATISTIC <- sum(sort((x - E) ^ 2 / E, decreasing = TRUE)) > > -k My thoughts too. PR 3486 is about simulated tables that theoretically have STATISTIC equal to the one observed, but come out slightly different, messing up the simulated p value. The sort is not actually intended to squeeze the very last bit of accuracy out of the computation, just to make sure that the round-off affects equivalent tables in the same way. "Fixing" the code may therefore unfix PR#3486; at the very least some care is required if this is modified. -pd > >> However, based on book "Accuracy and stability of numerical algorithms" available from: >> http://ftp.demec.ufpr.br/CFD/bibliografia/Higham_2002_Accuracy%20and%20Stability%20of%20Numerical%20Algorithms.pdf >> Table 4.1 on page 89, it is better to sort the data in increasing order than in decreasing order, when the data are non-negative. > >> An example: >> x = matrix(c(rep(1.1, 10000)), 10^16, nrow = 10001, ncol = 1) # We have a vector with 10000*1.1 and 1*10^16 >> c(sum(sort(x, decreasing = TRUE)), sum(sort(x, decreasing = FALSE))) >> The result: >> 10000000000010996 10000000000011000 >> When we sort the data in the increasing order, we get the correct result. If we sort the data in the decreasing order, we get a result that is off by 4. > >> Shouldn't the sort be in the increasing order rather than in the decreasing order? > >> Best regards, >> Jan Motl > > >> PS: This post is based on discussion on https://stackoverflow.com/questions/47847295/why-does-chisq-test-sort-data-in-descending-order-before-summation and the response from the post to [hidden email]. >> ______________________________________________ >> [hidden email] mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: [hidden email] Priv: [hidden email] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
Hi,
there is also PR#8224, which seems to be relevant. I executed the following code: ## Modify the function chisq.test2 <- edit(chisq.test) # Modify to use increasing order of sorting at line 57 ## PR#8224 (patological contingency table) m <- matrix(c(1,0,7,16),2,2); # Original original <- chisq.test(m, sim=T)$p.value for(i in (1:2000)){original <- c(original, chisq.test(m, sim=T)$p.value)} # Modified modified <- chisq.test2(m, sim=T)$p.value for(i in (1:2000)){modified <- c(modified, chisq.test2(m, sim=T)$p.value)} # Evaluation t.test(original, modified) ## PR#3486 (invariance to transposition) x <- rbind(c(149, 151), c(1, 8)) # Original c2x <- chisq.test(x, sim=T, B=100000)$p.value for(i in (1:200)){c2x<-c(c2x,chisq.test(x, sim=T,B=100000)$p.value)} c2tx <- chisq.test(t(x), sim=T, B=100000)$p.value for(i in (1:200)){c2tx<-c(c2tx,chisq.test(t(x), sim=T, B=100000)$p.value)} sum(abs(c2x-c2tx)) # Modified mc2x <- chisq.test2(x, sim=T, B=100000)$p.value for(i in (1:200)){mc2x <- c(mc2x, chisq.test2(x, sim=T, B=100000)$p.value)} mc2tx <- chisq.test2(t(x), sim=T, B=100000)$p.value for(i in (1:200)){mc2tx <- c(mc2tx, chisq.test2(t(x), sim=T, B=100000)$p.value)} sum(abs(mc2x-mc2tx)) # Evaluation t.test((c2x-c2tx), (mc2x-mc2tx)) on two computers: 1) OS: OS X 10.11.6, x86_64, darwin15.6.0; Version: R version 3.4.2 (2017-09-28) 2) OS: Windows XP, i386, mingw32; Version: R version 3.4.3 (2017-11-30) On both computers, the increasing and decreasing order return approximately the same results. Best regards, Jan Motl > My thoughts too. PR 3486 is about simulated tables that theoretically have STATISTIC equal to the one observed, but come out slightly different, messing up the simulated p value. The sort is not actually intended to squeeze the very last bit of accuracy out of the computation, just to make sure that the round-off affects equivalent tables in the same way. "Fixing" the code may therefore unfix PR#3486; at the very least some care is required if this is modified. [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
Free forum by Nabble | Edit this page |