Are r2dtable and C_r2dtable behaving correctly?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Are r2dtable and C_r2dtable behaving correctly?

Gustavo Fernandez Bayon
Hello,

While doing some enrichment tests using chisq.test() with simulated
p-values, I noticed some strange behaviour. The computed p-value was
extremely small, so I decided to dig a little deeper and debug
chisq.test(). I noticed then that the simulated statistics returned by the
following call

tmp <- .Call(C_chisq_sim, sr, sc, B, E)

were all the same, very small numbers. This, at first, seemed strange to
me. So I decided to do some simulations myself, and started playing around
with the r2dtable() function. Problem is, using my row and column
marginals, r2dtable() always returns the same matrix. Let's provide a
minimal example:

rr <- c(209410, 276167)
cc <- c(25000, 460577)
ms <- r2dtable(3, rr, cc)

I have tested this code in two machines and it always returned the same
list of length three containing the same matrix three times. The repeated
matrix is the following:

[[1]]
      [,1]   [,2]
[1,] 10782 198628
[2,] 14218 261949

[[2]]
      [,1]   [,2]
[1,] 10782 198628
[2,] 14218 261949

[[3]]
      [,1]   [,2]
[1,] 10782 198628
[2,] 14218 261949

I also coded a small function returning the value of the chi-squared
statistic using the previous fixed marginals and taking the value at [1, 1]
as input. This helped me to plot a curve and notice that the repeating
matrix was the one that yielded the minimum chi-squared statistic.

This behaviour persists if I use greater marginals (summing 100000 to every
element of the marginal for example),

> rr <- c(309410, 376167)
> cc <- c(125000, 560577)
> r2dtable(3, rr, cc)
[[1]]
      [,1]   [,2]
[1,] 56414 252996
[2,] 68586 307581

[[2]]
      [,1]   [,2]
[1,] 56414 252996
[2,] 68586 307581

[[3]]
      [,1]   [,2]
[1,] 56414 252996
[2,] 68586 307581

 but not if we use smaller ones:

> rr <- c(9410, 76167)
> cc <- c(25000, 60577)
> r2dtable(3, rr, cc)
[[1]]
      [,1]  [,2]
[1,]  2721  6689
[2,] 22279 53888

[[2]]
      [,1]  [,2]
[1,]  2834  6576
[2,] 22166 54001

[[3]]
      [,1]  [,2]
[1,]  2778  6632
[2,] 22222 53945

I have looked inside the C code for the C_r2dtable() and rcont2()
functions, but I cannot do much more than guess where this behaviour could
originate, so I would like to ask for help from anybody more experienced in
the R implementation. Guess there is some kind of inflection point
depending on the total sample size of the table, or maybe the generation
algorithm tends to output matrices concentrated around the minimum.

This is the output from my sessionInfo()

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C
LC_TIME=es_ES.UTF-8
 [4] LC_COLLATE=es_ES.UTF-8     LC_MONETARY=es_ES.UTF-8
 LC_MESSAGES=es_ES.UTF-8
 [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                  LC_ADDRESS=C

[10] LC_TELEPHONE=C             LC_MEASUREMENT=es_ES.UTF-8
LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
 methods   base

other attached packages:
 [1] profvis_0.3.3                           bindrcpp_0.2

 [3] FDb.InfiniumMethylation.hg19_2.2.0      org.Hs.eg.db_3.4.1

 [5] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.28.4

 [7] AnnotationDbi_1.38.2                    Biobase_2.36.2

 [9] GenomicRanges_1.28.4                    GenomeInfoDb_1.12.2

[11] IRanges_2.10.2                          S4Vectors_0.14.3

[13] BiocGenerics_0.22.0                     epian_0.1.0


loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.6.3 purrr_0.2.3                reshape2_1.4.2

 [4] lattice_0.20-35            htmltools_0.3.6
 rtracklayer_1.36.4
 [7] blob_1.1.0                 XML_3.98-1.9               rlang_0.1.2

[10] foreign_0.8-67             glue_1.1.1                 DBI_0.7

[13] BiocParallel_1.10.1        bit64_0.9-7
 matrixStats_0.52.2
[16] GenomeInfoDbData_0.99.0    bindr_0.1                  plyr_1.8.4

[19] stringr_1.2.0              zlibbioc_1.22.0
 Biostrings_2.44.2
[22] htmlwidgets_0.9            psych_1.7.5                memoise_1.1.0

[25] biomaRt_2.32.1             broom_0.4.2                Rcpp_0.12.12

[28] DelayedArray_0.2.7         XVector_0.16.0             bit_1.1-12

[31] Rsamtools_1.28.0           mnormt_1.5-5               digest_0.6.12

[34] stringi_1.1.5              dplyr_0.7.2                grid_3.4.0

[37] tools_3.4.0                bitops_1.0-6               magrittr_1.5

[40] RCurl_1.95-4.8             tibble_1.3.3               RSQLite_2.0

[43] tidyr_0.7.0                pkgconfig_2.0.1            Matrix_1.2-9

[46] assertthat_0.2.0           R6_2.2.2
GenomicAlignments_1.12.1
[49] nlme_3.1-131               compiler_3.4.0

Any hint or help would be much appreciated. We do not use a lot the
simulated version of the chisq.test at the lab, but I would like to
understand better what is happening.

Kind regards,
Gustavo

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Are r2dtable and C_r2dtable behaving correctly?

Martin Maechler
>>>>> Gustavo Fernandez Bayon <[hidden email]>
>>>>>     on Thu, 24 Aug 2017 16:42:36 +0200 writes:

    > Hello,
    > While doing some enrichment tests using chisq.test() with simulated
    > p-values, I noticed some strange behaviour. The computed p-value was
    > extremely small, so I decided to dig a little deeper and debug
    > chisq.test(). I noticed then that the simulated statistics returned by the
    > following call

    > tmp <- .Call(C_chisq_sim, sr, sc, B, E)

    > were all the same, very small numbers. This, at first, seemed strange to
    > me. So I decided to do some simulations myself, and started playing around
    > with the r2dtable() function. Problem is, using my row and column
    > marginals, r2dtable() always returns the same matrix. Let's provide a
    > minimal example:

    > rr <- c(209410, 276167)
    > cc <- c(25000, 460577)
    > ms <- r2dtable(3, rr, cc)

    > I have tested this code in two machines and it always returned the same
    > list of length three containing the same matrix three times. The repeated
    > matrix is the following:

    > [[1]]
    > [,1]   [,2]
    > [1,] 10782 198628
    > [2,] 14218 261949

    > [[2]]
    > [,1]   [,2]
    > [1,] 10782 198628
    > [2,] 14218 261949

    > [[3]]
    > [,1]   [,2]
    > [1,] 10782 198628
    > [2,] 14218 261949

Yes.  You can also do

   unique(r2dtable(100, rr, cc))

and see that the result is constant.

I'm pretty sure this is still due to some integer overflow,

in spite of the fact that I had spent quite some time to fix
such problem in Dec 2003, see the 14 years old bug PR#5701
  https://bugs.r-project.org/bugzilla/show_bug.cgi?id=5701#c2

It has to be said that this is based on an algorithm published
in 1981, specifically - from  help(r2dtable) -

     Patefield, W. M. (1981) Algorithm AS159.  An efficient method of
     generating r x c tables with given row and column totals.
     _Applied Statistics_ *30*, 91-97.

   For those with JSTOR access (typically via your University),
   available at http://www.jstor.org/stable/2346669

When I start reading it, indeed the algorithm seems start from the
expected value of a cell entry and then "explore from there"...
and I do wonder if there is not a flaw somewhere in the
algorithm:

I've now found that a bit more than a year ago, 'paljenczy' found on SO
  https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated
that indeed the generated tables seem to be too much around the mean.
Basically his example:

https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated


> set.seed(1); system.time(tabs <- r2dtable(1e6, c(100, 100), c(100, 100))); A11 <- vapply(tabs, function(x) x[1, 1], numeric(1))
   user  system elapsed
  0.218   0.025   0.244
> table(A11)

    34     35     36     37     38     39     40     41     42     43
     2     17     40    129    334    883   2026   4522   8766  15786
    44     45     46     47     48     49     50     51     52     53
 26850  42142  59535  78851  96217 107686 112438 108237  95761  78737
    54     55     56     57     58     59     60     61     62     63
 59732  41474  26939  16006   8827   4633   2050    865    340    116
    64     65     66     67
    38     13      7      1
>

For a  2x2  table, there's really only one degree of freedom,
hence the above characterizes the full distribution for that
case.

I would have expected to see all possible values in  0:100
instead of such a "normal like" distribution with carrier only
in [34, 67].

There are newer publications and maybe algorithms.
So maybe the algorithm is "flawed by design" for really large
total number of observations, rather than wrong
Seems interesting ...

Martin Maechler

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Are r2dtable and C_r2dtable behaving correctly?

Jari Oksanen
It is not about "really arge total number of observations", but:

set.seed(4711);tabs <- r2dtable(1e6, c(2, 2), c(2, 2)); A11 <- vapply(tabs, function(x) x[1, 1], numeric(1));table(A11)

A11
     0      1      2
166483 666853 166664

There are three possible matrices, and these come out in proportions 1:4:1, the one with all cells filled with ones being
most common.

Cheers, Jari O.
________________________________________
From: R-devel <[hidden email]> on behalf of Martin Maechler <[hidden email]>
Sent: 25 August 2017 11:30
To: Gustavo Fernandez Bayon
Cc: [hidden email]
Subject: Re: [Rd] Are r2dtable and C_r2dtable behaving correctly?

>>>>> Gustavo Fernandez Bayon <[hidden email]>
>>>>>     on Thu, 24 Aug 2017 16:42:36 +0200 writes:

    > Hello,
    > While doing some enrichment tests using chisq.test() with simulated
    > p-values, I noticed some strange behaviour. The computed p-value was
    > extremely small, so I decided to dig a little deeper and debug
    > chisq.test(). I noticed then that the simulated statistics returned by the
    > following call

    > tmp <- .Call(C_chisq_sim, sr, sc, B, E)

    > were all the same, very small numbers. This, at first, seemed strange to
    > me. So I decided to do some simulations myself, and started playing around
    > with the r2dtable() function. Problem is, using my row and column
    > marginals, r2dtable() always returns the same matrix. Let's provide a
    > minimal example:

    > rr <- c(209410, 276167)
    > cc <- c(25000, 460577)
    > ms <- r2dtable(3, rr, cc)

    > I have tested this code in two machines and it always returned the same
    > list of length three containing the same matrix three times. The repeated
    > matrix is the following:

    > [[1]]
    > [,1]   [,2]
    > [1,] 10782 198628
    > [2,] 14218 261949

    > [[2]]
    > [,1]   [,2]
    > [1,] 10782 198628
    > [2,] 14218 261949

    > [[3]]
    > [,1]   [,2]
    > [1,] 10782 198628
    > [2,] 14218 261949

Yes.  You can also do

   unique(r2dtable(100, rr, cc))

and see that the result is constant.

I'm pretty sure this is still due to some integer overflow,

in spite of the fact that I had spent quite some time to fix
such problem in Dec 2003, see the 14 years old bug PR#5701
  https://bugs.r-project.org/bugzilla/show_bug.cgi?id=5701#c2

It has to be said that this is based on an algorithm published
in 1981, specifically - from  help(r2dtable) -

     Patefield, W. M. (1981) Algorithm AS159.  An efficient method of
     generating r x c tables with given row and column totals.
     _Applied Statistics_ *30*, 91-97.

   For those with JSTOR access (typically via your University),
   available at http://www.jstor.org/stable/2346669

When I start reading it, indeed the algorithm seems start from the
expected value of a cell entry and then "explore from there"...
and I do wonder if there is not a flaw somewhere in the
algorithm:

I've now found that a bit more than a year ago, 'paljenczy' found on SO
  https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated
that indeed the generated tables seem to be too much around the mean.
Basically his example:

https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated


> set.seed(1); system.time(tabs <- r2dtable(1e6, c(100, 100), c(100, 100))); A11 <- vapply(tabs, function(x) x[1, 1], numeric(1))
   user  system elapsed
  0.218   0.025   0.244
> table(A11)

    34     35     36     37     38     39     40     41     42     43
     2     17     40    129    334    883   2026   4522   8766  15786
    44     45     46     47     48     49     50     51     52     53
 26850  42142  59535  78851  96217 107686 112438 108237  95761  78737
    54     55     56     57     58     59     60     61     62     63
 59732  41474  26939  16006   8827   4633   2050    865    340    116
    64     65     66     67
    38     13      7      1
>

For a  2x2  table, there's really only one degree of freedom,
hence the above characterizes the full distribution for that
case.

I would have expected to see all possible values in  0:100
instead of such a "normal like" distribution with carrier only
in [34, 67].

There are newer publications and maybe algorithms.
So maybe the algorithm is "flawed by design" for really large
total number of observations, rather than wrong
Seems interesting ...

Martin Maechler

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Are r2dtable and C_r2dtable behaving correctly?

Peter Dalgaard-2
In reply to this post by Martin Maechler

> On 25 Aug 2017, at 10:30 , Martin Maechler <[hidden email]> wrote:
>
[...]

> https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated
>
>
>> set.seed(1); system.time(tabs <- r2dtable(1e6, c(100, 100), c(100, 100))); A11 <- vapply(tabs, function(x) x[1, 1], numeric(1))
>   user  system elapsed
>  0.218   0.025   0.244
>> table(A11)
>
>    34     35     36     37     38     39     40     41     42     43
>     2     17     40    129    334    883   2026   4522   8766  15786
>    44     45     46     47     48     49     50     51     52     53
> 26850  42142  59535  78851  96217 107686 112438 108237  95761  78737
>    54     55     56     57     58     59     60     61     62     63
> 59732  41474  26939  16006   8827   4633   2050    865    340    116
>    64     65     66     67
>    38     13      7      1
>>
>
> For a  2x2  table, there's really only one degree of freedom,
> hence the above characterizes the full distribution for that
> case.
>
> I would have expected to see all possible values in  0:100
> instead of such a "normal like" distribution with carrier only
> in [34, 67].

Hmm, am I missing a point here?

> round(dhyper(0:100,100,100,100)*1e6)
  [1]      0      0      0      0      0      0      0      0      0      0
 [11]      0      0      0      0      0      0      0      0      0      0
 [21]      0      0      0      0      0      0      0      0      0      0
 [31]      0      0      0      1      4     13     43    129    355    897
 [41]   2087   4469   8819  16045  26927  41700  59614  78694  95943 108050
 [51] 112416 108050  95943  78694  59614  41700  26927  16045   8819   4469
 [61]   2087    897    355    129     43     13      4      1      0      0
 [71]      0      0      0      0      0      0      0      0      0      0
 [81]      0      0      0      0      0      0      0      0      0      0
 [91]      0      0      0      0      0      0      0      0      0      0
[101]      0


--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Are r2dtable and C_r2dtable behaving correctly?

Peter Dalgaard-2
In reply to this post by Jari Oksanen

> On 25 Aug 2017, at 11:23 , Jari Oksanen <[hidden email]> wrote:
>
> It is not about "really arge total number of observations", but:
>
> set.seed(4711);tabs <- r2dtable(1e6, c(2, 2), c(2, 2)); A11 <- vapply(tabs, function(x) x[1, 1], numeric(1));table(A11)
>
> A11
>     0      1      2
> 166483 666853 166664
>
> There are three possible matrices, and these come out in proportions 1:4:1, the one with all cells filled with ones being
> most common.

... and

> dhyper(0:2,2,2,2)
[1] 0.1666667 0.6666667 0.1666667
> dhyper(0:2,2,2,2) *6
[1] 1 4 1

so that is exactly what you would expect. However,

> dhyper(10782,209410, 276167, 25000)
[1] 0.005230889

so you wouldn't expect 10782 to recur. It is close to the mean of the hypergeometric distribution, but

> sim <- rhyper(1e6,209410, 276167, 25000)
> mean(sim)
[1] 10781.53
> sd(sim)
[1] 76.14209

(and incidentally, rhyper is plenty fast enough that you don't really need r2dtable for the 2x2 case)

-pd

>
> Cheers, Jari O.
> ________________________________________
> From: R-devel <[hidden email]> on behalf of Martin Maechler <[hidden email]>
> Sent: 25 August 2017 11:30
> To: Gustavo Fernandez Bayon
> Cc: [hidden email]
> Subject: Re: [Rd] Are r2dtable and C_r2dtable behaving correctly?
>
>>>>>> Gustavo Fernandez Bayon <[hidden email]>
>>>>>>    on Thu, 24 Aug 2017 16:42:36 +0200 writes:
>
>> Hello,
>> While doing some enrichment tests using chisq.test() with simulated
>> p-values, I noticed some strange behaviour. The computed p-value was
>> extremely small, so I decided to dig a little deeper and debug
>> chisq.test(). I noticed then that the simulated statistics returned by the
>> following call
>
>> tmp <- .Call(C_chisq_sim, sr, sc, B, E)
>
>> were all the same, very small numbers. This, at first, seemed strange to
>> me. So I decided to do some simulations myself, and started playing around
>> with the r2dtable() function. Problem is, using my row and column
>> marginals, r2dtable() always returns the same matrix. Let's provide a
>> minimal example:
>
>> rr <- c(209410, 276167)
>> cc <- c(25000, 460577)
>> ms <- r2dtable(3, rr, cc)
>
>> I have tested this code in two machines and it always returned the same
>> list of length three containing the same matrix three times. The repeated
>> matrix is the following:
>
>> [[1]]
>> [,1]   [,2]
>> [1,] 10782 198628
>> [2,] 14218 261949
>
>> [[2]]
>> [,1]   [,2]
>> [1,] 10782 198628
>> [2,] 14218 261949
>
>> [[3]]
>> [,1]   [,2]
>> [1,] 10782 198628
>> [2,] 14218 261949
>
> Yes.  You can also do
>
>   unique(r2dtable(100, rr, cc))
>
> and see that the result is constant.
>
> I'm pretty sure this is still due to some integer overflow,
>
> in spite of the fact that I had spent quite some time to fix
> such problem in Dec 2003, see the 14 years old bug PR#5701
>  https://bugs.r-project.org/bugzilla/show_bug.cgi?id=5701#c2
>
> It has to be said that this is based on an algorithm published
> in 1981, specifically - from  help(r2dtable) -
>
>     Patefield, W. M. (1981) Algorithm AS159.  An efficient method of
>     generating r x c tables with given row and column totals.
>     _Applied Statistics_ *30*, 91-97.
>
>   For those with JSTOR access (typically via your University),
>   available at http://www.jstor.org/stable/2346669
>
> When I start reading it, indeed the algorithm seems start from the
> expected value of a cell entry and then "explore from there"...
> and I do wonder if there is not a flaw somewhere in the
> algorithm:
>
> I've now found that a bit more than a year ago, 'paljenczy' found on SO
>  https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated
> that indeed the generated tables seem to be too much around the mean.
> Basically his example:
>
> https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated
>
>
>> set.seed(1); system.time(tabs <- r2dtable(1e6, c(100, 100), c(100, 100))); A11 <- vapply(tabs, function(x) x[1, 1], numeric(1))
>   user  system elapsed
>  0.218   0.025   0.244
>> table(A11)
>
>    34     35     36     37     38     39     40     41     42     43
>     2     17     40    129    334    883   2026   4522   8766  15786
>    44     45     46     47     48     49     50     51     52     53
> 26850  42142  59535  78851  96217 107686 112438 108237  95761  78737
>    54     55     56     57     58     59     60     61     62     63
> 59732  41474  26939  16006   8827   4633   2050    865    340    116
>    64     65     66     67
>    38     13      7      1
>>
>
> For a  2x2  table, there's really only one degree of freedom,
> hence the above characterizes the full distribution for that
> case.
>
> I would have expected to see all possible values in  0:100
> instead of such a "normal like" distribution with carrier only
> in [34, 67].
>
> There are newer publications and maybe algorithms.
> So maybe the algorithm is "flawed by design" for really large
> total number of observations, rather than wrong
> Seems interesting ...
>
> Martin Maechler
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Are r2dtable and C_r2dtable behaving correctly?

Martin Maechler
In reply to this post by Peter Dalgaard-2
>>>>> Peter Dalgaard <[hidden email]>
>>>>>     on Fri, 25 Aug 2017 11:43:40 +0200 writes:

    >> On 25 Aug 2017, at 10:30 , Martin Maechler <[hidden email]> wrote:
    >>
    > [...]
    >> https://stackoverflow.com/questions/37309276/r-r2dtable-contingency-tables-are-too-concentrated
    >>
    >>
    >>> set.seed(1); system.time(tabs <- r2dtable(1e6, c(100, 100), c(100, 100))); A11 <- vapply(tabs, function(x) x[1, 1], numeric(1))
    >> user  system elapsed
    >> 0.218   0.025   0.244
    >>> table(A11)
    >>
    >> 34     35     36     37     38     39     40     41     42     43
    >> 2     17     40    129    334    883   2026   4522   8766  15786
    >> 44     45     46     47     48     49     50     51     52     53
    >> 26850  42142  59535  78851  96217 107686 112438 108237  95761  78737
    >> 54     55     56     57     58     59     60     61     62     63
    >> 59732  41474  26939  16006   8827   4633   2050    865    340    116
    >> 64     65     66     67
    >> 38     13      7      1
    >>>
    >>
    >> For a  2x2  table, there's really only one degree of freedom,
    >> hence the above characterizes the full distribution for that
    >> case.
    >>
    >> I would have expected to see all possible values in  0:100
    >> instead of such a "normal like" distribution with carrier only
    >> in [34, 67].

    > Hmm, am I missing a point here?

    >> round(dhyper(0:100,100,100,100)*1e6)
    > [1]      0      0      0      0      0      0      0      0      0      0
    > [11]      0      0      0      0      0      0      0      0      0      0
    > [21]      0      0      0      0      0      0      0      0      0      0
    > [31]      0      0      0      1      4     13     43    129    355    897
    > [41]   2087   4469   8819  16045  26927  41700  59614  78694  95943 108050
    > [51] 112416 108050  95943  78694  59614  41700  26927  16045   8819   4469
    > [61]   2087    897    355    129     43     13      4      1      0      0
    > [71]      0      0      0      0      0      0      0      0      0      0
    > [81]      0      0      0      0      0      0      0      0      0      0
    > [91]      0      0      0      0      0      0      0      0      0      0
    > [101]      0

No, you ain't,  I was. :-(
Martin

    > --
    > Peter Dalgaard, Professor,
    > Center for Statistics, Copenhagen Business School
    > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
    > Phone: (+45)38153501
    > Office: A 4.23
    > Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Are r2dtable and C_r2dtable behaving correctly?

Peter Dalgaard-2
In reply to this post by Peter Dalgaard-2

> On 25 Aug 2017, at 12:04 , Peter Dalgaard <[hidden email]> wrote:
>
>> There are three possible matrices, and these come out in proportions 1:4:1, the one with all cells filled with ones being
>> most common.
>
> ... and
>
>> dhyper(0:2,2,2,2)
> [1] 0.1666667 0.6666667 0.1666667
>> dhyper(0:2,2,2,2) *6
> [1] 1 4 1
>
> so that is exactly what you would expect.

And, incidentally, this is the "statistician's socks" puzzle from introductory probability:

A statistician has 2 white socks and 2 black socks. Late for work a dark November morning, he puts on two socks at random. What is the probability that he goes to work with different colored socks?

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel