bias issue in sample() (PR 17494)

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

bias issue in sample() (PR 17494)

Tierney, Luke

Before the next release we really should to sort out the bias issue in
sample() reported by Ottoboni and Stark in
https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and
filed aa a bug report by Duncan Murdoch at
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494.

Here are two examples of bad behavior through current R-devel:

     set.seed(123)
     m <- (2/5) * 2^32
     x <- sample(m, 1000000, replace = TRUE)
     table(x %% 2, x > m / 2)
     ##
     ##    FALSE   TRUE
     ## 0 300620 198792
     ## 1 200196 300392

     table(sample(2/7 * 2^32, 1000000, replace = TRUE) %% 2)
     ##
     ##      0      1
     ## 429054 570946

I committed a modification to R_unif_index to address this by
generating random bits (blocks of 16) and rejection sampling, but for
now this is only enabled if the environment variable R_NEW_SAMPLE is
set before the first call.

Some things still needed:

- someone to look over the change and see if there are any issues
- adjustment of RNGkind to allowing the old behavior to be selected
- make the new behavior the default
- adjust documentation
- ???

Unfortunately I don't have enough free cycles to do this, but I can
help if someone else can take the lead.

There are two other places I found that might suffer from the same
issue, in walker_ProbSampleReplace (pointed out bu O & S) and in
src/nmath/wilcox.c.  Both can be addressed by using R_unif_index. I
have done that for walker_ProbSampleReplace, but the wilcox change
might need adjusting to support the standalone math library and I
don't feel confident enough I'd get that right.

Best,

luke


--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   [hidden email]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: bias issue in sample() (PR 17494)

Gabriel Becker-2
Luke,

I'm happy to help with this. Its great to see this get tackled (I've cc'ed
Kelli Ottoboni who helped flag this issue).

I can prepare a patch for the RNGkind related stuff and the doc update.

As for ???, what are your (and others') thoughts about the possibility of
a) a reproducibility API which takes either an R version (or maybe
alternatively a date) and sets the RNGkind to the default for that
version/date, and/or b) that sessionInfo be modified to capture (and
display) the RNGkind in effect.

Best,
~G


On Tue, Feb 19, 2019 at 11:52 AM Tierney, Luke <[hidden email]>
wrote:

>
> Before the next release we really should to sort out the bias issue in
> sample() reported by Ottoboni and Stark in
> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and
> filed aa a bug report by Duncan Murdoch at
> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494.
>
> Here are two examples of bad behavior through current R-devel:
>
>      set.seed(123)
>      m <- (2/5) * 2^32
>      x <- sample(m, 1000000, replace = TRUE)
>      table(x %% 2, x > m / 2)
>      ##
>      ##    FALSE   TRUE
>      ## 0 300620 198792
>      ## 1 200196 300392
>
>      table(sample(2/7 * 2^32, 1000000, replace = TRUE) %% 2)
>      ##
>      ##      0      1
>      ## 429054 570946
>
> I committed a modification to R_unif_index to address this by
> generating random bits (blocks of 16) and rejection sampling, but for
> now this is only enabled if the environment variable R_NEW_SAMPLE is
> set before the first call.
>
> Some things still needed:
>
> - someone to look over the change and see if there are any issues
> - adjustment of RNGkind to allowing the old behavior to be selected
> - make the new behavior the default
> - adjust documentation
> - ???
>
> Unfortunately I don't have enough free cycles to do this, but I can
> help if someone else can take the lead.
>
> There are two other places I found that might suffer from the same
> issue, in walker_ProbSampleReplace (pointed out bu O & S) and in
> src/nmath/wilcox.c.  Both can be addressed by using R_unif_index. I
> have done that for walker_ProbSampleReplace, but the wilcox change
> might need adjusting to support the standalone math library and I
> don't feel confident enough I'd get that right.
>
> Best,
>
> luke
>
>
> --
> Luke Tierney
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa                  Phone:             319-335-3386
> Department of Statistics and        Fax:               319-335-3017
>     Actuarial Science
> 241 Schaeffer Hall                  email:   [hidden email]
> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: bias issue in sample() (PR 17494)

Kirill Müller-2
Gabe


As mentioned on Twitter, I think the following behavior should be fixed
as part of the upcoming changes:

R.version.string
## [1] "R Under development (unstable) (2019-02-25 r76160)"
.Machine$double.digits
## [1] 53
set.seed(123)
RNGkind()
## [1] "Mersenne-Twister" "Inversion"        "Rejection"
length(table(runif(1e6)))
## [1] 999863

I don't expect any collisions when using Mersenne-Twister to generate a
million floating point values. I'm not sure what causes this behavior,
but it's documented in ?Random:

"Do not rely on randomness of low-order bits from RNGs. Most of the
supplied uniform generators return 32-bit integer values that are
converted to doubles, so they take at most 2^32 distinct values and long
runs will return duplicated values (Wichmann-Hill is the exception, and
all give at least 30 varying bits.)"

The "Wichman-Hill" bit is interesting:

RNGkind("Wichmann-Hill")
length(table(runif(1e6)))
## [1] 1000000
length(table(runif(1e6)))
## [1] 1000000

Mersenne-Twister has a much much larger periodicity than Wichmann-Hill,
it would be great to see the above behavior also for Mersenne-Twister.
Thanks for considering.


Best regards

Kirill


On 20.02.19 08:01, Gabriel Becker wrote:

> Luke,
>
> I'm happy to help with this. Its great to see this get tackled (I've cc'ed
> Kelli Ottoboni who helped flag this issue).
>
> I can prepare a patch for the RNGkind related stuff and the doc update.
>
> As for ???, what are your (and others') thoughts about the possibility of
> a) a reproducibility API which takes either an R version (or maybe
> alternatively a date) and sets the RNGkind to the default for that
> version/date, and/or b) that sessionInfo be modified to capture (and
> display) the RNGkind in effect.
>
> Best,
> ~G
>
>
> On Tue, Feb 19, 2019 at 11:52 AM Tierney, Luke <[hidden email]>
> wrote:
>
>> Before the next release we really should to sort out the bias issue in
>> sample() reported by Ottoboni and Stark in
>> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and
>> filed aa a bug report by Duncan Murdoch at
>> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494.
>>
>> Here are two examples of bad behavior through current R-devel:
>>
>>       set.seed(123)
>>       m <- (2/5) * 2^32
>>       x <- sample(m, 1000000, replace = TRUE)
>>       table(x %% 2, x > m / 2)
>>       ##
>>       ##    FALSE   TRUE
>>       ## 0 300620 198792
>>       ## 1 200196 300392
>>
>>       table(sample(2/7 * 2^32, 1000000, replace = TRUE) %% 2)
>>       ##
>>       ##      0      1
>>       ## 429054 570946
>>
>> I committed a modification to R_unif_index to address this by
>> generating random bits (blocks of 16) and rejection sampling, but for
>> now this is only enabled if the environment variable R_NEW_SAMPLE is
>> set before the first call.
>>
>> Some things still needed:
>>
>> - someone to look over the change and see if there are any issues
>> - adjustment of RNGkind to allowing the old behavior to be selected
>> - make the new behavior the default
>> - adjust documentation
>> - ???
>>
>> Unfortunately I don't have enough free cycles to do this, but I can
>> help if someone else can take the lead.
>>
>> There are two other places I found that might suffer from the same
>> issue, in walker_ProbSampleReplace (pointed out bu O & S) and in
>> src/nmath/wilcox.c.  Both can be addressed by using R_unif_index. I
>> have done that for walker_ProbSampleReplace, but the wilcox change
>> might need adjusting to support the standalone math library and I
>> don't feel confident enough I'd get that right.
>>
>> Best,
>>
>> luke
>>
>>
>> --
>> Luke Tierney
>> Ralph E. Wareham Professor of Mathematical Sciences
>> University of Iowa                  Phone:             319-335-3386
>> Department of Statistics and        Fax:               319-335-3017
>>      Actuarial Science
>> 241 Schaeffer Hall                  email:   [hidden email]
>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: bias issue in sample() (PR 17494)

Ralf Stubner
Kirill,

I think some level of collision is actually expected! R uses a 32bit MT
that can produce 2^32 different doubles. The probability for a collision
within a million draws is

> pbirthday(1e6, classes = 2^32)
[1] 1

Greetings
Ralf


On 26.02.19 07:06, Kirill Müller wrote:

> Gabe
>
>
> As mentioned on Twitter, I think the following behavior should be fixed
> as part of the upcoming changes:
>
> R.version.string
> ## [1] "R Under development (unstable) (2019-02-25 r76160)"
> .Machine$double.digits
> ## [1] 53
> set.seed(123)
> RNGkind()
> ## [1] "Mersenne-Twister" "Inversion"        "Rejection"
> length(table(runif(1e6)))
> ## [1] 999863
>
> I don't expect any collisions when using Mersenne-Twister to generate a
> million floating point values. I'm not sure what causes this behavior,
> but it's documented in ?Random:
>
> "Do not rely on randomness of low-order bits from RNGs. Most of the
> supplied uniform generators return 32-bit integer values that are
> converted to doubles, so they take at most 2^32 distinct values and long
> runs will return duplicated values (Wichmann-Hill is the exception, and
> all give at least 30 varying bits.)"
>
> The "Wichman-Hill" bit is interesting:
>
> RNGkind("Wichmann-Hill")
> length(table(runif(1e6)))
> ## [1] 1000000
> length(table(runif(1e6)))
> ## [1] 1000000
>
> Mersenne-Twister has a much much larger periodicity than Wichmann-Hill,
> it would be great to see the above behavior also for Mersenne-Twister.
> Thanks for considering.
>
>
> Best regards
>
> Kirill
>
>
> On 20.02.19 08:01, Gabriel Becker wrote:
>> Luke,
>>
>> I'm happy to help with this. Its great to see this get tackled (I've
>> cc'ed
>> Kelli Ottoboni who helped flag this issue).
>>
>> I can prepare a patch for the RNGkind related stuff and the doc update.
>>
>> As for ???, what are your (and others') thoughts about the possibility of
>> a) a reproducibility API which takes either an R version (or maybe
>> alternatively a date) and sets the RNGkind to the default for that
>> version/date, and/or b) that sessionInfo be modified to capture (and
>> display) the RNGkind in effect.
>>
>> Best,
>> ~G
>>
>>
>> On Tue, Feb 19, 2019 at 11:52 AM Tierney, Luke <[hidden email]>
>> wrote:
>>
>>> Before the next release we really should to sort out the bias issue in
>>> sample() reported by Ottoboni and Stark in
>>> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and
>>> filed aa a bug report by Duncan Murdoch at
>>> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494.
>>>
>>> Here are two examples of bad behavior through current R-devel:
>>>
>>>       set.seed(123)
>>>       m <- (2/5) * 2^32
>>>       x <- sample(m, 1000000, replace = TRUE)
>>>       table(x %% 2, x > m / 2)
>>>       ##
>>>       ##    FALSE   TRUE
>>>       ## 0 300620 198792
>>>       ## 1 200196 300392
>>>
>>>       table(sample(2/7 * 2^32, 1000000, replace = TRUE) %% 2)
>>>       ##
>>>       ##      0      1
>>>       ## 429054 570946
>>>
>>> I committed a modification to R_unif_index to address this by
>>> generating random bits (blocks of 16) and rejection sampling, but for
>>> now this is only enabled if the environment variable R_NEW_SAMPLE is
>>> set before the first call.
>>>
>>> Some things still needed:
>>>
>>> - someone to look over the change and see if there are any issues
>>> - adjustment of RNGkind to allowing the old behavior to be selected
>>> - make the new behavior the default
>>> - adjust documentation
>>> - ???
>>>
>>> Unfortunately I don't have enough free cycles to do this, but I can
>>> help if someone else can take the lead.
>>>
>>> There are two other places I found that might suffer from the same
>>> issue, in walker_ProbSampleReplace (pointed out bu O & S) and in
>>> src/nmath/wilcox.c.  Both can be addressed by using R_unif_index. I
>>> have done that for walker_ProbSampleReplace, but the wilcox change
>>> might need adjusting to support the standalone math library and I
>>> don't feel confident enough I'd get that right.
>>>
>>> Best,
>>>
>>> luke
>>>
>>>
>>> --
>>> Luke Tierney
>>> Ralph E. Wareham Professor of Mathematical Sciences
>>> University of Iowa                  Phone:             319-335-3386
>>> Department of Statistics and        Fax:               319-335-3017
>>>      Actuarial Science
>>> 241 Schaeffer Hall                  email:   [hidden email]
>>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
--
Ralf Stubner
Senior Software Engineer / Trainer

daqana GmbH
Dortustraße 48
14467 Potsdam

T: +49 331 23 61 93 11
F: +49 331 23 61 93 90
M: +49 162 20 91 196
Mail: [hidden email]

Sitz: Potsdam
Register: AG Potsdam HRB 27966
Ust.-IdNr.: DE300072622
Geschäftsführer: Dr.-Ing. Stefan Knirsch, Prof. Dr. Dr. Karl-Kuno Kunze


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: bias issue in sample() (PR 17494)

Kirill Müller-2
Ralf


I don't doubt this is expected with the current implementation, I doubt
the implementation is desirable. Suggesting to turn this to

pbirthday(1e6, classes = 2^53)
## [1] 5.550956e-05

(which is still non-zero, but much less likely to cause confusion.)


Best regards

Kirill

On 26.02.19 10:18, Ralf Stubner wrote:

> Kirill,
>
> I think some level of collision is actually expected! R uses a 32bit MT
> that can produce 2^32 different doubles. The probability for a collision
> within a million draws is
>
>> pbirthday(1e6, classes = 2^32)
> [1] 1
>
> Greetings
> Ralf
>
>
> On 26.02.19 07:06, Kirill Müller wrote:
>> Gabe
>>
>>
>> As mentioned on Twitter, I think the following behavior should be fixed
>> as part of the upcoming changes:
>>
>> R.version.string
>> ## [1] "R Under development (unstable) (2019-02-25 r76160)"
>> .Machine$double.digits
>> ## [1] 53
>> set.seed(123)
>> RNGkind()
>> ## [1] "Mersenne-Twister" "Inversion"        "Rejection"
>> length(table(runif(1e6)))
>> ## [1] 999863
>>
>> I don't expect any collisions when using Mersenne-Twister to generate a
>> million floating point values. I'm not sure what causes this behavior,
>> but it's documented in ?Random:
>>
>> "Do not rely on randomness of low-order bits from RNGs. Most of the
>> supplied uniform generators return 32-bit integer values that are
>> converted to doubles, so they take at most 2^32 distinct values and long
>> runs will return duplicated values (Wichmann-Hill is the exception, and
>> all give at least 30 varying bits.)"
>>
>> The "Wichman-Hill" bit is interesting:
>>
>> RNGkind("Wichmann-Hill")
>> length(table(runif(1e6)))
>> ## [1] 1000000
>> length(table(runif(1e6)))
>> ## [1] 1000000
>>
>> Mersenne-Twister has a much much larger periodicity than Wichmann-Hill,
>> it would be great to see the above behavior also for Mersenne-Twister.
>> Thanks for considering.
>>
>>
>> Best regards
>>
>> Kirill
>>
>>
>> On 20.02.19 08:01, Gabriel Becker wrote:
>>> Luke,
>>>
>>> I'm happy to help with this. Its great to see this get tackled (I've
>>> cc'ed
>>> Kelli Ottoboni who helped flag this issue).
>>>
>>> I can prepare a patch for the RNGkind related stuff and the doc update.
>>>
>>> As for ???, what are your (and others') thoughts about the possibility of
>>> a) a reproducibility API which takes either an R version (or maybe
>>> alternatively a date) and sets the RNGkind to the default for that
>>> version/date, and/or b) that sessionInfo be modified to capture (and
>>> display) the RNGkind in effect.
>>>
>>> Best,
>>> ~G
>>>
>>>
>>> On Tue, Feb 19, 2019 at 11:52 AM Tierney, Luke <[hidden email]>
>>> wrote:
>>>
>>>> Before the next release we really should to sort out the bias issue in
>>>> sample() reported by Ottoboni and Stark in
>>>> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and
>>>> filed aa a bug report by Duncan Murdoch at
>>>> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494.
>>>>
>>>> Here are two examples of bad behavior through current R-devel:
>>>>
>>>>        set.seed(123)
>>>>        m <- (2/5) * 2^32
>>>>        x <- sample(m, 1000000, replace = TRUE)
>>>>        table(x %% 2, x > m / 2)
>>>>        ##
>>>>        ##    FALSE   TRUE
>>>>        ## 0 300620 198792
>>>>        ## 1 200196 300392
>>>>
>>>>        table(sample(2/7 * 2^32, 1000000, replace = TRUE) %% 2)
>>>>        ##
>>>>        ##      0      1
>>>>        ## 429054 570946
>>>>
>>>> I committed a modification to R_unif_index to address this by
>>>> generating random bits (blocks of 16) and rejection sampling, but for
>>>> now this is only enabled if the environment variable R_NEW_SAMPLE is
>>>> set before the first call.
>>>>
>>>> Some things still needed:
>>>>
>>>> - someone to look over the change and see if there are any issues
>>>> - adjustment of RNGkind to allowing the old behavior to be selected
>>>> - make the new behavior the default
>>>> - adjust documentation
>>>> - ???
>>>>
>>>> Unfortunately I don't have enough free cycles to do this, but I can
>>>> help if someone else can take the lead.
>>>>
>>>> There are two other places I found that might suffer from the same
>>>> issue, in walker_ProbSampleReplace (pointed out bu O & S) and in
>>>> src/nmath/wilcox.c.  Both can be addressed by using R_unif_index. I
>>>> have done that for walker_ProbSampleReplace, but the wilcox change
>>>> might need adjusting to support the standalone math library and I
>>>> don't feel confident enough I'd get that right.
>>>>
>>>> Best,
>>>>
>>>> luke
>>>>
>>>>
>>>> --
>>>> Luke Tierney
>>>> Ralph E. Wareham Professor of Mathematical Sciences
>>>> University of Iowa                  Phone:             319-335-3386
>>>> Department of Statistics and        Fax:               319-335-3017
>>>>       Actuarial Science
>>>> 241 Schaeffer Hall                  email:   [hidden email]
>>>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>      [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: bias issue in sample() (PR 17494)

Tierney, Luke
On Tue, 26 Feb 2019, Kirill Müller wrote:

> Ralf
>
>
> I don't doubt this is expected with the current implementation, I doubt the
> implementation is desirable. Suggesting to turn this to
>
> pbirthday(1e6, classes = 2^53)
> ## [1] 5.550956e-05

That isn't a small number given simulation sizes people routinely run
these days. Just about right to miss an issue in a pilot run and get
bitten on the real one.

In the inversion generator for normals we already use a higher
resolution uniform produced from two regular ones. I considered
switching to that approach for all uniforms, either in addition to or
instead of changing the uniform integer sampling algorithm used in
sample(). But that would have been even more disruptive:

- all simulation results (except normals) would change;
- there would be a performance penalty;
- the streams would be used up twice as fast;

I would also probably be necessary to rethink things like how to use
the L'Ecuyer generator to produce multiple streams in the `parallel`
package.

We may need to take this route in the future, but it didn't seem like
a good idea at this time.

Best,

luke

>
> (which is still non-zero, but much less likely to cause confusion.)
>
>
> Best regards
>
> Kirill
>
> On 26.02.19 10:18, Ralf Stubner wrote:
>> Kirill,
>>
>> I think some level of collision is actually expected! R uses a 32bit MT
>> that can produce 2^32 different doubles. The probability for a collision
>> within a million draws is
>>
>>> pbirthday(1e6, classes = 2^32)
>> [1] 1
>>
>> Greetings
>> Ralf
>>
>>
>> On 26.02.19 07:06, Kirill Müller wrote:
>>> Gabe
>>>
>>>
>>> As mentioned on Twitter, I think the following behavior should be fixed
>>> as part of the upcoming changes:
>>>
>>> R.version.string
>>> ## [1] "R Under development (unstable) (2019-02-25 r76160)"
>>> .Machine$double.digits
>>> ## [1] 53
>>> set.seed(123)
>>> RNGkind()
>>> ## [1] "Mersenne-Twister" "Inversion"        "Rejection"
>>> length(table(runif(1e6)))
>>> ## [1] 999863
>>>
>>> I don't expect any collisions when using Mersenne-Twister to generate a
>>> million floating point values. I'm not sure what causes this behavior,
>>> but it's documented in ?Random:
>>>
>>> "Do not rely on randomness of low-order bits from RNGs. Most of the
>>> supplied uniform generators return 32-bit integer values that are
>>> converted to doubles, so they take at most 2^32 distinct values and long
>>> runs will return duplicated values (Wichmann-Hill is the exception, and
>>> all give at least 30 varying bits.)"
>>>
>>> The "Wichman-Hill" bit is interesting:
>>>
>>> RNGkind("Wichmann-Hill")
>>> length(table(runif(1e6)))
>>> ## [1] 1000000
>>> length(table(runif(1e6)))
>>> ## [1] 1000000
>>>
>>> Mersenne-Twister has a much much larger periodicity than Wichmann-Hill,
>>> it would be great to see the above behavior also for Mersenne-Twister.
>>> Thanks for considering.
>>>
>>>
>>> Best regards
>>>
>>> Kirill
>>>
>>>
>>> On 20.02.19 08:01, Gabriel Becker wrote:
>>>> Luke,
>>>>
>>>> I'm happy to help with this. Its great to see this get tackled (I've
>>>> cc'ed
>>>> Kelli Ottoboni who helped flag this issue).
>>>>
>>>> I can prepare a patch for the RNGkind related stuff and the doc update.
>>>>
>>>> As for ???, what are your (and others') thoughts about the possibility of
>>>> a) a reproducibility API which takes either an R version (or maybe
>>>> alternatively a date) and sets the RNGkind to the default for that
>>>> version/date, and/or b) that sessionInfo be modified to capture (and
>>>> display) the RNGkind in effect.
>>>>
>>>> Best,
>>>> ~G
>>>>
>>>>
>>>> On Tue, Feb 19, 2019 at 11:52 AM Tierney, Luke <[hidden email]>
>>>> wrote:
>>>>
>>>>> Before the next release we really should to sort out the bias issue in
>>>>> sample() reported by Ottoboni and Stark in
>>>>> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and
>>>>> filed aa a bug report by Duncan Murdoch at
>>>>> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494.
>>>>>
>>>>> Here are two examples of bad behavior through current R-devel:
>>>>>
>>>>>        set.seed(123)
>>>>>        m <- (2/5) * 2^32
>>>>>        x <- sample(m, 1000000, replace = TRUE)
>>>>>        table(x %% 2, x > m / 2)
>>>>>        ##
>>>>>        ##    FALSE   TRUE
>>>>>        ## 0 300620 198792
>>>>>        ## 1 200196 300392
>>>>>
>>>>>        table(sample(2/7 * 2^32, 1000000, replace = TRUE) %% 2)
>>>>>        ##
>>>>>        ##      0      1
>>>>>        ## 429054 570946
>>>>>
>>>>> I committed a modification to R_unif_index to address this by
>>>>> generating random bits (blocks of 16) and rejection sampling, but for
>>>>> now this is only enabled if the environment variable R_NEW_SAMPLE is
>>>>> set before the first call.
>>>>>
>>>>> Some things still needed:
>>>>>
>>>>> - someone to look over the change and see if there are any issues
>>>>> - adjustment of RNGkind to allowing the old behavior to be selected
>>>>> - make the new behavior the default
>>>>> - adjust documentation
>>>>> - ???
>>>>>
>>>>> Unfortunately I don't have enough free cycles to do this, but I can
>>>>> help if someone else can take the lead.
>>>>>
>>>>> There are two other places I found that might suffer from the same
>>>>> issue, in walker_ProbSampleReplace (pointed out bu O & S) and in
>>>>> src/nmath/wilcox.c.  Both can be addressed by using R_unif_index. I
>>>>> have done that for walker_ProbSampleReplace, but the wilcox change
>>>>> might need adjusting to support the standalone math library and I
>>>>> don't feel confident enough I'd get that right.
>>>>>
>>>>> Best,
>>>>>
>>>>> luke
>>>>>
>>>>>
>>>>> --
>>>>> Luke Tierney
>>>>> Ralph E. Wareham Professor of Mathematical Sciences
>>>>> University of Iowa                  Phone:             319-335-3386
>>>>> Department of Statistics and        Fax:               319-335-3017
>>>>>       Actuarial Science
>>>>> 241 Schaeffer Hall                  email:   [hidden email]
>>>>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>
>>>>      [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>

--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   [hidden email]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel