Bias in R's random integers?

classic Classic list List threaded Threaded
38 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Bias in R's random integers?

Carl Boettiger-2
Dear list,

It looks to me that R samples random integers using an intuitive but biased
algorithm by going from a random number on [0,1) from the PRNG to a random
integer, e.g.
https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808

Many other languages use various rejection sampling approaches which
provide an unbiased method for sampling, such as in Go, python, and others
described here:  https://arxiv.org/abs/1805.10941 (I believe the biased
algorithm currently used in R is also described there).  I'm not an expert
in this area, but does it make sense for the R to adopt one of the unbiased
random sample algorithms outlined there and used in other languages?  Would
a patch providing such an algorithm be welcome? What concerns would need to
be addressed first?

I believe this issue was also raised by Killie & Philip in
http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and more
recently in
https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf,
pointing to the python implementation for comparison:
https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265

Thanks!

Carl
--

http://carlboettiger.info

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bias in R's random integers?

Duncan Murdoch-2
On 18/09/2018 5:46 PM, Carl Boettiger wrote:

> Dear list,
>
> It looks to me that R samples random integers using an intuitive but biased
> algorithm by going from a random number on [0,1) from the PRNG to a random
> integer, e.g.
> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
>
> Many other languages use various rejection sampling approaches which
> provide an unbiased method for sampling, such as in Go, python, and others
> described here:  https://arxiv.org/abs/1805.10941 (I believe the biased
> algorithm currently used in R is also described there).  I'm not an expert
> in this area, but does it make sense for the R to adopt one of the unbiased
> random sample algorithms outlined there and used in other languages?  Would
> a patch providing such an algorithm be welcome? What concerns would need to
> be addressed first?
>
> I believe this issue was also raised by Killie & Philip in
> http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and more
> recently in
> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf,
> pointing to the python implementation for comparison:
> https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265

I think the analyses are correct, but I doubt if a change to the default
is likely to be accepted as it would make it more difficult to reproduce
older results.

On the other hand, a contribution of a new function like sample() but
not suffering from the bias would be good.  The normal way to make such
a contribution is in a user contributed package.

By the way, R code illustrating the bias is probably not very hard to
put together.  I believe the bias manifests itself in sample() producing
values with two different probabilities (instead of all equal
probabilities).  Those may differ by as much as one part in 2^32.  It's
very difficult to detect a probability difference that small, but if you
define the partition of values into the high probability values vs the
low probability values, you can probably detect the difference in a
feasible simulation.

Duncan Murdoch

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bias in R's random integers?

Iñaki Ucar
El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch
(<[hidden email]>) escribió:

>
> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
> > Dear list,
> >
> > It looks to me that R samples random integers using an intuitive but biased
> > algorithm by going from a random number on [0,1) from the PRNG to a random
> > integer, e.g.
> > https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
> >
> > Many other languages use various rejection sampling approaches which
> > provide an unbiased method for sampling, such as in Go, python, and others
> > described here:  https://arxiv.org/abs/1805.10941 (I believe the biased
> > algorithm currently used in R is also described there).  I'm not an expert
> > in this area, but does it make sense for the R to adopt one of the unbiased
> > random sample algorithms outlined there and used in other languages?  Would
> > a patch providing such an algorithm be welcome? What concerns would need to
> > be addressed first?
> >
> > I believe this issue was also raised by Killie & Philip in
> > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and more
> > recently in
> > https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf,
> > pointing to the python implementation for comparison:
> > https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
>
> I think the analyses are correct, but I doubt if a change to the default
> is likely to be accepted as it would make it more difficult to reproduce
> older results.
>
> On the other hand, a contribution of a new function like sample() but
> not suffering from the bias would be good.  The normal way to make such
> a contribution is in a user contributed package.
>
> By the way, R code illustrating the bias is probably not very hard to
> put together.  I believe the bias manifests itself in sample() producing
> values with two different probabilities (instead of all equal
> probabilities).  Those may differ by as much as one part in 2^32.  It's

According to Kellie and Philip, in the attachment of the thread
referenced by Carl, "The maximum ratio of selection probabilities can
get as large as 1.5 if n is just below 2^31".

Iñaki

> very difficult to detect a probability difference that small, but if you
> define the partition of values into the high probability values vs the
> low probability values, you can probably detect the difference in a
> feasible simulation.
>
> Duncan Murdoch
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bias in R's random integers?

David Hugh-Jones-3
In reply to this post by Duncan Murdoch-2
On Wed, 19 Sep 2018 at 13:43, Duncan Murdoch <[hidden email]>
wrote:

>
> I think the analyses are correct, but I doubt if a change to the default
> is likely to be accepted as it would make it more difficult to reproduce
> older results.


I'm a bit alarmed by the logic here. Unbiased sampling seems basic for a
statistical language. As a consumer of R I'd like to think that e.g. my
bootstrapped p values are correct.
Surely if the old results depend on the biased algorithm, then they are
false results?
--
Sent from Gmail Mobile

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bias in R's random integers?

bbolker


On 2018-09-19 09:40 AM, David Hugh-Jones wrote:

> On Wed, 19 Sep 2018 at 13:43, Duncan Murdoch <[hidden email]>
> wrote:
>
>>
>> I think the analyses are correct, but I doubt if a change to the default
>> is likely to be accepted as it would make it more difficult to reproduce
>> older results.
>
>
> I'm a bit alarmed by the logic here. Unbiased sampling seems basic for a
> statistical language. As a consumer of R I'd like to think that e.g. my
> bootstrapped p values are correct.
> Surely if the old results depend on the biased algorithm, then they are
> false results?
>

   Balancing backward compatibility and correctness is a tough problem
here.  If this goes into base R, what's the best way to do it?  What was
the protocol for migrating away from the "buggy Kinderman-Ramage"
generator, back in the day?   (Version 1.7 was sometime between 2001 and
2004).

  I couldn't find the exact commit in the GitHub mirror: this is related ...

https://github.com/wch/r-source/commit/7ad3044639fd1fe093c655e573fd1a67aa7f55f6#diff-dbcad570d4fb9b7005550ff630543b37



===
‘normal.kind’ can be ‘"Kinderman-Ramage"’, ‘"Buggy
     Kinderman-Ramage"’ (not for ‘set.seed’), ‘"Ahrens-Dieter"’,
     ‘"Box-Muller"’, ‘"Inversion"’ (the default), or ‘"user-supplied"’.
     (For inversion, see the reference in ‘qnorm’.)  The
     Kinderman-Ramage generator used in versions prior to 1.7.0 (now
     called ‘"Buggy"’) had several approximation errors and should only
     be used for reproduction of old results.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bias in R's random integers?

Duncan Murdoch-2
In reply to this post by Iñaki Ucar
On 19/09/2018 9:09 AM, Iñaki Ucar wrote:

> El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch
> (<[hidden email]>) escribió:
>>
>> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
>>> Dear list,
>>>
>>> It looks to me that R samples random integers using an intuitive but biased
>>> algorithm by going from a random number on [0,1) from the PRNG to a random
>>> integer, e.g.
>>> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
>>>
>>> Many other languages use various rejection sampling approaches which
>>> provide an unbiased method for sampling, such as in Go, python, and others
>>> described here:  https://arxiv.org/abs/1805.10941 (I believe the biased
>>> algorithm currently used in R is also described there).  I'm not an expert
>>> in this area, but does it make sense for the R to adopt one of the unbiased
>>> random sample algorithms outlined there and used in other languages?  Would
>>> a patch providing such an algorithm be welcome? What concerns would need to
>>> be addressed first?
>>>
>>> I believe this issue was also raised by Killie & Philip in
>>> http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and more
>>> recently in
>>> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf,
>>> pointing to the python implementation for comparison:
>>> https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
>>
>> I think the analyses are correct, but I doubt if a change to the default
>> is likely to be accepted as it would make it more difficult to reproduce
>> older results.
>>
>> On the other hand, a contribution of a new function like sample() but
>> not suffering from the bias would be good.  The normal way to make such
>> a contribution is in a user contributed package.
>>
>> By the way, R code illustrating the bias is probably not very hard to
>> put together.  I believe the bias manifests itself in sample() producing
>> values with two different probabilities (instead of all equal
>> probabilities).  Those may differ by as much as one part in 2^32.  It's
>
> According to Kellie and Philip, in the attachment of the thread
> referenced by Carl, "The maximum ratio of selection probabilities can
> get as large as 1.5 if n is just below 2^31".

Sorry, I didn't write very well.  I meant to say that the difference in
probabilities would be 2^-32, not that the ratio of probabilities would
be 1 + 2^-32.

By the way, I don't see the statement giving the ratio as 1.5, but maybe
I was looking in the wrong place.  In Theorem 1 of the paper I was
looking in the ratio was "1 + m 2^{-w + 1}".  In that formula m is your
n.  If it is near 2^31, R uses w = 57 random bits, so the ratio would be
very, very small (one part in 2^25).

The worst case for R would happen when m  is just below  2^25, where w
is at least 31 for the default generators.  In that case the ratio could
be about 1.03.

Duncan Murdoch

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bias in R's random integers?

Philip B. Stark
The 53 bits only encode at most 2^{32} possible values, because the source
of the float is the output of a 32-bit PRNG (the obsolete version of MT).
53 bits isn't the relevant number here.

The selection ratios can get close to 2. Computer scientists don't do it
the way R does, for a reason.

Regards,
Philip

On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch <[hidden email]>
wrote:

> On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
> > El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch
> > (<[hidden email]>) escribió:
> >>
> >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
> >>> Dear list,
> >>>
> >>> It looks to me that R samples random integers using an intuitive but
> biased
> >>> algorithm by going from a random number on [0,1) from the PRNG to a
> random
> >>> integer, e.g.
> >>> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
> >>>
> >>> Many other languages use various rejection sampling approaches which
> >>> provide an unbiased method for sampling, such as in Go, python, and
> others
> >>> described here:  https://arxiv.org/abs/1805.10941 (I believe the
> biased
> >>> algorithm currently used in R is also described there).  I'm not an
> expert
> >>> in this area, but does it make sense for the R to adopt one of the
> unbiased
> >>> random sample algorithms outlined there and used in other languages?
> Would
> >>> a patch providing such an algorithm be welcome? What concerns would
> need to
> >>> be addressed first?
> >>>
> >>> I believe this issue was also raised by Killie & Philip in
> >>> http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and more
> >>> recently in
> >>> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf,
> >>> pointing to the python implementation for comparison:
> >>>
> https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
> >>
> >> I think the analyses are correct, but I doubt if a change to the default
> >> is likely to be accepted as it would make it more difficult to reproduce
> >> older results.
> >>
> >> On the other hand, a contribution of a new function like sample() but
> >> not suffering from the bias would be good.  The normal way to make such
> >> a contribution is in a user contributed package.
> >>
> >> By the way, R code illustrating the bias is probably not very hard to
> >> put together.  I believe the bias manifests itself in sample() producing
> >> values with two different probabilities (instead of all equal
> >> probabilities).  Those may differ by as much as one part in 2^32.  It's
> >
> > According to Kellie and Philip, in the attachment of the thread
> > referenced by Carl, "The maximum ratio of selection probabilities can
> > get as large as 1.5 if n is just below 2^31".
>
> Sorry, I didn't write very well.  I meant to say that the difference in
> probabilities would be 2^-32, not that the ratio of probabilities would
> be 1 + 2^-32.
>
> By the way, I don't see the statement giving the ratio as 1.5, but maybe
> I was looking in the wrong place.  In Theorem 1 of the paper I was
> looking in the ratio was "1 + m 2^{-w + 1}".  In that formula m is your
> n.  If it is near 2^31, R uses w = 57 random bits, so the ratio would be
> very, very small (one part in 2^25).
>
> The worst case for R would happen when m  is just below  2^25, where w
> is at least 31 for the default generators.  In that case the ratio could
> be about 1.03.
>
> Duncan Murdoch
>


--
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor,  Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bias in R's random integers?

Duncan Murdoch-2
In reply to this post by David Hugh-Jones-3
On 19/09/2018 9:40 AM, David Hugh-Jones wrote:

>
>
> On Wed, 19 Sep 2018 at 13:43, Duncan Murdoch <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>
>     I think the analyses are correct, but I doubt if a change to the
>     default
>     is likely to be accepted as it would make it more difficult to
>     reproduce
>     older results.
>
>
> I'm a bit alarmed by the logic here. Unbiased sampling seems basic for a
> statistical language. As a consumer of R I'd like to think that e.g. my
> bootstrapped p values are correct.
> Surely if the old results depend on the biased algorithm, then they are
> false results?

All Monte Carlo results contain Monte Carlo error.  Using the biased
function will have some additional error, but for almost all
simulations, it will be negligible compared to the Monte Carlo error.  I
suspect the only simulations where the bias was anywhere near the same
order of magnitude as the Monte Carlo error would be ones designed with
this specific code in mind.

Duncan Murdoch

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bias in R's random integers?

Philip B. Stark
That depends on the number of replications, among other things.

Moreover, because of the bias, the usual formulae for uncertainty in
estimates based on random samples, etc., are incorrect: sample() does not
give a simple random sample.

On Wed, Sep 19, 2018 at 9:15 AM Duncan Murdoch <[hidden email]>
wrote:

> On 19/09/2018 9:40 AM, David Hugh-Jones wrote:
> >
> >
> > On Wed, 19 Sep 2018 at 13:43, Duncan Murdoch <[hidden email]
> > <mailto:[hidden email]>> wrote:
> >
> >
> >     I think the analyses are correct, but I doubt if a change to the
> >     default
> >     is likely to be accepted as it would make it more difficult to
> >     reproduce
> >     older results.
> >
> >
> > I'm a bit alarmed by the logic here. Unbiased sampling seems basic for a
> > statistical language. As a consumer of R I'd like to think that e.g. my
> > bootstrapped p values are correct.
> > Surely if the old results depend on the biased algorithm, then they are
> > false results?
>
> All Monte Carlo results contain Monte Carlo error.  Using the biased
> function will have some additional error, but for almost all
> simulations, it will be negligible compared to the Monte Carlo error.  I
> suspect the only simulations where the bias was anywhere near the same
> order of magnitude as the Monte Carlo error would be ones designed with
> this specific code in mind.
>
> Duncan Murdoch
>
>

--
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor,  Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bias in R's random integers?

Duncan Murdoch-2
In reply to this post by Philip B. Stark
On 19/09/2018 12:09 PM, Philip B. Stark wrote:
> The 53 bits only encode at most 2^{32} possible values, because the
> source of the float is the output of a 32-bit PRNG (the obsolete version
> of MT). 53 bits isn't the relevant number here.

No, two calls to unif_rand() are used.  There are two 32 bit values, but
some of the bits are thrown away.

Duncan Murdoch

>
> The selection ratios can get close to 2. Computer scientists don't do it
> the way R does, for a reason.
>
> Regards,
> Philip
>
> On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
>      > El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch
>      > (<[hidden email] <mailto:[hidden email]>>)
>     escribió:
>      >>
>      >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
>      >>> Dear list,
>      >>>
>      >>> It looks to me that R samples random integers using an
>     intuitive but biased
>      >>> algorithm by going from a random number on [0,1) from the PRNG
>     to a random
>      >>> integer, e.g.
>      >>>
>     https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
>      >>>
>      >>> Many other languages use various rejection sampling approaches
>     which
>      >>> provide an unbiased method for sampling, such as in Go, python,
>     and others
>      >>> described here: https://arxiv.org/abs/1805.10941 (I believe the
>     biased
>      >>> algorithm currently used in R is also described there).  I'm
>     not an expert
>      >>> in this area, but does it make sense for the R to adopt one of
>     the unbiased
>      >>> random sample algorithms outlined there and used in other
>     languages?  Would
>      >>> a patch providing such an algorithm be welcome? What concerns
>     would need to
>      >>> be addressed first?
>      >>>
>      >>> I believe this issue was also raised by Killie & Philip in
>      >>> http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
>     more
>      >>> recently in
>      >>>
>     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
>     <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
>      >>> pointing to the python implementation for comparison:
>      >>>
>     https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
>      >>
>      >> I think the analyses are correct, but I doubt if a change to the
>     default
>      >> is likely to be accepted as it would make it more difficult to
>     reproduce
>      >> older results.
>      >>
>      >> On the other hand, a contribution of a new function like
>     sample() but
>      >> not suffering from the bias would be good.  The normal way to
>     make such
>      >> a contribution is in a user contributed package.
>      >>
>      >> By the way, R code illustrating the bias is probably not very
>     hard to
>      >> put together.  I believe the bias manifests itself in sample()
>     producing
>      >> values with two different probabilities (instead of all equal
>      >> probabilities).  Those may differ by as much as one part in
>     2^32.  It's
>      >
>      > According to Kellie and Philip, in the attachment of the thread
>      > referenced by Carl, "The maximum ratio of selection probabilities can
>      > get as large as 1.5 if n is just below 2^31".
>
>     Sorry, I didn't write very well.  I meant to say that the difference in
>     probabilities would be 2^-32, not that the ratio of probabilities would
>     be 1 + 2^-32.
>
>     By the way, I don't see the statement giving the ratio as 1.5, but
>     maybe
>     I was looking in the wrong place.  In Theorem 1 of the paper I was
>     looking in the ratio was "1 + m 2^{-w + 1}".  In that formula m is your
>     n.  If it is near 2^31, R uses w = 57 random bits, so the ratio
>     would be
>     very, very small (one part in 2^25).
>
>     The worst case for R would happen when m  is just below  2^25, where w
>     is at least 31 for the default generators.  In that case the ratio
>     could
>     be about 1.03.
>
>     Duncan Murdoch
>
>
>
> --
> Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
> Professor,  Department of Statistics |
> University of California
> Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark
> <http://statistics.berkeley.edu/%7Estark> |
> @philipbstark
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bias in R's random integers?

Philip B. Stark
No, the 2nd call only happens when m > 2**31. Here's the code:

(RNG.c, lines 793ff)

double R_unif_index(double dn)
{
    double cut = INT_MAX;

    switch(RNG_kind) {
    case KNUTH_TAOCP:
    case USER_UNIF:
    case KNUTH_TAOCP2:
cut = 33554431.0; /* 2^25 - 1 */
  break;
    default:
  break;
   }

    double u = dn > cut ? ru() : unif_rand();
    return floor(dn * u);
}

On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch <[hidden email]>
wrote:

> On 19/09/2018 12:09 PM, Philip B. Stark wrote:
> > The 53 bits only encode at most 2^{32} possible values, because the
> > source of the float is the output of a 32-bit PRNG (the obsolete version
> > of MT). 53 bits isn't the relevant number here.
>
> No, two calls to unif_rand() are used.  There are two 32 bit values, but
> some of the bits are thrown away.
>
> Duncan Murdoch
>
> >
> > The selection ratios can get close to 2. Computer scientists don't do it
> > the way R does, for a reason.
> >
> > Regards,
> > Philip
> >
> > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch <[hidden email]
> > <mailto:[hidden email]>> wrote:
> >
> >     On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
> >      > El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch
> >      > (<[hidden email] <mailto:[hidden email]>>)
> >     escribió:
> >      >>
> >      >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
> >      >>> Dear list,
> >      >>>
> >      >>> It looks to me that R samples random integers using an
> >     intuitive but biased
> >      >>> algorithm by going from a random number on [0,1) from the PRNG
> >     to a random
> >      >>> integer, e.g.
> >      >>>
> >
> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
> >      >>>
> >      >>> Many other languages use various rejection sampling approaches
> >     which
> >      >>> provide an unbiased method for sampling, such as in Go, python,
> >     and others
> >      >>> described here: https://arxiv.org/abs/1805.10941 (I believe the
> >     biased
> >      >>> algorithm currently used in R is also described there).  I'm
> >     not an expert
> >      >>> in this area, but does it make sense for the R to adopt one of
> >     the unbiased
> >      >>> random sample algorithms outlined there and used in other
> >     languages?  Would
> >      >>> a patch providing such an algorithm be welcome? What concerns
> >     would need to
> >      >>> be addressed first?
> >      >>>
> >      >>> I believe this issue was also raised by Killie & Philip in
> >      >>> http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
> >     more
> >      >>> recently in
> >      >>>
> >     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
> >     <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
> >      >>> pointing to the python implementation for comparison:
> >      >>>
> >
> https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
> >      >>
> >      >> I think the analyses are correct, but I doubt if a change to the
> >     default
> >      >> is likely to be accepted as it would make it more difficult to
> >     reproduce
> >      >> older results.
> >      >>
> >      >> On the other hand, a contribution of a new function like
> >     sample() but
> >      >> not suffering from the bias would be good.  The normal way to
> >     make such
> >      >> a contribution is in a user contributed package.
> >      >>
> >      >> By the way, R code illustrating the bias is probably not very
> >     hard to
> >      >> put together.  I believe the bias manifests itself in sample()
> >     producing
> >      >> values with two different probabilities (instead of all equal
> >      >> probabilities).  Those may differ by as much as one part in
> >     2^32.  It's
> >      >
> >      > According to Kellie and Philip, in the attachment of the thread
> >      > referenced by Carl, "The maximum ratio of selection probabilities
> can
> >      > get as large as 1.5 if n is just below 2^31".
> >
> >     Sorry, I didn't write very well.  I meant to say that the difference
> in
> >     probabilities would be 2^-32, not that the ratio of probabilities
> would
> >     be 1 + 2^-32.
> >
> >     By the way, I don't see the statement giving the ratio as 1.5, but
> >     maybe
> >     I was looking in the wrong place.  In Theorem 1 of the paper I was
> >     looking in the ratio was "1 + m 2^{-w + 1}".  In that formula m is
> your
> >     n.  If it is near 2^31, R uses w = 57 random bits, so the ratio
> >     would be
> >     very, very small (one part in 2^25).
> >
> >     The worst case for R would happen when m  is just below  2^25, where
> w
> >     is at least 31 for the default generators.  In that case the ratio
> >     could
> >     be about 1.03.
> >
> >     Duncan Murdoch
> >
> >
> >
> > --
> > Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
> > Professor,  Department of Statistics |
> > University of California
> > Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark
> > <http://statistics.berkeley.edu/%7Estark> |
> > @philipbstark
> >
>
>

--
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor,  Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bias in R's random integers?

Duncan Murdoch-2
On 19/09/2018 12:23 PM, Philip B. Stark wrote:
> No, the 2nd call only happens when m > 2**31. Here's the code:

Yes, you're right. Sorry!

So the ratio really does come close to 2.  However, the difference in
probabilities between outcomes is still at most 2^-32 when m is less
than that cutoff.  That's not feasible to detect; the only detectable
difference would happen if some event was constructed to hold an
abundance of outcomes with especially low (or especially high) probability.

As I said in my original post, it's probably not hard to construct such
a thing, but as I've said more recently, it probably wouldn't happen by
chance.  Here's one attempt to do it:

Call the values from unif_rand() "the unif_rand() outcomes".  Call the
values from sample() the sample outcomes.

It would be easiest to see the error if half of the sample() outcomes
used two unif_rand() outcomes, and half used just one.  That would mean
m should be (2/3) * 2^32, but that's too big and would trigger the other
version.

So how about half use 2 unif_rands(), and half use 3?  That means m =
(2/5) * 2^32 = 1717986918.  A good guess is that sample() outcomes would
alternate between the two possibilities, so our event could be even
versus odd outcomes.

Let's try it:

 > m <- (2/5)*2^32
 > m > 2^31
[1] FALSE
 > x <- sample(m, 1000000, replace = TRUE)
 > table(x %% 2)

      0      1
399850 600150

Since m is an even number, the true proportions of evens and odds should
be exactly 0.5.  That's some pretty strong evidence of the bug in the
generator.  (Note that the ratio of the observed probabilities is about
1.5, so I may not be the first person to have done this.)

I'm still not convinced that there has ever been a simulation run with
detectable bias compared to Monte Carlo error unless it (like this one)
was designed specifically to show the problem.

Duncan Murdoch

>
> (RNG.c, lines 793ff)
>
> double R_unif_index(double dn)
> {
>      double cut = INT_MAX;
>
>      switch(RNG_kind) {
>      case KNUTH_TAOCP:
>      case USER_UNIF:
>      case KNUTH_TAOCP2:
> cut = 33554431.0; /* 2^25 - 1 */
> break;
>      default:
> break;
>     }
>
>      double u = dn > cut ? ru() : unif_rand();
>      return floor(dn * u);
> }
>
> On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 19/09/2018 12:09 PM, Philip B. Stark wrote:
>      > The 53 bits only encode at most 2^{32} possible values, because the
>      > source of the float is the output of a 32-bit PRNG (the obsolete
>     version
>      > of MT). 53 bits isn't the relevant number here.
>
>     No, two calls to unif_rand() are used.  There are two 32 bit values,
>     but
>     some of the bits are thrown away.
>
>     Duncan Murdoch
>
>      >
>      > The selection ratios can get close to 2. Computer scientists
>     don't do it
>      > the way R does, for a reason.
>      >
>      > Regards,
>      > Philip
>      >
>      > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
>     <[hidden email] <mailto:[hidden email]>
>      > <mailto:[hidden email]
>     <mailto:[hidden email]>>> wrote:
>      >
>      >     On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
>      >      > El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch
>      >      > (<[hidden email]
>     <mailto:[hidden email]> <mailto:[hidden email]
>     <mailto:[hidden email]>>>)
>      >     escribió:
>      >      >>
>      >      >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
>      >      >>> Dear list,
>      >      >>>
>      >      >>> It looks to me that R samples random integers using an
>      >     intuitive but biased
>      >      >>> algorithm by going from a random number on [0,1) from
>     the PRNG
>      >     to a random
>      >      >>> integer, e.g.
>      >      >>>
>      > https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
>      >      >>>
>      >      >>> Many other languages use various rejection sampling
>     approaches
>      >     which
>      >      >>> provide an unbiased method for sampling, such as in Go,
>     python,
>      >     and others
>      >      >>> described here: https://arxiv.org/abs/1805.10941 (I
>     believe the
>      >     biased
>      >      >>> algorithm currently used in R is also described there).  I'm
>      >     not an expert
>      >      >>> in this area, but does it make sense for the R to adopt
>     one of
>      >     the unbiased
>      >      >>> random sample algorithms outlined there and used in other
>      >     languages?  Would
>      >      >>> a patch providing such an algorithm be welcome? What
>     concerns
>      >     would need to
>      >      >>> be addressed first?
>      >      >>>
>      >      >>> I believe this issue was also raised by Killie & Philip in
>      >      >>>
>     http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
>      >     more
>      >      >>> recently in
>      >      >>>
>      >
>     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
>     <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>      >  
>       <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
>      >      >>> pointing to the python implementation for comparison:
>      >      >>>
>      >
>     https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
>      >      >>
>      >      >> I think the analyses are correct, but I doubt if a change
>     to the
>      >     default
>      >      >> is likely to be accepted as it would make it more
>     difficult to
>      >     reproduce
>      >      >> older results.
>      >      >>
>      >      >> On the other hand, a contribution of a new function like
>      >     sample() but
>      >      >> not suffering from the bias would be good.  The normal way to
>      >     make such
>      >      >> a contribution is in a user contributed package.
>      >      >>
>      >      >> By the way, R code illustrating the bias is probably not very
>      >     hard to
>      >      >> put together.  I believe the bias manifests itself in
>     sample()
>      >     producing
>      >      >> values with two different probabilities (instead of all equal
>      >      >> probabilities).  Those may differ by as much as one part in
>      >     2^32.  It's
>      >      >
>      >      > According to Kellie and Philip, in the attachment of the
>     thread
>      >      > referenced by Carl, "The maximum ratio of selection
>     probabilities can
>      >      > get as large as 1.5 if n is just below 2^31".
>      >
>      >     Sorry, I didn't write very well.  I meant to say that the
>     difference in
>      >     probabilities would be 2^-32, not that the ratio of
>     probabilities would
>      >     be 1 + 2^-32.
>      >
>      >     By the way, I don't see the statement giving the ratio as
>     1.5, but
>      >     maybe
>      >     I was looking in the wrong place.  In Theorem 1 of the paper
>     I was
>      >     looking in the ratio was "1 + m 2^{-w + 1}".  In that formula
>     m is your
>      >     n.  If it is near 2^31, R uses w = 57 random bits, so the ratio
>      >     would be
>      >     very, very small (one part in 2^25).
>      >
>      >     The worst case for R would happen when m  is just below
>     2^25, where w
>      >     is at least 31 for the default generators.  In that case the
>     ratio
>      >     could
>      >     be about 1.03.
>      >
>      >     Duncan Murdoch
>      >
>      >
>      >
>      > --
>      > Philip B. Stark | Associate Dean, Mathematical and Physical
>     Sciences |
>      > Professor,  Department of Statistics |
>      > University of California
>      > Berkeley, CA 94720-3860 | 510-394-5077 |
>     statistics.berkeley.edu/~stark
>     <http://statistics.berkeley.edu/%7Estark>
>      > <http://statistics.berkeley.edu/%7Estark> |
>      > @philipbstark
>      >
>
>
>
> --
> Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
> Professor,  Department of Statistics |
> University of California
> Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark
> <http://statistics.berkeley.edu/%7Estark> |
> @philipbstark
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bias in R's random integers?

Philip B. Stark
Hi Duncan--

Nice simulation!

The absolute difference in probabilities is small, but the maximum relative
difference grows from something negligible to almost 2 as m approaches
2**31.

Because the L_1 distance between the uniform distribution on {1, ..., m}
and what you actually get is large, there have to be test functions whose
expectations are quite different under the two distributions. Whether those
correspond to commonly used statistics or not, I have no idea.

Regarding backwards compatibility: as a user, I'd rather the default
sample() do the best possible thing, and take an extra step to use
something like sample(..., legacy=TRUE) if I want to reproduce old results.

Regards,
Philip

On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch <[hidden email]>
wrote:

> On 19/09/2018 12:23 PM, Philip B. Stark wrote:
> > No, the 2nd call only happens when m > 2**31. Here's the code:
>
> Yes, you're right. Sorry!
>
> So the ratio really does come close to 2.  However, the difference in
> probabilities between outcomes is still at most 2^-32 when m is less
> than that cutoff.  That's not feasible to detect; the only detectable
> difference would happen if some event was constructed to hold an
> abundance of outcomes with especially low (or especially high) probability.
>
> As I said in my original post, it's probably not hard to construct such
> a thing, but as I've said more recently, it probably wouldn't happen by
> chance.  Here's one attempt to do it:
>
> Call the values from unif_rand() "the unif_rand() outcomes".  Call the
> values from sample() the sample outcomes.
>
> It would be easiest to see the error if half of the sample() outcomes
> used two unif_rand() outcomes, and half used just one.  That would mean
> m should be (2/3) * 2^32, but that's too big and would trigger the other
> version.
>
> So how about half use 2 unif_rands(), and half use 3?  That means m =
> (2/5) * 2^32 = 1717986918.  A good guess is that sample() outcomes would
> alternate between the two possibilities, so our event could be even
> versus odd outcomes.
>
> Let's try it:
>
>  > m <- (2/5)*2^32
>  > m > 2^31
> [1] FALSE
>  > x <- sample(m, 1000000, replace = TRUE)
>  > table(x %% 2)
>
>       0      1
> 399850 600150
>
> Since m is an even number, the true proportions of evens and odds should
> be exactly 0.5.  That's some pretty strong evidence of the bug in the
> generator.  (Note that the ratio of the observed probabilities is about
> 1.5, so I may not be the first person to have done this.)
>
> I'm still not convinced that there has ever been a simulation run with
> detectable bias compared to Monte Carlo error unless it (like this one)
> was designed specifically to show the problem.
>
> Duncan Murdoch
>
> >
> > (RNG.c, lines 793ff)
> >
> > double R_unif_index(double dn)
> > {
> >      double cut = INT_MAX;
> >
> >      switch(RNG_kind) {
> >      case KNUTH_TAOCP:
> >      case USER_UNIF:
> >      case KNUTH_TAOCP2:
> > cut = 33554431.0; /* 2^25 - 1 */
> > break;
> >      default:
> > break;
> >     }
> >
> >      double u = dn > cut ? ru() : unif_rand();
> >      return floor(dn * u);
> > }
> >
> > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch <[hidden email]
> > <mailto:[hidden email]>> wrote:
> >
> >     On 19/09/2018 12:09 PM, Philip B. Stark wrote:
> >      > The 53 bits only encode at most 2^{32} possible values, because
> the
> >      > source of the float is the output of a 32-bit PRNG (the obsolete
> >     version
> >      > of MT). 53 bits isn't the relevant number here.
> >
> >     No, two calls to unif_rand() are used.  There are two 32 bit values,
> >     but
> >     some of the bits are thrown away.
> >
> >     Duncan Murdoch
> >
> >      >
> >      > The selection ratios can get close to 2. Computer scientists
> >     don't do it
> >      > the way R does, for a reason.
> >      >
> >      > Regards,
> >      > Philip
> >      >
> >      > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
> >     <[hidden email] <mailto:[hidden email]>
> >      > <mailto:[hidden email]
> >     <mailto:[hidden email]>>> wrote:
> >      >
> >      >     On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
> >      >      > El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch
> >      >      > (<[hidden email]
> >     <mailto:[hidden email]> <mailto:[hidden email]
> >     <mailto:[hidden email]>>>)
> >      >     escribió:
> >      >      >>
> >      >      >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
> >      >      >>> Dear list,
> >      >      >>>
> >      >      >>> It looks to me that R samples random integers using an
> >      >     intuitive but biased
> >      >      >>> algorithm by going from a random number on [0,1) from
> >     the PRNG
> >      >     to a random
> >      >      >>> integer, e.g.
> >      >      >>>
> >      >
> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
> >      >      >>>
> >      >      >>> Many other languages use various rejection sampling
> >     approaches
> >      >     which
> >      >      >>> provide an unbiased method for sampling, such as in Go,
> >     python,
> >      >     and others
> >      >      >>> described here: https://arxiv.org/abs/1805.10941 (I
> >     believe the
> >      >     biased
> >      >      >>> algorithm currently used in R is also described there).
> I'm
> >      >     not an expert
> >      >      >>> in this area, but does it make sense for the R to adopt
> >     one of
> >      >     the unbiased
> >      >      >>> random sample algorithms outlined there and used in other
> >      >     languages?  Would
> >      >      >>> a patch providing such an algorithm be welcome? What
> >     concerns
> >      >     would need to
> >      >      >>> be addressed first?
> >      >      >>>
> >      >      >>> I believe this issue was also raised by Killie & Philip
> in
> >      >      >>>
> >     http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
> >      >     more
> >      >      >>> recently in
> >      >      >>>
> >      >
> >     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
> >     <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> >      >
> >       <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
> >      >      >>> pointing to the python implementation for comparison:
> >      >      >>>
> >      >
> >
> https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
> >      >      >>
> >      >      >> I think the analyses are correct, but I doubt if a change
> >     to the
> >      >     default
> >      >      >> is likely to be accepted as it would make it more
> >     difficult to
> >      >     reproduce
> >      >      >> older results.
> >      >      >>
> >      >      >> On the other hand, a contribution of a new function like
> >      >     sample() but
> >      >      >> not suffering from the bias would be good.  The normal
> way to
> >      >     make such
> >      >      >> a contribution is in a user contributed package.
> >      >      >>
> >      >      >> By the way, R code illustrating the bias is probably not
> very
> >      >     hard to
> >      >      >> put together.  I believe the bias manifests itself in
> >     sample()
> >      >     producing
> >      >      >> values with two different probabilities (instead of all
> equal
> >      >      >> probabilities).  Those may differ by as much as one part
> in
> >      >     2^32.  It's
> >      >      >
> >      >      > According to Kellie and Philip, in the attachment of the
> >     thread
> >      >      > referenced by Carl, "The maximum ratio of selection
> >     probabilities can
> >      >      > get as large as 1.5 if n is just below 2^31".
> >      >
> >      >     Sorry, I didn't write very well.  I meant to say that the
> >     difference in
> >      >     probabilities would be 2^-32, not that the ratio of
> >     probabilities would
> >      >     be 1 + 2^-32.
> >      >
> >      >     By the way, I don't see the statement giving the ratio as
> >     1.5, but
> >      >     maybe
> >      >     I was looking in the wrong place.  In Theorem 1 of the paper
> >     I was
> >      >     looking in the ratio was "1 + m 2^{-w + 1}".  In that formula
> >     m is your
> >      >     n.  If it is near 2^31, R uses w = 57 random bits, so the
> ratio
> >      >     would be
> >      >     very, very small (one part in 2^25).
> >      >
> >      >     The worst case for R would happen when m  is just below
> >     2^25, where w
> >      >     is at least 31 for the default generators.  In that case the
> >     ratio
> >      >     could
> >      >     be about 1.03.
> >      >
> >      >     Duncan Murdoch
> >      >
> >      >
> >      >
> >      > --
> >      > Philip B. Stark | Associate Dean, Mathematical and Physical
> >     Sciences |
> >      > Professor,  Department of Statistics |
> >      > University of California
> >      > Berkeley, CA 94720-3860 | 510-394-5077 |
> >     statistics.berkeley.edu/~stark
> >     <http://statistics.berkeley.edu/%7Estark>
> >      > <http://statistics.berkeley.edu/%7Estark> |
> >      > @philipbstark
> >      >
> >
> >
> >
> > --
> > Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
> > Professor,  Department of Statistics |
> > University of California
> > Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark
> > <http://statistics.berkeley.edu/%7Estark> |
> > @philipbstark
> >
>
>

--
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor,  Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bias in R's random integers?

Philip B. Stark
One more thing, apropos this:

I'm still not convinced that there has ever been a simulation run with
detectable bias compared to Monte Carlo error unless it (like this one) was
designed specifically to show the problem.


I often use random permutations to simulate p-values to calibrate
permutation tests. If I'm trying to simulate the probability of a
low-probability event, this could matter a lot.

Best wishes,
Philip

On Wed, Sep 19, 2018 at 12:52 PM Philip B. Stark <[hidden email]>
wrote:

> Hi Duncan--
>
> Nice simulation!
>
> The absolute difference in probabilities is small, but the maximum
> relative difference grows from something negligible to almost 2 as m
> approaches 2**31.
>
> Because the L_1 distance between the uniform distribution on {1, ..., m}
> and what you actually get is large, there have to be test functions whose
> expectations are quite different under the two distributions. Whether those
> correspond to commonly used statistics or not, I have no idea.
>
> Regarding backwards compatibility: as a user, I'd rather the default
> sample() do the best possible thing, and take an extra step to use
> something like sample(..., legacy=TRUE) if I want to reproduce old results.
>
> Regards,
> Philip
>
> On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch <[hidden email]>
> wrote:
>
>> On 19/09/2018 12:23 PM, Philip B. Stark wrote:
>> > No, the 2nd call only happens when m > 2**31. Here's the code:
>>
>> Yes, you're right. Sorry!
>>
>> So the ratio really does come close to 2.  However, the difference in
>> probabilities between outcomes is still at most 2^-32 when m is less
>> than that cutoff.  That's not feasible to detect; the only detectable
>> difference would happen if some event was constructed to hold an
>> abundance of outcomes with especially low (or especially high)
>> probability.
>>
>> As I said in my original post, it's probably not hard to construct such
>> a thing, but as I've said more recently, it probably wouldn't happen by
>> chance.  Here's one attempt to do it:
>>
>> Call the values from unif_rand() "the unif_rand() outcomes".  Call the
>> values from sample() the sample outcomes.
>>
>> It would be easiest to see the error if half of the sample() outcomes
>> used two unif_rand() outcomes, and half used just one.  That would mean
>> m should be (2/3) * 2^32, but that's too big and would trigger the other
>> version.
>>
>> So how about half use 2 unif_rands(), and half use 3?  That means m =
>> (2/5) * 2^32 = 1717986918.  A good guess is that sample() outcomes would
>> alternate between the two possibilities, so our event could be even
>> versus odd outcomes.
>>
>> Let's try it:
>>
>>  > m <- (2/5)*2^32
>>  > m > 2^31
>> [1] FALSE
>>  > x <- sample(m, 1000000, replace = TRUE)
>>  > table(x %% 2)
>>
>>       0      1
>> 399850 600150
>>
>> Since m is an even number, the true proportions of evens and odds should
>> be exactly 0.5.  That's some pretty strong evidence of the bug in the
>> generator.  (Note that the ratio of the observed probabilities is about
>> 1.5, so I may not be the first person to have done this.)
>>
>> I'm still not convinced that there has ever been a simulation run with
>> detectable bias compared to Monte Carlo error unless it (like this one)
>> was designed specifically to show the problem.
>>
>> Duncan Murdoch
>>
>> >
>> > (RNG.c, lines 793ff)
>> >
>> > double R_unif_index(double dn)
>> > {
>> >      double cut = INT_MAX;
>> >
>> >      switch(RNG_kind) {
>> >      case KNUTH_TAOCP:
>> >      case USER_UNIF:
>> >      case KNUTH_TAOCP2:
>> > cut = 33554431.0; /* 2^25 - 1 */
>> > break;
>> >      default:
>> > break;
>> >     }
>> >
>> >      double u = dn > cut ? ru() : unif_rand();
>> >      return floor(dn * u);
>> > }
>> >
>> > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch <
>> [hidden email]
>> > <mailto:[hidden email]>> wrote:
>> >
>> >     On 19/09/2018 12:09 PM, Philip B. Stark wrote:
>> >      > The 53 bits only encode at most 2^{32} possible values, because
>> the
>> >      > source of the float is the output of a 32-bit PRNG (the obsolete
>> >     version
>> >      > of MT). 53 bits isn't the relevant number here.
>> >
>> >     No, two calls to unif_rand() are used.  There are two 32 bit values,
>> >     but
>> >     some of the bits are thrown away.
>> >
>> >     Duncan Murdoch
>> >
>> >      >
>> >      > The selection ratios can get close to 2. Computer scientists
>> >     don't do it
>> >      > the way R does, for a reason.
>> >      >
>> >      > Regards,
>> >      > Philip
>> >      >
>> >      > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
>> >     <[hidden email] <mailto:[hidden email]>
>> >      > <mailto:[hidden email]
>> >     <mailto:[hidden email]>>> wrote:
>> >      >
>> >      >     On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
>> >      >      > El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch
>> >      >      > (<[hidden email]
>> >     <mailto:[hidden email]> <mailto:[hidden email]
>> >     <mailto:[hidden email]>>>)
>> >      >     escribió:
>> >      >      >>
>> >      >      >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
>> >      >      >>> Dear list,
>> >      >      >>>
>> >      >      >>> It looks to me that R samples random integers using an
>> >      >     intuitive but biased
>> >      >      >>> algorithm by going from a random number on [0,1) from
>> >     the PRNG
>> >      >     to a random
>> >      >      >>> integer, e.g.
>> >      >      >>>
>> >      >
>> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
>> >      >      >>>
>> >      >      >>> Many other languages use various rejection sampling
>> >     approaches
>> >      >     which
>> >      >      >>> provide an unbiased method for sampling, such as in Go,
>> >     python,
>> >      >     and others
>> >      >      >>> described here: https://arxiv.org/abs/1805.10941 (I
>> >     believe the
>> >      >     biased
>> >      >      >>> algorithm currently used in R is also described
>> there).  I'm
>> >      >     not an expert
>> >      >      >>> in this area, but does it make sense for the R to adopt
>> >     one of
>> >      >     the unbiased
>> >      >      >>> random sample algorithms outlined there and used in
>> other
>> >      >     languages?  Would
>> >      >      >>> a patch providing such an algorithm be welcome? What
>> >     concerns
>> >      >     would need to
>> >      >      >>> be addressed first?
>> >      >      >>>
>> >      >      >>> I believe this issue was also raised by Killie & Philip
>> in
>> >      >      >>>
>> >     http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
>> >      >     more
>> >      >      >>> recently in
>> >      >      >>>
>> >      >
>> >     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
>> >     <
>> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>> >      >
>> >       <
>> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
>> >      >      >>> pointing to the python implementation for comparison:
>> >      >      >>>
>> >      >
>> >
>> https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
>> >      >      >>
>> >      >      >> I think the analyses are correct, but I doubt if a change
>> >     to the
>> >      >     default
>> >      >      >> is likely to be accepted as it would make it more
>> >     difficult to
>> >      >     reproduce
>> >      >      >> older results.
>> >      >      >>
>> >      >      >> On the other hand, a contribution of a new function like
>> >      >     sample() but
>> >      >      >> not suffering from the bias would be good.  The normal
>> way to
>> >      >     make such
>> >      >      >> a contribution is in a user contributed package.
>> >      >      >>
>> >      >      >> By the way, R code illustrating the bias is probably not
>> very
>> >      >     hard to
>> >      >      >> put together.  I believe the bias manifests itself in
>> >     sample()
>> >      >     producing
>> >      >      >> values with two different probabilities (instead of all
>> equal
>> >      >      >> probabilities).  Those may differ by as much as one part
>> in
>> >      >     2^32.  It's
>> >      >      >
>> >      >      > According to Kellie and Philip, in the attachment of the
>> >     thread
>> >      >      > referenced by Carl, "The maximum ratio of selection
>> >     probabilities can
>> >      >      > get as large as 1.5 if n is just below 2^31".
>> >      >
>> >      >     Sorry, I didn't write very well.  I meant to say that the
>> >     difference in
>> >      >     probabilities would be 2^-32, not that the ratio of
>> >     probabilities would
>> >      >     be 1 + 2^-32.
>> >      >
>> >      >     By the way, I don't see the statement giving the ratio as
>> >     1.5, but
>> >      >     maybe
>> >      >     I was looking in the wrong place.  In Theorem 1 of the paper
>> >     I was
>> >      >     looking in the ratio was "1 + m 2^{-w + 1}".  In that formula
>> >     m is your
>> >      >     n.  If it is near 2^31, R uses w = 57 random bits, so the
>> ratio
>> >      >     would be
>> >      >     very, very small (one part in 2^25).
>> >      >
>> >      >     The worst case for R would happen when m  is just below
>> >     2^25, where w
>> >      >     is at least 31 for the default generators.  In that case the
>> >     ratio
>> >      >     could
>> >      >     be about 1.03.
>> >      >
>> >      >     Duncan Murdoch
>> >      >
>> >      >
>> >      >
>> >      > --
>> >      > Philip B. Stark | Associate Dean, Mathematical and Physical
>> >     Sciences |
>> >      > Professor,  Department of Statistics |
>> >      > University of California
>> >      > Berkeley, CA 94720-3860 | 510-394-5077 |
>> >     statistics.berkeley.edu/~stark
>> >     <http://statistics.berkeley.edu/%7Estark>
>> >      > <http://statistics.berkeley.edu/%7Estark> |
>> >      > @philipbstark
>> >      >
>> >
>> >
>> >
>> > --
>> > Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
>> > Professor,  Department of Statistics |
>> > University of California
>> > Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark
>> > <http://statistics.berkeley.edu/%7Estark> |
>> > @philipbstark
>> >
>>
>>
>
> --
> Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
> Professor,  Department of Statistics |
> University of California
> Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
> @philipbstark
>
>

--
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor,  Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bias in R's random integers?

Duncan Murdoch-2
In reply to this post by Philip B. Stark
On 19/09/2018 3:52 PM, Philip B. Stark wrote:

> Hi Duncan--
>
> Nice simulation!
>
> The absolute difference in probabilities is small, but the maximum
> relative difference grows from something negligible to almost 2 as m
> approaches 2**31.
>
> Because the L_1 distance between the uniform distribution on {1, ..., m}
> and what you actually get is large, there have to be test functions
> whose expectations are quite different under the two distributions.

That is a mathematically true statement, but I suspect it is not very
relevant.  Pseudo-random number generators always have test functions
whose sample averages are quite different from the expectation under the
true distribution.  Remember Von Neumann's "state of sin" quote.  The
bug in sample() just means it is easier to find such a function than it
would otherwise be.

The practical question is whether such a function is likely to arise in
practice or not.

 > Whether those correspond to commonly used statistics or not, I have no
 > idea.

I am pretty confident that this bug rarely matters.

> Regarding backwards compatibility: as a user, I'd rather the default
> sample() do the best possible thing, and take an extra step to use
> something like sample(..., legacy=TRUE) if I want to reproduce old results.

I suspect there's a good chance the bug I discovered today (non-integer
x values not being truncated) will be declared to be a feature, and the
documentation will be changed.  Then the rejection sampling approach
would need to be quite a bit more complicated.

I think a documentation warning about the accuracy of sampling
probabilities would also be a sufficient fix here, and would be quite a
bit less trouble than changing the default sample().  But as I said in
my original post, a contribution of a function without this bug would be
a nice addition.

Duncan Murdoch

>
> Regards,
> Philip
>
> On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 19/09/2018 12:23 PM, Philip B. Stark wrote:
>      > No, the 2nd call only happens when m > 2**31. Here's the code:
>
>     Yes, you're right. Sorry!
>
>     So the ratio really does come close to 2.  However, the difference in
>     probabilities between outcomes is still at most 2^-32 when m is less
>     than that cutoff.  That's not feasible to detect; the only detectable
>     difference would happen if some event was constructed to hold an
>     abundance of outcomes with especially low (or especially high)
>     probability.
>
>     As I said in my original post, it's probably not hard to construct such
>     a thing, but as I've said more recently, it probably wouldn't happen by
>     chance.  Here's one attempt to do it:
>
>     Call the values from unif_rand() "the unif_rand() outcomes".  Call the
>     values from sample() the sample outcomes.
>
>     It would be easiest to see the error if half of the sample() outcomes
>     used two unif_rand() outcomes, and half used just one.  That would mean
>     m should be (2/3) * 2^32, but that's too big and would trigger the
>     other
>     version.
>
>     So how about half use 2 unif_rands(), and half use 3?  That means m =
>     (2/5) * 2^32 = 1717986918.  A good guess is that sample() outcomes
>     would
>     alternate between the two possibilities, so our event could be even
>     versus odd outcomes.
>
>     Let's try it:
>
>       > m <- (2/5)*2^32
>       > m > 2^31
>     [1] FALSE
>       > x <- sample(m, 1000000, replace = TRUE)
>       > table(x %% 2)
>
>            0      1
>     399850 600150
>
>     Since m is an even number, the true proportions of evens and odds
>     should
>     be exactly 0.5.  That's some pretty strong evidence of the bug in the
>     generator.  (Note that the ratio of the observed probabilities is about
>     1.5, so I may not be the first person to have done this.)
>
>     I'm still not convinced that there has ever been a simulation run with
>     detectable bias compared to Monte Carlo error unless it (like this one)
>     was designed specifically to show the problem.
>
>     Duncan Murdoch
>
>      >
>      > (RNG.c, lines 793ff)
>      >
>      > double R_unif_index(double dn)
>      > {
>      >      double cut = INT_MAX;
>      >
>      >      switch(RNG_kind) {
>      >      case KNUTH_TAOCP:
>      >      case USER_UNIF:
>      >      case KNUTH_TAOCP2:
>      > cut = 33554431.0; /* 2^25 - 1 */
>      > break;
>      >      default:
>      > break;
>      >     }
>      >
>      >      double u = dn > cut ? ru() : unif_rand();
>      >      return floor(dn * u);
>      > }
>      >
>      > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch
>     <[hidden email] <mailto:[hidden email]>
>      > <mailto:[hidden email]
>     <mailto:[hidden email]>>> wrote:
>      >
>      >     On 19/09/2018 12:09 PM, Philip B. Stark wrote:
>      >      > The 53 bits only encode at most 2^{32} possible values,
>     because the
>      >      > source of the float is the output of a 32-bit PRNG (the
>     obsolete
>      >     version
>      >      > of MT). 53 bits isn't the relevant number here.
>      >
>      >     No, two calls to unif_rand() are used.  There are two 32 bit
>     values,
>      >     but
>      >     some of the bits are thrown away.
>      >
>      >     Duncan Murdoch
>      >
>      >      >
>      >      > The selection ratios can get close to 2. Computer scientists
>      >     don't do it
>      >      > the way R does, for a reason.
>      >      >
>      >      > Regards,
>      >      > Philip
>      >      >
>      >      > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
>      >     <[hidden email] <mailto:[hidden email]>
>     <mailto:[hidden email] <mailto:[hidden email]>>
>      >      > <mailto:[hidden email]
>     <mailto:[hidden email]>
>      >     <mailto:[hidden email]
>     <mailto:[hidden email]>>>> wrote:
>      >      >
>      >      >     On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
>      >      >      > El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch
>      >      >      > (<[hidden email]
>     <mailto:[hidden email]>
>      >     <mailto:[hidden email]
>     <mailto:[hidden email]>> <mailto:[hidden email]
>     <mailto:[hidden email]>
>      >     <mailto:[hidden email]
>     <mailto:[hidden email]>>>>)
>      >      >     escribió:
>      >      >      >>
>      >      >      >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
>      >      >      >>> Dear list,
>      >      >      >>>
>      >      >      >>> It looks to me that R samples random integers
>     using an
>      >      >     intuitive but biased
>      >      >      >>> algorithm by going from a random number on [0,1) from
>      >     the PRNG
>      >      >     to a random
>      >      >      >>> integer, e.g.
>      >      >      >>>
>      >      >
>     https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
>      >      >      >>>
>      >      >      >>> Many other languages use various rejection sampling
>      >     approaches
>      >      >     which
>      >      >      >>> provide an unbiased method for sampling, such as
>     in Go,
>      >     python,
>      >      >     and others
>      >      >      >>> described here: https://arxiv.org/abs/1805.10941 (I
>      >     believe the
>      >      >     biased
>      >      >      >>> algorithm currently used in R is also described
>     there).  I'm
>      >      >     not an expert
>      >      >      >>> in this area, but does it make sense for the R to
>     adopt
>      >     one of
>      >      >     the unbiased
>      >      >      >>> random sample algorithms outlined there and used
>     in other
>      >      >     languages?  Would
>      >      >      >>> a patch providing such an algorithm be welcome? What
>      >     concerns
>      >      >     would need to
>      >      >      >>> be addressed first?
>      >      >      >>>
>      >      >      >>> I believe this issue was also raised by Killie &
>     Philip in
>      >      >      >>>
>      > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
>      >      >     more
>      >      >      >>> recently in
>      >      >      >>>
>      >      >
>      >
>     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
>     <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>      >  
>       <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>      >      >
>      >    
>       <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
>      >      >      >>> pointing to the python implementation for comparison:
>      >      >      >>>
>      >      >
>      >
>     https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
>      >      >      >>
>      >      >      >> I think the analyses are correct, but I doubt if a
>     change
>      >     to the
>      >      >     default
>      >      >      >> is likely to be accepted as it would make it more
>      >     difficult to
>      >      >     reproduce
>      >      >      >> older results.
>      >      >      >>
>      >      >      >> On the other hand, a contribution of a new
>     function like
>      >      >     sample() but
>      >      >      >> not suffering from the bias would be good.  The
>     normal way to
>      >      >     make such
>      >      >      >> a contribution is in a user contributed package.
>      >      >      >>
>      >      >      >> By the way, R code illustrating the bias is
>     probably not very
>      >      >     hard to
>      >      >      >> put together.  I believe the bias manifests itself in
>      >     sample()
>      >      >     producing
>      >      >      >> values with two different probabilities (instead
>     of all equal
>      >      >      >> probabilities).  Those may differ by as much as
>     one part in
>      >      >     2^32.  It's
>      >      >      >
>      >      >      > According to Kellie and Philip, in the attachment
>     of the
>      >     thread
>      >      >      > referenced by Carl, "The maximum ratio of selection
>      >     probabilities can
>      >      >      > get as large as 1.5 if n is just below 2^31".
>      >      >
>      >      >     Sorry, I didn't write very well.  I meant to say that the
>      >     difference in
>      >      >     probabilities would be 2^-32, not that the ratio of
>      >     probabilities would
>      >      >     be 1 + 2^-32.
>      >      >
>      >      >     By the way, I don't see the statement giving the ratio as
>      >     1.5, but
>      >      >     maybe
>      >      >     I was looking in the wrong place.  In Theorem 1 of the
>     paper
>      >     I was
>      >      >     looking in the ratio was "1 + m 2^{-w + 1}".  In that
>     formula
>      >     m is your
>      >      >     n.  If it is near 2^31, R uses w = 57 random bits, so
>     the ratio
>      >      >     would be
>      >      >     very, very small (one part in 2^25).
>      >      >
>      >      >     The worst case for R would happen when m  is just below
>      >     2^25, where w
>      >      >     is at least 31 for the default generators.  In that
>     case the
>      >     ratio
>      >      >     could
>      >      >     be about 1.03.
>      >      >
>      >      >     Duncan Murdoch
>      >      >
>      >      >
>      >      >
>      >      > --
>      >      > Philip B. Stark | Associate Dean, Mathematical and Physical
>      >     Sciences |
>      >      > Professor,  Department of Statistics |
>      >      > University of California
>      >      > Berkeley, CA 94720-3860 | 510-394-5077 |
>      > statistics.berkeley.edu/~stark
>     <http://statistics.berkeley.edu/%7Estark>
>      >     <http://statistics.berkeley.edu/%7Estark>
>      >      > <http://statistics.berkeley.edu/%7Estark> |
>      >      > @philipbstark
>      >      >
>      >
>      >
>      >
>      > --
>      > Philip B. Stark | Associate Dean, Mathematical and Physical
>     Sciences |
>      > Professor,  Department of Statistics |
>      > University of California
>      > Berkeley, CA 94720-3860 | 510-394-5077 |
>     statistics.berkeley.edu/~stark
>     <http://statistics.berkeley.edu/%7Estark>
>      > <http://statistics.berkeley.edu/%7Estark> |
>      > @philipbstark
>      >
>
>
>
> --
> Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
> Professor,  Department of Statistics |
> University of California
> Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark
> <http://statistics.berkeley.edu/%7Estark> |
> @philipbstark
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bias in R's random integers?

David Hugh-Jones-3
It doesn't seem too hard to come up with plausible ways in which this could
give bad results. Suppose I sample rows from a large dataset, maybe for
bootstrapping. Suppose the rows are non-randomly ordered, e.g. odd rows are
males, even rows are females. Oops! Very non-representative sample,
bootstrap p values are garbage.

David

On Wed, 19 Sep 2018 at 21:20, Duncan Murdoch <[hidden email]>
wrote:

> On 19/09/2018 3:52 PM, Philip B. Stark wrote:
> > Hi Duncan--
> >
> >
>
> That is a mathematically true statement, but I suspect it is not very
> relevant.  Pseudo-random number generators always have test functions
> whose sample averages are quite different from the expectation under the
> true distribution.  Remember Von Neumann's "state of sin" quote.  The
> bug in sample() just means it is easier to find such a function than it
> would otherwise be.
>
> The practical question is whether such a function is likely to arise in
> practice or not.
>
>  > Whether those correspond to commonly used statistics or not, I have no
>  > idea.
>
> I am pretty confident that this bug rarely matters.
>
> > Regarding backwards compatibility: as a user, I'd rather the default
> > sample() do the best possible thing, and take an extra step to use
> > something like sample(..., legacy=TRUE) if I want to reproduce old
> results.
>
> I suspect there's a good chance the bug I discovered today (non-integer
> x values not being truncated) will be declared to be a feature, and the
> documentation will be changed.  Then the rejection sampling approach
> would need to be quite a bit more complicated.
>
> I think a documentation warning about the accuracy of sampling
> probabilities would also be a sufficient fix here, and would be quite a
> bit less trouble than changing the default sample().  But as I said in
> my original post, a contribution of a function without this bug would be
> a nice addition.
>
> Duncan Murdoch
>
> >
> > Regards,
> > Philip
> >
> > On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch <[hidden email]
> > <mailto:[hidden email]>> wrote:
> >
> >     On 19/09/2018 12:23 PM, Philip B. Stark wrote:
> >      > No, the 2nd call only happens when m > 2**31. Here's the code:
> >
> >     Yes, you're right. Sorry!
> >
> >     So the ratio really does come close to 2.  However, the difference in
> >     probabilities between outcomes is still at most 2^-32 when m is less
> >     than that cutoff.  That's not feasible to detect; the only detectable
> >     difference would happen if some event was constructed to hold an
> >     abundance of outcomes with especially low (or especially high)
> >     probability.
> >
> >     As I said in my original post, it's probably not hard to construct
> such
> >     a thing, but as I've said more recently, it probably wouldn't happen
> by
> >     chance.  Here's one attempt to do it:
> >
> >     Call the values from unif_rand() "the unif_rand() outcomes".  Call
> the
> >     values from sample() the sample outcomes.
> >
> >     It would be easiest to see the error if half of the sample() outcomes
> >     used two unif_rand() outcomes, and half used just one.  That would
> mean
> >     m should be (2/3) * 2^32, but that's too big and would trigger the
> >     other
> >     version.
> >
> >     So how about half use 2 unif_rands(), and half use 3?  That means m =
> >     (2/5) * 2^32 = 1717986918.  A good guess is that sample() outcomes
> >     would
> >     alternate between the two possibilities, so our event could be even
> >     versus odd outcomes.
> >
> >     Let's try it:
> >
> >       > m <- (2/5)*2^32
> >       > m > 2^31
> >     [1] FALSE
> >       > x <- sample(m, 1000000, replace = TRUE)
> >       > table(x %% 2)
> >
> >            0      1
> >     399850 600150
> >
> >     Since m is an even number, the true proportions of evens and odds
> >     should
> >     be exactly 0.5.  That's some pretty strong evidence of the bug in the
> >     generator.  (Note that the ratio of the observed probabilities is
> about
> >     1.5, so I may not be the first person to have done this.)
> >
> >     I'm still not convinced that there has ever been a simulation run
> with
> >     detectable bias compared to Monte Carlo error unless it (like this
> one)
> >     was designed specifically to show the problem.
> >
> >     Duncan Murdoch
> >
> >      >
> >      > (RNG.c, lines 793ff)
> >      >
> >      > double R_unif_index(double dn)
> >      > {
> >      >      double cut = INT_MAX;
> >      >
> >      >      switch(RNG_kind) {
> >      >      case KNUTH_TAOCP:
> >      >      case USER_UNIF:
> >      >      case KNUTH_TAOCP2:
> >      > cut = 33554431.0; /* 2^25 - 1 */
> >      > break;
> >      >      default:
> >      > break;
> >      >     }
> >      >
> >      >      double u = dn > cut ? ru() : unif_rand();
> >      >      return floor(dn * u);
> >      > }
> >      >
> >      > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch
> >     <[hidden email] <mailto:[hidden email]>
> >      > <mailto:[hidden email]
> >     <mailto:[hidden email]>>> wrote:
> >      >
> >      >     On 19/09/2018 12:09 PM, Philip B. Stark wrote:
> >      >      > The 53 bits only encode at most 2^{32} possible values,
> >     because the
> >      >      > source of the float is the output of a 32-bit PRNG (the
> >     obsolete
> >      >     version
> >      >      > of MT). 53 bits isn't the relevant number here.
> >      >
> >      >     No, two calls to unif_rand() are used.  There are two 32 bit
> >     values,
> >      >     but
> >      >     some of the bits are thrown away.
> >      >
> >      >     Duncan Murdoch
> >      >
> >      >      >
> >      >      > The selection ratios can get close to 2. Computer
> scientists
> >      >     don't do it
> >      >      > the way R does, for a reason.
> >      >      >
> >      >      > Regards,
> >      >      > Philip
> >      >      >
> >      >      > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
> >      >     <[hidden email] <mailto:[hidden email]>
> >     <mailto:[hidden email] <mailto:[hidden email]>>
> >      >      > <mailto:[hidden email]
> >     <mailto:[hidden email]>
> >      >     <mailto:[hidden email]
> >     <mailto:[hidden email]>>>> wrote:
> >      >      >
> >      >      >     On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
> >      >      >      > El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch
> >      >      >      > (<[hidden email]
> >     <mailto:[hidden email]>
> >      >     <mailto:[hidden email]
> >     <mailto:[hidden email]>> <mailto:[hidden email]
> >     <mailto:[hidden email]>
> >      >     <mailto:[hidden email]
> >     <mailto:[hidden email]>>>>)
> >      >      >     escribió:
> >      >      >      >>
> >      >      >      >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
> >      >      >      >>> Dear list,
> >      >      >      >>>
> >      >      >      >>> It looks to me that R samples random integers
> >     using an
> >      >      >     intuitive but biased
> >      >      >      >>> algorithm by going from a random number on [0,1)
> from
> >      >     the PRNG
> >      >      >     to a random
> >      >      >      >>> integer, e.g.
> >      >      >      >>>
> >      >      >
> >
> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
> >      >      >      >>>
> >      >      >      >>> Many other languages use various rejection
> sampling
> >      >     approaches
> >      >      >     which
> >      >      >      >>> provide an unbiased method for sampling, such as
> >     in Go,
> >      >     python,
> >      >      >     and others
> >      >      >      >>> described here: https://arxiv.org/abs/1805.10941
> (I
> >      >     believe the
> >      >      >     biased
> >      >      >      >>> algorithm currently used in R is also described
> >     there).  I'm
> >      >      >     not an expert
> >      >      >      >>> in this area, but does it make sense for the R to
> >     adopt
> >      >     one of
> >      >      >     the unbiased
> >      >      >      >>> random sample algorithms outlined there and used
> >     in other
> >      >      >     languages?  Would
> >      >      >      >>> a patch providing such an algorithm be welcome?
> What
> >      >     concerns
> >      >      >     would need to
> >      >      >      >>> be addressed first?
> >      >      >      >>>
> >      >      >      >>> I believe this issue was also raised by Killie &
> >     Philip in
> >      >      >      >>>
> >      > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
> >      >      >     more
> >      >      >      >>> recently in
> >      >      >      >>>
> >      >      >
> >      >
> >     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
> >     <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> >      >
> >       <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> >      >      >
> >      >
> >       <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
> >      >      >      >>> pointing to the python implementation for
> comparison:
> >      >      >      >>>
> >      >      >
> >      >
> >
> https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
> >      >      >      >>
> >      >      >      >> I think the analyses are correct, but I doubt if a
> >     change
> >      >     to the
> >      >      >     default
> >      >      >      >> is likely to be accepted as it would make it more
> >      >     difficult to
> >      >      >     reproduce
> >      >      >      >> older results.
> >      >      >      >>
> >      >      >      >> On the other hand, a contribution of a new
> >     function like
> >      >      >     sample() but
> >      >      >      >> not suffering from the bias would be good.  The
> >     normal way to
> >      >      >     make such
> >      >      >      >> a contribution is in a user contributed package.
> >      >      >      >>
> >      >      >      >> By the way, R code illustrating the bias is
> >     probably not very
> >      >      >     hard to
> >      >      >      >> put together.  I believe the bias manifests itself
> in
> >      >     sample()
> >      >      >     producing
> >      >      >      >> values with two different probabilities (instead
> >     of all equal
> >      >      >      >> probabilities).  Those may differ by as much as
> >     one part in
> >      >      >     2^32.  It's
> >      >      >      >
> >      >      >      > According to Kellie and Philip, in the attachment
> >     of the
> >      >     thread
> >      >      >      > referenced by Carl, "The maximum ratio of selection
> >      >     probabilities can
> >      >      >      > get as large as 1.5 if n is just below 2^31".
> >      >      >
> >      >      >     Sorry, I didn't write very well.  I meant to say that
> the
> >      >     difference in
> >      >      >     probabilities would be 2^-32, not that the ratio of
> >      >     probabilities would
> >      >      >     be 1 + 2^-32.
> >      >      >
> >      >      >     By the way, I don't see the statement giving the ratio
> as
> >      >     1.5, but
> >      >      >     maybe
> >      >      >     I was looking in the wrong place.  In Theorem 1 of the
> >     paper
> >      >     I was
> >      >      >     looking in the ratio was "1 + m 2^{-w + 1}".  In that
> >     formula
> >      >     m is your
> >      >      >     n.  If it is near 2^31, R uses w = 57 random bits, so
> >     the ratio
> >      >      >     would be
> >      >      >     very, very small (one part in 2^25).
> >      >      >
> >      >      >     The worst case for R would happen when m  is just below
> >      >     2^25, where w
> >      >      >     is at least 31 for the default generators.  In that
> >     case the
> >      >     ratio
> >      >      >     could
> >      >      >     be about 1.03.
> >      >      >
> >      >      >     Duncan Murdoch
> >      >      >
> >      >      >
> >      >      >
> >      >      > --
> >      >      > Philip B. Stark | Associate Dean, Mathematical and Physical
> >      >     Sciences |
> >      >      > Professor,  Department of Statistics |
> >      >      > University of California
> >      >      > Berkeley, CA 94720-3860 | 510-394-5077 |
> >      > statistics.berkeley.edu/~stark
> >     <http://statistics.berkeley.edu/%7Estark>
> >      >     <http://statistics.berkeley.edu/%7Estark>
> >      >      > <http://statistics.berkeley.edu/%7Estark> |
> >      >      > @philipbstark
> >      >      >
> >      >
> >      >
> >      >
> >      > --
> >      > Philip B. Stark | Associate Dean, Mathematical and Physical
> >     Sciences |
> >      > Professor,  Department of Statistics |
> >      > University of California
> >      > Berkeley, CA 94720-3860 | 510-394-5077 |
> >     statistics.berkeley.edu/~stark
> >     <http://statistics.berkeley.edu/%7Estark>
> >      > <http://statistics.berkeley.edu/%7Estark> |
> >      > @philipbstark
> >      >
> >
> >
> >
> > --
> > Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
> > Professor,  Department of Statistics |
> > University of California
> > Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark
> > <http://statistics.berkeley.edu/%7Estark> |
> > @philipbstark
> >
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
--
Sent from Gmail Mobile

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bias in R's random integers?

Duncan Murdoch-2
On 19/09/2018 5:57 PM, David Hugh-Jones wrote:
>
> It doesn't seem too hard to come up with plausible ways in which this
> could give bad results. Suppose I sample rows from a large dataset,
> maybe for bootstrapping. Suppose the rows are non-randomly ordered, e.g.
> odd rows are males, even rows are females. Oops! Very non-representative
> sample, bootstrap p values are garbage.

That would only happen if your dataset was exactly 1717986918 elements
in size. (And in fact, it will be less extreme than I posted:  I had x
set to 1717986918.4, as described in another thread.  If you use an
integer value you need a different pattern; add or subtract an element
or two and the pattern needed to see a problem changes drastically.)

But if you're sampling from a dataset of that exact size, then you
should worry about this bug. Don't use sample().  Use the algorithm that
Carl described.

Duncan Murdoch

>
> David
>
> On Wed, 19 Sep 2018 at 21:20, Duncan Murdoch <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 19/09/2018 3:52 PM, Philip B. Stark wrote:
>      > Hi Duncan--
>      >
>      >
>
>     That is a mathematically true statement, but I suspect it is not very
>     relevant.  Pseudo-random number generators always have test functions
>     whose sample averages are quite different from the expectation under
>     the
>     true distribution.  Remember Von Neumann's "state of sin" quote.  The
>     bug in sample() just means it is easier to find such a function than it
>     would otherwise be.
>
>     The practical question is whether such a function is likely to arise in
>     practice or not.
>
>       > Whether those correspond to commonly used statistics or not, I
>     have no
>       > idea.
>
>     I am pretty confident that this bug rarely matters.
>
>      > Regarding backwards compatibility: as a user, I'd rather the default
>      > sample() do the best possible thing, and take an extra step to use
>      > something like sample(..., legacy=TRUE) if I want to reproduce
>     old results.
>
>     I suspect there's a good chance the bug I discovered today (non-integer
>     x values not being truncated) will be declared to be a feature, and the
>     documentation will be changed.  Then the rejection sampling approach
>     would need to be quite a bit more complicated.
>
>     I think a documentation warning about the accuracy of sampling
>     probabilities would also be a sufficient fix here, and would be quite a
>     bit less trouble than changing the default sample().  But as I said in
>     my original post, a contribution of a function without this bug
>     would be
>     a nice addition.
>
>     Duncan Murdoch
>
>      >
>      > Regards,
>      > Philip
>      >
>      > On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch
>     <[hidden email] <mailto:[hidden email]>
>      > <mailto:[hidden email]
>     <mailto:[hidden email]>>> wrote:
>      >
>      >     On 19/09/2018 12:23 PM, Philip B. Stark wrote:
>      >      > No, the 2nd call only happens when m > 2**31. Here's the code:
>      >
>      >     Yes, you're right. Sorry!
>      >
>      >     So the ratio really does come close to 2.  However, the
>     difference in
>      >     probabilities between outcomes is still at most 2^-32 when m
>     is less
>      >     than that cutoff.  That's not feasible to detect; the only
>     detectable
>      >     difference would happen if some event was constructed to hold an
>      >     abundance of outcomes with especially low (or especially high)
>      >     probability.
>      >
>      >     As I said in my original post, it's probably not hard to
>     construct such
>      >     a thing, but as I've said more recently, it probably wouldn't
>     happen by
>      >     chance.  Here's one attempt to do it:
>      >
>      >     Call the values from unif_rand() "the unif_rand() outcomes".
>     Call the
>      >     values from sample() the sample outcomes.
>      >
>      >     It would be easiest to see the error if half of the sample()
>     outcomes
>      >     used two unif_rand() outcomes, and half used just one.  That
>     would mean
>      >     m should be (2/3) * 2^32, but that's too big and would
>     trigger the
>      >     other
>      >     version.
>      >
>      >     So how about half use 2 unif_rands(), and half use 3?  That
>     means m =
>      >     (2/5) * 2^32 = 1717986918.  A good guess is that sample()
>     outcomes
>      >     would
>      >     alternate between the two possibilities, so our event could
>     be even
>      >     versus odd outcomes.
>      >
>      >     Let's try it:
>      >
>      >       > m <- (2/5)*2^32
>      >       > m > 2^31
>      >     [1] FALSE
>      >       > x <- sample(m, 1000000, replace = TRUE)
>      >       > table(x %% 2)
>      >
>      >            0      1
>      >     399850 600150
>      >
>      >     Since m is an even number, the true proportions of evens and odds
>      >     should
>      >     be exactly 0.5.  That's some pretty strong evidence of the
>     bug in the
>      >     generator.  (Note that the ratio of the observed
>     probabilities is about
>      >     1.5, so I may not be the first person to have done this.)
>      >
>      >     I'm still not convinced that there has ever been a simulation
>     run with
>      >     detectable bias compared to Monte Carlo error unless it (like
>     this one)
>      >     was designed specifically to show the problem.
>      >
>      >     Duncan Murdoch
>      >
>      >      >
>      >      > (RNG.c, lines 793ff)
>      >      >
>      >      > double R_unif_index(double dn)
>      >      > {
>      >      >      double cut = INT_MAX;
>      >      >
>      >      >      switch(RNG_kind) {
>      >      >      case KNUTH_TAOCP:
>      >      >      case USER_UNIF:
>      >      >      case KNUTH_TAOCP2:
>      >      > cut = 33554431.0; /* 2^25 - 1 */
>      >      > break;
>      >      >      default:
>      >      > break;
>      >      >     }
>      >      >
>      >      >      double u = dn > cut ? ru() : unif_rand();
>      >      >      return floor(dn * u);
>      >      > }
>      >      >
>      >      > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch
>      >     <[hidden email] <mailto:[hidden email]>
>     <mailto:[hidden email] <mailto:[hidden email]>>
>      >      > <mailto:[hidden email]
>     <mailto:[hidden email]>
>      >     <mailto:[hidden email]
>     <mailto:[hidden email]>>>> wrote:
>      >      >
>      >      >     On 19/09/2018 12:09 PM, Philip B. Stark wrote:
>      >      >      > The 53 bits only encode at most 2^{32} possible values,
>      >     because the
>      >      >      > source of the float is the output of a 32-bit PRNG (the
>      >     obsolete
>      >      >     version
>      >      >      > of MT). 53 bits isn't the relevant number here.
>      >      >
>      >      >     No, two calls to unif_rand() are used.  There are two
>     32 bit
>      >     values,
>      >      >     but
>      >      >     some of the bits are thrown away.
>      >      >
>      >      >     Duncan Murdoch
>      >      >
>      >      >      >
>      >      >      > The selection ratios can get close to 2. Computer
>     scientists
>      >      >     don't do it
>      >      >      > the way R does, for a reason.
>      >      >      >
>      >      >      > Regards,
>      >      >      > Philip
>      >      >      >
>      >      >      > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
>      >      >     <[hidden email]
>     <mailto:[hidden email]> <mailto:[hidden email]
>     <mailto:[hidden email]>>
>      >     <mailto:[hidden email]
>     <mailto:[hidden email]> <mailto:[hidden email]
>     <mailto:[hidden email]>>>
>      >      >      > <mailto:[hidden email]
>     <mailto:[hidden email]>
>      >     <mailto:[hidden email]
>     <mailto:[hidden email]>>
>      >      >     <mailto:[hidden email]
>     <mailto:[hidden email]>
>      >     <mailto:[hidden email]
>     <mailto:[hidden email]>>>>> wrote:
>      >      >      >
>      >      >      >     On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
>      >      >      >      > El mié., 19 sept. 2018 a las 14:43, Duncan
>     Murdoch
>      >      >      >      > (<[hidden email]
>     <mailto:[hidden email]>
>      >     <mailto:[hidden email]
>     <mailto:[hidden email]>>
>      >      >     <mailto:[hidden email]
>     <mailto:[hidden email]>
>      >     <mailto:[hidden email]
>     <mailto:[hidden email]>>> <mailto:[hidden email]
>     <mailto:[hidden email]>
>      >     <mailto:[hidden email]
>     <mailto:[hidden email]>>
>      >      >     <mailto:[hidden email]
>     <mailto:[hidden email]>
>      >     <mailto:[hidden email]
>     <mailto:[hidden email]>>>>>)
>      >      >      >     escribió:
>      >      >      >      >>
>      >      >      >      >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
>      >      >      >      >>> Dear list,
>      >      >      >      >>>
>      >      >      >      >>> It looks to me that R samples random integers
>      >     using an
>      >      >      >     intuitive but biased
>      >      >      >      >>> algorithm by going from a random number on
>     [0,1) from
>      >      >     the PRNG
>      >      >      >     to a random
>      >      >      >      >>> integer, e.g.
>      >      >      >      >>>
>      >      >      >
>      > https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
>      >      >      >      >>>
>      >      >      >      >>> Many other languages use various rejection
>     sampling
>      >      >     approaches
>      >      >      >     which
>      >      >      >      >>> provide an unbiased method for sampling,
>     such as
>      >     in Go,
>      >      >     python,
>      >      >      >     and others
>      >      >      >      >>> described here:
>     https://arxiv.org/abs/1805.10941 (I
>      >      >     believe the
>      >      >      >     biased
>      >      >      >      >>> algorithm currently used in R is also
>     described
>      >     there).  I'm
>      >      >      >     not an expert
>      >      >      >      >>> in this area, but does it make sense for
>     the R to
>      >     adopt
>      >      >     one of
>      >      >      >     the unbiased
>      >      >      >      >>> random sample algorithms outlined there
>     and used
>      >     in other
>      >      >      >     languages?  Would
>      >      >      >      >>> a patch providing such an algorithm be
>     welcome? What
>      >      >     concerns
>      >      >      >     would need to
>      >      >      >      >>> be addressed first?
>      >      >      >      >>>
>      >      >      >      >>> I believe this issue was also raised by
>     Killie &
>      >     Philip in
>      >      >      >      >>>
>      >      >
>     http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
>      >      >      >     more
>      >      >      >      >>> recently in
>      >      >      >      >>>
>      >      >      >
>      >      >
>      >
>     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
>     <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>      >  
>       <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>      >      >
>      >    
>       <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>      >      >      >
>      >      >
>      >    
>       <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
>      >      >      >      >>> pointing to the python implementation for
>     comparison:
>      >      >      >      >>>
>      >      >      >
>      >      >
>      >
>     https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
>      >      >      >      >>
>      >      >      >      >> I think the analyses are correct, but I
>     doubt if a
>      >     change
>      >      >     to the
>      >      >      >     default
>      >      >      >      >> is likely to be accepted as it would make
>     it more
>      >      >     difficult to
>      >      >      >     reproduce
>      >      >      >      >> older results.
>      >      >      >      >>
>      >      >      >      >> On the other hand, a contribution of a new
>      >     function like
>      >      >      >     sample() but
>      >      >      >      >> not suffering from the bias would be good.  The
>      >     normal way to
>      >      >      >     make such
>      >      >      >      >> a contribution is in a user contributed
>     package.
>      >      >      >      >>
>      >      >      >      >> By the way, R code illustrating the bias is
>      >     probably not very
>      >      >      >     hard to
>      >      >      >      >> put together.  I believe the bias manifests
>     itself in
>      >      >     sample()
>      >      >      >     producing
>      >      >      >      >> values with two different probabilities
>     (instead
>      >     of all equal
>      >      >      >      >> probabilities).  Those may differ by as much as
>      >     one part in
>      >      >      >     2^32.  It's
>      >      >      >      >
>      >      >      >      > According to Kellie and Philip, in the
>     attachment
>      >     of the
>      >      >     thread
>      >      >      >      > referenced by Carl, "The maximum ratio of
>     selection
>      >      >     probabilities can
>      >      >      >      > get as large as 1.5 if n is just below 2^31".
>      >      >      >
>      >      >      >     Sorry, I didn't write very well.  I meant to
>     say that the
>      >      >     difference in
>      >      >      >     probabilities would be 2^-32, not that the ratio of
>      >      >     probabilities would
>      >      >      >     be 1 + 2^-32.
>      >      >      >
>      >      >      >     By the way, I don't see the statement giving
>     the ratio as
>      >      >     1.5, but
>      >      >      >     maybe
>      >      >      >     I was looking in the wrong place.  In Theorem 1
>     of the
>      >     paper
>      >      >     I was
>      >      >      >     looking in the ratio was "1 + m 2^{-w + 1}".
>     In that
>      >     formula
>      >      >     m is your
>      >      >      >     n.  If it is near 2^31, R uses w = 57 random
>     bits, so
>      >     the ratio
>      >      >      >     would be
>      >      >      >     very, very small (one part in 2^25).
>      >      >      >
>      >      >      >     The worst case for R would happen when m  is
>     just below
>      >      >     2^25, where w
>      >      >      >     is at least 31 for the default generators.  In that
>      >     case the
>      >      >     ratio
>      >      >      >     could
>      >      >      >     be about 1.03.
>      >      >      >
>      >      >      >     Duncan Murdoch
>      >      >      >
>      >      >      >
>      >      >      >
>      >      >      > --
>      >      >      > Philip B. Stark | Associate Dean, Mathematical and
>     Physical
>      >      >     Sciences |
>      >      >      > Professor,  Department of Statistics |
>      >      >      > University of California
>      >      >      > Berkeley, CA 94720-3860 | 510-394-5077 |
>      >      > statistics.berkeley.edu/~stark
>     <http://statistics.berkeley.edu/%7Estark>
>      >     <http://statistics.berkeley.edu/%7Estark>
>      >      >     <http://statistics.berkeley.edu/%7Estark>
>      >      >      > <http://statistics.berkeley.edu/%7Estark> |
>      >      >      > @philipbstark
>      >      >      >
>      >      >
>      >      >
>      >      >
>      >      > --
>      >      > Philip B. Stark | Associate Dean, Mathematical and Physical
>      >     Sciences |
>      >      > Professor,  Department of Statistics |
>      >      > University of California
>      >      > Berkeley, CA 94720-3860 | 510-394-5077 |
>      > statistics.berkeley.edu/~stark
>     <http://statistics.berkeley.edu/%7Estark>
>      >     <http://statistics.berkeley.edu/%7Estark>
>      >      > <http://statistics.berkeley.edu/%7Estark> |
>      >      > @philipbstark
>      >      >
>      >
>      >
>      >
>      > --
>      > Philip B. Stark | Associate Dean, Mathematical and Physical
>     Sciences |
>      > Professor,  Department of Statistics |
>      > University of California
>      > Berkeley, CA 94720-3860 | 510-394-5077 |
>     statistics.berkeley.edu/~stark
>     <http://statistics.berkeley.edu/%7Estark>
>      > <http://statistics.berkeley.edu/%7Estark> |
>      > @philipbstark
>      >
>
>     ______________________________________________
>     [hidden email] <mailto:[hidden email]> mailing list
>     https://stat.ethz.ch/mailman/listinfo/r-devel
>
> --
> Sent from Gmail Mobile

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bias in R's random integers?

bbolker
In reply to this post by Duncan Murdoch-2

  A quick point of order here: arguing with Duncan in this forum is
helpful to expose ideas, but probably neither side will convince the
other; eventually, if you want this adopted in core R, you'll need to
convince an R-core member to pursue this fix.

  In the meantime, a good, well-tested implementation in a
user-contributed package (presumably written in C for speed) would be
enormously useful.  Volunteers ... ?



On 2018-09-19 04:19 PM, Duncan Murdoch wrote:

> On 19/09/2018 3:52 PM, Philip B. Stark wrote:
>> Hi Duncan--
>>
>> Nice simulation!
>>
>> The absolute difference in probabilities is small, but the maximum
>> relative difference grows from something negligible to almost 2 as m
>> approaches 2**31.
>>
>> Because the L_1 distance between the uniform distribution on {1, ...,
>> m} and what you actually get is large, there have to be test functions
>> whose expectations are quite different under the two distributions.
>
> That is a mathematically true statement, but I suspect it is not very
> relevant.  Pseudo-random number generators always have test functions
> whose sample averages are quite different from the expectation under the
> true distribution.  Remember Von Neumann's "state of sin" quote.  The
> bug in sample() just means it is easier to find such a function than it
> would otherwise be.
>
> The practical question is whether such a function is likely to arise in
> practice or not.
>
>> Whether those correspond to commonly used statistics or not, I have no
>> idea.
>
> I am pretty confident that this bug rarely matters.
>
>> Regarding backwards compatibility: as a user, I'd rather the default
>> sample() do the best possible thing, and take an extra step to use
>> something like sample(..., legacy=TRUE) if I want to reproduce old
>> results.
>
> I suspect there's a good chance the bug I discovered today (non-integer
> x values not being truncated) will be declared to be a feature, and the
> documentation will be changed.  Then the rejection sampling approach
> would need to be quite a bit more complicated.
>
> I think a documentation warning about the accuracy of sampling
> probabilities would also be a sufficient fix here, and would be quite a
> bit less trouble than changing the default sample().  But as I said in
> my original post, a contribution of a function without this bug would be
> a nice addition.
>
> Duncan Murdoch
>
>>
>> Regards,
>> Philip
>>
>> On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch
>> <[hidden email] <mailto:[hidden email]>> wrote:
>>
>>     On 19/09/2018 12:23 PM, Philip B. Stark wrote:
>>      > No, the 2nd call only happens when m > 2**31. Here's the code:
>>
>>     Yes, you're right. Sorry!
>>
>>     So the ratio really does come close to 2.  However, the difference in
>>     probabilities between outcomes is still at most 2^-32 when m is less
>>     than that cutoff.  That's not feasible to detect; the only detectable
>>     difference would happen if some event was constructed to hold an
>>     abundance of outcomes with especially low (or especially high)
>>     probability.
>>
>>     As I said in my original post, it's probably not hard to construct
>> such
>>     a thing, but as I've said more recently, it probably wouldn't
>> happen by
>>     chance.  Here's one attempt to do it:
>>
>>     Call the values from unif_rand() "the unif_rand() outcomes".  Call
>> the
>>     values from sample() the sample outcomes.
>>
>>     It would be easiest to see the error if half of the sample() outcomes
>>     used two unif_rand() outcomes, and half used just one.  That would
>> mean
>>     m should be (2/3) * 2^32, but that's too big and would trigger the
>>     other
>>     version.
>>
>>     So how about half use 2 unif_rands(), and half use 3?  That means m =
>>     (2/5) * 2^32 = 1717986918.  A good guess is that sample() outcomes
>>     would
>>     alternate between the two possibilities, so our event could be even
>>     versus odd outcomes.
>>
>>     Let's try it:
>>
>>       > m <- (2/5)*2^32
>>       > m > 2^31
>>     [1] FALSE
>>       > x <- sample(m, 1000000, replace = TRUE)
>>       > table(x %% 2)
>>
>>            0      1
>>     399850 600150
>>
>>     Since m is an even number, the true proportions of evens and odds
>>     should
>>     be exactly 0.5.  That's some pretty strong evidence of the bug in the
>>     generator.  (Note that the ratio of the observed probabilities is
>> about
>>     1.5, so I may not be the first person to have done this.)
>>
>>     I'm still not convinced that there has ever been a simulation run
>> with
>>     detectable bias compared to Monte Carlo error unless it (like this
>> one)
>>     was designed specifically to show the problem.
>>
>>     Duncan Murdoch
>>
>>      >
>>      > (RNG.c, lines 793ff)
>>      >
>>      > double R_unif_index(double dn)
>>      > {
>>      >      double cut = INT_MAX;
>>      >
>>      >      switch(RNG_kind) {
>>      >      case KNUTH_TAOCP:
>>      >      case USER_UNIF:
>>      >      case KNUTH_TAOCP2:
>>      > cut = 33554431.0; /* 2^25 - 1 */
>>      > break;
>>      >      default:
>>      > break;
>>      >     }
>>      >
>>      >      double u = dn > cut ? ru() : unif_rand();
>>      >      return floor(dn * u);
>>      > }
>>      >
>>      > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch
>>     <[hidden email] <mailto:[hidden email]>
>>      > <mailto:[hidden email]
>>     <mailto:[hidden email]>>> wrote:
>>      >
>>      >     On 19/09/2018 12:09 PM, Philip B. Stark wrote:
>>      >      > The 53 bits only encode at most 2^{32} possible values,
>>     because the
>>      >      > source of the float is the output of a 32-bit PRNG (the
>>     obsolete
>>      >     version
>>      >      > of MT). 53 bits isn't the relevant number here.
>>      >
>>      >     No, two calls to unif_rand() are used.  There are two 32 bit
>>     values,
>>      >     but
>>      >     some of the bits are thrown away.
>>      >
>>      >     Duncan Murdoch
>>      >
>>      >      >
>>      >      > The selection ratios can get close to 2. Computer
>> scientists
>>      >     don't do it
>>      >      > the way R does, for a reason.
>>      >      >
>>      >      > Regards,
>>      >      > Philip
>>      >      >
>>      >      > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
>>      >     <[hidden email] <mailto:[hidden email]>
>>     <mailto:[hidden email] <mailto:[hidden email]>>
>>      >      > <mailto:[hidden email]
>>     <mailto:[hidden email]>
>>      >     <mailto:[hidden email]
>>     <mailto:[hidden email]>>>> wrote:
>>      >      >
>>      >      >     On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
>>      >      >      > El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch
>>      >      >      > (<[hidden email]
>>     <mailto:[hidden email]>
>>      >     <mailto:[hidden email]
>>     <mailto:[hidden email]>> <mailto:[hidden email]
>>     <mailto:[hidden email]>
>>      >     <mailto:[hidden email]
>>     <mailto:[hidden email]>>>>)
>>      >      >     escribió:
>>      >      >      >>
>>      >      >      >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
>>      >      >      >>> Dear list,
>>      >      >      >>>
>>      >      >      >>> It looks to me that R samples random integers
>>     using an
>>      >      >     intuitive but biased
>>      >      >      >>> algorithm by going from a random number on
>> [0,1) from
>>      >     the PRNG
>>      >      >     to a random
>>      >      >      >>> integer, e.g.
>>      >      >      >>>
>>      >      >
>>     https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
>>      >      >      >>>
>>      >      >      >>> Many other languages use various rejection
>> sampling
>>      >     approaches
>>      >      >     which
>>      >      >      >>> provide an unbiased method for sampling, such as
>>     in Go,
>>      >     python,
>>      >      >     and others
>>      >      >      >>> described here:
>> https://arxiv.org/abs/1805.10941 (I
>>      >     believe the
>>      >      >     biased
>>      >      >      >>> algorithm currently used in R is also described
>>     there).  I'm
>>      >      >     not an expert
>>      >      >      >>> in this area, but does it make sense for the R to
>>     adopt
>>      >     one of
>>      >      >     the unbiased
>>      >      >      >>> random sample algorithms outlined there and used
>>     in other
>>      >      >     languages?  Would
>>      >      >      >>> a patch providing such an algorithm be welcome?
>> What
>>      >     concerns
>>      >      >     would need to
>>      >      >      >>> be addressed first?
>>      >      >      >>>
>>      >      >      >>> I believe this issue was also raised by Killie &
>>     Philip in
>>      >      >      >>>
>>      > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
>>      >      >     more
>>      >      >      >>> recently in
>>      >      >      >>>
>>      >      >
>>      >
>>     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
>>    
>> <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>>      >       
>>  <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>>      >      >
>>      >         
>>  <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
>>      >      >      >>> pointing to the python implementation for
>> comparison:
>>      >      >      >>>
>>      >      >
>>      >
>>    
>> https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
>>
>>      >      >      >>
>>      >      >      >> I think the analyses are correct, but I doubt if a
>>     change
>>      >     to the
>>      >      >     default
>>      >      >      >> is likely to be accepted as it would make it more
>>      >     difficult to
>>      >      >     reproduce
>>      >      >      >> older results.
>>      >      >      >>
>>      >      >      >> On the other hand, a contribution of a new
>>     function like
>>      >      >     sample() but
>>      >      >      >> not suffering from the bias would be good.  The
>>     normal way to
>>      >      >     make such
>>      >      >      >> a contribution is in a user contributed package.
>>      >      >      >>
>>      >      >      >> By the way, R code illustrating the bias is
>>     probably not very
>>      >      >     hard to
>>      >      >      >> put together.  I believe the bias manifests
>> itself in
>>      >     sample()
>>      >      >     producing
>>      >      >      >> values with two different probabilities (instead
>>     of all equal
>>      >      >      >> probabilities).  Those may differ by as much as
>>     one part in
>>      >      >     2^32.  It's
>>      >      >      >
>>      >      >      > According to Kellie and Philip, in the attachment
>>     of the
>>      >     thread
>>      >      >      > referenced by Carl, "The maximum ratio of selection
>>      >     probabilities can
>>      >      >      > get as large as 1.5 if n is just below 2^31".
>>      >      >
>>      >      >     Sorry, I didn't write very well.  I meant to say
>> that the
>>      >     difference in
>>      >      >     probabilities would be 2^-32, not that the ratio of
>>      >     probabilities would
>>      >      >     be 1 + 2^-32.
>>      >      >
>>      >      >     By the way, I don't see the statement giving the
>> ratio as
>>      >     1.5, but
>>      >      >     maybe
>>      >      >     I was looking in the wrong place.  In Theorem 1 of the
>>     paper
>>      >     I was
>>      >      >     looking in the ratio was "1 + m 2^{-w + 1}".  In that
>>     formula
>>      >     m is your
>>      >      >     n.  If it is near 2^31, R uses w = 57 random bits, so
>>     the ratio
>>      >      >     would be
>>      >      >     very, very small (one part in 2^25).
>>      >      >
>>      >      >     The worst case for R would happen when m  is just below
>>      >     2^25, where w
>>      >      >     is at least 31 for the default generators.  In that
>>     case the
>>      >     ratio
>>      >      >     could
>>      >      >     be about 1.03.
>>      >      >
>>      >      >     Duncan Murdoch
>>      >      >
>>      >      >
>>      >      >
>>      >      > --
>>      >      > Philip B. Stark | Associate Dean, Mathematical and Physical
>>      >     Sciences |
>>      >      > Professor,  Department of Statistics |
>>      >      > University of California
>>      >      > Berkeley, CA 94720-3860 | 510-394-5077 |
>>      > statistics.berkeley.edu/~stark
>>     <http://statistics.berkeley.edu/%7Estark>
>>      >     <http://statistics.berkeley.edu/%7Estark>
>>      >      > <http://statistics.berkeley.edu/%7Estark> |
>>      >      > @philipbstark
>>      >      >
>>      >
>>      >
>>      >
>>      > --
>>      > Philip B. Stark | Associate Dean, Mathematical and Physical
>>     Sciences |
>>      > Professor,  Department of Statistics |
>>      > University of California
>>      > Berkeley, CA 94720-3860 | 510-394-5077 |
>>     statistics.berkeley.edu/~stark
>>     <http://statistics.berkeley.edu/%7Estark>
>>      > <http://statistics.berkeley.edu/%7Estark> |
>>      > @philipbstark
>>      >
>>
>>
>>
>> -- 
>> Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
>> Professor,  Department of Statistics |
>> University of California
>> Berkeley, CA 94720-3860 | 510-394-5077 |
>> statistics.berkeley.edu/~stark
>> <http://statistics.berkeley.edu/%7Estark> |
>> @philipbstark
>>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bias in R's random integers?

Carl Boettiger
For a well-tested C algorithm, based on my reading of Lemire, the unbiased
"algorithm 3" in https://arxiv.org/abs/1805.10941 is part already of the C
standard library in OpenBSD and macOS (as arc4random_uniform), and in the
GNU standard library.  Lemire also provides C++ code in the appendix of his
piece for both this and the faster "nearly divisionless" algorithm.

It would be excellent if any R core members were interested in considering
bindings to these algorithms as a patch, or might express expectations for
how that patch would have to operate (e.g. re Duncan's comment about
non-integer arguments to sample size).  Otherwise, an R package binding
seems like a good starting point, but I'm not the right volunteer.


On Wed, Sep 19, 2018 at 4:22 PM Ben Bolker <[hidden email]> wrote:

>
>   A quick point of order here: arguing with Duncan in this forum is
> helpful to expose ideas, but probably neither side will convince the
> other; eventually, if you want this adopted in core R, you'll need to
> convince an R-core member to pursue this fix.
>
>   In the meantime, a good, well-tested implementation in a
> user-contributed package (presumably written in C for speed) would be
> enormously useful.  Volunteers ... ?
>
>
>
> On 2018-09-19 04:19 PM, Duncan Murdoch wrote:
> > On 19/09/2018 3:52 PM, Philip B. Stark wrote:
> >> Hi Duncan--
> >>
> >> Nice simulation!
> >>
> >> The absolute difference in probabilities is small, but the maximum
> >> relative difference grows from something negligible to almost 2 as m
> >> approaches 2**31.
> >>
> >> Because the L_1 distance between the uniform distribution on {1, ...,
> >> m} and what you actually get is large, there have to be test functions
> >> whose expectations are quite different under the two distributions.
> >
> > That is a mathematically true statement, but I suspect it is not very
> > relevant.  Pseudo-random number generators always have test functions
> > whose sample averages are quite different from the expectation under the
> > true distribution.  Remember Von Neumann's "state of sin" quote.  The
> > bug in sample() just means it is easier to find such a function than it
> > would otherwise be.
> >
> > The practical question is whether such a function is likely to arise in
> > practice or not.
> >
> >> Whether those correspond to commonly used statistics or not, I have no
> >> idea.
> >
> > I am pretty confident that this bug rarely matters.
> >
> >> Regarding backwards compatibility: as a user, I'd rather the default
> >> sample() do the best possible thing, and take an extra step to use
> >> something like sample(..., legacy=TRUE) if I want to reproduce old
> >> results.
> >
> > I suspect there's a good chance the bug I discovered today (non-integer
> > x values not being truncated) will be declared to be a feature, and the
> > documentation will be changed.  Then the rejection sampling approach
> > would need to be quite a bit more complicated.
> >
> > I think a documentation warning about the accuracy of sampling
> > probabilities would also be a sufficient fix here, and would be quite a
> > bit less trouble than changing the default sample().  But as I said in
> > my original post, a contribution of a function without this bug would be
> > a nice addition.
> >
> > Duncan Murdoch
> >
> >>
> >> Regards,
> >> Philip
> >>
> >> On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch
> >> <[hidden email] <mailto:[hidden email]>> wrote:
> >>
> >>     On 19/09/2018 12:23 PM, Philip B. Stark wrote:
> >>      > No, the 2nd call only happens when m > 2**31. Here's the code:
> >>
> >>     Yes, you're right. Sorry!
> >>
> >>     So the ratio really does come close to 2.  However, the difference
> in
> >>     probabilities between outcomes is still at most 2^-32 when m is less
> >>     than that cutoff.  That's not feasible to detect; the only
> detectable
> >>     difference would happen if some event was constructed to hold an
> >>     abundance of outcomes with especially low (or especially high)
> >>     probability.
> >>
> >>     As I said in my original post, it's probably not hard to construct
> >> such
> >>     a thing, but as I've said more recently, it probably wouldn't
> >> happen by
> >>     chance.  Here's one attempt to do it:
> >>
> >>     Call the values from unif_rand() "the unif_rand() outcomes".  Call
> >> the
> >>     values from sample() the sample outcomes.
> >>
> >>     It would be easiest to see the error if half of the sample()
> outcomes
> >>     used two unif_rand() outcomes, and half used just one.  That would
> >> mean
> >>     m should be (2/3) * 2^32, but that's too big and would trigger the
> >>     other
> >>     version.
> >>
> >>     So how about half use 2 unif_rands(), and half use 3?  That means m
> =
> >>     (2/5) * 2^32 = 1717986918.  A good guess is that sample() outcomes
> >>     would
> >>     alternate between the two possibilities, so our event could be even
> >>     versus odd outcomes.
> >>
> >>     Let's try it:
> >>
> >>       > m <- (2/5)*2^32
> >>       > m > 2^31
> >>     [1] FALSE
> >>       > x <- sample(m, 1000000, replace = TRUE)
> >>       > table(x %% 2)
> >>
> >>            0      1
> >>     399850 600150
> >>
> >>     Since m is an even number, the true proportions of evens and odds
> >>     should
> >>     be exactly 0.5.  That's some pretty strong evidence of the bug in
> the
> >>     generator.  (Note that the ratio of the observed probabilities is
> >> about
> >>     1.5, so I may not be the first person to have done this.)
> >>
> >>     I'm still not convinced that there has ever been a simulation run
> >> with
> >>     detectable bias compared to Monte Carlo error unless it (like this
> >> one)
> >>     was designed specifically to show the problem.
> >>
> >>     Duncan Murdoch
> >>
> >>      >
> >>      > (RNG.c, lines 793ff)
> >>      >
> >>      > double R_unif_index(double dn)
> >>      > {
> >>      >      double cut = INT_MAX;
> >>      >
> >>      >      switch(RNG_kind) {
> >>      >      case KNUTH_TAOCP:
> >>      >      case USER_UNIF:
> >>      >      case KNUTH_TAOCP2:
> >>      > cut = 33554431.0; /* 2^25 - 1 */
> >>      > break;
> >>      >      default:
> >>      > break;
> >>      >     }
> >>      >
> >>      >      double u = dn > cut ? ru() : unif_rand();
> >>      >      return floor(dn * u);
> >>      > }
> >>      >
> >>      > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch
> >>     <[hidden email] <mailto:[hidden email]>
> >>      > <mailto:[hidden email]
> >>     <mailto:[hidden email]>>> wrote:
> >>      >
> >>      >     On 19/09/2018 12:09 PM, Philip B. Stark wrote:
> >>      >      > The 53 bits only encode at most 2^{32} possible values,
> >>     because the
> >>      >      > source of the float is the output of a 32-bit PRNG (the
> >>     obsolete
> >>      >     version
> >>      >      > of MT). 53 bits isn't the relevant number here.
> >>      >
> >>      >     No, two calls to unif_rand() are used.  There are two 32 bit
> >>     values,
> >>      >     but
> >>      >     some of the bits are thrown away.
> >>      >
> >>      >     Duncan Murdoch
> >>      >
> >>      >      >
> >>      >      > The selection ratios can get close to 2. Computer
> >> scientists
> >>      >     don't do it
> >>      >      > the way R does, for a reason.
> >>      >      >
> >>      >      > Regards,
> >>      >      > Philip
> >>      >      >
> >>      >      > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
> >>      >     <[hidden email] <mailto:[hidden email]>
> >>     <mailto:[hidden email] <mailto:[hidden email]>>
> >>      >      > <mailto:[hidden email]
> >>     <mailto:[hidden email]>
> >>      >     <mailto:[hidden email]
> >>     <mailto:[hidden email]>>>> wrote:
> >>      >      >
> >>      >      >     On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
> >>      >      >      > El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch
> >>      >      >      > (<[hidden email]
> >>     <mailto:[hidden email]>
> >>      >     <mailto:[hidden email]
> >>     <mailto:[hidden email]>> <mailto:[hidden email]
> >>     <mailto:[hidden email]>
> >>      >     <mailto:[hidden email]
> >>     <mailto:[hidden email]>>>>)
> >>      >      >     escribió:
> >>      >      >      >>
> >>      >      >      >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
> >>      >      >      >>> Dear list,
> >>      >      >      >>>
> >>      >      >      >>> It looks to me that R samples random integers
> >>     using an
> >>      >      >     intuitive but biased
> >>      >      >      >>> algorithm by going from a random number on
> >> [0,1) from
> >>      >     the PRNG
> >>      >      >     to a random
> >>      >      >      >>> integer, e.g.
> >>      >      >      >>>
> >>      >      >
> >>
> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
> >>      >      >      >>>
> >>      >      >      >>> Many other languages use various rejection
> >> sampling
> >>      >     approaches
> >>      >      >     which
> >>      >      >      >>> provide an unbiased method for sampling, such as
> >>     in Go,
> >>      >     python,
> >>      >      >     and others
> >>      >      >      >>> described here:
> >> https://arxiv.org/abs/1805.10941 (I
> >>      >     believe the
> >>      >      >     biased
> >>      >      >      >>> algorithm currently used in R is also described
> >>     there).  I'm
> >>      >      >     not an expert
> >>      >      >      >>> in this area, but does it make sense for the R to
> >>     adopt
> >>      >     one of
> >>      >      >     the unbiased
> >>      >      >      >>> random sample algorithms outlined there and used
> >>     in other
> >>      >      >     languages?  Would
> >>      >      >      >>> a patch providing such an algorithm be welcome?
> >> What
> >>      >     concerns
> >>      >      >     would need to
> >>      >      >      >>> be addressed first?
> >>      >      >      >>>
> >>      >      >      >>> I believe this issue was also raised by Killie &
> >>     Philip in
> >>      >      >      >>>
> >>      > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
> >>      >      >     more
> >>      >      >      >>> recently in
> >>      >      >      >>>
> >>      >      >
> >>      >
> >>     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
> >>
> >> <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> >>      >
> >>  <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> >>      >      >
> >>      >
> >>  <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf
> >,
> >>      >      >      >>> pointing to the python implementation for
> >> comparison:
> >>      >      >      >>>
> >>      >      >
> >>      >
> >>
> >>
> https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
> >>
> >>      >      >      >>
> >>      >      >      >> I think the analyses are correct, but I doubt if a
> >>     change
> >>      >     to the
> >>      >      >     default
> >>      >      >      >> is likely to be accepted as it would make it more
> >>      >     difficult to
> >>      >      >     reproduce
> >>      >      >      >> older results.
> >>      >      >      >>
> >>      >      >      >> On the other hand, a contribution of a new
> >>     function like
> >>      >      >     sample() but
> >>      >      >      >> not suffering from the bias would be good.  The
> >>     normal way to
> >>      >      >     make such
> >>      >      >      >> a contribution is in a user contributed package.
> >>      >      >      >>
> >>      >      >      >> By the way, R code illustrating the bias is
> >>     probably not very
> >>      >      >     hard to
> >>      >      >      >> put together.  I believe the bias manifests
> >> itself in
> >>      >     sample()
> >>      >      >     producing
> >>      >      >      >> values with two different probabilities (instead
> >>     of all equal
> >>      >      >      >> probabilities).  Those may differ by as much as
> >>     one part in
> >>      >      >     2^32.  It's
> >>      >      >      >
> >>      >      >      > According to Kellie and Philip, in the attachment
> >>     of the
> >>      >     thread
> >>      >      >      > referenced by Carl, "The maximum ratio of selection
> >>      >     probabilities can
> >>      >      >      > get as large as 1.5 if n is just below 2^31".
> >>      >      >
> >>      >      >     Sorry, I didn't write very well.  I meant to say
> >> that the
> >>      >     difference in
> >>      >      >     probabilities would be 2^-32, not that the ratio of
> >>      >     probabilities would
> >>      >      >     be 1 + 2^-32.
> >>      >      >
> >>      >      >     By the way, I don't see the statement giving the
> >> ratio as
> >>      >     1.5, but
> >>      >      >     maybe
> >>      >      >     I was looking in the wrong place.  In Theorem 1 of the
> >>     paper
> >>      >     I was
> >>      >      >     looking in the ratio was "1 + m 2^{-w + 1}".  In that
> >>     formula
> >>      >     m is your
> >>      >      >     n.  If it is near 2^31, R uses w = 57 random bits, so
> >>     the ratio
> >>      >      >     would be
> >>      >      >     very, very small (one part in 2^25).
> >>      >      >
> >>      >      >     The worst case for R would happen when m  is just
> below
> >>      >     2^25, where w
> >>      >      >     is at least 31 for the default generators.  In that
> >>     case the
> >>      >     ratio
> >>      >      >     could
> >>      >      >     be about 1.03.
> >>      >      >
> >>      >      >     Duncan Murdoch
> >>      >      >
> >>      >      >
> >>      >      >
> >>      >      > --
> >>      >      > Philip B. Stark | Associate Dean, Mathematical and
> Physical
> >>      >     Sciences |
> >>      >      > Professor,  Department of Statistics |
> >>      >      > University of California
> >>      >      > Berkeley, CA 94720-3860 | 510-394-5077 <(510)%20394-5077>
> |
> >>      > statistics.berkeley.edu/~stark
> >>     <http://statistics.berkeley.edu/%7Estark>
> >>      >     <http://statistics.berkeley.edu/%7Estark>
> >>      >      > <http://statistics.berkeley.edu/%7Estark> |
> >>      >      > @philipbstark
> >>      >      >
> >>      >
> >>      >
> >>      >
> >>      > --
> >>      > Philip B. Stark | Associate Dean, Mathematical and Physical
> >>     Sciences |
> >>      > Professor,  Department of Statistics |
> >>      > University of California
> >>      > Berkeley, CA 94720-3860 | 510-394-5077 <(510)%20394-5077> |
> >>     statistics.berkeley.edu/~stark
> >>     <http://statistics.berkeley.edu/%7Estark>
> >>      > <http://statistics.berkeley.edu/%7Estark> |
> >>      > @philipbstark
> >>      >
> >>
> >>
> >>
> >> --
> >> Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
> >> Professor,  Department of Statistics |
> >> University of California
> >> Berkeley, CA 94720-3860 | 510-394-5077 <(510)%20394-5077> |
> >> statistics.berkeley.edu/~stark
> >> <http://statistics.berkeley.edu/%7Estark> |
> >> @philipbstark
> >>
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
--

http://carlboettiger.info

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bias in R's random integers?

Ralf Stubner
On 9/20/18 1:43 AM, Carl Boettiger wrote:

> For a well-tested C algorithm, based on my reading of Lemire, the unbiased
> "algorithm 3" in https://arxiv.org/abs/1805.10941 is part already of the C
> standard library in OpenBSD and macOS (as arc4random_uniform), and in the
> GNU standard library.  Lemire also provides C++ code in the appendix of his
> piece for both this and the faster "nearly divisionless" algorithm.
>
> It would be excellent if any R core members were interested in considering
> bindings to these algorithms as a patch, or might express expectations for
> how that patch would have to operate (e.g. re Duncan's comment about
> non-integer arguments to sample size).  Otherwise, an R package binding
> seems like a good starting point, but I'm not the right volunteer.
It is difficult to do this in a package, since R does not provide access
to the random bits generated by the RNG. Only a float in (0,1) is
available via unif_rand(). However, if one is willing to use an external
RNG, it is of course possible. After reading about Lemire's work [1], I
had planned to integrate such an unbiased sampling scheme into the dqrng
package, which I have now started. [2]

Using Duncan's example, the results look much better:

> library(dqrng)
> m <- (2/5)*2^32
> y <- dqsample(m, 1000000, replace = TRUE)
> table(y %% 2)

     0      1
500252 499748

Currently I am taking the other interpretation of "truncated":

> table(dqsample(2.5, 1000000, replace = TRUE))

     0      1
499894 500106

I will adjust this to whatever is decided for base R.


However, there is currently neither long vector nor weighted sampling
support. And the performance without replacement is quite bad compared
to R's algorithm with hashing.

cheerio
ralf

[1] via http://www.pcg-random.org/posts/bounded-rands.html
[2] https://github.com/daqana/dqrng/tree/feature/sample

--
Ralf Stubner
Senior Software Engineer / Trainer

daqana GmbH
Dortustraße 48
14467 Potsdam

T: +49 331 23 61 93 11
F: +49 331 23 61 93 90
M: +49 162 20 91 196
Mail: [hidden email]

Sitz: Potsdam
Register: AG Potsdam HRB 27966 P
Ust.-IdNr.: DE300072622
Geschäftsführer: Prof. Dr. Dr. Karl-Kuno Kunze


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

signature.asc (849 bytes) Download Attachment
12