Extreme bunching of random values from runif with Mersenne-Twister seed

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Extreme bunching of random values from runif with Mersenne-Twister seed

Tirthankar Chakravarty-2
This is cross-posted from SO (https://stackoverflow.com/q/47079702/1414455),
but I now feel that this needs someone from R-Devel to help understand why
this is happening.

We are facing a weird situation in our code when using R's [`runif`][1] and
setting seed with `set.seed` with the `kind = NULL` option (which resolves,
unless I am mistaken, to `kind = "default"`; the default being
`"Mersenne-Twister"`).

We set the seed using (8 digit) unique IDs generated by an upstream system,
before calling `runif`:

    seeds = c(
      "86548915", "86551615", "86566163", "86577411", "86584144",
      "86584272", "86620568", "86724613", "86756002", "86768593",
"86772411",
      "86781516", "86794389", "86805854", "86814600", "86835092",
"86874179",
      "86876466", "86901193", "86987847", "86988080")

    random_values = sapply(seeds, function(x) {
      set.seed(x)
      y = runif(1, 17, 26)
      return(y)
    })

This gives values that are **extremely** bunched together.

    > summary(random_values)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
      25.13   25.36   25.66   25.58   25.83   25.94

This behaviour of `runif` goes away when we use `kind =
"Knuth-TAOCP-2002"`, and we get values that appear to be much more evenly
spread out.

    random_values = sapply(seeds, function(x) {
      set.seed(x, kind = "Knuth-TAOCP-2002")
      y = runif(1, 17, 26)
      return(y)
    })

*Output omitted.*

---

**The most interesting thing here is that this does not happen on Windows
-- only happens on Ubuntu** (`sessionInfo` output for Ubuntu & Windows
below).

# Windows output: #

    > seeds = c(
    +   "86548915", "86551615", "86566163", "86577411", "86584144",
    +   "86584272", "86620568", "86724613", "86756002", "86768593",
"86772411",
    +   "86781516", "86794389", "86805854", "86814600", "86835092",
"86874179",
    +   "86876466", "86901193", "86987847", "86988080")
    >
    > random_values = sapply(seeds, function(x) {
    +   set.seed(x)
    +   y = runif(1, 17, 26)
    +   return(y)
    + })
    >
    > summary(random_values)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
      17.32   20.14   23.00   22.17   24.07   25.90

Can someone help understand what is going on?

Ubuntu
------

    R version 3.4.0 (2017-04-21)
    Platform: x86_64-pc-linux-gnu (64-bit)
    Running under: Ubuntu 16.04.2 LTS

    Matrix products: default
    BLAS: /usr/lib/libblas/libblas.so.3.6.0
    LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

    locale:
    [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C
     [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8
     [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8
     [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8
     [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8
    [11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

    attached base packages:
    [1] parallel  stats     graphics  grDevices utils     datasets
methods   base

    other attached packages:
    [1] RMySQL_0.10.8               DBI_0.6-1
     [3] jsonlite_1.4                tidyjson_0.2.2
     [5] optiRum_0.37.3              lubridate_1.6.0
     [7] httr_1.2.1                  gdata_2.18.0
     [9] XLConnect_0.2-12            XLConnectJars_0.2-12
    [11] data.table_1.10.4           stringr_1.2.0
    [13] readxl_1.0.0                xlsx_0.5.7
    [15] xlsxjars_0.6.1              rJava_0.9-8
    [17] sqldf_0.4-10                RSQLite_1.1-2
    [19] gsubfn_0.6-6                proto_1.0.0
    [21] dplyr_0.5.0                 purrr_0.2.4
    [23] readr_1.1.1                 tidyr_0.6.3
    [25] tibble_1.3.0                tidyverse_1.1.1
    [27] rBayesianOptimization_1.1.0 xgboost_0.6-4
    [29] MLmetrics_1.1.1             caret_6.0-76
    [31] ROCR_1.0-7                  gplots_3.0.1
    [33] effects_3.1-2               pROC_1.10.0
    [35] pscl_1.4.9                  lattice_0.20-35
    [37] MASS_7.3-47                 ggplot2_2.2.1

    loaded via a namespace (and not attached):
    [1] splines_3.4.0      foreach_1.4.3      AUC_0.3.0
modelr_0.1.0
     [5] gtools_3.5.0       assertthat_0.2.0   stats4_3.4.0
 cellranger_1.1.0
     [9] quantreg_5.33      chron_2.3-50       digest_0.6.10
rvest_0.3.2
    [13] minqa_1.2.4        colorspace_1.3-2   Matrix_1.2-10
plyr_1.8.4
    [17] psych_1.7.3.21     XML_3.98-1.7       broom_0.4.2
SparseM_1.77
    [21] haven_1.0.0        scales_0.4.1       lme4_1.1-13
MatrixModels_0.4-1
    [25] mgcv_1.8-17        car_2.1-5          nnet_7.3-12
lazyeval_0.2.0
    [29] pbkrtest_0.4-7     mnormt_1.5-5       magrittr_1.5
 memoise_1.0.0
    [33] nlme_3.1-131       forcats_0.2.0      xml2_1.1.1
 foreign_0.8-69
    [37] tools_3.4.0        hms_0.3            munsell_0.4.3
compiler_3.4.0
    [41] caTools_1.17.1     rlang_0.1.1        grid_3.4.0
 nloptr_1.0.4
    [45] iterators_1.0.8    bitops_1.0-6       tcltk_3.4.0
gtable_0.2.0
    [49] ModelMetrics_1.1.0 codetools_0.2-15   reshape2_1.4.2     R6_2.2.0

    [53] knitr_1.15.1       KernSmooth_2.23-15 stringi_1.1.5
Rcpp_0.12.11



Windows
-------

    > sessionInfo()
    R version 3.3.2 (2016-10-31)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    Running under: Windows >= 8 x64 (build 9200)

    locale:
    [1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252
LC_MONETARY=English_India.1252
    [4] LC_NUMERIC=C                   LC_TIME=English_India.1252

    attached base packages:
    [1] graphics  grDevices utils     datasets  grid      stats
 methods   base

    other attached packages:
     [1] bindrcpp_0.2         h2o_3.14.0.3         ggrepel_0.6.5
eulerr_1.1.0         VennDiagram_1.6.17
     [6] futile.logger_1.4.3  scales_0.4.1         FinCal_0.6.3
 xml2_1.0.0           httr_1.3.0
    [11] wesanderson_0.3.2    wordcloud_2.5        RColorBrewer_1.1-2
 htmltools_0.3.6      urltools_1.6.0
    [16] timevis_0.4          dtplyr_0.0.1         magrittr_1.5
 shiny_1.0.5          RODBC_1.3-14
    [21] zoo_1.8-0            sqldf_0.4-10         RSQLite_1.1-2
gsubfn_0.6-6         proto_1.0.0
    [26] gdata_2.17.0         stringr_1.2.0        XLConnect_0.2-12
 XLConnectJars_0.2-12 data.table_1.10.4
    [31] xlsx_0.5.7           xlsxjars_0.6.1       rJava_0.9-8
readxl_0.1.1         googlesheets_0.2.1
    [36] jsonlite_1.5         tidyjson_0.2.1       RMySQL_0.10.9
RPostgreSQL_0.4-1    DBI_0.5-1
    [41] dplyr_0.7.2          purrr_0.2.3          readr_1.1.1
tidyr_0.7.0          tibble_1.3.3
    [46] ggplot2_2.2.0        tidyverse_1.0.0      lubridate_1.6.0

    loaded via a namespace (and not attached):
     [1] gtools_3.5.0         assertthat_0.2.0     triebeard_0.3.0
cellranger_1.1.0     yaml_2.1.14
     [6] slam_0.1-40          lattice_0.20-34      glue_1.1.1
 chron_2.3-48         digest_0.6.12.1
    [11] colorspace_1.3-1     httpuv_1.3.5         plyr_1.8.4
 pkgconfig_2.0.1      xtable_1.8-2
    [16] lazyeval_0.2.0       mime_0.5             memoise_1.0.0
tools_3.3.2          hms_0.3
    [21] munsell_0.4.3        lambda.r_1.1.9       rlang_0.1.1
RCurl_1.95-4.8       labeling_0.3
    [26] bitops_1.0-6         tcltk_3.3.2          gtable_0.2.0
 reshape2_1.4.2       R6_2.2.0
    [31] bindr_0.1            futile.options_1.0.0 stringi_1.1.2
Rcpp_0.12.12.1

  [1]: http://stat.ethz.ch/R-manual/R-devel/library/stats/html/Uniform.html

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Extreme bunching of random values from runif with Mersenne-Twister seed

Martin Maechler
>>>>> Tirthankar Chakravarty <[hidden email]>
>>>>>     on Fri, 3 Nov 2017 13:19:12 +0530 writes:

    > This is cross-posted from SO
    > (https://stackoverflow.com/q/47079702/1414455), but I now
    > feel that this needs someone from R-Devel to help
    > understand why this is happening.

Why R-devel -- R-help would have been appropriate:

It seems you have not read the help page for
set.seed as I expect it from posters to R-devel.
Why would you use strings instead of integers if you *had* read it ?

    > We are facing a weird situation in our code when using R's
    > [`runif`][1] and setting seed with `set.seed` with the
    > `kind = NULL` option (which resolves, unless I am
    > mistaken, to `kind = "default"`; the default being
    > `"Mersenne-Twister"`).

again this is not what the help page says; rather

 | The use of ‘kind = NULL’ or ‘normal.kind = NULL’ in ‘RNGkind’ or
 | ‘set.seed’ selects the currently-used generator (including that
 | used in the previous session if the workspace has been restored):
 | if no generator has been used it selects ‘"default"’.

but as you have > 90 (!!) packages in your sessionInfo() below,
why should we (or you) know if some of the things you did
before or (implicitly) during loading all these packages did not
change the RNG kind ?

    > We set the seed using (8 digit) unique IDs generated by an
    > upstream system, before calling `runif`:

    >     seeds = c( "86548915", "86551615", "86566163",
    > "86577411", "86584144", "86584272", "86620568",
    > "86724613", "86756002", "86768593", "86772411",
    > "86781516", "86794389", "86805854", "86814600",
    > "86835092", "86874179", "86876466", "86901193",
    > "86987847", "86988080")

    >  random_values = sapply(seeds, function(x) {
    >   set.seed(x)
    >   y = runif(1, 17, 26)
    >   return(y)
    > })

Why do you do that?

1) You should set the seed *once*, not multiple times in one simulation.

2) Assuming that your strings are correctly translated to integers
   and the same on all platforms, independent of locales (!) etc,
   you are again not following the simple instruction on the help page:

     ‘set.seed’ uses a single integer argument to set as many seeds as
     are required.  It is intended as a simple way to get quite
     different seeds by specifying small integer arguments, and also as
     .....
     .....

Note:   ** small ** integer
Why do you assume   86901193  to be a small integer ?

    > This gives values that are **extremely** bunched together.

    >> summary(random_values)
    >        Min. 1st Qu.  Median Mean 3rd Qu.  Max.  25.13
    > 25.36 25.66 25.58 25.83 25.94

    > This behaviour of `runif` goes away when we use `kind =
    > "Knuth-TAOCP-2002"`, and we get values that appear to be
    > much more evenly spread out.

    >     random_values = sapply(seeds, function(x) {
    > set.seed(x, kind = "Knuth-TAOCP-2002") y = runif(1, 17,
    > 26) return(y) })

    > *Output omitted.*

    > ---

    > **The most interesting thing here is that this does not
    > happen on Windows -- only happens on Ubuntu**
    > (`sessionInfo` output for Ubuntu & Windows below).

    > # Windows output: #

    >> seeds = c(
    >     + "86548915", "86551615", "86566163", "86577411",
    > "86584144", + "86584272", "86620568", "86724613",
    > "86756002", "86768593", "86772411", + "86781516",
    > "86794389", "86805854", "86814600", "86835092",
    > "86874179", + "86876466", "86901193", "86987847",
    > "86988080")
    >>
    >> random_values = sapply(seeds, function(x) {
    >     + set.seed(x) + y = runif(1, 17, 26) + return(y) + })
    >>
    >> summary(random_values)
    >        Min. 1st Qu.  Median Mean 3rd Qu.  Max.  17.32
    > 20.14 23.00 22.17 24.07 25.90

    > Can someone help understand what is going on?

    > Ubuntu
    > ------

    > R version 3.4.0 (2017-04-21)
    > Platform: x86_64-pc-linux-gnu (64-bit)
    > Running under: Ubuntu 16.04.2 LTS

You have not learned to get a current version of R.
===> You should not write to R-devel (sorry if this may sound harsh ..)

Hint:
   We know that  Ubuntu LTS -- by its virtue of LTS (Long Time
   Support) will not update R.
   But the Ubuntu/Debian pages on CRAN tell you how to ensure to
   automatically get current versions of R on your ubuntu-run computer
   (Namely by adding a CRAN mirror to your ubuntu sources)

And then in your sessionInfo :

    ....
       38 packages attached + 56 namespaces loaded !!
    ....

   and similar nonsense (tons of packages+namespaces)
   on Windows which uses an even more outdated version of
   R 3.3.2.

-------------

Can you please learn to work with a minimal reproducible example MRE
(well you are close in your R code, but not if you load 50
 packages and do how-knows-what before running the example,
 you RNGkind() and many other things could have been changed ...)

Since you run ubuntu, you know the shell and you could
(after installing a current version of R) put your MRE in a
small *.R script and do

   R CMD BATCH --vanilla  MRE.R

which will produce MRE.Rout  with all input/output

BTW: Even on Windoze you can do similarly, once you've found the
location of 'Rcmd.exe':

   ......\Rcmd BATCH --vanilla MRE.R

should work there as well and deliver MRE.Rout

- - - - -
After doing all this, your problem may still be just
because you are using much too large integers for the 'seed'
argument of set.seed()

I really really strongly believe you should have used R-help
instead of R-devel.

Best,
Martin Maechler

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Extreme bunching of random values from runif with Mersenne-Twister seed

Lukas Stadler
If I interpret the original message as “I think there’s something wrong with R's random number generator”:
Your assumption is that going from the seed to the first random number is a good hash function, which it isn’t.
E.g., with Mersenne Twister it’s a couple of multiplications, bit shifts, xors and ands, and the few bits that vary in your seed end up in the less significant bits of the result.
Something like the “digest” package might be what you want, it provides proper hash functions.

- Lukas

> On 3 Nov 2017, at 10:39, Martin Maechler <[hidden email]> wrote:
>
>>>>>> Tirthankar Chakravarty <[hidden email]>
>>>>>>    on Fri, 3 Nov 2017 13:19:12 +0530 writes:
>
>> This is cross-posted from SO
>> (https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_q_47079702_1414455&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=sySSOv_y4gUrdhItlSw7q2z3RRR8JsPrnS8RhIHA9W4&m=mDEuT7697Im9mtm3dqOQF3Abpcn1ZsA1E_sZE-PZIGg&s=qm177vnypIq1tc3Km5gwocAEmlwieB9pD5jkClG0I-U&e=), but I now
>> feel that this needs someone from R-Devel to help
>> understand why this is happening.
>
> Why R-devel -- R-help would have been appropriate:
>
> It seems you have not read the help page for
> set.seed as I expect it from posters to R-devel.
> Why would you use strings instead of integers if you *had* read it ?
>
>> We are facing a weird situation in our code when using R's
>> [`runif`][1] and setting seed with `set.seed` with the
>> `kind = NULL` option (which resolves, unless I am
>> mistaken, to `kind = "default"`; the default being
>> `"Mersenne-Twister"`).
>
> again this is not what the help page says; rather
>
> | The use of ‘kind = NULL’ or ‘normal.kind = NULL’ in ‘RNGkind’ or
> | ‘set.seed’ selects the currently-used generator (including that
> | used in the previous session if the workspace has been restored):
> | if no generator has been used it selects ‘"default"’.
>
> but as you have > 90 (!!) packages in your sessionInfo() below,
> why should we (or you) know if some of the things you did
> before or (implicitly) during loading all these packages did not
> change the RNG kind ?
>
>> We set the seed using (8 digit) unique IDs generated by an
>> upstream system, before calling `runif`:
>
>>    seeds = c( "86548915", "86551615", "86566163",
>> "86577411", "86584144", "86584272", "86620568",
>> "86724613", "86756002", "86768593", "86772411",
>> "86781516", "86794389", "86805854", "86814600",
>> "86835092", "86874179", "86876466", "86901193",
>> "86987847", "86988080")
>
>> random_values = sapply(seeds, function(x) {
>>  set.seed(x)
>>  y = runif(1, 17, 26)
>>  return(y)
>> })
>
> Why do you do that?
>
> 1) You should set the seed *once*, not multiple times in one simulation.
>
> 2) Assuming that your strings are correctly translated to integers
>   and the same on all platforms, independent of locales (!) etc,
>   you are again not following the simple instruction on the help page:
>
>     ‘set.seed’ uses a single integer argument to set as many seeds as
>     are required.  It is intended as a simple way to get quite
>     different seeds by specifying small integer arguments, and also as
>     .....
>     .....
>
> Note:   ** small ** integer
> Why do you assume   86901193  to be a small integer ?
>
>> This gives values that are **extremely** bunched together.
>
>>> summary(random_values)
>>       Min. 1st Qu.  Median Mean 3rd Qu.  Max.  25.13
>> 25.36 25.66 25.58 25.83 25.94
>
>> This behaviour of `runif` goes away when we use `kind =
>> "Knuth-TAOCP-2002"`, and we get values that appear to be
>> much more evenly spread out.
>
>>    random_values = sapply(seeds, function(x) {
>> set.seed(x, kind = "Knuth-TAOCP-2002") y = runif(1, 17,
>> 26) return(y) })
>
>> *Output omitted.*
>
>> ---
>
>> **The most interesting thing here is that this does not
>> happen on Windows -- only happens on Ubuntu**
>> (`sessionInfo` output for Ubuntu & Windows below).
>
>> # Windows output: #
>
>>> seeds = c(
>>    + "86548915", "86551615", "86566163", "86577411",
>> "86584144", + "86584272", "86620568", "86724613",
>> "86756002", "86768593", "86772411", + "86781516",
>> "86794389", "86805854", "86814600", "86835092",
>> "86874179", + "86876466", "86901193", "86987847",
>> "86988080")
>>>
>>> random_values = sapply(seeds, function(x) {
>>    + set.seed(x) + y = runif(1, 17, 26) + return(y) + })
>>>
>>> summary(random_values)
>>       Min. 1st Qu.  Median Mean 3rd Qu.  Max.  17.32
>> 20.14 23.00 22.17 24.07 25.90
>
>> Can someone help understand what is going on?
>
>> Ubuntu
>> ------
>
>> R version 3.4.0 (2017-04-21)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>> Running under: Ubuntu 16.04.2 LTS
>
> You have not learned to get a current version of R.
> ===> You should not write to R-devel (sorry if this may sound harsh ..)
>
> Hint:
>   We know that  Ubuntu LTS -- by its virtue of LTS (Long Time
>   Support) will not update R.
>   But the Ubuntu/Debian pages on CRAN tell you how to ensure to
>   automatically get current versions of R on your ubuntu-run computer
>   (Namely by adding a CRAN mirror to your ubuntu sources)
>
> And then in your sessionInfo :
>
>    ....
>       38 packages attached + 56 namespaces loaded !!
>    ....
>
>   and similar nonsense (tons of packages+namespaces)
>   on Windows which uses an even more outdated version of
>   R 3.3.2.
>
> -------------
>
> Can you please learn to work with a minimal reproducible example MRE
> (well you are close in your R code, but not if you load 50
> packages and do how-knows-what before running the example,
> you RNGkind() and many other things could have been changed ...)
>
> Since you run ubuntu, you know the shell and you could
> (after installing a current version of R) put your MRE in a
> small *.R script and do
>
>   R CMD BATCH --vanilla  MRE.R
>
> which will produce MRE.Rout  with all input/output
>
> BTW: Even on Windoze you can do similarly, once you've found the
> location of 'Rcmd.exe':
>
>   ......\Rcmd BATCH --vanilla MRE.R
>
> should work there as well and deliver MRE.Rout
>
> - - - - -
> After doing all this, your problem may still be just
> because you are using much too large integers for the 'seed'
> argument of set.seed()
>
> I really really strongly believe you should have used R-help
> instead of R-devel.
>
> Best,
> Martin Maechler
>
> ______________________________________________
> [hidden email] mailing list
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=sySSOv_y4gUrdhItlSw7q2z3RRR8JsPrnS8RhIHA9W4&m=mDEuT7697Im9mtm3dqOQF3Abpcn1ZsA1E_sZE-PZIGg&s=ua3fUgGQ4bG_ImAKJ-_AHRdtFz0xtqvoA--cKTvFI1Q&e=

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Extreme bunching of random values from runif with Mersenne-Twister seed

Tirthankar Chakravarty-2
In reply to this post by Martin Maechler
Martin,

Thanks for the helpful reply. Alas I had forgotten that (implied)
unfavorable comparisons of *nix systems with Windows systems would likely
draw irate (but always substantive) responses on the R-devel list -- poor
phrasing on my part. :)

Regardless, let me try to address some of the concerns related to the
construction of the MRE itself and try to see if we can clean away the
shrubbery & zero down on the core issue, since I continue to believe that
this is an issue with either R's implementation or a bad interaction of the
seeds supplied with the Mersenne-Twister algorithm itself. The latter would
require a deeper understanding of the algorithm than I have at the moment.
If we can rule out the former through this thread, then I will pursue the
latter solution path.

Responses inline below, but summarizing:

1. All examples now are run using "R CMD BATCH --vanilla" as you have
suggested, to ensure that no other loaded packages or namespace changes
have interfered with the behaviour of `set.seed`.
2. Converting the character vector to integer vector has no impact on the
output.
3. Upgrading to the latest version of R has no impact on the output.
4. Multiplying the seed vector by 10L causes the behaviour to vanish,
calling into question the large integer theory.


On Fri, Nov 3, 2017 at 3:09 PM, Martin Maechler <[hidden email]>
wrote:

> Why R-devel -- R-help would have been appropriate:
>

> It seems you have not read the help page for
> set.seed as I expect it from posters to R-devel.
> Why would you use strings instead of integers if you *had* read it ?
>

The manual (which we did read) says:

seed a single value, interpreted as an integer,

We were confident of R coercing characters to integers correctly. We
tested, prior to making this posting that the behaviour remains intact if
we change the `seeds` variable from a character vector to the "equivalent"
integer vector by hand.

> seeds = c(86548915L, 86551615L, 86566163L, 86577411L, 86584144L,
86584272L,
+   86620568L, 86724613L, 86756002L, 86768593L, 86772411L, 86781516L,
+   86794389L, 86805854L, 86814600L, 86835092L, 86874179L, 86876466L,
+   86901193L, 86987847L, 86988080L)
>
> random_values = sapply(seeds, function(x) {
+   set.seed(x)
+   y = runif(1, 17, 26)
+   return(y)
+ })
>
> summary(random_values)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  25.13   25.36   25.66   25.58   25.83   25.94



>     > We are facing a weird situation in our code when using R's
>     > [`runif`][1] and setting seed with `set.seed` with the
>     > `kind = NULL` option (which resolves, unless I am
>     > mistaken, to `kind = "default"`; the default being
>     > `"Mersenne-Twister"`).
>
> again this is not what the help page says; rather
>
>  | The use of ‘kind = NULL’ or ‘normal.kind = NULL’ in ‘RNGkind’ or
>  | ‘set.seed’ selects the currently-used generator (including that
>  | used in the previous session if the workspace has been restored):
>  | if no generator has been used it selects ‘"default"’.
>
> but as you have > 90 (!!) packages in your sessionInfo() below,
> why should we (or you) know if some of the things you did
> before or (implicitly) during loading all these packages did not
> change the RNG kind ?
>

Agreed. We are running this system in production, and we will need
`set.seed` to behave reliably with this session, however, as you say, we
are claiming that there is an issue with the PRNG, so should isolate to an
environment that does not have any of the attendant potential confounding
factors that come with having 90 packages loaded (did you count?).

As mentioned above, we have rerun all examples using "R CMD BATCH
--vanilla" and we can report that the output is unchanged.


>
>     > We set the seed using (8 digit) unique IDs generated by an
>     > upstream system, before calling `runif`:
>
>     >     seeds = c( "86548915", "86551615", "86566163",
>     > "86577411", "86584144", "86584272", "86620568",
>     > "86724613", "86756002", "86768593", "86772411",
>     > "86781516", "86794389", "86805854", "86814600",
>     > "86835092", "86874179", "86876466", "86901193",
>     > "86987847", "86988080")
>
>     >  random_values = sapply(seeds, function(x) {
>     >   set.seed(x)
>     >   y = runif(1, 17, 26)
>     >   return(y)
>     > })
>
> Why do you do that?
>
> 1) You should set the seed *once*, not multiple times in one simulation.
>

This code is written like this since this seed is set every time the
function (API) is called for call-level replicability. It doesn't make a
lot of sense in an MRE, but this is a critical component of the larger
function. We do acknowledge that for any one of the seeds in the vector
`seeds` the vector of draws appears to have the uniform distribution.


> 2) Assuming that your strings are correctly translated to integers
>    and the same on all platforms, independent of locales (!) etc,
>    you are again not following the simple instruction on the help page:
>
>      ‘set.seed’ uses a single integer argument to set as many seeds as
>      are required.  It is intended as a simple way to get quite
>      different seeds by specifying small integer arguments, and also as
>      .....
>      .....
>
> Note:   ** small ** integer
> Why do you assume   86901193  to be a small integer ?
>

Because 86901193/2^32 = 0.02. What is a "small integer"?


>
>     > This gives values that are **extremely** bunched together.
>
>     >> summary(random_values)
>     >        Min. 1st Qu.  Median Mean 3rd Qu.  Max.  25.13
>     > 25.36 25.66 25.58 25.83 25.94
>
>     > This behaviour of `runif` goes away when we use `kind =
>     > "Knuth-TAOCP-2002"`, and we get values that appear to be
>     > much more evenly spread out.
>
>     >     random_values = sapply(seeds, function(x) {
>     > set.seed(x, kind = "Knuth-TAOCP-2002") y = runif(1, 17,
>     > 26) return(y) })
>
>     > *Output omitted.*
>
>     > ---
>
>     > **The most interesting thing here is that this does not
>     > happen on Windows -- only happens on Ubuntu**
>     > (`sessionInfo` output for Ubuntu & Windows below).
>
>     > # Windows output: #
>
>     >> seeds = c(
>     >     + "86548915", "86551615", "86566163", "86577411",
>     > "86584144", + "86584272", "86620568", "86724613",
>     > "86756002", "86768593", "86772411", + "86781516",
>     > "86794389", "86805854", "86814600", "86835092",
>     > "86874179", + "86876466", "86901193", "86987847",
>     > "86988080")
>     >>
>     >> random_values = sapply(seeds, function(x) {
>     >     + set.seed(x) + y = runif(1, 17, 26) + return(y) + })
>     >>
>     >> summary(random_values)
>     >        Min. 1st Qu.  Median Mean 3rd Qu.  Max.  17.32
>     > 20.14 23.00 22.17 24.07 25.90
>
>     > Can someone help understand what is going on?
>
>     > Ubuntu
>     > ------
>
>     > R version 3.4.0 (2017-04-21)
>     > Platform: x86_64-pc-linux-gnu (64-bit)
>     > Running under: Ubuntu 16.04.2 LTS
>
> You have not learned to get a current version of R.
> ===> You should not write to R-devel (sorry if this may sound harsh ..)
>

We do spend a while on certain versions of R since upgrading our systems in
production is not something we are able to do frequently & this version is
only 6 months old. However, addressing your concern, upgrading to R 3.4.2
leaves the output unchanged.


> - - - - -
> After doing all this, your problem may still be just
> because you are using much too large integers for the 'seed'
> argument of set.seed()
>

Note that multiplying the reported set of seeds by 10, results in expected
output, so not clear if there is a sweet spot that bugs out the
Mersenne-Twister algorithm:

seeds = c(86548915L, 86551615L, 86566163L, 86577411L, 86584144L, 86584272L,
  86620568L, 86724613L, 86756002L, 86768593L, 86772411L, 86781516L,
  86794389L, 86805854L, 86814600L, 86835092L, 86874179L, 86876466L,
  86901193L, 86987847L, 86988080L)*10

random_values = sapply(seeds, function(x) {
  set.seed(x)
  y = runif(1, 17, 26)
  return(y)
})

summary(random_values)



> I really really strongly believe you should have used R-help
> instead of R-devel.
>
> Best,
> Martin Maechler
>

If you continue to believe with the inputs given in this reply that this
should be on R-help, we will switch over.

Your continued help would be appreciated in understanding the issue.

T

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Extreme bunching of random values from runif with Mersenne-Twister seed

Serguei Sokol
Le 03/11/2017 à 14:24, Tirthankar Chakravarty a écrit :

> Martin,
>
> Thanks for the helpful reply. Alas I had forgotten that (implied)
> unfavorable comparisons of *nix systems with Windows systems would likely
> draw irate (but always substantive) responses on the R-devel list -- poor
> phrasing on my part. :)
>
> Regardless, let me try to address some of the concerns related to the
> construction of the MRE itself and try to see if we can clean away the
> shrubbery & zero down on the core issue, since I continue to believe that
> this is an issue with either R's implementation or a bad interaction of the
> seeds supplied with the Mersenne-Twister algorithm itself.
Is there an issue or not may depend on how the vector 'seeds' was obtained.
If we simply do:

r=range(seeds)
s=seq(r[1], r[2])
# pick up seeds giving the runif() in (25; 26) interval
s25=s[sapply(s, function(i) {set.seed(i); runif(1, 17, 26) > 25})]
all(seeds %in% s25) # TRUE
length(s25)/diff(r) # 0.1107351

Thus, the proportion of such seeds is about 1/9 which is coherent with
the fraction of the interval (25; 26) in (17; 26).
Now, you can pick up any 21 numbers from s25 vector (which is 48631 long) and say
"Look! It's weird, all values drawn by runif() are > 25!"
But s25 has nothing strange by itself. If we plot kind of cumulative distribution

plot(s25, type="l")

It shows a distribution very close to uniform which means that such seeds
are not grouped more densely or rarely somewhere.
So, how your set of seeds was obtained?

Best,
Serguei.

>   The latter would
> require a deeper understanding of the algorithm than I have at the moment.
> If we can rule out the former through this thread, then I will pursue the
> latter solution path.
>
> Responses inline below, but summarizing:
>
> 1. All examples now are run using "R CMD BATCH --vanilla" as you have
> suggested, to ensure that no other loaded packages or namespace changes
> have interfered with the behaviour of `set.seed`.
> 2. Converting the character vector to integer vector has no impact on the
> output.
> 3. Upgrading to the latest version of R has no impact on the output.
> 4. Multiplying the seed vector by 10L causes the behaviour to vanish,
> calling into question the large integer theory.
>
>
> On Fri, Nov 3, 2017 at 3:09 PM, Martin Maechler <[hidden email]>
> wrote:
>
>> Why R-devel -- R-help would have been appropriate:
>>
>> It seems you have not read the help page for
>> set.seed as I expect it from posters to R-devel.
>> Why would you use strings instead of integers if you *had* read it ?
>>
> The manual (which we did read) says:
>
> seed a single value, interpreted as an integer,
>
> We were confident of R coercing characters to integers correctly. We
> tested, prior to making this posting that the behaviour remains intact if
> we change the `seeds` variable from a character vector to the "equivalent"
> integer vector by hand.
>
>> seeds = c(86548915L, 86551615L, 86566163L, 86577411L, 86584144L,
> 86584272L,
> +   86620568L, 86724613L, 86756002L, 86768593L, 86772411L, 86781516L,
> +   86794389L, 86805854L, 86814600L, 86835092L, 86874179L, 86876466L,
> +   86901193L, 86987847L, 86988080L)
>> random_values = sapply(seeds, function(x) {
> +   set.seed(x)
> +   y = runif(1, 17, 26)
> +   return(y)
> + })
>> summary(random_values)
>     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>    25.13   25.36   25.66   25.58   25.83   25.94
>
>
>
>>      > We are facing a weird situation in our code when using R's
>>      > [`runif`][1] and setting seed with `set.seed` with the
>>      > `kind = NULL` option (which resolves, unless I am
>>      > mistaken, to `kind = "default"`; the default being
>>      > `"Mersenne-Twister"`).
>>
>> again this is not what the help page says; rather
>>
>>   | The use of ‘kind = NULL’ or ‘normal.kind = NULL’ in ‘RNGkind’ or
>>   | ‘set.seed’ selects the currently-used generator (including that
>>   | used in the previous session if the workspace has been restored):
>>   | if no generator has been used it selects ‘"default"’.
>>
>> but as you have > 90 (!!) packages in your sessionInfo() below,
>> why should we (or you) know if some of the things you did
>> before or (implicitly) during loading all these packages did not
>> change the RNG kind ?
>>
> Agreed. We are running this system in production, and we will need
> `set.seed` to behave reliably with this session, however, as you say, we
> are claiming that there is an issue with the PRNG, so should isolate to an
> environment that does not have any of the attendant potential confounding
> factors that come with having 90 packages loaded (did you count?).
>
> As mentioned above, we have rerun all examples using "R CMD BATCH
> --vanilla" and we can report that the output is unchanged.
>
>
>>      > We set the seed using (8 digit) unique IDs generated by an
>>      > upstream system, before calling `runif`:
>>
>>      >     seeds = c( "86548915", "86551615", "86566163",
>>      > "86577411", "86584144", "86584272", "86620568",
>>      > "86724613", "86756002", "86768593", "86772411",
>>      > "86781516", "86794389", "86805854", "86814600",
>>      > "86835092", "86874179", "86876466", "86901193",
>>      > "86987847", "86988080")
>>
>>      >  random_values = sapply(seeds, function(x) {
>>      >   set.seed(x)
>>      >   y = runif(1, 17, 26)
>>      >   return(y)
>>      > })
>>
>> Why do you do that?
>>
>> 1) You should set the seed *once*, not multiple times in one simulation.
>>
> This code is written like this since this seed is set every time the
> function (API) is called for call-level replicability. It doesn't make a
> lot of sense in an MRE, but this is a critical component of the larger
> function. We do acknowledge that for any one of the seeds in the vector
> `seeds` the vector of draws appears to have the uniform distribution.
>
>
>> 2) Assuming that your strings are correctly translated to integers
>>     and the same on all platforms, independent of locales (!) etc,
>>     you are again not following the simple instruction on the help page:
>>
>>       ‘set.seed’ uses a single integer argument to set as many seeds as
>>       are required.  It is intended as a simple way to get quite
>>       different seeds by specifying small integer arguments, and also as
>>       .....
>>       .....
>>
>> Note:   ** small ** integer
>> Why do you assume   86901193  to be a small integer ?
>>
> Because 86901193/2^32 = 0.02. What is a "small integer"?
>
>
>>      > This gives values that are **extremely** bunched together.
>>
>>      >> summary(random_values)
>>      >        Min. 1st Qu.  Median Mean 3rd Qu.  Max.  25.13
>>      > 25.36 25.66 25.58 25.83 25.94
>>
>>      > This behaviour of `runif` goes away when we use `kind =
>>      > "Knuth-TAOCP-2002"`, and we get values that appear to be
>>      > much more evenly spread out.
>>
>>      >     random_values = sapply(seeds, function(x) {
>>      > set.seed(x, kind = "Knuth-TAOCP-2002") y = runif(1, 17,
>>      > 26) return(y) })
>>
>>      > *Output omitted.*
>>
>>      > ---
>>
>>      > **The most interesting thing here is that this does not
>>      > happen on Windows -- only happens on Ubuntu**
>>      > (`sessionInfo` output for Ubuntu & Windows below).
>>
>>      > # Windows output: #
>>
>>      >> seeds = c(
>>      >     + "86548915", "86551615", "86566163", "86577411",
>>      > "86584144", + "86584272", "86620568", "86724613",
>>      > "86756002", "86768593", "86772411", + "86781516",
>>      > "86794389", "86805854", "86814600", "86835092",
>>      > "86874179", + "86876466", "86901193", "86987847",
>>      > "86988080")
>>      >>
>>      >> random_values = sapply(seeds, function(x) {
>>      >     + set.seed(x) + y = runif(1, 17, 26) + return(y) + })
>>      >>
>>      >> summary(random_values)
>>      >        Min. 1st Qu.  Median Mean 3rd Qu.  Max.  17.32
>>      > 20.14 23.00 22.17 24.07 25.90
>>
>>      > Can someone help understand what is going on?
>>
>>      > Ubuntu
>>      > ------
>>
>>      > R version 3.4.0 (2017-04-21)
>>      > Platform: x86_64-pc-linux-gnu (64-bit)
>>      > Running under: Ubuntu 16.04.2 LTS
>>
>> You have not learned to get a current version of R.
>> ===> You should not write to R-devel (sorry if this may sound harsh ..)
>>
> We do spend a while on certain versions of R since upgrading our systems in
> production is not something we are able to do frequently & this version is
> only 6 months old. However, addressing your concern, upgrading to R 3.4.2
> leaves the output unchanged.
>
>
>> - - - - -
>> After doing all this, your problem may still be just
>> because you are using much too large integers for the 'seed'
>> argument of set.seed()
>>
> Note that multiplying the reported set of seeds by 10, results in expected
> output, so not clear if there is a sweet spot that bugs out the
> Mersenne-Twister algorithm:
>
> seeds = c(86548915L, 86551615L, 86566163L, 86577411L, 86584144L, 86584272L,
>    86620568L, 86724613L, 86756002L, 86768593L, 86772411L, 86781516L,
>    86794389L, 86805854L, 86814600L, 86835092L, 86874179L, 86876466L,
>    86901193L, 86987847L, 86988080L)*10
>
> random_values = sapply(seeds, function(x) {
>    set.seed(x)
>    y = runif(1, 17, 26)
>    return(y)
> })
>
> summary(random_values)
>
>
>
>> I really really strongly believe you should have used R-help
>> instead of R-devel.
>>
>> Best,
>> Martin Maechler
>>
> If you continue to believe with the inputs given in this reply that this
> should be on R-help, we will switch over.
>
> Your continued help would be appreciated in understanding the issue.
>
> T
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel


--
Serguei Sokol
Ingenieur de recherche INRA

Cellule mathématique
LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504
135 Avenue de Rangueil
31077 Toulouse Cedex 04

tel: +33 5 6155 9849
email: [hidden email]
http://www.lisbp.fr

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Extreme bunching of random values from runif with Mersenne-Twister seed

R devel mailing list
In reply to this post by Tirthankar Chakravarty-2
The random numbers in a stream initialized with one seed should have about
the desired distribution.  You don't win by changing the seed all the
time.  Your seeds caused the first numbers of a bunch of streams to be
about the same, but the second and subsequent entries in each stream do
look uniformly distributed.

You didn't say what your 'upstream process' was, but it is easy to come up
with seeds that give about the same first value:

> Filter(function(s){set.seed(s);runif(1,17,26)>25.99}, 1:10000)
 [1]  514  532 1951 2631 3974 4068 4229 6092 6432 7264 9090



Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Fri, Nov 3, 2017 at 12:49 AM, Tirthankar Chakravarty <
[hidden email]> wrote:

> This is cross-posted from SO (https://stackoverflow.com/q/47079702/1414455
> ),
> but I now feel that this needs someone from R-Devel to help understand why
> this is happening.
>
> We are facing a weird situation in our code when using R's [`runif`][1] and
> setting seed with `set.seed` with the `kind = NULL` option (which resolves,
> unless I am mistaken, to `kind = "default"`; the default being
> `"Mersenne-Twister"`).
>
> We set the seed using (8 digit) unique IDs generated by an upstream system,
> before calling `runif`:
>
>     seeds = c(
>       "86548915", "86551615", "86566163", "86577411", "86584144",
>       "86584272", "86620568", "86724613", "86756002", "86768593",
> "86772411",
>       "86781516", "86794389", "86805854", "86814600", "86835092",
> "86874179",
>       "86876466", "86901193", "86987847", "86988080")
>
>     random_values = sapply(seeds, function(x) {
>       set.seed(x)
>       y = runif(1, 17, 26)
>       return(y)
>     })
>
> This gives values that are **extremely** bunched together.
>
>     > summary(random_values)
>        Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>       25.13   25.36   25.66   25.58   25.83   25.94
>
> This behaviour of `runif` goes away when we use `kind =
> "Knuth-TAOCP-2002"`, and we get values that appear to be much more evenly
> spread out.
>
>     random_values = sapply(seeds, function(x) {
>       set.seed(x, kind = "Knuth-TAOCP-2002")
>       y = runif(1, 17, 26)
>       return(y)
>     })
>
> *Output omitted.*
>
> ---
>
> **The most interesting thing here is that this does not happen on Windows
> -- only happens on Ubuntu** (`sessionInfo` output for Ubuntu & Windows
> below).
>
> # Windows output: #
>
>     > seeds = c(
>     +   "86548915", "86551615", "86566163", "86577411", "86584144",
>     +   "86584272", "86620568", "86724613", "86756002", "86768593",
> "86772411",
>     +   "86781516", "86794389", "86805854", "86814600", "86835092",
> "86874179",
>     +   "86876466", "86901193", "86987847", "86988080")
>     >
>     > random_values = sapply(seeds, function(x) {
>     +   set.seed(x)
>     +   y = runif(1, 17, 26)
>     +   return(y)
>     + })
>     >
>     > summary(random_values)
>        Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>       17.32   20.14   23.00   22.17   24.07   25.90
>
> Can someone help understand what is going on?
>
> Ubuntu
> ------
>
>     R version 3.4.0 (2017-04-21)
>     Platform: x86_64-pc-linux-gnu (64-bit)
>     Running under: Ubuntu 16.04.2 LTS
>
>     Matrix products: default
>     BLAS: /usr/lib/libblas/libblas.so.3.6.0
>     LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
>
>     locale:
>     [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C
>      [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8
>      [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8
>      [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8
>      [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8
>     [11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8
>
>     attached base packages:
>     [1] parallel  stats     graphics  grDevices utils     datasets
> methods   base
>
>     other attached packages:
>     [1] RMySQL_0.10.8               DBI_0.6-1
>      [3] jsonlite_1.4                tidyjson_0.2.2
>      [5] optiRum_0.37.3              lubridate_1.6.0
>      [7] httr_1.2.1                  gdata_2.18.0
>      [9] XLConnect_0.2-12            XLConnectJars_0.2-12
>     [11] data.table_1.10.4           stringr_1.2.0
>     [13] readxl_1.0.0                xlsx_0.5.7
>     [15] xlsxjars_0.6.1              rJava_0.9-8
>     [17] sqldf_0.4-10                RSQLite_1.1-2
>     [19] gsubfn_0.6-6                proto_1.0.0
>     [21] dplyr_0.5.0                 purrr_0.2.4
>     [23] readr_1.1.1                 tidyr_0.6.3
>     [25] tibble_1.3.0                tidyverse_1.1.1
>     [27] rBayesianOptimization_1.1.0 xgboost_0.6-4
>     [29] MLmetrics_1.1.1             caret_6.0-76
>     [31] ROCR_1.0-7                  gplots_3.0.1
>     [33] effects_3.1-2               pROC_1.10.0
>     [35] pscl_1.4.9                  lattice_0.20-35
>     [37] MASS_7.3-47                 ggplot2_2.2.1
>
>     loaded via a namespace (and not attached):
>     [1] splines_3.4.0      foreach_1.4.3      AUC_0.3.0
> modelr_0.1.0
>      [5] gtools_3.5.0       assertthat_0.2.0   stats4_3.4.0
>  cellranger_1.1.0
>      [9] quantreg_5.33      chron_2.3-50       digest_0.6.10
> rvest_0.3.2
>     [13] minqa_1.2.4        colorspace_1.3-2   Matrix_1.2-10
> plyr_1.8.4
>     [17] psych_1.7.3.21     XML_3.98-1.7       broom_0.4.2
> SparseM_1.77
>     [21] haven_1.0.0        scales_0.4.1       lme4_1.1-13
> MatrixModels_0.4-1
>     [25] mgcv_1.8-17        car_2.1-5          nnet_7.3-12
> lazyeval_0.2.0
>     [29] pbkrtest_0.4-7     mnormt_1.5-5       magrittr_1.5
>  memoise_1.0.0
>     [33] nlme_3.1-131       forcats_0.2.0      xml2_1.1.1
>  foreign_0.8-69
>     [37] tools_3.4.0        hms_0.3            munsell_0.4.3
> compiler_3.4.0
>     [41] caTools_1.17.1     rlang_0.1.1        grid_3.4.0
>  nloptr_1.0.4
>     [45] iterators_1.0.8    bitops_1.0-6       tcltk_3.4.0
> gtable_0.2.0
>     [49] ModelMetrics_1.1.0 codetools_0.2-15   reshape2_1.4.2     R6_2.2.0
>
>     [53] knitr_1.15.1       KernSmooth_2.23-15 stringi_1.1.5
> Rcpp_0.12.11
>
>
>
> Windows
> -------
>
>     > sessionInfo()
>     R version 3.3.2 (2016-10-31)
>     Platform: x86_64-w64-mingw32/x64 (64-bit)
>     Running under: Windows >= 8 x64 (build 9200)
>
>     locale:
>     [1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252
> LC_MONETARY=English_India.1252
>     [4] LC_NUMERIC=C                   LC_TIME=English_India.1252
>
>     attached base packages:
>     [1] graphics  grDevices utils     datasets  grid      stats
>  methods   base
>
>     other attached packages:
>      [1] bindrcpp_0.2         h2o_3.14.0.3         ggrepel_0.6.5
> eulerr_1.1.0         VennDiagram_1.6.17
>      [6] futile.logger_1.4.3  scales_0.4.1         FinCal_0.6.3
>  xml2_1.0.0           httr_1.3.0
>     [11] wesanderson_0.3.2    wordcloud_2.5        RColorBrewer_1.1-2
>  htmltools_0.3.6      urltools_1.6.0
>     [16] timevis_0.4          dtplyr_0.0.1         magrittr_1.5
>  shiny_1.0.5          RODBC_1.3-14
>     [21] zoo_1.8-0            sqldf_0.4-10         RSQLite_1.1-2
> gsubfn_0.6-6         proto_1.0.0
>     [26] gdata_2.17.0         stringr_1.2.0        XLConnect_0.2-12
>  XLConnectJars_0.2-12 data.table_1.10.4
>     [31] xlsx_0.5.7           xlsxjars_0.6.1       rJava_0.9-8
> readxl_0.1.1         googlesheets_0.2.1
>     [36] jsonlite_1.5         tidyjson_0.2.1       RMySQL_0.10.9
> RPostgreSQL_0.4-1    DBI_0.5-1
>     [41] dplyr_0.7.2          purrr_0.2.3          readr_1.1.1
> tidyr_0.7.0          tibble_1.3.3
>     [46] ggplot2_2.2.0        tidyverse_1.0.0      lubridate_1.6.0
>
>     loaded via a namespace (and not attached):
>      [1] gtools_3.5.0         assertthat_0.2.0     triebeard_0.3.0
> cellranger_1.1.0     yaml_2.1.14
>      [6] slam_0.1-40          lattice_0.20-34      glue_1.1.1
>  chron_2.3-48         digest_0.6.12.1
>     [11] colorspace_1.3-1     httpuv_1.3.5         plyr_1.8.4
>  pkgconfig_2.0.1      xtable_1.8-2
>     [16] lazyeval_0.2.0       mime_0.5             memoise_1.0.0
> tools_3.3.2          hms_0.3
>     [21] munsell_0.4.3        lambda.r_1.1.9       rlang_0.1.1
> RCurl_1.95-4.8       labeling_0.3
>     [26] bitops_1.0-6         tcltk_3.3.2          gtable_0.2.0
>  reshape2_1.4.2       R6_2.2.0
>     [31] bindr_0.1            futile.options_1.0.0 stringi_1.1.2
> Rcpp_0.12.12.1
>
>   [1]: http://stat.ethz.ch/R-manual/R-devel/library/stats/html/
> Uniform.html
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Extreme bunching of random values from runif with Mersenne-Twister seed

Tirthankar Chakravarty-2
Bill,

I have clarified this on SO, and I will copy that clarification in here:

"Sure, we tested them on other 8-digit numbers as well & we could not
replicate. However, these are honest-to-goodness numbers generated by a
non-adversarial system that has no conception of these numbers being used
for anything other than a unique key for an entity -- these are not a
specially constructed edge case. Would be good to know what seeds will and
will not work, and why."

These numbers are generated by an application that serves a form, and
associates form IDs in a sequence. The application calls our API depending
on the form values entered by users, which in turn calls our R code that
executes some code that needs an RNG. Since the API has to be stateless, to
be able to replicate the results for possible debugging, we need to draw
random numbers in a way that we can replicate the results of the API
response -- we use the form ID as seeds.

I repeat, there is no design or anything adversarial about the way that
these numbers were generated -- the system generating these numbers and the
users entering inputs have no conception of our use of an RNG -- this is
meant to just be a random sequence of form IDs. This issue was discovered
completely by chance when the output of the API was observed to be highly
non-random. It is possible that it is a 1/10^8 chance, but that is hard to
believe, given that the API hit depends on user input. Note also that the
issue goes away when we use a different RNG as mentioned below.

T

On Fri, Nov 3, 2017 at 9:58 PM, William Dunlap <[hidden email]> wrote:

> The random numbers in a stream initialized with one seed should have about
> the desired distribution.  You don't win by changing the seed all the
> time.  Your seeds caused the first numbers of a bunch of streams to be
> about the same, but the second and subsequent entries in each stream do
> look uniformly distributed.
>
> You didn't say what your 'upstream process' was, but it is easy to come up
> with seeds that give about the same first value:
>
> > Filter(function(s){set.seed(s);runif(1,17,26)>25.99}, 1:10000)
>  [1]  514  532 1951 2631 3974 4068 4229 6092 6432 7264 9090
>
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Fri, Nov 3, 2017 at 12:49 AM, Tirthankar Chakravarty <
> [hidden email]> wrote:
>
>> This is cross-posted from SO (https://stackoverflow.com/q/4
>> 7079702/1414455),
>> but I now feel that this needs someone from R-Devel to help understand why
>> this is happening.
>>
>> We are facing a weird situation in our code when using R's [`runif`][1]
>> and
>> setting seed with `set.seed` with the `kind = NULL` option (which
>> resolves,
>> unless I am mistaken, to `kind = "default"`; the default being
>> `"Mersenne-Twister"`).
>>
>> We set the seed using (8 digit) unique IDs generated by an upstream
>> system,
>> before calling `runif`:
>>
>>     seeds = c(
>>       "86548915", "86551615", "86566163", "86577411", "86584144",
>>       "86584272", "86620568", "86724613", "86756002", "86768593",
>> "86772411",
>>       "86781516", "86794389", "86805854", "86814600", "86835092",
>> "86874179",
>>       "86876466", "86901193", "86987847", "86988080")
>>
>>     random_values = sapply(seeds, function(x) {
>>       set.seed(x)
>>       y = runif(1, 17, 26)
>>       return(y)
>>     })
>>
>> This gives values that are **extremely** bunched together.
>>
>>     > summary(random_values)
>>        Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>       25.13   25.36   25.66   25.58   25.83   25.94
>>
>> This behaviour of `runif` goes away when we use `kind =
>> "Knuth-TAOCP-2002"`, and we get values that appear to be much more evenly
>> spread out.
>>
>>     random_values = sapply(seeds, function(x) {
>>       set.seed(x, kind = "Knuth-TAOCP-2002")
>>       y = runif(1, 17, 26)
>>       return(y)
>>     })
>>
>> *Output omitted.*
>>
>> ---
>>
>> **The most interesting thing here is that this does not happen on Windows
>> -- only happens on Ubuntu** (`sessionInfo` output for Ubuntu & Windows
>> below).
>>
>> # Windows output: #
>>
>>     > seeds = c(
>>     +   "86548915", "86551615", "86566163", "86577411", "86584144",
>>     +   "86584272", "86620568", "86724613", "86756002", "86768593",
>> "86772411",
>>     +   "86781516", "86794389", "86805854", "86814600", "86835092",
>> "86874179",
>>     +   "86876466", "86901193", "86987847", "86988080")
>>     >
>>     > random_values = sapply(seeds, function(x) {
>>     +   set.seed(x)
>>     +   y = runif(1, 17, 26)
>>     +   return(y)
>>     + })
>>     >
>>     > summary(random_values)
>>        Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>       17.32   20.14   23.00   22.17   24.07   25.90
>>
>> Can someone help understand what is going on?
>>
>> Ubuntu
>> ------
>>
>>     R version 3.4.0 (2017-04-21)
>>     Platform: x86_64-pc-linux-gnu (64-bit)
>>     Running under: Ubuntu 16.04.2 LTS
>>
>>     Matrix products: default
>>     BLAS: /usr/lib/libblas/libblas.so.3.6.0
>>     LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
>>
>>     locale:
>>     [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C
>>      [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8
>>      [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8
>>      [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8
>>      [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8
>>     [11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8
>>
>>     attached base packages:
>>     [1] parallel  stats     graphics  grDevices utils     datasets
>> methods   base
>>
>>     other attached packages:
>>     [1] RMySQL_0.10.8               DBI_0.6-1
>>      [3] jsonlite_1.4                tidyjson_0.2.2
>>      [5] optiRum_0.37.3              lubridate_1.6.0
>>      [7] httr_1.2.1                  gdata_2.18.0
>>      [9] XLConnect_0.2-12            XLConnectJars_0.2-12
>>     [11] data.table_1.10.4           stringr_1.2.0
>>     [13] readxl_1.0.0                xlsx_0.5.7
>>     [15] xlsxjars_0.6.1              rJava_0.9-8
>>     [17] sqldf_0.4-10                RSQLite_1.1-2
>>     [19] gsubfn_0.6-6                proto_1.0.0
>>     [21] dplyr_0.5.0                 purrr_0.2.4
>>     [23] readr_1.1.1                 tidyr_0.6.3
>>     [25] tibble_1.3.0                tidyverse_1.1.1
>>     [27] rBayesianOptimization_1.1.0 xgboost_0.6-4
>>     [29] MLmetrics_1.1.1             caret_6.0-76
>>     [31] ROCR_1.0-7                  gplots_3.0.1
>>     [33] effects_3.1-2               pROC_1.10.0
>>     [35] pscl_1.4.9                  lattice_0.20-35
>>     [37] MASS_7.3-47                 ggplot2_2.2.1
>>
>>     loaded via a namespace (and not attached):
>>     [1] splines_3.4.0      foreach_1.4.3      AUC_0.3.0
>> modelr_0.1.0
>>      [5] gtools_3.5.0       assertthat_0.2.0   stats4_3.4.0
>>  cellranger_1.1.0
>>      [9] quantreg_5.33      chron_2.3-50       digest_0.6.10
>> rvest_0.3.2
>>     [13] minqa_1.2.4        colorspace_1.3-2   Matrix_1.2-10
>> plyr_1.8.4
>>     [17] psych_1.7.3.21     XML_3.98-1.7       broom_0.4.2
>> SparseM_1.77
>>     [21] haven_1.0.0        scales_0.4.1       lme4_1.1-13
>> MatrixModels_0.4-1
>>     [25] mgcv_1.8-17        car_2.1-5          nnet_7.3-12
>> lazyeval_0.2.0
>>     [29] pbkrtest_0.4-7     mnormt_1.5-5       magrittr_1.5
>>  memoise_1.0.0
>>     [33] nlme_3.1-131       forcats_0.2.0      xml2_1.1.1
>>  foreign_0.8-69
>>     [37] tools_3.4.0        hms_0.3            munsell_0.4.3
>> compiler_3.4.0
>>     [41] caTools_1.17.1     rlang_0.1.1        grid_3.4.0
>>  nloptr_1.0.4
>>     [45] iterators_1.0.8    bitops_1.0-6       tcltk_3.4.0
>> gtable_0.2.0
>>     [49] ModelMetrics_1.1.0 codetools_0.2-15   reshape2_1.4.2     R6_2.2.0
>>
>>     [53] knitr_1.15.1       KernSmooth_2.23-15 stringi_1.1.5
>> Rcpp_0.12.11
>>
>>
>>
>> Windows
>> -------
>>
>>     > sessionInfo()
>>     R version 3.3.2 (2016-10-31)
>>     Platform: x86_64-w64-mingw32/x64 (64-bit)
>>     Running under: Windows >= 8 x64 (build 9200)
>>
>>     locale:
>>     [1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252
>> LC_MONETARY=English_India.1252
>>     [4] LC_NUMERIC=C                   LC_TIME=English_India.1252
>>
>>     attached base packages:
>>     [1] graphics  grDevices utils     datasets  grid      stats
>>  methods   base
>>
>>     other attached packages:
>>      [1] bindrcpp_0.2         h2o_3.14.0.3         ggrepel_0.6.5
>> eulerr_1.1.0         VennDiagram_1.6.17
>>      [6] futile.logger_1.4.3  scales_0.4.1         FinCal_0.6.3
>>  xml2_1.0.0           httr_1.3.0
>>     [11] wesanderson_0.3.2    wordcloud_2.5        RColorBrewer_1.1-2
>>  htmltools_0.3.6      urltools_1.6.0
>>     [16] timevis_0.4          dtplyr_0.0.1         magrittr_1.5
>>  shiny_1.0.5          RODBC_1.3-14
>>     [21] zoo_1.8-0            sqldf_0.4-10         RSQLite_1.1-2
>> gsubfn_0.6-6         proto_1.0.0
>>     [26] gdata_2.17.0         stringr_1.2.0        XLConnect_0.2-12
>>  XLConnectJars_0.2-12 data.table_1.10.4
>>     [31] xlsx_0.5.7           xlsxjars_0.6.1       rJava_0.9-8
>> readxl_0.1.1         googlesheets_0.2.1
>>     [36] jsonlite_1.5         tidyjson_0.2.1       RMySQL_0.10.9
>> RPostgreSQL_0.4-1    DBI_0.5-1
>>     [41] dplyr_0.7.2          purrr_0.2.3          readr_1.1.1
>> tidyr_0.7.0          tibble_1.3.3
>>     [46] ggplot2_2.2.0        tidyverse_1.0.0      lubridate_1.6.0
>>
>>     loaded via a namespace (and not attached):
>>      [1] gtools_3.5.0         assertthat_0.2.0     triebeard_0.3.0
>> cellranger_1.1.0     yaml_2.1.14
>>      [6] slam_0.1-40          lattice_0.20-34      glue_1.1.1
>>  chron_2.3-48         digest_0.6.12.1
>>     [11] colorspace_1.3-1     httpuv_1.3.5         plyr_1.8.4
>>  pkgconfig_2.0.1      xtable_1.8-2
>>     [16] lazyeval_0.2.0       mime_0.5             memoise_1.0.0
>> tools_3.3.2          hms_0.3
>>     [21] munsell_0.4.3        lambda.r_1.1.9       rlang_0.1.1
>> RCurl_1.95-4.8       labeling_0.3
>>     [26] bitops_1.0-6         tcltk_3.3.2          gtable_0.2.0
>>  reshape2_1.4.2       R6_2.2.0
>>     [31] bindr_0.1            futile.options_1.0.0 stringi_1.1.2
>> Rcpp_0.12.12.1
>>
>>   [1]: http://stat.ethz.ch/R-manual/R-devel/library/stats/html/Unif
>> orm.html
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Extreme bunching of random values from runif with Mersenne-Twister seed

R devel mailing list
Another other generator is subject to the same problem with the same
probabilitiy.

> Filter(function(s){set.seed(s,
kind="Knuth-TAOCP-2002");runif(1,17,26)>25.99}, 1:10000)
 [1]  280  415  826 1372 2224 2544 3270 3594 3809 4116 4236 5018 5692 7043
7212 7364 7747 9256 9491 9568 9886



Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Fri, Nov 3, 2017 at 10:31 AM, Tirthankar Chakravarty <
[hidden email]> wrote:

>
> Bill,
>
> I have clarified this on SO, and I will copy that clarification in here:
>
> "Sure, we tested them on other 8-digit numbers as well & we could not
> replicate. However, these are honest-to-goodness numbers generated by a
> non-adversarial system that has no conception of these numbers being used
> for anything other than a unique key for an entity -- these are not a
> specially constructed edge case. Would be good to know what seeds will and
> will not work, and why."
>
> These numbers are generated by an application that serves a form, and
> associates form IDs in a sequence. The application calls our API depending
> on the form values entered by users, which in turn calls our R code that
> executes some code that needs an RNG. Since the API has to be stateless, to
> be able to replicate the results for possible debugging, we need to draw
> random numbers in a way that we can replicate the results of the API
> response -- we use the form ID as seeds.
>
> I repeat, there is no design or anything adversarial about the way that
> these numbers were generated -- the system generating these numbers and
> the users entering inputs have no conception of our use of an RNG -- this
> is meant to just be a random sequence of form IDs. This issue was
> discovered completely by chance when the output of the API was observed to
> be highly non-random. It is possible that it is a 1/10^8 chance, but that
> is hard to believe, given that the API hit depends on user input. Note also
> that the issue goes away when we use a different RNG as mentioned below.
>
> T
>
> On Fri, Nov 3, 2017 at 9:58 PM, William Dunlap <[hidden email]> wrote:
>
>> The random numbers in a stream initialized with one seed should have
>> about the desired distribution.  You don't win by changing the seed all the
>> time.  Your seeds caused the first numbers of a bunch of streams to be
>> about the same, but the second and subsequent entries in each stream do
>> look uniformly distributed.
>>
>> You didn't say what your 'upstream process' was, but it is easy to come
>> up with seeds that give about the same first value:
>>
>> > Filter(function(s){set.seed(s);runif(1,17,26)>25.99}, 1:10000)
>>  [1]  514  532 1951 2631 3974 4068 4229 6092 6432 7264 9090
>>
>>
>>
>> Bill Dunlap
>> TIBCO Software
>> wdunlap tibco.com
>>
>> On Fri, Nov 3, 2017 at 12:49 AM, Tirthankar Chakravarty <
>> [hidden email]> wrote:
>>
>>> This is cross-posted from SO (https://stackoverflow.com/q/4
>>> 7079702/1414455),
>>> but I now feel that this needs someone from R-Devel to help understand
>>> why
>>> this is happening.
>>>
>>> We are facing a weird situation in our code when using R's [`runif`][1]
>>> and
>>> setting seed with `set.seed` with the `kind = NULL` option (which
>>> resolves,
>>> unless I am mistaken, to `kind = "default"`; the default being
>>> `"Mersenne-Twister"`).
>>>
>>> We set the seed using (8 digit) unique IDs generated by an upstream
>>> system,
>>> before calling `runif`:
>>>
>>>     seeds = c(
>>>       "86548915", "86551615", "86566163", "86577411", "86584144",
>>>       "86584272", "86620568", "86724613", "86756002", "86768593",
>>> "86772411",
>>>       "86781516", "86794389", "86805854", "86814600", "86835092",
>>> "86874179",
>>>       "86876466", "86901193", "86987847", "86988080")
>>>
>>>     random_values = sapply(seeds, function(x) {
>>>       set.seed(x)
>>>       y = runif(1, 17, 26)
>>>       return(y)
>>>     })
>>>
>>> This gives values that are **extremely** bunched together.
>>>
>>>     > summary(random_values)
>>>        Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>>       25.13   25.36   25.66   25.58   25.83   25.94
>>>
>>> This behaviour of `runif` goes away when we use `kind =
>>> "Knuth-TAOCP-2002"`, and we get values that appear to be much more evenly
>>> spread out.
>>>
>>>     random_values = sapply(seeds, function(x) {
>>>       set.seed(x, kind = "Knuth-TAOCP-2002")
>>>       y = runif(1, 17, 26)
>>>       return(y)
>>>     })
>>>
>>> *Output omitted.*
>>>
>>> ---
>>>
>>> **The most interesting thing here is that this does not happen on Windows
>>> -- only happens on Ubuntu** (`sessionInfo` output for Ubuntu & Windows
>>> below).
>>>
>>> # Windows output: #
>>>
>>>     > seeds = c(
>>>     +   "86548915", "86551615", "86566163", "86577411", "86584144",
>>>     +   "86584272", "86620568", "86724613", "86756002", "86768593",
>>> "86772411",
>>>     +   "86781516", "86794389", "86805854", "86814600", "86835092",
>>> "86874179",
>>>     +   "86876466", "86901193", "86987847", "86988080")
>>>     >
>>>     > random_values = sapply(seeds, function(x) {
>>>     +   set.seed(x)
>>>     +   y = runif(1, 17, 26)
>>>     +   return(y)
>>>     + })
>>>     >
>>>     > summary(random_values)
>>>        Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>>       17.32   20.14   23.00   22.17   24.07   25.90
>>>
>>> Can someone help understand what is going on?
>>>
>>> Ubuntu
>>> ------
>>>
>>>     R version 3.4.0 (2017-04-21)
>>>     Platform: x86_64-pc-linux-gnu (64-bit)
>>>     Running under: Ubuntu 16.04.2 LTS
>>>
>>>     Matrix products: default
>>>     BLAS: /usr/lib/libblas/libblas.so.3.6.0
>>>     LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
>>>
>>>     locale:
>>>     [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C
>>>      [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8
>>>      [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8
>>>      [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8
>>>      [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8
>>>     [11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8
>>>
>>>     attached base packages:
>>>     [1] parallel  stats     graphics  grDevices utils     datasets
>>> methods   base
>>>
>>>     other attached packages:
>>>     [1] RMySQL_0.10.8               DBI_0.6-1
>>>      [3] jsonlite_1.4                tidyjson_0.2.2
>>>      [5] optiRum_0.37.3              lubridate_1.6.0
>>>      [7] httr_1.2.1                  gdata_2.18.0
>>>      [9] XLConnect_0.2-12            XLConnectJars_0.2-12
>>>     [11] data.table_1.10.4           stringr_1.2.0
>>>     [13] readxl_1.0.0                xlsx_0.5.7
>>>     [15] xlsxjars_0.6.1              rJava_0.9-8
>>>     [17] sqldf_0.4-10                RSQLite_1.1-2
>>>     [19] gsubfn_0.6-6                proto_1.0.0
>>>     [21] dplyr_0.5.0                 purrr_0.2.4
>>>     [23] readr_1.1.1                 tidyr_0.6.3
>>>     [25] tibble_1.3.0                tidyverse_1.1.1
>>>     [27] rBayesianOptimization_1.1.0 xgboost_0.6-4
>>>     [29] MLmetrics_1.1.1             caret_6.0-76
>>>     [31] ROCR_1.0-7                  gplots_3.0.1
>>>     [33] effects_3.1-2               pROC_1.10.0
>>>     [35] pscl_1.4.9                  lattice_0.20-35
>>>     [37] MASS_7.3-47                 ggplot2_2.2.1
>>>
>>>     loaded via a namespace (and not attached):
>>>     [1] splines_3.4.0      foreach_1.4.3      AUC_0.3.0
>>> modelr_0.1.0
>>>      [5] gtools_3.5.0       assertthat_0.2.0   stats4_3.4.0
>>>  cellranger_1.1.0
>>>      [9] quantreg_5.33      chron_2.3-50       digest_0.6.10
>>> rvest_0.3.2
>>>     [13] minqa_1.2.4        colorspace_1.3-2   Matrix_1.2-10
>>> plyr_1.8.4
>>>     [17] psych_1.7.3.21     XML_3.98-1.7       broom_0.4.2
>>> SparseM_1.77
>>>     [21] haven_1.0.0        scales_0.4.1       lme4_1.1-13
>>> MatrixModels_0.4-1
>>>     [25] mgcv_1.8-17        car_2.1-5          nnet_7.3-12
>>> lazyeval_0.2.0
>>>     [29] pbkrtest_0.4-7     mnormt_1.5-5       magrittr_1.5
>>>  memoise_1.0.0
>>>     [33] nlme_3.1-131       forcats_0.2.0      xml2_1.1.1
>>>  foreign_0.8-69
>>>     [37] tools_3.4.0        hms_0.3            munsell_0.4.3
>>> compiler_3.4.0
>>>     [41] caTools_1.17.1     rlang_0.1.1        grid_3.4.0
>>>  nloptr_1.0.4
>>>     [45] iterators_1.0.8    bitops_1.0-6       tcltk_3.4.0
>>> gtable_0.2.0
>>>     [49] ModelMetrics_1.1.0 codetools_0.2-15   reshape2_1.4.2
>>>  R6_2.2.0
>>>
>>>     [53] knitr_1.15.1       KernSmooth_2.23-15 stringi_1.1.5
>>> Rcpp_0.12.11
>>>
>>>
>>>
>>> Windows
>>> -------
>>>
>>>     > sessionInfo()
>>>     R version 3.3.2 (2016-10-31)
>>>     Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>     Running under: Windows >= 8 x64 (build 9200)
>>>
>>>     locale:
>>>     [1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252
>>> LC_MONETARY=English_India.1252
>>>     [4] LC_NUMERIC=C                   LC_TIME=English_India.1252
>>>
>>>     attached base packages:
>>>     [1] graphics  grDevices utils     datasets  grid      stats
>>>  methods   base
>>>
>>>     other attached packages:
>>>      [1] bindrcpp_0.2         h2o_3.14.0.3         ggrepel_0.6.5
>>> eulerr_1.1.0         VennDiagram_1.6.17
>>>      [6] futile.logger_1.4.3  scales_0.4.1         FinCal_0.6.3
>>>  xml2_1.0.0           httr_1.3.0
>>>     [11] wesanderson_0.3.2    wordcloud_2.5        RColorBrewer_1.1-2
>>>  htmltools_0.3.6      urltools_1.6.0
>>>     [16] timevis_0.4          dtplyr_0.0.1         magrittr_1.5
>>>  shiny_1.0.5          RODBC_1.3-14
>>>     [21] zoo_1.8-0            sqldf_0.4-10         RSQLite_1.1-2
>>> gsubfn_0.6-6         proto_1.0.0
>>>     [26] gdata_2.17.0         stringr_1.2.0        XLConnect_0.2-12
>>>  XLConnectJars_0.2-12 data.table_1.10.4
>>>     [31] xlsx_0.5.7           xlsxjars_0.6.1       rJava_0.9-8
>>> readxl_0.1.1         googlesheets_0.2.1
>>>     [36] jsonlite_1.5         tidyjson_0.2.1       RMySQL_0.10.9
>>> RPostgreSQL_0.4-1    DBI_0.5-1
>>>     [41] dplyr_0.7.2          purrr_0.2.3          readr_1.1.1
>>> tidyr_0.7.0          tibble_1.3.3
>>>     [46] ggplot2_2.2.0        tidyverse_1.0.0      lubridate_1.6.0
>>>
>>>     loaded via a namespace (and not attached):
>>>      [1] gtools_3.5.0         assertthat_0.2.0     triebeard_0.3.0
>>> cellranger_1.1.0     yaml_2.1.14
>>>      [6] slam_0.1-40          lattice_0.20-34      glue_1.1.1
>>>  chron_2.3-48         digest_0.6.12.1
>>>     [11] colorspace_1.3-1     httpuv_1.3.5         plyr_1.8.4
>>>  pkgconfig_2.0.1      xtable_1.8-2
>>>     [16] lazyeval_0.2.0       mime_0.5             memoise_1.0.0
>>> tools_3.3.2          hms_0.3
>>>     [21] munsell_0.4.3        lambda.r_1.1.9       rlang_0.1.1
>>> RCurl_1.95-4.8       labeling_0.3
>>>     [26] bitops_1.0-6         tcltk_3.3.2          gtable_0.2.0
>>>  reshape2_1.4.2       R6_2.2.0
>>>     [31] bindr_0.1            futile.options_1.0.0 stringi_1.1.2
>>> Rcpp_0.12.12.1
>>>
>>>   [1]: http://stat.ethz.ch/R-manual/R-devel/library/stats/html/Unif
>>> orm.html
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Extreme bunching of random values from runif with Mersenne-Twister seed

Tirthankar Chakravarty-2
Bill,

Appreciate the point that both you and Serguei are making, but the sequence
in question is not a selected or filtered set. These are values as observed
in a sequence from a  mechanism described below. The probabilities required
to generate this exact sequence in the wild seem staggering to me.

T

On Fri, Nov 3, 2017 at 11:27 PM, William Dunlap <[hidden email]> wrote:

> Another other generator is subject to the same problem with the same
> probabilitiy.
>
> > Filter(function(s){set.seed(s, kind="Knuth-TAOCP-2002");runif(1,17,26)>25.99},
> 1:10000)
>  [1]  280  415  826 1372 2224 2544 3270 3594 3809 4116 4236 5018 5692 7043
> 7212 7364 7747 9256 9491 9568 9886
>
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Fri, Nov 3, 2017 at 10:31 AM, Tirthankar Chakravarty <
> [hidden email]> wrote:
>
>>
>> Bill,
>>
>> I have clarified this on SO, and I will copy that clarification in here:
>>
>> "Sure, we tested them on other 8-digit numbers as well & we could not
>> replicate. However, these are honest-to-goodness numbers generated by a
>> non-adversarial system that has no conception of these numbers being used
>> for anything other than a unique key for an entity -- these are not a
>> specially constructed edge case. Would be good to know what seeds will and
>> will not work, and why."
>>
>> These numbers are generated by an application that serves a form, and
>> associates form IDs in a sequence. The application calls our API depending
>> on the form values entered by users, which in turn calls our R code that
>> executes some code that needs an RNG. Since the API has to be stateless, to
>> be able to replicate the results for possible debugging, we need to draw
>> random numbers in a way that we can replicate the results of the API
>> response -- we use the form ID as seeds.
>>
>> I repeat, there is no design or anything adversarial about the way that
>> these numbers were generated -- the system generating these numbers and
>> the users entering inputs have no conception of our use of an RNG -- this
>> is meant to just be a random sequence of form IDs. This issue was
>> discovered completely by chance when the output of the API was observed to
>> be highly non-random. It is possible that it is a 1/10^8 chance, but that
>> is hard to believe, given that the API hit depends on user input. Note also
>> that the issue goes away when we use a different RNG as mentioned below.
>>
>> T
>>
>> On Fri, Nov 3, 2017 at 9:58 PM, William Dunlap <[hidden email]> wrote:
>>
>>> The random numbers in a stream initialized with one seed should have
>>> about the desired distribution.  You don't win by changing the seed all the
>>> time.  Your seeds caused the first numbers of a bunch of streams to be
>>> about the same, but the second and subsequent entries in each stream do
>>> look uniformly distributed.
>>>
>>> You didn't say what your 'upstream process' was, but it is easy to come
>>> up with seeds that give about the same first value:
>>>
>>> > Filter(function(s){set.seed(s);runif(1,17,26)>25.99}, 1:10000)
>>>  [1]  514  532 1951 2631 3974 4068 4229 6092 6432 7264 9090
>>>
>>>
>>>
>>> Bill Dunlap
>>> TIBCO Software
>>> wdunlap tibco.com
>>>
>>> On Fri, Nov 3, 2017 at 12:49 AM, Tirthankar Chakravarty <
>>> [hidden email]> wrote:
>>>
>>>> This is cross-posted from SO (https://stackoverflow.com/q/4
>>>> 7079702/1414455),
>>>> but I now feel that this needs someone from R-Devel to help understand
>>>> why
>>>> this is happening.
>>>>
>>>> We are facing a weird situation in our code when using R's [`runif`][1]
>>>> and
>>>> setting seed with `set.seed` with the `kind = NULL` option (which
>>>> resolves,
>>>> unless I am mistaken, to `kind = "default"`; the default being
>>>> `"Mersenne-Twister"`).
>>>>
>>>> We set the seed using (8 digit) unique IDs generated by an upstream
>>>> system,
>>>> before calling `runif`:
>>>>
>>>>     seeds = c(
>>>>       "86548915", "86551615", "86566163", "86577411", "86584144",
>>>>       "86584272", "86620568", "86724613", "86756002", "86768593",
>>>> "86772411",
>>>>       "86781516", "86794389", "86805854", "86814600", "86835092",
>>>> "86874179",
>>>>       "86876466", "86901193", "86987847", "86988080")
>>>>
>>>>     random_values = sapply(seeds, function(x) {
>>>>       set.seed(x)
>>>>       y = runif(1, 17, 26)
>>>>       return(y)
>>>>     })
>>>>
>>>> This gives values that are **extremely** bunched together.
>>>>
>>>>     > summary(random_values)
>>>>        Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>>>       25.13   25.36   25.66   25.58   25.83   25.94
>>>>
>>>> This behaviour of `runif` goes away when we use `kind =
>>>> "Knuth-TAOCP-2002"`, and we get values that appear to be much more
>>>> evenly
>>>> spread out.
>>>>
>>>>     random_values = sapply(seeds, function(x) {
>>>>       set.seed(x, kind = "Knuth-TAOCP-2002")
>>>>       y = runif(1, 17, 26)
>>>>       return(y)
>>>>     })
>>>>
>>>> *Output omitted.*
>>>>
>>>> ---
>>>>
>>>> **The most interesting thing here is that this does not happen on
>>>> Windows
>>>> -- only happens on Ubuntu** (`sessionInfo` output for Ubuntu & Windows
>>>> below).
>>>>
>>>> # Windows output: #
>>>>
>>>>     > seeds = c(
>>>>     +   "86548915", "86551615", "86566163", "86577411", "86584144",
>>>>     +   "86584272", "86620568", "86724613", "86756002", "86768593",
>>>> "86772411",
>>>>     +   "86781516", "86794389", "86805854", "86814600", "86835092",
>>>> "86874179",
>>>>     +   "86876466", "86901193", "86987847", "86988080")
>>>>     >
>>>>     > random_values = sapply(seeds, function(x) {
>>>>     +   set.seed(x)
>>>>     +   y = runif(1, 17, 26)
>>>>     +   return(y)
>>>>     + })
>>>>     >
>>>>     > summary(random_values)
>>>>        Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>>>       17.32   20.14   23.00   22.17   24.07   25.90
>>>>
>>>> Can someone help understand what is going on?
>>>>
>>>> Ubuntu
>>>> ------
>>>>
>>>>     R version 3.4.0 (2017-04-21)
>>>>     Platform: x86_64-pc-linux-gnu (64-bit)
>>>>     Running under: Ubuntu 16.04.2 LTS
>>>>
>>>>     Matrix products: default
>>>>     BLAS: /usr/lib/libblas/libblas.so.3.6.0
>>>>     LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
>>>>
>>>>     locale:
>>>>     [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C
>>>>      [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8
>>>>      [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8
>>>>      [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8
>>>>      [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8
>>>>     [11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8
>>>>
>>>>     attached base packages:
>>>>     [1] parallel  stats     graphics  grDevices utils     datasets
>>>> methods   base
>>>>
>>>>     other attached packages:
>>>>     [1] RMySQL_0.10.8               DBI_0.6-1
>>>>      [3] jsonlite_1.4                tidyjson_0.2.2
>>>>      [5] optiRum_0.37.3              lubridate_1.6.0
>>>>      [7] httr_1.2.1                  gdata_2.18.0
>>>>      [9] XLConnect_0.2-12            XLConnectJars_0.2-12
>>>>     [11] data.table_1.10.4           stringr_1.2.0
>>>>     [13] readxl_1.0.0                xlsx_0.5.7
>>>>     [15] xlsxjars_0.6.1              rJava_0.9-8
>>>>     [17] sqldf_0.4-10                RSQLite_1.1-2
>>>>     [19] gsubfn_0.6-6                proto_1.0.0
>>>>     [21] dplyr_0.5.0                 purrr_0.2.4
>>>>     [23] readr_1.1.1                 tidyr_0.6.3
>>>>     [25] tibble_1.3.0                tidyverse_1.1.1
>>>>     [27] rBayesianOptimization_1.1.0 xgboost_0.6-4
>>>>     [29] MLmetrics_1.1.1             caret_6.0-76
>>>>     [31] ROCR_1.0-7                  gplots_3.0.1
>>>>     [33] effects_3.1-2               pROC_1.10.0
>>>>     [35] pscl_1.4.9                  lattice_0.20-35
>>>>     [37] MASS_7.3-47                 ggplot2_2.2.1
>>>>
>>>>     loaded via a namespace (and not attached):
>>>>     [1] splines_3.4.0      foreach_1.4.3      AUC_0.3.0
>>>> modelr_0.1.0
>>>>      [5] gtools_3.5.0       assertthat_0.2.0   stats4_3.4.0
>>>>  cellranger_1.1.0
>>>>      [9] quantreg_5.33      chron_2.3-50       digest_0.6.10
>>>> rvest_0.3.2
>>>>     [13] minqa_1.2.4        colorspace_1.3-2   Matrix_1.2-10
>>>> plyr_1.8.4
>>>>     [17] psych_1.7.3.21     XML_3.98-1.7       broom_0.4.2
>>>> SparseM_1.77
>>>>     [21] haven_1.0.0        scales_0.4.1       lme4_1.1-13
>>>> MatrixModels_0.4-1
>>>>     [25] mgcv_1.8-17        car_2.1-5          nnet_7.3-12
>>>> lazyeval_0.2.0
>>>>     [29] pbkrtest_0.4-7     mnormt_1.5-5       magrittr_1.5
>>>>  memoise_1.0.0
>>>>     [33] nlme_3.1-131       forcats_0.2.0      xml2_1.1.1
>>>>  foreign_0.8-69
>>>>     [37] tools_3.4.0        hms_0.3            munsell_0.4.3
>>>> compiler_3.4.0
>>>>     [41] caTools_1.17.1     rlang_0.1.1        grid_3.4.0
>>>>  nloptr_1.0.4
>>>>     [45] iterators_1.0.8    bitops_1.0-6       tcltk_3.4.0
>>>> gtable_0.2.0
>>>>     [49] ModelMetrics_1.1.0 codetools_0.2-15   reshape2_1.4.2
>>>>  R6_2.2.0
>>>>
>>>>     [53] knitr_1.15.1       KernSmooth_2.23-15 stringi_1.1.5
>>>> Rcpp_0.12.11
>>>>
>>>>
>>>>
>>>> Windows
>>>> -------
>>>>
>>>>     > sessionInfo()
>>>>     R version 3.3.2 (2016-10-31)
>>>>     Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>     Running under: Windows >= 8 x64 (build 9200)
>>>>
>>>>     locale:
>>>>     [1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252
>>>> LC_MONETARY=English_India.1252
>>>>     [4] LC_NUMERIC=C                   LC_TIME=English_India.1252
>>>>
>>>>     attached base packages:
>>>>     [1] graphics  grDevices utils     datasets  grid      stats
>>>>  methods   base
>>>>
>>>>     other attached packages:
>>>>      [1] bindrcpp_0.2         h2o_3.14.0.3         ggrepel_0.6.5
>>>> eulerr_1.1.0         VennDiagram_1.6.17
>>>>      [6] futile.logger_1.4.3  scales_0.4.1         FinCal_0.6.3
>>>>  xml2_1.0.0           httr_1.3.0
>>>>     [11] wesanderson_0.3.2    wordcloud_2.5        RColorBrewer_1.1-2
>>>>  htmltools_0.3.6      urltools_1.6.0
>>>>     [16] timevis_0.4          dtplyr_0.0.1         magrittr_1.5
>>>>  shiny_1.0.5          RODBC_1.3-14
>>>>     [21] zoo_1.8-0            sqldf_0.4-10         RSQLite_1.1-2
>>>> gsubfn_0.6-6         proto_1.0.0
>>>>     [26] gdata_2.17.0         stringr_1.2.0        XLConnect_0.2-12
>>>>  XLConnectJars_0.2-12 data.table_1.10.4
>>>>     [31] xlsx_0.5.7           xlsxjars_0.6.1       rJava_0.9-8
>>>> readxl_0.1.1         googlesheets_0.2.1
>>>>     [36] jsonlite_1.5         tidyjson_0.2.1       RMySQL_0.10.9
>>>> RPostgreSQL_0.4-1    DBI_0.5-1
>>>>     [41] dplyr_0.7.2          purrr_0.2.3          readr_1.1.1
>>>> tidyr_0.7.0          tibble_1.3.3
>>>>     [46] ggplot2_2.2.0        tidyverse_1.0.0      lubridate_1.6.0
>>>>
>>>>     loaded via a namespace (and not attached):
>>>>      [1] gtools_3.5.0         assertthat_0.2.0     triebeard_0.3.0
>>>> cellranger_1.1.0     yaml_2.1.14
>>>>      [6] slam_0.1-40          lattice_0.20-34      glue_1.1.1
>>>>  chron_2.3-48         digest_0.6.12.1
>>>>     [11] colorspace_1.3-1     httpuv_1.3.5         plyr_1.8.4
>>>>  pkgconfig_2.0.1      xtable_1.8-2
>>>>     [16] lazyeval_0.2.0       mime_0.5             memoise_1.0.0
>>>> tools_3.3.2          hms_0.3
>>>>     [21] munsell_0.4.3        lambda.r_1.1.9       rlang_0.1.1
>>>> RCurl_1.95-4.8       labeling_0.3
>>>>     [26] bitops_1.0-6         tcltk_3.3.2          gtable_0.2.0
>>>>  reshape2_1.4.2       R6_2.2.0
>>>>     [31] bindr_0.1            futile.options_1.0.0 stringi_1.1.2
>>>> Rcpp_0.12.12.1
>>>>
>>>>   [1]: http://stat.ethz.ch/R-manual/R-devel/library/stats/html/Unif
>>>> orm.html
>>>>
>>>>         [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>
>>>
>>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Extreme bunching of random values from runif with Mersenne-Twister seed

Radford Neal
In reply to this post by Tirthankar Chakravarty-2
In the code below, you seem to be essentially using the random number
generator to implement a hash function.  This isn't a good idea.

My impression is that pseudo-random number generation methods are
generally evaluated by whether the sequence produced from any seed
"appears" to be random.  Informally, there may be some effort to make
long sequences started with seeds 1, 2, 3, etc. appear unrelated,
since that is a common use pattern when running a simulation several
times to check on variability.  But you are relying on the FIRST
number from each sequence being apparently unrelated to the seed.  
I think few or none of the people designing pseudo-random number
generators evaluate their methods by that criterion.

There is, however, a large literature on hash functions, which is
what you should look at.

But if you want a quick fix, perhaps looking not at the first number
in the sequence, but rather (say) the 10th, might be preferable.

   Radford Neal


> > seeds = c(86548915L, 86551615L, 86566163L, 86577411L, 86584144L,
> 86584272L,
> +   86620568L, 86724613L, 86756002L, 86768593L, 86772411L, 86781516L,
> +   86794389L, 86805854L, 86814600L, 86835092L, 86874179L, 86876466L,
> +   86901193L, 86987847L, 86988080L)
> >
> > random_values = sapply(seeds, function(x) {
> +   set.seed(x)
> +   y = runif(1, 17, 26)
> +   return(y)
> + })
> >
> > summary(random_values)
>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>   25.13   25.36   25.66   25.58   25.83   25.94

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Extreme bunching of random values from runif with Mersenne-Twister seed

Daniel Nordlund-3
In reply to this post by Tirthankar Chakravarty-2
Tirthankar,

"random number generators" do not produce random numbers.  Any given
generator produces a fixed sequence of numbers that appear to meet
various tests of randomness.  By picking a seed you enter that sequence
in a particular place and subsequent numbers in the sequence appear to
be unrelated.  There are no guarantees that if YOU pick a SET of seeds
they won't produce a set of values that are of a similar magnitude.

You can likely solve your problem by following Radford Neal's advice of
not using the the first number from each seed.  However, you don't need
to use anything more than the second number.  So, you can modify your
function as follows:

function(x) {
       set.seed(x, kind = "default")
       y = runif(2, 17, 26)
       return(y[2])
     }

Hope this is helpful,

Dan

--
Daniel Nordlund
Port Townsend, WA  USA


On 11/3/2017 11:30 AM, Tirthankar Chakravarty wrote:

> Bill,
>
> Appreciate the point that both you and Serguei are making, but the sequence
> in question is not a selected or filtered set. These are values as observed
> in a sequence from a  mechanism described below. The probabilities required
> to generate this exact sequence in the wild seem staggering to me.
>
> T
>
> On Fri, Nov 3, 2017 at 11:27 PM, William Dunlap <[hidden email]> wrote:
>
>> Another other generator is subject to the same problem with the same
>> probabilitiy.
>>
>>> Filter(function(s){set.seed(s, kind="Knuth-TAOCP-2002");runif(1,17,26)>25.99},
>> 1:10000)
>>   [1]  280  415  826 1372 2224 2544 3270 3594 3809 4116 4236 5018 5692 7043
>> 7212 7364 7747 9256 9491 9568 9886
>>
>>
>>
>> Bill Dunlap
>> TIBCO Software
>> wdunlap tibco.com
>>
>> On Fri, Nov 3, 2017 at 10:31 AM, Tirthankar Chakravarty <
>> [hidden email]> wrote:
>>
>>>
>>> Bill,
>>>
>>> I have clarified this on SO, and I will copy that clarification in here:
>>>
>>> "Sure, we tested them on other 8-digit numbers as well & we could not
>>> replicate. However, these are honest-to-goodness numbers generated by a
>>> non-adversarial system that has no conception of these numbers being used
>>> for anything other than a unique key for an entity -- these are not a
>>> specially constructed edge case. Would be good to know what seeds will and
>>> will not work, and why."
>>>
>>> These numbers are generated by an application that serves a form, and
>>> associates form IDs in a sequence. The application calls our API depending
>>> on the form values entered by users, which in turn calls our R code that
>>> executes some code that needs an RNG. Since the API has to be stateless, to
>>> be able to replicate the results for possible debugging, we need to draw
>>> random numbers in a way that we can replicate the results of the API
>>> response -- we use the form ID as seeds.
>>>
>>> I repeat, there is no design or anything adversarial about the way that
>>> these numbers were generated -- the system generating these numbers and
>>> the users entering inputs have no conception of our use of an RNG -- this
>>> is meant to just be a random sequence of form IDs. This issue was
>>> discovered completely by chance when the output of the API was observed to
>>> be highly non-random. It is possible that it is a 1/10^8 chance, but that
>>> is hard to believe, given that the API hit depends on user input. Note also
>>> that the issue goes away when we use a different RNG as mentioned below.
>>>
>>> T
>>>
>>> On Fri, Nov 3, 2017 at 9:58 PM, William Dunlap <[hidden email]> wrote:
>>>
>>>> The random numbers in a stream initialized with one seed should have
>>>> about the desired distribution.  You don't win by changing the seed all the
>>>> time.  Your seeds caused the first numbers of a bunch of streams to be
>>>> about the same, but the second and subsequent entries in each stream do
>>>> look uniformly distributed.
>>>>
>>>> You didn't say what your 'upstream process' was, but it is easy to come
>>>> up with seeds that give about the same first value:
>>>>
>>>>> Filter(function(s){set.seed(s);runif(1,17,26)>25.99}, 1:10000)
>>>>   [1]  514  532 1951 2631 3974 4068 4229 6092 6432 7264 9090
>>>>
>>>>
>>>>
>>>> Bill Dunlap
>>>> TIBCO Software
>>>> wdunlap tibco.com
>>>>
>>>> On Fri, Nov 3, 2017 at 12:49 AM, Tirthankar Chakravarty <
>>>> [hidden email]> wrote:
>>>>
>>>>> This is cross-posted from SO (https://stackoverflow.com/q/4
>>>>> 7079702/1414455),
>>>>> but I now feel that this needs someone from R-Devel to help understand
>>>>> why
>>>>> this is happening.
>>>>>
>>>>> We are facing a weird situation in our code when using R's [`runif`][1]
>>>>> and
>>>>> setting seed with `set.seed` with the `kind = NULL` option (which
>>>>> resolves,
>>>>> unless I am mistaken, to `kind = "default"`; the default being
>>>>> `"Mersenne-Twister"`).
>>>>>
>>>>> We set the seed using (8 digit) unique IDs generated by an upstream
>>>>> system,
>>>>> before calling `runif`:
>>>>>
>>>>>      seeds = c(
>>>>>        "86548915", "86551615", "86566163", "86577411", "86584144",
>>>>>        "86584272", "86620568", "86724613", "86756002", "86768593",
>>>>> "86772411",
>>>>>        "86781516", "86794389", "86805854", "86814600", "86835092",
>>>>> "86874179",
>>>>>        "86876466", "86901193", "86987847", "86988080")
>>>>>
>>>>>      random_values = sapply(seeds, function(x) {
>>>>>        set.seed(x)
>>>>>        y = runif(1, 17, 26)
>>>>>        return(y)
>>>>>      })
>>>>>
>>>>> This gives values that are **extremely** bunched together.
>>>>>
>>>>>      > summary(random_values)
>>>>>         Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>>>>        25.13   25.36   25.66   25.58   25.83   25.94
>>>>>
>>>>> This behaviour of `runif` goes away when we use `kind =
>>>>> "Knuth-TAOCP-2002"`, and we get values that appear to be much more
>>>>> evenly
>>>>> spread out.
>>>>>
>>>>>      random_values = sapply(seeds, function(x) {
>>>>>        set.seed(x, kind = "Knuth-TAOCP-2002")
>>>>>        y = runif(1, 17, 26)
>>>>>        return(y)
>>>>>      })
>>>>>
>>>>> *Output omitted.*
>>>>>
>>>>> ---
>>>>>
>>>>> **The most interesting thing here is that this does not happen on
>>>>> Windows
>>>>> -- only happens on Ubuntu** (`sessionInfo` output for Ubuntu & Windows
>>>>> below).
>>>>>
>>>>> # Windows output: #
>>>>>
>>>>>      > seeds = c(
>>>>>      +   "86548915", "86551615", "86566163", "86577411", "86584144",
>>>>>      +   "86584272", "86620568", "86724613", "86756002", "86768593",
>>>>> "86772411",
>>>>>      +   "86781516", "86794389", "86805854", "86814600", "86835092",
>>>>> "86874179",
>>>>>      +   "86876466", "86901193", "86987847", "86988080")
>>>>>      >
>>>>>      > random_values = sapply(seeds, function(x) {
>>>>>      +   set.seed(x)
>>>>>      +   y = runif(1, 17, 26)
>>>>>      +   return(y)
>>>>>      + })
>>>>>      >
>>>>>      > summary(random_values)
>>>>>         Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>>>>        17.32   20.14   23.00   22.17   24.07   25.90
>>>>>
>>>>> Can someone help understand what is going on?
>>>>>
>>>>> Ubuntu
>>>>> ------
>>>>>
>>>>>      R version 3.4.0 (2017-04-21)
>>>>>      Platform: x86_64-pc-linux-gnu (64-bit)
>>>>>      Running under: Ubuntu 16.04.2 LTS
>>>>>
>>>>>      Matrix products: default
>>>>>      BLAS: /usr/lib/libblas/libblas.so.3.6.0
>>>>>      LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
>>>>>
>>>>>      locale:
>>>>>      [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C
>>>>>       [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8
>>>>>       [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8
>>>>>       [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8
>>>>>       [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8
>>>>>      [11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8
>>>>>
>>>>>      attached base packages:
>>>>>      [1] parallel  stats     graphics  grDevices utils     datasets
>>>>> methods   base
>>>>>
>>>>>      other attached packages:
>>>>>      [1] RMySQL_0.10.8               DBI_0.6-1
>>>>>       [3] jsonlite_1.4                tidyjson_0.2.2
>>>>>       [5] optiRum_0.37.3              lubridate_1.6.0
>>>>>       [7] httr_1.2.1                  gdata_2.18.0
>>>>>       [9] XLConnect_0.2-12            XLConnectJars_0.2-12
>>>>>      [11] data.table_1.10.4           stringr_1.2.0
>>>>>      [13] readxl_1.0.0                xlsx_0.5.7
>>>>>      [15] xlsxjars_0.6.1              rJava_0.9-8
>>>>>      [17] sqldf_0.4-10                RSQLite_1.1-2
>>>>>      [19] gsubfn_0.6-6                proto_1.0.0
>>>>>      [21] dplyr_0.5.0                 purrr_0.2.4
>>>>>      [23] readr_1.1.1                 tidyr_0.6.3
>>>>>      [25] tibble_1.3.0                tidyverse_1.1.1
>>>>>      [27] rBayesianOptimization_1.1.0 xgboost_0.6-4
>>>>>      [29] MLmetrics_1.1.1             caret_6.0-76
>>>>>      [31] ROCR_1.0-7                  gplots_3.0.1
>>>>>      [33] effects_3.1-2               pROC_1.10.0
>>>>>      [35] pscl_1.4.9                  lattice_0.20-35
>>>>>      [37] MASS_7.3-47                 ggplot2_2.2.1
>>>>>
>>>>>      loaded via a namespace (and not attached):
>>>>>      [1] splines_3.4.0      foreach_1.4.3      AUC_0.3.0
>>>>> modelr_0.1.0
>>>>>       [5] gtools_3.5.0       assertthat_0.2.0   stats4_3.4.0
>>>>>   cellranger_1.1.0
>>>>>       [9] quantreg_5.33      chron_2.3-50       digest_0.6.10
>>>>> rvest_0.3.2
>>>>>      [13] minqa_1.2.4        colorspace_1.3-2   Matrix_1.2-10
>>>>> plyr_1.8.4
>>>>>      [17] psych_1.7.3.21     XML_3.98-1.7       broom_0.4.2
>>>>> SparseM_1.77
>>>>>      [21] haven_1.0.0        scales_0.4.1       lme4_1.1-13
>>>>> MatrixModels_0.4-1
>>>>>      [25] mgcv_1.8-17        car_2.1-5          nnet_7.3-12
>>>>> lazyeval_0.2.0
>>>>>      [29] pbkrtest_0.4-7     mnormt_1.5-5       magrittr_1.5
>>>>>   memoise_1.0.0
>>>>>      [33] nlme_3.1-131       forcats_0.2.0      xml2_1.1.1
>>>>>   foreign_0.8-69
>>>>>      [37] tools_3.4.0        hms_0.3            munsell_0.4.3
>>>>> compiler_3.4.0
>>>>>      [41] caTools_1.17.1     rlang_0.1.1        grid_3.4.0
>>>>>   nloptr_1.0.4
>>>>>      [45] iterators_1.0.8    bitops_1.0-6       tcltk_3.4.0
>>>>> gtable_0.2.0
>>>>>      [49] ModelMetrics_1.1.0 codetools_0.2-15   reshape2_1.4.2
>>>>>   R6_2.2.0
>>>>>
>>>>>      [53] knitr_1.15.1       KernSmooth_2.23-15 stringi_1.1.5
>>>>> Rcpp_0.12.11
>>>>>
>>>>>
>>>>>
>>>>> Windows
>>>>> -------
>>>>>
>>>>>      > sessionInfo()
>>>>>      R version 3.3.2 (2016-10-31)
>>>>>      Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>>      Running under: Windows >= 8 x64 (build 9200)
>>>>>
>>>>>      locale:
>>>>>      [1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252
>>>>> LC_MONETARY=English_India.1252
>>>>>      [4] LC_NUMERIC=C                   LC_TIME=English_India.1252
>>>>>
>>>>>      attached base packages:
>>>>>      [1] graphics  grDevices utils     datasets  grid      stats
>>>>>   methods   base
>>>>>
>>>>>      other attached packages:
>>>>>       [1] bindrcpp_0.2         h2o_3.14.0.3         ggrepel_0.6.5
>>>>> eulerr_1.1.0         VennDiagram_1.6.17
>>>>>       [6] futile.logger_1.4.3  scales_0.4.1         FinCal_0.6.3
>>>>>   xml2_1.0.0           httr_1.3.0
>>>>>      [11] wesanderson_0.3.2    wordcloud_2.5        RColorBrewer_1.1-2
>>>>>   htmltools_0.3.6      urltools_1.6.0
>>>>>      [16] timevis_0.4          dtplyr_0.0.1         magrittr_1.5
>>>>>   shiny_1.0.5          RODBC_1.3-14
>>>>>      [21] zoo_1.8-0            sqldf_0.4-10         RSQLite_1.1-2
>>>>> gsubfn_0.6-6         proto_1.0.0
>>>>>      [26] gdata_2.17.0         stringr_1.2.0        XLConnect_0.2-12
>>>>>   XLConnectJars_0.2-12 data.table_1.10.4
>>>>>      [31] xlsx_0.5.7           xlsxjars_0.6.1       rJava_0.9-8
>>>>> readxl_0.1.1         googlesheets_0.2.1
>>>>>      [36] jsonlite_1.5         tidyjson_0.2.1       RMySQL_0.10.9
>>>>> RPostgreSQL_0.4-1    DBI_0.5-1
>>>>>      [41] dplyr_0.7.2          purrr_0.2.3          readr_1.1.1
>>>>> tidyr_0.7.0          tibble_1.3.3
>>>>>      [46] ggplot2_2.2.0        tidyverse_1.0.0      lubridate_1.6.0
>>>>>
>>>>>      loaded via a namespace (and not attached):
>>>>>       [1] gtools_3.5.0         assertthat_0.2.0     triebeard_0.3.0
>>>>> cellranger_1.1.0     yaml_2.1.14
>>>>>       [6] slam_0.1-40          lattice_0.20-34      glue_1.1.1
>>>>>   chron_2.3-48         digest_0.6.12.1
>>>>>      [11] colorspace_1.3-1     httpuv_1.3.5         plyr_1.8.4
>>>>>   pkgconfig_2.0.1      xtable_1.8-2
>>>>>      [16] lazyeval_0.2.0       mime_0.5             memoise_1.0.0
>>>>> tools_3.3.2          hms_0.3
>>>>>      [21] munsell_0.4.3        lambda.r_1.1.9       rlang_0.1.1
>>>>> RCurl_1.95-4.8       labeling_0.3
>>>>>      [26] bitops_1.0-6         tcltk_3.3.2          gtable_0.2.0
>>>>>   reshape2_1.4.2       R6_2.2.0
>>>>>      [31] bindr_0.1            futile.options_1.0.0 stringi_1.1.2
>>>>> Rcpp_0.12.12.1
>>>>>
>>>>>    [1]: http://stat.ethz.ch/R-manual/R-devel/library/stats/html/Unif
>>>>> orm.html
>>>>>
>>>>>          [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>
>>>>
>>>>
>>>
>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Extreme bunching of random values from runif with Mersenne-Twister seed

Duncan Murdoch-2
On 04/11/2017 10:20 PM, Daniel Nordlund wrote:

> Tirthankar,
>
> "random number generators" do not produce random numbers.  Any given
> generator produces a fixed sequence of numbers that appear to meet
> various tests of randomness.  By picking a seed you enter that sequence
> in a particular place and subsequent numbers in the sequence appear to
> be unrelated.  There are no guarantees that if YOU pick a SET of seeds
> they won't produce a set of values that are of a similar magnitude.
>
> You can likely solve your problem by following Radford Neal's advice of
> not using the the first number from each seed.  However, you don't need
> to use anything more than the second number.  So, you can modify your
> function as follows:
>
> function(x) {
>         set.seed(x, kind = "default")
>         y = runif(2, 17, 26)
>         return(y[2])
>       }
>
> Hope this is helpful,

That's assuming that the chosen seeds are unrelated to the function
output, which seems unlikely on the face of it.  You can certainly
choose a set of seeds that give high values on the second draw just as
easily as you can choose seeds that give high draws on the first draw.

The interesting thing about this problem is that Tirthankar doesn't
believe that the seed selection process is aware of the function output.
  I would say that it must be, and he should be investigating how that
happens if he is worried about the output, he shouldn't be worrying
about R's RNG.

Duncan Murdoch

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Extreme bunching of random values from runif with Mersenne-Twister seed

Tirthankar Chakravarty-2
Duncan, Daniel,

Thanks and indeed we intend to take the advice that Radford and Lukas have
provided in this thread.

I do want to re-iterate that the generating system itself cannot have any
conception of the use of form IDs as seeds for a PRNG *and* the system
itself only generates a sequence of form IDs, which are then filtered & are
passed to our API depending on basic rules on user inputs in that form.
Either in our production system a truly remarkable probability event has
happened or that the Mersenne-Twister is very susceptible to the first draw
in the sequence to be correlated across closely related seeds. Both of
these require understanding the Mersenne-Twister better.

The solution here as has been suggested is to use a different RNG with
adequate burn-in (in which case even MT would work) or to look more
carefully at our problem and understand if we just need a hash function.

In either case, we will cease to question R's implementation of
Mersenne-Twister (for the time being). :)

T



On Sun, Nov 5, 2017 at 7:47 PM, Duncan Murdoch <[hidden email]>
wrote:

> On 04/11/2017 10:20 PM, Daniel Nordlund wrote:
>
>> Tirthankar,
>>
>> "random number generators" do not produce random numbers.  Any given
>> generator produces a fixed sequence of numbers that appear to meet
>> various tests of randomness.  By picking a seed you enter that sequence
>> in a particular place and subsequent numbers in the sequence appear to
>> be unrelated.  There are no guarantees that if YOU pick a SET of seeds
>> they won't produce a set of values that are of a similar magnitude.
>>
>> You can likely solve your problem by following Radford Neal's advice of
>> not using the the first number from each seed.  However, you don't need
>> to use anything more than the second number.  So, you can modify your
>> function as follows:
>>
>> function(x) {
>>         set.seed(x, kind = "default")
>>         y = runif(2, 17, 26)
>>         return(y[2])
>>       }
>>
>> Hope this is helpful,
>>
>
> That's assuming that the chosen seeds are unrelated to the function
> output, which seems unlikely on the face of it.  You can certainly choose a
> set of seeds that give high values on the second draw just as easily as you
> can choose seeds that give high draws on the first draw.
>
> The interesting thing about this problem is that Tirthankar doesn't
> believe that the seed selection process is aware of the function output.  I
> would say that it must be, and he should be investigating how that happens
> if he is worried about the output, he shouldn't be worrying about R's RNG.
>
> Duncan Murdoch
>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Extreme bunching of random values from runif with Mersenne-Twister seed

Peter Dalgaard-2
In reply to this post by Duncan Murdoch-2

> On 5 Nov 2017, at 15:17 , Duncan Murdoch <[hidden email]> wrote:
>
> On 04/11/2017 10:20 PM, Daniel Nordlund wrote:
>> Tirthankar,
>> "random number generators" do not produce random numbers.  Any given
>> generator produces a fixed sequence of numbers that appear to meet
>> various tests of randomness.  By picking a seed you enter that sequence
>> in a particular place and subsequent numbers in the sequence appear to
>> be unrelated.  There are no guarantees that if YOU pick a SET of seeds
>> they won't produce a set of values that are of a similar magnitude.
>> You can likely solve your problem by following Radford Neal's advice of
>> not using the the first number from each seed.  However, you don't need
>> to use anything more than the second number.  So, you can modify your
>> function as follows:
>> function(x) {
>>        set.seed(x, kind = "default")
>>        y = runif(2, 17, 26)
>>        return(y[2])
>>      }
>> Hope this is helpful,
>
> That's assuming that the chosen seeds are unrelated to the function output, which seems unlikely on the face of it.  You can certainly choose a set of seeds that give high values on the second draw just as easily as you can choose seeds that give high draws on the first draw.
>
> The interesting thing about this problem is that Tirthankar doesn't believe that the seed selection process is aware of the function output.  I would say that it must be, and he should be investigating how that happens if he is worried about the output, he shouldn't be worrying about R's RNG.
>

Hmm, no. The basic issue is that RNGs are constructed so that with x_{n+1} = f(x_n),
x_1, x_2, x_3,... will look random, not so that f(s_1), f(s_2), f(s_3), ... will look random for any s_1, s_2, ... . This is true, even if seeds s_1, s_2, ... are not chosen so as to mess with the RNG. In the present case, it seems that the seeds around 86e6 tend to give similar output. On the other hand, it is not _just_ the similarity in magnitude that does it, try e.g.

s <- as.integer(runif(1000000, 86.54e6, 86.98e6))
r <- sapply(s, function(s){set.seed(s); runif(1,17,26)})
plot(s,r, pch=".")

and no obvious pattern emerges. My best guess is that the seeds are not only of similar magnitude, but also have other bit-pattern similarities.

(Isn't there a Knuth quote to the effect that "Every random number generator will fail in at least one application"?)

One remaining issue is whether it is really true that the same seeds givee different output on different platforms. That shouldn't happen, I believe.


> Duncan Murdoch
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Extreme bunching of random values from runif with Mersenne-Twister seed

Paul Gilbert-2
I'll point out that there is there is a large literature on generating
pseudo random numbers for parallel processes, and it is not as easy as
one (at least me) would intuitively think. By a contra-positive like
thinking one might guess that it will not be easy to pick seeds in a way
that will produce independent sequences.

(I'm a bit confused about the objective but) If the objective is to
produce independent sequence from some different seeds then the RNGs for
parallel processing might be a good place to start. (And, BTW, if you
want to reproduce parallel generated random numbers you need to keep
track of both the starting seed and the number of nodes.)

Paul Gilbert

On 11/05/2017 10:58 AM, peter dalgaard wrote:

>
>> On 5 Nov 2017, at 15:17 , Duncan Murdoch <[hidden email]> wrote:
>>
>> On 04/11/2017 10:20 PM, Daniel Nordlund wrote:
>>> Tirthankar,
>>> "random number generators" do not produce random numbers.  Any given
>>> generator produces a fixed sequence of numbers that appear to meet
>>> various tests of randomness.  By picking a seed you enter that sequence
>>> in a particular place and subsequent numbers in the sequence appear to
>>> be unrelated.  There are no guarantees that if YOU pick a SET of seeds
>>> they won't produce a set of values that are of a similar magnitude.
>>> You can likely solve your problem by following Radford Neal's advice of
>>> not using the the first number from each seed.  However, you don't need
>>> to use anything more than the second number.  So, you can modify your
>>> function as follows:
>>> function(x) {
>>>         set.seed(x, kind = "default")
>>>         y = runif(2, 17, 26)
>>>         return(y[2])
>>>       }
>>> Hope this is helpful,
>>
>> That's assuming that the chosen seeds are unrelated to the function output, which seems unlikely on the face of it.  You can certainly choose a set of seeds that give high values on the second draw just as easily as you can choose seeds that give high draws on the first draw.
>>
>> The interesting thing about this problem is that Tirthankar doesn't believe that the seed selection process is aware of the function output.  I would say that it must be, and he should be investigating how that happens if he is worried about the output, he shouldn't be worrying about R's RNG.
>>
>
> Hmm, no. The basic issue is that RNGs are constructed so that with x_{n+1} = f(x_n),
> x_1, x_2, x_3,... will look random, not so that f(s_1), f(s_2), f(s_3), ... will look random for any s_1, s_2, ... . This is true, even if seeds s_1, s_2, ... are not chosen so as to mess with the RNG. In the present case, it seems that the seeds around 86e6 tend to give similar output. On the other hand, it is not _just_ the similarity in magnitude that does it, try e.g.
>
> s <- as.integer(runif(1000000, 86.54e6, 86.98e6))
> r <- sapply(s, function(s){set.seed(s); runif(1,17,26)})
> plot(s,r, pch=".")
>
> and no obvious pattern emerges. My best guess is that the seeds are not only of similar magnitude, but also have other bit-pattern similarities.
>
> (Isn't there a Knuth quote to the effect that "Every random number generator will fail in at least one application"?)
>
> One remaining issue is whether it is really true that the same seeds givee different output on different platforms. That shouldn't happen, I believe.
>
>
>> Duncan Murdoch
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Extreme bunching of random values from runif with Mersenne-Twister seed

Serguei Sokol
In reply to this post by Duncan Murdoch-2
Le 05/11/2017 à 15:17, Duncan Murdoch a écrit :

> On 04/11/2017 10:20 PM, Daniel Nordlund wrote:
>> Tirthankar,
>>
>> "random number generators" do not produce random numbers.  Any given
>> generator produces a fixed sequence of numbers that appear to meet
>> various tests of randomness.  By picking a seed you enter that sequence
>> in a particular place and subsequent numbers in the sequence appear to
>> be unrelated.  There are no guarantees that if YOU pick a SET of seeds
>> they won't produce a set of values that are of a similar magnitude.
>>
>> You can likely solve your problem by following Radford Neal's advice of
>> not using the the first number from each seed.  However, you don't need
>> to use anything more than the second number.  So, you can modify your
>> function as follows:
>>
>> function(x) {
>>         set.seed(x, kind = "default")
>>         y = runif(2, 17, 26)
>>         return(y[2])
>>       }
>>
>> Hope this is helpful,
>
> That's assuming that the chosen seeds are unrelated to the function output, which seems unlikely on the face of it.  You can certainly choose a set of seeds
> that give high values on the second draw just as easily as you can choose seeds that give high draws on the first draw.
To confirm this statement, I did

s2_25=s[sapply(s, function(i) {set.seed(i); runif(2, 17, 26)[2] > 25})]
length(s2_25) # 48990

For memory, we had
length(s25) # 48631 out of 439166

which is much similar length.
So if we take the second or even the 10-th pseudo-random value we can
fall as easily (or as hard) at a seed sequence giving some narrow set.

Serguei.

>
> The interesting thing about this problem is that Tirthankar doesn't believe that the seed selection process is aware of the function output.  I would say that
> it must be, and he should be investigating how that happens if he is worried about the output, he shouldn't be worrying about R's RNG.
>
> Duncan Murdoch
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Extreme bunching of random values from runif with Mersenne-Twister seed

Duncan Murdoch-2
In reply to this post by Peter Dalgaard-2
On 05/11/2017 10:58 AM, peter dalgaard wrote:

>
>> On 5 Nov 2017, at 15:17 , Duncan Murdoch <[hidden email]> wrote:
>>
>> On 04/11/2017 10:20 PM, Daniel Nordlund wrote:
>>> Tirthankar,
>>> "random number generators" do not produce random numbers.  Any given
>>> generator produces a fixed sequence of numbers that appear to meet
>>> various tests of randomness.  By picking a seed you enter that sequence
>>> in a particular place and subsequent numbers in the sequence appear to
>>> be unrelated.  There are no guarantees that if YOU pick a SET of seeds
>>> they won't produce a set of values that are of a similar magnitude.
>>> You can likely solve your problem by following Radford Neal's advice of
>>> not using the the first number from each seed.  However, you don't need
>>> to use anything more than the second number.  So, you can modify your
>>> function as follows:
>>> function(x) {
>>>         set.seed(x, kind = "default")
>>>         y = runif(2, 17, 26)
>>>         return(y[2])
>>>       }
>>> Hope this is helpful,
>>
>> That's assuming that the chosen seeds are unrelated to the function output, which seems unlikely on the face of it.  You can certainly choose a set of seeds that give high values on the second draw just as easily as you can choose seeds that give high draws on the first draw.
>>
>> The interesting thing about this problem is that Tirthankar doesn't believe that the seed selection process is aware of the function output.  I would say that it must be, and he should be investigating how that happens if he is worried about the output, he shouldn't be worrying about R's RNG.
>>
>
> Hmm, no. The basic issue is that RNGs are constructed so that with x_{n+1} = f(x_n),
> x_1, x_2, x_3,... will look random, not so that f(s_1), f(s_2), f(s_3), ... will look random for any s_1, s_2, ... . This is true, even if seeds s_1, s_2, ... are not chosen so as to mess with the RNG. In the present case, it seems that the seeds around 86e6 tend to give similar output. On the other hand, it is not _just_ the similarity in magnitude that does it, try e.g.
>
> s <- as.integer(runif(1000000, 86.54e6, 86.98e6))
> r <- sapply(s, function(s){set.seed(s); runif(1,17,26)})
> plot(s,r, pch=".")
>
> and no obvious pattern emerges. My best guess is that the seeds are not only of similar magnitude, but also have other bit-pattern similarities.
>
> (Isn't there a Knuth quote to the effect that "Every random number generator will fail in at least one application"?)
>
> One remaining issue is whether it is really true that the same seeds givee different output on different platforms. That shouldn't happen, I believe.

I don't think there's a platform difference if the same generator is
used. In my tests, I get the Ubuntu results on both MacOS and Windows.
In one of the earlier messages, Tirthankar said he was using
RNGkind(kind = NULL), which means earlier experiments with a different
generator would taint the results.

Duncan Murdoch

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel