Hi all,
For some odd reason when running naïve bayes, k-NN, etc., I get slightly different results (e.g., error rates, classification probabilities) from run to run even though I am using the same random seed. Nothing else (input-wise) is changing, but my results are somewhat different from run to run. The only randomness should be in the partitioning, and I have set the seed before this point. My question simply is: should the location of the set.seed command matter, provided that it is applied before any commands which involve randomness (such as partitioning)? If you need to see the code, it is below: Thank you, Gary A. Separate the original (in-sample) data from the new (out-of-sample) data. Set a random seed. > InvestTech <- as.data.frame(InvestTechRevised) > outOfSample <- InvestTech[5001:nrow(InvestTech), ] > InvestTech <- InvestTech[1:5000, ] > set.seed(654321) B. Install and load the caret, ggplot2 and e1071 packages. > install.packages(caret) > install.packages(ggplot2) > install.packages(e1071) > library(caret) > library(ggplot2) > library(e1071) C. Bin the predictor variables with approximately equal counts using the cut_number function from the ggplot2 package. We will use 20 bins. > InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20) > InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20) > outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20) > outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20) D. Partition the original (in-sample) data into 60% training and 40% validation sets. > n <- nrow(InvestTech) > train <- sample(1:n, size = 0.6 * n, replace = FALSE) > InvestTechTrain <- InvestTech[train, ] > InvestTechVal <- InvestTech[-train, ] E. Use the naiveBayes function in the e1071 package to fit the model. > model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data = InvestTechTrain) > prob <- predict(model, newdata = InvestTechVal, type = raw) > pred <- ifelse(prob[, 2] >= 0.3, 1, 0) F. Use the confusionMatrix function in the caret package to output the confusion matrix. > confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode = everything, positive = 1) > accuracy <- confMtr$overall[1] > valError <- 1 accuracy > confMtr G. Classify the 18 new (out-of-sample) readers using the following code. > prob <- predict(model, newdata = outOfSample, type = raw) > pred <- ifelse(prob[, 2] >= 0.3, 1, 0) > cbind(pred, prob, outOfSample[, -3]) --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
In case you don't get an answer from someone more knowledgeable:
1. I don't know. 2. But it is possible that other packages that are loaded after set.seed() fool with the RNG. 3. So I would call set.seed just before you invoke each random number generation to be safe. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <[hidden email]> wrote: > Hi all, > > For some odd reason when running naïve bayes, k-NN, etc., I get slightly > different results (e.g., error rates, classification probabilities) from > run > to run even though I am using the same random seed. > > Nothing else (input-wise) is changing, but my results are somewhat > different > from run to run. The only randomness should be in the partitioning, and I > have set the seed before this point. > > My question simply is: should the location of the set.seed command matter, > provided that it is applied before any commands which involve randomness > (such as partitioning)? > > If you need to see the code, it is below: > > Thank you, > Gary > > > A. Separate the original (in-sample) data from the new (out-of-sample) > data. Set a random seed. > > > InvestTech <- as.data.frame(InvestTechRevised) > > outOfSample <- InvestTech[5001:nrow(InvestTech), ] > > InvestTech <- InvestTech[1:5000, ] > > set.seed(654321) > > B. Install and load the caret, ggplot2 and e1071 packages. > > > install.packages(“caret”) > > install.packages(“ggplot2”) > > install.packages(“e1071”) > > library(caret) > > library(ggplot2) > > library(e1071) > > C. Bin the predictor variables with approximately equal counts using > the cut_number function from the ggplot2 package. We will use 20 bins. > > > InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20) > > InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20) > > outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20) > > outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20) > > D. Partition the original (in-sample) data into 60% training and 40% > validation sets. > > > n <- nrow(InvestTech) > > train <- sample(1:n, size = 0.6 * n, replace = FALSE) > > InvestTechTrain <- InvestTech[train, ] > > InvestTechVal <- InvestTech[-train, ] > > E. Use the naiveBayes function in the e1071 package to fit the model. > > > model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data = InvestTechTrain) > > prob <- predict(model, newdata = InvestTechVal, type = “raw”) > > pred <- ifelse(prob[, 2] >= 0.3, 1, 0) > > F. Use the confusionMatrix function in the caret package to output the > confusion matrix. > > > confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode = > “everything”, positive = “1”) > > accuracy <- confMtr$overall[1] > > valError <- 1 – accuracy > > confMtr > > G. Classify the 18 new (out-of-sample) readers using the following > code. > > prob <- predict(model, newdata = outOfSample, type = “raw”) > > pred <- ifelse(prob[, 2] >= 0.3, 1, 0) > > cbind(pred, prob, outOfSample[, -3]) > > > > > > > > --- > This email has been checked for viruses by Avast antivirus software. > https://www.avast.com/antivirus > > ______________________________________________ > [hidden email] mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
If your computations involve the parallel package then set.seed(seed)
may not produce repeatable results. E.g., > cl <- parallel::makeCluster(3) # Create cluster with 3 nodes on local host > set.seed(100); runif(2) [1] 0.3077661 0.2576725 > parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1)) [,1] [,2] [,3] [1,] 101.7779 102.5308 103.3459 [2,] 101.8128 102.6114 103.9102 > > set.seed(100); runif(2) [1] 0.3077661 0.2576725 > parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1)) [,1] [,2] [,3] [1,] 101.1628 102.9643 103.2684 [2,] 101.9205 102.6937 103.7907 Bill Dunlap TIBCO Software wdunlap tibco.com On Mon, Feb 26, 2018 at 4:30 PM, Bert Gunter <[hidden email]> wrote: > In case you don't get an answer from someone more knowledgeable: > > 1. I don't know. > 2. But it is possible that other packages that are loaded after set.seed() > fool with the RNG. > 3. So I would call set.seed just before you invoke each random number > generation to be safe. > > Cheers, > Bert > > > > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and > sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <[hidden email]> > wrote: > > > Hi all, > > > > For some odd reason when running naïve bayes, k-NN, etc., I get slightly > > different results (e.g., error rates, classification probabilities) from > > run > > to run even though I am using the same random seed. > > > > Nothing else (input-wise) is changing, but my results are somewhat > > different > > from run to run. The only randomness should be in the partitioning, and > I > > have set the seed before this point. > > > > My question simply is: should the location of the set.seed command > matter, > > provided that it is applied before any commands which involve randomness > > (such as partitioning)? > > > > If you need to see the code, it is below: > > > > Thank you, > > Gary > > > > > > A. Separate the original (in-sample) data from the new > (out-of-sample) > > data. Set a random seed. > > > > > InvestTech <- as.data.frame(InvestTechRevised) > > > outOfSample <- InvestTech[5001:nrow(InvestTech), ] > > > InvestTech <- InvestTech[1:5000, ] > > > set.seed(654321) > > > > B. Install and load the caret, ggplot2 and e1071 packages. > > > > > install.packages(“caret”) > > > install.packages(“ggplot2”) > > > install.packages(“e1071”) > > > library(caret) > > > library(ggplot2) > > > library(e1071) > > > > C. Bin the predictor variables with approximately equal counts using > > the cut_number function from the ggplot2 package. We will use 20 bins. > > > > > InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20) > > > InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20) > > > outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20) > > > outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20) > > > > D. Partition the original (in-sample) data into 60% training and 40% > > validation sets. > > > > > n <- nrow(InvestTech) > > > train <- sample(1:n, size = 0.6 * n, replace = FALSE) > > > InvestTechTrain <- InvestTech[train, ] > > > InvestTechVal <- InvestTech[-train, ] > > > > E. Use the naiveBayes function in the e1071 package to fit the > model. > > > > > model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data = > InvestTechTrain) > > > prob <- predict(model, newdata = InvestTechVal, type = “raw”) > > > pred <- ifelse(prob[, 2] >= 0.3, 1, 0) > > > > F. Use the confusionMatrix function in the caret package to output > the > > confusion matrix. > > > > > confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode = > > “everything”, positive = “1”) > > > accuracy <- confMtr$overall[1] > > > valError <- 1 – accuracy > > > confMtr > > > > G. Classify the 18 new (out-of-sample) readers using the following > > code. > > > prob <- predict(model, newdata = outOfSample, type = “raw”) > > > pred <- ifelse(prob[, 2] >= 0.3, 1, 0) > > > cbind(pred, prob, outOfSample[, -3]) > > > > > > > > > > > > > > > > --- > > This email has been checked for viruses by Avast antivirus software. > > https://www.avast.com/antivirus > > > > ______________________________________________ > > [hidden email] mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/ > > posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > [[alternative HTML version deleted]] > > ______________________________________________ > [hidden email] mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
In reply to this post by Bert Gunter-2
I am willing to go out on that limb and say the answer to the OP question is yes, the RN sequence in R should be reproducible. I agree though that it doesn't look like he is actually taking care not to run code that would disturb the generator.
-- Sent from my phone. Please excuse my brevity. On February 26, 2018 4:30:47 PM PST, Bert Gunter <[hidden email]> wrote: >In case you don't get an answer from someone more knowledgeable: > >1. I don't know. >2. But it is possible that other packages that are loaded after >set.seed() >fool with the RNG. >3. So I would call set.seed just before you invoke each random number >generation to be safe. > >Cheers, >Bert > > > > >Bert Gunter > >"The trouble with having an open mind is that people keep coming along >and >sticking things into it." >-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > >On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <[hidden email]> >wrote: > >> Hi all, >> >> For some odd reason when running naïve bayes, k-NN, etc., I get >slightly >> different results (e.g., error rates, classification probabilities) >from >> run >> to run even though I am using the same random seed. >> >> Nothing else (input-wise) is changing, but my results are somewhat >> different >> from run to run. The only randomness should be in the partitioning, >and I >> have set the seed before this point. >> >> My question simply is: should the location of the set.seed command >matter, >> provided that it is applied before any commands which involve >randomness >> (such as partitioning)? >> >> If you need to see the code, it is below: >> >> Thank you, >> Gary >> >> >> A. Separate the original (in-sample) data from the new >(out-of-sample) >> data. Set a random seed. >> >> > InvestTech <- as.data.frame(InvestTechRevised) >> > outOfSample <- InvestTech[5001:nrow(InvestTech), ] >> > InvestTech <- InvestTech[1:5000, ] >> > set.seed(654321) >> >> B. Install and load the caret, ggplot2 and e1071 packages. >> >> > install.packages(“caret”) >> > install.packages(“ggplot2”) >> > install.packages(“e1071”) >> > library(caret) >> > library(ggplot2) >> > library(e1071) >> >> C. Bin the predictor variables with approximately equal counts >using >> the cut_number function from the ggplot2 package. We will use 20 >bins. >> >> > InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20) >> > InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20) >> > outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20) >> > outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20) >> >> D. Partition the original (in-sample) data into 60% training and >40% >> validation sets. >> >> > n <- nrow(InvestTech) >> > train <- sample(1:n, size = 0.6 * n, replace = FALSE) >> > InvestTechTrain <- InvestTech[train, ] >> > InvestTechVal <- InvestTech[-train, ] >> >> E. Use the naiveBayes function in the e1071 package to fit the >model. >> >> > model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data = >InvestTechTrain) >> > prob <- predict(model, newdata = InvestTechVal, type = “raw”) >> > pred <- ifelse(prob[, 2] >= 0.3, 1, 0) >> >> F. Use the confusionMatrix function in the caret package to >output the >> confusion matrix. >> >> > confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode = >> “everything”, positive = “1”) >> > accuracy <- confMtr$overall[1] >> > valError <- 1 – accuracy >> > confMtr >> >> G. Classify the 18 new (out-of-sample) readers using the >following >> code. >> > prob <- predict(model, newdata = outOfSample, type = “raw”) >> > pred <- ifelse(prob[, 2] >= 0.3, 1, 0) >> > cbind(pred, prob, outOfSample[, -3]) >> >> >> >> >> >> >> >> --- >> This email has been checked for viruses by Avast antivirus software. >> https://www.avast.com/antivirus >> >> ______________________________________________ >> [hidden email] mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/ >> posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] > >______________________________________________ >[hidden email] mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
In reply to this post by GaryB
On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <[hidden email]>
wrote: (Sorry to be a bit slow responding.) You have not supplied a complete example, which would be good in this case because what you are suggesting could be a serious bug in R or a package. Serious journals require reproducibility these days. For example, JSS is very clear on this point. To your question > My question simply is: should the location of the set.seed command matter, > provided that it is applied before any commands which involve randomness > (such as partitioning)? the answer is no, it should not matter. But the proviso is important. You can determine where things are messing up using something like set.seed(654321) zk <- RNGkind() # [1] "Mersenne-Twister" "Inversion" zk z <- runif(2) z set.seed(654321) # install.packages(c('caret', 'ggplot2', 'e1071')) library(caret) all(runif(2) == z) # should be true but it is not always set.seed(654321) library(ggplot2) all(runif(2) == z) # should be true set.seed(654321) library(e1071) all(runif(2) == z) # should be true all(RNGkind() == zk) # should be true On my computer package caret seems to sometimes, but not always, do something that advances or changes the RNG. So you will need to set the seed after that package is loaded if you want reproducibility. As Bill Dunlap points out, parallel can introduce much more complicated issues. If you are in fact using parallel then we really need a new thread with a better subject line, and the discussion will get much messier. The short answer is that, yes you should be able to get reproducible results with parallel computing. If you cannot then you are almost certainly doing something wrong. To publish you really must have reproducible results. In the example that Bill gave, I think the problem is that set.seed() only resets the seed in the main thread, the nodes continue to operate with unreset RNG. To demonstrate this to yourself you can do library(parallel) cl <- parallel::makeCluster(3) parallel::clusterCall(cl, function()set.seed(100)) parallel::clusterCall(cl, function()RNGkind()) parallel::clusterCall(cl, function()runif(2)) # similar result from all nodes # [1] 0.3077661 0.2576725 However, do *NOT* do that in real work. You will be getting the same RNG stream from each node. If you are using random numbers and parallel you need to read a lot more, and probably consider a variant of the "L'Ecuyer" generator or something designed for parallel computing. One special point I will mention because it does not seem to be widely appreciated: the number of nodes affects the random stream, so recording the number of compute nodes along with the RNG and seed information is important for reproducible results. This has the unfortunate consequence that an experiment cannot be reproduced on a larger cluster. (If anyone knows differently I would very much like to hear.) Paul Gilbert > Hi all, > > For some odd reason when running naïve bayes, k-NN, etc., I get slightly > different results (e.g., error rates, classification probabilities) from > run > to run even though I am using the same random seed. > > Nothing else (input-wise) is changing, but my results are somewhat > different > from run to run. The only randomness should be in the partitioning, and I > have set the seed before this point. > > My question simply is: should the location of the set.seed command matter, > provided that it is applied before any commands which involve randomness > (such as partitioning)? > > If you need to see the code, it is below: > > Thank you, > Gary > > > A. Separate the original (in-sample) data from the new (out-of-sample) > data. Set a random seed. > >> InvestTech <- as.data.frame(InvestTechRevised) >> outOfSample <- InvestTech[5001:nrow(InvestTech), ] >> InvestTech <- InvestTech[1:5000, ] >> set.seed(654321) > > B. Install and load the caret, ggplot2 and e1071 packages. > >> install.packages(“caret”) >> install.packages(“ggplot2”) >> install.packages(“e1071”) >> library(caret) >> library(ggplot2) >> library(e1071) > > C. Bin the predictor variables with approximately equal counts using > the cut_number function from the ggplot2 package. We will use 20 bins. > >> InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20) >> InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20) >> outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20) >> outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20) > > D. Partition the original (in-sample) data into 60% training and 40% > validation sets. > >> n <- nrow(InvestTech) >> train <- sample(1:n, size = 0.6 * n, replace = FALSE) >> InvestTechTrain <- InvestTech[train, ] >> InvestTechVal <- InvestTech[-train, ] > > E. Use the naiveBayes function in the e1071 package to fit the model. > >> model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data = InvestTechTrain) >> prob <- predict(model, newdata = InvestTechVal, type = “raw”) >> pred <- ifelse(prob[, 2] >= 0.3, 1, 0) > > F. Use the confusionMatrix function in the caret package to output the > confusion matrix. > >> confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode = > “everything”, positive = “1”) >> accuracy <- confMtr$overall[1] >> valError <- 1 – accuracy >> confMtr > > G. Classify the 18 new (out-of-sample) readers using the following > code. >> prob <- predict(model, newdata = outOfSample, type = “raw”) >> pred <- ifelse(prob[, 2] >= 0.3, 1, 0) >> cbind(pred, prob, outOfSample[, -3]) > > > If your computations involve the parallel package then set.seed(seed) may not produce repeatable results. E.g., > cl <- parallel::makeCluster(3) # Create cluster with 3 nodes on local host > set.seed(100); runif(2) [1] 0.3077661 0.2576725 > parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1)) [,1] [,2] [,3] [1,] 101.7779 102.5308 103.3459 [2,] 101.8128 102.6114 103.9102 > > set.seed(100); runif(2) [1] 0.3077661 0.2576725 > parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1)) [,1] [,2] [,3] [1,] 101.1628 102.9643 103.2684 [2,] 101.9205 102.6937 103.7907 Bill Dunlap TIBCO Software wdunlap tibco.com ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
Thank you, everybody, who replied! I appreciate your valuable advise! I will move the location of the set.seed() command to after all packages have been installed and loaded.
Best regards, Gary Sent from my iPad > On Mar 4, 2018, at 12:18 PM, Paul Gilbert <[hidden email]> wrote: > > On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <[hidden email]> > wrote: > > (Sorry to be a bit slow responding.) > > You have not supplied a complete example, which would be good in this case because what you are suggesting could be a serious bug in R or a package. Serious journals require reproducibility these days. For example, JSS is very clear on this point. > > To your question > > My question simply is: should the location of the set.seed command matter, > > provided that it is applied before any commands which involve randomness > > (such as partitioning)? > > the answer is no, it should not matter. But the proviso is important. > > You can determine where things are messing up using something like > > set.seed(654321) > zk <- RNGkind() # [1] "Mersenne-Twister" "Inversion" > zk > z <- runif(2) > z > set.seed(654321) > > # install.packages(c('caret', 'ggplot2', 'e1071')) > library(caret) > all(runif(2) == z) # should be true but it is not always > > set.seed(654321) > library(ggplot2) > all(runif(2) == z) # should be true > > set.seed(654321) > library(e1071) > all(runif(2) == z) # should be true > > all(RNGkind() == zk) # should be true > > On my computer package caret seems to sometimes, but not always, do something that advances or changes the RNG. So you will need to set the seed after that package is loaded if you want reproducibility. > > As Bill Dunlap points out, parallel can introduce much more complicated issues. If you are in fact using parallel then we really need a new thread with a better subject line, and the discussion will get much messier. > > The short answer is that, yes you should be able to get reproducible results with parallel computing. If you cannot then you are almost certainly doing something wrong. To publish you really must have reproducible results. > > In the example that Bill gave, I think the problem is that set.seed() only resets the seed in the main thread, the nodes continue to operate with unreset RNG. To demonstrate this to yourself you can do > > library(parallel) > cl <- parallel::makeCluster(3) > parallel::clusterCall(cl, function()set.seed(100)) > parallel::clusterCall(cl, function()RNGkind()) > parallel::clusterCall(cl, function()runif(2)) # similar result from all nodes > # [1] 0.3077661 0.2576725 > > However, do *NOT* do that in real work. You will be getting the same RNG stream from each node. If you are using random numbers and parallel you need to read a lot more, and probably consider a variant of the "L'Ecuyer" generator or something designed for parallel computing. > > One special point I will mention because it does not seem to be widely > appreciated: the number of nodes affects the random stream, so recording the number of compute nodes along with the RNG and seed information is important for reproducible results. This has the unfortunate consequence that an experiment cannot be reproduced on a larger cluster. (If anyone knows differently I would very much like to hear.) > > Paul Gilbert > > > > Hi all, > > > > For some odd reason when running naïve bayes, k-NN, etc., I get slightly > > different results (e.g., error rates, classification probabilities) from > > run > > to run even though I am using the same random seed. > > > > Nothing else (input-wise) is changing, but my results are somewhat > > different > > from run to run. The only randomness should be in the partitioning, and I > > have set the seed before this point. > > > > My question simply is: should the location of the set.seed command matter, > > provided that it is applied before any commands which involve randomness > > (such as partitioning)? > > > > If you need to see the code, it is below: > > > > Thank you, > > Gary > > > > > > A. Separate the original (in-sample) data from the new (out-of-sample) > > data. Set a random seed. > > > >> InvestTech <- as.data.frame(InvestTechRevised) > >> outOfSample <- InvestTech[5001:nrow(InvestTech), ] > >> InvestTech <- InvestTech[1:5000, ] > >> set.seed(654321) > > > > B. Install and load the caret, ggplot2 and e1071 packages. > > > >> install.packages(“caret”) > >> install.packages(“ggplot2”) > >> install.packages(“e1071”) > >> library(caret) > >> library(ggplot2) > >> library(e1071) > > > > C. Bin the predictor variables with approximately equal counts using > > the cut_number function from the ggplot2 package. We will use 20 bins. > > > >> InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20) > >> InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20) > >> outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20) > >> outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20) > > > > D. Partition the original (in-sample) data into 60% training and 40% > > validation sets. > > > >> n <- nrow(InvestTech) > >> train <- sample(1:n, size = 0.6 * n, replace = FALSE) > >> InvestTechTrain <- InvestTech[train, ] > >> InvestTechVal <- InvestTech[-train, ] > > > > E. Use the naiveBayes function in the e1071 package to fit the model. > > > >> model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data = InvestTechTrain) > >> prob <- predict(model, newdata = InvestTechVal, type = “raw”) > >> pred <- ifelse(prob[, 2] >= 0.3, 1, 0) > > > > F. Use the confusionMatrix function in the caret package to output the > > confusion matrix. > > > >> confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode = > > “everything”, positive = “1”) > >> accuracy <- confMtr$overall[1] > >> valError <- 1 – accuracy > >> confMtr > > > > G. Classify the 18 new (out-of-sample) readers using the following > > code. > >> prob <- predict(model, newdata = outOfSample, type = “raw”) > >> pred <- ifelse(prob[, 2] >= 0.3, 1, 0) > >> cbind(pred, prob, outOfSample[, -3]) > > > > > > > > > If your computations involve the parallel package then set.seed(seed) > may not produce repeatable results. E.g., > > > cl <- parallel::makeCluster(3) # Create cluster with 3 nodes on local > host > > set.seed(100); runif(2) > [1] 0.3077661 0.2576725 > > parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1)) > [,1] [,2] [,3] > [1,] 101.7779 102.5308 103.3459 > [2,] 101.8128 102.6114 103.9102 > > > > set.seed(100); runif(2) > [1] 0.3077661 0.2576725 > > parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1)) > [,1] [,2] [,3] > [1,] 101.1628 102.9643 103.2684 > [2,] 101.9205 102.6937 103.7907 > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
The following helps identify when .GlobalEnv$.Random.seed has changed:
rng_tracker <- local({ last <- .GlobalEnv$.Random.seed function(...) { curr <- .GlobalEnv$.Random.seed if (!identical(curr, last)) { warning(".Random.seed changed") last <<- curr } TRUE } }) addTaskCallback(rng_tracker, name = "RNG tracker") EXAMPLE: > sample.int(1L) [1] 1 Warning: .Random.seed changed This will help you find for instance: ## Loading ggplot2 does not affect the RNG seed > loadNamespace("ggplot2") <environment: namespace:ggplot2> ## But attaching it does > library("ggplot2") Warning: .Random.seed changed which reveals: > ggplot2:::.onAttach function (...) { if (!interactive() || stats::runif(1) > 0.1) return() tips <- c("Need help? Try the ggplot2 mailing list: http://groups.google.com/group/ggplot2.", "Find out what's changed in ggplot2 at http://github.com/tidyverse/ggplot2/releases.", "Use suppressPackageStartupMessages() to eliminate package startup messages.", "Stackoverflow is a great place to get help: http://stackoverflow.com/tags/ggplot2.", "Need help getting started? Try the cookbook for R: http://www.cookbook-r.com/Graphs/", "Want to understand how all the pieces fit together? Buy the ggplot2 book: http://ggplot2.org/book/") tip <- sample(tips, 1) packageStartupMessage(paste(strwrap(tip), collapse = "\n")) } <environment: namespace:ggplot2> There are probably many case of this in different R packages. R WISH: There could be a preserveRandomSeed({ tip <- sample(tips, 1) }) function in R for these type of random needs where true random properties are non-critical. This type of "draw-a-random-number-and-reset-the-seed" is for instance used in parallel:::initDefaultClusterOptions() which is called when the 'parallel' package is loaded: seed <- .GlobalEnv$.Random.seed ran1 <- sample.int(.Machine$integer.max - 1L, 1L) / .Machine$integer.max port <- 11000 + 1000 * ((ran1 + unclass(Sys.time()) / 300) %% 1) if(is.null(seed)) ## there was none, initially rm( ".Random.seed", envir = .GlobalEnv, inherits = FALSE) else # reset assign(".Random.seed", seed, envir = .GlobalEnv, inherits = FALSE) /Henrik On Sun, Mar 4, 2018 at 1:40 PM, Gary Black <[hidden email]> wrote: > Thank you, everybody, who replied! I appreciate your valuable advise! I will move the location of the set.seed() command to after all packages have been installed and loaded. > > Best regards, > Gary > > Sent from my iPad > >> On Mar 4, 2018, at 12:18 PM, Paul Gilbert <[hidden email]> wrote: >> >> On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <[hidden email]> >> wrote: >> >> (Sorry to be a bit slow responding.) >> >> You have not supplied a complete example, which would be good in this case because what you are suggesting could be a serious bug in R or a package. Serious journals require reproducibility these days. For example, JSS is very clear on this point. >> >> To your question >> > My question simply is: should the location of the set.seed command matter, >> > provided that it is applied before any commands which involve randomness >> > (such as partitioning)? >> >> the answer is no, it should not matter. But the proviso is important. >> >> You can determine where things are messing up using something like >> >> set.seed(654321) >> zk <- RNGkind() # [1] "Mersenne-Twister" "Inversion" >> zk >> z <- runif(2) >> z >> set.seed(654321) >> >> # install.packages(c('caret', 'ggplot2', 'e1071')) >> library(caret) >> all(runif(2) == z) # should be true but it is not always >> >> set.seed(654321) >> library(ggplot2) >> all(runif(2) == z) # should be true >> >> set.seed(654321) >> library(e1071) >> all(runif(2) == z) # should be true >> >> all(RNGkind() == zk) # should be true >> >> On my computer package caret seems to sometimes, but not always, do something that advances or changes the RNG. So you will need to set the seed after that package is loaded if you want reproducibility. >> >> As Bill Dunlap points out, parallel can introduce much more complicated issues. If you are in fact using parallel then we really need a new thread with a better subject line, and the discussion will get much messier. >> >> The short answer is that, yes you should be able to get reproducible results with parallel computing. If you cannot then you are almost certainly doing something wrong. To publish you really must have reproducible results. >> >> In the example that Bill gave, I think the problem is that set.seed() only resets the seed in the main thread, the nodes continue to operate with unreset RNG. To demonstrate this to yourself you can do >> >> library(parallel) >> cl <- parallel::makeCluster(3) >> parallel::clusterCall(cl, function()set.seed(100)) >> parallel::clusterCall(cl, function()RNGkind()) >> parallel::clusterCall(cl, function()runif(2)) # similar result from all nodes >> # [1] 0.3077661 0.2576725 >> >> However, do *NOT* do that in real work. You will be getting the same RNG stream from each node. If you are using random numbers and parallel you need to read a lot more, and probably consider a variant of the "L'Ecuyer" generator or something designed for parallel computing. >> >> One special point I will mention because it does not seem to be widely >> appreciated: the number of nodes affects the random stream, so recording the number of compute nodes along with the RNG and seed information is important for reproducible results. This has the unfortunate consequence that an experiment cannot be reproduced on a larger cluster. (If anyone knows differently I would very much like to hear.) >> >> Paul Gilbert >> >> >> > Hi all, >> > >> > For some odd reason when running naïve bayes, k-NN, etc., I get slightly >> > different results (e.g., error rates, classification probabilities) from >> > run >> > to run even though I am using the same random seed. >> > >> > Nothing else (input-wise) is changing, but my results are somewhat >> > different >> > from run to run. The only randomness should be in the partitioning, and I >> > have set the seed before this point. >> > >> > My question simply is: should the location of the set.seed command matter, >> > provided that it is applied before any commands which involve randomness >> > (such as partitioning)? >> > >> > If you need to see the code, it is below: >> > >> > Thank you, >> > Gary >> > >> > >> > A. Separate the original (in-sample) data from the new (out-of-sample) >> > data. Set a random seed. >> > >> >> InvestTech <- as.data.frame(InvestTechRevised) >> >> outOfSample <- InvestTech[5001:nrow(InvestTech), ] >> >> InvestTech <- InvestTech[1:5000, ] >> >> set.seed(654321) >> > >> > B. Install and load the caret, ggplot2 and e1071 packages. >> > >> >> install.packages(“caret”) >> >> install.packages(“ggplot2”) >> >> install.packages(“e1071”) >> >> library(caret) >> >> library(ggplot2) >> >> library(e1071) >> > >> > C. Bin the predictor variables with approximately equal counts using >> > the cut_number function from the ggplot2 package. We will use 20 bins. >> > >> >> InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20) >> >> InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20) >> >> outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20) >> >> outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20) >> > >> > D. Partition the original (in-sample) data into 60% training and 40% >> > validation sets. >> > >> >> n <- nrow(InvestTech) >> >> train <- sample(1:n, size = 0.6 * n, replace = FALSE) >> >> InvestTechTrain <- InvestTech[train, ] >> >> InvestTechVal <- InvestTech[-train, ] >> > >> > E. Use the naiveBayes function in the e1071 package to fit the model. >> > >> >> model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data = InvestTechTrain) >> >> prob <- predict(model, newdata = InvestTechVal, type = “raw”) >> >> pred <- ifelse(prob[, 2] >= 0.3, 1, 0) >> > >> > F. Use the confusionMatrix function in the caret package to output the >> > confusion matrix. >> > >> >> confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode = >> > “everything”, positive = “1”) >> >> accuracy <- confMtr$overall[1] >> >> valError <- 1 – accuracy >> >> confMtr >> > >> > G. Classify the 18 new (out-of-sample) readers using the following >> > code. >> >> prob <- predict(model, newdata = outOfSample, type = “raw”) >> >> pred <- ifelse(prob[, 2] >= 0.3, 1, 0) >> >> cbind(pred, prob, outOfSample[, -3]) >> > >> > >> > >> >> >> If your computations involve the parallel package then set.seed(seed) >> may not produce repeatable results. E.g., >> >> > cl <- parallel::makeCluster(3) # Create cluster with 3 nodes on local >> host >> > set.seed(100); runif(2) >> [1] 0.3077661 0.2576725 >> > parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1)) >> [,1] [,2] [,3] >> [1,] 101.7779 102.5308 103.3459 >> [2,] 101.8128 102.6114 103.9102 >> > >> > set.seed(100); runif(2) >> [1] 0.3077661 0.2576725 >> > parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1)) >> [,1] [,2] [,3] >> [1,] 101.1628 102.9643 103.2684 >> [2,] 101.9205 102.6937 103.7907 >> >> >> Bill Dunlap >> TIBCO Software >> wdunlap tibco.com > > ______________________________________________ > [hidden email] mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
On 04/03/2018 5:54 PM, Henrik Bengtsson wrote:
> The following helps identify when .GlobalEnv$.Random.seed has changed: > > rng_tracker <- local({ > last <- .GlobalEnv$.Random.seed > function(...) { > curr <- .GlobalEnv$.Random.seed > if (!identical(curr, last)) { > warning(".Random.seed changed") > last <<- curr > } > TRUE > } > }) > > addTaskCallback(rng_tracker, name = "RNG tracker") > > > EXAMPLE: > >> sample.int(1L) > [1] 1 > Warning: .Random.seed changed > > This will help you find for instance: > > ## Loading ggplot2 does not affect the RNG seed >> loadNamespace("ggplot2") > <environment: namespace:ggplot2> > > ## But attaching it does >> library("ggplot2") > Warning: .Random.seed changed > > which reveals: > >> ggplot2:::.onAttach > function (...) > { > if (!interactive() || stats::runif(1) > 0.1) > return() > tips <- c("Need help? Try the ggplot2 mailing list: > http://groups.google.com/group/ggplot2.", > "Find out what's changed in ggplot2 at > http://github.com/tidyverse/ggplot2/releases.", > "Use suppressPackageStartupMessages() to eliminate package > startup messages.", > "Stackoverflow is a great place to get help: > http://stackoverflow.com/tags/ggplot2.", > "Need help getting started? Try the cookbook for R: > http://www.cookbook-r.com/Graphs/", > "Want to understand how all the pieces fit together? Buy the > ggplot2 book: http://ggplot2.org/book/") > tip <- sample(tips, 1) > packageStartupMessage(paste(strwrap(tip), collapse = "\n")) > } > <environment: namespace:ggplot2> > > There are probably many case of this in different R packages. > > > R WISH: > > There could be a > > preserveRandomSeed({ > tip <- sample(tips, 1) > }) > > function in R for these type of random needs where true random > properties are non-critical. This type of > "draw-a-random-number-and-reset-the-seed" is for instance used in > parallel:::initDefaultClusterOptions() which is called when the > 'parallel' package is loaded: > > seed <- .GlobalEnv$.Random.seed > ran1 <- sample.int(.Machine$integer.max - 1L, 1L) / .Machine$integer.max > port <- 11000 + 1000 * ((ran1 + unclass(Sys.time()) / 300) %% 1) > if(is.null(seed)) ## there was none, initially > rm( ".Random.seed", envir = .GlobalEnv, inherits = FALSE) > else # reset > assign(".Random.seed", seed, envir = .GlobalEnv, inherits = FALSE) An issue is that .Random.seed doesn't contain the full state of the RNG system, so restoring it doesn't necessarily lead to an identical sequence of output. The only way to guarantee the sequence will repeat is to call set.seed(n), and that only leads to a tiny fraction of possible states. Expanding .Random.seed so that it does contain the full state would be a good idea, and that would make your preserveRandomSeed really easy to write. Here's a demo that .Random.seed is not enough: > set.seed(123, normal.kind = "Box-Muller") > rnorm(1) [1] -0.1613431 > save <- .Random.seed > rnorm(1) [1] 0.6706031 > .Random.seed <- save > rnorm(1) [1] -0.4194403 If .Random.seed were the complete state, the 2nd and 3rd rnorm results would be the same. Duncan Murdoch > > /Henrik > > On Sun, Mar 4, 2018 at 1:40 PM, Gary Black <[hidden email]> wrote: >> Thank you, everybody, who replied! I appreciate your valuable advise! I will move the location of the set.seed() command to after all packages have been installed and loaded. >> >> Best regards, >> Gary >> >> Sent from my iPad >> >>> On Mar 4, 2018, at 12:18 PM, Paul Gilbert <[hidden email]> wrote: >>> >>> On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <[hidden email]> >>> wrote: >>> >>> (Sorry to be a bit slow responding.) >>> >>> You have not supplied a complete example, which would be good in this case because what you are suggesting could be a serious bug in R or a package. Serious journals require reproducibility these days. For example, JSS is very clear on this point. >>> >>> To your question >>>> My question simply is: should the location of the set.seed command matter, >>>> provided that it is applied before any commands which involve randomness >>>> (such as partitioning)? >>> >>> the answer is no, it should not matter. But the proviso is important. >>> >>> You can determine where things are messing up using something like >>> >>> set.seed(654321) >>> zk <- RNGkind() # [1] "Mersenne-Twister" "Inversion" >>> zk >>> z <- runif(2) >>> z >>> set.seed(654321) >>> >>> # install.packages(c('caret', 'ggplot2', 'e1071')) >>> library(caret) >>> all(runif(2) == z) # should be true but it is not always >>> >>> set.seed(654321) >>> library(ggplot2) >>> all(runif(2) == z) # should be true >>> >>> set.seed(654321) >>> library(e1071) >>> all(runif(2) == z) # should be true >>> >>> all(RNGkind() == zk) # should be true >>> >>> On my computer package caret seems to sometimes, but not always, do something that advances or changes the RNG. So you will need to set the seed after that package is loaded if you want reproducibility. >>> >>> As Bill Dunlap points out, parallel can introduce much more complicated issues. If you are in fact using parallel then we really need a new thread with a better subject line, and the discussion will get much messier. >>> >>> The short answer is that, yes you should be able to get reproducible results with parallel computing. If you cannot then you are almost certainly doing something wrong. To publish you really must have reproducible results. >>> >>> In the example that Bill gave, I think the problem is that set.seed() only resets the seed in the main thread, the nodes continue to operate with unreset RNG. To demonstrate this to yourself you can do >>> >>> library(parallel) >>> cl <- parallel::makeCluster(3) >>> parallel::clusterCall(cl, function()set.seed(100)) >>> parallel::clusterCall(cl, function()RNGkind()) >>> parallel::clusterCall(cl, function()runif(2)) # similar result from all nodes >>> # [1] 0.3077661 0.2576725 >>> >>> However, do *NOT* do that in real work. You will be getting the same RNG stream from each node. If you are using random numbers and parallel you need to read a lot more, and probably consider a variant of the "L'Ecuyer" generator or something designed for parallel computing. >>> >>> One special point I will mention because it does not seem to be widely >>> appreciated: the number of nodes affects the random stream, so recording the number of compute nodes along with the RNG and seed information is important for reproducible results. This has the unfortunate consequence that an experiment cannot be reproduced on a larger cluster. (If anyone knows differently I would very much like to hear.) >>> >>> Paul Gilbert >>> >>> >>>> Hi all, >>>> >>>> For some odd reason when running naïve bayes, k-NN, etc., I get slightly >>>> different results (e.g., error rates, classification probabilities) from >>>> run >>>> to run even though I am using the same random seed. >>>> >>>> Nothing else (input-wise) is changing, but my results are somewhat >>>> different >>>> from run to run. The only randomness should be in the partitioning, and I >>>> have set the seed before this point. >>>> >>>> My question simply is: should the location of the set.seed command matter, >>>> provided that it is applied before any commands which involve randomness >>>> (such as partitioning)? >>>> >>>> If you need to see the code, it is below: >>>> >>>> Thank you, >>>> Gary >>>> >>>> >>>> A. Separate the original (in-sample) data from the new (out-of-sample) >>>> data. Set a random seed. >>>> >>>>> InvestTech <- as.data.frame(InvestTechRevised) >>>>> outOfSample <- InvestTech[5001:nrow(InvestTech), ] >>>>> InvestTech <- InvestTech[1:5000, ] >>>>> set.seed(654321) >>>> >>>> B. Install and load the caret, ggplot2 and e1071 packages. >>>> >>>>> install.packages(“caret”) >>>>> install.packages(“ggplot2”) >>>>> install.packages(“e1071”) >>>>> library(caret) >>>>> library(ggplot2) >>>>> library(e1071) >>>> >>>> C. Bin the predictor variables with approximately equal counts using >>>> the cut_number function from the ggplot2 package. We will use 20 bins. >>>> >>>>> InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20) >>>>> InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20) >>>>> outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20) >>>>> outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20) >>>> >>>> D. Partition the original (in-sample) data into 60% training and 40% >>>> validation sets. >>>> >>>>> n <- nrow(InvestTech) >>>>> train <- sample(1:n, size = 0.6 * n, replace = FALSE) >>>>> InvestTechTrain <- InvestTech[train, ] >>>>> InvestTechVal <- InvestTech[-train, ] >>>> >>>> E. Use the naiveBayes function in the e1071 package to fit the model. >>>> >>>>> model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data = InvestTechTrain) >>>>> prob <- predict(model, newdata = InvestTechVal, type = “raw”) >>>>> pred <- ifelse(prob[, 2] >= 0.3, 1, 0) >>>> >>>> F. Use the confusionMatrix function in the caret package to output the >>>> confusion matrix. >>>> >>>>> confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode = >>>> “everything”, positive = “1”) >>>>> accuracy <- confMtr$overall[1] >>>>> valError <- 1 – accuracy >>>>> confMtr >>>> >>>> G. Classify the 18 new (out-of-sample) readers using the following >>>> code. >>>>> prob <- predict(model, newdata = outOfSample, type = “raw”) >>>>> pred <- ifelse(prob[, 2] >= 0.3, 1, 0) >>>>> cbind(pred, prob, outOfSample[, -3]) >>>> >>>> >>>> >>> >>> >>> If your computations involve the parallel package then set.seed(seed) >>> may not produce repeatable results. E.g., >>> >>>> cl <- parallel::makeCluster(3) # Create cluster with 3 nodes on local >>> host >>>> set.seed(100); runif(2) >>> [1] 0.3077661 0.2576725 >>>> parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1)) >>> [,1] [,2] [,3] >>> [1,] 101.7779 102.5308 103.3459 >>> [2,] 101.8128 102.6114 103.9102 >>>> >>>> set.seed(100); runif(2) >>> [1] 0.3077661 0.2576725 >>>> parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1)) >>> [,1] [,2] [,3] >>> [1,] 101.1628 102.9643 103.2684 >>> [2,] 101.9205 102.6937 103.7907 >>> >>> >>> Bill Dunlap >>> TIBCO Software >>> wdunlap tibco.com >> >> ______________________________________________ >> [hidden email] mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > [hidden email] mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
In reply to this post by Paul Gilbert-2
On Sun, Mar 4, 2018 at 10:18 AM, Paul Gilbert <[hidden email]> wrote:
> On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <[hidden email]> > wrote: > > (Sorry to be a bit slow responding.) > > You have not supplied a complete example, which would be good in this case > because what you are suggesting could be a serious bug in R or a package. > Serious journals require reproducibility these days. For example, JSS is > very clear on this point. > > To your question >> My question simply is: should the location of the set.seed command >> matter, >> provided that it is applied before any commands which involve randomness >> (such as partitioning)? > > the answer is no, it should not matter. But the proviso is important. > > You can determine where things are messing up using something like > > set.seed(654321) > zk <- RNGkind() # [1] "Mersenne-Twister" "Inversion" > zk > z <- runif(2) > z > set.seed(654321) > > # install.packages(c('caret', 'ggplot2', 'e1071')) > library(caret) > all(runif(2) == z) # should be true but it is not always > > set.seed(654321) > library(ggplot2) > all(runif(2) == z) # should be true > > set.seed(654321) > library(e1071) > all(runif(2) == z) # should be true > > all(RNGkind() == zk) # should be true > > On my computer package caret seems to sometimes, but not always, do > something that advances or changes the RNG. So you will need to set the seed > after that package is loaded if you want reproducibility. > > As Bill Dunlap points out, parallel can introduce much more complicated > issues. If you are in fact using parallel then we really need a new thread > with a better subject line, and the discussion will get much messier. > > The short answer is that, yes you should be able to get reproducible results > with parallel computing. If you cannot then you are almost certainly doing > something wrong. To publish you really must have reproducible results. > > In the example that Bill gave, I think the problem is that set.seed() only > resets the seed in the main thread, the nodes continue to operate with > unreset RNG. To demonstrate this to yourself you can do > > library(parallel) > cl <- parallel::makeCluster(3) > parallel::clusterCall(cl, function()set.seed(100)) > parallel::clusterCall(cl, function()RNGkind()) > parallel::clusterCall(cl, function()runif(2)) # similar result from all > nodes > # [1] 0.3077661 0.2576725 > > However, do *NOT* do that in real work. You will be getting the same RNG > stream from each node. If you are using random numbers and parallel you need > to read a lot more, and probably consider a variant of the "L'Ecuyer" > generator or something designed for parallel computing. > > One special point I will mention because it does not seem to be widely > appreciated: the number of nodes affects the random stream, so recording the > number of compute nodes along with the RNG and seed information is important > for reproducible results. This has the unfortunate consequence that an > experiment cannot be reproduced on a larger cluster. (If anyone knows > differently I would very much like to hear.) [Disclaimer: I'm the author] future.apply::future_lapply(X, ..., future.seed) etc. can produce identical RNG results regardless of how 'X' is chunked up. For example, library(future.apply) task <- function(i) { c(i = i, random = sample.int(10, size = 1), pid = Sys.getpid()) } y <- list() plan(multiprocess, workers = 1L) y[[1]] <- future_sapply(1:10, FUN = task, future.seed = 42) plan(multiprocess, workers = 2L) y[[2]] <- future_sapply(1:10, FUN = task, future.seed = 42) plan(multiprocess, workers = 3L) y[[3]] <- future_sapply(1:10, FUN = task, future.seed = 42) gives the exact same random output: > y [[1]] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] i 1 2 3 4 5 6 7 8 9 10 random 5 10 1 8 7 9 3 5 10 4 pid 31933 31933 31933 31933 31933 31933 31933 31933 31933 31933 [[2]] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] i 1 2 3 4 5 6 7 8 9 10 random 5 10 1 8 7 9 3 5 10 4 pid 32141 32141 32141 32141 32141 32142 32142 32142 32142 32142 [[3]] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] i 1 2 3 4 5 6 7 8 9 10 random 5 10 1 8 7 9 3 5 10 4 pid 32199 32199 32199 32200 32200 32200 32200 32201 32201 32201 To base the RNG on the current RNG seed (== .GlobalEnv$.Random.seed), one can use 'future.seed = TRUE'. For performance reasons, I choose the default to be 'future.seed = FALSE', because there can be a substantial overhead in setting up reproducible L'Ecuyer subRNG-streams for all elements in 'X'. I think the snowFT package by Sevcikova & Rossini also provides this mechanism; Hana Sevcikova is also behind the rlecuyer package. Hope this helps /Henrik > > Paul Gilbert > > > >> Hi all, >> >> For some odd reason when running naïve bayes, k-NN, etc., I get slightly >> different results (e.g., error rates, classification probabilities) from >> run >> to run even though I am using the same random seed. >> >> Nothing else (input-wise) is changing, but my results are somewhat >> different >> from run to run. The only randomness should be in the partitioning, and I >> have set the seed before this point. >> >> My question simply is: should the location of the set.seed command >> matter, >> provided that it is applied before any commands which involve randomness >> (such as partitioning)? >> >> If you need to see the code, it is below: >> >> Thank you, >> Gary >> >> >> A. Separate the original (in-sample) data from the new >> (out-of-sample) >> data. Set a random seed. >> >>> InvestTech <- as.data.frame(InvestTechRevised) >>> outOfSample <- InvestTech[5001:nrow(InvestTech), ] >>> InvestTech <- InvestTech[1:5000, ] >>> set.seed(654321) >> >> B. Install and load the caret, ggplot2 and e1071 packages. >> >>> install.packages(“caret”) >>> install.packages(“ggplot2”) >>> install.packages(“e1071”) >>> library(caret) >>> library(ggplot2) >>> library(e1071) >> >> C. Bin the predictor variables with approximately equal counts using >> the cut_number function from the ggplot2 package. We will use 20 bins. >> >>> InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20) >>> InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20) >>> outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20) >>> outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20) >> >> D. Partition the original (in-sample) data into 60% training and 40% >> validation sets. >> >>> n <- nrow(InvestTech) >>> train <- sample(1:n, size = 0.6 * n, replace = FALSE) >>> InvestTechTrain <- InvestTech[train, ] >>> InvestTechVal <- InvestTech[-train, ] >> >> E. Use the naiveBayes function in the e1071 package to fit the model. >> >>> model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data = InvestTechTrain) >>> prob <- predict(model, newdata = InvestTechVal, type = “raw”) >>> pred <- ifelse(prob[, 2] >= 0.3, 1, 0) >> >> F. Use the confusionMatrix function in the caret package to output >> the >> confusion matrix. >> >>> confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode = >> “everything”, positive = “1”) >>> accuracy <- confMtr$overall[1] >>> valError <- 1 – accuracy >>> confMtr >> >> G. Classify the 18 new (out-of-sample) readers using the following >> code. >>> prob <- predict(model, newdata = outOfSample, type = “raw”) >>> pred <- ifelse(prob[, 2] >= 0.3, 1, 0) >>> cbind(pred, prob, outOfSample[, -3]) >> >> >> > > > If your computations involve the parallel package then set.seed(seed) > may not produce repeatable results. E.g., > >> cl <- parallel::makeCluster(3) # Create cluster with 3 nodes on local > host >> set.seed(100); runif(2) > [1] 0.3077661 0.2576725 >> parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1)) > [,1] [,2] [,3] > [1,] 101.7779 102.5308 103.3459 > [2,] 101.8128 102.6114 103.9102 >> >> set.seed(100); runif(2) > [1] 0.3077661 0.2576725 >> parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1)) > [,1] [,2] [,3] > [1,] 101.1628 102.9643 103.2684 > [2,] 101.9205 102.6937 103.7907 > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > ______________________________________________ > [hidden email] mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
In reply to this post by Duncan Murdoch-2
On Sun, Mar 4, 2018 at 3:23 PM, Duncan Murdoch <[hidden email]> wrote:
> On 04/03/2018 5:54 PM, Henrik Bengtsson wrote: >> >> The following helps identify when .GlobalEnv$.Random.seed has changed: >> >> rng_tracker <- local({ >> last <- .GlobalEnv$.Random.seed >> function(...) { >> curr <- .GlobalEnv$.Random.seed >> if (!identical(curr, last)) { >> warning(".Random.seed changed") >> last <<- curr >> } >> TRUE >> } >> }) >> >> addTaskCallback(rng_tracker, name = "RNG tracker") >> >> >> EXAMPLE: >> >>> sample.int(1L) >> >> [1] 1 >> Warning: .Random.seed changed >> >> This will help you find for instance: >> >> ## Loading ggplot2 does not affect the RNG seed >>> >>> loadNamespace("ggplot2") >> >> <environment: namespace:ggplot2> >> >> ## But attaching it does >>> >>> library("ggplot2") >> >> Warning: .Random.seed changed >> >> which reveals: >> >>> ggplot2:::.onAttach >> >> function (...) >> { >> if (!interactive() || stats::runif(1) > 0.1) >> return() >> tips <- c("Need help? Try the ggplot2 mailing list: >> http://groups.google.com/group/ggplot2.", >> "Find out what's changed in ggplot2 at >> http://github.com/tidyverse/ggplot2/releases.", >> "Use suppressPackageStartupMessages() to eliminate package >> startup messages.", >> "Stackoverflow is a great place to get help: >> http://stackoverflow.com/tags/ggplot2.", >> "Need help getting started? Try the cookbook for R: >> http://www.cookbook-r.com/Graphs/", >> "Want to understand how all the pieces fit together? Buy the >> ggplot2 book: http://ggplot2.org/book/") >> tip <- sample(tips, 1) >> packageStartupMessage(paste(strwrap(tip), collapse = "\n")) >> } >> <environment: namespace:ggplot2> >> >> There are probably many case of this in different R packages. >> >> >> R WISH: >> >> There could be a >> >> preserveRandomSeed({ >> tip <- sample(tips, 1) >> }) >> >> function in R for these type of random needs where true random >> properties are non-critical. This type of >> "draw-a-random-number-and-reset-the-seed" is for instance used in >> parallel:::initDefaultClusterOptions() which is called when the >> 'parallel' package is loaded: >> >> seed <- .GlobalEnv$.Random.seed >> ran1 <- sample.int(.Machine$integer.max - 1L, 1L) / .Machine$integer.max >> port <- 11000 + 1000 * ((ran1 + unclass(Sys.time()) / 300) %% 1) >> if(is.null(seed)) ## there was none, initially >> rm( ".Random.seed", envir = .GlobalEnv, inherits = FALSE) >> else # reset >> assign(".Random.seed", seed, envir = .GlobalEnv, inherits = FALSE) > > > An issue is that .Random.seed doesn't contain the full state of the RNG > system, so restoring it doesn't necessarily lead to an identical sequence of > output. The only way to guarantee the sequence will repeat is to call > set.seed(n), and that only leads to a tiny fraction of possible states. > > Expanding .Random.seed so that it does contain the full state would be a > good idea, and that would make your preserveRandomSeed really easy to write. > > Here's a demo that .Random.seed is not enough: > >> set.seed(123, normal.kind = "Box-Muller") >> rnorm(1) > [1] -0.1613431 >> save <- .Random.seed >> rnorm(1) > [1] 0.6706031 >> .Random.seed <- save >> rnorm(1) > [1] -0.4194403 > > If .Random.seed were the complete state, the 2nd and 3rd rnorm results would > be the same. Ah... good point - I forgot about that "oddity", which is documented in help(".Random.seed"): ".Random.seed saves the seed set for the uniform random-number generator, at least for the system generators. It does not necessarily save the state of other generators, and in particular does not save the state of the Box–Muller normal generator. If you want to reproduce work later, call set.seed (preferably with explicit values for kind and normal.kind) rather than set .Random.seed." So, this is is only for some of the RNG kinds. Is the reason for this limitation that it is not possible for R (not even the R internals) to get hold of some of the RNG generators? In other words, it is unlikely to ever be fixed? Since there is no bijective function to infer `x` such that `set.seed(x)` resets the RNG state properly, the best we have is '.Random.seed' for resetting it. Would a start be to implement preserveRandomSeed() for supported RNGkind():s and give an informative warning in other cases? /Henrik > > Duncan Murdoch > > > >> >> /Henrik >> >> On Sun, Mar 4, 2018 at 1:40 PM, Gary Black <[hidden email]> >> wrote: >>> >>> Thank you, everybody, who replied! I appreciate your valuable advise! I >>> will move the location of the set.seed() command to after all packages have >>> been installed and loaded. >>> >>> Best regards, >>> Gary >>> >>> Sent from my iPad >>> >>>> On Mar 4, 2018, at 12:18 PM, Paul Gilbert <[hidden email]> wrote: >>>> >>>> On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <[hidden email]> >>>> wrote: >>>> >>>> (Sorry to be a bit slow responding.) >>>> >>>> You have not supplied a complete example, which would be good in this >>>> case because what you are suggesting could be a serious bug in R or a >>>> package. Serious journals require reproducibility these days. For example, >>>> JSS is very clear on this point. >>>> >>>> To your question >>>>> >>>>> My question simply is: should the location of the set.seed command >>>>> matter, >>>>> provided that it is applied before any commands which involve >>>>> randomness >>>>> (such as partitioning)? >>>> >>>> >>>> the answer is no, it should not matter. But the proviso is important. >>>> >>>> You can determine where things are messing up using something like >>>> >>>> set.seed(654321) >>>> zk <- RNGkind() # [1] "Mersenne-Twister" "Inversion" >>>> zk >>>> z <- runif(2) >>>> z >>>> set.seed(654321) >>>> >>>> # install.packages(c('caret', 'ggplot2', 'e1071')) >>>> library(caret) >>>> all(runif(2) == z) # should be true but it is not always >>>> >>>> set.seed(654321) >>>> library(ggplot2) >>>> all(runif(2) == z) # should be true >>>> >>>> set.seed(654321) >>>> library(e1071) >>>> all(runif(2) == z) # should be true >>>> >>>> all(RNGkind() == zk) # should be true >>>> >>>> On my computer package caret seems to sometimes, but not always, do >>>> something that advances or changes the RNG. So you will need to set the seed >>>> after that package is loaded if you want reproducibility. >>>> >>>> As Bill Dunlap points out, parallel can introduce much more complicated >>>> issues. If you are in fact using parallel then we really need a new thread >>>> with a better subject line, and the discussion will get much messier. >>>> >>>> The short answer is that, yes you should be able to get reproducible >>>> results with parallel computing. If you cannot then you are almost certainly >>>> doing something wrong. To publish you really must have reproducible results. >>>> >>>> In the example that Bill gave, I think the problem is that set.seed() >>>> only resets the seed in the main thread, the nodes continue to operate with >>>> unreset RNG. To demonstrate this to yourself you can do >>>> >>>> library(parallel) >>>> cl <- parallel::makeCluster(3) >>>> parallel::clusterCall(cl, function()set.seed(100)) >>>> parallel::clusterCall(cl, function()RNGkind()) >>>> parallel::clusterCall(cl, function()runif(2)) # similar result from all >>>> nodes >>>> # [1] 0.3077661 0.2576725 >>>> >>>> However, do *NOT* do that in real work. You will be getting the same RNG >>>> stream from each node. If you are using random numbers and parallel you need >>>> to read a lot more, and probably consider a variant of the "L'Ecuyer" >>>> generator or something designed for parallel computing. >>>> >>>> One special point I will mention because it does not seem to be widely >>>> appreciated: the number of nodes affects the random stream, so recording >>>> the number of compute nodes along with the RNG and seed information is >>>> important for reproducible results. This has the unfortunate consequence >>>> that an experiment cannot be reproduced on a larger cluster. (If anyone >>>> knows differently I would very much like to hear.) >>>> >>>> Paul Gilbert >>>> >>>> >>>>> Hi all, >>>>> >>>>> For some odd reason when running naïve bayes, k-NN, etc., I get >>>>> slightly >>>>> different results (e.g., error rates, classification probabilities) >>>>> from >>>>> run >>>>> to run even though I am using the same random seed. >>>>> >>>>> Nothing else (input-wise) is changing, but my results are somewhat >>>>> different >>>>> from run to run. The only randomness should be in the partitioning, >>>>> and I >>>>> have set the seed before this point. >>>>> >>>>> My question simply is: should the location of the set.seed command >>>>> matter, >>>>> provided that it is applied before any commands which involve >>>>> randomness >>>>> (such as partitioning)? >>>>> >>>>> If you need to see the code, it is below: >>>>> >>>>> Thank you, >>>>> Gary >>>>> >>>>> >>>>> A. Separate the original (in-sample) data from the new >>>>> (out-of-sample) >>>>> data. Set a random seed. >>>>> >>>>>> InvestTech <- as.data.frame(InvestTechRevised) >>>>>> outOfSample <- InvestTech[5001:nrow(InvestTech), ] >>>>>> InvestTech <- InvestTech[1:5000, ] >>>>>> set.seed(654321) >>>>> >>>>> >>>>> B. Install and load the caret, ggplot2 and e1071 packages. >>>>> >>>>>> install.packages(“caret”) >>>>>> install.packages(“ggplot2”) >>>>>> install.packages(“e1071”) >>>>>> library(caret) >>>>>> library(ggplot2) >>>>>> library(e1071) >>>>> >>>>> >>>>> C. Bin the predictor variables with approximately equal counts >>>>> using >>>>> the cut_number function from the ggplot2 package. We will use 20 bins. >>>>> >>>>>> InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20) >>>>>> InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20) >>>>>> outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20) >>>>>> outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20) >>>>> >>>>> >>>>> D. Partition the original (in-sample) data into 60% training and >>>>> 40% >>>>> validation sets. >>>>> >>>>>> n <- nrow(InvestTech) >>>>>> train <- sample(1:n, size = 0.6 * n, replace = FALSE) >>>>>> InvestTechTrain <- InvestTech[train, ] >>>>>> InvestTechVal <- InvestTech[-train, ] >>>>> >>>>> >>>>> E. Use the naiveBayes function in the e1071 package to fit the >>>>> model. >>>>> >>>>>> model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data = >>>>>> InvestTechTrain) >>>>>> prob <- predict(model, newdata = InvestTechVal, type = “raw”) >>>>>> pred <- ifelse(prob[, 2] >= 0.3, 1, 0) >>>>> >>>>> >>>>> F. Use the confusionMatrix function in the caret package to output >>>>> the >>>>> confusion matrix. >>>>> >>>>>> confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode = >>>>> >>>>> “everything”, positive = “1”) >>>>>> >>>>>> accuracy <- confMtr$overall[1] >>>>>> valError <- 1 – accuracy >>>>>> confMtr >>>>> >>>>> >>>>> G. Classify the 18 new (out-of-sample) readers using the following >>>>> code. >>>>>> >>>>>> prob <- predict(model, newdata = outOfSample, type = “raw”) >>>>>> pred <- ifelse(prob[, 2] >= 0.3, 1, 0) >>>>>> cbind(pred, prob, outOfSample[, -3]) >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> If your computations involve the parallel package then set.seed(seed) >>>> may not produce repeatable results. E.g., >>>> >>>>> cl <- parallel::makeCluster(3) # Create cluster with 3 nodes on local >>>> >>>> host >>>>> >>>>> set.seed(100); runif(2) >>>> >>>> [1] 0.3077661 0.2576725 >>>>> >>>>> parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1)) >>>> >>>> [,1] [,2] [,3] >>>> [1,] 101.7779 102.5308 103.3459 >>>> [2,] 101.8128 102.6114 103.9102 >>>>> >>>>> >>>>> set.seed(100); runif(2) >>>> >>>> [1] 0.3077661 0.2576725 >>>>> >>>>> parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1)) >>>> >>>> [,1] [,2] [,3] >>>> [1,] 101.1628 102.9643 103.2684 >>>> [2,] 101.9205 102.6937 103.7907 >>>> >>>> >>>> Bill Dunlap >>>> TIBCO Software >>>> wdunlap tibco.com >>> >>> >>> ______________________________________________ >>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> >> >> ______________________________________________ >> [hidden email] mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
On 03/04/2018 07:14 PM, Henrik Bengtsson wrote: > On Sun, Mar 4, 2018 at 3:23 PM, Duncan Murdoch <[hidden email]> wrote: ... >> An issue is that .Random.seed doesn't contain the full state of the RNG >> system, so restoring it doesn't necessarily lead to an identical sequence of >> output. The only way to guarantee the sequence will repeat is to call >> set.seed(n), and that only leads to a tiny fraction of possible states. >> >> Expanding .Random.seed so that it does contain the full state would be a >> good idea, and that would make your preserveRandomSeed really easy to write. >> >> Here's a demo that .Random.seed is not enough: >> >>> set.seed(123, normal.kind = "Box-Muller") >>> rnorm(1) >> [1] -0.1613431 >>> save <- .Random.seed >>> rnorm(1) >> [1] 0.6706031 >>> .Random.seed <- save >>> rnorm(1) >> [1] -0.4194403 >> >> If .Random.seed were the complete state, the 2nd and 3rd rnorm results would >> be the same. To be pedantic, it is not the RNG state that is the problem, it is the state of the normal transformation "Box-Muller". And, again pedantic >So, this is is only for some of the RNG kinds. As I recall, it is not a problem for any RNG kinds, it is only a problem with the Box-Muller normal.kind. Things may have changed, and parallel adds the need to record number of nodes. >Is the reason for this > limitation that it is not possible for R (not even the R internals) to > get hold of some of the RNG generators? In other words, it is > unlikely to ever be fixed? It has been around for a very long time (since at least R 0.99) and, as I recall, it was not easy to fix. I think the more substantial reason is that Box-Muller is no longer the default or preferred normal generator. It may only be used for backward compatibility, in which case messing with it could be a disaster with very little potential benefit. Paul ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
Free forum by Nabble | Edit this page |