Sampling

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Sampling

Judith Flores
Hi there,

   I want to generate different samples using the
followindg code:


g<-sample(LETTERS[1:2], 24, replace=T)

   How can I specify that I need 12 "A"s and 12 "B"s?

Thank you,

Judith


      ____________________________________________________________________________________
Be a better friend, newshound, and

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Sampling

Daniel Dunn-2
sample(rep(LETTERS[1:2],12), 24, replace=F)

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On
Behalf Of Judith Flores
Sent: Tuesday, February 05, 2008 1:52 PM
To: RHelp
Subject: [R] Sampling

Hi there,

   I want to generate different samples using the followindg code:


g<-sample(LETTERS[1:2], 24, replace=T)

   How can I specify that I need 12 "A"s and 12 "B"s?

Thank you,

Judith


 
____________________________________________________________________________
________
Be a better friend, newshound, and

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Sampling

KEVIN ZEMBOWER
In reply to this post by Judith Flores
Would this work:
g<-sample(rep(LETTERS[1:2],12), 24, replace=F)

HTH
-Kevin

-----Original Message-----
From: [hidden email] [mailto:[hidden email]]
On Behalf Of Judith Flores
Sent: Tuesday, February 05, 2008 1:52 PM
To: RHelp
Subject: [R] Sampling

Hi there,

   I want to generate different samples using the
followindg code:


g<-sample(LETTERS[1:2], 24, replace=T)

   How can I specify that I need 12 "A"s and 12 "B"s?

Thank you,

Judith


 
________________________________________________________________________
____________
Be a better friend, newshound, and

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Sampling

Chuck Cleland
In reply to this post by Judith Flores
On 2/5/2008 1:52 PM, Judith Flores wrote:

> Hi there,
>
>    I want to generate different samples using the
> followindg code:
>
>
> g<-sample(LETTERS[1:2], 24, replace=T)
>
>    How can I specify that I need 12 "A"s and 12 "B"s?
>
> Thank you,
>
> Judith

x <- rep(c("A","B"), each=12)

x
  [1] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
[12] "A" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B"
[23] "B" "B"

# "sample(x) generates a random permutation of the elements of x"

g <- sample(x)

g
  [1] "A" "A" "B" "A" "B" "B" "B" "B" "A" "B" "A"
[12] "A" "B" "A" "A" "A" "B" "B" "B" "A" "A" "B"
[23] "A" "B"

>       ____________________________________________________________________________________
> Be a better friend, newshound, and
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Chuck Cleland, Ph.D.
NDRI, Inc.
71 West 23rd Street, 8th floor
New York, NY 10010
tel: (212) 845-4495 (Tu, Th)
tel: (732) 512-0171 (M, W, F)
fax: (917) 438-0894

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Sampling

Jorge I Velez
In reply to this post by Daniel Dunn-2
Hi Judith,

Also you can try

NSIM=20  # Number of samples
res=apply(matrix(rep(rep(LETTERS[1:2],12),NSIM),ncol=NSIM),2,sample)
res

I hope this helps.


Jorge Iván Vélez



On 2/5/08, Daniel Dunn <[hidden email]> wrote:

>
> sample(rep(LETTERS[1:2],12), 24, replace=F)
>
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]]
> On
> Behalf Of Judith Flores
> Sent: Tuesday, February 05, 2008 1:52 PM
> To: RHelp
> Subject: [R] Sampling
>
> Hi there,
>
>   I want to generate different samples using the followindg code:
>
>
> g<-sample(LETTERS[1:2], 24, replace=T)
>
>   How can I specify that I need 12 "A"s and 12 "B"s?
>
> Thank you,
>
> Judith
>
>
>
>
> ____________________________________________________________________________
> ________
> Be a better friend, newshound, and
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>
        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Sampling

Daniel Nordlund
In reply to this post by Judith Flores
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On Behalf
> Of Judith Flores
> Sent: Tuesday, February 05, 2008 10:52 AM
> To: RHelp
> Subject: [R] Sampling
>
> Hi there,
>
>    I want to generate different samples using the
> followindg code:
>
>
> g<-sample(LETTERS[1:2], 24, replace=T)
>
>    How can I specify that I need 12 "A"s and 12 "B"s?
>
> Thank you,
>
> Judith
>
>

Judith,

Does this do what you want?

   g<-sample(rep(LETTERS[1:2],12))

Hope this is helpful,

Dan

Daniel Nordlund
Bothell, WA  USA

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Sampling

markleeds
In reply to this post by Judith Flores
>From: Judith Flores <[hidden email]>
>Date: 2008/02/05 Tue PM 12:52:06 CST
>To: RHelp <[hidden email]>
>Subject: [R] Sampling

you just can put 24 in there and
then use replace=FALSE

sample(rep(letters[1:2],c(12,12),24,replace=FALSE))


>Hi there,
>
>   I want to generate different samples using the
>followindg code:
>
>
>g<-sample(LETTERS[1:2], 24, replace=T)
>
>   How can I specify that I need 12 "A"s and 12 "B"s?
>
>Thank you,
>
>Judith
>
>
>      ____________________________________________________________________________________
>Be a better friend, newshound, and
>
>______________________________________________
>[hidden email] mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Sampling

Tim Hesterberg
In reply to this post by Judith Flores
>   I want to generate different samples using the
>followindg code:
>
>g<-sample(LETTERS[1:2], 24, replace=T)
>
>   How can I specify that I need 12 "A"s and 12 "B"s?

I introduced the concept of "sampling with minimal replacement" into the
S-PLUS version of sample to handle things like this:
        sample(LETTERS[1:2], 24, minimal = T)

This is very useful in variance reduction applications, to approximately
stratify but with introducing bias.  I'd like to see this in R.


I'll raise a related issue - sampling with unequal probabilities,
without replacement.  R does the wrong thing, in my opinion:

> values <- sapply(1:1000, function(i) sample(1:3, size=2, prob = c(.5, .25, .25)))
> table(values)
values
  1   2   3
834 574 592

The selection probabilities are not proportional to the specified
probabilities.  

In contrast, in S-PLUS:
> values <- sapply(1:1000, function(i) sample(1:3, size=2, prob = c(.5, .25, .25)))
> table(values)
    1   2   3
 1000 501 499

You can specify minimal = FALSE to get the same behavior as R:
> values <- sapply(1:1000, function(i) sample(1:3, size=2, prob = c(.5, .25, .25), minimal = F))
> table(values)
   1   2   3
 844 592 564

There is a reason this is associated with the concept of sampling with
minimal replacement.  Consider for example:
        sample(1:4, size = 3, prob = 1:4/10)
The expected frequencies of (1,2,3,4) should be proportional
to size*prob = c(.3,.6,.9,1.2).  That isn't possible when sampling
without replacement.  Sampling with minimal replacement allows this;
observation 4 is included in every sample, and is included twice in
20% of the samples.

Tim Hesterberg

Disclaimer - these are my opinions, not those of my employer.

========================================================
| Tim Hesterberg       Senior Research Scientist       |
| [hidden email]  Insightful Corp.                |
| (206)802-2319        1700 Westlake Ave. N, Suite 500 |
| (206)283-8691 (fax)  Seattle, WA 98109-3044, U.S.A.  |
|                      www.insightful.com/Hesterberg   |
========================================================
I'll teach short courses:
Advanced Programming in S-PLUS: San Antonio TX, March 26-27, 2008.
Bootstrap Methods and Permutation Tests: San Antonio, March 28, 2008.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Sampling

Peter Dalgaard
Tim Hesterberg wrote:

>> values <- sapply(1:1000, function(i) sample(1:3, size=2, prob = c(.5, .25, .25)))
>> table(values)
>>    
> values
>   1   2   3
> 834 574 592
>
> The selection probabilities are not proportional to the specified
> probabilities.  
>
> In contrast, in S-PLUS:
>  
>> values <- sapply(1:1000, function(i) sample(1:3, size=2, prob = c(.5, .25, .25)))
>> table(values)
>>    
>     1   2   3
>  1000 501 499
>
>  
But is that the right thing? If you can use prob=c(.6, .2, .2) and get
1200 - 400 - 400, then I'm not going to play poker with you....

The interpretation in R is that you are dealing with "fat cards", i.e.
card 1 is twice as thick as the other two, so there is 50% chance of
getting the _first_ card as a 1 and additionally, (.25+.25)*2/3 to get
the 2nd card as a 1 for a total of .8333. And since the two cards are
different the expected number of occurrences of card 1 in 1000 samples
is 833.  What is the interpretation in S-PLUS?

--
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - ([hidden email])                  FAX: (+45) 35327907

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Sampling

Tim Hesterberg
>Tim Hesterberg wrote:
>>I'll raise a related issue - sampling with unequal probabilities,
>>without replacement.  R does the wrong thing, in my opinion:
>>...
>Peter Dalgaard wrote:
>But is that the right thing? ...
(See bottom for more of the previous messages.)


First, consider the common case, where size * max(prob) < 1 --
sampling with unequal probabilities without replacement.

Why do people do sampling with unequal probabilities, without
replacement?  A typical application would be sampling with probability
proportional to size, or more generally where the desire is that
selection probabilities match some criterion.

The default S-PLUS algorithm does that.  The selection probabilities
at each of step 1, 2, ..., size are all equal to prob, and the overall
probabilities of selection are size*prob.

The R algorithm (and old S+ algorithm, prior to S+6.2) did not;
at step 1 the selection probabilities are correct, but not in
subsequent steps.  With size = 2 and prob = c(.5, .25, .25),
for example, the selection probabilities at step 2 are (.33, .33, .33).

The R/old S+ algorithm is the result of programming "sampling with
probability proportional to prob without replacement" literally,
sampling greedily at each step, ignoring dependence, without thinking
about that fact that that doesn't give the desired unconditional
probabilities after step 1.  In practice one often needs to know the
actual selection probabilities, and they are difficult to compute
except in toy problems with small size.


Now turn to the less common case -- possible overselection -- sampling
with unequal probabilities with minimal replacement.  One example of
this would be in multistage sampling.  At stage one you select cities
with probability proportional to prob.  If size*max(prob) > 1, then a
city may be selected more than once.  If selected more than once, the
second stage selects that many individuals from the city.


Another example of both cases is an improvement on SIR
(sampling-importance-resampling).  In Bayesian simulation, if
importance sampling is used, each observation ends up with a value and
a weight.  Often one would prefer the output to be a sample without
weights.  In SIR this is done by sampling from the observations with
probability proportional to the weight, <<with replacement>>.  As a
result some observations may be sampled multiple times.

It would be better to do this with minimal replacement.  If
size*max(prob) <= 1 then no observation is sampled multiple times.
If size*max(prob) > 1 then some observations may be sampled
multiple times, but there will tend to be less of this (and more
unique values sampled) than if sampling with replacement.


Here's a final example of the less common case.  This one is more
colorful, but a bit long; at the end I'll relate this to another
improvement on SIR.

"Splitting" and "Russian Roulette" are complementary techniques that
date back to the early days of Monte Carlo simulation, developed for
estimating the probability of neutrons or other particles penetrating
a shield.

The simplest procedure is to generate a bunch of random neutrons, and
follow each on its path through the shield - it flies a random
distance until it hits an atom, there is a certain probability of
absorption, otherwise it is reflected in a random direction, etc.  In
the end you count the proportion of neutrons that penetrate.  The
problem is that the penetration probability is so small that a
you have to start simulating with an enormous number of neutrons, to
accurately estimate the small probability.

The next improvement is to bias the sampling, and assign each
neutron a weight.  The weight W starts at 1.  When the neutron hits
an atom, instead of being absorbed with probability p, multiply W by p.
You can also bias the direction, in favor of of the direction of
the shield boundary, and counteract that sampling bias by multiplying
W by the relative likelihood (likelihood of observed direction under
random sampling / likelihood under biased sampling).  As a neutron
proceeds its W keeps getting up-weighted and down-weighted.  The final
estimate of penetration is based on the average W.  The problem with
this is that the variance of W across neutrons can be huge; most
W's are very small, and a few are much larger than the others.

That is where Russian roulette and splitting come in.  In the middle
of a simulated path, if a neutron has a very small W, it undergoes
Russian roulette - there is a probability P that it is killed; if not
killed its weight is divided by (1-P).  Conversely, if the W is
relatively large, the particle is split into K simulated particles,
each with initial weight W/K, which are subsequently simulated
independently.  The two techniques together reduce the variance of W
across neutrons.

Sampling with unequal probabilities and minimal replacement can be
used to similar effect.  Sample from the current population of
particles with probabilities proportional to 'prob', then divide each
particle's W by (size*prob). Particles that are sampled more than once
are split.


Now relate that back to Bayesian importance sampling performed
sequentially.  Say the simulation is split into two stages.  After
the first stage there are a number of observations with associated
weights.  At this point the observations can be subject to Russian
roulette and splitting, or sampling with minimal replacement and
unequal probabilities, to reduce the variance of the weights
entering the second stage.  The 'prob' could be proportional to W, or
some other function such as sqrt(W).


>Tim Hesterberg wrote:
>>I'll raise a related issue - sampling with unequal probabilities,
>>without replacement.  R does the wrong thing, in my opinion:
>>
>>> values <- sapply(1:1000, function(i) sample(1:3, size=2, prob = c(.5, .25, .25)))
>>> table(values)
>>>    
>> values
>>   1   2   3
>> 834 574 592
>>
>> The selection probabilities are not proportional to the specified
>> probabilities.  
>>
>> In contrast, in S-PLUS:
>>  
>>> values <- sapply(1:1000, function(i) sample(1:3, size=2, prob = c(.5, .25, .25)))
>>> table(values)
>>>    
>>     1   2   3
>>  1000 501 499
>>
>Peter Dalgaard wrote:
>But is that the right thing? If you can use prob=c(.6, .2, .2) and get
>1200 - 400 - 400, then I'm not going to play poker with you....
>
>The interpretation in R is that you are dealing with "fat cards", i.e.
>card 1 is twice as thick as the other two, so there is 50% chance of
>getting the _first_ card as a 1 and additionally, (.25+.25)*2/3 to get
>the 2nd card as a 1 for a total of .8333. And since the two cards are
>different the expected number of occurrences of card 1 in 1000 samples
>is 833.  What is the interpretation in S-PLUS?

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Sampling

Thomas Lumley
On Wed, 6 Feb 2008, Tim Hesterberg wrote:

>> Tim Hesterberg wrote:
>>> I'll raise a related issue - sampling with unequal probabilities,
>>> without replacement.  R does the wrong thing, in my opinion:
>>> ...
>> Peter Dalgaard wrote:
>> But is that the right thing? ...
> (See bottom for more of the previous messages.)
>
>
> First, consider the common case, where size * max(prob) < 1 --
> sampling with unequal probabilities without replacement.
>
> Why do people do sampling with unequal probabilities, without
> replacement?  A typical application would be sampling with probability
> proportional to size, or more generally where the desire is that
> selection probabilities match some criterion.

In real survey PPS sampling it also matters what the pairwise joint
selection probabilities are -- and there are *many* algorithms, with
different properties. Yves Till'e has written an R package that implements
some of them, and the pps package implements others.

> The default S-PLUS algorithm does that.  The selection probabilities
> at each of step 1, 2, ..., size are all equal to prob, and the overall
> probabilities of selection are size*prob.

Umm, no, they aren't.

Splus 7.0.3 doesn't say explicitly what its algorithm is, but is happy to
take a sample of size 10 from a population of size 10 with unequal
sampling probabilities.  The overall selection probability *can't* be
anything other than 1 for each element -- sampling without replacement and
proportional to any other set of  probabilities is impossible.

Even in a milder case -- samples of size 5 from 1:10 with probabilities
proportional to 1:10 -- the deviation is noticeable in 1000 replications.
In this case sampling with the specified probabilities is actually
possible, but S-PLUS doesn't do it.

Now, it might be useful to add another replace=FALSE sampler to sample(),
such as the newish Conditional Poisson Sampler based on the work of
S.X.Chen.  This does give correct marginal probabilities of inclusion, and
the pairwise joint probabilities are not too hard to compute.

I don't think that dropping the current sequential PPS implementation is
a good idea. The help page does explain the algorithm, though it might be
useful to add an explicit note that the marginal probabilities of sampling
are not the supplied probabilities.


  -thomas

Thomas Lumley Assoc. Professor, Biostatistics
[hidden email] University of Washington, Seattle

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Sampling

Tim Hesterberg
Thomas Lumley wrote:

>On Wed, 6 Feb 2008, Tim Hesterberg wrote:
>
>>> Tim Hesterberg wrote:
>>>> I'll raise a related issue - sampling with unequal probabilities,
>>>> without replacement.  R does the wrong thing, in my opinion:
>>>> ...
>>> Peter Dalgaard wrote:
>>> But is that the right thing? ...
>> (See bottom for more of the previous messages.)
>>
>>
>> First, consider the common case, where size * max(prob) < 1 --
>> sampling with unequal probabilities without replacement.
>>
>> Why do people do sampling with unequal probabilities, without
>> replacement?  A typical application would be sampling with probability
>> proportional to size, or more generally where the desire is that
>> selection probabilities match some criterion.
>
>In real survey PPS sampling it also matters what the pairwise joint
>selection probabilities are -- and there are *many* algorithms, with
>different properties. Yves Till'e has written an R package that implements
>some of them, and the pps package implements others.

Brewer & Hanif (1983) Sampling with unequal probabilities, Springer
give 50 algorithms.

>> The default S-PLUS algorithm does that.  The selection probabilities
>> at each of step 1, 2, ..., size are all equal to prob, and the overall
>> probabilities of selection are size*prob.
>
>Umm, no, they aren't.
>
>Splus 7.0.3 doesn't say explicitly what its algorithm is, but is happy to
>take a sample of size 10 from a population of size 10 with unequal
>sampling probabilities.  The overall selection probability *can't* be
>anything other than 1 for each element -- sampling without replacement and
>proportional to any other set of  probabilities is impossible.

The default S-PLUS algorithm when probabilities are supplied is
sampling with minimal replacement, not without replacement.  "Minimal
replacement" is not interpreted strictly - duplicates may occur
if (size*max(prob)>1), so that selection probabilities are right
at every step:
> 10 * (1:10 / sum(1:10)) # expected counts
 [1] 0.1818182 0.3636364 0.5454545 0.7272727 0.9090909 1.0909091 1.2727273
 [8] 1.4545455 1.6363636 1.8181818
> sample(10, size=10, prob = 1:10) # default arguments give minimal replacement
 [1]  9  9  5 10  3 10  7  6  7  8

One correction to my statement - in the case that size * max(prob) >
1, replace replace "overall probabilities of selection are size*prob"
with "expected number of times each observation is selected is size*prob".


In contrast, minimal=F gives the same algorithm as R, without replacement:
> sample(10, size=10, prob = 1:10, minimal = F)
 [1]  8 10  9  6  7  3  4  5  1  2
Note that the later draws tend to be observations with small prob,
because that is what is left over.

Here is a quick simulation to demonstrate selection probabilities:
> values <- sapply(1:10^3, function(i) sample(5, size=3, prob = 1:5))
> apply(values, 1, table) / 10^3

In S-PLUS, with minimal replacement
   [,1]  [,2]  [,3]
1 0.068 0.060 0.065
2 0.145 0.140 0.137
3 0.181 0.214 0.197
4 0.274 0.243 0.276
5 0.332 0.343 0.325

In R:
   [,1]  [,2]  [,3]
1 0.051 0.082 0.117
2 0.140 0.159 0.211
3 0.208 0.209 0.230
4 0.269 0.261 0.230
5 0.332 0.289 0.212

Column k corresponds to the kth draw.  Note that in S+ all columns
match the specified probabilities prob = (1:5/15),
while in R only the first column does.

>Splus 7.0.3 doesn't say explicitly what its algorithm is, ...

I'll add it to the help file, and post to r-devel.

>Even in a milder case -- samples of size 5 from 1:10 with probabilities
>proportional to 1:10 -- the deviation is noticeable in 1000 replications.
>In this case sampling with the specified probabilities is actually
>possible, but S-PLUS doesn't do it.

I think you're looking at the case of sampling without replacement,
not sampling with minimal replacement.

For sampling without replacement S-PLUS uses the same algorithm as R
-- at each step, draw an observation with probabilities proportional
to prob, omitting observations already drawn.

>Now, it might be useful to add another replace=FALSE sampler to sample(),
>such as the newish Conditional Poisson Sampler based on the work of
>S.X.Chen.  This does give correct marginal probabilities of inclusion, and
>the pairwise joint probabilities are not too hard to compute.

That could be interesting.  The pairwise joint probabilities are
important in some applications.  Are all the pairwise joint
probabilities nonzero (when size*prob allows that)?

Pedro J. Saavedra  (2005)
Comparison of Two Weighting Schemes for Sampling with Minimal Replacement
http://www.amstat.org/Sections/Srms/Proceedings/y2005/Files/JSM2005-000882.pdf
finds that the other scheme has some advantages over the one I used.

A historical note - Saavedra notes that Chromy (1979) had the concept
of sampling with minimal replacement under a different name.
I used the term in the S+Resample version released 8/26/2002.

Note that any algorithm for sampling without replacement can be
extended to sampling with minimal replacement, by first drawing
observations using the integer part of (size*prob), then reducing size
and adjusting prob prior to random drawing.

>I don't think that dropping the current sequential PPS implementation is
>a good idea. The help page does explain the algorithm, though it might be
>useful to add an explicit note that the marginal probabilities of sampling
>are not the supplied probabilities.
>
> -thomas

I don't think it should be dropped; it should be available for
backward compatibility.  But it should not be the default.

Tim

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Sampling

Thomas Lumley
On Thu, 7 Feb 2008, Tim Hesterberg wrote:

> Thomas Lumley wrote:
>> Now, it might be useful to add another replace=FALSE sampler to sample(),
>> such as the newish Conditional Poisson Sampler based on the work of
>> S.X.Chen.  This does give correct marginal probabilities of inclusion, and
>> the pairwise joint probabilities are not too hard to compute.
>
> That could be interesting.  The pairwise joint probabilities are
> important in some applications.  Are all the pairwise joint
> probabilities nonzero (when size*prob allows that)?
>

Yes.

> Note that any algorithm for sampling without replacement can be
> extended to sampling with minimal replacement, by first drawing
> observations using the integer part of (size*prob), then reducing size
> and adjusting prob prior to random drawing.

Certainly -- the difficult part is sampling without replacement.

  -thomas

Thomas Lumley Assoc. Professor, Biostatistics
[hidden email] University of Washington, Seattle

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.