bootstrap resampling question

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

bootstrap resampling question

Laszlo
Hello there,

I have a problem concerning bootstrapping in R - especially focusing on the resampling part of it. I try to sum it up in a simplified way so that I would not confuse anybody.

I have a small database consisting of 20 observations (basically numbers from 1 to 20, I mean: 1, 2, 3, 4, 5, ... 18, 19, 20).

I would like to resample this database many times for the bootstrap process with the following two conditions. The resampled databases should also have 20 observations and you can select each of the previously mentioned 20 numbers with replacement. I guess it is obvious so far. Now the more difficult second condition is that one number can be selected only maximum 5 times. In order to make this clear I try to show you an example. So there can be resampled databases like the following ones:

(1st database)          1,2,1,2,1,2,1,2,1,2,3,3,3,3,3,4,4,4,4,4
(4 different numbers are chosen, each selected 5 times)

(2nd database)          1,8,8,6,8,8,8,2,3,4,5,6,6,6,6,7,19,1,1,1
(Two numbers - 8 and 6 - selected 5 times, number "1" selected four times, the others selected less than 4 times)

My very first guess that came to my mind whilst thinking about the problem was the sample function where there are settings like replace=TRUE and prob=... where you can create a probability vector i.e. how much should be the probability of selecting a number. So I tried to calculate probabilities first. I thought the problem can basically described as a k-combination with repetitions. Unfortunately the only thing I could calculate so far is the total number of all possible selections which amounts to 137 846 527 049.

Anybody knows how to implement my second "tricky" condition into one of the R functions? Are 'boot' and 'bootstrap' packages capable of managing this? I guess they are, I just couldn't figure it out yet...

Thanks very much! Best regards,
Laszlo Bodnar

____________________________________________________________________________________________________
Ez az e-mail és az összes hozzá tartozó csatolt melléklet titkos és/vagy jogilag, szakmailag vagy más módon védett információt tartalmazhat. Amennyiben nem Ön a levél címzettje akkor a levél tartalmának közlése, reprodukálása, másolása, vagy egyéb más úton történő terjesztése, felhasználása szigorúan tilos. Amennyiben tévedésből kapta meg ezt az üzenetet kérjük azonnal értesítse az üzenet küldőjét. Az Erste Bank Hungary Zrt. (EBH) nem vállal felelősséget az információ teljes és pontos - címzett(ek)hez történő - eljuttatásáért, valamint semmilyen késésért, kapcsolat megszakadásból eredő hibáért, vagy az információ felhasználásából vagy annak megbízhatatlanságából eredő kárért.

Az üzenetek EBH-n kívüli küldője vagy címzettje tudomásul veszi és hozzájárul, hogy az üzenetekhez más banki alkalmazott is hozzáférhet az EBH folytonos munkamenetének biztosítása érdekében.


This e-mail and any attached files are confidential and/...{{dropped:19}}


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: bootstrap resampling question

Giovanni Petris
A simple way of sampling with replacement from 1:20, with the additional
constraint that each number can be selected at most five times is

> sample(rep(1:20, 5), 20)

HTH,
Giovanni

On Tue, 2011-03-01 at 11:30 +0100, Bodnar Laszlo EB_HU wrote:

> Hello there,
>
> I have a problem concerning bootstrapping in R - especially focusing on the resampling part of it. I try to sum it up in a simplified way so that I would not confuse anybody.
>
> I have a small database consisting of 20 observations (basically numbers from 1 to 20, I mean: 1, 2, 3, 4, 5, ... 18, 19, 20).
>
> I would like to resample this database many times for the bootstrap process with the following two conditions. The resampled databases should also have 20 observations and you can select each of the previously mentioned 20 numbers with replacement. I guess it is obvious so far. Now the more difficult second condition is that one number can be selected only maximum 5 times. In order to make this clear I try to show you an example. So there can be resampled databases like the following ones:
>
> (1st database)          1,2,1,2,1,2,1,2,1,2,3,3,3,3,3,4,4,4,4,4
> (4 different numbers are chosen, each selected 5 times)
>
> (2nd database)          1,8,8,6,8,8,8,2,3,4,5,6,6,6,6,7,19,1,1,1
> (Two numbers - 8 and 6 - selected 5 times, number "1" selected four times, the others selected less than 4 times)
>
> My very first guess that came to my mind whilst thinking about the problem was the sample function where there are settings like replace=TRUE and prob=... where you can create a probability vector i.e. how much should be the probability of selecting a number. So I tried to calculate probabilities first. I thought the problem can basically described as a k-combination with repetitions. Unfortunately the only thing I could calculate so far is the total number of all possible selections which amounts to 137 846 527 049.
>
> Anybody knows how to implement my second "tricky" condition into one of the R functions? Are 'boot' and 'bootstrap' packages capable of managing this? I guess they are, I just couldn't figure it out yet...
>
> Thanks very much! Best regards,
> Laszlo Bodnar
>
> ____________________________________________________________________________________________________
> Ez az e-mail és az összes hozzá tartozó csatolt melléklet titkos és/vagy jogilag, szakmailag vagy más módon védett információt tartalmazhat. Amennyiben nem Ön a levél címzettje akkor a levél tartalmának közlése, reprodukálása, másolása, vagy egyéb más úton történő terjesztése, felhasználása szigorúan tilos. Amennyiben tévedésből kapta meg ezt az üzenetet kérjük azonnal értesítse az üzenet küldőjét. Az Erste Bank Hungary Zrt. (EBH) nem vállal felelősséget az információ teljes és pontos - címzett(ek)hez történő - eljuttatásáért, valamint semmilyen késésért, kapcsolat megszakadásból eredő hibáért, vagy az információ felhasználásából vagy annak megbízhatatlanságából eredő kárért.
>
> Az üzenetek EBH-n kívüli küldője vagy címzettje tudomásul veszi és hozzájárul, hogy az üzenetekhez más banki alkalmazott is hozzáférhet az EBH folytonos munkamenetének biztosítása érdekében.
>
>
> This e-mail and any attached files are confidential and/...{{dropped:19}}
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--

Giovanni Petris  <[hidden email]>
Associate Professor
Department of Mathematical Sciences
University of Arkansas - Fayetteville, AR 72701
Ph: (479) 575-6324, 575-8630 (fax)
http://definetti.uark.edu/~gpetris/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: bootstrap resampling question

Greg Snow-2
In reply to this post by Laszlo
Here are a couple of thoughts.

If you want to use the boot package then the statistic function you give it just receives the bootstrapped indexes, you could test the indexes for your condition of not more than 5 of each and if it fails return an NA instead of computing the statistic.  Then in your output just remove the NAs (you should increase the total number of samples tried so that you have a reasonable number after deletion).

If you want to do it by hand, just use the rep function to create 5 replicates of your data, then sample from that without replacement.  You can get up to 5 copies of each value (from the hand replication), but no more since it samples without replacement.

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[hidden email]
801.408.8111


> -----Original Message-----
> From: [hidden email] [mailto:r-help-bounces@r-
> project.org] On Behalf Of Bodnar Laszlo EB_HU
> Sent: Tuesday, March 01, 2011 3:31 AM
> To: '[hidden email]'
> Subject: [R] bootstrap resampling question
>
> Hello there,
>
>
>
> I have a problem concerning bootstrapping in R - especially focusing on
> the resampling part of it. I try to sum it up in a simplified way so
> that I would not confuse anybody.
>
>
>
> I have a small database consisting of 20 observations (basically
> numbers from 1 to 20, I mean: 1, 2, 3, 4, 5, ... 18, 19, 20).
>
>
>
> I would like to resample this database many times for the bootstrap
> process with the following two conditions. The resampled databases
> should also have 20 observations and you can select each of the
> previously mentioned 20 numbers with replacement. I guess it is obvious
> so far. Now the more difficult second condition is that one number can
> be selected only maximum 5 times. In order to make this clear I try to
> show you an example. So there can be resampled databases like the
> following ones:
>
>
>
> (1st database)          1,2,1,2,1,2,1,2,1,2,3,3,3,3,3,4,4,4,4,4
>
> (4 different numbers are chosen, each selected 5 times)
>
>
>
> (2nd database)          1,8,8,6,8,8,8,2,3,4,5,6,6,6,6,7,19,1,1,1
>
> (Two numbers - 8 and 6 - selected 5 times, number "1" selected four
> times, the others selected less than 4 times)
>
>
>
> My very first guess that came to my mind whilst thinking about the
> problem was the sample function where there are settings like
> replace=TRUE and prob=... where you can create a probability vector
> i.e. how much should be the probability of selecting a number. So I
> tried to calculate probabilities first. I thought the problem can
> basically described as a k-combination with repetitions. Unfortunately
> the only thing I could calculate so far is the total number of all
> possible selections which amounts to 137 846 527 049.
>
>
>
> Anybody knows how to implement my second "tricky" condition into one of
> the R functions? Are 'boot' and 'bootstrap' packages capable of
> managing this? I guess they are, I just couldn't figure it out yet...
>
>
>
> Thanks very much! Best regards,
>
> Laszlo Bodnar
>
>
>
> _______________________________________________________________________
> _____________________________
>
> Ez az e-mail és az összes hozzá tartozó csatolt melléklet titkos
> és/vagy jogilag, szakmailag vagy más módon védett információt
> tartalmazhat. Amennyiben nem Ön a levél címzettje akkor a levél
> tartalmának közlése, reprodukálása, másolása, vagy egyéb más úton
> történő terjesztése, felhasználása szigorúan tilos. Amennyiben
> tévedésből kapta meg ezt az üzenetet kérjük azonnal értesítse az üzenet
> küldőjét. Az Erste Bank Hungary Zrt. (EBH) nem vállal felelősséget az
> információ teljes és pontos - címzett(ek)hez történő - eljuttatásáért,
> valamint semmilyen késésért, kapcsolat megszakadásból eredő hibáért,
> vagy az információ felhasználásából vagy annak megbízhatatlanságából
> eredő kárért.
>
>
>
> Az üzenetek EBH-n kívüli küldője vagy címzettje tudomásul veszi és
> hozzájárul, hogy az üzenetekhez más banki alkalmazott is hozzáférhet az
> EBH folytonos munkamenetének biztosítása érdekében.
>
>
>
>
>
> This e-mail and any attached files are confidential
> and/...{{dropped:19}}

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: bootstrap resampling question

Jonathan P Daily
In reply to this post by Giovanni Petris
I'm not sure that is equivalent to sampling with replacement, since if the
first "draw" is 1, then the probability that the next draw will be one is
4/100 instead of the 1/20 it would be in sampling with replacement. I
think the way to do this would be what Greg suggested - something like:

bigsamp <- sample(1:20, 100, T)
idx <- sort(unlist(sapply(1:20, function(x) which(bigsamp ==
x)[1:5])))[1:20]
samp <- bigsamp[idx]

--------------------------------------
Jonathan P. Daily
Technician - USGS Leetown Science Center
11649 Leetown Road
Kearneysville WV, 25430
(304) 724-4480
"Is the room still a room when its empty? Does the room,
 the thing itself have purpose? Or do we, what's the word... imbue it."
     - Jubal Early, Firefly

[hidden email] wrote on 03/01/2011 09:37:31 AM:

> [image removed]
>
> Re: [R] bootstrap resampling question
>
> Giovanni Petris
>
> to:
>
> Bodnar Laszlo EB_HU
>
> 03/01/2011 11:58 AM
>
> Sent by:
>
> [hidden email]
>
> Cc:
>
> "'[hidden email]'"
>
> A simple way of sampling with replacement from 1:20, with the additional
> constraint that each number can be selected at most five times is
>
> > sample(rep(1:20, 5), 20)
>
> HTH,
> Giovanni
>
> On Tue, 2011-03-01 at 11:30 +0100, Bodnar Laszlo EB_HU wrote:
> > Hello there,
> >
> > I have a problem concerning bootstrapping in R - especially
> focusing on the resampling part of it. I try to sum it up in a
> simplified way so that I would not confuse anybody.
> >
> > I have a small database consisting of 20 observations (basically
> numbers from 1 to 20, I mean: 1, 2, 3, 4, 5, ... 18, 19, 20).
> >
> > I would like to resample this database many times for the
> bootstrap process with the following two conditions. The resampled
> databases should also have 20 observations and you can select each
> of the previously mentioned 20 numbers with replacement. I guess it
> is obvious so far. Now the more difficult second condition is that
> one number can be selected only maximum 5 times. In order to make
> this clear I try to show you an example. So there can be resampled
> databases like the following ones:
> >
> > (1st database)          1,2,1,2,1,2,1,2,1,2,3,3,3,3,3,4,4,4,4,4
> > (4 different numbers are chosen, each selected 5 times)
> >
> > (2nd database)          1,8,8,6,8,8,8,2,3,4,5,6,6,6,6,7,19,1,1,1
> > (Two numbers - 8 and 6 - selected 5 times, number "1" selected
> four times, the others selected less than 4 times)
> >
> > My very first guess that came to my mind whilst thinking about the
> problem was the sample function where there are settings like
> replace=TRUE and prob=... where you can create a probability vector
> i.e. how much should be the probability of selecting a number. So I
> tried to calculate probabilities first. I thought the problem can
> basically described as a k-combination with repetitions.
> Unfortunately the only thing I could calculate so far is the total
> number of all possible selections which amounts to 137 846 527 049.
> >
> > Anybody knows how to implement my second "tricky" condition into
> one of the R functions? Are 'boot' and 'bootstrap' packages capable
> of managing this? I guess they are, I just couldn't figure it out yet...
> >
> > Thanks very much! Best regards,
> > Laszlo Bodnar
> >
> >
>
____________________________________________________________________________________________________

> > Ez az e-mail és az összes hozzá tartozó csatolt melléklet titkos
> és/vagy jogilag, szakmailag vagy más módon védett információt
> tartalmazhat. Amennyiben nem Ön a levél címzettje akkor a levél
> tartalmának közlése, reprodukálása, másolása, vagy egyéb más úton
> történő terjesztése, felhasználása szigorúan tilos. Amennyiben
> tévedésből kapta meg ezt az üzenetet kérjük azonnal értesítse az
> üzenet küldőjét. Az Erste Bank Hungary Zrt. (EBH) nem vállal
> felelősséget az információ teljes és pontos - címzett(ek)hez történő
> - eljuttatásáért, valamint semmilyen késésért, kapcsolat
> megszakadásból eredő hibáért, vagy az információ felhasználásából
> vagy annak megbízhatatlanságából eredő kárért.
> >
> > Az üzenetek EBH-n kívüli küldője vagy címzettje tudomásul veszi és
> hozzájárul, hogy az üzenetekhez más banki alkalmazott is hozzáférhet
> az EBH folytonos munkamenetének biztosítása érdekében.
> >
> >
> > This e-mail and any attached files are confidential
and/...{{dropped:19}}
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html

> > and provide commented, minimal, self-contained, reproducible code.
>
> --
>
> Giovanni Petris  <[hidden email]>
> Associate Professor
> Department of Mathematical Sciences
> University of Arkansas - Fayetteville, AR 72701
> Ph: (479) 575-6324, 575-8630 (fax)
> http://definetti.uark.edu/~gpetris/
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: bootstrap resampling question

Giovanni Petris
Good point. I'll take my suggestion back...

Giovanni

On Tue, 2011-03-01 at 13:18 -0500, Jonathan P Daily wrote:

> I'm not sure that is equivalent to sampling with replacement, since if the
> first "draw" is 1, then the probability that the next draw will be one is
> 4/100 instead of the 1/20 it would be in sampling with replacement. I
> think the way to do this would be what Greg suggested - something like:
>
> bigsamp <- sample(1:20, 100, T)
> idx <- sort(unlist(sapply(1:20, function(x) which(bigsamp ==
> x)[1:5])))[1:20]
> samp <- bigsamp[idx]
>
> --------------------------------------
> Jonathan P. Daily
> Technician - USGS Leetown Science Center
> 11649 Leetown Road
> Kearneysville WV, 25430
> (304) 724-4480
> "Is the room still a room when its empty? Does the room,
>  the thing itself have purpose? Or do we, what's the word... imbue it."
>      - Jubal Early, Firefly
>
> [hidden email] wrote on 03/01/2011 09:37:31 AM:
>
> > [image removed]
> >
> > Re: [R] bootstrap resampling question
> >
> > Giovanni Petris
> >
> > to:
> >
> > Bodnar Laszlo EB_HU
> >
> > 03/01/2011 11:58 AM
> >
> > Sent by:
> >
> > [hidden email]
> >
> > Cc:
> >
> > "'[hidden email]'"
> >
> > A simple way of sampling with replacement from 1:20, with the additional
> > constraint that each number can be selected at most five times is
> >
> > > sample(rep(1:20, 5), 20)
> >
> > HTH,
> > Giovanni
> >
> > On Tue, 2011-03-01 at 11:30 +0100, Bodnar Laszlo EB_HU wrote:
> > > Hello there,
> > >
> > > I have a problem concerning bootstrapping in R - especially
> > focusing on the resampling part of it. I try to sum it up in a
> > simplified way so that I would not confuse anybody.
> > >
> > > I have a small database consisting of 20 observations (basically
> > numbers from 1 to 20, I mean: 1, 2, 3, 4, 5, ... 18, 19, 20).
> > >
> > > I would like to resample this database many times for the
> > bootstrap process with the following two conditions. The resampled
> > databases should also have 20 observations and you can select each
> > of the previously mentioned 20 numbers with replacement. I guess it
> > is obvious so far. Now the more difficult second condition is that
> > one number can be selected only maximum 5 times. In order to make
> > this clear I try to show you an example. So there can be resampled
> > databases like the following ones:
> > >
> > > (1st database)          1,2,1,2,1,2,1,2,1,2,3,3,3,3,3,4,4,4,4,4
> > > (4 different numbers are chosen, each selected 5 times)
> > >
> > > (2nd database)          1,8,8,6,8,8,8,2,3,4,5,6,6,6,6,7,19,1,1,1
> > > (Two numbers - 8 and 6 - selected 5 times, number "1" selected
> > four times, the others selected less than 4 times)
> > >
> > > My very first guess that came to my mind whilst thinking about the
> > problem was the sample function where there are settings like
> > replace=TRUE and prob=... where you can create a probability vector
> > i.e. how much should be the probability of selecting a number. So I
> > tried to calculate probabilities first. I thought the problem can
> > basically described as a k-combination with repetitions.
> > Unfortunately the only thing I could calculate so far is the total
> > number of all possible selections which amounts to 137 846 527 049.
> > >
> > > Anybody knows how to implement my second "tricky" condition into
> > one of the R functions? Are 'boot' and 'bootstrap' packages capable
> > of managing this? I guess they are, I just couldn't figure it out yet...
> > >
> > > Thanks very much! Best regards,
> > > Laszlo Bodnar
> > >
> > >
> >
> ____________________________________________________________________________________________________
> > > Ez az e-mail és az összes hozzá tartozó csatolt melléklet titkos
> > és/vagy jogilag, szakmailag vagy más módon védett információt
> > tartalmazhat. Amennyiben nem Ön a levél címzettje akkor a levél
> > tartalmának közlése, reprodukálása, másolása, vagy egyéb más úton
> > történő terjesztése, felhasználása szigorúan tilos. Amennyiben
> > tévedésből kapta meg ezt az üzenetet kérjük azonnal értesítse az
> > üzenet küldőjét. Az Erste Bank Hungary Zrt. (EBH) nem vállal
> > felelősséget az információ teljes és pontos - címzett(ek)hez történő
> > - eljuttatásáért, valamint semmilyen késésért, kapcsolat
> > megszakadásból eredő hibáért, vagy az információ felhasználásából
> > vagy annak megbízhatatlanságából eredő kárért.
> > >
> > > Az üzenetek EBH-n kívüli küldője vagy címzettje tudomásul veszi és
> > hozzájárul, hogy az üzenetekhez más banki alkalmazott is hozzáférhet
> > az EBH folytonos munkamenetének biztosítása érdekében.
> > >
> > >
> > > This e-mail and any attached files are confidential
> and/...{{dropped:19}}
> > >
> > > ______________________________________________
> > > [hidden email] mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> >
> > --
> >
> > Giovanni Petris  <[hidden email]>
> > Associate Professor
> > Department of Mathematical Sciences
> > University of Arkansas - Fayetteville, AR 72701
> > Ph: (479) 575-6324, 575-8630 (fax)
> > http://definetti.uark.edu/~gpetris/
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.