Thanks you very much for you answer. The mistake was in the statistic function

I didn't specificy indices

Best wishes to all R users and R core team

Sincerly.

Marc Schwartz <

[hidden email]> a écrit : On Fri, 2005-12-30 at 18:37 +0100, justin bem wrote:

> Hi,

>

> I have a difficulty with the bootstrap procedure in boot package.

> How can I specify the size of sample at each bootstrap ?

> I use

> >myboot<-boot(data,boot.fun,R=300)

> when I display myboot$t I get a vector with just one value the than

> the compute statistic in my data file after reading about bootstrap in

> the book MASS I add

> >set.seed(101)

> But I get the same result. than mean than at each boot the sample

> draw is exactly the data file. How can I do resolve this ?

>

> Sincerly !

Justin,

First, when creating a new post, please do not do so by replying to an

existing post. This screws up the threading in the list archive, making

it difficult for folks to search for your post and replies, when the

thread ends up being embedded in another unrelated one.

Second, with bootstrapping, the bootstrap samples are done using

sampling with replacement from the original sample. Thus, the bootstrap

replicates are, by default, the same size as the original sample.

You might want to look at the examples in ?sample, specifically the

resample() function created there, for some additional insight.

Using set.seed() simply allows for the ability to repeat the same

sequence of sampling randomizations. It has no effect on the replicate

sample size.

In terms of what you are seeing with respect to only getting a single

statistic value, I suspect that there is a problem in how you have

defined your statistic function.

Using the example from page 134 in MASS4:

library(MASS)

library(boot)

set.seed(101)

gal.boot <- boot(galaxies, function(x, i) median(x[i]), R = 1000)

> str(gal.boot$t)

num [1:1000, 1] 20930 21492 20196 20179 21184 ...

Here 't' is of length 1000, the same as R.

The median function is passed the galaxies data as 'x' and a set of

sampled indices 'i'. Thus, x[i] is the randomly selected replicate

passed to median(). Each replicate has 82 elements (the length of

galaxies) and this is repeated 1000 times.

Note that by default, the 'statistic' function in boot() is defined to

take two arguments, 'data' and 'i'. Since median() does not use these,

we create the boot() function call as above. We could have done it as:

my.median <- function(data, i)

{

median(data[i])

}

set.seed(101)

my.boot <- boot(galaxies, my.median, R = 1000)

# Now let's compare the resultant boot objects

> all.equal(gal.boot, my.boot)

[1] "Component 6: target, current don't match when deparsed"

[2] "Component 8: target, current don't match when deparsed"

They are identical except for the 'statistic' and 'call' elements, which

simply reflects the different statistic function expressions used.

The key is that the statistic function (in a routine bootstrap such as

these) takes two arguments. It could take more (see below).

If you want to get a feel for how each of the 1000 replicates looks, you

could use the boot.array() function:

boot.array(gal.boot, indices = TRUE)

boot.array() with 'indices = TRUE' will return a matrix of values

representing each replicate as a row (for 1000 rows in this case) and

the _indices_ of the values from galaxies (1:82) that was used in each

replicate (row).

With 'indices = FALSE' (the default here), boot.array() will give you

information pertaining to how frequently each of the 82 elements in

galaxies was used in each replicate. Keep in mind that this is sampling

WITH replacement, so some values will be used more than once in each

replicate.

For example:

> table(boot.array(gal.boot, indices = FALSE))

0 1 2 3 4 5 6 7

30027 30317 15134 4998 1248 233 37 6

This shows that the 82 values in galaxies were used anywhere from 0 to 7

times in any given replicate. The sum() of the above is 82000 of course

and each replicate does have 82 elements:

# Get the sum of each row from the boot.array() call

# above and create a table

> table(rowSums(boot.array(gal.boot, indices = FALSE)))

82

1000

If you actually wanted to generate the replicates used, you could do

something like:

matrix(galaxies[boot.array(gal.boot, indices = TRUE)],

ncol = 82, byrow = TRUE)

which would return a 1000 x 82 matrix with the actual galaxies data as

sampled in the 1000 replicates.

In terms of defining the sample size in each replicate, there is an

example in Davison and Hinkley's book (cited in ?boot) and for which the

boot package is supporting software. The example is on page 528 and

provides an approach for specifying the sample size for the replicates,

if that it what you truly want to do. This example uses the 'city' data:

> city

u x

1 138 143

2 93 104

3 61 69

4 179 260

5 48 75

6 37 63

7 29 50

8 23 48

9 30 111

10 2 50

city.subset <- function(data, i, n = 10)

{

# This step takes the sampled indices

# and returns the first 'n' of them.

# Then subsets 'data' using these, in

# this case, using the first 'n' rows of

# all of the sampled rows in the city dataset

d <- data[i[1:n], ]

# Now the statistic is calculated on a

# replicate with a sample size of 'n'

mean(d[, 2]) / mean(d[, 1])

}

# Let's do the boot with 200 replicates

# each with a sample size of 5. 'n' is passed as

# part of the "..." argument in boot().

city.boot <- boot(city, city.subset, R = 200, n = 5)

Note that the 'statistic' function here takes three arguments:

data: this will be 'city'

i: sampled indices

n: replicate sample size

As pointed out in D&H, the boot.array() function would need to be

adjusted here to work properly:

boot.array(city.boot, indices = TRUE)[, 1:5]

so that you only return the first 5 values for each replicate.

There are important considerations if your replicates are sized less

than the original sample size, so I would not do this lightly and would

recommend securing a copy of D&H to cover some of this area.

HTH,

Marc Schwartz

---------------------------------

[[alternative HTML version deleted]]

______________________________________________

[hidden email] mailing list

https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide!

http://www.R-project.org/posting-guide.html