|
Hi everyone!
In a package I'm developing, I have created a custom function to get jackknife standard errors for the parameters of a gnm model (which is essentially the same as a glm model for this issue). I'd like to add support for bootstrap using package boot, but I couldn't find how to proceed. The problem is, my data is a table object. Thus, I don't have one individual per line: when the object is converted to a data frame, one row represents one cell, or one combination of factor levels. I cannot pass this to boot() as the "data" argument and use "indices" from my custom statistic() function, since I would drop cells, not individual observations. A very inefficient solution would be to create a data frame with one row per observation, by replicating each cell using its frequencies. That's really a brute force solution, though. ;-) The other way would be generate importance weights based on observed frequencies, and to multiply the original data by the weights at each iteration, but I'm not sure that's correct. Thoughts? Thanks for your help! ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
One approach is to bootstrap the vector 1:n, where n is the number
of individuals, with a function that does: f <- function(vectorOfIndices, theTable) { (1) create a new table with the same dimensions, but with the counts in the table based on vectorOfIndices. (2) Calculate the statistics of interest on the new table. } When f is called with 1:n, the table it creates should be the same as the original table. When called with a bootstrap sample of values from 1:n, it should create a table corresponding to the bootstrap sample. Tim Hesterberg http://www.timhesterberg.net (resampling, water bottle rockets, computers to Costa Rica, shower = 2650 light bulbs, ...) NEW! Mathematical Statistics with Resampling and R, Chihara & Hesterberg http://www.amazon.com/Mathematical-Statistics-Resampling-Laura-Chihara/dp/1118029852/ref=sr_1_1?ie=UTF8 >Hi everyone! > >In a package I'm developing, I have created a custom function to get >jackknife standard errors for the parameters of a gnm model (which is >essentially the same as a glm model for this issue). I'd like to add >support for bootstrap using package boot, but I couldn't find how to >proceed. > >The problem is, my data is a table object. Thus, I don't have one >individual per line: when the object is converted to a data frame, one >row represents one cell, or one combination of factor levels. I cannot >pass this to boot() as the "data" argument and use "indices" from my >custom statistic() function, since I would drop cells, not individual >observations. > >A very inefficient solution would be to create a data frame with one row >per observation, by replicating each cell using its frequencies. That's >really a brute force solution, though. ;-) > >The other way would be generate importance weights based on observed >frequencies, and to multiply the original data by the weights at each >iteration, but I'm not sure that's correct. Thoughts? > > >Thanks for your help! ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Le mercredi 12 septembre 2012 à 07:08 -0700, Tim Hesterberg a écrit :
> One approach is to bootstrap the vector 1:n, where n is the number > of individuals, with a function that does: > f <- function(vectorOfIndices, theTable) { > (1) create a new table with the same dimensions, but with the counts > in the table based on vectorOfIndices. > (2) Calculate the statistics of interest on the new table. > } > > When f is called with 1:n, the table it creates should be the same > as the original table. When called with a bootstrap sample of > values from 1:n, it should create a table corresponding to the > bootstrap sample. nothing more reasonable exists. You're right that it's more efficient than replicating the whole data set. But still, with a typical table of less than 100 cells and several thousands of observations, this means creating a potentially long vector, much larger than the original data; nothing really hard with common machines, to be sure. If no other way exists, I'll use this. Thanks. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
>Le mercredi 12 septembre 2012 à 07:08 -0700, Tim Hesterberg a écrit :
>> One approach is to bootstrap the vector 1:n, where n is the number >> of individuals, with a function that does: >> f <- function(vectorOfIndices, theTable) { >> (1) create a new table with the same dimensions, but with the counts >> in the table based on vectorOfIndices. >> (2) Calculate the statistics of interest on the new table. >> } >> >> When f is called with 1:n, the table it creates should be the same >> as the original table. When called with a bootstrap sample of >> values from 1:n, it should create a table corresponding to the >> bootstrap sample. >Indeed, that's another solution I considered, but I wanted to be sure >nothing more reasonable exists. You're right that it's more efficient >than replicating the whole data set. But still, with a typical table of >less than 100 cells and several thousands of observations, this means >creating a potentially long vector, much larger than the original data; >nothing really hard with common machines, to be sure. > >If no other way exists, I'll use this. Thanks. In your original posting you also suggested: >>>The other way would be generate importance weights based on observed >>>frequencies, and to multiply the original data by the weights at each >>>iteration, but I'm not sure that's correct. Thoughts? You could do: bootstrapTable <- x # where x is the original table for(i in numberOfBootstrapSamples) { bootstrapTable[] <- rmultinom(1, size = sum(x), prob = x) replicate[i] <- myFunction(bootstrapTable) } # caveat - not tested I can't tell from help(boot) whether you could do it correctly there. boot has a 'weights' argument that you could use for the sampling probabilities, but you also need a way to tell it to draw sum(x) observations. Or, you could also pass boot a "parametric" sampler. But be careful if you use boot in either of these ways; you not only need to generate the bootstrap samples, you also need to make sure that it is does all other calculations correctly, including calculating the statistic for the original data, calculating jackknife statistics if they are used for confidence intervals, etc. Wistful sigh - this would be pretty easy to do with S+Resample. Tim Hesterberg ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by Tim Hesterberg-2
Le mercredi 12 septembre 2012 à 07:08 -0700, Tim Hesterberg a écrit :
> One approach is to bootstrap the vector 1:n, where n is the number > of individuals, with a function that does: > f <- function(vectorOfIndices, theTable) { > (1) create a new table with the same dimensions, but with the counts > in the table based on vectorOfIndices. > (2) Calculate the statistics of interest on the new table. > } > > When f is called with 1:n, the table it creates should be the same > as the original table. When called with a bootstrap sample of > values from 1:n, it should create a table corresponding to the > bootstrap sample. described above being implemented as below. The idea is to assign an index to each observation, and identify which cell the observation comes from using the cumulative sum. Instead of going over all indices and adding incrementing the corresponding cell count for each, I decided to start with the original data, decrementing the counts for missing indices, and incrementing it for duplicates. There are probably better implementations, but performance-wise it seems good enough. # tab is a table object f <- function(tab, indices) { cs <- cumsum(tab) # Remove missing observations for(i in setdiff(1:sum(tab), indices)) { index <- min(which(i <= cs)) tab[index] <- tab[index] - 1 } # Add duplicate observations for(i in indices[duplicated(indices)]) { index <- min(which(i <= cs)) tab[index] <- tab[index] + 1 } } Thanks for the pointers! ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
| Powered by Nabble | Edit this page |
