R question: generating data using MASS

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

R question: generating data using MASS

uf_mike
Hi, all! I'm new to R but need to use it to solve a little problem I'm having with a paper I'm writing. The question has a few components and I'd appreciate guidance on any of them.

1. The most essential thing is that I need to generate some multivariate normal data on a restricted integer range (1 to 7). I know I can use MASS mvrnorm command to do this but have a couple questions about that:
-I can make the simulated data but I don't know how to issue a command that restricts the generated data to be between a specific range (1 to 7), and integer-only.
-Is there a way to specify a single desired correlation between all the variables (i.e., I want, say, five variables to all be correlated about .30 with each other), rather than input the entire covariance matrix as sigma?

2. I need to introduce missing data (NA) AFTER generating the data set, and I need it to be random and at a specific prevalence (say, 5%). Is there a simple way to take the initial data set and randomly replace 5% of values with NA missing values?

Thanks, I appreciate any guidance folks can offer. :-)
Reply | Threaded
Open this post in threaded view
|

Re: R question: generating data using MASS

bbolker
uf_mike <michael.parent <at> ufl.edu> writes:

>
> Hi, all! I'm new to R but need to use it to solve a little problem I'm having
> with a paper I'm writing. The question has a few components and I'd
> appreciate guidance on any of them.
>
> 1. The most essential thing is that I need to generate some multivariate
> normal data on a restricted integer range (1 to 7). I know I can use MASS
> mvrnorm command to do this but have a couple questions about that:
> -I can make the simulated data but I don't know how to issue a command that
> restricts the generated data to be between a specific range (1 to 7), and
> integer-only.

   This problem isn't uniquely defined.  Are you willing to generate
more samples than you need and then throw away extreme values?  Or do
you want to 'censor' extreme values (i.e. set values <= 1 to 1 and
values >=7 to 7)?

  x <- MASS::mvrnorm(10000,...)
  x2 <- x[x>=1 & x<=7]
  x3 <- x2[1:1000]  ## or however many you need
  x4 <- round(x3)


> -Is there a way to specify a single desired correlation between all the
> variables (i.e., I want, say, five variables to all be correlated about .30
> with each other), rather than input the entire covariance matrix as sigma?

   What's wrong with

m <- matrix(0.3,nrow=5,ncol=5)
diag(m) <- 1
m <- m*variance

  ?
>
> 2. I need to introduce missing data (NA) AFTER generating the data set, and
> I need it to be random and at a specific prevalence (say, 5%). Is there a
> simple way to take the initial data set and randomly replace 5% of values
> with NA missing values?

  x4[sample(seq(x4),size=0.05*length(x4),replace=FALSE)] <- NA
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: R question: generating data using MASS

uf_mike
Thanks!

"This problem isn't uniquely defined.  Are you willing to generate more samples than you need and then throw away extreme values?  Or do you want to 'censor' extreme values (i.e. set values <= 1 to 1 and values >=7 to 7)?"

I'd like the retain a normal distribution so I wouldn't want to delete the other values or truncate them. Can I use the cut command on the data that gets generated and retain a normal(ish, at least) distribution?

Oh, thanks for the help on the matrix, that is easier, and also the random missingness, I will try those!

Thanks,
Mike


On Aug 29, 2011, at 2:29 AM, Ben Bolker wrote:

> uf_mike <michael.parent <at> ufl.edu> writes:
>
>>
>> Hi, all! I'm new to R but need to use it to solve a little problem I'm having
>> with a paper I'm writing. The question has a few components and I'd
>> appreciate guidance on any of them.
>>
>> 1. The most essential thing is that I need to generate some multivariate
>> normal data on a restricted integer range (1 to 7). I know I can use MASS
>> mvrnorm command to do this but have a couple questions about that:
>> -I can make the simulated data but I don't know how to issue a command that
>> restricts the generated data to be between a specific range (1 to 7), and
>> integer-only.
>
>   This problem isn't uniquely defined.  Are you willing to generate
> more samples than you need and then throw away extreme values?  Or do
> you want to 'censor' extreme values (i.e. set values <= 1 to 1 and
> values >=7 to 7)?
>
>  x <- MASS::mvrnorm(10000,...)
>  x2 <- x[x>=1 & x<=7]
>  x3 <- x2[1:1000]  ## or however many you need
>  x4 <- round(x3)
>
>
>> -Is there a way to specify a single desired correlation between all the
>> variables (i.e., I want, say, five variables to all be correlated about .30
>> with each other), rather than input the entire covariance matrix as sigma?
>
>   What's wrong with
>
> m <- matrix(0.3,nrow=5,ncol=5)
> diag(m) <- 1
> m <- m*variance
>
>  ?
>>
>> 2. I need to introduce missing data (NA) AFTER generating the data set, and
>> I need it to be random and at a specific prevalence (say, 5%). Is there a
>> simple way to take the initial data set and randomly replace 5% of values
>> with NA missing values?
>
>  x4[sample(seq(x4),size=0.05*length(x4),replace=FALSE)] <- NA
>>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: R question: generating data using MASS

bbolker
Michael Parent <michael.parent <at> ufl.edu> writes:

>
> Thanks!
>
> "This problem isn't uniquely defined.  Are you
> willing to generate more samples than you need and then throw
> away extreme values?  Or do you want to 'censor'
> extreme values (i.e. set values <= 1 to 1 and values >=7 to 7)?"
>
> I'd like the retain a normal distribution so I wouldn't want to
> delete the other values or truncate them. Can
> I use the cut command on the data that gets generated and
> retain a normal(ish, at least) distribution?

  I don't quite understand how 'cut' (which transforms a continuous variable
into a categorical one) is going to help ... by definition,
a normal distribution is continuous (so discretizing the distribution
will make it non-normal) and has the real numbers as its domain
(so in theory you can't have a restricted domain and still have it
be normal).  If your standard deviation is small enough (say
mean=3.5 and sd=0.1) then you will never have to worry about
values beyond (1,7) in the lifetime of the universe, but if
your sd is larger (and you can't allow it to be smaller) then
you have to do *something* with the values that get generated
outside your chosen bounds ...

 [snip to make Gmane happy]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.