Hi,
I d like to simulate 9 variables; 3 binary, 3 categorical and 3 continuous with a known covariance matrix. Using mvtnorm and later dichotimize/categorize variables is not efficient. Do you know any package or how to simulate mixed data? |
Partly this depends on what you mean by a covariance between
categorical variables (and binary) and what is a covariance between a categorical and a continuous variable? On Thu, Mar 29, 2012 at 12:31 PM, Burak Aydin <[hidden email]> wrote: > Hi, > I d like to simulate 9 variables; 3 binary, 3 categorical and 3 continuous > with a known covariance matrix. > Using mvtnorm and later dichotimize/categorize variables is not efficient. > Do you know any package or how to simulate mixed data? > > -- > View this message in context: http://r.789695.n4.nabble.com/simulate-correlated-binary-categorical-and-continuous-variable-tp4516433p4516433.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Gregory (Greg) L. Snow Ph.D. [hidden email] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
Hello Greg,
Thanks for your time, Lets say I know Pearson covariance matrix. When I use rmvnorm to simulate 9 variables and then dichotomize/categorize them, I cant retrieve the population covariance matrix. |
In reply to this post by Burak Aydin
Burak Aydin <[hidden email]> asked:
> Lets say I know Pearson covariance matrix. > When I use rmvnorm to simulate 9 variables and then > dichotomize/categorize > them, I cant retrieve the population covariance matrix. library(polycor) sim1 <- function(thresh=0.5, r=0.3) { x <- rmvnorm(1000,c(0,0),matrix(c(1,r,r,1), nr=2)) x[x>thresh] <- 2 x[x<2] <- 1 polychor(x[,1], x[,2]) } tr <- double(100) for(i in 1:100) tr[i] <- sim1() summary(tr) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1571 0.2703 0.3041 0.3010 0.3328 0.4062 -- | David Duffy (MBBS PhD) ,-_|\ | email: [hidden email] ph: INT+61+7+3362-0217 fax: -0101 / * | Epidemiology Unit, Queensland Institute of Medical Research \_,-._/ | 300 Herston Rd, Brisbane, Queensland 4029, Australia GPG 4D0B994A v ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
In reply to this post by Burak Aydin
Your explanation below has me more confused than before. Now it is
possible that it is just me, but it seems that if others understood it then someone else would have given a better answer by now. Are you restricting your categorical and binary variables to be binned versions of underlying normals? if that is the case I doubt that there would be a more efficient way than binning a normal variable. If not then can you show us more of what you want to produce? along with what you mean by correlation or covariance with categorical variables (which is meaningless without additional restrictions/assumptions). On Fri, Mar 30, 2012 at 3:41 PM, Burak Aydin <[hidden email]> wrote: > Hello Greg, > Thanks for your time, > Lets say I know Pearson covariance matrix. > When I use rmvnorm to simulate 9 variables and then dichotomize/categorize > them, I cant retrieve the population covariance matrix. > > -- > View this message in context: http://r.789695.n4.nabble.com/simulate-correlated-binary-categorical-and-continuous-variable-tp4516433p4520464.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Gregory (Greg) L. Snow Ph.D. [hidden email] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
Hello Greg,
Sorry for the confusion. Lets say, I have a population. I have 6 variables. They are correlated to each other. I can get you pearson correlation, tetrachoric or polychoric correlation coefficients. 2 of them continuous, 2 binary, 2 categorical. Lets assume following conditions; Co1 and Co2 are normally distributed continuous random variables. Co1-- N (0,1), Co2--N(100,15) Ca1 and Ca2 are categorical variables. Ca1 probabilities =c(.02,.18,.28,.22,.30), Ca2 probs =c(.06,.18,.76) Bi1 and Bi2 are binaries, Marginal probabilities Bi1 p= 0.4, Bi2 p=0.5. And , again, I have the correlations. When I try to simulate this population I fail. If I keep the means and probabilities same I lost the correct correlations. When I keep correlations, I loose precision on means and frequencies/probabilities. See these links please http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/copulademo.html http://stats.stackexchange.com/questions/22856/how-to-generate-correlated-test-data-that-has-bernoulli-categorical-and-contin http://www.springerlink.com/content/011x633m554u843g/ |
In reply to this post by David Duffy-2
Hello David Duffy-2,
I see that you just proved using rmvnorm and then dichotomize/categorize them should work. Thanks but please take a look at this link; http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/CatContinuous and this article; Analysis by Categorizing or Dichotomizing Continuous Variables Is Inadvisable: An Example from the Natural History of Unruptured Aneurysms by O. Naggaraa,b, J. Raymonda, F. Guilberta, D. Roya, A. Weilla and D.G. Altmanc 2011. Plus; here is my explanatory code. require(mvtnorm) sigm=matrix(c(0.12, 0.05, 0.02, 0.00, 0.05, 1.24, 0.38,0.00, 0.02, 0.38, 2.38, 0.03, 0.00, 0.00,0.03, 0.16), ncol=4, byrow=T) mu=rep(0,4) #simulated data dat1 = rmvnorm(1000,mean=mu,sigma=sigm) #difference between sigmas before dichotimize/categorize sigm-cov(dat1) #difference between means before dichotimize/categorize means1=apply(dat1,2,mean) mu-means1 #dichotimization and categorization #lets dichotimize the third variable #I wantto keep mean the same (0.50) dat2=dat1 dat2[,3]=ifelse(dat1[,3]>0.0,0,1) means2=apply(dat2,2,mean) mu-means2 # I kept the mean same, but look at the difference in cov matricies sigm-cov(dat2) |
In reply to this post by Burak Aydin
On Sun, Apr 01, 2012 at 06:00:43PM -0700, Burak Aydin wrote:
> Hello Greg, > Sorry for the confusion. > Lets say, I have a population. I have 6 variables. They are correlated to > each other. I can get you pearson correlation, tetrachoric or polychoric > correlation coefficients. > 2 of them continuous, 2 binary, 2 categorical. > Lets assume following conditions; > Co1 and Co2 are normally distributed continuous random variables. Co1-- N > (0,1), Co2--N(100,15) > Ca1 and Ca2 are categorical variables. Ca1 probabilities > =c(.02,.18,.28,.22,.30), Ca2 probs =c(.06,.18,.76) > Bi1 and Bi2 are binaries, Marginal probabilities Bi1 p= 0.4, Bi2 p=0.5. > And , again, I have the correlations. > > When I try to simulate this population I fail. If I keep the means and > probabilities same I lost the correct correlations. When I keep > correlations, I loose precision on means and frequencies/probabilities. Hi. One idea, which occured to me, is the following. Formulate a model of the joint distribution with some parameters and a criterion function, which measures how much the data generated from the model differ from the required marginal distributions and the required correlations. Then run an optimization of the parameters to minimize the difference. If you have enough data, then the model can be a table of estimated probabilities for all 5*3*2*2 = 60 combinations of the discrete variables and for each of these combinations the parameters of the conditional distribution on the 2 continuous variables, which can be a bivariate normal distribution. However, you probably do not have enough data for this. Another approach starts from the distribution of the continuous variables and the model for the discrete variables can be a logistic model using the continuous variables as input. Another type of a model, which may be suitable, is a Bayesian network. For this, you need to choose only a subset of the most important dependencies, so that the selected dependencies can be represented by a directed acyclic graph. Petr Savicky. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
In reply to this post by Burak Aydin
How are you calculating the correlations? That may be part of the
problem, when you categorize a continuous variable you get a factor whose internal representation is a set of integers. If you try to get a correlation with that variable it will not be the polychoric correlation. Also do you need your data to have the exact proportions and means that you show below? or represent random samples from those populations and therefore the actual proportions and means will vary a bit from what is specified? If you are interested in tetrachoric and polychoric correlations, then generating the latent normals and categorizing seems the most straightforward method. Also, which function (from which package) are you using to generate your normal variables? That may have some effect. On Sun, Apr 1, 2012 at 7:00 PM, Burak Aydin <[hidden email]> wrote: > Hello Greg, > Sorry for the confusion. > Lets say, I have a population. I have 6 variables. They are correlated to > each other. I can get you pearson correlation, tetrachoric or polychoric > correlation coefficients. > 2 of them continuous, 2 binary, 2 categorical. > Lets assume following conditions; > Co1 and Co2 are normally distributed continuous random variables. Co1-- N > (0,1), Co2--N(100,15) > Ca1 and Ca2 are categorical variables. Ca1 probabilities > =c(.02,.18,.28,.22,.30), Ca2 probs =c(.06,.18,.76) > Bi1 and Bi2 are binaries, Marginal probabilities Bi1 p= 0.4, Bi2 p=0.5. > And , again, I have the correlations. > > When I try to simulate this population I fail. If I keep the means and > probabilities same I lost the correct correlations. When I keep > correlations, I loose precision on means and frequencies/probabilities. > See these links please > http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/copulademo.html > http://stats.stackexchange.com/questions/22856/how-to-generate-correlated-test-data-that-has-bernoulli-categorical-and-contin > http://www.springerlink.com/content/011x633m554u843g/ > > > > -- > View this message in context: http://r.789695.n4.nabble.com/simulate-correlated-binary-categorical-and-continuous-variable-tp4516433p4524863.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Gregory (Greg) L. Snow Ph.D. [hidden email] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
Free forum by Nabble | Edit this page |