

Hi,
I d like to simulate 9 variables; 3 binary, 3 categorical and 3 continuous with a known covariance matrix.
Using mvtnorm and later dichotimize/categorize variables is not efficient.
Do you know any package or how to simulate mixed data?


Hello Greg,
Thanks for your time,
Lets say I know Pearson covariance matrix.
When I use rmvnorm to simulate 9 variables and then dichotomize/categorize them, I cant retrieve the population covariance matrix.


Burak Aydin < [hidden email]> asked:
> Lets say I know Pearson covariance matrix.
> When I use rmvnorm to simulate 9 variables and then
> dichotomize/categorize
> them, I cant retrieve the population covariance matrix.
library(polycor)
sim1 < function(thresh=0.5, r=0.3) {
x < rmvnorm(1000,c(0,0),matrix(c(1,r,r,1), nr=2))
x[x>thresh] < 2
x[x<2] < 1
polychor(x[,1], x[,2])
}
tr < double(100)
for(i in 1:100) tr[i] < sim1()
summary(tr)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1571 0.2703 0.3041 0.3010 0.3328 0.4062

 David Duffy (MBBS PhD) ,_\
 email: [hidden email] ph: INT+61+7+33620217 fax: 0101 / *
 Epidemiology Unit, Queensland Institute of Medical Research \_,._/
 300 Herston Rd, Brisbane, Queensland 4029, Australia GPG 4D0B994A v
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Your explanation below has me more confused than before. Now it is
possible that it is just me, but it seems that if others understood it
then someone else would have given a better answer by now. Are you
restricting your categorical and binary variables to be binned
versions of underlying normals? if that is the case I doubt that
there would be a more efficient way than binning a normal variable.
If not then can you show us more of what you want to produce? along
with what you mean by correlation or covariance with categorical
variables (which is meaningless without additional
restrictions/assumptions).
On Fri, Mar 30, 2012 at 3:41 PM, Burak Aydin < [hidden email]> wrote:

Gregory (Greg) L. Snow Ph.D.
[hidden email]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Hello Greg,
Sorry for the confusion.
Lets say, I have a population. I have 6 variables. They are correlated to each other. I can get you pearson correlation, tetrachoric or polychoric correlation coefficients.
2 of them continuous, 2 binary, 2 categorical.
Lets assume following conditions;
Co1 and Co2 are normally distributed continuous random variables. Co1 N (0,1), Co2N(100,15)
Ca1 and Ca2 are categorical variables. Ca1 probabilities =c(.02,.18,.28,.22,.30), Ca2 probs =c(.06,.18,.76)
Bi1 and Bi2 are binaries, Marginal probabilities Bi1 p= 0.4, Bi2 p=0.5.
And , again, I have the correlations.
When I try to simulate this population I fail. If I keep the means and probabilities same I lost the correct correlations. When I keep correlations, I loose precision on means and frequencies/probabilities.
See these links please
http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/copulademo.htmlhttp://stats.stackexchange.com/questions/22856/howtogeneratecorrelatedtestdatathathasbernoullicategoricalandcontinhttp://www.springerlink.com/content/011x633m554u843g/


Hello David Duffy2,
I see that you just proved using rmvnorm and then dichotomize/categorize them should work. Thanks but please take a look at this link;
http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/CatContinuousand this article;
Analysis by Categorizing or Dichotomizing Continuous Variables Is Inadvisable: An Example from the Natural History of Unruptured Aneurysms
by O. Naggaraa,b, J. Raymonda, F. Guilberta, D. Roya, A. Weilla and D.G. Altmanc 2011.
Plus; here is my explanatory code.
require(mvtnorm)
sigm=matrix(c(0.12, 0.05, 0.02, 0.00,
0.05, 1.24, 0.38,0.00,
0.02, 0.38, 2.38, 0.03,
0.00, 0.00,0.03, 0.16),
ncol=4, byrow=T)
mu=rep(0,4)
#simulated data
dat1 = rmvnorm(1000,mean=mu,sigma=sigm)
#difference between sigmas before dichotimize/categorize
sigmcov(dat1)
#difference between means before dichotimize/categorize
means1=apply(dat1,2,mean)
mumeans1
#dichotimization and categorization
#lets dichotimize the third variable
#I wantto keep mean the same (0.50)
dat2=dat1
dat2[,3]=ifelse(dat1[,3]>0.0,0,1)
means2=apply(dat2,2,mean)
mumeans2
# I kept the mean same, but look at the difference in cov matricies
sigmcov(dat2)


On Sun, Apr 01, 2012 at 06:00:43PM 0700, Burak Aydin wrote:
> Hello Greg,
> Sorry for the confusion.
> Lets say, I have a population. I have 6 variables. They are correlated to
> each other. I can get you pearson correlation, tetrachoric or polychoric
> correlation coefficients.
> 2 of them continuous, 2 binary, 2 categorical.
> Lets assume following conditions;
> Co1 and Co2 are normally distributed continuous random variables. Co1 N
> (0,1), Co2N(100,15)
> Ca1 and Ca2 are categorical variables. Ca1 probabilities
> =c(.02,.18,.28,.22,.30), Ca2 probs =c(.06,.18,.76)
> Bi1 and Bi2 are binaries, Marginal probabilities Bi1 p= 0.4, Bi2 p=0.5.
> And , again, I have the correlations.
>
> When I try to simulate this population I fail. If I keep the means and
> probabilities same I lost the correct correlations. When I keep
> correlations, I loose precision on means and frequencies/probabilities.
Hi.
One idea, which occured to me, is the following. Formulate a model of
the joint distribution with some parameters and a criterion function,
which measures how much the data generated from the model differ from
the required marginal distributions and the required correlations. Then
run an optimization of the parameters to minimize the difference.
If you have enough data, then the model can be a table of estimated
probabilities for all 5*3*2*2 = 60 combinations of the discrete
variables and for each of these combinations the parameters of the
conditional distribution on the 2 continuous variables, which can
be a bivariate normal distribution. However, you probably do not have
enough data for this.
Another approach starts from the distribution of the continuous
variables and the model for the discrete variables can be a logistic
model using the continuous variables as input.
Another type of a model, which may be suitable, is a Bayesian network.
For this, you need to choose only a subset of the most important dependencies,
so that the selected dependencies can be represented by a directed acyclic
graph.
Petr Savicky.
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


How are you calculating the correlations? That may be part of the
problem, when you categorize a continuous variable you get a factor
whose internal representation is a set of integers. If you try to get
a correlation with that variable it will not be the polychoric
correlation.
Also do you need your data to have the exact proportions and means
that you show below? or represent random samples from those
populations and therefore the actual proportions and means will vary a
bit from what is specified?
If you are interested in tetrachoric and polychoric correlations, then
generating the latent normals and categorizing seems the most
straightforward method.
Also, which function (from which package) are you using to generate
your normal variables? That may have some effect.
On Sun, Apr 1, 2012 at 7:00 PM, Burak Aydin < [hidden email]> wrote:
> Hello Greg,
> Sorry for the confusion.
> Lets say, I have a population. I have 6 variables. They are correlated to
> each other. I can get you pearson correlation, tetrachoric or polychoric
> correlation coefficients.
> 2 of them continuous, 2 binary, 2 categorical.
> Lets assume following conditions;
> Co1 and Co2 are normally distributed continuous random variables. Co1 N
> (0,1), Co2N(100,15)
> Ca1 and Ca2 are categorical variables. Ca1 probabilities
> =c(.02,.18,.28,.22,.30), Ca2 probs =c(.06,.18,.76)
> Bi1 and Bi2 are binaries, Marginal probabilities Bi1 p= 0.4, Bi2 p=0.5.
> And , again, I have the correlations.
>
> When I try to simulate this population I fail. If I keep the means and
> probabilities same I lost the correct correlations. When I keep
> correlations, I loose precision on means and frequencies/probabilities.
> See these links please
> http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/copulademo.html> http://stats.stackexchange.com/questions/22856/howtogeneratecorrelatedtestdatathathasbernoullicategoricalandcontin> http://www.springerlink.com/content/011x633m554u843g/>
>
>
> 
> View this message in context: http://r.789695.n4.nabble.com/simulatecorrelatedbinarycategoricalandcontinuousvariabletp4516433p4524863.html> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.

Gregory (Greg) L. Snow Ph.D.
[hidden email]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.

