Quantcast

crosstable and regression for survey data (weighted)

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

crosstable and regression for survey data (weighted)

haps
I have survey data that I am working on. I need to make some multi-way tables and regression analyses on the data. After attaching the data, this is the code I use for tables for four variables (sweight is the weight variable):

> a <- xtabs(sweight~research.area + gender + a2n2 + age)
> tmp <- ftable(a)

Is this correct? I don't think I need to use the strata and cluster variables, right?
 
And, below is the logistic regression code that I use for randomly sampled, or unweighted, data:
> logit.1 <- glm(var4 ~ var3 + var2 + var1, family = binomial(link = "logit"))
> summary(logit.1)
But how can I do the same analyses for the weighted data? Here is some additional info: There are four variables in the dataset that reflect the sampling structure. These are
strat: stratum (urban or (sub-county) rural).
clust: batch of interviews that were part of the same random walk
vill_neigh_code: village or neighbourhood code
sweight: weights
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: crosstable and regression for survey data (weighted)

Pablo Domínguez Vaselli
Regarding regression models, there's a bit of discussion on whether or not
it is necessary to take the sample design into account (for instance, SPSS
doesn't), so you can run them just normally without much remorse. Or get
your life complicated (see below).

Your xtabs call seems OK to me. However, regarding tables and totals, you
can expand cases as SPSS and most software does (frequency weights) with
this code:

mydata.x <- mydata[rep(1:nrow(mydata),mydata$sweight),]

Once your dataframe is expanded this way, any totals and crosstabulations
will be right without setting any count variable on xtabs or other
functions and using just about any normal call you want (i.e. aggregate(),
table(), etc.). This approach is memory-intensive, the dataframe will be as
large as the target population.

However, in order to properly deal with complex sample data you need the
survey package (I think this is the only sound approach to your modelling
problem). This package will enable you to calculate design effects,
variance estimators and regression modelling taking the survey design into
account without hitting the RAM as above.

In that case, you must first feed the design variables to a survey design
object, using something like:

> library(survey)
> mydesign <- svydesign(ids=~vill_neigh_code+clust, strata=~stratum,
weights=~sweight, data=mydata)

Do check the survey package's vignette and help files, this is tricky. It
will also help to have the neighbors population. You must also check their
nesting (that is, if the clusters ids reuse names across strata).

Note the survey package has special functions for just about anything
(including getting your frequencies), all of them start with "svy" such as
in "svytable" and return variance estimators (note your estimation's errors
will vary tab-wise in such a complex design. Survey example:

>data(api)
>xtabs(~sch.wide+stype, data=apipop)
>dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
>summary(dclus1)
>(tbl <- svytable(~sch.wide+stype, dclus1))

Once you've specified your survey design, you can fit a design-conscious
glm model using:

>mymodel <- svyglm(var1~var2+var3, design=mydesign, family=quasibinomial())


If you're out of time just use normal xtabs and glm!

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: crosstable and regression for survey data (weighted)

haps
Thanks Pablo for your answer, it was very insightful, but I guess I got something wrong.

I formed a survey design as:
> library(survey)
> mydesign <- svydesign(ids=~vill_neigh_code+clust, strata=~strat, weights=~sweight, data=mydata)
where
strat: stratum (urban or (sub-county) rural).
clust: batch of interviews that were part of the same random walk
vill_neigh_code: village or neighbourhood code
sweight: probability weights
Then, I run a logistic regression as
> logit.1 <- svyglm(response~var1+var2+var3+var4+var5+var6, design=mydesign, data=mydata, nest=TRUE, family=quasibinomial())
And I get this error message:
Error in svyglm.survey.design(response ~ var1 + var2 + var3 + var4 +  :
  all variables must be in design= argument
What should I change in the syntax in this case?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: crosstable and regression for survey data (weighted)

Pablo Domínguez Vaselli
In reply to this post by haps
It seems the var names you've put are not the same as in the design object:

"all variables must be in design= argument ": that means the object you've
assigned in mydesign <- svydesign(ids=~vill_neigh_code+clust,
strata=~strat, weights=~sweight, data=mydata)

Check the spelling. Note that the "mydesign" is *not* a dataframe. That
means that mydesign[,5] or mydesign$myvar won't work (off course neither
will naming the original dataframe "mydata"), you must just use the
variable names alone

for instance:

svyglm(api00~ell+meals+mobility, design=dstrat)

is correct, using only the var names, not dstrat[smth]~dstrat[smth]+
dstrat[smth]

If you write the names correctly it should work

regards

pablo

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: crosstable and regression for survey data (weighted)

haps
Thanks Pablo,

There must be a spelling issue then although I can get the tables and other stuff on the same variables. In this case, I will go for the glm below, and hopefully this will not make the results too bad.

mylogit <- glm(response~ var1+ var2+ var3+ var4+ var5+ var6, weights = sweight, family = quasibinomial(link = "logit"))
Loading...