Quantcast

Dummy variables or factors?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Dummy variables or factors?

Luciano La Sala-2
Dear R-people,

I am analyzing epidemiological data using GLMM using the lmer package. I usually explore the assumption of linearity of continuous variables in the logit of the outcome by creating 4 categories of the variable, performing a bivariate logistic regression, and then plotting the coefficients of each category against their mid points. That gives me a pretty good idea about the linearity assumption and possible departures from it.

I know of people who create 0,1 dummy variables in order to relax the linearity assumption. However, I've read that dummy variables are never needed (nor are desireble) in R! Instead, one should make use of factors variable. That is much easier to work with than dummy variables and the model itself will create the necessary dummy variables.

Having said that, if my data violates the linearity assumption, does the use of a factors for the variable in question helps overcome the lack of linearity?

Thanks in advance,

Luciano    



      Yahoo! Cocina

Encontra las mejores recetas con Yahoo! Cocina.


http://ar.mujer.yahoo.com/cocina/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Dummy variables or factors?

David Winsemius

On Oct 20, 2009, at 4:00 PM, Luciano La Sala wrote:

> Dear R-people,
>
> I am analyzing epidemiological data using GLMM using the lmer  
> package. I usually explore the assumption of linearity of continuous  
> variables in the logit of the outcome by creating 4 categories of  
> the variable, performing a bivariate logistic regression, and then  
> plotting the coefficients of each category against their mid points.  
> That gives me a pretty good idea about the linearity assumption and  
> possible departures from it.
>
> I know of people who create 0,1 dummy variables in order to relax  
> the linearity assumption. However, I've read that dummy variables  
> are never needed (nor are desireble) in R! Instead, one should make  
> use of factors variable. That is much easier to work with than dummy  
> variables and the model itself will create the necessary dummy  
> variables.
>
> Having said that, if my data violates the linearity assumption, does  
> the use of a factors for the variable in question helps overcome the  
> lack of linearity?
>
No. If done by dividing into samall numbers of categories after  
looking at the data, it merely creates other (and probably more  
severe) problems. If you are in the unusal (although desirable)  
position of having a large number of events across the range of the  
covariates in your data, you may be able to cut your variable into  
quintiles or deciles and analyze the resulting factor, but the  
preferred approach would be to fit a regression spline of sufficient  
complexity.

> Thanks in advance.

--

David Winsemius, MD
Heritage Laboratories
West Hartford, CT

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Dummy variables or factors?

AndrewRoyal
The following is *significantly* easier to do than try and add in
dummy variables, although the dummy variable approach is going to give
you exactly the same answer as the factor method, but possibly with a
different baseline.

Basically, you might want to search the lm help and possibly consult a
stats book on information about how the design matrix is constructed
in both cases.

> xF <- factor(1:10)
> N <- 1000
> xFs <- sample(x=xF,N,replace = T)
> yFs <- rnorm(N, mean = as.numeric(xFs))
> lm(yFs ~ xFs)

Call:
lm(formula = yFs ~ xFs)

Coefficients:
(Intercept)         xFs2         xFs3         xFs4
xFs5         xFs6         xFs7         xFs8
     0.7845       1.1620       2.1474       3.1391       4.2183
5.2621       6.0814       7.4170
       xFs9        xFs10
     8.2193       9.2987

> lm(yFs ~ diag(10)[,1:9][xFs,])

Call:
lm(formula = yFs ~ diag(10)[, 1:9][xFs, ])

Coefficients:
            (Intercept)  diag(10)[, 1:9][xFs, ]1  diag(10)[, 1:9]
[xFs, ]2  diag(10)[, 1:9][xFs, ]3
                 10.083                   -9.299
-8.137                   -7.151
diag(10)[, 1:9][xFs, ]4  diag(10)[, 1:9][xFs, ]5  diag(10)[, 1:9]
[xFs, ]6  diag(10)[, 1:9][xFs, ]7
                 -6.160                   -5.080
-4.037                   -3.217
diag(10)[, 1:9][xFs, ]8  diag(10)[, 1:9][xFs, ]9
                 -1.882                   -1.079




On Oct 21, 9:44 am, David Winsemius <[hidden email]> wrote:

> On Oct 20, 2009, at 4:00 PM, Luciano La Sala wrote:
>
>
>
> > Dear R-people,
>
> > I am analyzing epidemiological data using GLMM using the lmer  
> > package. I usually explore the assumption of linearity of continuous  
> > variables in the logit of the outcome by creating 4 categories of  
> > the variable, performing a bivariate logistic regression, and then  
> > plotting the coefficients of each category against their mid points.  
> > That gives me a pretty good idea about the linearity assumption and  
> > possible departures from it.
>
> > I know of people who create 0,1 dummy variables in order to relax  
> > the linearity assumption. However, I've read that dummy variables  
> > are never needed (nor are desireble) in R! Instead, one should make  
> > use of factors variable. That is much easier to work with than dummy  
> > variables and the model itself will create the necessary dummy  
> > variables.
>
> > Having said that, if my data violates the linearity assumption, does  
> > the use of a factors for the variable in question helps overcome the  
> > lack of linearity?
>
> No. If done by dividing into samall numbers of categories after  
> looking at the data, it merely creates other (and probably more  
> severe) problems. If you are in the unusal (although desirable)  
> position of having a large number of events across the range of the  
> covariates in your data, you may be able to cut your variable into  
> quintiles or deciles and analyze the resulting factor, but the  
> preferred approach would be to fit a regression spline of sufficient  
> complexity.
>
> > Thanks in advance.
>
> --
>
> David Winsemius, MD
> Heritage Laboratories
> West Hartford, CT
>
> ______________________________________________
> [hidden email] mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Dummy variables or factors?

AndrewRoyal
Oh dear, that doesn't look right at all.  I shall have a think about
what I did wrong and maybe follow my own advice and consult the doco
myself!


On Oct 21, 2:45 pm, andrew <[hidden email]> wrote:

> The following is *significantly* easier to do than try and add in
> dummy variables, although the dummy variable approach is going to give
> you exactly the same answer as the factor method, but possibly with a
> different baseline.
>
> Basically, you might want to search the lm help and possibly consult a
> stats book on information about how the design matrix is constructed
> in both cases.
>
> > xF <- factor(1:10)
> > N <- 1000
> > xFs <- sample(x=xF,N,replace = T)
> > yFs <- rnorm(N, mean = as.numeric(xFs))
> > lm(yFs ~ xFs)
>
> Call:
> lm(formula = yFs ~ xFs)
>
> Coefficients:
> (Intercept)         xFs2         xFs3         xFs4
> xFs5         xFs6         xFs7         xFs8
>      0.7845       1.1620       2.1474       3.1391       4.2183
> 5.2621       6.0814       7.4170
>        xFs9        xFs10
>      8.2193       9.2987
>
> > lm(yFs ~ diag(10)[,1:9][xFs,])
>
> Call:
> lm(formula = yFs ~ diag(10)[, 1:9][xFs, ])
>
> Coefficients:
>             (Intercept)  diag(10)[, 1:9][xFs, ]1  diag(10)[, 1:9]
> [xFs, ]2  diag(10)[, 1:9][xFs, ]3
>                  10.083                   -9.299
> -8.137                   -7.151
> diag(10)[, 1:9][xFs, ]4  diag(10)[, 1:9][xFs, ]5  diag(10)[, 1:9]
> [xFs, ]6  diag(10)[, 1:9][xFs, ]7
>                  -6.160                   -5.080
> -4.037                   -3.217
> diag(10)[, 1:9][xFs, ]8  diag(10)[, 1:9][xFs, ]9
>                  -1.882                   -1.079
>
> On Oct 21, 9:44 am, David Winsemius <[hidden email]> wrote:
>
>
>
> > On Oct 20, 2009, at 4:00 PM, Luciano La Sala wrote:
>
> > > Dear R-people,
>
> > > I am analyzing epidemiological data using GLMM using the lmer  
> > > package. I usually explore the assumption of linearity of continuous  
> > > variables in the logit of the outcome by creating 4 categories of  
> > > the variable, performing a bivariate logistic regression, and then  
> > > plotting the coefficients of each category against their mid points.  
> > > That gives me a pretty good idea about the linearity assumption and  
> > > possible departures from it.
>
> > > I know of people who create 0,1 dummy variables in order to relax  
> > > the linearity assumption. However, I've read that dummy variables  
> > > are never needed (nor are desireble) in R! Instead, one should make  
> > > use of factors variable. That is much easier to work with than dummy  
> > > variables and the model itself will create the necessary dummy  
> > > variables.
>
> > > Having said that, if my data violates the linearity assumption, does  
> > > the use of a factors for the variable in question helps overcome the  
> > > lack of linearity?
>
> > No. If done by dividing into samall numbers of categories after  
> > looking at the data, it merely creates other (and probably more  
> > severe) problems. If you are in the unusal (although desirable)  
> > position of having a large number of events across the range of the  
> > covariates in your data, you may be able to cut your variable into  
> > quintiles or deciles and analyze the resulting factor, but the  
> > preferred approach would be to fit a regression spline of sufficient  
> > complexity.
>
> > > Thanks in advance.
>
> > --
>
> > David Winsemius, MD
> > Heritage Laboratories
> > West Hartford, CT
>
> > ______________________________________________
> > [hidden email] mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Dummy variables or factors?

AndrewRoyal
Sorry for this third posting - the second method is the same as the
first after all: the coefficients of the first linear model *is* a
linear transformation of the second.  Just got confused with the
pasting, tis all.


On Oct 21, 2:51 pm, andrew <[hidden email]> wrote:

> Oh dear, that doesn't look right at all.  I shall have a think about
> what I did wrong and maybe follow my own advice and consult the doco
> myself!
>
> On Oct 21, 2:45 pm, andrew <[hidden email]> wrote:
>
>
>
> > The following is *significantly* easier to do than try and add in
> > dummy variables, although the dummy variable approach is going to give
> > you exactly the same answer as the factor method, but possibly with a
> > different baseline.
>
> > Basically, you might want to search the lm help and possibly consult a
> > stats book on information about how the design matrix is constructed
> > in both cases.
>
> > > xF <- factor(1:10)
> > > N <- 1000
> > > xFs <- sample(x=xF,N,replace = T)
> > > yFs <- rnorm(N, mean = as.numeric(xFs))
> > > lm(yFs ~ xFs)
>
> > Call:
> > lm(formula = yFs ~ xFs)
>
> > Coefficients:
> > (Intercept)         xFs2         xFs3         xFs4
> > xFs5         xFs6         xFs7         xFs8
> >      0.7845       1.1620       2.1474       3.1391       4.2183
> > 5.2621       6.0814       7.4170
> >        xFs9        xFs10
> >      8.2193       9.2987
>
> > > lm(yFs ~ diag(10)[,1:9][xFs,])
>
> > Call:
> > lm(formula = yFs ~ diag(10)[, 1:9][xFs, ])
>
> > Coefficients:
> >             (Intercept)  diag(10)[, 1:9][xFs, ]1  diag(10)[, 1:9]
> > [xFs, ]2  diag(10)[, 1:9][xFs, ]3
> >                  10.083                   -9.299
> > -8.137                   -7.151
> > diag(10)[, 1:9][xFs, ]4  diag(10)[, 1:9][xFs, ]5  diag(10)[, 1:9]
> > [xFs, ]6  diag(10)[, 1:9][xFs, ]7
> >                  -6.160                   -5.080
> > -4.037                   -3.217
> > diag(10)[, 1:9][xFs, ]8  diag(10)[, 1:9][xFs, ]9
> >                  -1.882                   -1.079
>
> > On Oct 21, 9:44 am, David Winsemius <[hidden email]> wrote:
>
> > > On Oct 20, 2009, at 4:00 PM, Luciano La Sala wrote:
>
> > > > Dear R-people,
>
> > > > I am analyzing epidemiological data using GLMM using the lmer  
> > > > package. I usually explore the assumption of linearity of continuous  
> > > > variables in the logit of the outcome by creating 4 categories of  
> > > > the variable, performing a bivariate logistic regression, and then  
> > > > plotting the coefficients of each category against their mid points.  
> > > > That gives me a pretty good idea about the linearity assumption and  
> > > > possible departures from it.
>
> > > > I know of people who create 0,1 dummy variables in order to relax  
> > > > the linearity assumption. However, I've read that dummy variables  
> > > > are never needed (nor are desireble) in R! Instead, one should make  
> > > > use of factors variable. That is much easier to work with than dummy  
> > > > variables and the model itself will create the necessary dummy  
> > > > variables.
>
> > > > Having said that, if my data violates the linearity assumption, does  
> > > > the use of a factors for the variable in question helps overcome the  
> > > > lack of linearity?
>
> > > No. If done by dividing into samall numbers of categories after  
> > > looking at the data, it merely creates other (and probably more  
> > > severe) problems. If you are in the unusal (although desirable)  
> > > position of having a large number of events across the range of the  
> > > covariates in your data, you may be able to cut your variable into  
> > > quintiles or deciles and analyze the resulting factor, but the  
> > > preferred approach would be to fit a regression spline of sufficient  
> > > complexity.
>
> > > > Thanks in advance.
>
> > > --
>
> > > David Winsemius, MD
> > > Heritage Laboratories
> > > West Hartford, CT
>
> > > ______________________________________________
> > > [hidden email] mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
>
> > ______________________________________________
> > [hidden email] mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...