modeling binary response variables

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

modeling binary response variables

Kevin J Emerson
R-devotees,

I have a question about modeling in the case where the response variable is
binary.

I have a case where I have a response variable that is the probability of
success, and four descriptor variables, The response has a sigmoid response
with one of the variables. I would like to test for the effect of the
various descriptor variables on the percentage success of the binary trait.
I have looked at glm with family = "binomial" but am not sure I totally
understand its use (and therefore am not sure it is the appropriate test)
and am looking for two things: (1) is glm with family = 'binomial' the right
way to do this, and (2) are there any good references on how it works.
I have posted a plot of a sample of the data I am looking at as well as the
sample data used to generate the plots.

Sample Plot: http://www.uoregon.edu/~kemerson/tmp/plot.pdf
Sample Data: http://www.uoregon.edu/~kemerson/tmp/data.csv

Response variable is percent.dev (se2.dev are the errors from binomial
estimates given probability and number of samples).

Descriptor variables are num.days, ppd, temp, and pop.  

Any help would be greatly appreciated.

Cheers,
Kevin Emerson


====================================
Kevin J. Emerson
Bradshaw - Holzapfel Lab
1210 University of Oregon
Eugene, OR, 97403
email: [hidden email]
web: http://evodevo.uoregon.edu/people/emerson.html

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: modeling binary response variables

Daniel Malter
Hi Kevin, you mean an s-shaped relationship of a variable with your response? So you have a response that is strictly constrained to the interval 0,1 or, and these limits are not due to truncation or censoring (i.e. your response variable is truly a proportion).

This sounds like a good application for a binomial model as fitting a linear model may give you a fit outside the limits of the interval that you are allowed to observe (0,1). The binomial logit (or probit, or cloglog) fixes that issue.

Since you have a proportion (the probability of success), you have something between 0 and 1. I suggest you to transform that by multiplying that proportion by say 100 (or 1000). Then you round this value to the next integer. Say Y is currently your proportion, do new.Y=round(Y*100). Then you create the number of observations that make up the counter-probability of your observation. counter.Y=100-Y.

Then you can run the binomial as follows:

reg=glm(cbind(new.Y,counter.Y)~predictors,binomial) ##runs the regression
summary(reg) ##shows the summary output of your regression
fitted(reg) ##shows the predicted values given your data matrix and your estimated model

You will want to check a.) whether you need a binomial (if your probabilities are actually reasonably distributed in a much smaller interval than 0,1, then you may be okay with a linear model).
b.) if a binomial is more appropriate, you will want to check whether your data is overdispersed. Look at whether your degrees of freedom in the summary of your model are about equal to the log-likelihood of the model. If not, choose option quasibinomial instead of option binomial when fitting the model.

Best,
Daniel


Kevin J Emerson wrote
R-devotees,

I have a question about modeling in the case where the response variable is
binary.

I have a case where I have a response variable that is the probability of
success, and four descriptor variables, The response has a sigmoid response
with one of the variables. I would like to test for the effect of the
various descriptor variables on the percentage success of the binary trait.
I have looked at glm with family = "binomial" but am not sure I totally
understand its use (and therefore am not sure it is the appropriate test)
and am looking for two things: (1) is glm with family = 'binomial' the right
way to do this, and (2) are there any good references on how it works.
I have posted a plot of a sample of the data I am looking at as well as the
sample data used to generate the plots.

Sample Plot: http://www.uoregon.edu/~kemerson/tmp/plot.pdf
Sample Data: http://www.uoregon.edu/~kemerson/tmp/data.csv

Response variable is percent.dev (se2.dev are the errors from binomial
estimates given probability and number of samples).

Descriptor variables are num.days, ppd, temp, and pop.  

Any help would be greatly appreciated.

Cheers,
Kevin Emerson


====================================
Kevin J. Emerson
Bradshaw - Holzapfel Lab
1210 University of Oregon
Eugene, OR, 97403
email: kemerson@uoregon.edu
web: http://evodevo.uoregon.edu/people/emerson.html

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: modeling binary response variables

Simon Blomberg-4
Wait, are the proportions (probabilities) based on discrete data, or are
they truly continuous? If the latter, then beta regression might be more
appropriate (e.g. package betareg). If the former, include the sample
size for each proportion in the call to glm using the weights= argument.
Or set the data up so you have a column of numbers of "successes" and a
column of "failures" and use the notation below. Multiplying your
proportion by an arbitrary large number is bad because you are in effect
fudging the precision of the proportion estimates.

HTH,

Simon.

On Mon, 2008-07-14 at 18:07 -0700, Daniel Malter wrote:

> Hi Kevin, you mean an s-shaped relationship of a variable with your response?
> So you have a response that is strictly constrained to the interval 0,1 or,
> and these limits are not due to truncation or censoring (i.e. your response
> variable is truly a proportion).
>
> This sounds like a good application for a binomial model as fitting a linear
> model may give you a fit outside the limits of the interval that you are
> allowed to observe (0,1). The binomial logit (or probit, or cloglog) fixes
> that issue.
>
> Since you have a proportion (the probability of success), you have something
> between 0 and 1. I suggest you to transform that by multiplying that
> proportion by say 100 (or 1000). Then you round this value to the next
> integer. Say Y is currently your proportion, do new.Y=round(Y*100). Then you
> create the number of observations that make up the counter-probability of
> your observation. counter.Y=100-Y.
>
> Then you can run the binomial as follows:
>
> reg=glm(cbind(new.Y,counter.Y)~predictors,binomial) ##runs the regression
> summary(reg) ##shows the summary output of your regression
> fitted(reg) ##shows the predicted values given your data matrix and your
> estimated model
>
> You will want to check a.) whether you need a binomial (if your
> probabilities are actually reasonably distributed in a much smaller interval
> than 0,1, then you may be okay with a linear model).
> b.) if a binomial is more appropriate, you will want to check whether your
> data is overdispersed. Look at whether your degrees of freedom in the
> summary of your model are about equal to the log-likelihood of the model. If
> not, choose option quasibinomial instead of option binomial when fitting the
> model.
>
> Best,
> Daniel
>
>
>
> Kevin J Emerson wrote:
> >
> > R-devotees,
> >
> > I have a question about modeling in the case where the response variable
> > is
> > binary.
> >
> > I have a case where I have a response variable that is the probability of
> > success, and four descriptor variables, The response has a sigmoid
> > response
> > with one of the variables. I would like to test for the effect of the
> > various descriptor variables on the percentage success of the binary
> > trait.
> > I have looked at glm with family = "binomial" but am not sure I totally
> > understand its use (and therefore am not sure it is the appropriate test)
> > and am looking for two things: (1) is glm with family = 'binomial' the
> > right
> > way to do this, and (2) are there any good references on how it works.
> > I have posted a plot of a sample of the data I am looking at as well as
> > the
> > sample data used to generate the plots.
> >
> > Sample Plot: http://www.uoregon.edu/~kemerson/tmp/plot.pdf
> > Sample Data: http://www.uoregon.edu/~kemerson/tmp/data.csv
> >
> > Response variable is percent.dev (se2.dev are the errors from binomial
> > estimates given probability and number of samples).
> >
> > Descriptor variables are num.days, ppd, temp, and pop.  
> >
> > Any help would be greatly appreciated.
> >
> > Cheers,
> > Kevin Emerson
> >
> >
> > ====================================
> > Kevin J. Emerson
> > Bradshaw - Holzapfel Lab
> > 1210 University of Oregon
> > Eugene, OR, 97403
> > email: [hidden email]
> > web: http://evodevo.uoregon.edu/people/emerson.html
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> >
>
--
Simon Blomberg, BSc (Hons), PhD, MAppStat.
Lecturer and Consultant Statistician
Faculty of Biological and Chemical Sciences
The University of Queensland
St. Lucia Queensland 4072
Australia
Room 320 Goddard Building (8)
T: +61 7 3365 2506
http://www.uq.edu.au/~uqsblomb
email: S.Blomberg1_at_uq.edu.au

Policies:
1.  I will NOT analyse your data for you.
2.  Your deadline is your problem.

The combination of some data and an aching desire for
an answer does not ensure that a reasonable answer can
be extracted from a given body of data. - John Tukey.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: modeling binary response variables

Daniel Malter
In reply to this post by Kevin J Emerson
I have a connector-question to that. Is beta-regression available for repeated measures or panel data and if so is it available in R?

thx,
Daniel

Kevin J Emerson wrote
R-devotees,

I have a question about modeling in the case where the response variable is
binary.

I have a case where I have a response variable that is the probability of
success, and four descriptor variables, The response has a sigmoid response
with one of the variables. I would like to test for the effect of the
various descriptor variables on the percentage success of the binary trait.
I have looked at glm with family = "binomial" but am not sure I totally
understand its use (and therefore am not sure it is the appropriate test)
and am looking for two things: (1) is glm with family = 'binomial' the right
way to do this, and (2) are there any good references on how it works.
I have posted a plot of a sample of the data I am looking at as well as the
sample data used to generate the plots.

Sample Plot: http://www.uoregon.edu/~kemerson/tmp/plot.pdf
Sample Data: http://www.uoregon.edu/~kemerson/tmp/data.csv

Response variable is percent.dev (se2.dev are the errors from binomial
estimates given probability and number of samples).

Descriptor variables are num.days, ppd, temp, and pop.  

Any help would be greatly appreciated.

Cheers,
Kevin Emerson


====================================
Kevin J. Emerson
Bradshaw - Holzapfel Lab
1210 University of Oregon
Eugene, OR, 97403
email: kemerson@uoregon.edu
web: http://evodevo.uoregon.edu/people/emerson.html

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: modeling binary response variables

Simon Blomberg-4
Jim Lindsey's repeated package has the function gnlmm which will fit
beta regressions with a random intercept and one level of nesting. I
don't know of any other options.

Cheers,

Simon.

 Not sure On Mon, 2008-07-14 at 19:16 -0700, Daniel Malter wrote:

> I have a connector-question to that. Is beta-regression available for
> repeated measures or panel data and if so is it available in R?
>
> thx,
> Daniel
>
>
> Kevin J Emerson wrote:
> >
> > R-devotees,
> >
> > I have a question about modeling in the case where the response variable
> > is
> > binary.
> >
> > I have a case where I have a response variable that is the probability of
> > success, and four descriptor variables, The response has a sigmoid
> > response
> > with one of the variables. I would like to test for the effect of the
> > various descriptor variables on the percentage success of the binary
> > trait.
> > I have looked at glm with family = "binomial" but am not sure I totally
> > understand its use (and therefore am not sure it is the appropriate test)
> > and am looking for two things: (1) is glm with family = 'binomial' the
> > right
> > way to do this, and (2) are there any good references on how it works.
> > I have posted a plot of a sample of the data I am looking at as well as
> > the
> > sample data used to generate the plots.
> >
> > Sample Plot: http://www.uoregon.edu/~kemerson/tmp/plot.pdf
> > Sample Data: http://www.uoregon.edu/~kemerson/tmp/data.csv
> >
> > Response variable is percent.dev (se2.dev are the errors from binomial
> > estimates given probability and number of samples).
> >
> > Descriptor variables are num.days, ppd, temp, and pop.  
> >
> > Any help would be greatly appreciated.
> >
> > Cheers,
> > Kevin Emerson
> >
> >
> > ====================================
> > Kevin J. Emerson
> > Bradshaw - Holzapfel Lab
> > 1210 University of Oregon
> > Eugene, OR, 97403
> > email: [hidden email]
> > web: http://evodevo.uoregon.edu/people/emerson.html
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> >
>
--
Simon Blomberg, BSc (Hons), PhD, MAppStat.
Lecturer and Consultant Statistician
Faculty of Biological and Chemical Sciences
The University of Queensland
St. Lucia Queensland 4072
Australia
Room 320 Goddard Building (8)
T: +61 7 3365 2506
http://www.uq.edu.au/~uqsblomb
email: S.Blomberg1_at_uq.edu.au

Policies:
1.  I will NOT analyse your data for you.
2.  Your deadline is your problem.

The combination of some data and an aching desire for
an answer does not ensure that a reasonable answer can
be extracted from a given body of data. - John Tukey.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: modeling binary response variables

Daniel Malter
thanks

Simon Blomberg-4 wrote
Jim Lindsey's repeated package has the function gnlmm which will fit
beta regressions with a random intercept and one level of nesting. I
don't know of any other options.

Cheers,

Simon.

 Not sure On Mon, 2008-07-14 at 19:16 -0700, Daniel Malter wrote:
> I have a connector-question to that. Is beta-regression available for
> repeated measures or panel data and if so is it available in R?
>
> thx,
> Daniel
>
>
> Kevin J Emerson wrote:
> >
> > R-devotees,
> >
> > I have a question about modeling in the case where the response variable
> > is
> > binary.
> >
> > I have a case where I have a response variable that is the probability of
> > success, and four descriptor variables, The response has a sigmoid
> > response
> > with one of the variables. I would like to test for the effect of the
> > various descriptor variables on the percentage success of the binary
> > trait.
> > I have looked at glm with family = "binomial" but am not sure I totally
> > understand its use (and therefore am not sure it is the appropriate test)
> > and am looking for two things: (1) is glm with family = 'binomial' the
> > right
> > way to do this, and (2) are there any good references on how it works.
> > I have posted a plot of a sample of the data I am looking at as well as
> > the
> > sample data used to generate the plots.
> >
> > Sample Plot: http://www.uoregon.edu/~kemerson/tmp/plot.pdf
> > Sample Data: http://www.uoregon.edu/~kemerson/tmp/data.csv
> >
> > Response variable is percent.dev (se2.dev are the errors from binomial
> > estimates given probability and number of samples).
> >
> > Descriptor variables are num.days, ppd, temp, and pop.  
> >
> > Any help would be greatly appreciated.
> >
> > Cheers,
> > Kevin Emerson
> >
> >
> > ====================================
> > Kevin J. Emerson
> > Bradshaw - Holzapfel Lab
> > 1210 University of Oregon
> > Eugene, OR, 97403
> > email: kemerson@uoregon.edu
> > web: http://evodevo.uoregon.edu/people/emerson.html
> >
> > ______________________________________________
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> >
>
--
Simon Blomberg, BSc (Hons), PhD, MAppStat.
Lecturer and Consultant Statistician
Faculty of Biological and Chemical Sciences
The University of Queensland
St. Lucia Queensland 4072
Australia
Room 320 Goddard Building (8)
T: +61 7 3365 2506
http://www.uq.edu.au/~uqsblomb
email: S.Blomberg1_at_uq.edu.au

Policies:
1.  I will NOT analyse your data for you.
2.  Your deadline is your problem.

The combination of some data and an aching desire for
an answer does not ensure that a reasonable answer can
be extracted from a given body of data. - John Tukey.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.