logistic regression model with non-integer weights

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

logistic regression model with non-integer weights

Ramón Casero Cañas

When fitting a logistic regression model using weights I get the
following warning

> data.model.w <- glm(ABN ~ TR, family=binomial(logit), weights=WEIGHT)
Warning message:
non-integer #successes in a binomial glm! in: eval(expr, envir, enclos)

Details follow

***

I have a binary dependent variable of abnormality

ABN = T, F, T, T, F, F, F...

and a continous predictor

TR = 1.962752 1.871123 1.893543 1.685001 2.121500, ...



As the number of abnormal cases (ABN==T) is only 14%, and there is large
overlapping between abnormal and normal cases, the logistic regression
found by glm is always much closer to the normal cases than for the
abnormal cases. In particular, the probability of abnormal is at most 0.4.

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   0.7607     0.7196   1.057   0.2905
TR2          -1.4853     0.4328  -3.432   0.0006 ***
---

I would like to compensate for the fact that the a priori probability of
abnormal cases is so low. I have created a weight vector

> WEIGHT <- ABN
> WEIGHT[ ABN == TRUE ] <-  1 / na / 2
> WEIGHT[ ABN == FALSE ] <-  1 / nn / 2

so that all weights add up to 1, where ``na'' is the number of abnormal
cases, and ``nn'' is the number of normal cases. That is, normal cases
have less weight in the model fitting because there are so many.

But then I get the warning message at the beginning of this email, and I
suspect that I'm doing something wrong. Must weights be integers, or at
least greater than one?

Regards,

--
Ramón Casero Cañas

http://www.robots.ox.ac.uk/~rcasero/wiki
http://www.robots.ox.ac.uk/~rcasero/blog

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: logistic regression model with non-integer weights

Michael Dewey
At 17:12 09/04/06, Ramón Casero Cañas wrote:

I have not seen a reply to this so far apologies if I missed something.


>When fitting a logistic regression model using weights I get the
>following warning
>
> > data.model.w <- glm(ABN ~ TR, family=binomial(logit), weights=WEIGHT)
>Warning message:
>non-integer #successes in a binomial glm! in: eval(expr, envir, enclos)
>
>Details follow
>
>***
>
>I have a binary dependent variable of abnormality
>
>ABN = T, F, T, T, F, F, F...
>
>and a continous predictor
>
>TR = 1.962752 1.871123 1.893543 1.685001 2.121500, ...
>
>
>
>As the number of abnormal cases (ABN==T) is only 14%, and there is large
>overlapping between abnormal and normal cases, the logistic regression
>found by glm is always much closer to the normal cases than for the
>abnormal cases. In particular, the probability of abnormal is at most 0.4.
>
>Coefficients:
>             Estimate Std. Error z value Pr(>|z|)
>(Intercept)   0.7607     0.7196   1.057   0.2905
>TR2          -1.4853     0.4328  -3.432   0.0006 ***
>---
>
>I would like to compensate for the fact that the a priori probability of
>abnormal cases is so low. I have created a weight vector

I am not sure what the problem you really want to solve is but it seems that
a) abnormality is rare
b) the logistic regression predicts it to be rare.
If you want a prediction system why not try different cut-offs (other than
0.5 on the probability scale) and perhaps plot sensitivity and specificity
to help to choose a cut-off?

> > WEIGHT <- ABN
> > WEIGHT[ ABN == TRUE ] <-  1 / na / 2
> > WEIGHT[ ABN == FALSE ] <-  1 / nn / 2
>
>so that all weights add up to 1, where ``na'' is the number of abnormal
>cases, and ``nn'' is the number of normal cases. That is, normal cases
>have less weight in the model fitting because there are so many.
>
>But then I get the warning message at the beginning of this email, and I
>suspect that I'm doing something wrong. Must weights be integers, or at
>least greater than one?
>
>Regards,
>
>--
>Ramón Casero Cañas
>
>http://www.robots.ox.ac.uk/~rcasero/wiki
>http://www.robots.ox.ac.uk/~rcasero/blog

Michael Dewey
http://www.aghmed.fsnet.co.uk

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: logistic regression model with non-integer weights

Ramón Casero Cañas
Michael Dewey wrote:
> At 17:12 09/04/06, Ramón Casero Cañas wrote:
>
> I am not sure what the problem you really want to solve is but it seems
> that
> a) abnormality is rare
> b) the logistic regression predicts it to be rare.
> If you want a prediction system why not try different cut-offs (other
> than 0.5 on the probability scale) and perhaps plot sensitivity and
> specificity to help to choose a cut-off?

Thanks for your suggestions, Michael. It took me some time to figure out
how to do this in R (as trivial as it may be for others). Some comments
about what I've done follow, in case anyone is interested.

The problem is a) abnormality is rare (Prevalence=14%) and b) there is
not much difference in the independent variable between abnormal and
normal. So the logistic regression model predicts that P(abnormal) <=
0.4. I got confused with this, as I expected a cut-off point of P=0.5 to
decide between normal/abnormal. But you are right, in that another
cut-off point can be chosen.

For a cut-off of e.g. P(abnormal)=0.15, Sensitivity=65% and
Specificity=52%. They are pretty bad, although for clinical purposes I
would say that Positive/Negative Predictive Values are more interesting.
But then PPV=19% and NPV=90%, which isn't great. As an overall test of
how good the model is for classification I have computed the area under
the ROC, from your suggestion of using Sensitivity and Specificity.

I couldn't find how to do this directly with R, so I implemented it
myself (it's not difficult but I'm new here). I tried with package ROCR,
but apparently it doesn't cover binary outcomes.

The area under the ROC is 0.64, so I would say that even though the
model seems to fit the data, it just doesn't allow acceptable
discrimination, not matter what the cut-off point.


I have also studied the effect of low prevalence. For this, I used
option ran.gen in the boot function (package boot) to define a function
that resamples the data so that it balances abnormal and normal cases.

A logistic regression model is fitted to each replicate, to a parametric
bootstrap, and thus compute the bias of the estimates of the model
coefficients, beta0 and beta1. This shows very small bias for beta1, but
a rather large bias for beta0.

So I would say that prevalence has an effect on beta0, but not beta1.
This is good, because a common measure like the odds ratio depends only
on beta1.

Cheers,

--
Ramón Casero Cañas

http://www.robots.ox.ac.uk/~rcasero/wiki
http://www.robots.ox.ac.uk/~rcasero/blog

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: logistic regression model with non-integer weights

Rick Bilonick
On Sun, 2006-04-16 at 19:10 +0100, Ramón Casero Cañas wrote:

> Thanks for your suggestions, Michael. It took me some time to figure out
> how to do this in R (as trivial as it may be for others). Some comments
> about what I've done follow, in case anyone is interested.
>
> The problem is a) abnormality is rare (Prevalence=14%) and b) there is
> not much difference in the independent variable between abnormal and
> normal. So the logistic regression model predicts that P(abnormal) <=
> 0.4. I got confused with this, as I expected a cut-off point of P=0.5 to
> decide between normal/abnormal. But you are right, in that another
> cut-off point can be chosen.
>
> For a cut-off of e.g. P(abnormal)=0.15, Sensitivity=65% and
> Specificity=52%. They are pretty bad, although for clinical purposes I
> would say that Positive/Negative Predictive Values are more interesting.
> But then PPV=19% and NPV=90%, which isn't great. As an overall test of
> how good the model is for classification I have computed the area under
> the ROC, from your suggestion of using Sensitivity and Specificity.
>
> I couldn't find how to do this directly with R, so I implemented it
> myself (it's not difficult but I'm new here). I tried with package ROCR,
> but apparently it doesn't cover binary outcomes.
>
> The area under the ROC is 0.64, so I would say that even though the
> model seems to fit the data, it just doesn't allow acceptable
> discrimination, not matter what the cut-off point.
>
>
> I have also studied the effect of low prevalence. For this, I used
> option ran.gen in the boot function (package boot) to define a function
> that resamples the data so that it balances abnormal and normal cases.
>
> A logistic regression model is fitted to each replicate, to a parametric
> bootstrap, and thus compute the bias of the estimates of the model
> coefficients, beta0 and beta1. This shows very small bias for beta1, but
> a rather large bias for beta0.
>
> So I would say that prevalence has an effect on beta0, but not beta1.
> This is good, because a common measure like the odds ratio depends only
> on beta1.
>
> Cheers,
>
> --
> Ramón Casero Cañas
>
> http://www.robots.ox.ac.uk/~rcasero/wiki
> http://www.robots.ox.ac.uk/~rcasero/blog
>

The Epi package has function ROC that draws the ROC curve and computes
the AUC among other things.

Rick B.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: logistic regression model with non-integer weights

Frank Harrell
In reply to this post by Ramón Casero Cañas
Ramón Casero Cañas wrote:

> Michael Dewey wrote:
>
>>At 17:12 09/04/06, Ramón Casero Cañas wrote:
>>
>>I am not sure what the problem you really want to solve is but it seems
>>that
>>a) abnormality is rare
>>b) the logistic regression predicts it to be rare.
>>If you want a prediction system why not try different cut-offs (other
>>than 0.5 on the probability scale) and perhaps plot sensitivity and
>>specificity to help to choose a cut-off?
>
>
> Thanks for your suggestions, Michael. It took me some time to figure out
> how to do this in R (as trivial as it may be for others). Some comments
> about what I've done follow, in case anyone is interested.
>
> The problem is a) abnormality is rare (Prevalence=14%) and b) there is
> not much difference in the independent variable between abnormal and
> normal. So the logistic regression model predicts that P(abnormal) <=
> 0.4. I got confused with this, as I expected a cut-off point of P=0.5 to
> decide between normal/abnormal. But you are right, in that another
> cut-off point can be chosen.
>
> For a cut-off of e.g. P(abnormal)=0.15, Sensitivity=65% and
> Specificity=52%. They are pretty bad, although for clinical purposes I
> would say that Positive/Negative Predictive Values are more interesting.
> But then PPV=19% and NPV=90%, which isn't great. As an overall test of
> how good the model is for classification I have computed the area under
> the ROC, from your suggestion of using Sensitivity and Specificity.
>
> I couldn't find how to do this directly with R, so I implemented it
> myself (it's not difficult but I'm new here). I tried with package ROCR,
> but apparently it doesn't cover binary outcomes.
>
> The area under the ROC is 0.64, so I would say that even though the
> model seems to fit the data, it just doesn't allow acceptable
> discrimination, not matter what the cut-off point.
>
>
> I have also studied the effect of low prevalence. For this, I used
> option ran.gen in the boot function (package boot) to define a function
> that resamples the data so that it balances abnormal and normal cases.
>
> A logistic regression model is fitted to each replicate, to a parametric
> bootstrap, and thus compute the bias of the estimates of the model
> coefficients, beta0 and beta1. This shows very small bias for beta1, but
> a rather large bias for beta0.
>
> So I would say that prevalence has an effect on beta0, but not beta1.
> This is good, because a common measure like the odds ratio depends only
> on beta1.
>
> Cheers,
>

This makes me think you are trying to go against maximum likelihood to
optimize an improper criterion.  Forcing a single cutpoint to be chosen
seems to be at the heart of your problem.  There's nothing wrong with
using probabilities and letting the utility possessor make the final
decision.

--
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Frank Harrell
Department of Biostatistics, Vanderbilt University
Reply | Threaded
Open this post in threaded view
|

Re: logistic regression model with non-integer weights

Ramón Casero Cañas
Frank E Harrell Jr wrote:
>
> This makes me think you are trying to go against maximum likelihood to
> optimize an improper criterion.  Forcing a single cutpoint to be chosen
> seems to be at the heart of your problem.  There's nothing wrong with
> using probabilities and letting the utility possessor make the final
> decision.

I agree, and in fact I was thinking along those lines, but I also needed
a way of evaluating how good is the model to discriminate between
abnormal and normal cases, as opposed to e.g. GOF. The only way I know
of is using area under ROC (thus setting cut-off points), which also
followed neatly from Michael Dewey comments. Any alternatives would be
welcome :)

--
Ramón Casero Cañas

http://www.robots.ox.ac.uk/~rcasero/wiki
http://www.robots.ox.ac.uk/~rcasero/blog

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: logistic regression model with non-integer weights

Frank Harrell
Ramón Casero Cañas wrote:

> Frank E Harrell Jr wrote:
>
>>This makes me think you are trying to go against maximum likelihood to
>>optimize an improper criterion.  Forcing a single cutpoint to be chosen
>>seems to be at the heart of your problem.  There's nothing wrong with
>>using probabilities and letting the utility possessor make the final
>>decision.
>
>
> I agree, and in fact I was thinking along those lines, but I also needed
> a way of evaluating how good is the model to discriminate between
> abnormal and normal cases, as opposed to e.g. GOF. The only way I know
> of is using area under ROC (thus setting cut-off points), which also
> followed neatly from Michael Dewey comments. Any alternatives would be
> welcome :)
>

To get the ROC area you don't need to do any of that, and as you
indicated, it is a good discrimination measure.  The lrm function in the
Design package gives it to you automatically (C index), and you can also
get it with the Hmisc package's somers2 and rcorr.cens functions.  ROC
area is highly related to the Wilcoxon 2-sample test statistic for
comparing cases and non-cases.

--
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Frank Harrell
Department of Biostatistics, Vanderbilt University