# logistic regression model with non-integer weights Classic List Threaded 7 messages Open this post in threaded view
|

## logistic regression model with non-integer weights

 When fitting a logistic regression model using weights I get the following warning > data.model.w <- glm(ABN ~ TR, family=binomial(logit), weights=WEIGHT) Warning message: non-integer #successes in a binomial glm! in: eval(expr, envir, enclos) Details follow *** I have a binary dependent variable of abnormality ABN = T, F, T, T, F, F, F... and a continous predictor TR = 1.962752 1.871123 1.893543 1.685001 2.121500, ... As the number of abnormal cases (ABN==T) is only 14%, and there is large overlapping between abnormal and normal cases, the logistic regression found by glm is always much closer to the normal cases than for the abnormal cases. In particular, the probability of abnormal is at most 0.4. Coefficients:             Estimate Std. Error z value Pr(>|z|) (Intercept)   0.7607     0.7196   1.057   0.2905 TR2          -1.4853     0.4328  -3.432   0.0006 *** --- I would like to compensate for the fact that the a priori probability of abnormal cases is so low. I have created a weight vector > WEIGHT <- ABN > WEIGHT[ ABN == TRUE ] <-  1 / na / 2 > WEIGHT[ ABN == FALSE ] <-  1 / nn / 2 so that all weights add up to 1, where ``na'' is the number of abnormal cases, and ``nn'' is the number of normal cases. That is, normal cases have less weight in the model fitting because there are so many. But then I get the warning message at the beginning of this email, and I suspect that I'm doing something wrong. Must weights be integers, or at least greater than one? Regards, -- Ramón Casero Cañas http://www.robots.ox.ac.uk/~rcasero/wikihttp://www.robots.ox.ac.uk/~rcasero/blog______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Open this post in threaded view
|

## Re: logistic regression model with non-integer weights

 At 17:12 09/04/06, RamÃ³n Casero CaÃ±as wrote: I have not seen a reply to this so far apologies if I missed something. >When fitting a logistic regression model using weights I get the >following warning > > > data.model.w <- glm(ABN ~ TR, family=binomial(logit), weights=WEIGHT) >Warning message: >non-integer #successes in a binomial glm! in: eval(expr, envir, enclos) > >Details follow > >*** > >I have a binary dependent variable of abnormality > >ABN = T, F, T, T, F, F, F... > >and a continous predictor > >TR = 1.962752 1.871123 1.893543 1.685001 2.121500, ... > > > >As the number of abnormal cases (ABN==T) is only 14%, and there is large >overlapping between abnormal and normal cases, the logistic regression >found by glm is always much closer to the normal cases than for the >abnormal cases. In particular, the probability of abnormal is at most 0.4. > >Coefficients: >             Estimate Std. Error z value Pr(>|z|) >(Intercept)   0.7607     0.7196   1.057   0.2905 >TR2          -1.4853     0.4328  -3.432   0.0006 *** >--- > >I would like to compensate for the fact that the a priori probability of >abnormal cases is so low. I have created a weight vector I am not sure what the problem you really want to solve is but it seems that a) abnormality is rare b) the logistic regression predicts it to be rare. If you want a prediction system why not try different cut-offs (other than 0.5 on the probability scale) and perhaps plot sensitivity and specificity to help to choose a cut-off? > > WEIGHT <- ABN > > WEIGHT[ ABN == TRUE ] <-  1 / na / 2 > > WEIGHT[ ABN == FALSE ] <-  1 / nn / 2 > >so that all weights add up to 1, where ``na'' is the number of abnormal >cases, and ``nn'' is the number of normal cases. That is, normal cases >have less weight in the model fitting because there are so many. > >But then I get the warning message at the beginning of this email, and I >suspect that I'm doing something wrong. Must weights be integers, or at >least greater than one? > >Regards, > >-- >RamÃ³n Casero CaÃ±as > >http://www.robots.ox.ac.uk/~rcasero/wiki>http://www.robots.ox.ac.uk/~rcasero/blogMichael Dewey http://www.aghmed.fsnet.co.uk______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Open this post in threaded view
|

## Re: logistic regression model with non-integer weights

 Michael Dewey wrote: > At 17:12 09/04/06, RamÃ³n Casero CaÃ±as wrote: > > I am not sure what the problem you really want to solve is but it seems > that > a) abnormality is rare > b) the logistic regression predicts it to be rare. > If you want a prediction system why not try different cut-offs (other > than 0.5 on the probability scale) and perhaps plot sensitivity and > specificity to help to choose a cut-off? Thanks for your suggestions, Michael. It took me some time to figure out how to do this in R (as trivial as it may be for others). Some comments about what I've done follow, in case anyone is interested. The problem is a) abnormality is rare (Prevalence=14%) and b) there is not much difference in the independent variable between abnormal and normal. So the logistic regression model predicts that P(abnormal) <= 0.4. I got confused with this, as I expected a cut-off point of P=0.5 to decide between normal/abnormal. But you are right, in that another cut-off point can be chosen. For a cut-off of e.g. P(abnormal)=0.15, Sensitivity=65% and Specificity=52%. They are pretty bad, although for clinical purposes I would say that Positive/Negative Predictive Values are more interesting. But then PPV=19% and NPV=90%, which isn't great. As an overall test of how good the model is for classification I have computed the area under the ROC, from your suggestion of using Sensitivity and Specificity. I couldn't find how to do this directly with R, so I implemented it myself (it's not difficult but I'm new here). I tried with package ROCR, but apparently it doesn't cover binary outcomes. The area under the ROC is 0.64, so I would say that even though the model seems to fit the data, it just doesn't allow acceptable discrimination, not matter what the cut-off point. I have also studied the effect of low prevalence. For this, I used option ran.gen in the boot function (package boot) to define a function that resamples the data so that it balances abnormal and normal cases. A logistic regression model is fitted to each replicate, to a parametric bootstrap, and thus compute the bias of the estimates of the model coefficients, beta0 and beta1. This shows very small bias for beta1, but a rather large bias for beta0. So I would say that prevalence has an effect on beta0, but not beta1. This is good, because a common measure like the odds ratio depends only on beta1. Cheers, -- Ramón Casero Cañas http://www.robots.ox.ac.uk/~rcasero/wikihttp://www.robots.ox.ac.uk/~rcasero/blog______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Open this post in threaded view
|

## Re: logistic regression model with non-integer weights

 On Sun, 2006-04-16 at 19:10 +0100, Ramón Casero Cañas wrote: > Thanks for your suggestions, Michael. It took me some time to figure out > how to do this in R (as trivial as it may be for others). Some comments > about what I've done follow, in case anyone is interested. > > The problem is a) abnormality is rare (Prevalence=14%) and b) there is > not much difference in the independent variable between abnormal and > normal. So the logistic regression model predicts that P(abnormal) <= > 0.4. I got confused with this, as I expected a cut-off point of P=0.5 to > decide between normal/abnormal. But you are right, in that another > cut-off point can be chosen. > > For a cut-off of e.g. P(abnormal)=0.15, Sensitivity=65% and > Specificity=52%. They are pretty bad, although for clinical purposes I > would say that Positive/Negative Predictive Values are more interesting. > But then PPV=19% and NPV=90%, which isn't great. As an overall test of > how good the model is for classification I have computed the area under > the ROC, from your suggestion of using Sensitivity and Specificity. > > I couldn't find how to do this directly with R, so I implemented it > myself (it's not difficult but I'm new here). I tried with package ROCR, > but apparently it doesn't cover binary outcomes. > > The area under the ROC is 0.64, so I would say that even though the > model seems to fit the data, it just doesn't allow acceptable > discrimination, not matter what the cut-off point. > > > I have also studied the effect of low prevalence. For this, I used > option ran.gen in the boot function (package boot) to define a function > that resamples the data so that it balances abnormal and normal cases. > > A logistic regression model is fitted to each replicate, to a parametric > bootstrap, and thus compute the bias of the estimates of the model > coefficients, beta0 and beta1. This shows very small bias for beta1, but > a rather large bias for beta0. > > So I would say that prevalence has an effect on beta0, but not beta1. > This is good, because a common measure like the odds ratio depends only > on beta1. > > Cheers, > > -- > Ramón Casero Cañas > > http://www.robots.ox.ac.uk/~rcasero/wiki> http://www.robots.ox.ac.uk/~rcasero/blog> The Epi package has function ROC that draws the ROC curve and computes the AUC among other things. Rick B. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Open this post in threaded view
|

## Re: logistic regression model with non-integer weights

 In reply to this post by Ramón Casero Cañas Ramón Casero Cañas wrote: > Michael Dewey wrote: > >>At 17:12 09/04/06, RamÃ³n Casero CaÃ±as wrote: >> >>I am not sure what the problem you really want to solve is but it seems >>that >>a) abnormality is rare >>b) the logistic regression predicts it to be rare. >>If you want a prediction system why not try different cut-offs (other >>than 0.5 on the probability scale) and perhaps plot sensitivity and >>specificity to help to choose a cut-off? > > > Thanks for your suggestions, Michael. It took me some time to figure out > how to do this in R (as trivial as it may be for others). Some comments > about what I've done follow, in case anyone is interested. > > The problem is a) abnormality is rare (Prevalence=14%) and b) there is > not much difference in the independent variable between abnormal and > normal. So the logistic regression model predicts that P(abnormal) <= > 0.4. I got confused with this, as I expected a cut-off point of P=0.5 to > decide between normal/abnormal. But you are right, in that another > cut-off point can be chosen. > > For a cut-off of e.g. P(abnormal)=0.15, Sensitivity=65% and > Specificity=52%. They are pretty bad, although for clinical purposes I > would say that Positive/Negative Predictive Values are more interesting. > But then PPV=19% and NPV=90%, which isn't great. As an overall test of > how good the model is for classification I have computed the area under > the ROC, from your suggestion of using Sensitivity and Specificity. > > I couldn't find how to do this directly with R, so I implemented it > myself (it's not difficult but I'm new here). I tried with package ROCR, > but apparently it doesn't cover binary outcomes. > > The area under the ROC is 0.64, so I would say that even though the > model seems to fit the data, it just doesn't allow acceptable > discrimination, not matter what the cut-off point. > > > I have also studied the effect of low prevalence. For this, I used > option ran.gen in the boot function (package boot) to define a function > that resamples the data so that it balances abnormal and normal cases. > > A logistic regression model is fitted to each replicate, to a parametric > bootstrap, and thus compute the bias of the estimates of the model > coefficients, beta0 and beta1. This shows very small bias for beta1, but > a rather large bias for beta0. > > So I would say that prevalence has an effect on beta0, but not beta1. > This is good, because a common measure like the odds ratio depends only > on beta1. > > Cheers, > This makes me think you are trying to go against maximum likelihood to optimize an improper criterion.  Forcing a single cutpoint to be chosen seems to be at the heart of your problem.  There's nothing wrong with using probabilities and letting the utility possessor make the final decision. -- Frank E Harrell Jr   Professor and Chair           School of Medicine                       Department of Biostatistics   Vanderbilt University ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Frank Harrell Department of Biostatistics, Vanderbilt University