Quantcast

binomial GLM quasi separation

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

binomial GLM quasi separation

lincoln
Hi all,

I have run a (glm) analysis where the dependent variable is the gender (family=binomial) and the predictors are percentages.
I get a warning saying "fitted probabilities numerically 0 or 1 occurred" that is indicating that quasi-separation or separation is occurring.
This makes sense given that one of these predictors have a very influential effect that is depending on a specific threshold separating these effects, in other words in my analysis one of these variables predicts males about the 80% of times when its values are less or equal to zero and females about the 80% when its values are greater than zero.
I have been looking at other posts about this but I haven’t understood how I should act when the separation (or quasi separation) is not a statistical artifact but it is something real.
As suggested in http://r.789695.n4.nabble.com/OT-quasi-separation-in-a-logistic-GLM-td875726.html#a3850331 (the last post is mine) I tried to use brglm procedure that uses a penalized maximum likelihood but it made no difference.

What would you do if you were in my shoes?
Thanks in advance for any help.

Simone
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: binomial GLM quasi separation

bbolker
lincoln <miseno77 <at> hotmail.com> writes:

>
> Hi all,
>
> I have run a (glm) analysis where the dependent variable is the gender
> (family=binomial) and the predictors are percentages.
> I get a warning saying "fitted probabilities numerically 0 or 1 occurred"
> that is indicating that quasi-separation or separation is occurring.
> This makes sense given that one of these predictors have a very influential
> effect that is depending on a specific threshold separating these effects,
> in other words in my analysis one of these variables predicts males about
> the 80% of times when its values are less or equal to zero and females about
> the 80% when its values are greater than zero.
> I have been looking at other posts about this but I haven’t understood how I
> should act when the separation (or quasi separation) is not a statistical
> artifact but it is something real.
> As suggested in
> http://r.789695.n4.nabble.com/
>  OT-quasi-separation-in-a-logistic-GLM-td875726.html#a3850331
> http://r.789695.n4.nabble.com/
>  OT-quasi-separation-in-a-logistic-GLM-td875726.html#a3850331

  [warning, broke URLs to make gmane happy]

> (the last post is mine) I tried to use brglm procedure that uses a penalized
> maximum likelihood but it made no difference.
>

  I'm not sure what's going on here, and I don't know why brglm()
shouldn't work ... from a squint at your Nabble post (I can't
really see the figure very well), I agree that
the hcp profile is funky, but I wouldn't immediately conclude that
the profile is bad -- in particular, it seems that the x-axis range
is -45 to -15, rather than something like (-600,-300) as I would expect
from the estimated parameter (ca. -400) and standard error (ca. 60).
I would start by setting which=3 (to confine your attention to the
hcp parameter) and messing around with the gridsize, stepsize, stdn
parameters in profileModel to see what's going on.

 If that doesn't work you might have to post data, or a subset of
data, in order to get any more help ...

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: binomial GLM quasi separation

Uwe Ligges-3


On 13.10.2011 21:46, Ben Bolker wrote:

> lincoln<miseno77<at>  hotmail.com>  writes:
>
>>
>> Hi all,
>>
>> I have run a (glm) analysis where the dependent variable is the gender
>> (family=binomial) and the predictors are percentages.
>> I get a warning saying "fitted probabilities numerically 0 or 1 occurred"
>> that is indicating that quasi-separation or separation is occurring.
>> This makes sense given that one of these predictors have a very influential
>> effect that is depending on a specific threshold separating these effects,
>> in other words in my analysis one of these variables predicts males about
>> the 80% of times when its values are less or equal to zero and females about
>> the 80% when its values are greater than zero.
>> I have been looking at other posts about this but I haven’t understood how I
>> should act when the separation (or quasi separation) is not a statistical
>> artifact but it is something real.
>> As suggested in
>> http://r.789695.n4.nabble.com/
>>   OT-quasi-separation-in-a-logistic-GLM-td875726.html#a3850331
>> http://r.789695.n4.nabble.com/
>>   OT-quasi-separation-in-a-logistic-GLM-td875726.html#a3850331
>
>    [warning, broke URLs to make gmane happy]
>
>> (the last post is mine) I tried to use brglm procedure that uses a penalized
>> maximum likelihood but it made no difference.
>>
>
>    I'm not sure what's going on here, and I don't know why brglm()
> shouldn't work ... from a squint at your Nabble post (I can't
> really see the figure very well), I agree that
> the hcp profile is funky, but I wouldn't immediately conclude that
> the profile is bad -- in particular, it seems that the x-axis range
> is -45 to -15, rather than something like (-600,-300) as I would expect
> from the estimated parameter (ca. -400) and standard error (ca. 60).
> I would start by setting which=3 (to confine your attention to the
> hcp parameter) and messing around with the gridsize, stepsize, stdn
> parameters in profileModel to see what's going on.
>
>   If that doesn't work you might have to post data, or a subset of
> data, in order to get any more help ...


Or if just the separating hyperplane is to be found (and no tests have
to be considered), I'd use an lda rather than logistic regression in
such a case.

Uwe Ligges


> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: binomial GLM quasi separation

lincoln
In reply to this post by bbolker
As you suggested I had a further look at the profile by changing default values of stepsize (I tried to modify the others but apparently there was any change).
Here they go the scripts I have used:

> dati<-read.table("simone.txt",header=T,sep="\t",as.is=T)
> glm.sat<-glm(sex~twp+hwp+hcp+hnp,binomial,data=dati)
Mensajes de aviso perdidos
glm.fit: fitted probabilities numerically 0 or 1 occurred
>pp<-profileModel(glm.sat,quantile=qchisq(0.95,1),objective="ordinaryDeviance",which=4)
Preliminary iteration . Done

Profiling for parameter hcp ... Done
>pp1<-profileModel(glm.sat,quantile=qchisq(0.95,1),objective="ordinaryDeviance",which=4,stepsize=20)
Preliminary iteration . Done

Profiling for parameter hcp ... Done
>pp2<-profileModel(glm.sat,quantile=qchisq(0.95,1),objective="ordinaryDeviance",which=4,stepsize=100)
Preliminary iteration . Done

Profiling for parameter hcp ... Done
>plot(pp)
>mtext("Default stepsize",adj=0,cex=2,line=1)
>dev.new()
>plot(pp)
>mtext("Stepsize=20",adj=0,cex=2,line=1)
>dev.new()
>plot(pp)
>mtext("Stepsize=100",adj=0,cex=2,line=1)

And these are the plots as they look like:




I have tried to understand what is going on but I don't know how to interpret this.
It's quite a long time that I am trying to solve this but I have not been able to. Here ( simone.txt ) I attach a subset of the data I am working with that comprises the variables specified in the above glm model and by the way the "funky" variable called "hcp".
Thank you for take your time to help me in this.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: binomial GLM quasi separation

Gavin Simpson
On Fri, 2011-10-14 at 02:32 -0700, lincoln wrote:
> As you suggested I had a further look at the profile by changing default
> values of stepsize (I tried to modify the others but apparently there was
> any change).

Have you read ?glm, specifically this bit:

Details:

....

     For the background to warning messages about ‘fitted probabilities
     numerically 0 or 1 occurred’ for binomial GLMs, see Venables &
     Ripley (2002, pp. 197-8).

There, V&R say (me paraphrasing) that if there are some large fitted
\beta_i the curvature of the log-likelihood at the fitted \beta can be
much less than at \beta_i = 0. The Wald approximation underestimates the
change in the LL on setting \beta_i = 0. As the absolute value of the
fitted \beta_i becomes large (tends to infinity) the t statistic tends
to 0. This is the Hauck Donner effect.

Whilst I am (so very) far from being an expert here - this does seem to
fit the results you showed.

Furthermore, did you follow the steps Ioannis Kosmidis took me through
with my data in that email thread? I have with your data and everything
seems to follow the explanation/situation given by Ioannis. Namely, if
you increase the number of iterations and tolerance in the glm() call
you get the same fit as with a standard glm() call:

> ## normal glm()
> summary(mod)

Call:
glm(formula = sex ~ twp + hwp + hcp + hnp, family = binomial,
    data = dat)

Deviance Residuals:
    Min       1Q   Median       3Q      Max  
-2.9703  -0.1760   0.3181   0.6061   3.5235  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)    1.4362     0.2687   5.345 9.02e-08 ***
twp            5.5673     1.3602   4.093 4.26e-05 ***
hwp           -4.2793     2.3781  -1.799   0.0719 .  
hcp         -450.1918    56.6823  -7.942 1.98e-15 ***
hnp           -4.5302     3.2825  -1.380   0.1676    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 581.51  on 425  degrees of freedom
Residual deviance: 294.00  on 421  degrees of freedom
  (41 observations deleted due to missingness)
AIC: 304

Number of Fisher Scoring iterations: 8

> ## now with control = glm.control(epsilon = 1e-16, maxit = 1000)
> ## in the call
> summary(mod3)

Call:
glm(formula = sex ~ twp + hwp + hcp + hnp, family = binomial,
    data = dat, control = glm.control(epsilon = 1e-16, maxit = 1000))

Deviance Residuals:
    Min       1Q   Median       3Q      Max  
-2.9703  -0.1760   0.3181   0.6061   3.5235  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)    1.4362     0.2687   5.345 9.02e-08 ***
twp            5.5673     1.3602   4.093 4.26e-05 ***
hwp           -4.2793     2.3781  -1.799   0.0719 .  
hcp         -450.1918    56.6823  -7.942 1.98e-15 ***
hnp           -4.5302     3.2825  -1.380   0.1676    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 581.51  on 425  degrees of freedom
Residual deviance: 294.00  on 421  degrees of freedom
  (41 observations deleted due to missingness)
AIC: 304

Number of Fisher Scoring iterations: 9

>From the thread you link to, we would expect the effects to diverge.

Profiling the model shows that the LL starts to increase again at low
values, but does so slowly. The LL is very flat around the estimates and
is far from 0, which seems to correspond with the description of the
Hauck Donner effect given By Venables and Ripley in their book. In your
case however, the statistic is still sufficiently large for it to be
identified as significant via the Wald test.

If we fit the model via brglm() we get essentially the same "result" as
fitted by glm():

> summary(mod2)

Call:
brglm(formula = sex ~ twp + hwp + hcp + hnp, family = binomial,
    data = dat)


Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)    1.4262     0.2662   5.357 8.47e-08 ***
twp            5.3696     1.3323   4.030 5.57e-05 ***
hwp           -4.2813     2.3504  -1.821   0.0685 .  
hcp         -435.9212    55.0566  -7.918 2.42e-15 ***
hnp           -4.6295     3.2459  -1.426   0.1538    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 560.48  on 425  degrees of freedom
Residual deviance: 294.08  on 421  degrees of freedom
Penalized deviance: 302.8212
  (41 observations deleted due to missingness)
AIC:  304.08

Ioannis points out that if there were separation, the brglm results
would differ markedly from the glm() ones, and they don't.

As Ioannis mentioned in that thread, you can get the warning about the
fitted probabilities without a separation problem - In my case it was
because there were some very small fitted probabilities, which R was
just warning about.

HTH

G

> Here they go the scripts I have used:
>
> > dati<-read.table("simone.txt",header=T,sep="\t",as.is=T)
> > glm.sat<-glm(sex~twp+hwp+hcp+hnp,binomial,data=dati)
> Mensajes de aviso perdidos
> glm.fit: fitted probabilities numerically 0 or 1 occurred
> >pp<-profileModel(glm.sat,quantile=qchisq(0.95,1),objective="ordinaryDeviance",which=4)
> Preliminary iteration . Done
>
> Profiling for parameter hcp ... Done
> >pp1<-profileModel(glm.sat,quantile=qchisq(0.95,1),objective="ordinaryDeviance",which=4,stepsize=20)
> Preliminary iteration . Done
>
> Profiling for parameter hcp ... Done
> >pp2<-profileModel(glm.sat,quantile=qchisq(0.95,1),objective="ordinaryDeviance",which=4,stepsize=100)
> Preliminary iteration . Done
>
> Profiling for parameter hcp ... Done
> >plot(pp)
> >mtext("Default stepsize",adj=0,cex=2,line=1)
> >dev.new()
> >plot(pp)
> >mtext("Stepsize=20",adj=0,cex=2,line=1)
> >dev.new()
> >plot(pp)
> >mtext("Stepsize=100",adj=0,cex=2,line=1)
>
> And these are the plots as they look like:
> http://r.789695.n4.nabble.com/file/n3904261/plot1.png 
> http://r.789695.n4.nabble.com/file/n3904261/plot2.png 
> http://r.789695.n4.nabble.com/file/n3904261/plot3.png 
>
> I have tried to understand what is going on but I don't know how to
> interpret this.
> It's quite a long time that I am trying to solve this but I have not been
> able to. Here (  http://r.789695.n4.nabble.com/file/n3904261/simone.txt
> simone.txt  ) I attach a subset of the data I am working with that comprises
> the variables specified in the above glm model and by the way the "funky"
> variable called "hcp".
> Thank you for take your time to help me in this.
>
> --
> View this message in context: http://r.789695.n4.nabble.com/binomial-GLM-quasi-separation-tp3901687p3904261.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: binomial GLM quasi separation

lincoln
In reply to this post by Uwe Ligges-3
#Uwe:

I have realized that in the firstly linked post ( OT-quasi-separation-in-a-logistic-GLM ) I have told something misleading: in fact my independent variables are not log-normally distributed since there are lot of zeros that constitute the more frequent values. I have not been able to normalize them: I don't even  know if it is possible to do it.
For the assumption of normally distributed predictors I believe I can't use a lda.

#Gavin:

I have read carefully your thread but I am not sure to understand what you are suggesting (my gaps in statistics!). You say that it should be due to the Hauck Donner effect and that it is not a quasi separation or separation issue. Even though, I am still unsure to understand why I found such a high asymptotic standard error.

Anyway, how should I consider this result? Should I find another way to analyze this process or I could consider this as correct?

If I am understanding this enough, this warning message and the very high estimates should be due to  Hauck-Donner. Regarding that reference to Venables and Ripley (2002) on this issue, I have found this ( Hauck-Donner ) where it is said that "The practical advice, then, is to run the model with all of the variables, and then run again with the questionable one removed, and conduct a likelihood ratio test. and I suppose that the p-values for hcp should be the LRT p-value, isn't it?

Thanks for taking your time to help me in this.

Simone


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: binomial GLM quasi separation

Gavin Simpson
On Sat, 2011-10-15 at 09:11 -0700, lincoln wrote:
> #Uwe:
<snip />
> #Gavin:
>
> I have read carefully your thread but I am not sure to understand what you
> are suggesting (my gaps in statistics!). You say that it should be due to
> the /Hauck Donner/ effect and that it is not a quasi separation or
> separation issue. Even though, I am still unsure to understand why I found
> such a high asymptotic standard error.

I don't believe this is a separation issue - the sorts of things we'd
expect to see if this were due to separation do not show up.

Given the large estimate for the coefficient for the term it is not that
surprising that the associated uncertainty is also high:

> set.seed(2)
> var(runif(100))
[1] 0.08911998
> set.seed(2)
> var(runif(100)*10000)
[1] 8911998

All I did there was increase the "units" of the data in the second
example and the variance is huge, but only because the data were
expressed in units 10000 times bigger than the first example. In the
same way, the coefficient estimate is large so it's standard error is
also large; the question one needs to ask is, is the estimate of the
coefficient for hcp bounded away from zero, given the uncertainty in the
estimate.

If you were to produce a profile confidence interval it too would be
large.

So you have a large estimate, which is somewhat uncertain. Given that
the slope of the log likelihood is low at the estimate and quite
different from the slope at \beta == 0, it is not unreasonable to assume
that the Hauck Donner effect might be present...

> Anyway, how should I consider this result? Should I find another way to
> analyze this process or I could consider this as correct?

...however, in the case of the snippet of data you showed, it doesn't
affect the result - on the basis of the Wald test you would still accept
that hcp is significant/important. The Hauck Donner effect might be
leading to a lower value of the test statistic, but it hasn't affected
the outcome of the test.

To check, fit the model with and without hcp and then use the anova()
function to compare the two models. This will do a likelihood ratio
test.

> If I am understanding this enough, this warning message

Possibly, but it could just be that the fitted probabilities really are
0 or 1.

>  and the very high
> estimates should be due to  /Hauck-Donner/. Regarding that reference to
> Venables and Ripley (2002) on this issue, I have found this (
> http://kups.ku.edu/maillist/classes/ps707/2005/msg00023.html Hauck-Donner  )
> where it is said that "The practical advice, then, is to run the model with
> all of the variables, and then run again with the questionable one removed,
> and conduct a likelihood ratio test./ and I suppose that the p-values for
> hcp should be the LRT p-value, isn't it?

Yes. Well it is the result of applying a likelihood ratio test. I don't
think there is such a thing as *the* p-value for a term in a model, just
different ways of computing *a* p-value.

In this case, what does it matter? If the Wald test is *under*estimating
z but the term *is* still significant, the LRT should only confirm this
and give an even lower p-value than the already very low one.

> Thanks for taking your time to help me in this.

Would it hurt you to reply via an email? Regardless of what Nabble
thinks, R-help is a mailing list and your *posts* keep on removing all
the context - I have to keep on hunting for the thread in the archives
just to keep track of what you have told us.

G

> Simone
>
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/binomial-GLM-quasi-separation-tp3901687p3907716.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...