# binomial GLM quasi separation

 Classic List Threaded
7 messages
Reply | Threaded
Open this post in threaded view
|
Report Content as Inappropriate

## binomial GLM quasi separation

 Hi all, I have run a (glm) analysis where the dependent variable is the gender (family=binomial) and the predictors are percentages. I get a warning saying "fitted probabilities numerically 0 or 1 occurred" that is indicating that quasi-separation or separation is occurring. This makes sense given that one of these predictors have a very influential effect that is depending on a specific threshold separating these effects, in other words in my analysis one of these variables predicts males about the 80% of times when its values are less or equal to zero and females about the 80% when its values are greater than zero. I have been looking at other posts about this but I haven’t understood how I should act when the separation (or quasi separation) is not a statistical artifact but it is something real. As suggested in http://r.789695.n4.nabble.com/OT-quasi-separation-in-a-logistic-GLM-td875726.html#a3850331 (the last post is mine) I tried to use brglm procedure that uses a penalized maximum likelihood but it made no difference. What would you do if you were in my shoes? Thanks in advance for any help. Simone
Reply | Threaded
Open this post in threaded view
|
Report Content as Inappropriate

## Re: binomial GLM quasi separation

 lincoln hotmail.com> writes: > > Hi all, > > I have run a (glm) analysis where the dependent variable is the gender > (family=binomial) and the predictors are percentages. > I get a warning saying "fitted probabilities numerically 0 or 1 occurred" > that is indicating that quasi-separation or separation is occurring. > This makes sense given that one of these predictors have a very influential > effect that is depending on a specific threshold separating these effects, > in other words in my analysis one of these variables predicts males about > the 80% of times when its values are less or equal to zero and females about > the 80% when its values are greater than zero. > I have been looking at other posts about this but I haven’t understood how I > should act when the separation (or quasi separation) is not a statistical > artifact but it is something real. > As suggested in > http://r.789695.n4.nabble.com/>  OT-quasi-separation-in-a-logistic-GLM-td875726.html#a3850331 > http://r.789695.n4.nabble.com/>  OT-quasi-separation-in-a-logistic-GLM-td875726.html#a3850331   [warning, broke URLs to make gmane happy] > (the last post is mine) I tried to use brglm procedure that uses a penalized > maximum likelihood but it made no difference. >   I'm not sure what's going on here, and I don't know why brglm() shouldn't work ... from a squint at your Nabble post (I can't really see the figure very well), I agree that the hcp profile is funky, but I wouldn't immediately conclude that the profile is bad -- in particular, it seems that the x-axis range is -45 to -15, rather than something like (-600,-300) as I would expect from the estimated parameter (ca. -400) and standard error (ca. 60). I would start by setting which=3 (to confine your attention to the hcp parameter) and messing around with the gridsize, stepsize, stdn parameters in profileModel to see what's going on.  If that doesn't work you might have to post data, or a subset of data, in order to get any more help ... ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|
Report Content as Inappropriate

## Re: binomial GLM quasi separation

 On 13.10.2011 21:46, Ben Bolker wrote: > lincoln  hotmail.com>  writes: > >> >> Hi all, >> >> I have run a (glm) analysis where the dependent variable is the gender >> (family=binomial) and the predictors are percentages. >> I get a warning saying "fitted probabilities numerically 0 or 1 occurred" >> that is indicating that quasi-separation or separation is occurring. >> This makes sense given that one of these predictors have a very influential >> effect that is depending on a specific threshold separating these effects, >> in other words in my analysis one of these variables predicts males about >> the 80% of times when its values are less or equal to zero and females about >> the 80% when its values are greater than zero. >> I have been looking at other posts about this but I haven’t understood how I >> should act when the separation (or quasi separation) is not a statistical >> artifact but it is something real. >> As suggested in >> http://r.789695.n4.nabble.com/>>   OT-quasi-separation-in-a-logistic-GLM-td875726.html#a3850331 >> http://r.789695.n4.nabble.com/>>   OT-quasi-separation-in-a-logistic-GLM-td875726.html#a3850331 > >    [warning, broke URLs to make gmane happy] > >> (the last post is mine) I tried to use brglm procedure that uses a penalized >> maximum likelihood but it made no difference. >> > >    I'm not sure what's going on here, and I don't know why brglm() > shouldn't work ... from a squint at your Nabble post (I can't > really see the figure very well), I agree that > the hcp profile is funky, but I wouldn't immediately conclude that > the profile is bad -- in particular, it seems that the x-axis range > is -45 to -15, rather than something like (-600,-300) as I would expect > from the estimated parameter (ca. -400) and standard error (ca. 60). > I would start by setting which=3 (to confine your attention to the > hcp parameter) and messing around with the gridsize, stepsize, stdn > parameters in profileModel to see what's going on. > >   If that doesn't work you might have to post data, or a subset of > data, in order to get any more help ... Or if just the separating hyperplane is to be found (and no tests have to be considered), I'd use an lda rather than logistic regression in such a case. Uwe Ligges > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|
Report Content as Inappropriate

## Re: binomial GLM quasi separation

 In reply to this post by bbolker As you suggested I had a further look at the profile by changing default values of stepsize (I tried to modify the others but apparently there was any change). Here they go the scripts I have used: > dati<-read.table("simone.txt",header=T,sep="\t",as.is=T) > glm.sat<-glm(sex~twp+hwp+hcp+hnp,binomial,data=dati) Mensajes de aviso perdidos glm.fit: fitted probabilities numerically 0 or 1 occurred >pp<-profileModel(glm.sat,quantile=qchisq(0.95,1),objective="ordinaryDeviance",which=4) Preliminary iteration . Done Profiling for parameter hcp ... Done >pp1<-profileModel(glm.sat,quantile=qchisq(0.95,1),objective="ordinaryDeviance",which=4,stepsize=20) Preliminary iteration . Done Profiling for parameter hcp ... Done >pp2<-profileModel(glm.sat,quantile=qchisq(0.95,1),objective="ordinaryDeviance",which=4,stepsize=100) Preliminary iteration . Done Profiling for parameter hcp ... Done >plot(pp) >mtext("Default stepsize",adj=0,cex=2,line=1) >dev.new() >plot(pp) >mtext("Stepsize=20",adj=0,cex=2,line=1) >dev.new() >plot(pp) >mtext("Stepsize=100",adj=0,cex=2,line=1) And these are the plots as they look like: I have tried to understand what is going on but I don't know how to interpret this. It's quite a long time that I am trying to solve this but I have not been able to. Here ( simone.txt ) I attach a subset of the data I am working with that comprises the variables specified in the above glm model and by the way the "funky" variable called "hcp". Thank you for take your time to help me in this.
Reply | Threaded
Open this post in threaded view
|
Report Content as Inappropriate

## Re: binomial GLM quasi separation

 On Fri, 2011-10-14 at 02:32 -0700, lincoln wrote: > As you suggested I had a further look at the profile by changing default > values of stepsize (I tried to modify the others but apparently there was > any change). Have you read ?glm, specifically this bit: Details: ....      For the background to warning messages about ‘fitted probabilities      numerically 0 or 1 occurred’ for binomial GLMs, see Venables &      Ripley (2002, pp. 197-8). There, V&R say (me paraphrasing) that if there are some large fitted \beta_i the curvature of the log-likelihood at the fitted \beta can be much less than at \beta_i = 0. The Wald approximation underestimates the change in the LL on setting \beta_i = 0. As the absolute value of the fitted \beta_i becomes large (tends to infinity) the t statistic tends to 0. This is the Hauck Donner effect. Whilst I am (so very) far from being an expert here - this does seem to fit the results you showed. Furthermore, did you follow the steps Ioannis Kosmidis took me through with my data in that email thread? I have with your data and everything seems to follow the explanation/situation given by Ioannis. Namely, if you increase the number of iterations and tolerance in the glm() call you get the same fit as with a standard glm() call: > ## normal glm() > summary(mod) Call: glm(formula = sex ~ twp + hwp + hcp + hnp, family = binomial,     data = dat) Deviance Residuals:     Min       1Q   Median       3Q      Max   -2.9703  -0.1760   0.3181   0.6061   3.5235   Coefficients:              Estimate Std. Error z value Pr(>|z|)     (Intercept)    1.4362     0.2687   5.345 9.02e-08 *** twp            5.5673     1.3602   4.093 4.26e-05 *** hwp           -4.2793     2.3781  -1.799   0.0719 .   hcp         -450.1918    56.6823  -7.942 1.98e-15 *** hnp           -4.5302     3.2825  -1.380   0.1676     --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1)     Null deviance: 581.51  on 425  degrees of freedom Residual deviance: 294.00  on 421  degrees of freedom   (41 observations deleted due to missingness) AIC: 304 Number of Fisher Scoring iterations: 8 > ## now with control = glm.control(epsilon = 1e-16, maxit = 1000) > ## in the call > summary(mod3) Call: glm(formula = sex ~ twp + hwp + hcp + hnp, family = binomial,     data = dat, control = glm.control(epsilon = 1e-16, maxit = 1000)) Deviance Residuals:     Min       1Q   Median       3Q      Max   -2.9703  -0.1760   0.3181   0.6061   3.5235   Coefficients:              Estimate Std. Error z value Pr(>|z|)     (Intercept)    1.4362     0.2687   5.345 9.02e-08 *** twp            5.5673     1.3602   4.093 4.26e-05 *** hwp           -4.2793     2.3781  -1.799   0.0719 .   hcp         -450.1918    56.6823  -7.942 1.98e-15 *** hnp           -4.5302     3.2825  -1.380   0.1676     --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1)     Null deviance: 581.51  on 425  degrees of freedom Residual deviance: 294.00  on 421  degrees of freedom   (41 observations deleted due to missingness) AIC: 304 Number of Fisher Scoring iterations: 9 >From the thread you link to, we would expect the effects to diverge. Profiling the model shows that the LL starts to increase again at low values, but does so slowly. The LL is very flat around the estimates and is far from 0, which seems to correspond with the description of the Hauck Donner effect given By Venables and Ripley in their book. In your case however, the statistic is still sufficiently large for it to be identified as significant via the Wald test. If we fit the model via brglm() we get essentially the same "result" as fitted by glm(): > summary(mod2) Call: brglm(formula = sex ~ twp + hwp + hcp + hnp, family = binomial,     data = dat) Coefficients:              Estimate Std. Error z value Pr(>|z|)     (Intercept)    1.4262     0.2662   5.357 8.47e-08 *** twp            5.3696     1.3323   4.030 5.57e-05 *** hwp           -4.2813     2.3504  -1.821   0.0685 .   hcp         -435.9212    55.0566  -7.918 2.42e-15 *** hnp           -4.6295     3.2459  -1.426   0.1538     --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1)     Null deviance: 560.48  on 425  degrees of freedom Residual deviance: 294.08  on 421  degrees of freedom Penalized deviance: 302.8212   (41 observations deleted due to missingness) AIC:  304.08 Ioannis points out that if there were separation, the brglm results would differ markedly from the glm() ones, and they don't. As Ioannis mentioned in that thread, you can get the warning about the fitted probabilities without a separation problem - In my case it was because there were some very small fitted probabilities, which R was just warning about. HTH G > Here they go the scripts I have used: > > > dati<-read.table("simone.txt",header=T,sep="\t",as.is=T) > > glm.sat<-glm(sex~twp+hwp+hcp+hnp,binomial,data=dati) > Mensajes de aviso perdidos > glm.fit: fitted probabilities numerically 0 or 1 occurred > >pp<-profileModel(glm.sat,quantile=qchisq(0.95,1),objective="ordinaryDeviance",which=4) > Preliminary iteration . Done > > Profiling for parameter hcp ... Done > >pp1<-profileModel(glm.sat,quantile=qchisq(0.95,1),objective="ordinaryDeviance",which=4,stepsize=20) > Preliminary iteration . Done > > Profiling for parameter hcp ... Done > >pp2<-profileModel(glm.sat,quantile=qchisq(0.95,1),objective="ordinaryDeviance",which=4,stepsize=100) > Preliminary iteration . Done > > Profiling for parameter hcp ... Done > >plot(pp) > >mtext("Default stepsize",adj=0,cex=2,line=1) > >dev.new() > >plot(pp) > >mtext("Stepsize=20",adj=0,cex=2,line=1) > >dev.new() > >plot(pp) > >mtext("Stepsize=100",adj=0,cex=2,line=1) > > And these are the plots as they look like: > http://r.789695.n4.nabble.com/file/n3904261/plot1.png  > http://r.789695.n4.nabble.com/file/n3904261/plot2.png  > http://r.789695.n4.nabble.com/file/n3904261/plot3.png  > > I have tried to understand what is going on but I don't know how to > interpret this. > It's quite a long time that I am trying to solve this but I have not been > able to. Here (  http://r.789695.n4.nabble.com/file/n3904261/simone.txt> simone.txt  ) I attach a subset of the data I am working with that comprises > the variables specified in the above glm model and by the way the "funky" > variable called "hcp". > Thank you for take your time to help me in this. > > -- > View this message in context: http://r.789695.n4.nabble.com/binomial-GLM-quasi-separation-tp3901687p3904261.html> Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code. -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%  Dr. Gavin Simpson             [t] +44 (0)20 7679 0522  ECRC, UCL Geography,          [f] +44 (0)20 7679 0565  Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk  Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|
Report Content as Inappropriate

## Re: binomial GLM quasi separation

 In reply to this post by Uwe Ligges-3 #Uwe: I have realized that in the firstly linked post ( OT-quasi-separation-in-a-logistic-GLM ) I have told something misleading: in fact my independent variables are not log-normally distributed since there are lot of zeros that constitute the more frequent values. I have not been able to normalize them: I don't even  know if it is possible to do it. For the assumption of normally distributed predictors I believe I can't use a lda. #Gavin: I have read carefully your thread but I am not sure to understand what you are suggesting (my gaps in statistics!). You say that it should be due to the Hauck Donner effect and that it is not a quasi separation or separation issue. Even though, I am still unsure to understand why I found such a high asymptotic standard error. Anyway, how should I consider this result? Should I find another way to analyze this process or I could consider this as correct? If I am understanding this enough, this warning message and the very high estimates should be due to  Hauck-Donner. Regarding that reference to Venables and Ripley (2002) on this issue, I have found this ( Hauck-Donner ) where it is said that "The practical advice, then, is to run the model with all of the variables, and then run again with the questionable one removed, and conduct a likelihood ratio test. and I suppose that the p-values for hcp should be the LRT p-value, isn't it? Thanks for taking your time to help me in this. Simone
Reply | Threaded
Open this post in threaded view
|
Report Content as Inappropriate

## Re: binomial GLM quasi separation

 On Sat, 2011-10-15 at 09:11 -0700, lincoln wrote: > #Uwe: > #Gavin: > > I have read carefully your thread but I am not sure to understand what you > are suggesting (my gaps in statistics!). You say that it should be due to > the /Hauck Donner/ effect and that it is not a quasi separation or > separation issue. Even though, I am still unsure to understand why I found > such a high asymptotic standard error. I don't believe this is a separation issue - the sorts of things we'd expect to see if this were due to separation do not show up. Given the large estimate for the coefficient for the term it is not that surprising that the associated uncertainty is also high: > set.seed(2) > var(runif(100)) [1] 0.08911998 > set.seed(2) > var(runif(100)*10000) [1] 8911998 All I did there was increase the "units" of the data in the second example and the variance is huge, but only because the data were expressed in units 10000 times bigger than the first example. In the same way, the coefficient estimate is large so it's standard error is also large; the question one needs to ask is, is the estimate of the coefficient for hcp bounded away from zero, given the uncertainty in the estimate. If you were to produce a profile confidence interval it too would be large. So you have a large estimate, which is somewhat uncertain. Given that the slope of the log likelihood is low at the estimate and quite different from the slope at \beta == 0, it is not unreasonable to assume that the Hauck Donner effect might be present... > Anyway, how should I consider this result? Should I find another way to > analyze this process or I could consider this as correct? ...however, in the case of the snippet of data you showed, it doesn't affect the result - on the basis of the Wald test you would still accept that hcp is significant/important. The Hauck Donner effect might be leading to a lower value of the test statistic, but it hasn't affected the outcome of the test. To check, fit the model with and without hcp and then use the anova() function to compare the two models. This will do a likelihood ratio test. > If I am understanding this enough, this warning message Possibly, but it could just be that the fitted probabilities really are 0 or 1. >  and the very high > estimates should be due to  /Hauck-Donner/. Regarding that reference to > Venables and Ripley (2002) on this issue, I have found this ( > http://kups.ku.edu/maillist/classes/ps707/2005/msg00023.html Hauck-Donner  ) > where it is said that "The practical advice, then, is to run the model with > all of the variables, and then run again with the questionable one removed, > and conduct a likelihood ratio test./ and I suppose that the p-values for > hcp should be the LRT p-value, isn't it? Yes. Well it is the result of applying a likelihood ratio test. I don't think there is such a thing as *the* p-value for a term in a model, just different ways of computing *a* p-value. In this case, what does it matter? If the Wald test is *under*estimating z but the term *is* still significant, the LRT should only confirm this and give an even lower p-value than the already very low one. > Thanks for taking your time to help me in this. Would it hurt you to reply via an email? Regardless of what Nabble thinks, R-help is a mailing list and your *posts* keep on removing all the context - I have to keep on hunting for the thread in the archives just to keep track of what you have told us. G > Simone > > > > > -- > View this message in context: http://r.789695.n4.nabble.com/binomial-GLM-quasi-separation-tp3901687p3907716.html> Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code. -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%  Dr. Gavin Simpson             [t] +44 (0)20 7679 0522  ECRC, UCL Geography,          [f] +44 (0)20 7679 0565  Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk  Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Loading...