I'm trying to do a Hosmer-Lemeshow 'goodness of fit' test on my logistic regression model.
I found some code here: http://sas-and-r.blogspot.com/2010/09/example-87-hosmer-and-lemeshow-goodness.html The R code is above is a little complicated for me but I'm having trouble with my answer: Hosmer-Lemeshow: p=0.6163585 le Cessie and Houwelingen test (Design library): p=0.2843620 The above link indicated they should be approximately equal which in my case they are not, any suggestions or is there a package function people would recommend in R for use with a logistic regression model? Thanks in advance, -Rob Schutt -------------------------------- Robert Schutt, MD, MCS Resident - Department of Internal Medicine University of Virginia, Charlottesville, Virginia ------------------ ######################################################## # Compute the Hosmer-Lemeshow 'goodness-of-fit' test cd.full_model = glm(formula = Collaterals ~ CHF + Age + CABG + relevel (as.factor (num.obst.vessels),"one") + Current.smoker + DM + HTN + ace.inhibitor + MI, family = binomial(link = "logit")) hosmerlem = function(y, yhat, g=10) { cutyhat = cut(yhat, breaks = quantile(yhat, probs=seq(0, 1, 1/g)), include.lowest=TRUE) obs = xtabs(cbind(1 - y, y) ~ cutyhat) expect = xtabs(cbind(1 - yhat, yhat) ~ cutyhat) chisq = sum((obs - expect)^2/expect) P = 1 - pchisq(chisq, g - 2) return(list(chisq=chisq,p.value=P)) } hosmerlem(y=Collaterals, yhat=fitted(cd.full_model)) ########################################################33 # Compute the le Cessie and Houwelingen test f <- lrm(Collaterals ~ CHF + Age + CABG + relevel (as.factor (num.obst.vessels),"one") + Current.smoker + DM + HTN + ace.inhibitor + MI, x=TRUE, y=TRUE) library(Design) resid(f, 'gof') Output: > hosmerlem(y=Collaterals, yhat=fitted(cd.full_model)) $chisq [1] 6.275889 $p.value [1] 0.6163585 > resid(f, 'gof') Sum of squared errors Expected value|H0 SD 118.5308396 118.3771115 0.1435944 Z P 1.0705717 0.2843620 |
Please read the documentation carefully, and replace the Design package with the newer rms package.
The older Hosmer-Lemeshow test requires binning and has lower power. It also does not penalize for overfitting. The newer goodness of fit test in rms/Design should not agree with Hosmer-Lemeshow. Frank
Frank Harrell
Department of Biostatistics, Vanderbilt University |
Again, thanks so much for your help.
Is there a "for dummies" version for interpreting the le Cessie and Houwelingen test. I read the 1991 biometrics paper but honestly got lost in the math. Is it interpreted the same way the Hosmer-Lemeshow test is? ie, a non-significant result means model fits well. I guess what does, what would my p-value of p=0.284362 tell you about my model? I would like to describe the goodness of fit of my model in a paper but I'm worried the average medical reader would not know how to interpret the result of this test whereas there are lots of references on interpreting Hosmer-Lemeshow. (my cross validated c-statistic was 0.69 and r^2 was 0.15) Thanks again for your all help! (also with my k-fold crossvalidation question!) -Rob -------------------------------- Robert Schutt, MD, MCS Resident - Department of Internal Medicine University of Virginia, Charlottesville, Virginia |
The test in the rms package's residuals.lrm function is the le Cessie - van Houwelingen - Copas - Hosmer unweighted sum of squares test for global goodness of fit. Like all statistical tests, a large P-value has no information other than there was not sufficient evidence to reject the null hypothesis. Here the null hypothesis is that the true probabilities are those specified by the model. Such an omnibus test, though having good general power (better than Hosmer-Lemeshow) lacks power to detect specific alternatives. That's why I spend more time allowing for nonlinearity of predictors.
Frank
Frank Harrell
Department of Biostatistics, Vanderbilt University |
I am a new user in R, so I am sorry if this is a basic question but I am kind of lost here and I really would appreciatte your help.
I have behavioral information of sea lions and I've done a binomial Generalized Linear model using Rcmdr, to understand what variables are affecting the place where Male's fights are being made (water or land). One of my possible models includes Agression Type, number of females around the fight and temperature in the moment. ******************* > GLM.2 <- glm(Place ~ AgType + Females + Temp, family=binomial(cloglog), + data=June) > summary(GLM.2) Call: glm(formula = Place ~ AgType + Females + Temp, family = binomial(cloglog), data = June) Deviance Residuals: Min 1Q Median 3Q Max -2.4448 -0.7248 0.4220 0.6741 1.8798 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.43071 0.82164 4.175 2.97e-05 *** AgType -1.23463 0.37166 -3.322 0.000894 *** Females 0.05812 0.01320 4.404 1.06e-05 *** Temp -0.08614 0.02453 -3.512 0.000444 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 175.79 on 141 degrees of freedom Residual deviance: 125.33 on 138 degrees of freedom AIC: 133.33 Number of Fisher Scoring iterations: 8 **************************************** When I ran hosmerlem test it seems that I have a good fit to the data. ************************ > hosmerlem(Place, fitted(GLM.2)) X^2 Df P(>Chi) 7.5123428 8.0000000 0.4824925 ***************************** However if I run only AgType, or Agtype + Temp, when running Hosmerlem test; it gives me an error that says: ERROR: 'breaks' are not unique. I've been having the same error when I try to run other models or when I use different link functions like logit. So... right now my questions are: 1. What is this error means? as far as I have seen in the Hosmerlem test code that I am using, it seems to depend on the yhat probs, but to be honest I don't really understand what is this "yhat". The code I am using is: hosmerlem <- function (y, yhat, g = 10) { cutyhat <- cut(yhat, breaks = quantile(yhat, probs = seq(0, 1, 1/g)), include.lowest = T) obs <- xtabs(cbind(1 - y, y) ~ cutyhat) expect <- xtabs(cbind(1 - yhat, yhat) ~ cutyhat) chisq <- sum((obs - expect)^2/expect) P <- 1 - pchisq(chisq, g - 2) c("X^2" = chisq, Df = g - 2, "P(>Chi)" = P) } 2. I've seen in older posts that Cessie - van Houwelingen - Copas - Hosmer unweighted sum of squares test for global goodness of fit test; seems to be a better alternative. However I am probably doing something wrong because I can't get it work. I installed and loaded rms package and I ran the following command: > resid(GLM.2, 'gof') As far as I understand 'gof' is another package with this description: Implementation of model-checking technique based on cumulative residuals. So, for the command I also installed and loaded 'gof' package However I get this: ERROR: 'arg' should be one of "deviance", "pearson", "working", "response", "partial" If I use any of the arguments that it suggests, it gives me a list of residuals, but I doesn't show the Z or the P. So, if someone could help me understand what I am doing wrong I really would appreciate it. |
At 19:34 29/03/2012, JimeBoHe wrote:
>I am a new user in R, so I am sorry if this is a basic question but I am kind >of lost here and I really would appreciatte your help. Dear Jimena Comments in-line >I have behavioral information of sea lions and I've done a binomial >Generalized Linear model using Rcmdr, to understand what variables are >affecting the place where Male's fights are being made (water or land). One >of my possible models includes Agression Type, number of females around the >fight and temperature in the moment. > >******************* > > GLM.2 <- glm(Place ~ AgType + Females + Temp, family=binomial(cloglog), >+ data=June) > > > summary(GLM.2) > >Call: >glm(formula = Place ~ AgType + Females + Temp, family = binomial(cloglog), > data = June) > >Deviance Residuals: > Min 1Q Median 3Q Max >-2.4448 -0.7248 0.4220 0.6741 1.8798 > >Coefficients: > Estimate Std. Error z value Pr(>|z|) >(Intercept) 3.43071 0.82164 4.175 2.97e-05 *** >AgType -1.23463 0.37166 -3.322 0.000894 *** >Females 0.05812 0.01320 4.404 1.06e-05 *** >Temp -0.08614 0.02453 -3.512 0.000444 *** >--- >Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > >(Dispersion parameter for binomial family taken to be 1) > > Null deviance: 175.79 on 141 degrees of freedom >Residual deviance: 125.33 on 138 degrees of freedom >AIC: 133.33 > >Number of Fisher Scoring iterations: 8 >**************************************** > > >When I ran hosmerlem test it seems that I have a good fit to the data. > >************************ > > hosmerlem(Place, fitted(GLM.2)) > X^2 Df P(>Chi) >7.5123428 8.0000000 0.4824925 >***************************** > >However if I run only AgType, or Agtype + Temp, when running Hosmerlem test; >it gives me an error that says: * ERROR: 'breaks' are not unique*. > >I've been having the same error when I try to run other models or when I use >different link functions like logit. > > >So... right now my questions are: > >1. What is this error means? as far as I have seen in the Hosmerlem test >code that I am using, it seems to depend on the yhat probs, but to be honest >I don't really understand what is this "yhat". The code I am using is: yhat is the predicted value of y (the outcome variable). You do not provide us with a reproducible example so this is only a guess but I suspect if you look at the values of yhat for the failing models you will find you have many replicated values and so at least two values of the breakpoints are identical. But, as I say, this is at best a guess. >hosmerlem <- >function (y, yhat, g = 10) >{ > cutyhat <- cut(yhat, breaks = quantile(yhat, probs = seq(0, > 1, 1/g)), include.lowest = T) > obs <- xtabs(cbind(1 - y, y) ~ cutyhat) > expect <- xtabs(cbind(1 - yhat, yhat) ~ cutyhat) > chisq <- sum((obs - expect)^2/expect) > P <- 1 - pchisq(chisq, g - 2) > c("X^2" = chisq, Df = g - 2, "P(>Chi)" = P) >} > > >2. I've seen in older posts that Cessie - van Houwelingen - Copas - Hosmer >unweighted sum of squares test for global goodness of fit test; seems to be >a better alternative. However I am probably doing something wrong because I >can't get it work. > >I installed and loaded rms package and I ran the following command: > > > resid(GLM.2, 'gof') > >As far as I understand 'gof' is another package with this description: >Implementation of model-checking technique based on cumulative residuals. >So, for the command I also installed and loaded 'gof' package You seem to be so far away from success here that I think remote advice is not going to help. >However I get this: *ERROR: 'arg' should be one of "deviance", "pearson", >"working", "response", "partial"* > >If I use any of the arguments that it suggests, it gives me a list of >residuals, but I doesn't show the Z or the P. That is because it is doing what it is documented to. I am not familiar with the gof function nor do I see it in the index for rms (as you implied) so it is a bit hard to know what you are trying to do here. >So, if someone could help me understand what I am doing wrong I really would >appreciate it. > > > > >-- >View this message in context: >http://r.789695.n4.nabble.com/Hosmer-Lemeshow-goodness-of-fit-tp3508127p4516441.html >Sent from the R help mailing list archive at Nabble.com. Michael Dewey [hidden email] http://www.aghmed.fsnet.co.uk/home.html ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
Free forum by Nabble | Edit this page |