Hi,
I've made a research about how to compare two regression line slopes (of y versus x for 2 groups, "group" being a factor ) using R. I knew the method based on the following statement : t = (b1 - b2) / sb1,b2 where b1 and b2 are the two slope coefficients and sb1,b2 the pooled standard error of the slope (b) which can be calculated in R this way: > df1 <- data.frame(x=1:3, y=1:3+rnorm(3)) > df2 <- data.frame(x=1:3, y=1:3+rnorm(3)) > fit1 <- lm(y~x, df1) > s1 <- summary(fit1)$coefficients > fit2 <- lm(y~x, df2) > s2 <- summary(fit2)$coefficients > db <- (s2[2,1]-s1[2,1]) > sd <- sqrt(s2[2,2]^2+s1[2,2]^2) > df <- (fit1$df.residual+fit2$df.residual) > td <- db/sd > 2*pt(-abs(td), df) [1] 0.9510506 However, I also found a procedure in Wonnacott & Wonnacott, that is based on the use of a mute variable D that will have a binary value according to the group to which a given point belongs (group : D=0; group 2: D=1). Then the equation that is computed is as follow: y = b0 + b1.x + D.b2.x which can be computed in R with: > fit <- lm(y ~ group + x + x:group) where y is the response of the 2 groups. The p-value of x:group gives the probability for the two slopes to be different, and the estimated values of parameters are these of both populations. These two methods have already been described in the mailing list but not confronted and discussed. So, my questions are: - are these methods different ? - which one should be preferentially used ? This is not really a question about R but more about statisticsâ€¦ I don't think I'm really clear and I know I'm not rigorous at all in my descriptions, but I hope someone will understand me. Thanks, Etienne ------------------------------------------------------------------- Etienne Toffin, PhD Student Unit of Social Ecology UniversitĂ© Libre de Bruxelles, CP 231 Boulevard du Triomphe B-1050 Brussels Belgium Tel: +32(0)2/650.55.30 Fax: +32(0)/650.59.87 Skype: etienne_titou http://www.ulb.ac.be/sciences/use/toffin.html ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
Etienne Toffin <etoffin <at> ulb.ac.be> writes:
> > I've made a research about how to compare two regression line slopes > (of y versus x for 2 groups, "group" being a factor ) using R. > > I knew the method based on the following statement : > t = (b1 - b2) / sb1,b2 > where b1 and b2 are the two slope coefficients and sb1,b2 the pooled > standard error of the slope (b) > > However, I also found a procedure in Wonnacott & Wonnacott, that is > based on the use of a mute variable D that will have a binary value > according to the group to which a given point belongs (group : D=0; > group 2: D=1). Then the equation that is computed is as follow: > y = b0 + b1.x + D.b2.x > > which can be computed in R with: > > fit <- lm(y ~ group + x + x:group) > where y is the response of the 2 groups. > The p-value of x:group gives the probability for the two slopes to be > different, and the estimated values of parameters are these of both > populations. > > These two methods have already been described in the mailing list but > not confronted and discussed. > So, my questions are: > - are these methods different ? > - which one should be preferentially used ? > These procedures are identical: the first has the virtue of being very mechanical and transparent, but the second is much easier (and easier to extend, e.g. to multiple groups): dat <- data.frame(x=rep(1:3,2),y=rep(1:3,2)+rnorm(6), f=factor(rep(1:2,each=3))) test1 <- function(dat) { fits <- lapply(split(dat,dat$f),lm,formula=y~x) sums <- lapply(fits,summary) coefs <- lapply(sums,coef) db <- coefs[[2]]["x","Estimate"]-coefs[[1]]["x","Estimate"] sd <- sqrt(sum(sapply(coefs,function(x) x["x","Std. Error"])^2)) df <- sum(sapply(fits,"[[","df.residual")) td <- db/sd c(est=db,sd=sd,tstat=td,prt=2*pt(-abs(td),df)) } test2 <- function(dat) { fit <- lm(y~x*f,data=dat) coef(summary(fit))["x:f2",] } rbind(test1(dat),test2(dat)) est sd tstat prt [1,] 0.5333555 1.382019 0.3859249 0.7367364 [2,] 0.5333555 1.382019 0.3859249 0.7367364 > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
In reply to this post by Toffin Etienne
Hi,
Yes, the two methods are equivalent. The p-value R calculates is based on the same t-statistic used in your manual analysis. You can see this by doing the second method: y2 = rbind(df1, df2) y2 = cbind(c(0,0,0,1,1,1), y2) summary(lm(y2[,3] ~ y2[,1] + y2[,2] + y2[,2]*y2[,1])) Look at the values you previously calculated and see where they reappear... print(td) print(db) print(sd) Looked at from the other way, the models with the D's and so on is one way to explain where the t-test comes from. Just do H0: b2=0 vs H1: b2!=0, and sprinkle some independence and normality assumptions. It's probably preferable to use the automatic lm based method, because then you specify the model explicitly, while with the seemingly recipe based approach the actual models and hypotheses your are testing may not be clear. Plus you get nice diagnostic statistics and pretty graphs. The downside is that you might get lured into complacency... Zhou Fang PS: Your model equation isn't right. In both, we are also allowing the intercept to vary between groups. So really you want y = c + D.b0 + b1.x + D.b2.x ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
In reply to this post by Toffin Etienne
Hi,
Yes, the two methods are equivalent. The p-value R calculates is based on the same t-statistic used in your manual analysis. You can see this by doing the second method: y2 = rbind(df1, df2) y2 = cbind(c(0,0,0,1,1,1), y2) summary(lm(y2[,3] ~ y2[,1] + y2[,2] + y2[,2]*y2[,1])) Look at the values you previously calculated and see where they reappear... print(td) print(db) print(sd) Looked at from the other way, the models with the D's and so on is one way to explain where the t-test comes from. Just do H0: b2=0 vs H1: b2!=0, and sprinkle some independence and normality assumptions. It's probably preferable to use the automatic lm based method, because then you specify the model explicitly, while with the seemingly recipe based approach the actual models and hypotheses your are testing may not be clear. Plus you get nice diagnostic statistics and pretty graphs. The downside is that you might get lured into complacency... Zhou Fang PS: Your model equation isn't right. In both, we are also allowing the intercept to vary between groups. So really you want y = c + D.b0 + b1.x + D.b2.x ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
This post has NOT been accepted by the mailing list yet.
In reply to this post by Toffin Etienne
Hi friends!
Can someone please explain what '2*pt(-abs(td), df)' means in the code below: > df1 <- data.frame(x=1:3, y=1:3+rnorm(3)) > df2 <- data.frame(x=1:3, y=1:3+rnorm(3)) > fit1 <- lm(y~x, df1) > s1 <- summary(fit1)$coefficients > fit2 <- lm(y~x, df2) > s2 <- summary(fit2)$coefficients > db <- (s2[2,1]-s1[2,1]) > sd <- sqrt(s2[2,2]^2+s1[2,2]^2) > df <- (fit1$df.residual+fit2$df.residual) > td <- db/sd > 2*pt(-abs(td), df) [1] 0.9510506 What I see above in terms of the formula is: sd = sqrt(SE^2 + SE^2) and db = b2-b1 Then td = db/sd So here the denominator (that is Sb1,b2) is only sd. However, from what I understand (following Biostatistical Analysis by Zar, 1999), Sb1,b2 (or S(b1-b2)) = Sqrt([(s^2y.x)p/(sigma x^2)1] + [(s^2y.x)p/(sigma x^2)2]) (s^2y.x)p is the pooled residual mean square, and calculated as: [(residual SS)1 + (residual SS)2]/[(residual DF)1 + (residual DF)2] So what I infer from the discrepancy above is that, the formula used in the above code is different from the one I'm using. So, I'd definitely like to understand what '2*pt(-abs(td), df)' meant to do so that I can use the code above. Thanks for helping! |
Free forum by Nabble | Edit this page |