Time-dependent coefficients in a Cox model with categorical variants

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Time-dependent coefficients in a Cox model with categorical variants

Max Shell
Suppose I have a dataset contain three variants, looks like
> head(dta)

  Sex    tumorsize    Histology       time         status
    0            1.5            2              12.1000             0
    1            1.8            1              38.4000             0
.....................

Sex: 1 for male; 0 for female., two levels
Histology: 1 for SqCC; 2 for High risk AC; 3 for low risk AC, three levels
Now I need to get a Time-dependent coefficients cox fit:

library(survival)
for(i in c(1,3) dta[,i] <- factor(dta[,i])
fit <-
  coxph(
    Surv(time, status) ~  Sex + tumorsize +  Histology + tt(Histology),
    data = dta,
    tt = function(x, t, ...) x * log(t)
  )

But  I keep gettting this error says:

Error in if (any(infs)) warning(paste("Loglik converged before variable ",  :
  missing value where TRUE/FALSE needed
In addition: Warning message:
In Ops.factor(x, log(t)) : ‘*’ not meaningful for factors.

How can I fix it? I know that the "Sex" and "Histology" are both
categorical variants. I  want to have a model that have two β(t) = a +
blog(t) for each histology level.
Thank you!

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Time-dependent coefficients in a Cox model with categorical variants

David Winsemius

> On Jan 15, 2018, at 12:58 AM, Max Shell <[hidden email]> wrote:
>
> Suppose I have a dataset contain three variants, looks like
>> head(dta)
>
>  Sex    tumorsize    Histology       time         status
>    0            1.5            2              12.1000             0
>    1            1.8            1              38.4000             0
> .....................
>
> Sex: 1 for male; 0 for female., two levels
> Histology: 1 for SqCC; 2 for High risk AC; 3 for low risk AC, three levels
> Now I need to get a Time-dependent coefficients cox fit:
>
> library(survival)
> for(i in c(1,3) dta[,i] <- factor(dta[,i])
> fit <-
>  coxph(
>    Surv(time, status) ~  Sex + tumorsize +  Histology + tt(Histology),
>    data = dta,
>    tt = function(x, t, ...) x * log(t)
>  )
>
> But  I keep gettting this error says:
>
> Error in if (any(infs)) warning(paste("Loglik converged before variable ",  :
>  missing value where TRUE/FALSE needed
> In addition: Warning message:
> In Ops.factor(x, log(t)) : ‘*’ not meaningful for factors.

The error message seems pretty clear. You are passing a factor to the x parameter of the tt function, and then you are attempting to multiply that value times log(t). You are lucky that it was a factor and not jsut an integer because then you might not have gotten an error or a warning.  I worry that your hopes of separate estimates may not be easily supportable by the tt-mechanism. However, the example of its use in the help page for `coxph` shows a spline function being passed and the boundary knots and weights mush estimated, so my fears may be over-blown. The knot locations and weights are not reported in the print.coxph but it doesn't look too difficult to extract then from the attributes of the model.

You might look at the mechanism for estimation of spline component effects to see if you can learn how to estimate multiple components:

> coef( coxph(Surv(time, status) ~ ph.ecog + tt(age), data=lung,
+      tt=function(x,t,...) pspline(x + t/365.25))  )
           ph.ecog  ps(x + t/365.25)3  ps(x + t/365.25)4  ps(x + t/365.25)5  ps(x + t/365.25)6  ps(x + t/365.25)7
         0.4528363          0.2426635          0.4876185          0.7796924          1.0160954          1.0765967
 ps(x + t/365.25)8  ps(x + t/365.25)9 ps(x + t/365.25)10 ps(x + t/365.25)11 ps(x + t/365.25)12 ps(x + t/365.25)13
         1.0449439          0.9170725          0.9276695          1.1349794          1.2837341          1.7024045
ps(x + t/365.25)14
         2.1863712

> attr( coxph(Surv(time, status) ~ ph.ecog + tt(age), data=lung,
+      tt=function(x,t,...) pspline(x + t/365.25))$terms, "predvars" )
list(Surv(time, status), ph.ecog, pspline(age, nterm = 10, intercept = FALSE,
    Boundary.knots = c(39.0136892539357, 82.0848733744011), `NA` = NULL))

The coefficients for the spline also get reported as exponentiated values in the summary output. And if you  used a crossing operator in the formula you get some sort of interaction result. Whether it has any sensible interpretation is decision that's above my pay grade.

The code for `pspline` is readily available. It's not even necessary to sue the triple-colon or getAnywhere functions.

--
David.

>
> How can I fix it? I know that the "Sex" and "Histology" are both
> categorical variants. I  want to have a model that have two β(t) = a +
> blog(t) for each histology level.
> Thank you!
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

'Any technology distinguishable from magic is insufficiently advanced.'   -Gehm's Corollary to Clarke's Third Law

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Time-dependent coefficients in a Cox model with categorical variants

Therneau, Terry M., Ph.D.
In reply to this post by Max Shell

First, as others have said please obey the mailing list rules and turn of
First, as others have said please obey the mailing list rules and turn off html, not everyone uses an html email client.

Here is your code, formatted and with line numbers added.  I also fixed one error: "y" should be "status".

1. fit0 <- coxph(Surv(futime, status) ~ x1 + x2 + x3, data = data0)
2. p <- log(predict(fit0, newdata = data1, type = "expected"))
3. lp <- predict(fit0, newdata = data1, type = "lp")
4. logbase <- p - lp
5. fit1 <- glm(status ~ offset(p), family = poisson, data = data1)
6. fit2 <- glm(status~ lp + offset(logbase), family = poisson, data = data1)
7. group <- cut(lp, c(-Inf, quantile(lp, (1:9) / 10), Inf))
8. fit3 <- glm(status ~ -1 + group + offset(p), family = poisson, data = data1)

The key idea of the paper you referenced is that the counterpart to the Hosmer-Lemishow test (wrong if used directly in a Cox model) is to look at the predicted values from a Cox model as input to a Poisson regression.  That means adding the expected from the Cox model as a fixed term in the Poisson.  And like any other poisson that means offset(log(expected)) as a term.

The presence of time dependent covariates does nothing to change this, per se, since expected for time fixed is the same as for time varying.  In practice it does matter, at least philosophically.  Lines 1, 2, 5 do this just fine.

If data1 is not the same as data0, a new study say, then the test for intercept=0 from fit1 is a test of overall calibration.  Models like line 8 try to partition out where any differences actually lie.

The time-dependent covariates part lies in the fact that a single subject may be represented by multiple lines in data0 and/or data1.  Do you want to collapse that person into a single row before the glm fits?  If subject "Jones" is represented by 15 lines in the data and "Smith" by 2, it does seem a bit unfair to give Jones 15 observations in the glm fit.  But full discussion of this is as much philosophy as statistics, and is perhaps best done over a beer.

Terry T.

________________________________
From: Max Shell [[hidden email]]
Sent: Wednesday, January 17, 2018 10:25 AM
To: Therneau, Terry M., Ph.D.
Subject: Re: Time-dependent coefficients in a Cox model with categorical variants

Assessing calibration of Cox model with time-dependent coefficients<https://stats.stackexchange.com/questions/323569/assessing-calibration-of-cox-model-with-time-dependent-coefficients>

I am trying to find methods for testing and visualizing calibration to Cox models with time-depended coefficients. I have read your nice article<http://journals.sagepub.com/doi/10.1177/0962280213497434>. In this paper, we can fit three models:

fit0 <- coxph(Surv(futime, status) ~ x1 + x2 + x3, data = data0) p <- log(predict(fit0, newdata = data1, type = "expected")) lp <- predict(fit0, newdata = data1, type = "lp") logbase <- p - lp fit1 <- glm(y ~ offset(p), family = poisson, data = data1) fit2 <- glm(y ~ lp + offset(logbase), family = poisson, data = data1) group <- cut(lp, c(-Inf, quantile(lp, (1:9) / 10), Inf)) fit3 <- glm(y ~ -1 + group + offset(p), family = poisson, data = data1)

Here$B!$(BI simplely use data1$B!!(B<- data0[1:500,]

First, I get following error when running line 5.

Error in eval(predvars, data, env) : object 'y' not found

So I modifited the code by replacing the y as status looks like this:

fit1 <- glm(status ~ offset(p), family = poisson, data = data1) fit2 <- glm(status ~ lp + offset(logbase), family = poisson, data = data1) group <- cut(lp, c(-Inf, quantile(lp, (1:9) / 10), Inf)) fit3 <- glm(status ~ -1 + group + offset(p), family = poisson, data = data1)

Is this replacing correct?

Second, I try to introduce the time-transform use coxph with ttparament.

My code is:  fit0 <- coxph(Surv(time, status) ~ x1 + x2 + x3 + tt(x3), data = data0, function(x, t, ...) x * t) p <- log(predict(fit0, newdata = data1, type = "expected")) lp <- predict(fit0, newdata = data1, type = "lp") logbase <- p - lp fit1 <- glm(status ~ offset(p), family = poisson, data = data1) fit2 <- glm(status ~ lp + offset(logbase), family = poisson, data = data1) group <- cut(lp, c(-Inf, quantile(lp, (1:9) / 10), Inf)) fit3 <- glm(status ~ -1 + group + offset(p), family = poisson, data = data1)

My questions is:

  *   Is the code above correct?
  *   How to interpret the fit1, fit2, fit3? What's the connection between the three models and the calibration of the Cox model?
  *   How to generate the calibration plot using fit3? The article dose have a section discuss this, but no code is provided.

Thank you!

On Mon, Jan 15, 2018 at 9:23 PM, Therneau, Terry M., Ph.D. <[hidden email]<mailto:[hidden email]>> wrote:
The model formula " ~ Histology" knows how to change your 3 level categorical variable into two 0/1 dummy variables for a regression matrix.  The tt() call is a simple function, however, and ordinary multiplication and does not have those powers.  In this case you need to do the setup by hand : create your own 0/1 dummy variables and work with them.

sqcc <- ifelse(dta$Histology == 'Sqcc', 0, 1)
hrac <- ifelse(dta$Histology == 'High risk AC', 0, 1)
fit <- coxph(Surv(time, status) ~ Sex + sqcc + hrac + tt(sqcc) + tt(hrac),
                 data = dta, tt = list(function(x,t, ...) x*log(t),
                                       function(x, t, ...) x* log(t)))


Terry Therneau

PS I've rarely found x*log(t) to be useful, but perhaps you have already looked at the cox.zph plots and see that shape.


Suppose I have a dataset contain three variants, looks like
head(dta)
   Sex    tumorsize    Histology       time         status
     0            1.5            2              12.1000             0
     1            1.8            1              38.4000             0
.....................

Sex: 1 for male; 0 for female., two levels
Histology: 1 for SqCC; 2 for High risk AC; 3 for low risk AC, three levels
Now I need to get a Time-dependent coefficients cox fit:

library(survival)
for(i in c(1,3) dta[,i] <- factor(dta[,i])
fit <-
   coxph(
     Surv(time, status) ~  Sex + tumorsize +  Histology + tt(Histology),
     data = dta,
     tt = function(x, t, ...) x * log(t)
   )

But  I keep gettting this error says:

Error in if (any(infs)) warning(paste("Loglik converged before variable ",  :
   missing value where TRUE/FALSE needed
In addition: Warning message:
In Ops.factor(x, log(t)) : ?*? not meaningful for factors.

How can I fix it? I know that the "Sex" and "Histology" are both
categorical variants. I  want to have a model that have two ?(t) = a +
blog(t) for each histology level.
Thank you?



        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Time-dependent coefficients in a Cox model with categorical variants

Jeff Newmiller
Offlist... for your information...

It is unfair to suggest that the mailing list participants are at fault for using old software.  Even if the mailing list participants use email programs that can handle HTML, any email that goes through the list gets the formatting stripped, which leaves it damaged to some degree. It might not seem like this because sometimes you CAN see formatting, but that only happens when you are listed in the "To" or "Cc" fields... the rest of the list saw a stripped version regardless of how good their mail program was. Just go look at the archives to confirm this. Net result is the rest of the participants see a more or less damaged version of the discussion/code whenever HTML is used on list.
--
Sent from my phone. Please excuse my brevity.

On January 18, 2018 11:38:17 AM PST, "Therneau, Terry M., Ph.D." <[hidden email]> wrote:

>
>First, as others have said please obey the mailing list rules and turn
>of
>First, as others have said please obey the mailing list rules and turn
>off html, not everyone uses an html email client.
>
>Here is your code, formatted and with line numbers added.  I also fixed
>one error: "y" should be "status".
>
>1. fit0 <- coxph(Surv(futime, status) ~ x1 + x2 + x3, data = data0)
>2. p <- log(predict(fit0, newdata = data1, type = "expected"))
>3. lp <- predict(fit0, newdata = data1, type = "lp")
>4. logbase <- p - lp
>5. fit1 <- glm(status ~ offset(p), family = poisson, data = data1)
>6. fit2 <- glm(status~ lp + offset(logbase), family = poisson, data =
>data1)
>7. group <- cut(lp, c(-Inf, quantile(lp, (1:9) / 10), Inf))
>8. fit3 <- glm(status ~ -1 + group + offset(p), family = poisson, data
>= data1)
>
>The key idea of the paper you referenced is that the counterpart to the
>Hosmer-Lemishow test (wrong if used directly in a Cox model) is to look
>at the predicted values from a Cox model as input to a Poisson
>regression.  That means adding the expected from the Cox model as a
>fixed term in the Poisson.  And like any other poisson that means
>offset(log(expected)) as a term.
>
>The presence of time dependent covariates does nothing to change this,
>per se, since expected for time fixed is the same as for time varying.
>In practice it does matter, at least philosophically.  Lines 1, 2, 5 do
>this just fine.
>
>If data1 is not the same as data0, a new study say, then the test for
>intercept=0 from fit1 is a test of overall calibration.  Models like
>line 8 try to partition out where any differences actually lie.
>
>The time-dependent covariates part lies in the fact that a single
>subject may be represented by multiple lines in data0 and/or data1.  Do
>you want to collapse that person into a single row before the glm fits?
>If subject "Jones" is represented by 15 lines in the data and "Smith"
>by 2, it does seem a bit unfair to give Jones 15 observations in the
>glm fit.  But full discussion of this is as much philosophy as
>statistics, and is perhaps best done over a beer.
>
>Terry T.
>
>________________________________
>From: Max Shell [[hidden email]]
>Sent: Wednesday, January 17, 2018 10:25 AM
>To: Therneau, Terry M., Ph.D.
>Subject: Re: Time-dependent coefficients in a Cox model with
>categorical variants
>
>Assessing calibration of Cox model with time-dependent
>coefficients<https://stats.stackexchange.com/questions/323569/assessing-calibration-of-cox-model-with-time-dependent-coefficients>
>
>I am trying to find methods for testing and visualizing calibration to
>Cox models with time-depended coefficients. I have read your nice
>article<http://journals.sagepub.com/doi/10.1177/0962280213497434>. In
>this paper, we can fit three models:
>
>fit0 <- coxph(Surv(futime, status) ~ x1 + x2 + x3, data = data0) p <-
>log(predict(fit0, newdata = data1, type = "expected")) lp <-
>predict(fit0, newdata = data1, type = "lp") logbase <- p - lp fit1 <-
>glm(y ~ offset(p), family = poisson, data = data1) fit2 <- glm(y ~ lp +
>offset(logbase), family = poisson, data = data1) group <- cut(lp,
>c(-Inf, quantile(lp, (1:9) / 10), Inf)) fit3 <- glm(y ~ -1 + group +
>offset(p), family = poisson, data = data1)
>
>Here$B!$(BI simplely use data1$B!!(B<- data0[1:500,]
>
>First, I get following error when running line 5.
>
>Error in eval(predvars, data, env) : object 'y' not found
>
>So I modifited the code by replacing the y as status looks like this:
>
>fit1 <- glm(status ~ offset(p), family = poisson, data = data1) fit2 <-
>glm(status ~ lp + offset(logbase), family = poisson, data = data1)
>group <- cut(lp, c(-Inf, quantile(lp, (1:9) / 10), Inf)) fit3 <-
>glm(status ~ -1 + group + offset(p), family = poisson, data = data1)
>
>Is this replacing correct?
>
>Second, I try to introduce the time-transform use coxph with
>ttparament.
>
>My code is:  fit0 <- coxph(Surv(time, status) ~ x1 + x2 + x3 + tt(x3),
>data = data0, function(x, t, ...) x * t) p <- log(predict(fit0, newdata
>= data1, type = "expected")) lp <- predict(fit0, newdata = data1, type
>= "lp") logbase <- p - lp fit1 <- glm(status ~ offset(p), family =
>poisson, data = data1) fit2 <- glm(status ~ lp + offset(logbase),
>family = poisson, data = data1) group <- cut(lp, c(-Inf, quantile(lp,
>(1:9) / 10), Inf)) fit3 <- glm(status ~ -1 + group + offset(p), family
>= poisson, data = data1)
>
>My questions is:
>
>  *   Is the code above correct?
>*   How to interpret the fit1, fit2, fit3? What's the connection
>between the three models and the calibration of the Cox model?
>*   How to generate the calibration plot using fit3? The article dose
>have a section discuss this, but no code is provided.
>
>Thank you!
>
>On Mon, Jan 15, 2018 at 9:23 PM, Therneau, Terry M., Ph.D.
><[hidden email]<mailto:[hidden email]>> wrote:
>The model formula " ~ Histology" knows how to change your 3 level
>categorical variable into two 0/1 dummy variables for a regression
>matrix.  The tt() call is a simple function, however, and ordinary
>multiplication and does not have those powers.  In this case you need
>to do the setup by hand : create your own 0/1 dummy variables and work
>with them.
>
>sqcc <- ifelse(dta$Histology == 'Sqcc', 0, 1)
>hrac <- ifelse(dta$Histology == 'High risk AC', 0, 1)
>fit <- coxph(Surv(time, status) ~ Sex + sqcc + hrac + tt(sqcc) +
>tt(hrac),
>                 data = dta, tt = list(function(x,t, ...) x*log(t),
>                                       function(x, t, ...) x* log(t)))
>
>
>Terry Therneau
>
>PS I've rarely found x*log(t) to be useful, but perhaps you have
>already looked at the cox.zph plots and see that shape.
>
>
>Suppose I have a dataset contain three variants, looks like
>head(dta)
>   Sex    tumorsize    Histology       time         status
>     0            1.5            2              12.1000             0
>     1            1.8            1              38.4000             0
>.....................
>
>Sex: 1 for male; 0 for female., two levels
>Histology: 1 for SqCC; 2 for High risk AC; 3 for low risk AC, three
>levels
>Now I need to get a Time-dependent coefficients cox fit:
>
>library(survival)
>for(i in c(1,3) dta[,i] <- factor(dta[,i])
>fit <-
>   coxph(
>    Surv(time, status) ~  Sex + tumorsize +  Histology + tt(Histology),
>     data = dta,
>     tt = function(x, t, ...) x * log(t)
>   )
>
>But  I keep gettting this error says:
>
>Error in if (any(infs)) warning(paste("Loglik converged before variable
>",  :
>   missing value where TRUE/FALSE needed
>In addition: Warning message:
>In Ops.factor(x, log(t)) : ?*? not meaningful for factors.
>
>How can I fix it? I know that the "Sex" and "Histology" are both
>categorical variants. I  want to have a model that have two ?(t) = a +
>blog(t) for each histology level.
>Thank you?
>
>
>
> [[alternative HTML version deleted]]
>
>______________________________________________
>[hidden email] mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.