results of a survival analysis change when converting the data to counting process format

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

results of a survival analysis change when converting the data to counting process format

Ferenci Tamas
Dear All,

Consider the following simple example:

library( survival )
data( veteran )

coef( coxph(Surv(time, status) ~ trt + prior + karno, data = veteran) )
         trt        prior        karno
 0.180197194 -0.005550919 -0.033771018

Note that we have neither time-dependent covariates, nor time-varying
coefficients, so the results should be the same if we change to
counting process format, no matter where we cut the times.

That's true if we cut at event times:

veteran2 <- survSplit( Surv(time, status) ~ trt + prior + karno,
                       data = veteran, cut = unique( veteran$time ) )

coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data = veteran2 ) )
         trt        prior        karno
 0.180197194 -0.005550919 -0.033771018

But quite interestingly not true, if we cut at every day:

veteran3 <- survSplit( Surv(time, status) ~ trt + prior + karno,
                       data = veteran, cut = 1:max(veteran$time) )

coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data = veteran3 ) )
         trt        prior        karno
 0.180197215 -0.005550913 -0.033771016

The difference is not large, but definitely more than just a rounding
error, or something like that.

What's going on? How can the results get wrong, especially by
including more cutpoints?

Thank you in advance,
Tamas

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: results of a survival analysis change when converting the data to counting process format

Göran Broström-3


On 2019-08-18 19:10, Ferenci Tamas wrote:

> Dear All,
>
> Consider the following simple example:
>
> library( survival )
> data( veteran )
>
> coef( coxph(Surv(time, status) ~ trt + prior + karno, data = veteran) )
>           trt        prior        karno
>   0.180197194 -0.005550919 -0.033771018
>
> Note that we have neither time-dependent covariates, nor time-varying
> coefficients, so the results should be the same if we change to
> counting process format, no matter where we cut the times.
>
> That's true if we cut at event times:
>
> veteran2 <- survSplit( Surv(time, status) ~ trt + prior + karno,
>                         data = veteran, cut = unique( veteran$time ) )
>
> coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data = veteran2 ) )
>           trt        prior        karno
>   0.180197194 -0.005550919 -0.033771018
>
> But quite interestingly not true, if we cut at every day:
>
> veteran3 <- survSplit( Surv(time, status) ~ trt + prior + karno,
>                         data = veteran, cut = 1:max(veteran$time) )
>
> coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data = veteran3 ) )
>           trt        prior        karno
>   0.180197215 -0.005550913 -0.033771016
>
> The difference is not large, but definitely more than just a rounding
> error, or something like that.
>
> What's going on? How can the results get wrong, especially by
> including more cutpoints?

All results are wrong, but they are useful (paraphrasing George EP Box).

Göran

>
> Thank you in advance,
> Tamas
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: results of a survival analysis change when converting the data to counting process format

Göran Broström-3


Den 2019-08-22 kl. 21:48, skrev Göran Broström:

>
>
> On 2019-08-18 19:10, Ferenci Tamas wrote:
>> Dear All,
>>
>> Consider the following simple example:
>>
>> library( survival )
>> data( veteran )
>>
>> coef( coxph(Surv(time, status) ~ trt + prior + karno, data = veteran) )
>>           trt        prior        karno
>>   0.180197194 -0.005550919 -0.033771018
>>
>> Note that we have neither time-dependent covariates, nor time-varying
>> coefficients, so the results should be the same if we change to
>> counting process format, no matter where we cut the times.
>>
>> That's true if we cut at event times:
>>
>> veteran2 <- survSplit( Surv(time, status) ~ trt + prior + karno,
>>                         data = veteran, cut = unique( veteran$time ) )
>>
>> coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data =
>> veteran2 ) )
>>           trt        prior        karno
>>   0.180197194 -0.005550919 -0.033771018
>>
>> But quite interestingly not true, if we cut at every day:
>>
>> veteran3 <- survSplit( Surv(time, status) ~ trt + prior + karno,
>>                         data = veteran, cut = 1:max(veteran$time) )
>>
>> coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data =
>> veteran3 ) )
>>           trt        prior        karno
>>   0.180197215 -0.005550913 -0.033771016
>>
>> The difference is not large, but definitely more than just a rounding
>> error, or something like that.
>>
>> What's going on? How can the results get wrong, especially by
>> including more cutpoints?
>
> All results are wrong, but they are useful (paraphrasing George EP Box).

That said, it is a little surprising: The generated risk sets are
(should be) identical in all cases, and one would expect rounding errors
to be the same. But data get stored differently, and ... who knows?

I tried your examples on my computer and got exactly the same results as
you. Which surprised me.

G,

>
> Göran
>
>>
>> Thank you in advance,
>> Tamas
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: results of a survival analysis change when converting the data to counting process format

Peter Dalgaard-2
I think this is a case of (a - x) + x != a in floating arithmetic. When updating the risk set, you subtract sums of covariates at the end of a time interval, then add them back at the beginning of the next interval. Something like that, anyway. As in

> x <- rnorm(1000)
> sum(c(x,-x))
[1] -1.387779e-17

Or (worse, since sum() is smart and doesn't operate sequentially)

> tt <- 0
> for (i in x) tt <- tt+i
> for (i in x) tt <- tt-i
> tt
[1] 2.553513e-14

-pd

> On 23 Aug 2019, at 11:12 , Göran Broström <[hidden email]> wrote:
>
>
>
> Den 2019-08-22 kl. 21:48, skrev Göran Broström:
>> On 2019-08-18 19:10, Ferenci Tamas wrote:
>>> Dear All,
>>>
>>> Consider the following simple example:
>>>
>>> library( survival )
>>> data( veteran )
>>>
>>> coef( coxph(Surv(time, status) ~ trt + prior + karno, data = veteran) )
>>>           trt        prior        karno
>>>   0.180197194 -0.005550919 -0.033771018
>>>
>>> Note that we have neither time-dependent covariates, nor time-varying
>>> coefficients, so the results should be the same if we change to
>>> counting process format, no matter where we cut the times.
>>>
>>> That's true if we cut at event times:
>>>
>>> veteran2 <- survSplit( Surv(time, status) ~ trt + prior + karno,
>>>                         data = veteran, cut = unique( veteran$time ) )
>>>
>>> coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data = veteran2 ) )
>>>           trt        prior        karno
>>>   0.180197194 -0.005550919 -0.033771018
>>>
>>> But quite interestingly not true, if we cut at every day:
>>>
>>> veteran3 <- survSplit( Surv(time, status) ~ trt + prior + karno,
>>>                         data = veteran, cut = 1:max(veteran$time) )
>>>
>>> coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data = veteran3 ) )
>>>           trt        prior        karno
>>>   0.180197215 -0.005550913 -0.033771016
>>>
>>> The difference is not large, but definitely more than just a rounding
>>> error, or something like that.
>>>
>>> What's going on? How can the results get wrong, especially by
>>> including more cutpoints?
>> All results are wrong, but they are useful (paraphrasing George EP Box).
>
> That said, it is a little surprising: The generated risk sets are (should be) identical in all cases, and one would expect rounding errors to be the same. But data get stored differently, and ... who knows?
>
> I tried your examples on my computer and got exactly the same results as you. Which surprised me.
>
> G,
>
>> Göran
>>>
>>> Thank you in advance,
>>> Tamas
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: results of a survival analysis change when converting the data to counting process format

Andrews, Chris
In reply to this post by Göran Broström-3

# For what it is worth, even the second fit (cuts at observation times) does not give identical coefficient estimates as using the original data structure.

answer <- coef( coxph(Surv(time, status) ~ trt + prior + karno, data = veteran) )
veteran2 <- survSplit( Surv(time, status) ~ trt + prior + karno, data = veteran, cut = unique( veteran$time ) )
answer2 <- coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data = veteran2 ) )
answer2 - answer
 #          trt         prior         karno
 # 2.775558e-16 -1.127570e-17 -6.938894e-18

# If you cut daily, but not all the way to 999, you get a few different fits

ff <- function(m) {
        veteran3 <- survSplit( Surv(time, status) ~ trt + prior + karno, data = veteran, cut = seq(m))
        answer3 <- coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data = veteran3 ) )
        return(answer3)
}

answers3 <- sapply(100:999, ff)
plot(100:999, answers3[1,] - answer[1])

# But certainly all these differences are of no practical importance.



-----Original Message-----
From: Göran Broström [mailto:[hidden email]]
Sent: Friday, August 23, 2019 5:13 AM
To: [hidden email]; [hidden email]
Subject: Re: [R] results of a survival analysis change when converting the data to counting process format



Den 2019-08-22 kl. 21:48, skrev Göran Broström:

>
>
> On 2019-08-18 19:10, Ferenci Tamas wrote:
>> Dear All,
>>
>> Consider the following simple example:
>>
>> library( survival )
>> data( veteran )
>>
>> coef( coxph(Surv(time, status) ~ trt + prior + karno, data = veteran) )
>>           trt        prior        karno
>>   0.180197194 -0.005550919 -0.033771018
>>
>> Note that we have neither time-dependent covariates, nor time-varying
>> coefficients, so the results should be the same if we change to
>> counting process format, no matter where we cut the times.
>>
>> That's true if we cut at event times:
>>
>> veteran2 <- survSplit( Surv(time, status) ~ trt + prior + karno,
>>                         data = veteran, cut = unique( veteran$time ) )
>>
>> coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data =
>> veteran2 ) )
>>           trt        prior        karno
>>   0.180197194 -0.005550919 -0.033771018
>>
>> But quite interestingly not true, if we cut at every day:
>>
>> veteran3 <- survSplit( Surv(time, status) ~ trt + prior + karno,
>>                         data = veteran, cut = 1:max(veteran$time) )
>>
>> coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data =
>> veteran3 ) )
>>           trt        prior        karno
>>   0.180197215 -0.005550913 -0.033771016
>>
>> The difference is not large, but definitely more than just a rounding
>> error, or something like that.
>>
>> What's going on? How can the results get wrong, especially by
>> including more cutpoints?
>
> All results are wrong, but they are useful (paraphrasing George EP Box).

That said, it is a little surprising: The generated risk sets are
(should be) identical in all cases, and one would expect rounding errors
to be the same. But data get stored differently, and ... who knows?

I tried your examples on my computer and got exactly the same results as
you. Which surprised me.

G,

>
> Göran
>
>>
>> Thank you in advance,
>> Tamas
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: results of a survival analysis change when converting the data to counting process format

Andrews, Chris
In reply to this post by Göran Broström-3

On the other hand, SAS gets the same answer all three ways.  And the answer SAS gets is closest to the one that R gets when using the daily cutting.

Obs _TIES_ _TYPE_ _STATUS_ _NAME_ trt prior karno _LNLIKE_
1 EFRON PARMS 0 Converged time 0.1801972156 -0.005550913 -0.033771016 -483.9277463
2 EFRON PARMS 0 Converged time 0.1801972156 -0.005550913 -0.033771016 -483.9277463
3 EFRON PARMS 0 Converged time 0.1801972156 -0.005550913 -0.033771016 -483.9277463

proc phreg data=veteran1 outest = vet1;
        model time * status(0) = trt prior karno / ties = efron;
proc phreg data=veteran2 outest = vet2;
        model time * status(0) = trt prior karno / ties = efron entry=tstart;
proc phreg data=veteran3 outest = vet3;
        model time * status(0) = trt prior karno / ties = efron entry=tstart;

data vets; set vet1 vet2 vet3;

proc print data=vets width=FULL;

run;


-----Original Message-----
From: Andrews, Chris
Sent: Friday, August 23, 2019 9:18 AM
To: 'Göran Broström'; [hidden email]; [hidden email]
Subject: RE: [R] results of a survival analysis change when converting the data to counting process format


# For what it is worth, even the second fit (cuts at observation times) does not give identical coefficient estimates as using the original data structure.

answer <- coef( coxph(Surv(time, status) ~ trt + prior + karno, data = veteran) )
veteran2 <- survSplit( Surv(time, status) ~ trt + prior + karno, data = veteran, cut = unique( veteran$time ) )
answer2 <- coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data = veteran2 ) )
answer2 - answer
 #          trt         prior         karno
 # 2.775558e-16 -1.127570e-17 -6.938894e-18

# If you cut daily, but not all the way to 999, you get a few different fits

ff <- function(m) {
        veteran3 <- survSplit( Surv(time, status) ~ trt + prior + karno, data = veteran, cut = seq(m))
        answer3 <- coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data = veteran3 ) )
        return(answer3)
}

answers3 <- sapply(100:999, ff)
plot(100:999, answers3[1,] - answer[1])

# But certainly all these differences are of no practical importance.



-----Original Message-----
From: Göran Broström [mailto:[hidden email]]
Sent: Friday, August 23, 2019 5:13 AM
To: [hidden email]; [hidden email]
Subject: Re: [R] results of a survival analysis change when converting the data to counting process format



Den 2019-08-22 kl. 21:48, skrev Göran Broström:

>
>
> On 2019-08-18 19:10, Ferenci Tamas wrote:
>> Dear All,
>>
>> Consider the following simple example:
>>
>> library( survival )
>> data( veteran )
>>
>> coef( coxph(Surv(time, status) ~ trt + prior + karno, data = veteran) )
>>           trt        prior        karno
>>   0.180197194 -0.005550919 -0.033771018
>>
>> Note that we have neither time-dependent covariates, nor time-varying
>> coefficients, so the results should be the same if we change to
>> counting process format, no matter where we cut the times.
>>
>> That's true if we cut at event times:
>>
>> veteran2 <- survSplit( Surv(time, status) ~ trt + prior + karno,
>>                         data = veteran, cut = unique( veteran$time ) )
>>
>> coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data =
>> veteran2 ) )
>>           trt        prior        karno
>>   0.180197194 -0.005550919 -0.033771018
>>
>> But quite interestingly not true, if we cut at every day:
>>
>> veteran3 <- survSplit( Surv(time, status) ~ trt + prior + karno,
>>                         data = veteran, cut = 1:max(veteran$time) )
>>
>> coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data =
>> veteran3 ) )
>>           trt        prior        karno
>>   0.180197215 -0.005550913 -0.033771016
>>
>> The difference is not large, but definitely more than just a rounding
>> error, or something like that.
>>
>> What's going on? How can the results get wrong, especially by
>> including more cutpoints?
>
> All results are wrong, but they are useful (paraphrasing George EP Box).

That said, it is a little surprising: The generated risk sets are
(should be) identical in all cases, and one would expect rounding errors
to be the same. But data get stored differently, and ... who knows?

I tried your examples on my computer and got exactly the same results as
you. Which surprised me.

G,

>
> Göran
>
>>
>> Thank you in advance,
>> Tamas
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: results of a survival analysis change when converting the data to counting process format

Ferenci Tamas
In reply to this post by Ferenci Tamas
Thanks for all those comments, and thanks Terry for the repair!

Tamas


2019. augusztus 23., 14:40:05, írtad:


I've spent some time chasing this down, and another error of similar type.   It's repaired in version 3.0-9  (soon to go to CRAN, or at least that is the plan  --- I'm working through one last fail in the checks.)

For those who want to know more---

When there is start, stop data such as this:  (0, 101]  (101, 123]   (123, 400], ....  the algorithm used by coxph is to work from largest to smallest time.  Every time it hits the start of an interval (400, 123, 101) that observation is added into the risk set, and whenever it hits the start of an interval the observation is removed (123, 101, 0).    "Add" and "remove" also means "add to the current estimate of the running mean and variance" of the covariates.   At each event time that current mean and variance are are used to update the loglik and its derivatives.

When a subject is diced into little peices of (1,2] (2,3] (3,4], ....  like the last fit below, the sheer number of updates to the running mean/variance can lead to accumulated round off error.   The code works hard to avoid this, and some of that is subtle --- round off error is a tricky business --- and has gone through several refinements.    Nevertheless, the result for this data set should not have been that far off.  Another user example from about the same time was worse so I've been focusing on that one: one of the two splits led to an exp() overflow and one didn't, giving results that were completely different.  This led to a more careful review and some changes that addressed the example below as well.

Terry T.

On 8/23/19 5:00 AM, [hidden email] wrote:

On 2019-08-18 19:10, Ferenci Tamas wrote:

Dear All,



Consider the following simple example:



library( survival )

data( veteran )



coef( coxph(Surv(time, status) ~ trt + prior + karno, data = veteran) )

          trt        prior        karno

  0.180197194 -0.005550919 -0.033771018



Note that we have neither time-dependent covariates, nor time-varying

coefficients, so the results should be the same if we change to

counting process format, no matter where we cut the times.



That's true if we cut at event times:



veteran2 <- survSplit( Surv(time, status) ~ trt + prior + karno,

                        data = veteran, cut = unique( veteran$time ) )



coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data = veteran2 ) )

          trt        prior        karno

  0.180197194 -0.005550919 -0.033771018



But quite interestingly not true, if we cut at every day:



veteran3 <- survSplit( Surv(time, status) ~ trt + prior + karno,

                        data = veteran, cut = 1:max(veteran$time) )



coef( coxph(Surv(tstart,time, status) ~ trt + prior + karno, data = veteran3 ) )

          trt        prior        karno

  0.180197215 -0.005550913 -0.033771016



The difference is not large, but definitely more than just a rounding

error, or something like that.



What's going on? How can the results get wrong, especially by

including more cutpoints?

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.