vcov and survival

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

vcov and survival

Therneau, Terry M., Ph.D.
I have just noticed a difference in behavior between coxph and lm/glm: if one or more of
the coefficients from the fit in NA, then lm and glm omit that row/column from the
variance matrix; while coxph retains it but sets the values to zero.

   Is this something that should be "fixed", i.e., made to agree? I suspect that doing so
will break other packages, but then NA coefs are rather rare so perhaps not.

Terry Therneau

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vcov and survival

Fox, John
Dear Terry,

Even the behaviour of lm() and glm() isn't entirely consistent. In both cases, singularity results in NA coefficients by default, and these are reported in the model summary and coefficient vector, but not in the coefficient covariance matrix:

----------------

> mod.lm <- lm(Employed ~ GNP + Population + I(GNP + Population),
+              data=longley)
> summary(mod.lm)

Call:
lm(formula = Employed ~ GNP + Population + I(GNP + Population),
    data = longley)

Residuals:
     Min       1Q   Median       3Q      Max
-0.80899 -0.33282 -0.02329  0.25895  1.08800

Coefficients: (1 not defined because of singularities)
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         88.93880   13.78503   6.452 2.16e-05 ***
GNP                  0.06317    0.01065   5.933 4.96e-05 ***
Population          -0.40974    0.15214  -2.693   0.0184 *  
I(GNP + Population)       NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5459 on 13 degrees of freedom
Multiple R-squared:  0.9791, Adjusted R-squared:  0.9758
F-statistic: 303.9 on 2 and 13 DF,  p-value: 1.221e-11

> vcov(mod.lm)
            (Intercept)           GNP Population
(Intercept) 190.0269691  0.1445617813 -2.0954381
GNP           0.1445618  0.0001133631 -0.0016054
Population   -2.0954381 -0.0016053999  0.0231456
> coef(mod.lm)
        (Intercept)                 GNP          Population I(GNP + Population)
        88.93879831          0.06317244         -0.40974292                  NA
>
> mod.glm <- glm(Employed ~ GNP + Population + I(GNP + Population),
+               data=longley)
> summary(mod.glm)

Call:
glm(formula = Employed ~ GNP + Population + I(GNP + Population),
    data = longley)

Deviance Residuals:
     Min        1Q    Median        3Q       Max  
-0.80899  -0.33282  -0.02329   0.25895   1.08800  

Coefficients: (1 not defined because of singularities)
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         88.93880   13.78503   6.452 2.16e-05 ***
GNP                  0.06317    0.01065   5.933 4.96e-05 ***
Population          -0.40974    0.15214  -2.693   0.0184 *  
I(GNP + Population)       NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.2980278)

    Null deviance: 185.0088  on 15  degrees of freedom
Residual deviance:   3.8744  on 13  degrees of freedom
AIC: 30.715

Number of Fisher Scoring iterations: 2

> coef(mod.glm)
        (Intercept)                 GNP          Population I(GNP + Population)
        88.93879831          0.06317244         -0.40974292                  NA
> vcov(mod.glm)
            (Intercept)           GNP Population
(Intercept) 190.0269691  0.1445617813 -2.0954381
GNP           0.1445618  0.0001133631 -0.0016054
Population   -2.0954381 -0.0016053999  0.0231456

----------------

Moreoever, lm() has a singular.ok() argument that defaults to TRUE, but glm() doesn't have this argument:

----------------

> mod.lm <- lm(Employed ~ GNP + Population + I(GNP + Population),
+              data=longley, singular.ok=FALSE)
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
  singular fit encountered

----------------

In my opinion, singularity should at least produce a warning, both in calls to lm() and glm(), and in summary() output. Even better, again in my opinion, would be to produce an error by default in this situation, but doing so would likely break too much existing code.

I prefer NA to 0 for the redundant coefficients because it at least suggests that the decision about what to exclude is arbitrary, and of course simply excluding coefficients isn't the only way to proceed.

Finally, the differences in behaviour between coef() and vcov() and between lm() and glm() aren't really sensible.

Maybe there's some reason for all this that escapes me.

Best,
 John

--------------------------------------
John Fox, Professor Emeritus
McMaster University
Hamilton, Ontario, Canada
Web: socserv.mcmaster.ca/jfox




> -----Original Message-----
> From: R-devel [mailto:[hidden email]] On Behalf Of
> Therneau, Terry M., Ph.D.
> Sent: Wednesday, September 13, 2017 6:19 PM
> To: [hidden email]
> Subject: [Rd] vcov and survival
>
> I have just noticed a difference in behavior between coxph and lm/glm:
> if one or more of the coefficients from the fit in NA, then lm and glm
> omit that row/column from the variance matrix; while coxph retains it
> but sets the values to zero.
>
>    Is this something that should be "fixed", i.e., made to agree? I
> suspect that doing so will break other packages, but then NA coefs are
> rather rare so perhaps not.
>
> Terry Therneau
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vcov and survival

Martin Maechler
>>>>> Fox, John <[hidden email]>
>>>>>     on Wed, 13 Sep 2017 22:45:07 +0000 writes:

    > Dear Terry,
    > Even the behaviour of lm() and glm() isn't entirely consistent. In both cases, singularity results in NA coefficients by default, and these are reported in the model summary and coefficient vector, but not in the coefficient covariance matrix:

    > ----------------

    >> mod.lm <- lm(Employed ~ GNP + Population + I(GNP + Population),
    > +              data=longley)
    >> summary(mod.lm)

    > Call:
    > lm(formula = Employed ~ GNP + Population + I(GNP + Population),
    > data = longley)

    > Residuals:
    > Min       1Q   Median       3Q      Max
    > -0.80899 -0.33282 -0.02329  0.25895  1.08800

    > Coefficients: (1 not defined because of singularities)
    > Estimate Std. Error t value Pr(>|t|)    
    > (Intercept)         88.93880   13.78503   6.452 2.16e-05 ***
    > GNP                  0.06317    0.01065   5.933 4.96e-05 ***
    > Population          -0.40974    0.15214  -2.693   0.0184 *  
    > I(GNP + Population)       NA         NA      NA       NA    
    > ---
    > Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    > Residual standard error: 0.5459 on 13 degrees of freedom
    > Multiple R-squared:  0.9791, Adjusted R-squared:  0.9758
    > F-statistic: 303.9 on 2 and 13 DF,  p-value: 1.221e-11

    >> vcov(mod.lm)
    > (Intercept)           GNP Population
    > (Intercept) 190.0269691  0.1445617813 -2.0954381
    > GNP           0.1445618  0.0001133631 -0.0016054
    > Population   -2.0954381 -0.0016053999  0.0231456
    >> coef(mod.lm)
    > (Intercept)                 GNP          Population I(GNP + Population)
    > 88.93879831          0.06317244         -0.40974292                  NA
    >>
    >> mod.glm <- glm(Employed ~ GNP + Population + I(GNP + Population),
    > +               data=longley)
    >> summary(mod.glm)

    > Call:
    > glm(formula = Employed ~ GNP + Population + I(GNP + Population),
    > data = longley)

    > Deviance Residuals:
    > Min        1Q    Median        3Q       Max  
    > -0.80899  -0.33282  -0.02329   0.25895   1.08800  

    > Coefficients: (1 not defined because of singularities)
    > Estimate Std. Error t value Pr(>|t|)    
    > (Intercept)         88.93880   13.78503   6.452 2.16e-05 ***
    > GNP                  0.06317    0.01065   5.933 4.96e-05 ***
    > Population          -0.40974    0.15214  -2.693   0.0184 *  
    > I(GNP + Population)       NA         NA      NA       NA    
    > ---
    > Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    > (Dispersion parameter for gaussian family taken to be 0.2980278)

    > Null deviance: 185.0088  on 15  degrees of freedom
    > Residual deviance:   3.8744  on 13  degrees of freedom
    > AIC: 30.715

    > Number of Fisher Scoring iterations: 2

    >> coef(mod.glm)
    > (Intercept)                 GNP          Population I(GNP + Population)
    > 88.93879831          0.06317244         -0.40974292                  NA
    >> vcov(mod.glm)
    > (Intercept)           GNP Population
    > (Intercept) 190.0269691  0.1445617813 -2.0954381
    > GNP           0.1445618  0.0001133631 -0.0016054
    > Population   -2.0954381 -0.0016053999  0.0231456

    > ----------------

    > Moreoever, lm() has a singular.ok() argument that defaults to TRUE, but glm() doesn't have this argument:

    > ----------------

    >> mod.lm <- lm(Employed ~ GNP + Population + I(GNP + Population),
    > +              data=longley, singular.ok=FALSE)
    > Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
    > singular fit encountered

    > ----------------

    > In my opinion, singularity should at least produce a warning, both in calls to lm() and glm(), and in summary() output. Even better, again in my opinion, would be to produce an error by default in this situation, but doing so would likely break too much existing code.

Yes, I would not want to change.  Note that this is from S
already, i.e., long "ingrained".  I think there one argument was
that there are situations with factor predictors of many levels
and conceptually their 2- or even 3-way interactions (!)
where it is neat to just fit the model, (-> get residuals and
fitted values) and also see implicitly the "necessary rank" of
prediction space, or rather even more specifically, you see for
every factor how many levels are "distinguishable"/useful for
prediction, given the data.

    > I prefer NA to 0 for the redundant coefficients because it at least suggests that the decision about what to exclude is arbitrary, and of course simply excluding coefficients isn't the only way to proceed.

I'm less modest and would say *definitely*, NA's are highly
prefered in such a situation.

    > Finally, the differences in behaviour between coef() and vcov() and between lm() and glm() aren't really sensible.

I really haven't seen any difference between lm() and glm() in
the example above.  Maybe you can point them out for me.

I do quite agree that  vcov() should be compatible with
coef() [and summary()]  for both 'lm' and 'glm' methods, i.e.,
should get NA rows and columns there.  This would require
eliminating these before e.g. using it in solve(<vcov>, *) etc,
but I think it would be a good idea that the useR must deal with
these NAs actively.

Shall "we" try and see the fallout in CRAN space?

    > Maybe there's some reason for all this that escapes me.
(for the first one---"no error"--- I gave a reason)

    > Best,
    > John

    > --------------------------------------
    > John Fox, Professor Emeritus
    > McMaster University
    > Hamilton, Ontario, Canada
    > Web: socserv.mcmaster.ca/jfox




    >> -----Original Message-----
    >> From: R-devel [mailto:[hidden email]] On Behalf Of
    >> Therneau, Terry M., Ph.D.
    >> Sent: Wednesday, September 13, 2017 6:19 PM
    >> To: [hidden email]
    >> Subject: [Rd] vcov and survival
    >>
    >> I have just noticed a difference in behavior between coxph and lm/glm:
    >> if one or more of the coefficients from the fit in NA, then lm and glm
    >> omit that row/column from the variance matrix; while coxph retains it
    >> but sets the values to zero.
    >>
    >> Is this something that should be "fixed", i.e., made to agree? I
    >> suspect that doing so will break other packages, but then NA coefs are
    >> rather rare so perhaps not.
    >>
    >> Terry Therneau
    >>
    >> ______________________________________________
    >> [hidden email] mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-devel

    > ______________________________________________
    > [hidden email] mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vcov and survival

Martin Maechler
>>>>> Martin Maechler <[hidden email]>
>>>>>     on Thu, 14 Sep 2017 10:13:02 +0200 writes:

>>>>> Fox, John <[hidden email]>
>>>>>     on Wed, 13 Sep 2017 22:45:07 +0000 writes:

    >> Dear Terry,
    >> Even the behaviour of lm() and glm() isn't entirely consistent. In both cases, singularity results in NA coefficients by default, and these are reported in the model summary and coefficient vector, but not in the coefficient covariance matrix:

    >> ----------------

    >>> mod.lm <- lm(Employed ~ GNP + Population + I(GNP + Population),
    >> +              data=longley)
    >>> summary(mod.lm)

    >> Call:
    >> lm(formula = Employed ~ GNP + Population + I(GNP + Population),
    >> data = longley)

    >> Residuals:
    >> Min       1Q   Median       3Q      Max
    >> -0.80899 -0.33282 -0.02329  0.25895  1.08800

    >> Coefficients: (1 not defined because of singularities)
    >> Estimate Std. Error t value Pr(>|t|)    
    >> (Intercept)         88.93880   13.78503   6.452 2.16e-05 ***
    >> GNP                  0.06317    0.01065   5.933 4.96e-05 ***
    >> Population          -0.40974    0.15214  -2.693   0.0184 *  
    >> I(GNP + Population)       NA         NA      NA       NA    
    >> ---
    >> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    >> Residual standard error: 0.5459 on 13 degrees of freedom
    >> Multiple R-squared:  0.9791, Adjusted R-squared:  0.9758
    >> F-statistic: 303.9 on 2 and 13 DF,  p-value: 1.221e-11

    >>> vcov(mod.lm)
    >> (Intercept)           GNP Population
    >> (Intercept) 190.0269691  0.1445617813 -2.0954381
    >> GNP           0.1445618  0.0001133631 -0.0016054
    >> Population   -2.0954381 -0.0016053999  0.0231456
    >>> coef(mod.lm)
    >> (Intercept)                 GNP          Population I(GNP + Population)
    >> 88.93879831          0.06317244         -0.40974292                  NA
    >>>
    >>> mod.glm <- glm(Employed ~ GNP + Population + I(GNP + Population),
    >> +               data=longley)
    >>> summary(mod.glm)

    >> Call:
    >> glm(formula = Employed ~ GNP + Population + I(GNP + Population),
    >> data = longley)

    >> Deviance Residuals:
    >> Min        1Q    Median        3Q       Max  
    >> -0.80899  -0.33282  -0.02329   0.25895   1.08800  

    >> Coefficients: (1 not defined because of singularities)
    >> Estimate Std. Error t value Pr(>|t|)    
    >> (Intercept)         88.93880   13.78503   6.452 2.16e-05 ***
    >> GNP                  0.06317    0.01065   5.933 4.96e-05 ***
    >> Population          -0.40974    0.15214  -2.693   0.0184 *  
    >> I(GNP + Population)       NA         NA      NA       NA    
    >> ---
    >> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    >> (Dispersion parameter for gaussian family taken to be 0.2980278)

    >> Null deviance: 185.0088  on 15  degrees of freedom
    >> Residual deviance:   3.8744  on 13  degrees of freedom
    >> AIC: 30.715

    >> Number of Fisher Scoring iterations: 2

    >>> coef(mod.glm)
    >> (Intercept)                 GNP          Population I(GNP + Population)
    >> 88.93879831          0.06317244         -0.40974292                  NA
    >>> vcov(mod.glm)
    >> (Intercept)           GNP Population
    >> (Intercept) 190.0269691  0.1445617813 -2.0954381
    >> GNP           0.1445618  0.0001133631 -0.0016054
    >> Population   -2.0954381 -0.0016053999  0.0231456

    >> ----------------

    >> Moreoever, lm() has a singular.ok() argument that defaults to TRUE, but glm() doesn't have this argument:

    >> ----------------

    >>> mod.lm <- lm(Employed ~ GNP + Population + I(GNP + Population),
    >> +              data=longley, singular.ok=FALSE)
    >> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
    >> singular fit encountered

    >> ----------------

    >> In my opinion, singularity should at least produce a warning, both in calls to lm() and glm(), and in summary() output. Even better, again in my opinion, would be to produce an error by default in this situation, but doing so would likely break too much existing code.

    > Yes, I would not want to change.  Note that this is from S
    > already, i.e., long "ingrained".  I think there one argument was
    > that there are situations with factor predictors of many levels
    > and conceptually their 2- or even 3-way interactions (!)
    > where it is neat to just fit the model, (-> get residuals and
    > fitted values) and also see implicitly the "necessary rank" of
    > prediction space, or rather even more specifically, you see for
    > every factor how many levels are "distinguishable"/useful for
    > prediction, given the data.

    >> I prefer NA to 0 for the redundant coefficients because it at least suggests that the decision about what to exclude is arbitrary, and of course simply excluding coefficients isn't the only way to proceed.

    > I'm less modest and would say *definitely*, NA's are highly
    > prefered in such a situation.

    >> Finally, the differences in behaviour between coef() and vcov() and between lm() and glm() aren't really sensible.

    > I really haven't seen any difference between lm() and glm() in
    > the example above.  Maybe you can point them out for me.

.. now I saw it:
   lm() has  a 'singular.ok = TRUE' argument
   which you can set to FALSE if you prefer an error to NA coefficients.

I also agree with you John that it would be nice if  glm() also
got such an argument.
Patches are welcome and seem easy. Nowadays we prefer them
as attachments (diff/patch file!) at R's
  https://bugs.r-project.org bugzilla
against the svn source, here
  https://svn.r-project.org/R/trunk/src/library/stats/R/glm.R
and
  https://svn.r-project.org/R/trunk/src/library/stats/man/glm.Rd

    > I do quite agree that  vcov() should be compatible with
    > coef() [and summary()]  for both 'lm' and 'glm' methods, i.e.,
    > should get NA rows and columns there.  This would require
    > eliminating these before e.g. using it in solve(<vcov>, *) etc,
    > but I think it would be a good idea that the useR must deal with
    > these NAs actively.

    > Shall "we" try and see the fallout in CRAN space?

    >> Maybe there's some reason for all this that escapes me.
    > (for the first one---"no error"--- I gave a reason)

    >> Best,
    >> John

    >> --------------------------------------
    >> John Fox, Professor Emeritus
    >> McMaster University
    >> Hamilton, Ontario, Canada
    >> Web: socserv.mcmaster.ca/jfox




    >>> -----Original Message-----
    >>> From: R-devel [mailto:[hidden email]] On Behalf Of
    >>> Therneau, Terry M., Ph.D.
    >>> Sent: Wednesday, September 13, 2017 6:19 PM
    >>> To: [hidden email]
    >>> Subject: [Rd] vcov and survival
    >>>
    >>> I have just noticed a difference in behavior between coxph and lm/glm:
    >>> if one or more of the coefficients from the fit in NA, then lm and glm
    >>> omit that row/column from the variance matrix; while coxph retains it
    >>> but sets the values to zero.
    >>>
    >>> Is this something that should be "fixed", i.e., made to agree? I
    >>> suspect that doing so will break other packages, but then NA coefs are
    >>> rather rare so perhaps not.
    >>>
    >>> Terry Therneau

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vcov and survival

Therneau, Terry M., Ph.D.
Thanks all for your comments.  No one said "all the other vcov methods do ....", so I took
some time this AM to look at several listed in the vcov help page.
Here is the code for the first few examples: data2 is constructed specifically to create
an NA coef midway in the list.

data1 <- data.frame(y = c(1,2,10,50, 5, 4, 8, 40, 60, 20, 21, 22,
                           3,5,12,52, 7, 8,16, 48, 58, 28, 20,5),
                     x1 = factor(letters[rep(1:3, length=24)]),
                     x2 = factor(LETTERS[rep(1:4, length=24)]),
                     x3 = factor(rep(1:7, length=24)))
data2 <- subset(data1, x1 !='a' | x2 != 'C')

fit1 <- lm(y ~ x1*x2, data2)
table(is.na(coef(fit1)))
dim(vcov(fit1))

fit2 <- glm(y ~ x1*x2, data=data2, poisson)
table(is.na(coef(fit2)))
dim(vcov(fit2))

fit3 <- lme(y ~ x1*x2, random= ~1|x3, data2)

1. lm, mlm, glm, negbin objects all have an NA in coef(fit); and remove NA columns from
the vcov object.

2. I expected polr to return a generalized inverse of the Hessian since vcov.polr has a
call to ginv(object$Hessian), but it shortcuts earlier with a message
  "design appears to be rank-deficient, so dropping some coefs"
The undetermined coef appears in neither coef() more vcov().

3. rlm declares that it does not work with singular data.

4. multinom returns values for all coefficients and a full variance matrix.  However, the
returned variance is rank-deficient.  It is essentially a g-inverse of the Hessian.

5. coxph and survreg report an NA coef, and return a generalized inverse of the Hessian
matrix.  The g-inverse was chosen to be a particularly easy one in that you can spot
redundant colums via a row/col of zeros.

6. nlme fails with a singularity error.  I didn't check out gls.

So my original question of whether I should make coxph consistent with the others has no
answer, the 'others' are not consistent.

In response to two other points:
  >> In my opinion singularity should at least produce a warning...
I was one of those who lobbied heavily to change the singular.ok=FALSE default of lm to
TRUE.  Data is messy, I have work to do, and don't need a package constantly harping at me.

In the same vein, stuffing NA into the vcov result is more pure, but would cause a lot of
hassle.  I'm not sure that it is worth it.

For now, coxph will stay as is.
But again, thanks to all for comments and I'll look forward to any more discussion.

Terry T.


On 09/14/2017 03:23 AM, Martin Maechler wrote:

>>>>>> Martin Maechler <[hidden email]>
>>>>>>      on Thu, 14 Sep 2017 10:13:02 +0200 writes:
>
>>>>>> Fox, John <[hidden email]>
>>>>>>      on Wed, 13 Sep 2017 22:45:07 +0000 writes:
>
>      >> Dear Terry,
>      >> Even the behaviour of lm() and glm() isn't entirely consistent. In both cases, singularity results in NA coefficients by default, and these are reported in the model summary and coefficient vector, but not in the coefficient covariance matrix:
>
>      >> ----------------
>
>      >>> mod.lm <- lm(Employed ~ GNP + Population + I(GNP + Population),
>      >> +              data=longley)
>      >>> summary(mod.lm)
>
>      >> Call:
>      >> lm(formula = Employed ~ GNP + Population + I(GNP + Population),
>      >> data = longley)
>
>      >> Residuals:
>      >> Min       1Q   Median       3Q      Max
>      >> -0.80899 -0.33282 -0.02329  0.25895  1.08800
>
>      >> Coefficients: (1 not defined because of singularities)
>      >> Estimate Std. Error t value Pr(>|t|)
>      >> (Intercept)         88.93880   13.78503   6.452 2.16e-05 ***
>      >> GNP                  0.06317    0.01065   5.933 4.96e-05 ***
>      >> Population          -0.40974    0.15214  -2.693   0.0184 *
>      >> I(GNP + Population)       NA         NA      NA       NA
>      >> ---
>      >> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
>      >> Residual standard error: 0.5459 on 13 degrees of freedom
>      >> Multiple R-squared:  0.9791, Adjusted R-squared:  0.9758
>      >> F-statistic: 303.9 on 2 and 13 DF,  p-value: 1.221e-11
>
>      >>> vcov(mod.lm)
>      >> (Intercept)           GNP Population
>      >> (Intercept) 190.0269691  0.1445617813 -2.0954381
>      >> GNP           0.1445618  0.0001133631 -0.0016054
>      >> Population   -2.0954381 -0.0016053999  0.0231456
>      >>> coef(mod.lm)
>      >> (Intercept)                 GNP          Population I(GNP + Population)
>      >> 88.93879831          0.06317244         -0.40974292                  NA
>      >>>
>      >>> mod.glm <- glm(Employed ~ GNP + Population + I(GNP + Population),
>      >> +               data=longley)
>      >>> summary(mod.glm)
>
>      >> Call:
>      >> glm(formula = Employed ~ GNP + Population + I(GNP + Population),
>      >> data = longley)
>
>      >> Deviance Residuals:
>      >> Min        1Q    Median        3Q       Max
>      >> -0.80899  -0.33282  -0.02329   0.25895   1.08800
>
>      >> Coefficients: (1 not defined because of singularities)
>      >> Estimate Std. Error t value Pr(>|t|)
>      >> (Intercept)         88.93880   13.78503   6.452 2.16e-05 ***
>      >> GNP                  0.06317    0.01065   5.933 4.96e-05 ***
>      >> Population          -0.40974    0.15214  -2.693   0.0184 *
>      >> I(GNP + Population)       NA         NA      NA       NA
>      >> ---
>      >> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
>      >> (Dispersion parameter for gaussian family taken to be 0.2980278)
>
>      >> Null deviance: 185.0088  on 15  degrees of freedom
>      >> Residual deviance:   3.8744  on 13  degrees of freedom
>      >> AIC: 30.715
>
>      >> Number of Fisher Scoring iterations: 2
>
>      >>> coef(mod.glm)
>      >> (Intercept)                 GNP          Population I(GNP + Population)
>      >> 88.93879831          0.06317244         -0.40974292                  NA
>      >>> vcov(mod.glm)
>      >> (Intercept)           GNP Population
>      >> (Intercept) 190.0269691  0.1445617813 -2.0954381
>      >> GNP           0.1445618  0.0001133631 -0.0016054
>      >> Population   -2.0954381 -0.0016053999  0.0231456
>
>      >> ----------------
>
>      >> Moreoever, lm() has a singular.ok() argument that defaults to TRUE, but glm() doesn't have this argument:
>
>      >> ----------------
>
>      >>> mod.lm <- lm(Employed ~ GNP + Population + I(GNP + Population),
>      >> +              data=longley, singular.ok=FALSE)
>      >> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
>      >> singular fit encountered
>
>      >> ----------------
>
>      >> In my opinion, singularity should at least produce a warning, both in calls to lm() and glm(), and in summary() output. Even better, again in my opinion, would be to produce an error by default in this situation, but doing so would likely break too much existing code.
>
>      > Yes, I would not want to change.  Note that this is from S
>      > already, i.e., long "ingrained".  I think there one argument was
>      > that there are situations with factor predictors of many levels
>      > and conceptually their 2- or even 3-way interactions (!)
>      > where it is neat to just fit the model, (-> get residuals and
>      > fitted values) and also see implicitly the "necessary rank" of
>      > prediction space, or rather even more specifically, you see for
>      > every factor how many levels are "distinguishable"/useful for
>      > prediction, given the data.
>
>      >> I prefer NA to 0 for the redundant coefficients because it at least suggests that the decision about what to exclude is arbitrary, and of course simply excluding coefficients isn't the only way to proceed.
>
>      > I'm less modest and would say *definitely*, NA's are highly
>      > prefered in such a situation.
>
>      >> Finally, the differences in behaviour between coef() and vcov() and between lm() and glm() aren't really sensible.
>
>      > I really haven't seen any difference between lm() and glm() in
>      > the example above.  Maybe you can point them out for me.
>
> .. now I saw it:
>     lm() has  a 'singular.ok = TRUE' argument
>     which you can set to FALSE if you prefer an error to NA coefficients.
>
> I also agree with you John that it would be nice if  glm() also
> got such an argument.
> Patches are welcome and seem easy. Nowadays we prefer them
> as attachments (diff/patch file!) at R's
>    https://bugs.r-project.org bugzilla
> against the svn source, here
>    https://svn.r-project.org/R/trunk/src/library/stats/R/glm.R
> and
>    https://svn.r-project.org/R/trunk/src/library/stats/man/glm.Rd
>
>      > I do quite agree that  vcov() should be compatible with
>      > coef() [and summary()]  for both 'lm' and 'glm' methods, i.e.,
>      > should get NA rows and columns there.  This would require
>      > eliminating these before e.g. using it in solve(<vcov>, *) etc,
>      > but I think it would be a good idea that the useR must deal with
>      > these NAs actively.
>
>      > Shall "we" try and see the fallout in CRAN space?
>
>      >> Maybe there's some reason for all this that escapes me.
>      > (for the first one---"no error"--- I gave a reason)
>
>      >> Best,
>      >> John
>
>      >> --------------------------------------
>      >> John Fox, Professor Emeritus
>      >> McMaster University
>      >> Hamilton, Ontario, Canada
>      >> Web: socserv.mcmaster.ca/jfox
>
>
>
>
>      >>> -----Original Message-----
>      >>> From: R-devel [mailto:[hidden email]] On Behalf Of
>      >>> Therneau, Terry M., Ph.D.
>      >>> Sent: Wednesday, September 13, 2017 6:19 PM
>      >>> To: [hidden email]
>      >>> Subject: [Rd] vcov and survival
>      >>>
>      >>> I have just noticed a difference in behavior between coxph and lm/glm:
>      >>> if one or more of the coefficients from the fit in NA, then lm and glm
>      >>> omit that row/column from the variance matrix; while coxph retains it
>      >>> but sets the values to zero.
>      >>>
>      >>> Is this something that should be "fixed", i.e., made to agree? I
>      >>> suspect that doing so will break other packages, but then NA coefs are
>      >>> rather rare so perhaps not.
>      >>>
>      >>> Terry Therneau
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vcov and survival

Fox, John
In reply to this post by Martin Maechler
Dear Martin,

I made three points which likely got lost because of the way I presented them:

(1) Singularity is an unusual situation and should be made more prominent. It typically reflects a problem with the data or the specification of the model. That's not to say that it *never* makes sense to allow singular fits (as in the situations you mentions).

I'd favour setting singular.ok=FALSE as the default, but in the absence of that a warning or at least a note. A compromise would be to have a singular.ok option() that would be FALSE out of the box.

Any changes would have to be made very carefully so as not to create chaos. That goes for the points below as well.

(2) coef() and vcov() behave inconsistently, which can be problematic because one often uses them together in code.

(3) As you noticed in your second message, lm() has a singular.ok argument and glm() doesn't.

I'll take a look at the code for glm() with an eye towards creating a patch, but I'm a bit reluctant to mess with the code for something as important as glm().

Best,
 John



> -----Original Message-----
> From: Martin Maechler [mailto:[hidden email]]
> Sent: Thursday, September 14, 2017 4:23 AM
> To: Martin Maechler <[hidden email]>
> Cc: Fox, John <[hidden email]>; Therneau, Terry M., Ph.D.
> <[hidden email]>; [hidden email]
> Subject: Re: [Rd] vcov and survival
>
> >>>>> Martin Maechler <[hidden email]>
> >>>>>     on Thu, 14 Sep 2017 10:13:02 +0200 writes:
>
> >>>>> Fox, John <[hidden email]>
> >>>>>     on Wed, 13 Sep 2017 22:45:07 +0000 writes:
>
>     >> Dear Terry,
>     >> Even the behaviour of lm() and glm() isn't entirely consistent. In both
> cases, singularity results in NA coefficients by default, and these are reported
> in the model summary and coefficient vector, but not in the coefficient
> covariance matrix:
>
>     >> ----------------
>
>     >>> mod.lm <- lm(Employed ~ GNP + Population + I(GNP + Population),
>     >> +              data=longley)
>     >>> summary(mod.lm)
>
>     >> Call:
>     >> lm(formula = Employed ~ GNP + Population + I(GNP + Population),
>     >> data = longley)
>
>     >> Residuals:
>     >> Min       1Q   Median       3Q      Max
>     >> -0.80899 -0.33282 -0.02329  0.25895  1.08800
>
>     >> Coefficients: (1 not defined because of singularities)
>     >> Estimate Std. Error t value Pr(>|t|)
>     >> (Intercept)         88.93880   13.78503   6.452 2.16e-05 ***
>     >> GNP                  0.06317    0.01065   5.933 4.96e-05 ***
>     >> Population          -0.40974    0.15214  -2.693   0.0184 *
>     >> I(GNP + Population)       NA         NA      NA       NA
>     >> ---
>     >> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
>     >> Residual standard error: 0.5459 on 13 degrees of freedom
>     >> Multiple R-squared:  0.9791, Adjusted R-squared:  0.9758
>     >> F-statistic: 303.9 on 2 and 13 DF,  p-value: 1.221e-11
>
>     >>> vcov(mod.lm)
>     >> (Intercept)           GNP Population
>     >> (Intercept) 190.0269691  0.1445617813 -2.0954381
>     >> GNP           0.1445618  0.0001133631 -0.0016054
>     >> Population   -2.0954381 -0.0016053999  0.0231456
>     >>> coef(mod.lm)
>     >> (Intercept)                 GNP          Population I(GNP + Population)
>     >> 88.93879831          0.06317244         -0.40974292                  NA
>     >>>
>     >>> mod.glm <- glm(Employed ~ GNP + Population + I(GNP + Population),
>     >> +               data=longley)
>     >>> summary(mod.glm)
>
>     >> Call:
>     >> glm(formula = Employed ~ GNP + Population + I(GNP + Population),
>     >> data = longley)
>
>     >> Deviance Residuals:
>     >> Min        1Q    Median        3Q       Max
>     >> -0.80899  -0.33282  -0.02329   0.25895   1.08800
>
>     >> Coefficients: (1 not defined because of singularities)
>     >> Estimate Std. Error t value Pr(>|t|)
>     >> (Intercept)         88.93880   13.78503   6.452 2.16e-05 ***
>     >> GNP                  0.06317    0.01065   5.933 4.96e-05 ***
>     >> Population          -0.40974    0.15214  -2.693   0.0184 *
>     >> I(GNP + Population)       NA         NA      NA       NA
>     >> ---
>     >> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
>     >> (Dispersion parameter for gaussian family taken to be 0.2980278)
>
>     >> Null deviance: 185.0088  on 15  degrees of freedom
>     >> Residual deviance:   3.8744  on 13  degrees of freedom
>     >> AIC: 30.715
>
>     >> Number of Fisher Scoring iterations: 2
>
>     >>> coef(mod.glm)
>     >> (Intercept)                 GNP          Population I(GNP + Population)
>     >> 88.93879831          0.06317244         -0.40974292                  NA
>     >>> vcov(mod.glm)
>     >> (Intercept)           GNP Population
>     >> (Intercept) 190.0269691  0.1445617813 -2.0954381
>     >> GNP           0.1445618  0.0001133631 -0.0016054
>     >> Population   -2.0954381 -0.0016053999  0.0231456
>
>     >> ----------------
>
>     >> Moreoever, lm() has a singular.ok() argument that defaults to TRUE, but
> glm() doesn't have this argument:
>
>     >> ----------------
>
>     >>> mod.lm <- lm(Employed ~ GNP + Population + I(GNP + Population),
>     >> +              data=longley, singular.ok=FALSE)
>     >> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
>     >> singular fit encountered
>
>     >> ----------------
>
>     >> In my opinion, singularity should at least produce a warning, both in calls
> to lm() and glm(), and in summary() output. Even better, again in my opinion,
> would be to produce an error by default in this situation, but doing so would
> likely break too much existing code.
>
>     > Yes, I would not want to change.  Note that this is from S
>     > already, i.e., long "ingrained".  I think there one argument was
>     > that there are situations with factor predictors of many levels
>     > and conceptually their 2- or even 3-way interactions (!)
>     > where it is neat to just fit the model, (-> get residuals and
>     > fitted values) and also see implicitly the "necessary rank" of
>     > prediction space, or rather even more specifically, you see for
>     > every factor how many levels are "distinguishable"/useful for
>     > prediction, given the data.
>
>     >> I prefer NA to 0 for the redundant coefficients because it at least suggests
> that the decision about what to exclude is arbitrary, and of course simply
> excluding coefficients isn't the only way to proceed.
>
>     > I'm less modest and would say *definitely*, NA's are highly
>     > prefered in such a situation.
>
>     >> Finally, the differences in behaviour between coef() and vcov() and
> between lm() and glm() aren't really sensible.
>
>     > I really haven't seen any difference between lm() and glm() in
>     > the example above.  Maybe you can point them out for me.
>
> .. now I saw it:
>    lm() has  a 'singular.ok = TRUE' argument
>    which you can set to FALSE if you prefer an error to NA coefficients.
>
> I also agree with you John that it would be nice if  glm() also got such an
> argument.
> Patches are welcome and seem easy. Nowadays we prefer them as
> attachments (diff/patch file!) at R's
>   https://bugs.r-project.org bugzilla
> against the svn source, here
>   https://svn.r-project.org/R/trunk/src/library/stats/R/glm.R
> and
>   https://svn.r-project.org/R/trunk/src/library/stats/man/glm.Rd
>
>     > I do quite agree that  vcov() should be compatible with
>     > coef() [and summary()]  for both 'lm' and 'glm' methods, i.e.,
>     > should get NA rows and columns there.  This would require
>     > eliminating these before e.g. using it in solve(<vcov>, *) etc,
>     > but I think it would be a good idea that the useR must deal with
>     > these NAs actively.
>
>     > Shall "we" try and see the fallout in CRAN space?
>
>     >> Maybe there's some reason for all this that escapes me.
>     > (for the first one---"no error"--- I gave a reason)
>
>     >> Best,
>     >> John
>
>     >> --------------------------------------
>     >> John Fox, Professor Emeritus
>     >> McMaster University
>     >> Hamilton, Ontario, Canada
>     >> Web: socserv.mcmaster.ca/jfox
>
>
>
>
>     >>> -----Original Message-----
>     >>> From: R-devel [mailto:[hidden email]] On Behalf Of
>     >>> Therneau, Terry M., Ph.D.
>     >>> Sent: Wednesday, September 13, 2017 6:19 PM
>     >>> To: [hidden email]
>     >>> Subject: [Rd] vcov and survival
>     >>>
>     >>> I have just noticed a difference in behavior between coxph and lm/glm:
>     >>> if one or more of the coefficients from the fit in NA, then lm and glm
>     >>> omit that row/column from the variance matrix; while coxph retains it
>     >>> but sets the values to zero.
>     >>>
>     >>> Is this something that should be "fixed", i.e., made to agree? I
>     >>> suspect that doing so will break other packages, but then NA coefs are
>     >>> rather rare so perhaps not.
>     >>>
>     >>> Terry Therneau

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vcov and survival

Fox, John
In reply to this post by Martin Maechler
Dear Terry,

It's not surprising that different modeling functions behave differently in this respect because there's no articulated standard.

Please see my response to Martin for my take on the singular.ok argument. For a highly sophisticated user like you, singular.ok=TRUE isn't problematic -- you're not going to fail to notice an NA in the coefficient vector -- but I've seen students, e.g., doing exactly that. In principle having a singular.ok option defaulting to FALSE would satisfy everyone, but would probably break too much existing code.

Best,
 John

> -----Original Message-----
> From: Therneau, Terry M., Ph.D. [mailto:[hidden email]]
> Sent: Thursday, September 14, 2017 8:41 AM
> To: Martin Maechler <[hidden email]>
> Cc: Fox, John <[hidden email]>; Therneau, Terry M., Ph.D.
> <[hidden email]>; [hidden email]
> Subject: Re: [Rd] vcov and survival
>
> Thanks all for your comments.  No one said "all the other vcov methods do
> ....", so I took some time this AM to look at several listed in the vcov help page.
> Here is the code for the first few examples: data2 is constructed specifically to
> create an NA coef midway in the list.
>
> data1 <- data.frame(y = c(1,2,10,50, 5, 4, 8, 40, 60, 20, 21, 22,
>                            3,5,12,52, 7, 8,16, 48, 58, 28, 20,5),
>                      x1 = factor(letters[rep(1:3, length=24)]),
>                      x2 = factor(LETTERS[rep(1:4, length=24)]),
>                      x3 = factor(rep(1:7, length=24)))
> data2 <- subset(data1, x1 !='a' | x2 != 'C')
>
> fit1 <- lm(y ~ x1*x2, data2)
> table(is.na(coef(fit1)))
> dim(vcov(fit1))
>
> fit2 <- glm(y ~ x1*x2, data=data2, poisson)
> table(is.na(coef(fit2)))
> dim(vcov(fit2))
>
> fit3 <- lme(y ~ x1*x2, random= ~1|x3, data2)
>
> 1. lm, mlm, glm, negbin objects all have an NA in coef(fit); and remove NA
> columns from the vcov object.
>
> 2. I expected polr to return a generalized inverse of the Hessian since vcov.polr
> has a call to ginv(object$Hessian), but it shortcuts earlier with a message
>   "design appears to be rank-deficient, so dropping some coefs"
> The undetermined coef appears in neither coef() more vcov().
>
> 3. rlm declares that it does not work with singular data.
>
> 4. multinom returns values for all coefficients and a full variance matrix.
> However, the returned variance is rank-deficient.  It is essentially a g-inverse of
> the Hessian.
>
> 5. coxph and survreg report an NA coef, and return a generalized inverse of the
> Hessian matrix.  The g-inverse was chosen to be a particularly easy one in that
> you can spot redundant colums via a row/col of zeros.
>
> 6. nlme fails with a singularity error.  I didn't check out gls.
>
> So my original question of whether I should make coxph consistent with the
> others has no answer, the 'others' are not consistent.
>
> In response to two other points:
>   >> In my opinion singularity should at least produce a warning...
> I was one of those who lobbied heavily to change the singular.ok=FALSE
> default of lm to TRUE.  Data is messy, I have work to do, and don't need a
> package constantly harping at me.
>
> In the same vein, stuffing NA into the vcov result is more pure, but would
> cause a lot of hassle.  I'm not sure that it is worth it.
>
> For now, coxph will stay as is.
> But again, thanks to all for comments and I'll look forward to any more
> discussion.
>
> Terry T.
>
>
> On 09/14/2017 03:23 AM, Martin Maechler wrote:
> >>>>>> Martin Maechler <[hidden email]>
> >>>>>>      on Thu, 14 Sep 2017 10:13:02 +0200 writes:
> >
> >>>>>> Fox, John <[hidden email]>
> >>>>>>      on Wed, 13 Sep 2017 22:45:07 +0000 writes:
> >
> >      >> Dear Terry,
> >      >> Even the behaviour of lm() and glm() isn't entirely consistent. In both
> cases, singularity results in NA coefficients by default, and these are reported
> in the model summary and coefficient vector, but not in the coefficient
> covariance matrix:
> >
> >      >> ----------------
> >
> >      >>> mod.lm <- lm(Employed ~ GNP + Population + I(GNP + Population),
> >      >> +              data=longley)
> >      >>> summary(mod.lm)
> >
> >      >> Call:
> >      >> lm(formula = Employed ~ GNP + Population + I(GNP + Population),
> >      >> data = longley)
> >
> >      >> Residuals:
> >      >> Min       1Q   Median       3Q      Max
> >      >> -0.80899 -0.33282 -0.02329  0.25895  1.08800
> >
> >      >> Coefficients: (1 not defined because of singularities)
> >      >> Estimate Std. Error t value Pr(>|t|)
> >      >> (Intercept)         88.93880   13.78503   6.452 2.16e-05 ***
> >      >> GNP                  0.06317    0.01065   5.933 4.96e-05 ***
> >      >> Population          -0.40974    0.15214  -2.693   0.0184 *
> >      >> I(GNP + Population)       NA         NA      NA       NA
> >      >> ---
> >      >> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> >
> >      >> Residual standard error: 0.5459 on 13 degrees of freedom
> >      >> Multiple R-squared:  0.9791, Adjusted R-squared:  0.9758
> >      >> F-statistic: 303.9 on 2 and 13 DF,  p-value: 1.221e-11
> >
> >      >>> vcov(mod.lm)
> >      >> (Intercept)           GNP Population
> >      >> (Intercept) 190.0269691  0.1445617813 -2.0954381
> >      >> GNP           0.1445618  0.0001133631 -0.0016054
> >      >> Population   -2.0954381 -0.0016053999  0.0231456
> >      >>> coef(mod.lm)
> >      >> (Intercept)                 GNP          Population I(GNP + Population)
> >      >> 88.93879831          0.06317244         -0.40974292                  NA
> >      >>>
> >      >>> mod.glm <- glm(Employed ~ GNP + Population + I(GNP + Population),
> >      >> +               data=longley)
> >      >>> summary(mod.glm)
> >
> >      >> Call:
> >      >> glm(formula = Employed ~ GNP + Population + I(GNP + Population),
> >      >> data = longley)
> >
> >      >> Deviance Residuals:
> >      >> Min        1Q    Median        3Q       Max
> >      >> -0.80899  -0.33282  -0.02329   0.25895   1.08800
> >
> >      >> Coefficients: (1 not defined because of singularities)
> >      >> Estimate Std. Error t value Pr(>|t|)
> >      >> (Intercept)         88.93880   13.78503   6.452 2.16e-05 ***
> >      >> GNP                  0.06317    0.01065   5.933 4.96e-05 ***
> >      >> Population          -0.40974    0.15214  -2.693   0.0184 *
> >      >> I(GNP + Population)       NA         NA      NA       NA
> >      >> ---
> >      >> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> >
> >      >> (Dispersion parameter for gaussian family taken to be
> > 0.2980278)
> >
> >      >> Null deviance: 185.0088  on 15  degrees of freedom
> >      >> Residual deviance:   3.8744  on 13  degrees of freedom
> >      >> AIC: 30.715
> >
> >      >> Number of Fisher Scoring iterations: 2
> >
> >      >>> coef(mod.glm)
> >      >> (Intercept)                 GNP          Population I(GNP + Population)
> >      >> 88.93879831          0.06317244         -0.40974292                  NA
> >      >>> vcov(mod.glm)
> >      >> (Intercept)           GNP Population
> >      >> (Intercept) 190.0269691  0.1445617813 -2.0954381
> >      >> GNP           0.1445618  0.0001133631 -0.0016054
> >      >> Population   -2.0954381 -0.0016053999  0.0231456
> >
> >      >> ----------------
> >
> >      >> Moreoever, lm() has a singular.ok() argument that defaults to TRUE,
> but glm() doesn't have this argument:
> >
> >      >> ----------------
> >
> >      >>> mod.lm <- lm(Employed ~ GNP + Population + I(GNP + Population),
> >      >> +              data=longley, singular.ok=FALSE)
> >      >> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
> >      >> singular fit encountered
> >
> >      >> ----------------
> >
> >      >> In my opinion, singularity should at least produce a warning, both in
> calls to lm() and glm(), and in summary() output. Even better, again in my
> opinion, would be to produce an error by default in this situation, but doing so
> would likely break too much existing code.
> >
> >      > Yes, I would not want to change.  Note that this is from S
> >      > already, i.e., long "ingrained".  I think there one argument was
> >      > that there are situations with factor predictors of many levels
> >      > and conceptually their 2- or even 3-way interactions (!)
> >      > where it is neat to just fit the model, (-> get residuals and
> >      > fitted values) and also see implicitly the "necessary rank" of
> >      > prediction space, or rather even more specifically, you see for
> >      > every factor how many levels are "distinguishable"/useful for
> >      > prediction, given the data.
> >
> >      >> I prefer NA to 0 for the redundant coefficients because it at least
> suggests that the decision about what to exclude is arbitrary, and of course
> simply excluding coefficients isn't the only way to proceed.
> >
> >      > I'm less modest and would say *definitely*, NA's are highly
> >      > prefered in such a situation.
> >
> >      >> Finally, the differences in behaviour between coef() and vcov() and
> between lm() and glm() aren't really sensible.
> >
> >      > I really haven't seen any difference between lm() and glm() in
> >      > the example above.  Maybe you can point them out for me.
> >
> > .. now I saw it:
> >     lm() has  a 'singular.ok = TRUE' argument
> >     which you can set to FALSE if you prefer an error to NA coefficients.
> >
> > I also agree with you John that it would be nice if  glm() also got
> > such an argument.
> > Patches are welcome and seem easy. Nowadays we prefer them as
> > attachments (diff/patch file!) at R's
> >    https://bugs.r-project.org bugzilla against the svn source, here
> >    https://svn.r-project.org/R/trunk/src/library/stats/R/glm.R
> > and
> >    https://svn.r-project.org/R/trunk/src/library/stats/man/glm.Rd
> >
> >      > I do quite agree that  vcov() should be compatible with
> >      > coef() [and summary()]  for both 'lm' and 'glm' methods, i.e.,
> >      > should get NA rows and columns there.  This would require
> >      > eliminating these before e.g. using it in solve(<vcov>, *) etc,
> >      > but I think it would be a good idea that the useR must deal with
> >      > these NAs actively.
> >
> >      > Shall "we" try and see the fallout in CRAN space?
> >
> >      >> Maybe there's some reason for all this that escapes me.
> >      > (for the first one---"no error"--- I gave a reason)
> >
> >      >> Best,
> >      >> John
> >
> >      >> --------------------------------------
> >      >> John Fox, Professor Emeritus
> >      >> McMaster University
> >      >> Hamilton, Ontario, Canada
> >      >> Web: socserv.mcmaster.ca/jfox
> >
> >
> >
> >
> >      >>> -----Original Message-----
> >      >>> From: R-devel [mailto:[hidden email]] On Behalf Of
> >      >>> Therneau, Terry M., Ph.D.
> >      >>> Sent: Wednesday, September 13, 2017 6:19 PM
> >      >>> To: [hidden email]
> >      >>> Subject: [Rd] vcov and survival
> >      >>>
> >      >>> I have just noticed a difference in behavior between coxph and
> lm/glm:
> >      >>> if one or more of the coefficients from the fit in NA, then lm and glm
> >      >>> omit that row/column from the variance matrix; while coxph retains it
> >      >>> but sets the values to zero.
> >      >>>
> >      >>> Is this something that should be "fixed", i.e., made to agree? I
> >      >>> suspect that doing so will break other packages, but then NA coefs
> are
> >      >>> rather rare so perhaps not.
> >      >>>
> >      >>> Terry Therneau
> >
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vcov and survival

Therneau, Terry M., Ph.D.
In reply to this post by Fox, John


On 09/14/2017 08:46 AM, Fox, John wrote:
> Dear Martin,
>
> I made three points which likely got lost because of the way I presented them:
>
> (1) Singularity is an unusual situation and should be made more prominent. It typically reflects a problem with the data or the specification of the model. That's not to say that it*never*  makes sense to allow singular fits (as in the situations you mentions).

In my medical work singularity is far from unusual.  It often results from imbalance in a
covariate, e.g., there are 4 pathological stages but one of them turns out to be rare.

>
> I'd favour setting singular.ok=FALSE as the default, but in the absence of that a warning or at least a note. A compromise would be to have a singular.ok option() that would be FALSE out of the box.

Originally the lm() default was singular.ok=FALSE.  It was a major pain in the ass and
there was widespread unhappiness.  Enough so that Splus changed it.  (And they were not
always very responsive to the users, so it took a lot of unrest.)   Another early default
was na.fail, based on similar logic that "missings are unusual and should require an
explicit response".  Don't repeat these mistakes.

Terry T.

>
> Any changes would have to be made very carefully so as not to create chaos. That goes for the points below as well.
>
> (2) coef() and vcov() behave inconsistently, which can be problematic because one often uses them together in code.
>
> (3) As you noticed in your second message, lm() has a singular.ok argument and glm() doesn't.
>
> I'll take a look at the code for glm() with an eye towards creating a patch, but I'm a bit reluctant to mess with the code for something as important as glm().
>
> Best,
>   John

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vcov and survival

Martin Maechler
In reply to this post by Fox, John
>>>>> Fox, John <[hidden email]>
>>>>>     on Thu, 14 Sep 2017 13:46:44 +0000 writes:

    > Dear Martin, I made three points which likely got lost
    > because of the way I presented them:

    > (1) Singularity is an unusual situation and should be made
    > more prominent. It typically reflects a problem with the
    > data or the specification of the model. That's not to say
    > that it *never* makes sense to allow singular fits (as in
    > the situations you mentions).

    > I'd favour setting singular.ok=FALSE as the default, but
    > in the absence of that a warning or at least a note. A
    > compromise would be to have a singular.ok option() that
    > would be FALSE out of the box.

    > Any changes would have to be made very carefully so as not
    > to create chaos.

I for one, am too reluctant to want to change the default there.

    >  That goes for the points below as well.

    > (2) coef() and vcov() behave inconsistently, which can be
    > problematic because one often uses them together in code.

indeed; and I had agreed on that.
As of today, in R-devel only they now behave compatibly.
NEWS entry

    • The “default” ("lm" etc) methods of vcov() have gained new
      optional argument complete = TRUE which makes the vcov() methods
      more consistent with the coef() methods in the case of singular
      designs.  The former behavior is now achieved by vcov(*,
      complete=FALSE).


    > (3) As you noticed in your second message, lm() has a
    > singular.ok argument and glm() doesn't.

and that has been amended even earlier (a bit more than a month
ago) in R-devel svn rev 73380 with  NEWS  entry

    • glm() and glm.fit get the same singular.ok=TRUE argument that
      lm() has had forever.  As a consequence, in glm(*, method =
      <your_own>), user specified methods need to accept a singular.ok
      argument as well.

    > I'll take a look at the code for glm() with an eye towards
    > creating a patch, but I'm a bit reluctant to mess with the
    > code for something as important as glm().

and as a matter of fact you did send me +- the R code part of
that change.

My current plan is to also add the  'complete = TRUE' option to the
"basic" coef() methods, such that you also have consistent
coef(*, complete=FALSE)  and
vcov(*, complete=FALSE)  behaviors.

Thank you and Terry (and others?) for bringing up the issues and
discussing them thoroughly!

Best,
Martin.


    > Best, John



    >> -----Original Message----- From: Martin Maechler
    >> [mailto:[hidden email]] Sent: Thursday,
    >> September 14, 2017 4:23 AM To: Martin Maechler
    >> <[hidden email]> Cc: Fox, John
    >> <[hidden email]>; Therneau, Terry M., Ph.D.
    >> <[hidden email]>; [hidden email] Subject: Re:
    >> [Rd] vcov and survival
    >>
    >> >>>>> Martin Maechler <[hidden email]> >>>>>
    >> on Thu, 14 Sep 2017 10:13:02 +0200 writes:
    >>
    >> >>>>> Fox, John <[hidden email]> >>>>> on Wed, 13 Sep
    >> 2017 22:45:07 +0000 writes:
    >>
    >> >> Dear Terry, >> Even the behaviour of lm() and glm()
    >> isn't entirely consistent. In both cases, singularity
    >> results in NA coefficients by default, and these are
    >> reported in the model summary and coefficient vector, but
    >> not in the coefficient covariance matrix:
    >>
    >> >> ----------------
    >>
    >> >>> mod.lm <- lm(Employed ~ GNP + Population + I(GNP +
    >> Population), >> + data=longley) >>> summary(mod.lm)
    >>
    >> >> Call: >> lm(formula = Employed ~ GNP + Population +
    >> I(GNP + Population), >> data = longley)
    >>
    >> >> Residuals: >> Min 1Q Median 3Q Max >> -0.80899
    >> -0.33282 -0.02329 0.25895 1.08800
    >>
    >> >> Coefficients: (1 not defined because of singularities)
    >> >> Estimate Std. Error t value Pr(>|t|) >> (Intercept)
    >> 88.93880 13.78503 6.452 2.16e-05 *** >> GNP 0.06317
    >> 0.01065 5.933 4.96e-05 *** >> Population -0.40974 0.15214
    >> -2.693 0.0184 * >> I(GNP + Population) NA NA NA NA
    >> >> ---
    >> >> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
    >> 0.1 ' ' 1
    >>
    >> >> Residual standard error: 0.5459 on 13 degrees of
    >> freedom >> Multiple R-squared: 0.9791, Adjusted
    >> R-squared: 0.9758 >> F-statistic: 303.9 on 2 and 13 DF,
    >> p-value: 1.221e-11
    >>
    >> >>> vcov(mod.lm) >> (Intercept) GNP Population >>
    >> (Intercept) 190.0269691 0.1445617813 -2.0954381 >> GNP
    >> 0.1445618 0.0001133631 -0.0016054 >> Population
    >> -2.0954381 -0.0016053999 0.0231456 >>> coef(mod.lm) >>
    >> (Intercept) GNP Population I(GNP + Population) >>
    >> 88.93879831 0.06317244 -0.40974292 NA
    >> >>>
    >> >>> mod.glm <- glm(Employed ~ GNP + Population + I(GNP +
    >> Population), >> + data=longley) >>> summary(mod.glm)
    >>
    >> >> Call: >> glm(formula = Employed ~ GNP + Population +
    >> I(GNP + Population), >> data = longley)
    >>
    >> >> Deviance Residuals: >> Min 1Q Median 3Q Max >>
    >> -0.80899 -0.33282 -0.02329 0.25895 1.08800
    >>
    >> >> Coefficients: (1 not defined because of singularities)
    >> >> Estimate Std. Error t value Pr(>|t|) >> (Intercept)
    >> 88.93880 13.78503 6.452 2.16e-05 *** >> GNP 0.06317
    >> 0.01065 5.933 4.96e-05 *** >> Population -0.40974 0.15214
    >> -2.693 0.0184 * >> I(GNP + Population) NA NA NA NA
    >> >> ---
    >> >> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
    >> 0.1 ' ' 1
    >>
    >> >> (Dispersion parameter for gaussian family taken to be
    >> 0.2980278)
    >>
    >> >> Null deviance: 185.0088 on 15 degrees of freedom >>
    >> Residual deviance: 3.8744 on 13 degrees of freedom >>
    >> AIC: 30.715
    >>
    >> >> Number of Fisher Scoring iterations: 2
    >>
    >> >>> coef(mod.glm) >> (Intercept) GNP Population I(GNP +
    >> Population) >> 88.93879831 0.06317244 -0.40974292 NA >>>
    >> vcov(mod.glm) >> (Intercept) GNP Population >>
    >> (Intercept) 190.0269691 0.1445617813 -2.0954381 >> GNP
    >> 0.1445618 0.0001133631 -0.0016054 >> Population
    >> -2.0954381 -0.0016053999 0.0231456
    >>
    >> >> ----------------
    >>
    >> >> Moreoever, lm() has a singular.ok() argument that
    >> defaults to TRUE, but glm() doesn't have this argument:
    >>
    >> >> ----------------
    >>
    >> >>> mod.lm <- lm(Employed ~ GNP + Population + I(GNP +
    >> Population), >> + data=longley, singular.ok=FALSE) >>
    >> Error in lm.fit(x, y, offset = offset, singular.ok =
    >> singular.ok, ...) : >> singular fit encountered
    >>
    >> >> ----------------
    >>
    >> >> In my opinion, singularity should at least produce a
    >> warning, both in calls to lm() and glm(), and in
    >> summary() output. Even better, again in my opinion, would
    >> be to produce an error by default in this situation, but
    >> doing so would likely break too much existing code.
    >>
    >> > Yes, I would not want to change.  Note that this is
    >> from S > already, i.e., long "ingrained".  I think there
    >> one argument was > that there are situations with factor
    >> predictors of many levels > and conceptually their 2- or
    >> even 3-way interactions (!)  > where it is neat to just
    >> fit the model, (-> get residuals and > fitted values) and
    >> also see implicitly the "necessary rank" of > prediction
    >> space, or rather even more specifically, you see for >
    >> every factor how many levels are "distinguishable"/useful
    >> for > prediction, given the data.
    >>
    >> >> I prefer NA to 0 for the redundant coefficients
    >> because it at least suggests that the decision about what
    >> to exclude is arbitrary, and of course simply excluding
    >> coefficients isn't the only way to proceed.
    >>
    >> > I'm less modest and would say *definitely*, NA's are
    >> highly > prefered in such a situation.
    >>
    >> >> Finally, the differences in behaviour between coef()
    >> and vcov() and between lm() and glm() aren't really
    >> sensible.
    >>
    >> > I really haven't seen any difference between lm() and
    >> glm() in > the example above.  Maybe you can point them
    >> out for me.
    >>
    >> .. now I saw it: lm() has a 'singular.ok = TRUE' argument
    >> which you can set to FALSE if you prefer an error to NA
    >> coefficients.
    >>
    >> I also agree with you John that it would be nice if glm()
    >> also got such an argument.  Patches are welcome and seem
    >> easy. Nowadays we prefer them as attachments (diff/patch
    >> file!) at R's https://bugs.r-project.org bugzilla against
    >> the svn source, here
    >> https://svn.r-project.org/R/trunk/src/library/stats/R/glm.R
    >> and
    >> https://svn.r-project.org/R/trunk/src/library/stats/man/glm.Rd
    >>
    >> > I do quite agree that vcov() should be compatible with
    >> > coef() [and summary()] for both 'lm' and 'glm' methods,
    >> i.e., > should get NA rows and columns there.  This would
    >> require > eliminating these before e.g. using it in
    >> solve(<vcov>, *) etc, > but I think it would be a good
    >> idea that the useR must deal with > these NAs actively.
    >>
    >> > Shall "we" try and see the fallout in CRAN space?
    >>
    >> >> Maybe there's some reason for all this that escapes
    >> me.  > (for the first one---"no error"--- I gave a
    >> reason)
    >>
    >> >> Best, >> John
    >>
    >> >> --------------------------------------
    >> >> John Fox, Professor Emeritus >> McMaster University >>
    >> Hamilton, Ontario, Canada >> Web:
    >> socserv.mcmaster.ca/jfox
    >>
    >>
    >>
    >>
    >> >>> -----Original Message----- >>> From: R-devel
    >> [mailto:[hidden email]] On Behalf Of >>>
    >> Therneau, Terry M., Ph.D.  >>> Sent: Wednesday, September
    >> 13, 2017 6:19 PM >>> To: [hidden email] >>>
    >> Subject: [Rd] vcov and survival
    >> >>>
    >> >>> I have just noticed a difference in behavior between
    >> coxph and lm/glm: >>> if one or more of the coefficients
    >> from the fit in NA, then lm and glm >>> omit that
    >> row/column from the variance matrix; while coxph retains
    >> it >>> but sets the values to zero.
    >> >>>
    >> >>> Is this something that should be "fixed", i.e., made
    >> to agree? I >>> suspect that doing so will break other
    >> packages, but then NA coefs are >>> rather rare so
    >> perhaps not.
    >> >>>
    >> >>> Terry Therneau

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vcov and survival

Fox, John
Dear Martin,

Thank you for taking care of this.

Best,
 John

> -----Original Message-----
> From: Martin Maechler [mailto:[hidden email]]
> Sent: Thursday, November 2, 2017 4:59 PM
> To: Fox, John <[hidden email]>
> Cc: Martin Maechler <[hidden email]>; Therneau, Terry M.,
> Ph.D. <[hidden email]>; [hidden email]
> Subject: RE: [Rd] vcov and survival
>
> >>>>> Fox, John <[hidden email]>
> >>>>>     on Thu, 14 Sep 2017 13:46:44 +0000 writes:
>
>     > Dear Martin, I made three points which likely got lost
>     > because of the way I presented them:
>
>     > (1) Singularity is an unusual situation and should be made
>     > more prominent. It typically reflects a problem with the
>     > data or the specification of the model. That's not to say
>     > that it *never* makes sense to allow singular fits (as in
>     > the situations you mentions).
>
>     > I'd favour setting singular.ok=FALSE as the default, but
>     > in the absence of that a warning or at least a note. A
>     > compromise would be to have a singular.ok option() that
>     > would be FALSE out of the box.
>
>     > Any changes would have to be made very carefully so as not
>     > to create chaos.
>
> I for one, am too reluctant to want to change the default there.
>
>     >  That goes for the points below as well.
>
>     > (2) coef() and vcov() behave inconsistently, which can be
>     > problematic because one often uses them together in code.
>
> indeed; and I had agreed on that.
> As of today, in R-devel only they now behave compatibly.
> NEWS entry
>
>     • The “default” ("lm" etc) methods of vcov() have gained new
>       optional argument complete = TRUE which makes the vcov() methods
>       more consistent with the coef() methods in the case of singular
>       designs.  The former behavior is now achieved by vcov(*,
>       complete=FALSE).
>
>
>     > (3) As you noticed in your second message, lm() has a
>     > singular.ok argument and glm() doesn't.
>
> and that has been amended even earlier (a bit more than a month
> ago) in R-devel svn rev 73380 with  NEWS  entry
>
>     • glm() and glm.fit get the same singular.ok=TRUE argument that
>       lm() has had forever.  As a consequence, in glm(*, method =
>       <your_own>), user specified methods need to accept a singular.ok
>       argument as well.
>
>     > I'll take a look at the code for glm() with an eye towards
>     > creating a patch, but I'm a bit reluctant to mess with the
>     > code for something as important as glm().
>
> and as a matter of fact you did send me +- the R code part of that change.
>
> My current plan is to also add the  'complete = TRUE' option to the "basic"
> coef() methods, such that you also have consistent coef(*, complete=FALSE)
> and vcov(*, complete=FALSE)  behaviors.
>
> Thank you and Terry (and others?) for bringing up the issues and discussing
> them thoroughly!
>
> Best,
> Martin.
>
>
>     > Best, John
>
>
>
>     >> -----Original Message----- From: Martin Maechler
>     >> [mailto:[hidden email]] Sent: Thursday,
>     >> September 14, 2017 4:23 AM To: Martin Maechler
>     >> <[hidden email]> Cc: Fox, John
>     >> <[hidden email]>; Therneau, Terry M., Ph.D.
>     >> <[hidden email]>; [hidden email] Subject: Re:
>     >> [Rd] vcov and survival
>     >>
>     >> >>>>> Martin Maechler <[hidden email]> >>>>>
>     >> on Thu, 14 Sep 2017 10:13:02 +0200 writes:
>     >>
>     >> >>>>> Fox, John <[hidden email]> >>>>> on Wed, 13 Sep
>     >> 2017 22:45:07 +0000 writes:
>     >>
>     >> >> Dear Terry, >> Even the behaviour of lm() and glm()
>     >> isn't entirely consistent. In both cases, singularity
>     >> results in NA coefficients by default, and these are
>     >> reported in the model summary and coefficient vector, but
>     >> not in the coefficient covariance matrix:
>     >>
>     >> >> ----------------
>     >>
>     >> >>> mod.lm <- lm(Employed ~ GNP + Population + I(GNP +
>     >> Population), >> + data=longley) >>> summary(mod.lm)
>     >>
>     >> >> Call: >> lm(formula = Employed ~ GNP + Population +
>     >> I(GNP + Population), >> data = longley)
>     >>
>     >> >> Residuals: >> Min 1Q Median 3Q Max >> -0.80899
>     >> -0.33282 -0.02329 0.25895 1.08800
>     >>
>     >> >> Coefficients: (1 not defined because of singularities)
>     >> >> Estimate Std. Error t value Pr(>|t|) >> (Intercept)
>     >> 88.93880 13.78503 6.452 2.16e-05 *** >> GNP 0.06317
>     >> 0.01065 5.933 4.96e-05 *** >> Population -0.40974 0.15214
>     >> -2.693 0.0184 * >> I(GNP + Population) NA NA NA NA
>     >> >> ---
>     >> >> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
>     >> 0.1 ' ' 1
>     >>
>     >> >> Residual standard error: 0.5459 on 13 degrees of
>     >> freedom >> Multiple R-squared: 0.9791, Adjusted
>     >> R-squared: 0.9758 >> F-statistic: 303.9 on 2 and 13 DF,
>     >> p-value: 1.221e-11
>     >>
>     >> >>> vcov(mod.lm) >> (Intercept) GNP Population >>
>     >> (Intercept) 190.0269691 0.1445617813 -2.0954381 >> GNP
>     >> 0.1445618 0.0001133631 -0.0016054 >> Population
>     >> -2.0954381 -0.0016053999 0.0231456 >>> coef(mod.lm) >>
>     >> (Intercept) GNP Population I(GNP + Population) >>
>     >> 88.93879831 0.06317244 -0.40974292 NA
>     >> >>>
>     >> >>> mod.glm <- glm(Employed ~ GNP + Population + I(GNP +
>     >> Population), >> + data=longley) >>> summary(mod.glm)
>     >>
>     >> >> Call: >> glm(formula = Employed ~ GNP + Population +
>     >> I(GNP + Population), >> data = longley)
>     >>
>     >> >> Deviance Residuals: >> Min 1Q Median 3Q Max >>
>     >> -0.80899 -0.33282 -0.02329 0.25895 1.08800
>     >>
>     >> >> Coefficients: (1 not defined because of singularities)
>     >> >> Estimate Std. Error t value Pr(>|t|) >> (Intercept)
>     >> 88.93880 13.78503 6.452 2.16e-05 *** >> GNP 0.06317
>     >> 0.01065 5.933 4.96e-05 *** >> Population -0.40974 0.15214
>     >> -2.693 0.0184 * >> I(GNP + Population) NA NA NA NA
>     >> >> ---
>     >> >> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
>     >> 0.1 ' ' 1
>     >>
>     >> >> (Dispersion parameter for gaussian family taken to be
>     >> 0.2980278)
>     >>
>     >> >> Null deviance: 185.0088 on 15 degrees of freedom >>
>     >> Residual deviance: 3.8744 on 13 degrees of freedom >>
>     >> AIC: 30.715
>     >>
>     >> >> Number of Fisher Scoring iterations: 2
>     >>
>     >> >>> coef(mod.glm) >> (Intercept) GNP Population I(GNP +
>     >> Population) >> 88.93879831 0.06317244 -0.40974292 NA >>>
>     >> vcov(mod.glm) >> (Intercept) GNP Population >>
>     >> (Intercept) 190.0269691 0.1445617813 -2.0954381 >> GNP
>     >> 0.1445618 0.0001133631 -0.0016054 >> Population
>     >> -2.0954381 -0.0016053999 0.0231456
>     >>
>     >> >> ----------------
>     >>
>     >> >> Moreoever, lm() has a singular.ok() argument that
>     >> defaults to TRUE, but glm() doesn't have this argument:
>     >>
>     >> >> ----------------
>     >>
>     >> >>> mod.lm <- lm(Employed ~ GNP + Population + I(GNP +
>     >> Population), >> + data=longley, singular.ok=FALSE) >>
>     >> Error in lm.fit(x, y, offset = offset, singular.ok =
>     >> singular.ok, ...) : >> singular fit encountered
>     >>
>     >> >> ----------------
>     >>
>     >> >> In my opinion, singularity should at least produce a
>     >> warning, both in calls to lm() and glm(), and in
>     >> summary() output. Even better, again in my opinion, would
>     >> be to produce an error by default in this situation, but
>     >> doing so would likely break too much existing code.
>     >>
>     >> > Yes, I would not want to change.  Note that this is
>     >> from S > already, i.e., long "ingrained".  I think there
>     >> one argument was > that there are situations with factor
>     >> predictors of many levels > and conceptually their 2- or
>     >> even 3-way interactions (!)  > where it is neat to just
>     >> fit the model, (-> get residuals and > fitted values) and
>     >> also see implicitly the "necessary rank" of > prediction
>     >> space, or rather even more specifically, you see for >
>     >> every factor how many levels are "distinguishable"/useful
>     >> for > prediction, given the data.
>     >>
>     >> >> I prefer NA to 0 for the redundant coefficients
>     >> because it at least suggests that the decision about what
>     >> to exclude is arbitrary, and of course simply excluding
>     >> coefficients isn't the only way to proceed.
>     >>
>     >> > I'm less modest and would say *definitely*, NA's are
>     >> highly > prefered in such a situation.
>     >>
>     >> >> Finally, the differences in behaviour between coef()
>     >> and vcov() and between lm() and glm() aren't really
>     >> sensible.
>     >>
>     >> > I really haven't seen any difference between lm() and
>     >> glm() in > the example above.  Maybe you can point them
>     >> out for me.
>     >>
>     >> .. now I saw it: lm() has a 'singular.ok = TRUE' argument
>     >> which you can set to FALSE if you prefer an error to NA
>     >> coefficients.
>     >>
>     >> I also agree with you John that it would be nice if glm()
>     >> also got such an argument.  Patches are welcome and seem
>     >> easy. Nowadays we prefer them as attachments (diff/patch
>     >> file!) at R's https://bugs.r-project.org bugzilla against
>     >> the svn source, here
>     >> https://svn.r-project.org/R/trunk/src/library/stats/R/glm.R
>     >> and
>     >> https://svn.r-project.org/R/trunk/src/library/stats/man/glm.Rd
>     >>
>     >> > I do quite agree that vcov() should be compatible with
>     >> > coef() [and summary()] for both 'lm' and 'glm' methods,
>     >> i.e., > should get NA rows and columns there.  This would
>     >> require > eliminating these before e.g. using it in
>     >> solve(<vcov>, *) etc, > but I think it would be a good
>     >> idea that the useR must deal with > these NAs actively.
>     >>
>     >> > Shall "we" try and see the fallout in CRAN space?
>     >>
>     >> >> Maybe there's some reason for all this that escapes
>     >> me.  > (for the first one---"no error"--- I gave a
>     >> reason)
>     >>
>     >> >> Best, >> John
>     >>
>     >> >> --------------------------------------
>     >> >> John Fox, Professor Emeritus >> McMaster University >>
>     >> Hamilton, Ontario, Canada >> Web:
>     >> socserv.mcmaster.ca/jfox
>     >>
>     >>
>     >>
>     >>
>     >> >>> -----Original Message----- >>> From: R-devel
>     >> [mailto:[hidden email]] On Behalf Of >>>
>     >> Therneau, Terry M., Ph.D.  >>> Sent: Wednesday, September
>     >> 13, 2017 6:19 PM >>> To: [hidden email] >>>
>     >> Subject: [Rd] vcov and survival
>     >> >>>
>     >> >>> I have just noticed a difference in behavior between
>     >> coxph and lm/glm: >>> if one or more of the coefficients
>     >> from the fit in NA, then lm and glm >>> omit that
>     >> row/column from the variance matrix; while coxph retains
>     >> it >>> but sets the values to zero.
>     >> >>>
>     >> >>> Is this something that should be "fixed", i.e., made
>     >> to agree? I >>> suspect that doing so will break other
>     >> packages, but then NA coefs are >>> rather rare so
>     >> perhaps not.
>     >> >>>
>     >> >>> Terry Therneau
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

New vcov(*, complete=TRUE) etc -- coef(<lm>) vs coef(<aov>)

Martin Maechler
In reply to this post by Martin Maechler
>>>>> Martin Maechler <[hidden email]>
>>>>>     on Thu, 2 Nov 2017 21:59:00 +0100 writes:

>>>>> Fox, John <[hidden email]>
>>>>>     on Thu, 14 Sep 2017 13:46:44 +0000 writes:

    >> Dear Martin, I made three points which likely got lost
    >> because of the way I presented them:

    >> (1) Singularity is an unusual situation and should be
    >> made more prominent. It typically reflects a problem with
    >> the data or the specification of the model. That's not to
    >> say that it *never* makes sense to allow singular fits
    >> (as in the situations you mentions).

    >> I'd favour setting singular.ok=FALSE as the default, but
    >> in the absence of that a warning or at least a note. A
    >> compromise would be to have a singular.ok option() that
    >> would be FALSE out of the box.

    >> Any changes would have to be made very carefully so as
    >> not to create chaos.

    > I for one, am too reluctant to want to change the default
    > there.

    >> That goes for the points below as well.

    >> (2) coef() and vcov() behave inconsistently, which can be
    >> problematic because one often uses them together in code.

    > indeed; and I had agreed on that.  As of today, in R-devel
    > only they now behave compatibly.  NEWS entry

    >     • The “default” ("lm" etc) methods of vcov() have
    > gained new optional argument complete = TRUE which makes
    > the vcov() methods more consistent with the coef() methods
    > in the case of singular designs.  The former behavior is
    > now achieved by vcov(*, complete=FALSE).


    >> (3) As you noticed in your second message, lm() has a
    >> singular.ok argument and glm() doesn't.

    > and that has been amended even earlier (a bit more than a
    > month ago) in R-devel svn rev 73380 with NEWS entry

    >     • glm() and glm.fit get the same singular.ok=TRUE
    > argument that lm() has had forever.  As a consequence, in
    > glm(*, method = <your_own>), user specified methods need
    > to accept a singular.ok argument as well.

    >> I'll take a look at the code for glm() with an eye
    >> towards creating a patch, but I'm a bit reluctant to mess
    >> with the code for something as important as glm().

    > and as a matter of fact you did send me +- the R code part
    > of that change.

    > My current plan is to also add the 'complete = TRUE'
    > option to the "basic" coef() methods, such that you also
    > have consistent coef(*, complete=FALSE) and vcov(*,
    > complete=FALSE) behaviors.

and indeed I had added the above a bit later.

However, to my surprise, I have now found that we have a
coef.aov() method -- completely undocumented which behaves *differently*:

where as the default coef() method which is called for lm(..)
results gives *all* coefficients, and gives  NA  for "aliased" ones,
the aov method *drops* the  NA  coefficients  and has done so
"forever"  (I've checked R version 1.1.1 of April 14, 2000).

vcov() on the other hand has not had a special "aov" method, but
treats aov() and lm() results the same... which means that in
R-devel the vcov() method for an aov() object  uses
'complete=TRUE' and gives NA rows and columns for the aliased coefficients,
whereas  coef.aov()  removes all the NAs  and  gives only the
"non-aliased" coefficients.   Consequently, in R-devel,
vcov(<aov>) and coef(<aov>)  are *now* incoherent, whereas these
two *were* coherent before the change.

I propose to
1. continue the strategy to keep coef() back-compatible  and
2. to *document* the "surprising" behavior of coef.aov()
3. introduce a  vcov.aov()  with complete=FALSE  default
   behavior which is compatile to the coef.aov() one [where I'd
   also introduce the no-change  'complete=FALSE' argument].

Hmm... again, this has been more work and more implications than
originally optimistically assumed..

Opinions, caveats, other feedback -- are very welcome!

Martin

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: New vcov(*, complete=TRUE) etc -- coef(<lm>) vs coef(<aov>)

Fox, John
In reply to this post by Martin Maechler
Dear Martin,

I think that your plan makes sense. It's too bad that aov() behaved differently in this respect from lm(), and thus created more work, but it's not be a bad thing that the difference is now explicit and documented.

I expect that that other problems like this will surface, particularly with contributed packages (and I know that you're aware that this has already happened with the car package). That is, packages that made provision for aliased coefficients based on the old behaviour of coef() and vcov() will now have to adapt to the new, more consistent behaviour.

Best,
 John

> -----Original Message-----
> From: R-devel [mailto:[hidden email]] On Behalf Of Martin
> Maechler
> Sent: Tuesday, November 7, 2017 4:48 PM
> To: [hidden email]
> Cc: Martin Maechler <[hidden email]>
> Subject: [Rd] New vcov(*, complete=TRUE) etc -- coef(<lm>) vs coef(<aov>)
>
> >>>>> Martin Maechler <[hidden email]>
> >>>>>     on Thu, 2 Nov 2017 21:59:00 +0100 writes:
>
> >>>>> Fox, John <[hidden email]>
> >>>>>     on Thu, 14 Sep 2017 13:46:44 +0000 writes:
>
>     >> Dear Martin, I made three points which likely got lost
>     >> because of the way I presented them:
>
>     >> (1) Singularity is an unusual situation and should be
>     >> made more prominent. It typically reflects a problem with
>     >> the data or the specification of the model. That's not to
>     >> say that it *never* makes sense to allow singular fits
>     >> (as in the situations you mentions).
>
>     >> I'd favour setting singular.ok=FALSE as the default, but
>     >> in the absence of that a warning or at least a note. A
>     >> compromise would be to have a singular.ok option() that
>     >> would be FALSE out of the box.
>
>     >> Any changes would have to be made very carefully so as
>     >> not to create chaos.
>
>     > I for one, am too reluctant to want to change the default
>     > there.
>
>     >> That goes for the points below as well.
>
>     >> (2) coef() and vcov() behave inconsistently, which can be
>     >> problematic because one often uses them together in code.
>
>     > indeed; and I had agreed on that.  As of today, in R-devel
>     > only they now behave compatibly.  NEWS entry
>
>     >     • The “default” ("lm" etc) methods of vcov() have
>     > gained new optional argument complete = TRUE which makes
>     > the vcov() methods more consistent with the coef() methods
>     > in the case of singular designs.  The former behavior is
>     > now achieved by vcov(*, complete=FALSE).
>
>
>     >> (3) As you noticed in your second message, lm() has a
>     >> singular.ok argument and glm() doesn't.
>
>     > and that has been amended even earlier (a bit more than a
>     > month ago) in R-devel svn rev 73380 with NEWS entry
>
>     >     • glm() and glm.fit get the same singular.ok=TRUE
>     > argument that lm() has had forever.  As a consequence, in
>     > glm(*, method = <your_own>), user specified methods need
>     > to accept a singular.ok argument as well.
>
>     >> I'll take a look at the code for glm() with an eye
>     >> towards creating a patch, but I'm a bit reluctant to mess
>     >> with the code for something as important as glm().
>
>     > and as a matter of fact you did send me +- the R code part
>     > of that change.
>
>     > My current plan is to also add the 'complete = TRUE'
>     > option to the "basic" coef() methods, such that you also
>     > have consistent coef(*, complete=FALSE) and vcov(*,
>     > complete=FALSE) behaviors.
>
> and indeed I had added the above a bit later.
>
> However, to my surprise, I have now found that we have a
> coef.aov() method -- completely undocumented which behaves *differently*:
>
> where as the default coef() method which is called for lm(..) results gives *all*
> coefficients, and gives  NA  for "aliased" ones, the aov method *drops* the  NA
> coefficients  and has done so "forever"  (I've checked R version 1.1.1 of April 14,
> 2000).
>
> vcov() on the other hand has not had a special "aov" method, but treats aov()
> and lm() results the same... which means that in R-devel the vcov() method for
> an aov() object  uses 'complete=TRUE' and gives NA rows and columns for the
> aliased coefficients, whereas  coef.aov()  removes all the NAs  and  gives only
> the
> "non-aliased" coefficients.   Consequently, in R-devel,
> vcov(<aov>) and coef(<aov>)  are *now* incoherent, whereas these two
> *were* coherent before the change.
>
> I propose to
> 1. continue the strategy to keep coef() back-compatible  and 2. to *document*
> the "surprising" behavior of coef.aov() 3. introduce a  vcov.aov()  with
> complete=FALSE  default
>    behavior which is compatile to the coef.aov() one [where I'd
>    also introduce the no-change  'complete=FALSE' argument].
>
> Hmm... again, this has been more work and more implications than originally
> optimistically assumed..
>
> Opinions, caveats, other feedback -- are very welcome!
>
> Martin
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: New vcov(*, complete=TRUE) etc -- coef(<lm>) vs coef(<aov>)

Martin Maechler
>>>>> Fox, John <[hidden email]>
>>>>>     on Tue, 7 Nov 2017 22:09:03 +0000 writes:

    > Dear Martin, I think that your plan makes sense. It's too
    > bad that aov() behaved differently in this respect from
    > lm(), and thus created more work, but it's not be a bad
    > thing that the difference is now explicit and documented.

    > I expect that that other problems like this will surface,
    > particularly with contributed packages (and I know that
    > you're aware that this has already happened with the car
    > package). That is, packages that made provision for
    > aliased coefficients based on the old behaviour of coef()
    > and vcov() will now have to adapt to the new, more
    > consistent behaviour.

    > Best, John

Thank you John for the confirmation (and see below).

    >> -----Original Message-----
    >> >>>>> Martin Maechler <[hidden email]>
    >> >>>>>     on Thu, 2 Nov 2017 21:59:00 +0100 writes:
    >>
    >> >>>>> Fox, John <[hidden email]>
    >> >>>>>     on Thu, 14 Sep 2017 13:46:44 +0000 writes:
    >>
    >> >> Dear Martin, I made three points which likely got lost
    >> >> because of the way I presented them:
    >>
    >> >> (1) Singularity is an unusual situation and should be
    >> >> made more prominent. It typically reflects a problem with
    >> >> the data or the specification of the model. That's not to
    >> >> say that it *never* makes sense to allow singular fits
    >> >> (as in the situations you mentions).
    >>
    >> >> I'd favour setting singular.ok=FALSE as the default, but
    >> >> in the absence of that a warning or at least a note. A
    >> >> compromise would be to have a singular.ok option() that
    >> >> would be FALSE out of the box.
    >>
    >> >> Any changes would have to be made very carefully so as
    >> >> not to create chaos.
    >>
    >> > I for one, am too reluctant to want to change the default
    >> > there.
    >>
    >> >> That goes for the points below as well.
    >>
    >> >> (2) coef() and vcov() behave inconsistently, which can be
    >> >> problematic because one often uses them together in code.
    >>
    >> > indeed; and I had agreed on that.  As of today, in R-devel
    >> > only they now behave compatibly.  NEWS entry
    >>
    >> >     • The “default” ("lm" etc) methods of vcov() have
    >> > gained new optional argument complete = TRUE which makes
    >> > the vcov() methods more consistent with the coef() methods
    >> > in the case of singular designs.  The former behavior is
    >> > now achieved by vcov(*, complete=FALSE).
    >>
    >>
    >> >> (3) As you noticed in your second message, lm() has a
    >> >> singular.ok argument and glm() doesn't.
    >>
    >> > and that has been amended even earlier (a bit more than a
    >> > month ago) in R-devel svn rev 73380 with NEWS entry
    >>
    >> >     • glm() and glm.fit get the same singular.ok=TRUE
    >> > argument that lm() has had forever.  As a consequence, in
    >> > glm(*, method = <your_own>), user specified methods need
    >> > to accept a singular.ok argument as well.
    >>
    >> >> I'll take a look at the code for glm() with an eye
    >> >> towards creating a patch, but I'm a bit reluctant to mess
    >> >> with the code for something as important as glm().
    >>
    >> > and as a matter of fact you did send me +- the R code part
    >> > of that change.
    >>
    >> > My current plan is to also add the 'complete = TRUE'
    >> > option to the "basic" coef() methods, such that you also
    >> > have consistent coef(*, complete=FALSE) and vcov(*,
    >> > complete=FALSE) behaviors.
    >>
    >> and indeed I had added the above a bit later.
    >>
    >> However, to my surprise, I have now found that we have a
    >> coef.aov() method -- completely undocumented which behaves *differently*:
    >>
    >> where as the default coef() method which is called for lm(..) results gives *all*
    >> coefficients, and gives  NA  for "aliased" ones, the aov method *drops* the  NA
    >> coefficients  and has done so "forever"  (I've checked R version 1.1.1 of April 14,
    >> 2000).
    >>
    >> vcov() on the other hand has not had a special "aov" method, but treats aov()
    >> and lm() results the same... which means that in R-devel the vcov() method for
    >> an aov() object  uses 'complete=TRUE' and gives NA rows and columns for the
    >> aliased coefficients, whereas  coef.aov()  removes all the NAs  and  gives only
    >> the
    >> "non-aliased" coefficients.   Consequently, in R-devel,
    >> vcov(<aov>) and coef(<aov>)  are *now* incoherent, whereas these two
    >> *were* coherent before the change.

    >> I propose to

    >> 1. continue the strategy to keep coef() back-compatible and
    >> 2. to *document* the "surprising" behavior of coef.aov()
    >> 3. introduce a  vcov.aov()  with complete=FALSE  default
    >> behavior which is compatile to the coef.aov() one [where I'd
    >> also introduce the no-change  'complete=FALSE' argument].

I have now committed the above proposal to R-devel,
svn rev 73692.

This does revert  vcov(<aov>)  default behavior in R-devel to
the R <= 3.4.x behavior...
so  an effect in package-space should rather be beneficial.

Martin Maechler
ETH Zurich

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel