BMA, logistic regression, odds ratio, model reduction etc

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

BMA, logistic regression, odds ratio, model reduction etc

khosoda
Hi everybody,
I apologize for long mail in advance.

I have data of 104 patients, which consists of 15 explanatory variables
and one binary outcome (poor/good). The outcome consists of 25 poor
results and 79 good results. I tried to analyze the data with logistic
regression. However, the 15 variables and 25 events means events per
variable (EPV) is much less than 10 (rule of thumb). Therefore, I used R
package, "BMA" to perform logistic regression with BMA to avoid this
problem.

model 1 (full model):
x1, x2, x3, x4 are continuous variables and others are binary data.

> x16.bic.glm <- bic.glm(outcome ~ ., data=x16.df,
glm.family="binomial", OR20, strict=FALSE)
> summary(x16.bic.glm)
(The output below has been cut off at the right edge to save space)

  62  models were selected
 Best  5  models (cumulative posterior probability =  0.3606 ):

                         p!=0    EV         SD        model 1    model2
Intercept                100    -5.1348545  1.652424    -4.4688  -5.15
-5.1536
age                        3.3   0.0001634  0.007258      .
sex                        4.0
   .M                           -0.0243145  0.220314      .
side                      10.8
    .R                           0.0811227  0.301233      .
procedure                 46.9  -0.5356894  0.685148      .      -1.163
symptom                    3.8  -0.0099438  0.129690      .          .
stenosis                   3.4  -0.0003343  0.005254      .
x1                        3.7  -0.0061451  0.144084      .
x2                       100.0   3.1707661  0.892034     3.2221     3.11
x3                        51.3  -0.4577885  0.551466    -0.9154     .
HT                         4.6
  .positive                      0.0199299  0.161769      .          .
DM                         3.3
  .positive                     -0.0019986  0.105910      .          .
IHD                        3.5
   .positive                     0.0077626  0.122593      .          .
smoking                    9.1
       .positive                 0.0611779  0.258402      .          .
hyperlipidemia            16.0
              .positive          0.1784293  0.512058      .          .
x4                         8.2   0.0607398  0.267501      .          .


nVar                                                       2          2
         1          3          3
BIC                                                   -376.9082
-376.5588  -376.3094  -375.8468  -374.5582
post prob                                                0.104
0.087      0.077      0.061      0.032

[Question 1]
Is it O.K to calculate odds ratio and its 95% confidence interval from
"EV" (posterior distribution mean) and“SD”(posterior distribution
standard deviation)?
For example, 95%CI of EV of x2 can be calculated as;
> exp(3.1707661)
[1] 23.82573     -----> odds ratio
> exp(3.1707661+1.96*0.892034)
[1] 136.8866
> exp(3.1707661-1.96*0.892034)
[1] 4.146976
------------------> 95%CI (4.1 to 136.9)
Is this O.K.?

[Question 2]
Is it permissible to delete variables with small value of "p!=0" and
"EV", such as age (3.3% and 0.0001634) to reduce the number of
explanatory variables and reconstruct new model without those variables
for new session of BMA?

model 2 (reduced model):
I used R package, "pvclust", to reduce the model. The result suggested
x1, x2 and x4 belonged to the same cluster, so I picked up only x2.
Based on the subject knowledge, I made a simple unweighted sum, by
counting the number of clinical features. For 9 features (sex, side,
HT2, hyperlipidemia, DM, IHD, smoking, symptom, age), the sum ranges
from 0 to 9. This score was defined as ClinicalScore. Consequently, I
made up new data set (x6.df), which consists of 5 variables (stenosis,
x2, x3, procedure, and ClinicalScore) and one binary outcome
(poor/good). Then, for alternative BMA session...

> BMAx6.glm <- bic.glm(postopDWI_HI ~ ., data=x6.df,
glm.family="binomial", OR=20, strict=FALSE)
> summary(BMAx6.glm)
(The output below has been cut off at the right edge to save space)
Call:
bic.glm.formula(f = postopDWI_HI ~ ., data = x6.df, glm.family =
"binomial",     strict = FALSE, OR = 20)


  13  models were selected
 Best  5  models (cumulative posterior probability =  0.7626 ):

                p!=0    EV         SD       model 1    model 2
Intercept       100    -5.6918362  1.81220    -4.4688    -6.3166
stenosis          8.1  -0.0008417  0.00815      .          .
x2              100.0   3.0606165  0.87765     3.2221     3.1154
x3               46.5  -0.3998864  0.52688    -0.9154      .
procedure       49.3   0.5747013  0.70164      .         1.1631
ClinicalScore   27.1   0.0966633  0.19645      .          .


nVar                                             2          2          1
         3          3
BIC                                         -376.9082  -376.5588
-376.3094  -375.8468  -375.5025
post prob                                      0.208      0.175
0.154      0.122      0.103

[Question 3]
Am I doing it correctly or not?
I mean this kind of model reduction is permissible for BMA?

[Question 4]
I still have 5 variables, which violates the rule of thumb, "EPV > 10".
Is it permissible to delete "stenosis" variable because of small value
of "EV"? Or is it O.K. because this is BMA?

Sorry for long post.

I appreciate your help very much in advance.

--
KH

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: BMA, logistic regression, odds ratio, model reduction etc

Frank Harrell
Deleting variables is a bad idea unless you make that a formal part of the BMA so that the attempt to delete variables is penalized for.  Instead of BMA I recommend simple penalized maximum likelihood estimation (see the lrm function in the rms package) or pre-modeling data reduction that is blinded to the outcome variable.
Frank

細田弘吉 wrote
Hi everybody,
I apologize for long mail in advance.

I have data of 104 patients, which consists of 15 explanatory variables
and one binary outcome (poor/good). The outcome consists of 25 poor
results and 79 good results. I tried to analyze the data with logistic
regression. However, the 15 variables and 25 events means events per
variable (EPV) is much less than 10 (rule of thumb). Therefore, I used R
package, "BMA" to perform logistic regression with BMA to avoid this
problem.

model 1 (full model):
x1, x2, x3, x4 are continuous variables and others are binary data.

> x16.bic.glm <- bic.glm(outcome ~ ., data=x16.df,
glm.family="binomial", OR20, strict=FALSE)
> summary(x16.bic.glm)
(The output below has been cut off at the right edge to save space)

  62  models were selected
 Best  5  models (cumulative posterior probability =  0.3606 ):

                         p!=0    EV         SD        model 1    model2
Intercept                100    -5.1348545  1.652424    -4.4688  -5.15
-5.1536
age                        3.3   0.0001634  0.007258      .
sex                        4.0
   .M                           -0.0243145  0.220314      .
side                      10.8
    .R                           0.0811227  0.301233      .
procedure                 46.9  -0.5356894  0.685148      .      -1.163
symptom                    3.8  -0.0099438  0.129690      .          .
stenosis                   3.4  -0.0003343  0.005254      .
x1                        3.7  -0.0061451  0.144084      .
x2                       100.0   3.1707661  0.892034     3.2221     3.11
x3                        51.3  -0.4577885  0.551466    -0.9154     .
HT                         4.6
  .positive                      0.0199299  0.161769      .          .
DM                         3.3
  .positive                     -0.0019986  0.105910      .          .
IHD                        3.5
   .positive                     0.0077626  0.122593      .          .
smoking                    9.1
       .positive                 0.0611779  0.258402      .          .
hyperlipidemia            16.0
              .positive          0.1784293  0.512058      .          .
x4                         8.2   0.0607398  0.267501      .          .


nVar                                                       2          2
         1          3          3
BIC                                                   -376.9082
-376.5588  -376.3094  -375.8468  -374.5582
post prob                                                0.104
0.087      0.077      0.061      0.032

[Question 1]
Is it O.K to calculate odds ratio and its 95% confidence interval from
"EV" (posterior distribution mean) and“SD”(posterior distribution
standard deviation)?
For example, 95%CI of EV of x2 can be calculated as;
> exp(3.1707661)
[1] 23.82573     -----> odds ratio
> exp(3.1707661+1.96*0.892034)
[1] 136.8866
> exp(3.1707661-1.96*0.892034)
[1] 4.146976
------------------> 95%CI (4.1 to 136.9)
Is this O.K.?

[Question 2]
Is it permissible to delete variables with small value of "p!=0" and
"EV", such as age (3.3% and 0.0001634) to reduce the number of
explanatory variables and reconstruct new model without those variables
for new session of BMA?

model 2 (reduced model):
I used R package, "pvclust", to reduce the model. The result suggested
x1, x2 and x4 belonged to the same cluster, so I picked up only x2.
Based on the subject knowledge, I made a simple unweighted sum, by
counting the number of clinical features. For 9 features (sex, side,
HT2, hyperlipidemia, DM, IHD, smoking, symptom, age), the sum ranges
from 0 to 9. This score was defined as ClinicalScore. Consequently, I
made up new data set (x6.df), which consists of 5 variables (stenosis,
x2, x3, procedure, and ClinicalScore) and one binary outcome
(poor/good). Then, for alternative BMA session...

> BMAx6.glm <- bic.glm(postopDWI_HI ~ ., data=x6.df,
glm.family="binomial", OR=20, strict=FALSE)
> summary(BMAx6.glm)
(The output below has been cut off at the right edge to save space)
Call:
bic.glm.formula(f = postopDWI_HI ~ ., data = x6.df, glm.family =
"binomial",     strict = FALSE, OR = 20)


  13  models were selected
 Best  5  models (cumulative posterior probability =  0.7626 ):

                p!=0    EV         SD       model 1    model 2
Intercept       100    -5.6918362  1.81220    -4.4688    -6.3166
stenosis          8.1  -0.0008417  0.00815      .          .
x2              100.0   3.0606165  0.87765     3.2221     3.1154
x3               46.5  -0.3998864  0.52688    -0.9154      .
procedure       49.3   0.5747013  0.70164      .         1.1631
ClinicalScore   27.1   0.0966633  0.19645      .          .


nVar                                             2          2          1
         3          3
BIC                                         -376.9082  -376.5588
-376.3094  -375.8468  -375.5025
post prob                                      0.208      0.175
0.154      0.122      0.103

[Question 3]
Am I doing it correctly or not?
I mean this kind of model reduction is permissible for BMA?

[Question 4]
I still have 5 variables, which violates the rule of thumb, "EPV > 10".
Is it permissible to delete "stenosis" variable because of small value
of "EV"? Or is it O.K. because this is BMA?

Sorry for long post.

I appreciate your help very much in advance.

--
KH

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Frank Harrell
Department of Biostatistics, Vanderbilt University
Reply | Threaded
Open this post in threaded view
|

Re: BMA, logistic regression, odds ratio, model reduction etc

khosoda
Dear Prof. Harrel,

Thank you very much for your quick advice.
I will try rms package.

Regarding model reduction, is my model 2 method (clustering and recoding
that are blinded to the outcome) permissible?

Sincerely,

--
KH

(11/04/20 22:01), Frank Harrell wrote:

> Deleting variables is a bad idea unless you make that a formal part of the
> BMA so that the attempt to delete variables is penalized for.  Instead of
> BMA I recommend simple penalized maximum likelihood estimation (see the lrm
> function in the rms package) or pre-modeling data reduction that is blinded
> to the outcome variable.
> Frank
>
>
> 細田弘吉 wrote:
>>
>> Hi everybody,
>> I apologize for long mail in advance.
>>
>> I have data of 104 patients, which consists of 15 explanatory variables
>> and one binary outcome (poor/good). The outcome consists of 25 poor
>> results and 79 good results. I tried to analyze the data with logistic
>> regression. However, the 15 variables and 25 events means events per
>> variable (EPV) is much less than 10 (rule of thumb). Therefore, I used R
>> package, "BMA" to perform logistic regression with BMA to avoid this
>> problem.
>>
>> model 1 (full model):
>> x1, x2, x3, x4 are continuous variables and others are binary data.
>>
>>> x16.bic.glm<- bic.glm(outcome ~ ., data=x16.df,
>> glm.family="binomial", OR20, strict=FALSE)
>>> summary(x16.bic.glm)
>> (The output below has been cut off at the right edge to save space)
>>
>>    62  models were selected
>>   Best  5  models (cumulative posterior probability =  0.3606 ):
>>
>>                           p!=0    EV         SD        model 1    model2
>> Intercept                100    -5.1348545  1.652424    -4.4688  -5.15
>> -5.1536
>> age                        3.3   0.0001634  0.007258      .
>> sex                        4.0
>>     .M                           -0.0243145  0.220314      .
>> side                      10.8
>>      .R                           0.0811227  0.301233      .
>> procedure                 46.9  -0.5356894  0.685148      .      -1.163
>> symptom                    3.8  -0.0099438  0.129690      .          .
>> stenosis                   3.4  -0.0003343  0.005254      .
>> x1                        3.7  -0.0061451  0.144084      .
>> x2                       100.0   3.1707661  0.892034     3.2221     3.11
>> x3                        51.3  -0.4577885  0.551466    -0.9154     .
>> HT                         4.6
>>    .positive                      0.0199299  0.161769      .          .
>> DM                         3.3
>>    .positive                     -0.0019986  0.105910      .          .
>> IHD                        3.5
>>     .positive                     0.0077626  0.122593      .          .
>> smoking                    9.1
>>         .positive                 0.0611779  0.258402      .          .
>> hyperlipidemia            16.0
>>                .positive          0.1784293  0.512058      .          .
>> x4                         8.2   0.0607398  0.267501      .          .
>>
>>
>> nVar                                                       2          2
>>           1          3          3
>> BIC                                                   -376.9082
>> -376.5588  -376.3094  -375.8468  -374.5582
>> post prob                                                0.104
>> 0.087      0.077      0.061      0.032
>>
>> [Question 1]
>> Is it O.K to calculate odds ratio and its 95% confidence interval from
>> "EV" (posterior distribution mean) and“SD”(posterior distribution
>> standard deviation)?
>> For example, 95%CI of EV of x2 can be calculated as;
>>> exp(3.1707661)
>> [1] 23.82573     ----->  odds ratio
>>> exp(3.1707661+1.96*0.892034)
>> [1] 136.8866
>>> exp(3.1707661-1.96*0.892034)
>> [1] 4.146976
>> ------------------>  95%CI (4.1 to 136.9)
>> Is this O.K.?
>>
>> [Question 2]
>> Is it permissible to delete variables with small value of "p!=0" and
>> "EV", such as age (3.3% and 0.0001634) to reduce the number of
>> explanatory variables and reconstruct new model without those variables
>> for new session of BMA?
>>
>> model 2 (reduced model):
>> I used R package, "pvclust", to reduce the model. The result suggested
>> x1, x2 and x4 belonged to the same cluster, so I picked up only x2.
>> Based on the subject knowledge, I made a simple unweighted sum, by
>> counting the number of clinical features. For 9 features (sex, side,
>> HT2, hyperlipidemia, DM, IHD, smoking, symptom, age), the sum ranges
>> from 0 to 9. This score was defined as ClinicalScore. Consequently, I
>> made up new data set (x6.df), which consists of 5 variables (stenosis,
>> x2, x3, procedure, and ClinicalScore) and one binary outcome
>> (poor/good). Then, for alternative BMA session...
>>
>>> BMAx6.glm<- bic.glm(postopDWI_HI ~ ., data=x6.df,
>> glm.family="binomial", OR=20, strict=FALSE)
>>> summary(BMAx6.glm)
>> (The output below has been cut off at the right edge to save space)
>> Call:
>> bic.glm.formula(f = postopDWI_HI ~ ., data = x6.df, glm.family =
>> "binomial",     strict = FALSE, OR = 20)
>>
>>
>>    13  models were selected
>>   Best  5  models (cumulative posterior probability =  0.7626 ):
>>
>>                  p!=0    EV         SD       model 1    model 2
>> Intercept       100    -5.6918362  1.81220    -4.4688    -6.3166
>> stenosis          8.1  -0.0008417  0.00815      .          .
>> x2              100.0   3.0606165  0.87765     3.2221     3.1154
>> x3               46.5  -0.3998864  0.52688    -0.9154      .
>> procedure       49.3   0.5747013  0.70164      .         1.1631
>> ClinicalScore   27.1   0.0966633  0.19645      .          .
>>
>>
>> nVar                                             2          2          1
>>           3          3
>> BIC                                         -376.9082  -376.5588
>> -376.3094  -375.8468  -375.5025
>> post prob                                      0.208      0.175
>> 0.154      0.122      0.103
>>
>> [Question 3]
>> Am I doing it correctly or not?
>> I mean this kind of model reduction is permissible for BMA?
>>
>> [Question 4]
>> I still have 5 variables, which violates the rule of thumb, "EPV>  10".
>> Is it permissible to delete "stenosis" variable because of small value
>> of "EV"? Or is it O.K. because this is BMA?
>>
>> Sorry for long post.
>>
>> I appreciate your help very much in advance.
>>
>> --
>> KH
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
> -----
> Frank Harrell
> Department of Biostatistics, Vanderbilt University
> --
> View this message in context: http://r.789695.n4.nabble.com/BMA-logistic-regression-odds-ratio-model-reduction-etc-tp3462416p3462919.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


--
*************************************************
 神戸大学大学院医学研究科 脳神経外科学分野
 細田 弘吉
 
 〒650-0017 神戸市中央区楠町7丁目5-1
     Phone: 078-382-5966
     Fax  : 078-382-5979
     E-mail address
         Office: [hidden email]
        Home  : [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: BMA, logistic regression, odds ratio, model reduction etc

Frank Harrell
I think it's OK.  You can also use the Hmisc package's varclus function.
Frank

細田弘吉 wrote
Dear Prof. Harrel,

Thank you very much for your quick advice.
I will try rms package.

Regarding model reduction, is my model 2 method (clustering and recoding
that are blinded to the outcome) permissible?

Sincerely,

--
KH

(11/04/20 22:01), Frank Harrell wrote:
> Deleting variables is a bad idea unless you make that a formal part of the
> BMA so that the attempt to delete variables is penalized for.  Instead of
> BMA I recommend simple penalized maximum likelihood estimation (see the lrm
> function in the rms package) or pre-modeling data reduction that is blinded
> to the outcome variable.
> Frank
>
>
> 細田弘吉 wrote:
>>
>> Hi everybody,
>> I apologize for long mail in advance.
>>
>> I have data of 104 patients, which consists of 15 explanatory variables
>> and one binary outcome (poor/good). The outcome consists of 25 poor
>> results and 79 good results. I tried to analyze the data with logistic
>> regression. However, the 15 variables and 25 events means events per
>> variable (EPV) is much less than 10 (rule of thumb). Therefore, I used R
>> package, "BMA" to perform logistic regression with BMA to avoid this
>> problem.
>>
>> model 1 (full model):
>> x1, x2, x3, x4 are continuous variables and others are binary data.
>>
>>> x16.bic.glm<- bic.glm(outcome ~ ., data=x16.df,
>> glm.family="binomial", OR20, strict=FALSE)
>>> summary(x16.bic.glm)
>> (The output below has been cut off at the right edge to save space)
>>
>>    62  models were selected
>>   Best  5  models (cumulative posterior probability =  0.3606 ):
>>
>>                           p!=0    EV         SD        model 1    model2
>> Intercept                100    -5.1348545  1.652424    -4.4688  -5.15
>> -5.1536
>> age                        3.3   0.0001634  0.007258      .
>> sex                        4.0
>>     .M                           -0.0243145  0.220314      .
>> side                      10.8
>>      .R                           0.0811227  0.301233      .
>> procedure                 46.9  -0.5356894  0.685148      .      -1.163
>> symptom                    3.8  -0.0099438  0.129690      .          .
>> stenosis                   3.4  -0.0003343  0.005254      .
>> x1                        3.7  -0.0061451  0.144084      .
>> x2                       100.0   3.1707661  0.892034     3.2221     3.11
>> x3                        51.3  -0.4577885  0.551466    -0.9154     .
>> HT                         4.6
>>    .positive                      0.0199299  0.161769      .          .
>> DM                         3.3
>>    .positive                     -0.0019986  0.105910      .          .
>> IHD                        3.5
>>     .positive                     0.0077626  0.122593      .          .
>> smoking                    9.1
>>         .positive                 0.0611779  0.258402      .          .
>> hyperlipidemia            16.0
>>                .positive          0.1784293  0.512058      .          .
>> x4                         8.2   0.0607398  0.267501      .          .
>>
>>
>> nVar                                                       2          2
>>           1          3          3
>> BIC                                                   -376.9082
>> -376.5588  -376.3094  -375.8468  -374.5582
>> post prob                                                0.104
>> 0.087      0.077      0.061      0.032
>>
>> [Question 1]
>> Is it O.K to calculate odds ratio and its 95% confidence interval from
>> "EV" (posterior distribution mean) and“SD”(posterior distribution
>> standard deviation)?
>> For example, 95%CI of EV of x2 can be calculated as;
>>> exp(3.1707661)
>> [1] 23.82573     ----->  odds ratio
>>> exp(3.1707661+1.96*0.892034)
>> [1] 136.8866
>>> exp(3.1707661-1.96*0.892034)
>> [1] 4.146976
>> ------------------>  95%CI (4.1 to 136.9)
>> Is this O.K.?
>>
>> [Question 2]
>> Is it permissible to delete variables with small value of "p!=0" and
>> "EV", such as age (3.3% and 0.0001634) to reduce the number of
>> explanatory variables and reconstruct new model without those variables
>> for new session of BMA?
>>
>> model 2 (reduced model):
>> I used R package, "pvclust", to reduce the model. The result suggested
>> x1, x2 and x4 belonged to the same cluster, so I picked up only x2.
>> Based on the subject knowledge, I made a simple unweighted sum, by
>> counting the number of clinical features. For 9 features (sex, side,
>> HT2, hyperlipidemia, DM, IHD, smoking, symptom, age), the sum ranges
>> from 0 to 9. This score was defined as ClinicalScore. Consequently, I
>> made up new data set (x6.df), which consists of 5 variables (stenosis,
>> x2, x3, procedure, and ClinicalScore) and one binary outcome
>> (poor/good). Then, for alternative BMA session...
>>
>>> BMAx6.glm<- bic.glm(postopDWI_HI ~ ., data=x6.df,
>> glm.family="binomial", OR=20, strict=FALSE)
>>> summary(BMAx6.glm)
>> (The output below has been cut off at the right edge to save space)
>> Call:
>> bic.glm.formula(f = postopDWI_HI ~ ., data = x6.df, glm.family =
>> "binomial",     strict = FALSE, OR = 20)
>>
>>
>>    13  models were selected
>>   Best  5  models (cumulative posterior probability =  0.7626 ):
>>
>>                  p!=0    EV         SD       model 1    model 2
>> Intercept       100    -5.6918362  1.81220    -4.4688    -6.3166
>> stenosis          8.1  -0.0008417  0.00815      .          .
>> x2              100.0   3.0606165  0.87765     3.2221     3.1154
>> x3               46.5  -0.3998864  0.52688    -0.9154      .
>> procedure       49.3   0.5747013  0.70164      .         1.1631
>> ClinicalScore   27.1   0.0966633  0.19645      .          .
>>
>>
>> nVar                                             2          2          1
>>           3          3
>> BIC                                         -376.9082  -376.5588
>> -376.3094  -375.8468  -375.5025
>> post prob                                      0.208      0.175
>> 0.154      0.122      0.103
>>
>> [Question 3]
>> Am I doing it correctly or not?
>> I mean this kind of model reduction is permissible for BMA?
>>
>> [Question 4]
>> I still have 5 variables, which violates the rule of thumb, "EPV>  10".
>> Is it permissible to delete "stenosis" variable because of small value
>> of "EV"? Or is it O.K. because this is BMA?
>>
>> Sorry for long post.
>>
>> I appreciate your help very much in advance.
>>
>> --
>> KH
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
> -----
> Frank Harrell
> Department of Biostatistics, Vanderbilt University
> --
> View this message in context: http://r.789695.n4.nabble.com/BMA-logistic-regression-odds-ratio-model-reduction-etc-tp3462416p3462919.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


--
*************************************************
 神戸大学大学院医学研究科 脳神経外科学分野
 細田 弘吉
 
 〒650-0017 神戸市中央区楠町7丁目5-1
     Phone: 078-382-5966
     Fax  : 078-382-5979
     E-mail address
         Office: [hidden email]
        Home  : [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Frank Harrell
Department of Biostatistics, Vanderbilt University
Reply | Threaded
Open this post in threaded view
|

Re: BMA, logistic regression, odds ratio, model reduction etc

khosoda
Thank you for your comment.
I forgot to mention that varclus and pvclust showed similar results for
my data.

BTW, I did not realize rms is a replacement for the Design package.
I appreciate your suggestion.
--
KH

(11/04/21 8:00), Frank Harrell wrote:

> I think it's OK.  You can also use the Hmisc package's varclus function.
> Frank
>
>
> 細田弘吉 wrote:
>>
>> Dear Prof. Harrel,
>>
>> Thank you very much for your quick advice.
>> I will try rms package.
>>
>> Regarding model reduction, is my model 2 method (clustering and recoding
>> that are blinded to the outcome) permissible?
>>
>> Sincerely,
>>
>> --
>> KH
>>
>> (11/04/20 22:01), Frank Harrell wrote:
>>> Deleting variables is a bad idea unless you make that a formal part of
>>> the
>>> BMA so that the attempt to delete variables is penalized for.  Instead of
>>> BMA I recommend simple penalized maximum likelihood estimation (see the
>>> lrm
>>> function in the rms package) or pre-modeling data reduction that is
>>> blinded
>>> to the outcome variable.
>>> Frank
>>>
>>>
>>> 細田弘吉 wrote:
>>>>
>>>> Hi everybody,
>>>> I apologize for long mail in advance.
>>>>
>>>> I have data of 104 patients, which consists of 15 explanatory variables
>>>> and one binary outcome (poor/good). The outcome consists of 25 poor
>>>> results and 79 good results. I tried to analyze the data with logistic
>>>> regression. However, the 15 variables and 25 events means events per
>>>> variable (EPV) is much less than 10 (rule of thumb). Therefore, I used R
>>>> package, "BMA" to perform logistic regression with BMA to avoid this
>>>> problem.
>>>>
>>>> model 1 (full model):
>>>> x1, x2, x3, x4 are continuous variables and others are binary data.
>>>>
>>>>> x16.bic.glm<- bic.glm(outcome ~ ., data=x16.df,
>>>> glm.family="binomial", OR20, strict=FALSE)
>>>>> summary(x16.bic.glm)
>>>> (The output below has been cut off at the right edge to save space)
>>>>
>>>>     62  models were selected
>>>>    Best  5  models (cumulative posterior probability =  0.3606 ):
>>>>
>>>>                            p!=0    EV         SD        model 1    model2
>>>> Intercept                100    -5.1348545  1.652424    -4.4688  -5.15
>>>> -5.1536
>>>> age                        3.3   0.0001634  0.007258      .
>>>> sex                        4.0
>>>>      .M                           -0.0243145  0.220314      .
>>>> side                      10.8
>>>>       .R                           0.0811227  0.301233      .
>>>> procedure                 46.9  -0.5356894  0.685148      .      -1.163
>>>> symptom                    3.8  -0.0099438  0.129690      .          .
>>>> stenosis                   3.4  -0.0003343  0.005254      .
>>>> x1                        3.7  -0.0061451  0.144084      .
>>>> x2                       100.0   3.1707661  0.892034     3.2221     3.11
>>>> x3                        51.3  -0.4577885  0.551466    -0.9154     .
>>>> HT                         4.6
>>>>     .positive                      0.0199299  0.161769      .          .
>>>> DM                         3.3
>>>>     .positive                     -0.0019986  0.105910      .          .
>>>> IHD                        3.5
>>>>      .positive                     0.0077626  0.122593      .          .
>>>> smoking                    9.1
>>>>          .positive                 0.0611779  0.258402      .          .
>>>> hyperlipidemia            16.0
>>>>                 .positive          0.1784293  0.512058      .          .
>>>> x4                         8.2   0.0607398  0.267501      .          .
>>>>
>>>>
>>>> nVar                                                       2          2
>>>>            1          3          3
>>>> BIC                                                   -376.9082
>>>> -376.5588  -376.3094  -375.8468  -374.5582
>>>> post prob                                                0.104
>>>> 0.087      0.077      0.061      0.032
>>>>
>>>> [Question 1]
>>>> Is it O.K to calculate odds ratio and its 95% confidence interval from
>>>> "EV" (posterior distribution mean) and“SD”(posterior distribution
>>>> standard deviation)?
>>>> For example, 95%CI of EV of x2 can be calculated as;
>>>>> exp(3.1707661)
>>>> [1] 23.82573     ----->   odds ratio
>>>>> exp(3.1707661+1.96*0.892034)
>>>> [1] 136.8866
>>>>> exp(3.1707661-1.96*0.892034)
>>>> [1] 4.146976
>>>> ------------------>   95%CI (4.1 to 136.9)
>>>> Is this O.K.?
>>>>
>>>> [Question 2]
>>>> Is it permissible to delete variables with small value of "p!=0" and
>>>> "EV", such as age (3.3% and 0.0001634) to reduce the number of
>>>> explanatory variables and reconstruct new model without those variables
>>>> for new session of BMA?
>>>>
>>>> model 2 (reduced model):
>>>> I used R package, "pvclust", to reduce the model. The result suggested
>>>> x1, x2 and x4 belonged to the same cluster, so I picked up only x2.
>>>> Based on the subject knowledge, I made a simple unweighted sum, by
>>>> counting the number of clinical features. For 9 features (sex, side,
>>>> HT2, hyperlipidemia, DM, IHD, smoking, symptom, age), the sum ranges
>>>> from 0 to 9. This score was defined as ClinicalScore. Consequently, I
>>>> made up new data set (x6.df), which consists of 5 variables (stenosis,
>>>> x2, x3, procedure, and ClinicalScore) and one binary outcome
>>>> (poor/good). Then, for alternative BMA session...
>>>>
>>>>> BMAx6.glm<- bic.glm(postopDWI_HI ~ ., data=x6.df,
>>>> glm.family="binomial", OR=20, strict=FALSE)
>>>>> summary(BMAx6.glm)
>>>> (The output below has been cut off at the right edge to save space)
>>>> Call:
>>>> bic.glm.formula(f = postopDWI_HI ~ ., data = x6.df, glm.family =
>>>> "binomial",     strict = FALSE, OR = 20)
>>>>
>>>>
>>>>     13  models were selected
>>>>    Best  5  models (cumulative posterior probability =  0.7626 ):
>>>>
>>>>                   p!=0    EV         SD       model 1    model 2
>>>> Intercept       100    -5.6918362  1.81220    -4.4688    -6.3166
>>>> stenosis          8.1  -0.0008417  0.00815      .          .
>>>> x2              100.0   3.0606165  0.87765     3.2221     3.1154
>>>> x3               46.5  -0.3998864  0.52688    -0.9154      .
>>>> procedure       49.3   0.5747013  0.70164      .         1.1631
>>>> ClinicalScore   27.1   0.0966633  0.19645      .          .
>>>>
>>>>
>>>> nVar                                             2          2          1
>>>>            3          3
>>>> BIC                                         -376.9082  -376.5588
>>>> -376.3094  -375.8468  -375.5025
>>>> post prob                                      0.208      0.175
>>>> 0.154      0.122      0.103
>>>>
>>>> [Question 3]
>>>> Am I doing it correctly or not?
>>>> I mean this kind of model reduction is permissible for BMA?
>>>>
>>>> [Question 4]
>>>> I still have 5 variables, which violates the rule of thumb, "EPV>   10".
>>>> Is it permissible to delete "stenosis" variable because of small value
>>>> of "EV"? Or is it O.K. because this is BMA?
>>>>
>>>> Sorry for long post.
>>>>
>>>> I appreciate your help very much in advance.
>>>>
>>>> --
>>>> KH
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>>
>>> -----
>>> Frank Harrell
>>> Department of Biostatistics, Vanderbilt University
>>> --
>>> View this message in context:
>>> http://r.789695.n4.nabble.com/BMA-logistic-regression-odds-ratio-model-reduction-etc-tp3462416p3462919.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>

>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
> -----
> Frank Harrell
> Department of Biostatistics, Vanderbilt University
> --
> View this message in context: http://r.789695.n4.nabble.com/BMA-logistic-regression-odds-ratio-model-reduction-etc-tp3462416p3464392.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Questions about lrm, validate, pentrace (Re: BMA, logistic regression, odds ratio, model reduction etc)

khosoda
According to the advice, I tried rms package.
Just to make sure, I have data of 104 patients (x6.df), which consists
of 5 explanatory variables and one binary outcome (poor/good) (previous
model 2 strategy). The outcome consists of 25 poor results and 79 good
results. Therefore, My events per variable (EPV) is only 5 (much less
than the rule of thumb of 10).

My questions are about validate and pentrace in rms package.
I present some codes and results.
I appreciate anybody's help in advance.

 > x6.lrm <- lrm(outcome ~ stenosis+x1+x2+procedure+ClinicalScore,
data=x6.df, x=T, y=T)

 > x6.lrm
...
Obs  104    LR chi2      29.24    R2       0.367    C       0.816
  negative 79    d.f.         5    g        1.633    Dxy     0.632
  positive 25    Pr(> chi2) <0.0001   gr    5.118    gamma   0.632
max |deriv| 1e-08                    gp    0.237    tau-a   0.233
                                      Brier   0.127

                Coef    S.E.   Wald Z Pr(>|Z|)
Intercept      -5.5328 2.6287 -2.10  0.0353
stenosis       -0.0150 0.0284 -0.53  0.5979
x1              3.0425 0.9100  3.34  0.0008
x2             -0.7534 0.4519 -1.67  0.0955
procedure       1.2085 0.5717  2.11  0.0345
ClinicalScore   0.3762 0.2287  1.65  0.0999

It seems not too bad. Next, validation by bootstrap ...

 > validate(x6.lrm, B=200, bw=F)
           index.orig training    test optimism index.corrected   n
Dxy           0.6324   0.6960  0.5870   0.1091          0.5233 200
R2            0.3668   0.4370  0.3154   0.1216          0.2453 200
Intercept     0.0000   0.0000 -0.2007   0.2007         -0.2007 200
Slope         1.0000   1.0000  0.7565   0.2435          0.7565 200
Emax          0.0000   0.0000  0.0999   0.0999          0.0999 200
D             0.2716   0.3368  0.2275   0.1093          0.1623 200
U            -0.0192  -0.0192  0.0369  -0.0561          0.0369 200
Q             0.2908   0.3560  0.1906   0.1654          0.1254 200
B             0.1272   0.1155  0.1384  -0.0229          0.1501 200
g             1.6328   2.0740  1.4647   0.6093          1.0235 200
gp            0.2367   0.2529  0.2189   0.0341          0.2026 200

The apparent Dxy is 0.63, and bias-corrected Dxy is 0.52. The maximum
absolute error is estimated to be 0.099. The changes in slope and
intercept are also more substantial. In all, there is evidence that I am
somewhat overfitting the data, right?.

Furthermore, using step-down variable selection ...

 > validate(x6.lrm, B=200, bw=T)

                Backwards Step-down - Original Model

  Deleted        Chi-Sq d.f. P      Residual d.f. P      AIC
  stenosis       0.28   1    0.5979 0.28     1    0.5979 -1.72
  ClinicalScore  2.60   1    0.1068 2.88     2    0.2370 -1.12
  x2             2.86   1    0.0910 5.74     3    0.1252 -0.26

Approximate Estimates after Deleting Factors

              Coef   S.E. Wald Z         P
Intercept  -5.865 1.4136 -4.149 3.336e-05
x1          2.915 0.8685  3.357 7.889e-04
procedure   1.072 0.5590  1.918 5.508e-02

Factors in Final Model

[1] x1         procedure
           index.orig training    test optimism index.corrected   n
Dxy           0.5661   0.6755  0.5559   0.1196          0.4464 200
R2            0.2876   0.4085  0.2784   0.1301          0.1575 200
Intercept     0.0000   0.0000 -0.2459   0.2459         -0.2459 200
Slope         1.0000   1.0000  0.7300   0.2700          0.7300 200
Emax          0.0000   0.0000  0.1173   0.1173          0.1173 200
D             0.2038   0.3130  0.1970   0.1160          0.0877 200
U            -0.0192  -0.0192  0.0382  -0.0574          0.0382 200
Q             0.2230   0.3323  0.1589   0.1734          0.0496 200
B             0.1441   0.1192  0.1452  -0.0261          0.1702 200
g             1.2628   1.9524  1.3222   0.6302          0.6326 199
gp            0.2041   0.2430  0.2043   0.0387          0.1654 199

If I select only two variables (x1 and procedure), bias-corrected Dxy
goes down to 0.45.

[Question 1]
I have EPV problem. Even so, should I keep the full model (5-variable
model)? or can I use the 2-variable (x1 and procedure) model which the
validate() with step-down provides?

[Question 2]
If I use 2-variable model, should I do
x2.lrm <- lrm(postopDWI_HI ~ T1+procedure2, data=x6.df, x=T, y=T)?
or keep the value showed above by validate function?

Next, shrinkage ...

 > pentrace(x6.lrm, seq(0, 5.0, by=0.05))
Best penalty:
penalty         df
    3.05   4.015378

The best penalty is 3.05. So, I update it with this penalty to obtain
the corresponding penalized model:

 > x6.lrm.pen <- update(x6.lrm, penalty=3.05, x=T, y=T)
 > x6.lrm.pen
.....
Penalty factors

  simple nonlinear interaction nonlinear.interaction
    3.05      3.05        3.05                  3.05
Final penalty on -2 log L
      [,1]
[1,]  3.8

Obs     104    LR chi2      28.18    R2       0.313    C       0.818
  negative    79    d.f.     4.015    g        1.264    Dxy     0.635
  positive    25   Pr(> chi2) <0.0001 gr       3.538    gamma   0.637
max |deriv| 3e-05                    gp       0.201    tau-a   0.234
                                      Brier    0.129

                Coef    S.E.   Wald Z Pr(>|Z|) Penalty Scale
Intercept      -4.7246 2.2429 -2.11  0.0352    0.0000
stenosis       -0.0105 0.0240 -0.44  0.6621   17.8021
x1              2.3605 0.7254  3.25  0.0011    0.6054
x2             -0.5385 0.3653 -1.47  0.1404    1.2851
procedure       0.9247 0.4844  1.91  0.0563    0.8576
ClinicalScore   0.3046 0.1874  1.63  0.1041    2.4779

Arrange the coefficients of the two models side by side, and also list
the difference between the two:

 > cbind(coef(x6.lrm), coef(x6.lrm.pen), abs(coef(x6.lrm)-coef(x6.lrm.pen)))
                       [,1]        [,2]        [,3]
Intercept      -5.53281808 -4.72464766 0.808170417
stenosis       -0.01496757 -0.01050797 0.004459599
x1              3.04248257  2.36051833 0.681964238
x2             -0.75335619 -0.53854750 0.214808685
procedure       1.20847252  0.92474708 0.283725441
ClinicalScore   0.37623189  0.30457557 0.071656322

[Question 3]
Is this penalized model the one I should present for my colleagues?
I still have EPV problem. Or is EPV problem O.K. if I use penalization?

I am still wondering about what I can do to avoid EPV problem.
Collecting new data would be a long-time and huge work...


(11/04/22 1:46), [hidden email] wrote:

> Thank you for your comment.
> I forgot to mention that varclus and pvclust showed similar results for
> my data.
>
> BTW, I did not realize rms is a replacement for the Design package.
> I appreciate your suggestion.
> --
> KH
>
> (11/04/21 8:00), Frank Harrell wrote:
>> I think it's OK. You can also use the Hmisc package's varclus function.
>> Frank
>>
>>
>> 細田弘吉 wrote:
>>>
>>> Dear Prof. Harrel,
>>>
>>> Thank you very much for your quick advice.
>>> I will try rms package.
>>>
>>> Regarding model reduction, is my model 2 method (clustering and recoding
>>> that are blinded to the outcome) permissible?
>>>
>>> Sincerely,
>>>
>>> --
>>> KH
>>>
>>> (11/04/20 22:01), Frank Harrell wrote:
>>>> Deleting variables is a bad idea unless you make that a formal part of
>>>> the
>>>> BMA so that the attempt to delete variables is penalized for.
>>>> Instead of
>>>> BMA I recommend simple penalized maximum likelihood estimation (see the
>>>> lrm
>>>> function in the rms package) or pre-modeling data reduction that is
>>>> blinded
>>>> to the outcome variable.
>>>> Frank
>>>>
>>>>
>>>> 細田弘吉 wrote:
>>>>>
>>>>> Hi everybody,
>>>>> I apologize for long mail in advance.
>>>>>
>>>>> I have data of 104 patients, which consists of 15 explanatory
>>>>> variables
>>>>> and one binary outcome (poor/good). The outcome consists of 25 poor
>>>>> results and 79 good results. I tried to analyze the data with logistic
>>>>> regression. However, the 15 variables and 25 events means events per
>>>>> variable (EPV) is much less than 10 (rule of thumb). Therefore, I
>>>>> used R
>>>>> package, "BMA" to perform logistic regression with BMA to avoid this
>>>>> problem.
>>>>>
>>>>> model 1 (full model):
>>>>> x1, x2, x3, x4 are continuous variables and others are binary data.
>>>>>
>>>>>> x16.bic.glm<- bic.glm(outcome ~ ., data=x16.df,
>>>>> glm.family="binomial", OR20, strict=FALSE)
>>>>>> summary(x16.bic.glm)
>>>>> (The output below has been cut off at the right edge to save space)
>>>>>
>>>>> 62 models were selected
>>>>> Best 5 models (cumulative posterior probability = 0.3606 ):
>>>>>
>>>>> p!=0 EV SD model 1 model2
>>>>> Intercept 100 -5.1348545 1.652424 -4.4688 -5.15
>>>>> -5.1536
>>>>> age 3.3 0.0001634 0.007258 .
>>>>> sex 4.0
>>>>> .M -0.0243145 0.220314 .
>>>>> side 10.8
>>>>> .R 0.0811227 0.301233 .
>>>>> procedure 46.9 -0.5356894 0.685148 . -1.163
>>>>> symptom 3.8 -0.0099438 0.129690 . .
>>>>> stenosis 3.4 -0.0003343 0.005254 .
>>>>> x1 3.7 -0.0061451 0.144084 .
>>>>> x2 100.0 3.1707661 0.892034 3.2221 3.11
>>>>> x3 51.3 -0.4577885 0.551466 -0.9154 .
>>>>> HT 4.6
>>>>> .positive 0.0199299 0.161769 . .
>>>>> DM 3.3
>>>>> .positive -0.0019986 0.105910 . .
>>>>> IHD 3.5
>>>>> .positive 0.0077626 0.122593 . .
>>>>> smoking 9.1
>>>>> .positive 0.0611779 0.258402 . .
>>>>> hyperlipidemia 16.0
>>>>> .positive 0.1784293 0.512058 . .
>>>>> x4 8.2 0.0607398 0.267501 . .
>>>>>
>>>>>
>>>>> nVar 2 2
>>>>> 1 3 3
>>>>> BIC -376.9082
>>>>> -376.5588 -376.3094 -375.8468 -374.5582
>>>>> post prob 0.104
>>>>> 0.087 0.077 0.061 0.032
>>>>>
>>>>> [Question 1]
>>>>> Is it O.K to calculate odds ratio and its 95% confidence interval from
>>>>> "EV" (posterior distribution mean) and“SD”(posterior distribution
>>>>> standard deviation)?
>>>>> For example, 95%CI of EV of x2 can be calculated as;
>>>>>> exp(3.1707661)
>>>>> [1] 23.82573 -----> odds ratio
>>>>>> exp(3.1707661+1.96*0.892034)
>>>>> [1] 136.8866
>>>>>> exp(3.1707661-1.96*0.892034)
>>>>> [1] 4.146976
>>>>> ------------------> 95%CI (4.1 to 136.9)
>>>>> Is this O.K.?
>>>>>
>>>>> [Question 2]
>>>>> Is it permissible to delete variables with small value of "p!=0" and
>>>>> "EV", such as age (3.3% and 0.0001634) to reduce the number of
>>>>> explanatory variables and reconstruct new model without those
>>>>> variables
>>>>> for new session of BMA?
>>>>>
>>>>> model 2 (reduced model):
>>>>> I used R package, "pvclust", to reduce the model. The result suggested
>>>>> x1, x2 and x4 belonged to the same cluster, so I picked up only x2.
>>>>> Based on the subject knowledge, I made a simple unweighted sum, by
>>>>> counting the number of clinical features. For 9 features (sex, side,
>>>>> HT2, hyperlipidemia, DM, IHD, smoking, symptom, age), the sum ranges
>>>>> from 0 to 9. This score was defined as ClinicalScore. Consequently, I
>>>>> made up new data set (x6.df), which consists of 5 variables (stenosis,
>>>>> x2, x3, procedure, and ClinicalScore) and one binary outcome
>>>>> (poor/good). Then, for alternative BMA session...
>>>>>
>>>>>> BMAx6.glm<- bic.glm(postopDWI_HI ~ ., data=x6.df,
>>>>> glm.family="binomial", OR=20, strict=FALSE)
>>>>>> summary(BMAx6.glm)
>>>>> (The output below has been cut off at the right edge to save space)
>>>>> Call:
>>>>> bic.glm.formula(f = postopDWI_HI ~ ., data = x6.df, glm.family =
>>>>> "binomial", strict = FALSE, OR = 20)
>>>>>
>>>>>
>>>>> 13 models were selected
>>>>> Best 5 models (cumulative posterior probability = 0.7626 ):
>>>>>
>>>>> p!=0 EV SD model 1 model 2
>>>>> Intercept 100 -5.6918362 1.81220 -4.4688 -6.3166
>>>>> stenosis 8.1 -0.0008417 0.00815 . .
>>>>> x2 100.0 3.0606165 0.87765 3.2221 3.1154
>>>>> x3 46.5 -0.3998864 0.52688 -0.9154 .
>>>>> procedure 49.3 0.5747013 0.70164 . 1.1631
>>>>> ClinicalScore 27.1 0.0966633 0.19645 . .
>>>>>
>>>>>
>>>>> nVar 2 2 1
>>>>> 3 3
>>>>> BIC -376.9082 -376.5588
>>>>> -376.3094 -375.8468 -375.5025
>>>>> post prob 0.208 0.175
>>>>> 0.154 0.122 0.103
>>>>>
>>>>> [Question 3]
>>>>> Am I doing it correctly or not?
>>>>> I mean this kind of model reduction is permissible for BMA?
>>>>>
>>>>> [Question 4]
>>>>> I still have 5 variables, which violates the rule of thumb, "EPV> 10".
>>>>> Is it permissible to delete "stenosis" variable because of small value
>>>>> of "EV"? Or is it O.K. because this is BMA?
>>>>>
>>>>> Sorry for long post.
>>>>>
>>>>> I appreciate your help very much in advance.
>>>>>
>>>>> --
>>>>> KH
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>>
>>>> -----
>>>> Frank Harrell
>>>> Department of Biostatistics, Vanderbilt University
>>>> --
>>>> View this message in context:
>>>> http://r.789695.n4.nabble.com/BMA-logistic-regression-odds-ratio-model-reduction-etc-tp3462416p3462919.html
>>>>
>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Questions about lrm, validate, pentrace (Re: BMA, logistic regression, odds ratio, model reduction etc)

Frank Harrell
You've done a lot of good work on this.  Yes I would say you have moderate overfitting with the first model.  The only thing that saved you from having severe overfitting is that there seems to be a signal present [I am assume this model is truly pre-specified and was not developed at all by looking at patterns of responses Y.]

The use of backwards stepdown demonstrated much worse overfitting.  This is in line with what we know about the damage of stepwise selection methods that do not incorporate shrinkage.  I would throw away the stepwise regression model.  You'll find that the model selected is entirely arbitrary.  And you can't use the "selected" variables in any re-fit of the model, i.e., you can't use lrm pretending that the two remaining variables were pre-specified.  Stepwise regression methods only seem to help.  When assessed properly we see that is an illusion.

You are using penalizing properly but you did not print the full table of penalties vs. effective AIC.  We don't have faith that you penalized enough.  I tend to run pentrace using a very wide range of possible penalties to make sure I've found the global optimum.

Penalization somewhat solves the EPV problem but there is no substitute for getting more data.

You can run validate specifying your final penalty as an argument.

Frank


細田弘吉 wrote
According to the advice, I tried rms package.
Just to make sure, I have data of 104 patients (x6.df), which consists
of 5 explanatory variables and one binary outcome (poor/good) (previous
model 2 strategy). The outcome consists of 25 poor results and 79 good
results. Therefore, My events per variable (EPV) is only 5 (much less
than the rule of thumb of 10).

My questions are about validate and pentrace in rms package.
I present some codes and results.
I appreciate anybody's help in advance.

 > x6.lrm <- lrm(outcome ~ stenosis+x1+x2+procedure+ClinicalScore,
data=x6.df, x=T, y=T)

 > x6.lrm
...
Obs  104    LR chi2      29.24    R2       0.367    C       0.816
  negative 79    d.f.         5    g        1.633    Dxy     0.632
  positive 25    Pr(> chi2) <0.0001   gr    5.118    gamma   0.632
max |deriv| 1e-08                    gp    0.237    tau-a   0.233
                                      Brier   0.127

                Coef    S.E.   Wald Z Pr(>|Z|)
Intercept      -5.5328 2.6287 -2.10  0.0353
stenosis       -0.0150 0.0284 -0.53  0.5979
x1              3.0425 0.9100  3.34  0.0008
x2             -0.7534 0.4519 -1.67  0.0955
procedure       1.2085 0.5717  2.11  0.0345
ClinicalScore   0.3762 0.2287  1.65  0.0999

It seems not too bad. Next, validation by bootstrap ...

 > validate(x6.lrm, B=200, bw=F)
           index.orig training    test optimism index.corrected   n
Dxy           0.6324   0.6960  0.5870   0.1091          0.5233 200
R2            0.3668   0.4370  0.3154   0.1216          0.2453 200
Intercept     0.0000   0.0000 -0.2007   0.2007         -0.2007 200
Slope         1.0000   1.0000  0.7565   0.2435          0.7565 200
Emax          0.0000   0.0000  0.0999   0.0999          0.0999 200
D             0.2716   0.3368  0.2275   0.1093          0.1623 200
U            -0.0192  -0.0192  0.0369  -0.0561          0.0369 200
Q             0.2908   0.3560  0.1906   0.1654          0.1254 200
B             0.1272   0.1155  0.1384  -0.0229          0.1501 200
g             1.6328   2.0740  1.4647   0.6093          1.0235 200
gp            0.2367   0.2529  0.2189   0.0341          0.2026 200

The apparent Dxy is 0.63, and bias-corrected Dxy is 0.52. The maximum
absolute error is estimated to be 0.099. The changes in slope and
intercept are also more substantial. In all, there is evidence that I am
somewhat overfitting the data, right?.

Furthermore, using step-down variable selection ...

 > validate(x6.lrm, B=200, bw=T)

                Backwards Step-down - Original Model

  Deleted        Chi-Sq d.f. P      Residual d.f. P      AIC
  stenosis       0.28   1    0.5979 0.28     1    0.5979 -1.72
  ClinicalScore  2.60   1    0.1068 2.88     2    0.2370 -1.12
  x2             2.86   1    0.0910 5.74     3    0.1252 -0.26

Approximate Estimates after Deleting Factors

              Coef   S.E. Wald Z         P
Intercept  -5.865 1.4136 -4.149 3.336e-05
x1          2.915 0.8685  3.357 7.889e-04
procedure   1.072 0.5590  1.918 5.508e-02

Factors in Final Model

[1] x1         procedure
           index.orig training    test optimism index.corrected   n
Dxy           0.5661   0.6755  0.5559   0.1196          0.4464 200
R2            0.2876   0.4085  0.2784   0.1301          0.1575 200
Intercept     0.0000   0.0000 -0.2459   0.2459         -0.2459 200
Slope         1.0000   1.0000  0.7300   0.2700          0.7300 200
Emax          0.0000   0.0000  0.1173   0.1173          0.1173 200
D             0.2038   0.3130  0.1970   0.1160          0.0877 200
U            -0.0192  -0.0192  0.0382  -0.0574          0.0382 200
Q             0.2230   0.3323  0.1589   0.1734          0.0496 200
B             0.1441   0.1192  0.1452  -0.0261          0.1702 200
g             1.2628   1.9524  1.3222   0.6302          0.6326 199
gp            0.2041   0.2430  0.2043   0.0387          0.1654 199

If I select only two variables (x1 and procedure), bias-corrected Dxy
goes down to 0.45.

[Question 1]
I have EPV problem. Even so, should I keep the full model (5-variable
model)? or can I use the 2-variable (x1 and procedure) model which the
validate() with step-down provides?

[Question 2]
If I use 2-variable model, should I do
x2.lrm <- lrm(postopDWI_HI ~ T1+procedure2, data=x6.df, x=T, y=T)?
or keep the value showed above by validate function?

Next, shrinkage ...

 > pentrace(x6.lrm, seq(0, 5.0, by=0.05))
Best penalty:
penalty         df
    3.05   4.015378

The best penalty is 3.05. So, I update it with this penalty to obtain
the corresponding penalized model:

 > x6.lrm.pen <- update(x6.lrm, penalty=3.05, x=T, y=T)
 > x6.lrm.pen
.....
Penalty factors

  simple nonlinear interaction nonlinear.interaction
    3.05      3.05        3.05                  3.05
Final penalty on -2 log L
      [,1]
[1,]  3.8

Obs     104    LR chi2      28.18    R2       0.313    C       0.818
  negative    79    d.f.     4.015    g        1.264    Dxy     0.635
  positive    25   Pr(> chi2) <0.0001 gr       3.538    gamma   0.637
max |deriv| 3e-05                    gp       0.201    tau-a   0.234
                                      Brier    0.129

                Coef    S.E.   Wald Z Pr(>|Z|) Penalty Scale
Intercept      -4.7246 2.2429 -2.11  0.0352    0.0000
stenosis       -0.0105 0.0240 -0.44  0.6621   17.8021
x1              2.3605 0.7254  3.25  0.0011    0.6054
x2             -0.5385 0.3653 -1.47  0.1404    1.2851
procedure       0.9247 0.4844  1.91  0.0563    0.8576
ClinicalScore   0.3046 0.1874  1.63  0.1041    2.4779

Arrange the coefficients of the two models side by side, and also list
the difference between the two:

 > cbind(coef(x6.lrm), coef(x6.lrm.pen), abs(coef(x6.lrm)-coef(x6.lrm.pen)))
                       [,1]        [,2]        [,3]
Intercept      -5.53281808 -4.72464766 0.808170417
stenosis       -0.01496757 -0.01050797 0.004459599
x1              3.04248257  2.36051833 0.681964238
x2             -0.75335619 -0.53854750 0.214808685
procedure       1.20847252  0.92474708 0.283725441
ClinicalScore   0.37623189  0.30457557 0.071656322

[Question 3]
Is this penalized model the one I should present for my colleagues?
I still have EPV problem. Or is EPV problem O.K. if I use penalization?

I am still wondering about what I can do to avoid EPV problem.
Collecting new data would be a long-time and huge work...


(11/04/22 1:46), [hidden email] wrote:
> Thank you for your comment.
> I forgot to mention that varclus and pvclust showed similar results for
> my data.
>
> BTW, I did not realize rms is a replacement for the Design package.
> I appreciate your suggestion.
> --
> KH
>
> (11/04/21 8:00), Frank Harrell wrote:
>> I think it's OK. You can also use the Hmisc package's varclus function.
>> Frank
>>
>>
>> 細田弘吉 wrote:
>>>
>>> Dear Prof. Harrel,
>>>
>>> Thank you very much for your quick advice.
>>> I will try rms package.
>>>
>>> Regarding model reduction, is my model 2 method (clustering and recoding
>>> that are blinded to the outcome) permissible?
>>>
>>> Sincerely,
>>>
>>> --
>>> KH
>>>
>>> (11/04/20 22:01), Frank Harrell wrote:
>>>> Deleting variables is a bad idea unless you make that a formal part of
>>>> the
>>>> BMA so that the attempt to delete variables is penalized for.
>>>> Instead of
>>>> BMA I recommend simple penalized maximum likelihood estimation (see the
>>>> lrm
>>>> function in the rms package) or pre-modeling data reduction that is
>>>> blinded
>>>> to the outcome variable.
>>>> Frank
>>>>
>>>>
>>>> 細田弘吉 wrote:
>>>>>
>>>>> Hi everybody,
>>>>> I apologize for long mail in advance.
>>>>>
>>>>> I have data of 104 patients, which consists of 15 explanatory
>>>>> variables
>>>>> and one binary outcome (poor/good). The outcome consists of 25 poor
>>>>> results and 79 good results. I tried to analyze the data with logistic
>>>>> regression. However, the 15 variables and 25 events means events per
>>>>> variable (EPV) is much less than 10 (rule of thumb). Therefore, I
>>>>> used R
>>>>> package, "BMA" to perform logistic regression with BMA to avoid this
>>>>> problem.
>>>>>
>>>>> model 1 (full model):
>>>>> x1, x2, x3, x4 are continuous variables and others are binary data.
>>>>>
>>>>>> x16.bic.glm<- bic.glm(outcome ~ ., data=x16.df,
>>>>> glm.family="binomial", OR20, strict=FALSE)
>>>>>> summary(x16.bic.glm)
>>>>> (The output below has been cut off at the right edge to save space)
>>>>>
>>>>> 62 models were selected
>>>>> Best 5 models (cumulative posterior probability = 0.3606 ):
>>>>>
>>>>> p!=0 EV SD model 1 model2
>>>>> Intercept 100 -5.1348545 1.652424 -4.4688 -5.15
>>>>> -5.1536
>>>>> age 3.3 0.0001634 0.007258 .
>>>>> sex 4.0
>>>>> .M -0.0243145 0.220314 .
>>>>> side 10.8
>>>>> .R 0.0811227 0.301233 .
>>>>> procedure 46.9 -0.5356894 0.685148 . -1.163
>>>>> symptom 3.8 -0.0099438 0.129690 . .
>>>>> stenosis 3.4 -0.0003343 0.005254 .
>>>>> x1 3.7 -0.0061451 0.144084 .
>>>>> x2 100.0 3.1707661 0.892034 3.2221 3.11
>>>>> x3 51.3 -0.4577885 0.551466 -0.9154 .
>>>>> HT 4.6
>>>>> .positive 0.0199299 0.161769 . .
>>>>> DM 3.3
>>>>> .positive -0.0019986 0.105910 . .
>>>>> IHD 3.5
>>>>> .positive 0.0077626 0.122593 . .
>>>>> smoking 9.1
>>>>> .positive 0.0611779 0.258402 . .
>>>>> hyperlipidemia 16.0
>>>>> .positive 0.1784293 0.512058 . .
>>>>> x4 8.2 0.0607398 0.267501 . .
>>>>>
>>>>>
>>>>> nVar 2 2
>>>>> 1 3 3
>>>>> BIC -376.9082
>>>>> -376.5588 -376.3094 -375.8468 -374.5582
>>>>> post prob 0.104
>>>>> 0.087 0.077 0.061 0.032
>>>>>
>>>>> [Question 1]
>>>>> Is it O.K to calculate odds ratio and its 95% confidence interval from
>>>>> "EV" (posterior distribution mean) and“SD”(posterior distribution
>>>>> standard deviation)?
>>>>> For example, 95%CI of EV of x2 can be calculated as;
>>>>>> exp(3.1707661)
>>>>> [1] 23.82573 -----> odds ratio
>>>>>> exp(3.1707661+1.96*0.892034)
>>>>> [1] 136.8866
>>>>>> exp(3.1707661-1.96*0.892034)
>>>>> [1] 4.146976
>>>>> ------------------> 95%CI (4.1 to 136.9)
>>>>> Is this O.K.?
>>>>>
>>>>> [Question 2]
>>>>> Is it permissible to delete variables with small value of "p!=0" and
>>>>> "EV", such as age (3.3% and 0.0001634) to reduce the number of
>>>>> explanatory variables and reconstruct new model without those
>>>>> variables
>>>>> for new session of BMA?
>>>>>
>>>>> model 2 (reduced model):
>>>>> I used R package, "pvclust", to reduce the model. The result suggested
>>>>> x1, x2 and x4 belonged to the same cluster, so I picked up only x2.
>>>>> Based on the subject knowledge, I made a simple unweighted sum, by
>>>>> counting the number of clinical features. For 9 features (sex, side,
>>>>> HT2, hyperlipidemia, DM, IHD, smoking, symptom, age), the sum ranges
>>>>> from 0 to 9. This score was defined as ClinicalScore. Consequently, I
>>>>> made up new data set (x6.df), which consists of 5 variables (stenosis,
>>>>> x2, x3, procedure, and ClinicalScore) and one binary outcome
>>>>> (poor/good). Then, for alternative BMA session...
>>>>>
>>>>>> BMAx6.glm<- bic.glm(postopDWI_HI ~ ., data=x6.df,
>>>>> glm.family="binomial", OR=20, strict=FALSE)
>>>>>> summary(BMAx6.glm)
>>>>> (The output below has been cut off at the right edge to save space)
>>>>> Call:
>>>>> bic.glm.formula(f = postopDWI_HI ~ ., data = x6.df, glm.family =
>>>>> "binomial", strict = FALSE, OR = 20)
>>>>>
>>>>>
>>>>> 13 models were selected
>>>>> Best 5 models (cumulative posterior probability = 0.7626 ):
>>>>>
>>>>> p!=0 EV SD model 1 model 2
>>>>> Intercept 100 -5.6918362 1.81220 -4.4688 -6.3166
>>>>> stenosis 8.1 -0.0008417 0.00815 . .
>>>>> x2 100.0 3.0606165 0.87765 3.2221 3.1154
>>>>> x3 46.5 -0.3998864 0.52688 -0.9154 .
>>>>> procedure 49.3 0.5747013 0.70164 . 1.1631
>>>>> ClinicalScore 27.1 0.0966633 0.19645 . .
>>>>>
>>>>>
>>>>> nVar 2 2 1
>>>>> 3 3
>>>>> BIC -376.9082 -376.5588
>>>>> -376.3094 -375.8468 -375.5025
>>>>> post prob 0.208 0.175
>>>>> 0.154 0.122 0.103
>>>>>
>>>>> [Question 3]
>>>>> Am I doing it correctly or not?
>>>>> I mean this kind of model reduction is permissible for BMA?
>>>>>
>>>>> [Question 4]
>>>>> I still have 5 variables, which violates the rule of thumb, "EPV> 10".
>>>>> Is it permissible to delete "stenosis" variable because of small value
>>>>> of "EV"? Or is it O.K. because this is BMA?
>>>>>
>>>>> Sorry for long post.
>>>>>
>>>>> I appreciate your help very much in advance.
>>>>>
>>>>> --
>>>>> KH
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>>
>>>> -----
>>>> Frank Harrell
>>>> Department of Biostatistics, Vanderbilt University
>>>> --
>>>> View this message in context:
>>>> http://r.789695.n4.nabble.com/BMA-logistic-regression-odds-ratio-model-reduction-etc-tp3462416p3462919.html
>>>>
>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Frank Harrell
Department of Biostatistics, Vanderbilt University
Reply | Threaded
Open this post in threaded view
|

Re: Questions about lrm, validate, pentrace

khosoda
Thank you for you quick reply, Prof. Harrell.
According to your advice, I ran pentrace using a very wide range.

 > pentrace.x6factor <- pentrace(x6factor.lrm, seq(0, 100, by=0.5))
 > plot(pentrace.x6factor)

I attached this figure. Then,

 > pentrace.x6factor <- pentrace(x6factor.lrm, seq(0, 10, by=0.05))

It seems reasonable that the best penalty is 2.55.

 > x6factor.lrm.pen <- update(x6factor.lrm, penalty=2.55)
 > cbind(coef(x6factor.lrm), coef(x6factor.lrm.pen),
abs(coef(x6factor.lrm)-coef(x6factor.lrm.pen)))
                      [,1]        [,2]        [,3]
Intercept     -4.32434556 -3.86816460 0.456180958
stenosis      -0.01496757 -0.01091755 0.004050025
T1             3.04248257  2.42443034 0.618052225
T2            -0.75335619 -0.57194342 0.181412767
procedure     -1.20847252 -0.82589263 0.382579892
ClinicalScore  0.37623189  0.30524628 0.070985611

 > validate(x6factor.lrm, bw=F, B=200)
           index.orig training    test optimism index.corrected   n
Dxy           0.6324   0.6849  0.5955   0.0894          0.5430 200
R2            0.3668   0.4220  0.3231   0.0989          0.2679 200
Intercept     0.0000   0.0000 -0.1924   0.1924         -0.1924 200
Slope         1.0000   1.0000  0.7796   0.2204          0.7796 200
Emax          0.0000   0.0000  0.0915   0.0915          0.0915 200
D             0.2716   0.3229  0.2339   0.0890          0.1826 200
U            -0.0192  -0.0192  0.0243  -0.0436          0.0243 200
Q             0.2908   0.3422  0.2096   0.1325          0.1582 200
B             0.1272   0.1171  0.1357  -0.0186          0.1457 200
g             1.6328   1.9879  1.4940   0.4939          1.1389 200
gp            0.2367   0.2502  0.2216   0.0286          0.2080 200


 > validate(x6factor.lrm.pen, bw=F, B=200)
           index.orig training    test optimism index.corrected   n
Dxy           0.6375   0.6857  0.6024   0.0833          0.5542 200
R2            0.3145   0.3488  0.3267   0.0221          0.2924 200
Intercept     0.0000   0.0000  0.0882  -0.0882          0.0882 200
Slope         1.0000   1.0000  1.0923  -0.0923          1.0923 200
Emax          0.0000   0.0000  0.0340   0.0340          0.0340 200
D             0.2612   0.2571  0.2370   0.0201          0.2411 200
U            -0.0192  -0.0192 -0.0047  -0.0145         -0.0047 200
Q             0.2805   0.2763  0.2417   0.0346          0.2458 200
B             0.1292   0.1224  0.1355  -0.0132          0.1423 200
g             1.2704   1.3917  1.5019  -0.1102          1.3805 200
gp            0.2020   0.2091  0.2229  -0.0138          0.2158 200

In the penalized model (x6factor.lrm.pen), the apparent Dxy is 0.64, and
bias-corrected Dxy is 0.55. The maximum absolute error is estimated to
be 0.034, smaller than non-penalized model (0.0915 in x6factor.lrm) The
changes in slope and intercept are substantially reduced in penalized
model. I think overfitting is improved at least to some extent. Should I
select this as a final model?

I have one more question. The "procedure" variable was defined as 0/1
value in the previous mail. For some graphical reason, I redefined it as
treat1/treat2 value. Then, the best penalty value was changed from 3.05
to 2.55. I guess change from numeric to factorial caused this reduction
in penalty. Which set up should I select?

I appreciate your help in advance.

--
KH

(11/04/26 0:21), Frank Harrell wrote:

> You've done a lot of good work on this.  Yes I would say you have moderate
> overfitting with the first model.  The only thing that saved you from having
> severe overfitting is that there seems to be a signal present [I am assume
> this model is truly pre-specified and was not developed at all by looking at
> patterns of responses Y.]
>
> The use of backwards stepdown demonstrated much worse overfitting.  This is
> in line with what we know about the damage of stepwise selection methods
> that do not incorporate shrinkage.  I would throw away the stepwise
> regression model.  You'll find that the model selected is entirely
> arbitrary.  And you can't use the "selected" variables in any re-fit of the
> model, i.e., you can't use lrm pretending that the two remaining variables
> were pre-specified.  Stepwise regression methods only seem to help.  When
> assessed properly we see that is an illusion.
>
> You are using penalizing properly but you did not print the full table of
> penalties vs. effective AIC.  We don't have faith that you penalized enough.
> I tend to run pentrace using a very wide range of possible penalties to make
> sure I've found the global optimum.
>
> Penalization somewhat solves the EPV problem but there is no substitute for
> getting more data.
>
> You can run validate specifying your final penalty as an argument.
>
> Frank
>
>
>
> 細田弘吉 wrote:
>>
>> According to the advice, I tried rms package.
>> Just to make sure, I have data of 104 patients (x6.df), which consists
>> of 5 explanatory variables and one binary outcome (poor/good) (previous
>> model 2 strategy). The outcome consists of 25 poor results and 79 good
>> results. Therefore, My events per variable (EPV) is only 5 (much less
>> than the rule of thumb of 10).
>>
>> My questions are about validate and pentrace in rms package.
>> I present some codes and results.
>> I appreciate anybody's help in advance.
>>
>>   >  x6.lrm<- lrm(outcome ~ stenosis+x1+x2+procedure+ClinicalScore,
>> data=x6.df, x=T, y=T)
>>
>>   >  x6.lrm
>> ...
>> Obs  104    LR chi2      29.24    R2       0.367    C       0.816
>>    negative 79    d.f.         5    g        1.633    Dxy     0.632
>>    positive 25    Pr(>  chi2)<0.0001   gr    5.118    gamma   0.632
>> max |deriv| 1e-08                    gp    0.237    tau-a   0.233
>>                                        Brier   0.127
>>
>>                  Coef    S.E.   Wald Z Pr(>|Z|)
>> Intercept      -5.5328 2.6287 -2.10  0.0353
>> stenosis       -0.0150 0.0284 -0.53  0.5979
>> x1              3.0425 0.9100  3.34  0.0008
>> x2             -0.7534 0.4519 -1.67  0.0955
>> procedure       1.2085 0.5717  2.11  0.0345
>> ClinicalScore   0.3762 0.2287  1.65  0.0999
>>
>> It seems not too bad. Next, validation by bootstrap ...
>>
>>   >  validate(x6.lrm, B=200, bw=F)
>>             index.orig training    test optimism index.corrected   n
>> Dxy           0.6324   0.6960  0.5870   0.1091          0.5233 200
>> R2            0.3668   0.4370  0.3154   0.1216          0.2453 200
>> Intercept     0.0000   0.0000 -0.2007   0.2007         -0.2007 200
>> Slope         1.0000   1.0000  0.7565   0.2435          0.7565 200
>> Emax          0.0000   0.0000  0.0999   0.0999          0.0999 200
>> D             0.2716   0.3368  0.2275   0.1093          0.1623 200
>> U            -0.0192  -0.0192  0.0369  -0.0561          0.0369 200
>> Q             0.2908   0.3560  0.1906   0.1654          0.1254 200
>> B             0.1272   0.1155  0.1384  -0.0229          0.1501 200
>> g             1.6328   2.0740  1.4647   0.6093          1.0235 200
>> gp            0.2367   0.2529  0.2189   0.0341          0.2026 200
>>
>> The apparent Dxy is 0.63, and bias-corrected Dxy is 0.52. The maximum
>> absolute error is estimated to be 0.099. The changes in slope and
>> intercept are also more substantial. In all, there is evidence that I am
>> somewhat overfitting the data, right?.
>>
>> Furthermore, using step-down variable selection ...
>>
>>   >  validate(x6.lrm, B=200, bw=T)
>>
>> Backwards Step-down - Original Model
>>
>>    Deleted        Chi-Sq d.f. P      Residual d.f. P      AIC
>>    stenosis       0.28   1    0.5979 0.28     1    0.5979 -1.72
>>    ClinicalScore  2.60   1    0.1068 2.88     2    0.2370 -1.12
>>    x2             2.86   1    0.0910 5.74     3    0.1252 -0.26
>>
>> Approximate Estimates after Deleting Factors
>>
>>                Coef   S.E. Wald Z         P
>> Intercept  -5.865 1.4136 -4.149 3.336e-05
>> x1          2.915 0.8685  3.357 7.889e-04
>> procedure   1.072 0.5590  1.918 5.508e-02
>>
>> Factors in Final Model
>>
>> [1] x1         procedure
>>             index.orig training    test optimism index.corrected   n
>> Dxy           0.5661   0.6755  0.5559   0.1196          0.4464 200
>> R2            0.2876   0.4085  0.2784   0.1301          0.1575 200
>> Intercept     0.0000   0.0000 -0.2459   0.2459         -0.2459 200
>> Slope         1.0000   1.0000  0.7300   0.2700          0.7300 200
>> Emax          0.0000   0.0000  0.1173   0.1173          0.1173 200
>> D             0.2038   0.3130  0.1970   0.1160          0.0877 200
>> U            -0.0192  -0.0192  0.0382  -0.0574          0.0382 200
>> Q             0.2230   0.3323  0.1589   0.1734          0.0496 200
>> B             0.1441   0.1192  0.1452  -0.0261          0.1702 200
>> g             1.2628   1.9524  1.3222   0.6302          0.6326 199
>> gp            0.2041   0.2430  0.2043   0.0387          0.1654 199
>>
>> If I select only two variables (x1 and procedure), bias-corrected Dxy
>> goes down to 0.45.
>>
>> [Question 1]
>> I have EPV problem. Even so, should I keep the full model (5-variable
>> model)? or can I use the 2-variable (x1 and procedure) model which the
>> validate() with step-down provides?
>>
>> [Question 2]
>> If I use 2-variable model, should I do
>> x2.lrm<- lrm(postopDWI_HI ~ T1+procedure2, data=x6.df, x=T, y=T)?
>> or keep the value showed above by validate function?
>>
>> Next, shrinkage ...
>>
>>   >  pentrace(x6.lrm, seq(0, 5.0, by=0.05))
>> Best penalty:
>> penalty         df
>>      3.05   4.015378
>>
>> The best penalty is 3.05. So, I update it with this penalty to obtain
>> the corresponding penalized model:
>>
>>   >  x6.lrm.pen<- update(x6.lrm, penalty=3.05, x=T, y=T)
>>   >  x6.lrm.pen
>> .....
>> Penalty factors
>>
>>    simple nonlinear interaction nonlinear.interaction
>>      3.05      3.05        3.05                  3.05
>> Final penalty on -2 log L
>>        [,1]
>> [1,]  3.8
>>
>> Obs     104    LR chi2      28.18    R2       0.313    C       0.818
>>    negative    79    d.f.     4.015    g        1.264    Dxy     0.635
>>    positive    25   Pr(>  chi2)<0.0001 gr       3.538    gamma   0.637
>> max |deriv| 3e-05                    gp       0.201    tau-a   0.234
>>                                        Brier    0.129
>>
>>                  Coef    S.E.   Wald Z Pr(>|Z|) Penalty Scale
>> Intercept      -4.7246 2.2429 -2.11  0.0352    0.0000
>> stenosis       -0.0105 0.0240 -0.44  0.6621   17.8021
>> x1              2.3605 0.7254  3.25  0.0011    0.6054
>> x2             -0.5385 0.3653 -1.47  0.1404    1.2851
>> procedure       0.9247 0.4844  1.91  0.0563    0.8576
>> ClinicalScore   0.3046 0.1874  1.63  0.1041    2.4779
>>
>> Arrange the coefficients of the two models side by side, and also list
>> the difference between the two:
>>
>>   >  cbind(coef(x6.lrm), coef(x6.lrm.pen),
>> abs(coef(x6.lrm)-coef(x6.lrm.pen)))
>>                         [,1]        [,2]        [,3]
>> Intercept      -5.53281808 -4.72464766 0.808170417
>> stenosis       -0.01496757 -0.01050797 0.004459599
>> x1              3.04248257  2.36051833 0.681964238
>> x2             -0.75335619 -0.53854750 0.214808685
>> procedure       1.20847252  0.92474708 0.283725441
>> ClinicalScore   0.37623189  0.30457557 0.071656322
>>
>> [Question 3]
>> Is this penalized model the one I should present for my colleagues?
>> I still have EPV problem. Or is EPV problem O.K. if I use penalization?
>>
>> I am still wondering about what I can do to avoid EPV problem.
>> Collecting new data would be a long-time and huge work...
>>
>>
>> (11/04/22 1:46), [hidden email] wrote:
>>> Thank you for your comment.
>>> I forgot to mention that varclus and pvclust showed similar results for
>>> my data.
>>>
>>> BTW, I did not realize rms is a replacement for the Design package.
>>> I appreciate your suggestion.
>>> --
>>> KH
>>>
>>> (11/04/21 8:00), Frank Harrell wrote:
>>>> I think it's OK. You can also use the Hmisc package's varclus function.
>>>> Frank
>>>>
>>>>
>>>> 細田弘吉 wrote:
>>>>>
>>>>> Dear Prof. Harrel,
>>>>>
>>>>> Thank you very much for your quick advice.
>>>>> I will try rms package.
>>>>>
>>>>> Regarding model reduction, is my model 2 method (clustering and
>>>>> recoding
>>>>> that are blinded to the outcome) permissible?
>>>>>
>>>>> Sincerely,
>>>>>
>>>>> --
>>>>> KH
>>>>>
>>>>> (11/04/20 22:01), Frank Harrell wrote:
>>>>>> Deleting variables is a bad idea unless you make that a formal part of
>>>>>> the
>>>>>> BMA so that the attempt to delete variables is penalized for.
>>>>>> Instead of
>>>>>> BMA I recommend simple penalized maximum likelihood estimation (see
>>>>>> the
>>>>>> lrm
>>>>>> function in the rms package) or pre-modeling data reduction that is
>>>>>> blinded
>>>>>> to the outcome variable.
>>>>>> Frank
>>>>>>
>>>>>>
>>>>>> 細田弘吉 wrote:
>>>>>>>
>>>>>>> Hi everybody,
>>>>>>> I apologize for long mail in advance.
>>>>>>>
>>>>>>> I have data of 104 patients, which consists of 15 explanatory
>>>>>>> variables
>>>>>>> and one binary outcome (poor/good). The outcome consists of 25 poor
>>>>>>> results and 79 good results. I tried to analyze the data with
>>>>>>> logistic
>>>>>>> regression. However, the 15 variables and 25 events means events per
>>>>>>> variable (EPV) is much less than 10 (rule of thumb). Therefore, I
>>>>>>> used R
>>>>>>> package, "BMA" to perform logistic regression with BMA to avoid this
>>>>>>> problem.
>>>>>>>
>>>>>>> model 1 (full model):
>>>>>>> x1, x2, x3, x4 are continuous variables and others are binary data.
>>>>>>>
>>>>>>>> x16.bic.glm<- bic.glm(outcome ~ ., data=x16.df,
>>>>>>> glm.family="binomial", OR20, strict=FALSE)
>>>>>>>> summary(x16.bic.glm)
>>>>>>> (The output below has been cut off at the right edge to save space)
>>>>>>>
>>>>>>> 62 models were selected
>>>>>>> Best 5 models (cumulative posterior probability = 0.3606 ):
>>>>>>>
>>>>>>> p!=0 EV SD model 1 model2
>>>>>>> Intercept 100 -5.1348545 1.652424 -4.4688 -5.15
>>>>>>> -5.1536
>>>>>>> age 3.3 0.0001634 0.007258 .
>>>>>>> sex 4.0
>>>>>>> .M -0.0243145 0.220314 .
>>>>>>> side 10.8
>>>>>>> .R 0.0811227 0.301233 .
>>>>>>> procedure 46.9 -0.5356894 0.685148 . -1.163
>>>>>>> symptom 3.8 -0.0099438 0.129690 . .
>>>>>>> stenosis 3.4 -0.0003343 0.005254 .
>>>>>>> x1 3.7 -0.0061451 0.144084 .
>>>>>>> x2 100.0 3.1707661 0.892034 3.2221 3.11
>>>>>>> x3 51.3 -0.4577885 0.551466 -0.9154 .
>>>>>>> HT 4.6
>>>>>>> .positive 0.0199299 0.161769 . .
>>>>>>> DM 3.3
>>>>>>> .positive -0.0019986 0.105910 . .
>>>>>>> IHD 3.5
>>>>>>> .positive 0.0077626 0.122593 . .
>>>>>>> smoking 9.1
>>>>>>> .positive 0.0611779 0.258402 . .
>>>>>>> hyperlipidemia 16.0
>>>>>>> .positive 0.1784293 0.512058 . .
>>>>>>> x4 8.2 0.0607398 0.267501 . .
>>>>>>>
>>>>>>>
>>>>>>> nVar 2 2
>>>>>>> 1 3 3
>>>>>>> BIC -376.9082
>>>>>>> -376.5588 -376.3094 -375.8468 -374.5582
>>>>>>> post prob 0.104
>>>>>>> 0.087 0.077 0.061 0.032
>>>>>>>
>>>>>>> [Question 1]
>>>>>>> Is it O.K to calculate odds ratio and its 95% confidence interval
>>>>>>> from
>>>>>>> "EV" (posterior distribution mean) and“SD”(posterior distribution
>>>>>>> standard deviation)?
>>>>>>> For example, 95%CI of EV of x2 can be calculated as;
>>>>>>>> exp(3.1707661)
>>>>>>> [1] 23.82573 ----->  odds ratio
>>>>>>>> exp(3.1707661+1.96*0.892034)
>>>>>>> [1] 136.8866
>>>>>>>> exp(3.1707661-1.96*0.892034)
>>>>>>> [1] 4.146976
>>>>>>> ------------------>  95%CI (4.1 to 136.9)
>>>>>>> Is this O.K.?
>>>>>>>
>>>>>>> [Question 2]
>>>>>>> Is it permissible to delete variables with small value of "p!=0" and
>>>>>>> "EV", such as age (3.3% and 0.0001634) to reduce the number of
>>>>>>> explanatory variables and reconstruct new model without those
>>>>>>> variables
>>>>>>> for new session of BMA?
>>>>>>>
>>>>>>> model 2 (reduced model):
>>>>>>> I used R package, "pvclust", to reduce the model. The result
>>>>>>> suggested
>>>>>>> x1, x2 and x4 belonged to the same cluster, so I picked up only x2.
>>>>>>> Based on the subject knowledge, I made a simple unweighted sum, by
>>>>>>> counting the number of clinical features. For 9 features (sex, side,
>>>>>>> HT2, hyperlipidemia, DM, IHD, smoking, symptom, age), the sum ranges
>>>>>>> from 0 to 9. This score was defined as ClinicalScore. Consequently, I
>>>>>>> made up new data set (x6.df), which consists of 5 variables
>>>>>>> (stenosis,
>>>>>>> x2, x3, procedure, and ClinicalScore) and one binary outcome
>>>>>>> (poor/good). Then, for alternative BMA session...
>>>>>>>
>>>>>>>> BMAx6.glm<- bic.glm(postopDWI_HI ~ ., data=x6.df,
>>>>>>> glm.family="binomial", OR=20, strict=FALSE)
>>>>>>>> summary(BMAx6.glm)
>>>>>>> (The output below has been cut off at the right edge to save space)
>>>>>>> Call:
>>>>>>> bic.glm.formula(f = postopDWI_HI ~ ., data = x6.df, glm.family =
>>>>>>> "binomial", strict = FALSE, OR = 20)
>>>>>>>
>>>>>>>
>>>>>>> 13 models were selected
>>>>>>> Best 5 models (cumulative posterior probability = 0.7626 ):
>>>>>>>
>>>>>>> p!=0 EV SD model 1 model 2
>>>>>>> Intercept 100 -5.6918362 1.81220 -4.4688 -6.3166
>>>>>>> stenosis 8.1 -0.0008417 0.00815 . .
>>>>>>> x2 100.0 3.0606165 0.87765 3.2221 3.1154
>>>>>>> x3 46.5 -0.3998864 0.52688 -0.9154 .
>>>>>>> procedure 49.3 0.5747013 0.70164 . 1.1631
>>>>>>> ClinicalScore 27.1 0.0966633 0.19645 . .
>>>>>>>
>>>>>>>
>>>>>>> nVar 2 2 1
>>>>>>> 3 3
>>>>>>> BIC -376.9082 -376.5588
>>>>>>> -376.3094 -375.8468 -375.5025
>>>>>>> post prob 0.208 0.175
>>>>>>> 0.154 0.122 0.103
>>>>>>>
>>>>>>> [Question 3]
>>>>>>> Am I doing it correctly or not?
>>>>>>> I mean this kind of model reduction is permissible for BMA?
>>>>>>>
>>>>>>> [Question 4]
>>>>>>> I still have 5 variables, which violates the rule of thumb, "EPV>
>>>>>>> 10".
>>>>>>> Is it permissible to delete "stenosis" variable because of small
>>>>>>> value
>>>>>>> of "EV"? Or is it O.K. because this is BMA?
>>>>>>>
>>>>>>> Sorry for long post.
>>>>>>>
>>>>>>> I appreciate your help very much in advance.
>>>>>>>
>>>>>>> --
>>>>>>> KH
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> [hidden email] mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide
>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> -----
>>>>>> Frank Harrell
>>>>>> Department of Biostatistics, Vanderbilt University
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://r.789695.n4.nabble.com/BMA-logistic-regression-odds-ratio-model-reduction-etc-tp3462416p3462919.html
>>>>>>
>>>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>>>
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
> -----
> Frank Harrell
> Department of Biostatistics, Vanderbilt University
> --
> View this message in context: http://r.789695.n4.nabble.com/BMA-logistic-regression-odds-ratio-model-reduction-etc-tp3462416p3473354.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
*************************************************
 神戸大学大学院医学研究科 脳神経外科学分野
 細田 弘吉
 
 〒650-0017 神戸市中央区楠町7丁目5-1
     Phone: 078-382-5966
     Fax  : 078-382-5979
     E-mail address
         Office: [hidden email]
        Home  : [hidden email]
*************************************************

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

x6factor.pentrace.pdf (109K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Questions about lrm, validate, pentrace

Frank Harrell
Yes I would select that as the final model.  The difference you saw is caused by different treatment of penalization of factor variables, related to the use of the sum squared differences between the estimate at one category from the average over all categories.  I think that as long as you code it one way consistently and pick the penalty using that coding you are OK.  But if the coefficients of the non-factor variables depend on how the binary predictor is coded, there is a bit more concern.

Frank

細田弘吉 wrote
Thank you for you quick reply, Prof. Harrell.
According to your advice, I ran pentrace using a very wide range.

 > pentrace.x6factor <- pentrace(x6factor.lrm, seq(0, 100, by=0.5))
 > plot(pentrace.x6factor)

I attached this figure. Then,

 > pentrace.x6factor <- pentrace(x6factor.lrm, seq(0, 10, by=0.05))

It seems reasonable that the best penalty is 2.55.

 > x6factor.lrm.pen <- update(x6factor.lrm, penalty=2.55)
 > cbind(coef(x6factor.lrm), coef(x6factor.lrm.pen),
abs(coef(x6factor.lrm)-coef(x6factor.lrm.pen)))
                      [,1]        [,2]        [,3]
Intercept     -4.32434556 -3.86816460 0.456180958
stenosis      -0.01496757 -0.01091755 0.004050025
T1             3.04248257  2.42443034 0.618052225
T2            -0.75335619 -0.57194342 0.181412767
procedure     -1.20847252 -0.82589263 0.382579892
ClinicalScore  0.37623189  0.30524628 0.070985611

 > validate(x6factor.lrm, bw=F, B=200)
           index.orig training    test optimism index.corrected   n
Dxy           0.6324   0.6849  0.5955   0.0894          0.5430 200
R2            0.3668   0.4220  0.3231   0.0989          0.2679 200
Intercept     0.0000   0.0000 -0.1924   0.1924         -0.1924 200
Slope         1.0000   1.0000  0.7796   0.2204          0.7796 200
Emax          0.0000   0.0000  0.0915   0.0915          0.0915 200
D             0.2716   0.3229  0.2339   0.0890          0.1826 200
U            -0.0192  -0.0192  0.0243  -0.0436          0.0243 200
Q             0.2908   0.3422  0.2096   0.1325          0.1582 200
B             0.1272   0.1171  0.1357  -0.0186          0.1457 200
g             1.6328   1.9879  1.4940   0.4939          1.1389 200
gp            0.2367   0.2502  0.2216   0.0286          0.2080 200


 > validate(x6factor.lrm.pen, bw=F, B=200)
           index.orig training    test optimism index.corrected   n
Dxy           0.6375   0.6857  0.6024   0.0833          0.5542 200
R2            0.3145   0.3488  0.3267   0.0221          0.2924 200
Intercept     0.0000   0.0000  0.0882  -0.0882          0.0882 200
Slope         1.0000   1.0000  1.0923  -0.0923          1.0923 200
Emax          0.0000   0.0000  0.0340   0.0340          0.0340 200
D             0.2612   0.2571  0.2370   0.0201          0.2411 200
U            -0.0192  -0.0192 -0.0047  -0.0145         -0.0047 200
Q             0.2805   0.2763  0.2417   0.0346          0.2458 200
B             0.1292   0.1224  0.1355  -0.0132          0.1423 200
g             1.2704   1.3917  1.5019  -0.1102          1.3805 200
gp            0.2020   0.2091  0.2229  -0.0138          0.2158 200

In the penalized model (x6factor.lrm.pen), the apparent Dxy is 0.64, and
bias-corrected Dxy is 0.55. The maximum absolute error is estimated to
be 0.034, smaller than non-penalized model (0.0915 in x6factor.lrm) The
changes in slope and intercept are substantially reduced in penalized
model. I think overfitting is improved at least to some extent. Should I
select this as a final model?

I have one more question. The "procedure" variable was defined as 0/1
value in the previous mail. For some graphical reason, I redefined it as
treat1/treat2 value. Then, the best penalty value was changed from 3.05
to 2.55. I guess change from numeric to factorial caused this reduction
in penalty. Which set up should I select?

I appreciate your help in advance.

--
KH

(11/04/26 0:21), Frank Harrell wrote:
> You've done a lot of good work on this.  Yes I would say you have moderate
> overfitting with the first model.  The only thing that saved you from having
> severe overfitting is that there seems to be a signal present [I am assume
> this model is truly pre-specified and was not developed at all by looking at
> patterns of responses Y.]
>
> The use of backwards stepdown demonstrated much worse overfitting.  This is
> in line with what we know about the damage of stepwise selection methods
> that do not incorporate shrinkage.  I would throw away the stepwise
> regression model.  You'll find that the model selected is entirely
> arbitrary.  And you can't use the "selected" variables in any re-fit of the
> model, i.e., you can't use lrm pretending that the two remaining variables
> were pre-specified.  Stepwise regression methods only seem to help.  When
> assessed properly we see that is an illusion.
>
> You are using penalizing properly but you did not print the full table of
> penalties vs. effective AIC.  We don't have faith that you penalized enough.
> I tend to run pentrace using a very wide range of possible penalties to make
> sure I've found the global optimum.
>
> Penalization somewhat solves the EPV problem but there is no substitute for
> getting more data.
>
> You can run validate specifying your final penalty as an argument.
>
> Frank
>
>
>
> 細田弘吉 wrote:
>>
>> According to the advice, I tried rms package.
>> Just to make sure, I have data of 104 patients (x6.df), which consists
>> of 5 explanatory variables and one binary outcome (poor/good) (previous
>> model 2 strategy). The outcome consists of 25 poor results and 79 good
>> results. Therefore, My events per variable (EPV) is only 5 (much less
>> than the rule of thumb of 10).
>>
>> My questions are about validate and pentrace in rms package.
>> I present some codes and results.
>> I appreciate anybody's help in advance.
>>
>>   >  x6.lrm<- lrm(outcome ~ stenosis+x1+x2+procedure+ClinicalScore,
>> data=x6.df, x=T, y=T)
>>
>>   >  x6.lrm
>> ...
>> Obs  104    LR chi2      29.24    R2       0.367    C       0.816
>>    negative 79    d.f.         5    g        1.633    Dxy     0.632
>>    positive 25    Pr(>  chi2)<0.0001   gr    5.118    gamma   0.632
>> max |deriv| 1e-08                    gp    0.237    tau-a   0.233
>>                                        Brier   0.127
>>
>>                  Coef    S.E.   Wald Z Pr(>|Z|)
>> Intercept      -5.5328 2.6287 -2.10  0.0353
>> stenosis       -0.0150 0.0284 -0.53  0.5979
>> x1              3.0425 0.9100  3.34  0.0008
>> x2             -0.7534 0.4519 -1.67  0.0955
>> procedure       1.2085 0.5717  2.11  0.0345
>> ClinicalScore   0.3762 0.2287  1.65  0.0999
>>
>> It seems not too bad. Next, validation by bootstrap ...
>>
>>   >  validate(x6.lrm, B=200, bw=F)
>>             index.orig training    test optimism index.corrected   n
>> Dxy           0.6324   0.6960  0.5870   0.1091          0.5233 200
>> R2            0.3668   0.4370  0.3154   0.1216          0.2453 200
>> Intercept     0.0000   0.0000 -0.2007   0.2007         -0.2007 200
>> Slope         1.0000   1.0000  0.7565   0.2435          0.7565 200
>> Emax          0.0000   0.0000  0.0999   0.0999          0.0999 200
>> D             0.2716   0.3368  0.2275   0.1093          0.1623 200
>> U            -0.0192  -0.0192  0.0369  -0.0561          0.0369 200
>> Q             0.2908   0.3560  0.1906   0.1654          0.1254 200
>> B             0.1272   0.1155  0.1384  -0.0229          0.1501 200
>> g             1.6328   2.0740  1.4647   0.6093          1.0235 200
>> gp            0.2367   0.2529  0.2189   0.0341          0.2026 200
>>
>> The apparent Dxy is 0.63, and bias-corrected Dxy is 0.52. The maximum
>> absolute error is estimated to be 0.099. The changes in slope and
>> intercept are also more substantial. In all, there is evidence that I am
>> somewhat overfitting the data, right?.
>>
>> Furthermore, using step-down variable selection ...
>>
>>   >  validate(x6.lrm, B=200, bw=T)
>>
>> Backwards Step-down - Original Model
>>
>>    Deleted        Chi-Sq d.f. P      Residual d.f. P      AIC
>>    stenosis       0.28   1    0.5979 0.28     1    0.5979 -1.72
>>    ClinicalScore  2.60   1    0.1068 2.88     2    0.2370 -1.12
>>    x2             2.86   1    0.0910 5.74     3    0.1252 -0.26
>>
>> Approximate Estimates after Deleting Factors
>>
>>                Coef   S.E. Wald Z         P
>> Intercept  -5.865 1.4136 -4.149 3.336e-05
>> x1          2.915 0.8685  3.357 7.889e-04
>> procedure   1.072 0.5590  1.918 5.508e-02
>>
>> Factors in Final Model
>>
>> [1] x1         procedure
>>             index.orig training    test optimism index.corrected   n
>> Dxy           0.5661   0.6755  0.5559   0.1196          0.4464 200
>> R2            0.2876   0.4085  0.2784   0.1301          0.1575 200
>> Intercept     0.0000   0.0000 -0.2459   0.2459         -0.2459 200
>> Slope         1.0000   1.0000  0.7300   0.2700          0.7300 200
>> Emax          0.0000   0.0000  0.1173   0.1173          0.1173 200
>> D             0.2038   0.3130  0.1970   0.1160          0.0877 200
>> U            -0.0192  -0.0192  0.0382  -0.0574          0.0382 200
>> Q             0.2230   0.3323  0.1589   0.1734          0.0496 200
>> B             0.1441   0.1192  0.1452  -0.0261          0.1702 200
>> g             1.2628   1.9524  1.3222   0.6302          0.6326 199
>> gp            0.2041   0.2430  0.2043   0.0387          0.1654 199
>>
>> If I select only two variables (x1 and procedure), bias-corrected Dxy
>> goes down to 0.45.
>>
>> [Question 1]
>> I have EPV problem. Even so, should I keep the full model (5-variable
>> model)? or can I use the 2-variable (x1 and procedure) model which the
>> validate() with step-down provides?
>>
>> [Question 2]
>> If I use 2-variable model, should I do
>> x2.lrm<- lrm(postopDWI_HI ~ T1+procedure2, data=x6.df, x=T, y=T)?
>> or keep the value showed above by validate function?
>>
>> Next, shrinkage ...
>>
>>   >  pentrace(x6.lrm, seq(0, 5.0, by=0.05))
>> Best penalty:
>> penalty         df
>>      3.05   4.015378
>>
>> The best penalty is 3.05. So, I update it with this penalty to obtain
>> the corresponding penalized model:
>>
>>   >  x6.lrm.pen<- update(x6.lrm, penalty=3.05, x=T, y=T)
>>   >  x6.lrm.pen
>> .....
>> Penalty factors
>>
>>    simple nonlinear interaction nonlinear.interaction
>>      3.05      3.05        3.05                  3.05
>> Final penalty on -2 log L
>>        [,1]
>> [1,]  3.8
>>
>> Obs     104    LR chi2      28.18    R2       0.313    C       0.818
>>    negative    79    d.f.     4.015    g        1.264    Dxy     0.635
>>    positive    25   Pr(>  chi2)<0.0001 gr       3.538    gamma   0.637
>> max |deriv| 3e-05                    gp       0.201    tau-a   0.234
>>                                        Brier    0.129
>>
>>                  Coef    S.E.   Wald Z Pr(>|Z|) Penalty Scale
>> Intercept      -4.7246 2.2429 -2.11  0.0352    0.0000
>> stenosis       -0.0105 0.0240 -0.44  0.6621   17.8021
>> x1              2.3605 0.7254  3.25  0.0011    0.6054
>> x2             -0.5385 0.3653 -1.47  0.1404    1.2851
>> procedure       0.9247 0.4844  1.91  0.0563    0.8576
>> ClinicalScore   0.3046 0.1874  1.63  0.1041    2.4779
>>
>> Arrange the coefficients of the two models side by side, and also list
>> the difference between the two:
>>
>>   >  cbind(coef(x6.lrm), coef(x6.lrm.pen),
>> abs(coef(x6.lrm)-coef(x6.lrm.pen)))
>>                         [,1]        [,2]        [,3]
>> Intercept      -5.53281808 -4.72464766 0.808170417
>> stenosis       -0.01496757 -0.01050797 0.004459599
>> x1              3.04248257  2.36051833 0.681964238
>> x2             -0.75335619 -0.53854750 0.214808685
>> procedure       1.20847252  0.92474708 0.283725441
>> ClinicalScore   0.37623189  0.30457557 0.071656322
>>
>> [Question 3]
>> Is this penalized model the one I should present for my colleagues?
>> I still have EPV problem. Or is EPV problem O.K. if I use penalization?
>>
>> I am still wondering about what I can do to avoid EPV problem.
>> Collecting new data would be a long-time and huge work...
>>
>>
>> (11/04/22 1:46), [hidden email] wrote:
>>> Thank you for your comment.
>>> I forgot to mention that varclus and pvclust showed similar results for
>>> my data.
>>>
>>> BTW, I did not realize rms is a replacement for the Design package.
>>> I appreciate your suggestion.
>>> --
>>> KH
>>>
>>> (11/04/21 8:00), Frank Harrell wrote:
>>>> I think it's OK. You can also use the Hmisc package's varclus function.
>>>> Frank
>>>>
>>>>
>>>> 細田弘吉 wrote:
>>>>>
>>>>> Dear Prof. Harrel,
>>>>>
>>>>> Thank you very much for your quick advice.
>>>>> I will try rms package.
>>>>>
>>>>> Regarding model reduction, is my model 2 method (clustering and
>>>>> recoding
>>>>> that are blinded to the outcome) permissible?
>>>>>
>>>>> Sincerely,
>>>>>
>>>>> --
>>>>> KH
>>>>>
>>>>> (11/04/20 22:01), Frank Harrell wrote:
>>>>>> Deleting variables is a bad idea unless you make that a formal part of
>>>>>> the
>>>>>> BMA so that the attempt to delete variables is penalized for.
>>>>>> Instead of
>>>>>> BMA I recommend simple penalized maximum likelihood estimation (see
>>>>>> the
>>>>>> lrm
>>>>>> function in the rms package) or pre-modeling data reduction that is
>>>>>> blinded
>>>>>> to the outcome variable.
>>>>>> Frank
>>>>>>
>>>>>>
>>>>>> 細田弘吉 wrote:
>>>>>>>
>>>>>>> Hi everybody,
>>>>>>> I apologize for long mail in advance.
>>>>>>>
>>>>>>> I have data of 104 patients, which consists of 15 explanatory
>>>>>>> variables
>>>>>>> and one binary outcome (poor/good). The outcome consists of 25 poor
>>>>>>> results and 79 good results. I tried to analyze the data with
>>>>>>> logistic
>>>>>>> regression. However, the 15 variables and 25 events means events per
>>>>>>> variable (EPV) is much less than 10 (rule of thumb). Therefore, I
>>>>>>> used R
>>>>>>> package, "BMA" to perform logistic regression with BMA to avoid this
>>>>>>> problem.
>>>>>>>
>>>>>>> model 1 (full model):
>>>>>>> x1, x2, x3, x4 are continuous variables and others are binary data.
>>>>>>>
>>>>>>>> x16.bic.glm<- bic.glm(outcome ~ ., data=x16.df,
>>>>>>> glm.family="binomial", OR20, strict=FALSE)
>>>>>>>> summary(x16.bic.glm)
>>>>>>> (The output below has been cut off at the right edge to save space)
>>>>>>>
>>>>>>> 62 models were selected
>>>>>>> Best 5 models (cumulative posterior probability = 0.3606 ):
>>>>>>>
>>>>>>> p!=0 EV SD model 1 model2
>>>>>>> Intercept 100 -5.1348545 1.652424 -4.4688 -5.15
>>>>>>> -5.1536
>>>>>>> age 3.3 0.0001634 0.007258 .
>>>>>>> sex 4.0
>>>>>>> .M -0.0243145 0.220314 .
>>>>>>> side 10.8
>>>>>>> .R 0.0811227 0.301233 .
>>>>>>> procedure 46.9 -0.5356894 0.685148 . -1.163
>>>>>>> symptom 3.8 -0.0099438 0.129690 . .
>>>>>>> stenosis 3.4 -0.0003343 0.005254 .
>>>>>>> x1 3.7 -0.0061451 0.144084 .
>>>>>>> x2 100.0 3.1707661 0.892034 3.2221 3.11
>>>>>>> x3 51.3 -0.4577885 0.551466 -0.9154 .
>>>>>>> HT 4.6
>>>>>>> .positive 0.0199299 0.161769 . .
>>>>>>> DM 3.3
>>>>>>> .positive -0.0019986 0.105910 . .
>>>>>>> IHD 3.5
>>>>>>> .positive 0.0077626 0.122593 . .
>>>>>>> smoking 9.1
>>>>>>> .positive 0.0611779 0.258402 . .
>>>>>>> hyperlipidemia 16.0
>>>>>>> .positive 0.1784293 0.512058 . .
>>>>>>> x4 8.2 0.0607398 0.267501 . .
>>>>>>>
>>>>>>>
>>>>>>> nVar 2 2
>>>>>>> 1 3 3
>>>>>>> BIC -376.9082
>>>>>>> -376.5588 -376.3094 -375.8468 -374.5582
>>>>>>> post prob 0.104
>>>>>>> 0.087 0.077 0.061 0.032
>>>>>>>
>>>>>>> [Question 1]
>>>>>>> Is it O.K to calculate odds ratio and its 95% confidence interval
>>>>>>> from
>>>>>>> "EV" (posterior distribution mean) and“SD”(posterior distribution
>>>>>>> standard deviation)?
>>>>>>> For example, 95%CI of EV of x2 can be calculated as;
>>>>>>>> exp(3.1707661)
>>>>>>> [1] 23.82573 ----->  odds ratio
>>>>>>>> exp(3.1707661+1.96*0.892034)
>>>>>>> [1] 136.8866
>>>>>>>> exp(3.1707661-1.96*0.892034)
>>>>>>> [1] 4.146976
>>>>>>> ------------------>  95%CI (4.1 to 136.9)
>>>>>>> Is this O.K.?
>>>>>>>
>>>>>>> [Question 2]
>>>>>>> Is it permissible to delete variables with small value of "p!=0" and
>>>>>>> "EV", such as age (3.3% and 0.0001634) to reduce the number of
>>>>>>> explanatory variables and reconstruct new model without those
>>>>>>> variables
>>>>>>> for new session of BMA?
>>>>>>>
>>>>>>> model 2 (reduced model):
>>>>>>> I used R package, "pvclust", to reduce the model. The result
>>>>>>> suggested
>>>>>>> x1, x2 and x4 belonged to the same cluster, so I picked up only x2.
>>>>>>> Based on the subject knowledge, I made a simple unweighted sum, by
>>>>>>> counting the number of clinical features. For 9 features (sex, side,
>>>>>>> HT2, hyperlipidemia, DM, IHD, smoking, symptom, age), the sum ranges
>>>>>>> from 0 to 9. This score was defined as ClinicalScore. Consequently, I
>>>>>>> made up new data set (x6.df), which consists of 5 variables
>>>>>>> (stenosis,
>>>>>>> x2, x3, procedure, and ClinicalScore) and one binary outcome
>>>>>>> (poor/good). Then, for alternative BMA session...
>>>>>>>
>>>>>>>> BMAx6.glm<- bic.glm(postopDWI_HI ~ ., data=x6.df,
>>>>>>> glm.family="binomial", OR=20, strict=FALSE)
>>>>>>>> summary(BMAx6.glm)
>>>>>>> (The output below has been cut off at the right edge to save space)
>>>>>>> Call:
>>>>>>> bic.glm.formula(f = postopDWI_HI ~ ., data = x6.df, glm.family =
>>>>>>> "binomial", strict = FALSE, OR = 20)
>>>>>>>
>>>>>>>
>>>>>>> 13 models were selected
>>>>>>> Best 5 models (cumulative posterior probability = 0.7626 ):
>>>>>>>
>>>>>>> p!=0 EV SD model 1 model 2
>>>>>>> Intercept 100 -5.6918362 1.81220 -4.4688 -6.3166
>>>>>>> stenosis 8.1 -0.0008417 0.00815 . .
>>>>>>> x2 100.0 3.0606165 0.87765 3.2221 3.1154
>>>>>>> x3 46.5 -0.3998864 0.52688 -0.9154 .
>>>>>>> procedure 49.3 0.5747013 0.70164 . 1.1631
>>>>>>> ClinicalScore 27.1 0.0966633 0.19645 . .
>>>>>>>
>>>>>>>
>>>>>>> nVar 2 2 1
>>>>>>> 3 3
>>>>>>> BIC -376.9082 -376.5588
>>>>>>> -376.3094 -375.8468 -375.5025
>>>>>>> post prob 0.208 0.175
>>>>>>> 0.154 0.122 0.103
>>>>>>>
>>>>>>> [Question 3]
>>>>>>> Am I doing it correctly or not?
>>>>>>> I mean this kind of model reduction is permissible for BMA?
>>>>>>>
>>>>>>> [Question 4]
>>>>>>> I still have 5 variables, which violates the rule of thumb, "EPV>
>>>>>>> 10".
>>>>>>> Is it permissible to delete "stenosis" variable because of small
>>>>>>> value
>>>>>>> of "EV"? Or is it O.K. because this is BMA?
>>>>>>>
>>>>>>> Sorry for long post.
>>>>>>>
>>>>>>> I appreciate your help very much in advance.
>>>>>>>
>>>>>>> --
>>>>>>> KH
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> [hidden email] mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide
>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> -----
>>>>>> Frank Harrell
>>>>>> Department of Biostatistics, Vanderbilt University
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://r.789695.n4.nabble.com/BMA-logistic-regression-odds-ratio-model-reduction-etc-tp3462416p3462919.html
>>>>>>
>>>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>>>
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
> -----
> Frank Harrell
> Department of Biostatistics, Vanderbilt University
> --
> View this message in context: http://r.789695.n4.nabble.com/BMA-logistic-regression-odds-ratio-model-reduction-etc-tp3462416p3473354.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


--
*************************************************
 神戸大学大学院医学研究科 脳神経外科学分野
 細田 弘吉
 
 〒650-0017 神戸市中央区楠町7丁目5-1
     Phone: 078-382-5966
     Fax  : 078-382-5979
     E-mail address
         Office: [hidden email]
        Home  : [hidden email]
*************************************************

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Frank Harrell
Department of Biostatistics, Vanderbilt University
Reply | Threaded
Open this post in threaded view
|

Re: Questions about lrm, validate, pentrace

khosoda
(11/04/29 22:09), Frank Harrell wrote:
> Yes I would select that as the final model.

Thank you for your comment. I am able to be confident about my model now.

The difference you saw is caused
> by different treatment of penalization of factor variables, related to the
> use of the sum squared differences between the estimate at one category from
> the average over all categories.  I think that as long as you code it one
> way consistently and pick the penalty using that coding you are OK.  But if
> the coefficients of the non-factor variables depend on how the binary
> predictor is coded, there is a bit more concern.

A lot of previous studies have demonstrated that poor outcome is more
frequent in treat2 than in treat 1. So, I coded treat1 as 0, and treat2
as 1 in the first mail. Then, I came back to the original coding of
treat1 and treat2 in the newer mail. According to your answer, I guess I
am OK. :-)

Prof Harrell, Your book (Rregression Modeling Strategies) and many kind
comments helped me a lot. Thank you very much again.

--
KH

>
> Frank
>
>
> 細田弘吉 wrote:
>>
>> Thank you for you quick reply, Prof. Harrell.
>> According to your advice, I ran pentrace using a very wide range.
>>
>>   >  pentrace.x6factor<- pentrace(x6factor.lrm, seq(0, 100, by=0.5))
>>   >  plot(pentrace.x6factor)
>>
>> I attached this figure. Then,
>>
>>   >  pentrace.x6factor<- pentrace(x6factor.lrm, seq(0, 10, by=0.05))
>>
>> It seems reasonable that the best penalty is 2.55.
>>
>>   >  x6factor.lrm.pen<- update(x6factor.lrm, penalty=2.55)
>>   >  cbind(coef(x6factor.lrm), coef(x6factor.lrm.pen),
>> abs(coef(x6factor.lrm)-coef(x6factor.lrm.pen)))
>>                        [,1]        [,2]        [,3]
>> Intercept     -4.32434556 -3.86816460 0.456180958
>> stenosis      -0.01496757 -0.01091755 0.004050025
>> T1             3.04248257  2.42443034 0.618052225
>> T2            -0.75335619 -0.57194342 0.181412767
>> procedure     -1.20847252 -0.82589263 0.382579892
>> ClinicalScore  0.37623189  0.30524628 0.070985611
>>
>>   >  validate(x6factor.lrm, bw=F, B=200)
>>             index.orig training    test optimism index.corrected   n
>> Dxy           0.6324   0.6849  0.5955   0.0894          0.5430 200
>> R2            0.3668   0.4220  0.3231   0.0989          0.2679 200
>> Intercept     0.0000   0.0000 -0.1924   0.1924         -0.1924 200
>> Slope         1.0000   1.0000  0.7796   0.2204          0.7796 200
>> Emax          0.0000   0.0000  0.0915   0.0915          0.0915 200
>> D             0.2716   0.3229  0.2339   0.0890          0.1826 200
>> U            -0.0192  -0.0192  0.0243  -0.0436          0.0243 200
>> Q             0.2908   0.3422  0.2096   0.1325          0.1582 200
>> B             0.1272   0.1171  0.1357  -0.0186          0.1457 200
>> g             1.6328   1.9879  1.4940   0.4939          1.1389 200
>> gp            0.2367   0.2502  0.2216   0.0286          0.2080 200
>>
>>
>>   >  validate(x6factor.lrm.pen, bw=F, B=200)
>>             index.orig training    test optimism index.corrected   n
>> Dxy           0.6375   0.6857  0.6024   0.0833          0.5542 200
>> R2            0.3145   0.3488  0.3267   0.0221          0.2924 200
>> Intercept     0.0000   0.0000  0.0882  -0.0882          0.0882 200
>> Slope         1.0000   1.0000  1.0923  -0.0923          1.0923 200
>> Emax          0.0000   0.0000  0.0340   0.0340          0.0340 200
>> D             0.2612   0.2571  0.2370   0.0201          0.2411 200
>> U            -0.0192  -0.0192 -0.0047  -0.0145         -0.0047 200
>> Q             0.2805   0.2763  0.2417   0.0346          0.2458 200
>> B             0.1292   0.1224  0.1355  -0.0132          0.1423 200
>> g             1.2704   1.3917  1.5019  -0.1102          1.3805 200
>> gp            0.2020   0.2091  0.2229  -0.0138          0.2158 200
>>
>> In the penalized model (x6factor.lrm.pen), the apparent Dxy is 0.64, and
>> bias-corrected Dxy is 0.55. The maximum absolute error is estimated to
>> be 0.034, smaller than non-penalized model (0.0915 in x6factor.lrm) The
>> changes in slope and intercept are substantially reduced in penalized
>> model. I think overfitting is improved at least to some extent. Should I
>> select this as a final model?
>>
>> I have one more question. The "procedure" variable was defined as 0/1
>> value in the previous mail. For some graphical reason, I redefined it as
>> treat1/treat2 value. Then, the best penalty value was changed from 3.05
>> to 2.55. I guess change from numeric to factorial caused this reduction
>> in penalty. Which set up should I select?
>>
>> I appreciate your help in advance.
>>
>> --
>> KH
>>
>> (11/04/26 0:21), Frank Harrell wrote:
>>> You've done a lot of good work on this.  Yes I would say you have
>>> moderate
>>> overfitting with the first model.  The only thing that saved you from
>>> having
>>> severe overfitting is that there seems to be a signal present [I am
>>> assume
>>> this model is truly pre-specified and was not developed at all by looking
>>> at
>>> patterns of responses Y.]
>>>
>>> The use of backwards stepdown demonstrated much worse overfitting.  This
>>> is
>>> in line with what we know about the damage of stepwise selection methods
>>> that do not incorporate shrinkage.  I would throw away the stepwise
>>> regression model.  You'll find that the model selected is entirely
>>> arbitrary.  And you can't use the "selected" variables in any re-fit of
>>> the
>>> model, i.e., you can't use lrm pretending that the two remaining
>>> variables
>>> were pre-specified.  Stepwise regression methods only seem to help.  When
>>> assessed properly we see that is an illusion.
>>>
>>> You are using penalizing properly but you did not print the full table of
>>> penalties vs. effective AIC.  We don't have faith that you penalized
>>> enough.
>>> I tend to run pentrace using a very wide range of possible penalties to
>>> make
>>> sure I've found the global optimum.
>>>
>>> Penalization somewhat solves the EPV problem but there is no substitute
>>> for
>>> getting more data.
>>>
>>> You can run validate specifying your final penalty as an argument.
>>>
>>> Frank
>>>
>>>
>>>
>>> 細田弘吉 wrote:
>>>>
>>>> According to the advice, I tried rms package.
>>>> Just to make sure, I have data of 104 patients (x6.df), which consists
>>>> of 5 explanatory variables and one binary outcome (poor/good) (previous
>>>> model 2 strategy). The outcome consists of 25 poor results and 79 good
>>>> results. Therefore, My events per variable (EPV) is only 5 (much less
>>>> than the rule of thumb of 10).
>>>>
>>>> My questions are about validate and pentrace in rms package.
>>>> I present some codes and results.
>>>> I appreciate anybody's help in advance.
>>>>
>>>>    >   x6.lrm<- lrm(outcome ~ stenosis+x1+x2+procedure+ClinicalScore,
>>>> data=x6.df, x=T, y=T)
>>>>
>>>>    >   x6.lrm
>>>> ...
>>>> Obs  104    LR chi2      29.24    R2       0.367    C       0.816
>>>>     negative 79    d.f.         5    g        1.633    Dxy     0.632
>>>>     positive 25    Pr(>   chi2)<0.0001   gr    5.118    gamma   0.632
>>>> max |deriv| 1e-08                    gp    0.237    tau-a   0.233
>>>>                                         Brier   0.127
>>>>
>>>>                   Coef    S.E.   Wald Z Pr(>|Z|)
>>>> Intercept      -5.5328 2.6287 -2.10  0.0353
>>>> stenosis       -0.0150 0.0284 -0.53  0.5979
>>>> x1              3.0425 0.9100  3.34  0.0008
>>>> x2             -0.7534 0.4519 -1.67  0.0955
>>>> procedure       1.2085 0.5717  2.11  0.0345
>>>> ClinicalScore   0.3762 0.2287  1.65  0.0999
>>>>
>>>> It seems not too bad. Next, validation by bootstrap ...
>>>>
>>>>    >   validate(x6.lrm, B=200, bw=F)
>>>>              index.orig training    test optimism index.corrected   n
>>>> Dxy           0.6324   0.6960  0.5870   0.1091          0.5233 200
>>>> R2            0.3668   0.4370  0.3154   0.1216          0.2453 200
>>>> Intercept     0.0000   0.0000 -0.2007   0.2007         -0.2007 200
>>>> Slope         1.0000   1.0000  0.7565   0.2435          0.7565 200
>>>> Emax          0.0000   0.0000  0.0999   0.0999          0.0999 200
>>>> D             0.2716   0.3368  0.2275   0.1093          0.1623 200
>>>> U            -0.0192  -0.0192  0.0369  -0.0561          0.0369 200
>>>> Q             0.2908   0.3560  0.1906   0.1654          0.1254 200
>>>> B             0.1272   0.1155  0.1384  -0.0229          0.1501 200
>>>> g             1.6328   2.0740  1.4647   0.6093          1.0235 200
>>>> gp            0.2367   0.2529  0.2189   0.0341          0.2026 200
>>>>
>>>> The apparent Dxy is 0.63, and bias-corrected Dxy is 0.52. The maximum
>>>> absolute error is estimated to be 0.099. The changes in slope and
>>>> intercept are also more substantial. In all, there is evidence that I am
>>>> somewhat overfitting the data, right?.
>>>>
>>>> Furthermore, using step-down variable selection ...
>>>>
>>>>    >   validate(x6.lrm, B=200, bw=T)
>>>>
>>>> Backwards Step-down - Original Model
>>>>
>>>>     Deleted        Chi-Sq d.f. P      Residual d.f. P      AIC
>>>>     stenosis       0.28   1    0.5979 0.28     1    0.5979 -1.72
>>>>     ClinicalScore  2.60   1    0.1068 2.88     2    0.2370 -1.12
>>>>     x2             2.86   1    0.0910 5.74     3    0.1252 -0.26
>>>>
>>>> Approximate Estimates after Deleting Factors
>>>>
>>>>                 Coef   S.E. Wald Z         P
>>>> Intercept  -5.865 1.4136 -4.149 3.336e-05
>>>> x1          2.915 0.8685  3.357 7.889e-04
>>>> procedure   1.072 0.5590  1.918 5.508e-02
>>>>
>>>> Factors in Final Model
>>>>
>>>> [1] x1         procedure
>>>>              index.orig training    test optimism index.corrected   n
>>>> Dxy           0.5661   0.6755  0.5559   0.1196          0.4464 200
>>>> R2            0.2876   0.4085  0.2784   0.1301          0.1575 200
>>>> Intercept     0.0000   0.0000 -0.2459   0.2459         -0.2459 200
>>>> Slope         1.0000   1.0000  0.7300   0.2700          0.7300 200
>>>> Emax          0.0000   0.0000  0.1173   0.1173          0.1173 200
>>>> D             0.2038   0.3130  0.1970   0.1160          0.0877 200
>>>> U            -0.0192  -0.0192  0.0382  -0.0574          0.0382 200
>>>> Q             0.2230   0.3323  0.1589   0.1734          0.0496 200
>>>> B             0.1441   0.1192  0.1452  -0.0261          0.1702 200
>>>> g             1.2628   1.9524  1.3222   0.6302          0.6326 199
>>>> gp            0.2041   0.2430  0.2043   0.0387          0.1654 199
>>>>
>>>> If I select only two variables (x1 and procedure), bias-corrected Dxy
>>>> goes down to 0.45.
>>>>
>>>> [Question 1]
>>>> I have EPV problem. Even so, should I keep the full model (5-variable
>>>> model)? or can I use the 2-variable (x1 and procedure) model which the
>>>> validate() with step-down provides?
>>>>
>>>> [Question 2]
>>>> If I use 2-variable model, should I do
>>>> x2.lrm<- lrm(postopDWI_HI ~ T1+procedure2, data=x6.df, x=T, y=T)?
>>>> or keep the value showed above by validate function?
>>>>
>>>> Next, shrinkage ...
>>>>
>>>>    >   pentrace(x6.lrm, seq(0, 5.0, by=0.05))
>>>> Best penalty:
>>>> penalty         df
>>>>       3.05   4.015378
>>>>
>>>> The best penalty is 3.05. So, I update it with this penalty to obtain
>>>> the corresponding penalized model:
>>>>
>>>>    >   x6.lrm.pen<- update(x6.lrm, penalty=3.05, x=T, y=T)
>>>>    >   x6.lrm.pen
>>>> .....
>>>> Penalty factors
>>>>
>>>>     simple nonlinear interaction nonlinear.interaction
>>>>       3.05      3.05        3.05                  3.05
>>>> Final penalty on -2 log L
>>>>         [,1]
>>>> [1,]  3.8
>>>>
>>>> Obs     104    LR chi2      28.18    R2       0.313    C       0.818
>>>>     negative    79    d.f.     4.015    g        1.264    Dxy     0.635
>>>>     positive    25   Pr(>   chi2)<0.0001 gr       3.538    gamma   0.637
>>>> max |deriv| 3e-05                    gp       0.201    tau-a   0.234
>>>>                                         Brier    0.129
>>>>
>>>>                   Coef    S.E.   Wald Z Pr(>|Z|) Penalty Scale
>>>> Intercept      -4.7246 2.2429 -2.11  0.0352    0.0000
>>>> stenosis       -0.0105 0.0240 -0.44  0.6621   17.8021
>>>> x1              2.3605 0.7254  3.25  0.0011    0.6054
>>>> x2             -0.5385 0.3653 -1.47  0.1404    1.2851
>>>> procedure       0.9247 0.4844  1.91  0.0563    0.8576
>>>> ClinicalScore   0.3046 0.1874  1.63  0.1041    2.4779
>>>>
>>>> Arrange the coefficients of the two models side by side, and also list
>>>> the difference between the two:
>>>>
>>>>    >   cbind(coef(x6.lrm), coef(x6.lrm.pen),
>>>> abs(coef(x6.lrm)-coef(x6.lrm.pen)))
>>>>                          [,1]        [,2]        [,3]
>>>> Intercept      -5.53281808 -4.72464766 0.808170417
>>>> stenosis       -0.01496757 -0.01050797 0.004459599
>>>> x1              3.04248257  2.36051833 0.681964238
>>>> x2             -0.75335619 -0.53854750 0.214808685
>>>> procedure       1.20847252  0.92474708 0.283725441
>>>> ClinicalScore   0.37623189  0.30457557 0.071656322
>>>>
>>>> [Question 3]
>>>> Is this penalized model the one I should present for my colleagues?
>>>> I still have EPV problem. Or is EPV problem O.K. if I use penalization?
>>>>
>>>> I am still wondering about what I can do to avoid EPV problem.
>>>> Collecting new data would be a long-time and huge work...
>>>>
>>>>
>>>> (11/04/22 1:46), [hidden email] wrote:
>>>>> Thank you for your comment.
>>>>> I forgot to mention that varclus and pvclust showed similar results for
>>>>> my data.
>>>>>
>>>>> BTW, I did not realize rms is a replacement for the Design package.
>>>>> I appreciate your suggestion.
>>>>> --
>>>>> KH
>>>>>
>>>>> (11/04/21 8:00), Frank Harrell wrote:
>>>>>> I think it's OK. You can also use the Hmisc package's varclus
>>>>>> function.
>>>>>> Frank
>>>>>>
>>>>>>
>>>>>> 細田弘吉 wrote:
>>>>>>>
>>>>>>> Dear Prof. Harrel,
>>>>>>>
>>>>>>> Thank you very much for your quick advice.
>>>>>>> I will try rms package.
>>>>>>>
>>>>>>> Regarding model reduction, is my model 2 method (clustering and
>>>>>>> recoding
>>>>>>> that are blinded to the outcome) permissible?
>>>>>>>
>>>>>>> Sincerely,
>>>>>>>
>>>>>>> --
>>>>>>> KH
>>>>>>>
>>>>>>> (11/04/20 22:01), Frank Harrell wrote:
>>>>>>>> Deleting variables is a bad idea unless you make that a formal part
>>>>>>>> of
>>>>>>>> the
>>>>>>>> BMA so that the attempt to delete variables is penalized for.
>>>>>>>> Instead of
>>>>>>>> BMA I recommend simple penalized maximum likelihood estimation (see
>>>>>>>> the
>>>>>>>> lrm
>>>>>>>> function in the rms package) or pre-modeling data reduction that is
>>>>>>>> blinded
>>>>>>>> to the outcome variable.
>>>>>>>> Frank
>>>>>>>>
>>>>>>>>
>>>>>>>> 細田弘吉 wrote:
>>>>>>>>>
>>>>>>>>> Hi everybody,
>>>>>>>>> I apologize for long mail in advance.
>>>>>>>>>
>>>>>>>>> I have data of 104 patients, which consists of 15 explanatory
>>>>>>>>> variables
>>>>>>>>> and one binary outcome (poor/good). The outcome consists of 25 poor
>>>>>>>>> results and 79 good results. I tried to analyze the data with
>>>>>>>>> logistic
>>>>>>>>> regression. However, the 15 variables and 25 events means events
>>>>>>>>> per
>>>>>>>>> variable (EPV) is much less than 10 (rule of thumb). Therefore, I
>>>>>>>>> used R
>>>>>>>>> package, "BMA" to perform logistic regression with BMA to avoid
>>>>>>>>> this
>>>>>>>>> problem.
>>>>>>>>>
>>>>>>>>> model 1 (full model):
>>>>>>>>> x1, x2, x3, x4 are continuous variables and others are binary data.
>>>>>>>>>
>>>>>>>>>> x16.bic.glm<- bic.glm(outcome ~ ., data=x16.df,
>>>>>>>>> glm.family="binomial", OR20, strict=FALSE)
>>>>>>>>>> summary(x16.bic.glm)
>>>>>>>>> (The output below has been cut off at the right edge to save space)
>>>>>>>>>
>>>>>>>>> 62 models were selected
>>>>>>>>> Best 5 models (cumulative posterior probability = 0.3606 ):
>>>>>>>>>
>>>>>>>>> p!=0 EV SD model 1 model2
>>>>>>>>> Intercept 100 -5.1348545 1.652424 -4.4688 -5.15
>>>>>>>>> -5.1536
>>>>>>>>> age 3.3 0.0001634 0.007258 .
>>>>>>>>> sex 4.0
>>>>>>>>> .M -0.0243145 0.220314 .
>>>>>>>>> side 10.8
>>>>>>>>> .R 0.0811227 0.301233 .
>>>>>>>>> procedure 46.9 -0.5356894 0.685148 . -1.163
>>>>>>>>> symptom 3.8 -0.0099438 0.129690 . .
>>>>>>>>> stenosis 3.4 -0.0003343 0.005254 .
>>>>>>>>> x1 3.7 -0.0061451 0.144084 .
>>>>>>>>> x2 100.0 3.1707661 0.892034 3.2221 3.11
>>>>>>>>> x3 51.3 -0.4577885 0.551466 -0.9154 .
>>>>>>>>> HT 4.6
>>>>>>>>> .positive 0.0199299 0.161769 . .
>>>>>>>>> DM 3.3
>>>>>>>>> .positive -0.0019986 0.105910 . .
>>>>>>>>> IHD 3.5
>>>>>>>>> .positive 0.0077626 0.122593 . .
>>>>>>>>> smoking 9.1
>>>>>>>>> .positive 0.0611779 0.258402 . .
>>>>>>>>> hyperlipidemia 16.0
>>>>>>>>> .positive 0.1784293 0.512058 . .
>>>>>>>>> x4 8.2 0.0607398 0.267501 . .
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> nVar 2 2
>>>>>>>>> 1 3 3
>>>>>>>>> BIC -376.9082
>>>>>>>>> -376.5588 -376.3094 -375.8468 -374.5582
>>>>>>>>> post prob 0.104
>>>>>>>>> 0.087 0.077 0.061 0.032
>>>>>>>>>
>>>>>>>>> [Question 1]
>>>>>>>>> Is it O.K to calculate odds ratio and its 95% confidence interval
>>>>>>>>> from
>>>>>>>>> "EV" (posterior distribution mean) and“SD”(posterior distribution
>>>>>>>>> standard deviation)?
>>>>>>>>> For example, 95%CI of EV of x2 can be calculated as;
>>>>>>>>>> exp(3.1707661)
>>>>>>>>> [1] 23.82573 ----->   odds ratio
>>>>>>>>>> exp(3.1707661+1.96*0.892034)
>>>>>>>>> [1] 136.8866
>>>>>>>>>> exp(3.1707661-1.96*0.892034)
>>>>>>>>> [1] 4.146976
>>>>>>>>> ------------------>   95%CI (4.1 to 136.9)
>>>>>>>>> Is this O.K.?
>>>>>>>>>
>>>>>>>>> [Question 2]
>>>>>>>>> Is it permissible to delete variables with small value of "p!=0"
>>>>>>>>> and
>>>>>>>>> "EV", such as age (3.3% and 0.0001634) to reduce the number of
>>>>>>>>> explanatory variables and reconstruct new model without those
>>>>>>>>> variables
>>>>>>>>> for new session of BMA?
>>>>>>>>>
>>>>>>>>> model 2 (reduced model):
>>>>>>>>> I used R package, "pvclust", to reduce the model. The result
>>>>>>>>> suggested
>>>>>>>>> x1, x2 and x4 belonged to the same cluster, so I picked up only x2.
>>>>>>>>> Based on the subject knowledge, I made a simple unweighted sum, by
>>>>>>>>> counting the number of clinical features. For 9 features (sex,
>>>>>>>>> side,
>>>>>>>>> HT2, hyperlipidemia, DM, IHD, smoking, symptom, age), the sum
>>>>>>>>> ranges
>>>>>>>>> from 0 to 9. This score was defined as ClinicalScore. Consequently,
>>>>>>>>> I
>>>>>>>>> made up new data set (x6.df), which consists of 5 variables
>>>>>>>>> (stenosis,
>>>>>>>>> x2, x3, procedure, and ClinicalScore) and one binary outcome
>>>>>>>>> (poor/good). Then, for alternative BMA session...
>>>>>>>>>
>>>>>>>>>> BMAx6.glm<- bic.glm(postopDWI_HI ~ ., data=x6.df,
>>>>>>>>> glm.family="binomial", OR=20, strict=FALSE)
>>>>>>>>>> summary(BMAx6.glm)
>>>>>>>>> (The output below has been cut off at the right edge to save space)
>>>>>>>>> Call:
>>>>>>>>> bic.glm.formula(f = postopDWI_HI ~ ., data = x6.df, glm.family =
>>>>>>>>> "binomial", strict = FALSE, OR = 20)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 13 models were selected
>>>>>>>>> Best 5 models (cumulative posterior probability = 0.7626 ):
>>>>>>>>>
>>>>>>>>> p!=0 EV SD model 1 model 2
>>>>>>>>> Intercept 100 -5.6918362 1.81220 -4.4688 -6.3166
>>>>>>>>> stenosis 8.1 -0.0008417 0.00815 . .
>>>>>>>>> x2 100.0 3.0606165 0.87765 3.2221 3.1154
>>>>>>>>> x3 46.5 -0.3998864 0.52688 -0.9154 .
>>>>>>>>> procedure 49.3 0.5747013 0.70164 . 1.1631
>>>>>>>>> ClinicalScore 27.1 0.0966633 0.19645 . .
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> nVar 2 2 1
>>>>>>>>> 3 3
>>>>>>>>> BIC -376.9082 -376.5588
>>>>>>>>> -376.3094 -375.8468 -375.5025
>>>>>>>>> post prob 0.208 0.175
>>>>>>>>> 0.154 0.122 0.103
>>>>>>>>>
>>>>>>>>> [Question 3]
>>>>>>>>> Am I doing it correctly or not?
>>>>>>>>> I mean this kind of model reduction is permissible for BMA?
>>>>>>>>>
>>>>>>>>> [Question 4]
>>>>>>>>> I still have 5 variables, which violates the rule of thumb, "EPV>
>>>>>>>>> 10".
>>>>>>>>> Is it permissible to delete "stenosis" variable because of small
>>>>>>>>> value
>>>>>>>>> of "EV"? Or is it O.K. because this is BMA?
>>>>>>>>>
>>>>>>>>> Sorry for long post.
>>>>>>>>>
>>>>>>>>> I appreciate your help very much in advance.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> KH
>>>>>>>>>
>>>>>>>>> ______________________________________________
>>>>>>>>> [hidden email] mailing list
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>> PLEASE do read the posting guide
>>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -----
>>>>>>>> Frank Harrell
>>>>>>>> Department of Biostatistics, Vanderbilt University
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>> http://r.789695.n4.nabble.com/BMA-logistic-regression-odds-ratio-model-reduction-etc-tp3462416p3462919.html
>>>>>>>>
>>>>>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>>>>>
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>>
>>> -----
>>> Frank Harrell
>>> Department of Biostatistics, Vanderbilt University
>>> --
>>> View this message in context:
>>> http://r.789695.n4.nabble.com/BMA-logistic-regression-odds-ratio-model-reduction-etc-tp3462416p3473354.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> --
>> *************************************************
>>  神戸大学大学院医学研究科 脳神経外科学分野
>>  細田 弘吉
>>  
>>  〒650-0017 神戸市中央区楠町7丁目5-1
>>       Phone: 078-382-5966
>>       Fax  : 078-382-5979
>>       E-mail address
>>           Office: [hidden email]
>> Home  : [hidden email]
>> *************************************************
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
> -----
> Frank Harrell
> Department of Biostatistics, Vanderbilt University
> --
> View this message in context: http://r.789695.n4.nabble.com/BMA-logistic-regression-odds-ratio-model-reduction-etc-tp3462416p3483634.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


--
*************************************************
 神戸大学大学院医学研究科 脳神経外科学分野
 細田 弘吉
 
 〒650-0017 神戸市中央区楠町7丁目5-1
     Phone: 078-382-5966
     Fax  : 078-382-5979
     E-mail address
         Office: [hidden email]
        Home  : [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.