How to estimate whether overfitting?

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

How to estimate whether overfitting?

Kevin Hao
1. is there some criterion to estimate overfitting?  e.g. R2 and Q2 in the training set, as well as R2 in the test set, when means overfitting.   for example,  in my data, I have R2=0.94 for the training set and  for the test set R2=0.70, is overfitting?
2. in this scatter, can one say this overfitting?

3. my result is obtained by svm, and the sample are 156 and 52 for the training and test sets, and predictors are 96,   In this case, can svm be employed to perform prediction?   whether the number of the predictors are too many ?

4.from this picture, can you give me some suggestion to improve model performance? and is the picture bad?

 
5. the picture and data below.
thank you!


scatter.jpg

pkc-svm.txt
Reply | Threaded
Open this post in threaded view
|

Re: How to estimate whether overfitting?

David Winsemius

On May 9, 2010, at 9:20 AM, bbslover wrote:

>
> 1. is there some criterion to estimate overfitting?  e.g. R2 and Q2  
> in the
> training set, as well as R2 in the test set, when means  
> overfitting.   for
> example,  in my data, I have R2=0.94 for the training set and  for  
> the test
> set R2=0.70, is overfitting?
> 2. in this scatter, can one say this overfitting?
>
> 3. my result is obtained by svm, and the sample are 156 and 52 for the
> training and test sets, and predictors are 96,   In this case, can  
> svm be
> employed to perform prediction?   whether the number of the  
> predictors are
> too many ?
>

I think you need to buy a copy of Hastie, Tibshirani, and Friedman and  
do some self-study of chapters 7 and 12.


> 4.from this picture, can you give me some suggestion to improve model
> performance? and is the picture bad?
>
>
> 5. the picture and data below.
> thank you!
>
>
> http://n4.nabble.com/file/n2164417/scatter.jpg scatter.jpg
>
> http://n4.nabble.com/file/n2164417/pkc-svm.txt pkc-svm.txt
> --
--

David Winsemius, MD
West Hartford, CT

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to estimate whether overfitting?

Kevin Hao
thanks for your suggestion.
 many I need to learn indeed. I will buy that good book.

kevin
Reply | Threaded
Open this post in threaded view
|

Re: How to estimate whether overfitting?

Steve Lianoglou-6
In reply to this post by David Winsemius
On Sun, May 9, 2010 at 11:53 AM, David Winsemius <[hidden email]> wrote:

>
> On May 9, 2010, at 9:20 AM, bbslover wrote:
>
>>
>> 1. is there some criterion to estimate overfitting?  e.g. R2 and Q2 in the
>> training set, as well as R2 in the test set, when means overfitting.   for
>> example,  in my data, I have R2=0.94 for the training set and  for the
>> test
>> set R2=0.70, is overfitting?
>> 2. in this scatter, can one say this overfitting?
>>
>> 3. my result is obtained by svm, and the sample are 156 and 52 for the
>> training and test sets, and predictors are 96,   In this case, can svm be
>> employed to perform prediction?   whether the number of the predictors are
>> too many ?
>>
>
> I think you need to buy a copy of Hastie, Tibshirani, and Friedman and do
> some self-study of chapters 7 and 12.

And you don't even have to buy it before you can start studying since
the PDF is available here:
http://www-stat.stanford.edu/~tibs/ElemStatLearn/

Having a hard cover is always handy, tho ..
-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to estimate whether overfitting?

Frank Harrell
In reply to this post by David Winsemius
On 05/09/2010 10:53 AM, David Winsemius wrote:

>
> On May 9, 2010, at 9:20 AM, bbslover wrote:
>
>>
>> 1. is there some criterion to estimate overfitting? e.g. R2 and Q2 in the
>> training set, as well as R2 in the test set, when means overfitting. for
>> example, in my data, I have R2=0.94 for the training set and for the test
>> set R2=0.70, is overfitting?
>> 2. in this scatter, can one say this overfitting?
>>
>> 3. my result is obtained by svm, and the sample are 156 and 52 for the
>> training and test sets, and predictors are 96, In this case, can svm be
>> employed to perform prediction? whether the number of the predictors are
>> too many ?

Your test sample is too small by a factor of 100 for split sample
validation to work well.

Frank

>>
>
> I think you need to buy a copy of Hastie, Tibshirani, and Friedman and
> do some self-study of chapters 7 and 12.
>
>
>> 4.from this picture, can you give me some suggestion to improve model
>> performance? and is the picture bad?
>>
>>
>> 5. the picture and data below.
>> thank you!
>>
>>
>> http://n4.nabble.com/file/n2164417/scatter.jpg scatter.jpg
>>
>> http://n4.nabble.com/file/n2164417/pkc-svm.txt pkc-svm.txt
>> --


--
Frank E Harrell Jr   Professor and Chairman        School of Medicine
                      Department of Biostatistics   Vanderbilt University

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Frank Harrell
Department of Biostatistics, Vanderbilt University
Reply | Threaded
Open this post in threaded view
|

Re: How to estimate whether overfitting?

Kevin Hao
In reply to this post by Steve Lianoglou-6
thank you, I have downloaded it. studying
Reply | Threaded
Open this post in threaded view
|

Re: How to estimate whether overfitting?

Kevin Hao
In reply to this post by Frank Harrell
many thanks .  I can try to use test set with 100 samples.

anther question is that how can I rationally split my data to training set and test set? (training set with 108 samples, and test set with 100 samples)

as I  know, the test set should the same distribute to the training set. and what method can deal with it to rationally split?

and what packages in R can deal with splitting training/test set rationally question?


if the split is random. it seems to need many times splits, and the average results consider as the final results.

however, I want to several methods to perform split and get the firm training set and test set instead of random split.

training set and test set should like this:ideally, the division must be performed sunch that points representing both traing and training set are distributed within the hole feature space occupied by the entire dataset, and each point of the test set is close to at least one point of the training set. this approach ensures that the similarity principle can be enmployed for the output prediction of the test set. Certainly,this condition can not always be satistied.

thus, generally, what algorithms often be perform to split? and more rational? some paper often say, they split the data set  randomly, thus, what is randomly?  just selection random? or have some clear method? e.g. output order,  I really know, which package can do with split data rationally?

other, if one want to get the better results, some "tips" can be done. e.g. they can select test set again and again, and use the test set with best results as final test set and say that the test set was selectd randomly, but it is not true random, it is false.

thank you, sorry to so many questions. but it puzzled me always.  up to now, I have no good method to split rationally my data into training set and test set.

at last, split training and test set should be done before modeling, and it seems that this can be done just from featrue? (som)  ( or feature and output?(alogorithm spxy. paper:"a method for calibration and validation subset partioning")  or just output?(output order)).

but always, often there are many features to be calculated. and some featrue is zero or low standard deviation(sd<0.5),  should we delete these features before split the whole data?

and use the remaining feature to split data, and just using the training set to build the regression model and to perform feature selection as well as to do cross-validation,  and the independent test set just used to test the built model, yes?

maybe, my thinking is not clear about the whole model precess. but I think it is like this:
1) get samples
2) calculate features
3) preprocess features calculated (e.g.remove zero)
4)rational split data into training and test set (always puzzle me, how to split on earth?)
5)build model and at the same time tune parameter of model  based on the resample methods using just training set. and get the final model.
6) test the model performance using independent test set (unseen samples).
7) estimate the model. good? or bad?  overfitting?  (generally, what case is overfitting? can you give me a example? as i know, it is overfitting when the trainging set fit good, but the independent test set is bad,but what is good ? what is bad?    r2=0.94 in the training set and r2=0.70 in the test, in this case, the model is overfitting?  the model can be accepted?  and generally what model can be well accetpt?)
8) conclusion. how is the model.

above is my thinking.  and many question wait for answering.

thanks

kevin

Reply | Threaded
Open this post in threaded view
|

Re: How to estimate whether overfitting?

Frank Harrell
On 05/10/2010 12:32 AM, bbslover wrote:

>
> many thanks .  I can try to use test set with 100 samples.
>
> anther question is that how can I rationally split my data to training set
> and test set? (training set with 108 samples, and test set with 100 samples)
>
> as I  know, the test set should the same distribute to the training set. and
> what method can deal with it to rationally split?
>
> and what packages in R can deal with splitting training/test set rationally
> question?
>
>
> if the split is random. it seems to need many times splits, and the average
> results consider as the final results.
>
> however, I want to several methods to perform split and get the firm
> training set and test set instead of random split.
>
> training set and test set should like this:ideally, the division must be
> performed sunch that points representing both traing and training set are
> distributed within the hole feature space occupied by the entire dataset,
> and each point of the test set is close to at least one point of the
> training set. this approach ensures that the similarity principle can be
> enmployed for the output prediction of the test set. Certainly,this
> condition can not always be satistied.
>
> thus, generally, what algorithms often be perform to split? and more
> rational? some paper often say, they split the data set  randomly, thus,
> what is randomly?  just selection random? or have some clear method? e.g.
> output order,  I really know, which package can do with split data
> rationally?
>
> other, if one want to get the better results, some "tips" can be done. e.g.
> they can select test set again and again, and use the test set with best
> results as final test set and say that the test set was selectd randomly,
> but it is not true random, it is false.
>
> thank you, sorry to so many questions. but it puzzled me always.  up to now,
> I have no good method to split rationally my data into training set and test
> set.
>
> at last, split training and test set should be done before modeling, and it
> seems that this can be done just from featrue? (som)  ( or feature and
> output?(alogorithm spxy. paper:"a method for calibration and validation
> subset partioning")  or just output?(output order)).
>
> but always, often there are many features to be calculated. and some featrue
> is zero or low standard deviation(sd<0.5),  should we delete these features
> before split the whole data?
>
> and use the remaining feature to split data, and just using the training set
> to build the regression model and to perform feature selection as well as to
> do cross-validation,  and the independent test set just used to test the
> built model, yes?
>
> maybe, my thinking is not clear about the whole model precess. but I think
> it is like this:
> 1) get samples
> 2) calculate features
> 3) preprocess features calculated (e.g.remove zero)
> 4)rational split data into training and test set (always puzzle me, how to
> split on earth?)
> 5)build model and at the same time tune parameter of model  based on the
> resample methods using just training set. and get the final model.
> 6) test the model performance using independent test set (unseen samples).
> 7) estimate the model. good? or bad?  overfitting?  (generally, what case is
> overfitting? can you give me a example? as i know, it is overfitting when
> the trainging set fit good, but the independent test set is bad,but what is
> good ? what is bad?    r2=0.94 in the training set and r2=0.70 in the test,
> in this case, the model is overfitting?  the model can be accepted?  and
> generally what model can be well accetpt?)
> 8) conclusion. how is the model.
>
> above is my thinking.  and many question wait for answering.
>
> thanks
>
> kevin
>
>


Kevin: I'm sorry I don't have time to deal with such a long note, but
briefly data splitting is not a good idea no matter how you do it unless
N > perhaps 20,000.  I suggest resampling, e.g., either the bootstrap
with 300 resamples or 50-fold repeats of 10-fold cross-validation.
Among other places these are implemented in my rms package.

Frank

--
Frank E Harrell Jr   Professor and Chairman        School of Medicine
                      Department of Biostatistics   Vanderbilt University

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Frank Harrell
Department of Biostatistics, Vanderbilt University
Reply | Threaded
Open this post in threaded view
|

Re: How to estimate whether overfitting?

Bert Gunter
In reply to this post by Steve Lianoglou-6
(Near) non-identifiability (especially in nonlinear models, which include
linear mixed effects models, Bayesian hierarchical models, etc.) is
typically a strong clue; usually indicated by software complaints (e.g.
convergence failures, running up against iteration limits, etc.).

However this is sufficient-ish, not necessary: "over-fitting" frequently
occurs even without such overt complaints. It should also be said that,
except for identifiability,  "over-fitting" is not a well-defined
statistical term: it depends on the scientific context.


Bert Gunter
Genentech Nonclinical Biostatistics
 
 -----Original Message-----
From: [hidden email] [mailto:[hidden email]] On
Behalf Of Steve Lianoglou
Sent: Sunday, May 09, 2010 6:13 PM
To: David Winsemius
Cc: [hidden email]; bbslover
Subject: Re: [R] How to estimate whether overfitting?

On Sun, May 9, 2010 at 11:53 AM, David Winsemius <[hidden email]>
wrote:
>
> On May 9, 2010, at 9:20 AM, bbslover wrote:
>
>>
>> 1. is there some criterion to estimate overfitting?  e.g. R2 and Q2 in
the
>> training set, as well as R2 in the test set, when means overfitting.  
for
>> example,  in my data, I have R2=0.94 for the training set and  for the
>> test
>> set R2=0.70, is overfitting?
>> 2. in this scatter, can one say this overfitting?
>>
>> 3. my result is obtained by svm, and the sample are 156 and 52 for the
>> training and test sets, and predictors are 96,   In this case, can svm be
>> employed to perform prediction?   whether the number of the predictors
are
>> too many ?
>>
>
> I think you need to buy a copy of Hastie, Tibshirani, and Friedman and do
> some self-study of chapters 7 and 12.

And you don't even have to buy it before you can start studying since
the PDF is available here:
http://www-stat.stanford.edu/~tibs/ElemStatLearn/

Having a hard cover is always handy, tho ..
-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to estimate whether overfitting?

Kevin Hao
thanks for your help. maybe I have poor statistics level, I can not well understand your means.

wishes
kevin<br><br>在2010-05-11,"Bert Gunter" <[hidden email]> 写道:

>(Near) non-identifiability (especially in nonlinear models, which include
>linear mixed effects models, Bayesian hierarchical models, etc.) is
>typically a strong clue; usually indicated by software complaints (e.g.
>convergence failures, running up against iteration limits, etc.).
>
>However this is sufficient-ish, not necessary: "over-fitting" frequently
>occurs even without such overt complaints. It should also be said that,
>except for identifiability,  "over-fitting" is not a well-defined
>statistical term: it depends on the scientific context.
>
>
>Bert Gunter
>Genentech Nonclinical Biostatistics
>
> -----Original Message-----
>From: [hidden email] [mailto:[hidden email]] On
>Behalf Of Steve Lianoglou
>Sent: Sunday, May 09, 2010 6:13 PM
>To: David Winsemius
>Cc: [hidden email]; bbslover
>Subject: Re: [R] How to estimate whether overfitting?
>
>On Sun, May 9, 2010 at 11:53 AM, David Winsemius <[hidden email]>
>wrote:
>>
>> On May 9, 2010, at 9:20 AM, bbslover wrote:
>>
>>>
>>> 1. is there some criterion to estimate overfitting?  e.g. R2 and Q2 in
>the
>>> training set, as well as R2 in the test set, when means overfitting.  
>for
>>> example,  in my data, I have R2=0.94 for the training set and  for the
>>> test
>>> set R2=0.70, is overfitting?
>>> 2. in this scatter, can one say this overfitting?
>>>
>>> 3. my result is obtained by svm, and the sample are 156 and 52 for the
>>> training and test sets, and predictors are 96,   In this case, can svm be
>>> employed to perform prediction?   whether the number of the predictors
>are
>>> too many ?
>>>
>>
>> I think you need to buy a copy of Hastie, Tibshirani, and Friedman and do
>> some self-study of chapters 7 and 12.
>
>And you don't even have to buy it before you can start studying since
>the PDF is available here:
>http://www-stat.stanford.edu/~tibs/ElemStatLearn/
>
>Having a hard cover is always handy, tho ..
>-steve
>
>--
>Steve Lianoglou
>Graduate Student: Computational Systems Biology
> | Memorial Sloan-Kettering Cancer Center
> | Weill Medical College of Cornell University
>Contact Info: http://cbio.mskcc.org/~lianos/contact
>
>______________________________________________
>[hidden email] mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>
        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to estimate whether overfitting?

Bert Gunter


-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On
Behalf Of [hidden email]
Sent: Tuesday, May 11, 2010 3:23 AM
To: Bert Gunter
Cc: [hidden email]
Subject: Re: [R] How to estimate whether overfitting?

thanks for your help. maybe I have poor statistics level, I can not well
understand your means.

-- Then you should consult a local expert. -- Bert

wishes
kevin<br><br>e(2010-05-11o<

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.