regression modeling

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

regression modeling

Weiwei Shi
Hi, there:
I am looking for a regression modeling (like regression trees) approach for
a large-scale industry dataset. Any suggestion on a package from R or from
other sources which has a decent accuracy and scalability? Any
recommendation from experience is highly appreciated.

Thanks,

Weiwei

--
Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: regression modeling

bogdan romocea
There is an aspect, worthy of careful consideration, you don't seem to
be aware of. I'll ask the question for you: How does the
explanatory/predictive potential of a dataset vary as the dataset gets
larger and larger?


> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of Weiwei Shi
> Sent: Monday, April 24, 2006 12:45 PM
> To: r-help
> Subject: [R] regression modeling
>
> Hi, there:
> I am looking for a regression modeling (like regression
> trees) approach for
> a large-scale industry dataset. Any suggestion on a package
> from R or from
> other sources which has a decent accuracy and scalability? Any
> recommendation from experience is highly appreciated.
>
> Thanks,
>
> Weiwei
>
> --
> Weiwei Shi, Ph.D
>
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: regression modeling

Weiwei Shi
i believe it is not a question only related to regression modeling. The
correlation between the sample size and confidence of prediction in data
mining is not as clear as traditional stat approach.  My concern is not in
that theoretical discussion but more practical, looking for a good algorithm
when response variable is continuous when large dataset is concerned.

On 4/25/06, bogdan romocea <[hidden email]> wrote:

>
> There is an aspect, worthy of careful consideration, you don't seem to
> be aware of. I'll ask the question for you: How does the
> explanatory/predictive potential of a dataset vary as the dataset gets
> larger and larger?
>
>
> > -----Original Message-----
> > From: [hidden email]
> > [mailto:[hidden email]] On Behalf Of Weiwei Shi
> > Sent: Monday, April 24, 2006 12:45 PM
> > To: r-help
> > Subject: [R] regression modeling
> >
> > Hi, there:
> > I am looking for a regression modeling (like regression
> > trees) approach for
> > a large-scale industry dataset. Any suggestion on a package
> > from R or from
> > other sources which has a decent accuracy and scalability? Any
> > recommendation from experience is highly appreciated.
> >
> > Thanks,
> >
> > Weiwei
> >
> > --
> > Weiwei Shi, Ph.D
> >
> > "Did you always know?"
> > "No, I did not. But I believed..."
> > ---Matrix III
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
>



--
Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: regression modeling

Bert Gunter
May I offer a perhaps contrary perspective on this.

Statistical **theory** tells us that the precision of estimates improves as
sample size increases. However, in practice, this is not always the case.
The reason is that it can take time to collect that extra data, and things
change over time. So the very definition of what one is measuring, the
measurement technology by which it is measured (think about estimating tumor
size or disease incidence or underemployment, for example), the presence or
absence of known or unknown large systematic effects, and so forth may
change in unknown ways. This defeats, or at least complicates, the
fundamental assumption that one is sampling from a (fixed) population or
stable (e.g. homogeneous, stationary) process, so it's no wonder that all
statistical bets are off. Of course, sometimes the necessary information to
account for these issues is present, and appropriate (but often complex)
statistical analyses can be performed. But not always.

Thus, I am suspicious, cynical even, about those who advocate collecting
"all the data" and subjecting the whole vast heterogeneous mess to arcane
and ever more computer intensive (and adjustable parameter ridden) "data
mining" algorithms to "detect trends" or "discover knowledge." To me, it
sounds like a prescription for "turning on all the equipment and waiting to
see what happens" in the science lab instead of performing careful,
well-designed experiments.

I realize, of course, that there are many perfectly legitimate areas of
scientific research, from geophysics to evolutionary biology to sociology,
where one cannot (easily) perform planned experiments. But my point is that
good science demands that in all circumstances, and especially when one
accumulates and attempts to aggregata data taken over spans of time and
space, one needs to beware of oversimplification, including statistical
oversimplification. So interrogate the measurement, be skeptical of
stability, expect inconsistency. While "all models are wrong but some are
useful" (George Box), the second law tells us that entropy still rules.

(Needless to say, public or private contrary views are welcome).

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA
 
"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box
 
 

> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of Weiwei Shi
> Sent: Tuesday, April 25, 2006 12:10 PM
> To: bogdan romocea
> Cc: r-help
> Subject: Re: [R] regression modeling
>
> i believe it is not a question only related to regression
> modeling. The
> correlation between the sample size and confidence of
> prediction in data
> mining is not as clear as traditional stat approach.  My
> concern is not in
> that theoretical discussion but more practical, looking for a
> good algorithm
> when response variable is continuous when large dataset is concerned.
>
> On 4/25/06, bogdan romocea <[hidden email]> wrote:
> >
> > There is an aspect, worthy of careful consideration, you
> don't seem to
> > be aware of. I'll ask the question for you: How does the
> > explanatory/predictive potential of a dataset vary as the
> dataset gets
> > larger and larger?
> >
> >
> > > -----Original Message-----
> > > From: [hidden email]
> > > [mailto:[hidden email]] On Behalf Of Weiwei Shi
> > > Sent: Monday, April 24, 2006 12:45 PM
> > > To: r-help
> > > Subject: [R] regression modeling
> > >
> > > Hi, there:
> > > I am looking for a regression modeling (like regression
> > > trees) approach for
> > > a large-scale industry dataset. Any suggestion on a package
> > > from R or from
> > > other sources which has a decent accuracy and scalability? Any
> > > recommendation from experience is highly appreciated.
> > >
> > > Thanks,
> > >
> > > Weiwei
> > >
> > > --
> > > Weiwei Shi, Ph.D
> > >
> > > "Did you always know?"
> > > "No, I did not. But I believed..."
> > > ---Matrix III
> > >
> > >       [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > [hidden email] mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
> > >
> >
>
>
>
> --
> Weiwei Shi, Ph.D
>
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: regression modeling

Frank Harrell
Berton Gunter wrote:

> May I offer a perhaps contrary perspective on this.
>
> Statistical **theory** tells us that the precision of estimates improves as
> sample size increases. However, in practice, this is not always the case.
> The reason is that it can take time to collect that extra data, and things
> change over time. So the very definition of what one is measuring, the
> measurement technology by which it is measured (think about estimating tumor
> size or disease incidence or underemployment, for example), the presence or
> absence of known or unknown large systematic effects, and so forth may
> change in unknown ways. This defeats, or at least complicates, the
> fundamental assumption that one is sampling from a (fixed) population or
> stable (e.g. homogeneous, stationary) process, so it's no wonder that all
> statistical bets are off. Of course, sometimes the necessary information to
> account for these issues is present, and appropriate (but often complex)
> statistical analyses can be performed. But not always.
>
> Thus, I am suspicious, cynical even, about those who advocate collecting
> "all the data" and subjecting the whole vast heterogeneous mess to arcane
> and ever more computer intensive (and adjustable parameter ridden) "data
> mining" algorithms to "detect trends" or "discover knowledge." To me, it
> sounds like a prescription for "turning on all the equipment and waiting to
> see what happens" in the science lab instead of performing careful,
> well-designed experiments.
>
> I realize, of course, that there are many perfectly legitimate areas of
> scientific research, from geophysics to evolutionary biology to sociology,
> where one cannot (easily) perform planned experiments. But my point is that
> good science demands that in all circumstances, and especially when one
> accumulates and attempts to aggregata data taken over spans of time and
> space, one needs to beware of oversimplification, including statistical
> oversimplification. So interrogate the measurement, be skeptical of
> stability, expect inconsistency. While "all models are wrong but some are
> useful" (George Box), the second law tells us that entropy still rules.
>
> (Needless to say, public or private contrary views are welcome).
>
> -- Bert Gunter
> Genentech Non-Clinical Statistics
> South San Francisco, CA

Bert raises some great points.  Ignoring the important issues of
doing good research and stability in the meaning of data as time marches
on, it is generally true that the larger the sample size the greater the
complexity of the model we can afford to fit, and the better the fit of
the model.  This is the "AIC" school.  The "BIC" school assumes there is
an actual model out there waiting for us, of finite dimension, and the
complexity of our models should not grow very fast as N increases.  I
find the "AIC" approach gives me more accurate predictions.
--
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Frank Harrell
Department of Biostatistics, Vanderbilt University
Reply | Threaded
Open this post in threaded view
|

Re: regression modeling

John Maindonald
In reply to this post by Weiwei Shi
An interesting example of this is the forest cover data set that is
available from http://www.ics.uci.edu/~mlearn
The proportions of the different cover types change systematically
as one moves through the file.  It seems that distance through the
file is a proxy for the geographical co-ordinates.  Fitting a tree-based
or suchlike model to the total data is not the way to go, unless one
is going to model the pattern of change through the file as part of the
modeling exercise.  In any case, some preliminary exploration of the
data is called for, so that such matters come to attention.  For my
money, the issue is not ease of performing regression with huge data
sets, but ease of data exploration.

John Maindonald             email: [hidden email]
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Mathematical Sciences Institute, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.


On 26 Apr 2006, at 8:00 PM, [hidden email] wrote:

> From: Berton Gunter <[hidden email]>
> Date: 26 April 2006 6:47:12 AM
> To: 'Weiwei Shi' <[hidden email]>, 'bogdan romocea'  
> <[hidden email]>
> Cc: 'r-help' <[hidden email]>
> Subject: Re: [R] regression modeling
>
>
> May I offer a perhaps contrary perspective on this.
>
> Statistical **theory** tells us that the precision of estimates  
> improves as
> sample size increases. However, in practice, this is not always the  
> case.
> The reason is that it can take time to collect that extra data, and  
> things
> change over time. So the very definition of what one is measuring, the
> measurement technology by which it is measured (think about  
> estimating tumor
> size or disease incidence or underemployment, for example), the  
> presence or
> absence of known or unknown large systematic effects, and so forth may
> change in unknown ways. This defeats, or at least complicates, the
> fundamental assumption that one is sampling from a (fixed)  
> population or
> stable (e.g. homogeneous, stationary) process, so it's no wonder  
> that all
> statistical bets are off. Of course, sometimes the necessary  
> information to
> account for these issues is present, and appropriate (but often  
> complex)
> statistical analyses can be performed. But not always.
>
> Thus, I am suspicious, cynical even, about those who advocate  
> collecting
> "all the data" and subjecting the whole vast heterogeneous mess to  
> arcane
> and ever more computer intensive (and adjustable parameter ridden)  
> "data
> mining" algorithms to "detect trends" or "discover knowledge." To  
> me, it
> sounds like a prescription for "turning on all the equipment and  
> waiting to
> see what happens" in the science lab instead of performing careful,
> well-designed experiments.
>
> I realize, of course, that there are many perfectly legitimate  
> areas of
> scientific research, from geophysics to evolutionary biology to  
> sociology,
> where one cannot (easily) perform planned experiments. But my point  
> is that
> good science demands that in all circumstances, and especially when  
> one
> accumulates and attempts to aggregata data taken over spans of time  
> and
> space, one needs to beware of oversimplification, including  
> statistical
> oversimplification. So interrogate the measurement, be skeptical of
> stability, expect inconsistency. While "all models are wrong but  
> some are
> useful" (George Box), the second law tells us that entropy still  
> rules.
>
> (Needless to say, public or private contrary views are welcome).
>
> -- Bert Gunter
> Genentech Non-Clinical Statistics
> South San Francisco, CA
>
> "The business of the statistician is to catalyze the scientific  
> learning
> process."  - George E. P. Box

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html