Short course: Statistical Learning and Data Mining II:
tools for tall and wide data
Trevor Hastie and Robert Tibshirani, Stanford University
Palo Alto, California,
April 3-4, 2006.
This two-day course gives a detailed overview of statistical models for
data mining, inference and prediction. With the rapid developments
in internet technology, genomics, financial risk modeling, and other
high-tech industries, we rely increasingly more on data analysis and
statistical models to exploit the vast amounts of data at our
This course is the third in a series, and follows our popular past
offerings "Modern Regression and Classification", and "Statistical
Learning and Data Mining".
The two earlier courses are not a prerequisite for this new course.
In this course we emphasize the tools useful for tackling modern-day
data analysis problems. We focus on both "tall" data ( N>p where
N=#cases, p=#features) and "wide" data (p>N). The tools include
gradient boosting, SVMs and kernel methods, random forests, lasso and
LARS, ridge regression and GAMs, supervised principal components, and
cross-validation. We also present some interesting case studies in a
variety of application areas. All our examples are developed using the
S language, and most of the procedures we discuss are implemented in
publicly available R packages.