Hello,
I need to analyse a data matrix with dimensions of 30x100. Before analysing the data there is, however, a need to remove outliers from the data. I read quite a lot about outlier removal already and I think the most common technique for that seems to be Principal Component Analysis (PCA). However, I think that these technqiue is quite subjective. When is an outlier an outlier? I uploaded an example PCA plot here: http://s14.postimage.org/oknyya1ld/pca.png Should we treat the green and red dots as outliers already or only the blue one which lies outside the 95% confidence interval. It seems very arbitrary how people remove outliers using PCA. I also thought about fitting a linear model through my data and look at distribution of the residuals. However, the problem with using linear models is that one can actually never be sure that the model used is the one which describes the data best. In model A, for instance, we might treat sample 1 as and outlier but fitting a different model B sample 1 might not be an outlier at all. I had a brief look at k-means clustering as well but I think it's not the right thing to go for. Again, how do one decide which cluster is an outler? And also it is known that different cluster analysis lead to totally different results. So which one to choose? Is there any other way to non-subjectively remove outliers from data? I would really appreciated any ideas/comments you might have on that topic. Cheers |
On Thu, 9 Feb 2012, mails wrote:
> I need to analyse a data matrix with dimensions of 30x100. Before > analysing the data there is, however, a need to remove outliers from the > data. I read quite a lot about outlier removal already and I think the > most common technique for that seems to be Principal Component Analysis > (PCA). However, I think that these technqiue is quite subjective. When is > an outlier an outlier? I uploaded an example PCA plot here: Those more expert than I will certainly provide answers. What I do will new data is create box-and-whisker plots (I use the lattice package) which defines outliers as those data beyond 1.5x the first or third quartile values. No one but you can answer your question on when an outlier is an outlier. It depends on your data set and the context of the data. For example, a water chemistry value that far exceeds a regulartory threshold might be meaningful in the context of a one-off excursion (in which case it's not an outlier but a real data point) or it might result from a handling, instrumentation, or analytical error (in which case toss it as an outlier). Rich ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
I wonder why it is still standard practice in some circles to search for "outliers" as opposed to using robust/resistent methods.
Here is a great paper with a scientific approach to "outliers": @Article{fin06cal, author = {Finney, David J.}, title = {Calibration guidelines challenge outlier practices}, journal = The American Statistician, year = 2006, volume = 60, pages = {309-313}, annote = {anticoagulant therapy;bias;causation;ethics;objectivity;outliers;guidelines for treatment of outliers;overview of types of outliers;letter to the editor and reply 61:187 May 2007} } Frank
Frank Harrell
Department of Biostatistics, Vanderbilt University |
> -----Original Message-----
> From: [hidden email] [mailto:r-help-bounces@r- > project.org] On Behalf Of Frank Harrell > Sent: Thursday, February 09, 2012 9:19 AM > To: [hidden email] > Subject: Re: [R] Outlier removal techniques > > I wonder why it is still standard practice in some circles to search > for > "outliers" as opposed to using robust/resistent methods. > > Here is a great paper with a scientific approach to "outliers": > > @Article{fin06cal, > author = {Finney, David J.}, > title = {Calibration guidelines challenge outlier > practices}, > journal = The American Statistician, > year = 2006, > volume = 60, > pages = {309-313}, > annote = {anticoagulant > therapy;bias;causation;ethics;objectivity;outliers;guidelines for > treatment of outliers;overview of types of outliers;letter to the > editor and > reply 61:187 May 2007} > } > > Frank > > Rich Shepard wrote > > > > On Thu, 9 Feb 2012, mails wrote: > > > >> I need to analyse a data matrix with dimensions of 30x100. Before > >> analysing the data there is, however, a need to remove outliers from > the > >> data. I read quite a lot about outlier removal already and I think > the > >> most common technique for that seems to be Principal Component > Analysis > >> (PCA). However, I think that these technqiue is quite subjective. > When is > >> an outlier an outlier? I uploaded an example PCA plot here: > > > > Those more expert than I will certainly provide answers. What I do > will > > new data is create box-and-whisker plots (I use the lattice package) > which > > defines outliers as those data beyond 1.5x the first or third > quartile > > values. > > > > No one but you can answer your question on when an outlier is an > > outlier. > > It depends on your data set and the context of the data. For example, > a > > water chemistry value that far exceeds a regulartory threshold might > be > > meaningful in the context of a one-off excursion (in which case it's > not > > an > > outlier but a real data point) or it might result from a handling, > > instrumentation, or analytical error (in which case toss it as an > > outlier). > > > > Rich > > > > ______________________________________________ > > R-help@ mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > I would echo what Frank says. I would also add that in the absence of demonstrated measurement/recording errors, there is good reason to "explain" the extreme values as well as the typical values. If a model can't deal with extreme values, then it may be good enough for some purposes, but it is not a "complete" explanation and may fail at the worst time. I would highly recommend the book "The Black Swan" by Nassim Nicholas Taleb (NOT the ballet story). Dan Daniel J. Nordlund Washington State Department of Social and Health Services Planning, Performance, and Accountability Research and Data Analysis Division Olympia, WA 98504-5204 ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
In reply to this post by Frank Harrell
> -----Original Message----- > I wonder why it is still standard practice in some circles to > search for "outliers" as opposed to using robust/resistent methods. At the risk of extending an old debate and driving us off list topic, here are three possible reasons: i) Identifying outliers is important when you want to find possible mistakes in measurement or data entry - so irrespective of whether you use robust methods, you probably want to ask questions like 'why has that result been entered as almost exactly 1000 times the value I expected?' [typically a unit error, btw). And although graphical outlier checking is the obvious way to do that, eyeballs see oddity in chance; an outlier test can help you distinguish oddity from chance and save some (arguably) unnecessary follow-up. ii) because supervised outlier rejection at around the 99% level performs - for simple problems - about as well as Huber's with c set to 1.5 and is a lot easier to explain to, er, people who don't understand iterative numerical methods. iii) Because it's written into some international Standards for statistical processing of data (ie, it's standard practice because it's Standard practice). iv) because you can't do robust analysis in Excel* Not that all these are necesarily _good_ reasons ... ;-) However, I do NOT understand why schools in the UK teach physics students that outliers should automatically and always be thrown out; that's a much larger leap. *You can actually; with R or several add-ins. But that is off topic. ******************************************************************* This email and any attachments are confidential. Any use...{{dropped:8}} ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
Powered by Nabble | Edit this page |