Suppose I am reading data from a file and the data contains some outliers. I want to know if it is possible in R to automatically detect outliers in a dataset and remove them

> Original Message
> From: [hidden email] [mailto:[hidden email]] > On Behalf Of vikrant > Sent: Monday, January 18, 2010 10:09 PM > To: [hidden email] > Subject: [R] How to detect and exclude outliers in R? > > > Suppose I am reading data from a file and the data contains some outliers. > I > want to know if it is possible in R to automatically detect outliers in a > dataset and remove them >  You will need to provide more information. What is your definition of an outlier? And, why should those data be removed? Daniel Nordlund Bothell, WA USA ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/rhelp PLEASE do read the posting guide http://www.Rproject.org/postingguide.html and provide commented, minimal, selfcontained, reproducible code. 
In reply to this post by vikrant
What makes an outlier an outlier depends on the model. A highly discrepant observation under one model is entirely typical under another.
Even given a model, criteria for what consititutes an outlier vary by application area and user. Even given all of that, exclusion is only one of many possible actions. Can you be more specific about your model for the data? 
In reply to this post by vikrant
Hi V.S.,
Did you search first on rrepositories about this issue prior to ask? May be not. RSiteSearch("outliers") bests milton On Tue, Jan 19, 2010 at 1:08 AM, vikrant <[hidden email]> wrote: > > Suppose I am reading data from a file and the data contains some outliers. > I > want to know if it is possible in R to automatically detect outliers in a > dataset and remove them >  > View this message in context: > http://n4.nabble.com/HowtodetectandexcludeoutliersinRtp1017285p1017285.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/rhelp > PLEASE do read the posting guide > http://www.Rproject.org/postingguide.html<http://www.rproject.org/postingguide.html> > and provide commented, minimal, selfcontained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/rhelp PLEASE do read the posting guide http://www.Rproject.org/postingguide.html and provide commented, minimal, selfcontained, reproducible code. 
In reply to this post by GlenB
Dear R users group I have performed PCA using the function rda in vegan and then used plot(pcaobject). I have a couple of questions: 1) The default plot shows the individual sites (black) and the variables (red). What I want however is a plot showing the mean of site groups with bidirectional error bars displaying the standard deviation for those groups (with the variables still plotted in the background)... 2) ...I know how to do this by export the scores and loadings to excel and then using excel or Sigmaplot to do the graphs; however then I have an issue with the scaling of the loadings (i.e. the values are so small that they are bunched up at the origin) so here is my second question: Can I multiply the loadings by a constant to display them in my plot and if yes what is the convention for doing this. Many thanks Paul _________________________________________________________________ Tell us your greatest, weirdest and funniest Hotmail stories [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/rhelp PLEASE do read the posting guide http://www.Rproject.org/postingguide.html and provide commented, minimal, selfcontained, reproducible code. 
In reply to this post by vikrant
fortune("outlier") vikrant schrieb: > Suppose I am reading data from a file and the data contains some outliers. I > want to know if it is possible in R to automatically detect outliers in a > dataset and remove them >  Eik Vettorazzi Institut für Medizinische Biometrie und Epidemiologie Universitätsklinikum HamburgEppendorf Martinistr. 52 20246 Hamburg T ++49/40/741058243 F ++49/40/741057790 ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/rhelp PLEASE do read the posting guide http://www.Rproject.org/postingguide.html and provide commented, minimal, selfcontained, reproducible code. 
In reply to this post by vikrant
I had a similar problem. In my case, I had a large table of data and wanted to find and exclude a single huge value in one column (i.e. remove the entire row). There were thousands of rows of data, and this single value was more than 3x the next value, and at least 30x the typical value. I wanted to see what the effect of removing that one datapoint was, without having to change the underlying data.
This finds & removes that one value. I assume it could be repeated to get rid of more values based on predefined criteria: First, load the "outliers" package. outlier_tf = outlier(data_full$target column,logical=TRUE) #This gives an array with all values False, except for the outlier (as defined in the package documentation "Finds value with largest difference between it and sample mean, which can be an outlier"). That value is returned as True. find_outlier = which(outlier_tf==TRUE,arr.ind=TRUE) #This finds the location of the outlier by finding that "True" value within the "outlier_tf" array. data_new = data_full[find_outlier,] #This creates a new dataset based on the old data, removing the one row that contains the outlier Guy

Powered by Nabble  Edit this page 