CONTENTS DELETED
The author has deleted this message.

On Feb 8, 2011, at 9:11 PM, kirtau wrote: > > I am working on a function that will remove outliers for regression > analysis. > I am stating that a data point is an outlier if its studentized > residual is > above or below 3 and 3, respectively. The code below is what i have > thus > far for the function > > x = c(1:20) > y = c(1,3,4,2,5,6,18,8,10,8,11,13,14,14,15,85,17,19,19,20) > data1 = data.frame(x,y) > > > rm.outliers = function(dataset,dependent,independent){ > dataset$predicted = predict(lm(dependent~independent)) > dataset$stdres = rstudent(lm(dependent~independent)) > m = 1 > for(i in 1:length(dataset$stdres)){ > dataset$outlier_counter[i] = if(dataset$stdres[i] >= 3  > dataset$stdres[i] <= 3) {m} else{0} > } > j = length(which(dataset$outlier_counter >= 1)) > while(j>=1){ > print(dataset[which(dataset$outlier_counter >= 1),]) > dataset = dataset[which(dataset$outlier_counter == 0),] > dataset$predicted = predict(lm(dependent~independent)) > dataset$stdres = rstudent(lm(dependent~independent)) > m = m+1 > for(k in 1:length(dataset$stdres)){ > dataset$outlier_counter[k] = if(dataset$stdres[k] >= 3  > dataset$stdres[k] <= 3) {m} else{0} > } > j = length(which(dataset$outlier_counter >= 1)) > } > return(dataset) > } > > The problem that I run into is that i receive this error when i type > > rm.outliers(data1,data1$y,data1$x) > > " x y predicted stdres outlier_counter > 16 16 85 22.98647 24.04862 1 > Error in `$<.data.frame`(`*tmp*`, "predicted", value = > c(0.114285714285714, > : > replacement has 20 rows, data has 19" > > Note: the outlier_counter variable is used to state which "round" of > the > loop the datapoint was marked as an outlier. > > This would be a HUGE help to me and a few buddies who run a lot of > different > regression tests. The solution is about 3 or 4 lines of code to make the function, but removing outliers like this is simply statistical malpractice. Maybe it's a good thing that R has a shallow learning curve.  David Winsemius, MD West Hartford, CT ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/rhelp PLEASE do read the posting guide http://www.Rproject.org/postingguide.html and provide commented, minimal, selfcontained, reproducible code. 
David,
Please allow me to digress a lot here. You are one of the few (inlcuding yours truly!) that uses the phrase "shallow learning curve" to indicate difficulty of learning (I assume this is what you meant). I always felt that "steep learning curve" was incorrect. If you plotted the amount of learning on the Yaxis and time on the Xaxis, a steep learning curve means that one learns very quickly, but this is just the opposite of what is actually meant. Best, Ravi. ____________________________________________________________________ Ravi Varadhan, Ph.D. Assistant Professor, Division of Geriatric Medicine and Gerontology School of Medicine Johns Hopkins University Ph. (410) 5022619 email: [hidden email]  Original Message  From: David Winsemius <[hidden email]> Date: Tuesday, February 8, 2011 10:09 pm Subject: Re: [R] Removing Outliers Function To: kirtau <[hidden email]> Cc: [hidden email] > On Feb 8, 2011, at 9:11 PM, kirtau wrote: > > > > >I am working on a function that will remove outliers for regression > analysis. > >I am stating that a data point is an outlier if its studentized > residual is > >above or below 3 and 3, respectively. The code below is what i have > thus > >far for the function > > > >x = c(1:20) > >y = c(1,3,4,2,5,6,18,8,10,8,11,13,14,14,15,85,17,19,19,20) > >data1 = data.frame(x,y) > > > > > >rm.outliers = function(dataset,dependent,independent){ > > dataset$predicted = predict(lm(dependent~independent)) > > dataset$stdres = rstudent(lm(dependent~independent)) > > m = 1 > > for(i in 1:length(dataset$stdres)){ > > dataset$outlier_counter[i] = if(dataset$stdres[i] >= 3  > >dataset$stdres[i] <= 3) {m} else{0} > > } > > j = length(which(dataset$outlier_counter >= 1)) > > while(j>=1){ > > print(dataset[which(dataset$outlier_counter >= 1),]) > > dataset = dataset[which(dataset$outlier_counter == 0),] > > dataset$predicted = predict(lm(dependent~independent)) > > dataset$stdres = rstudent(lm(dependent~independent)) > > m = m+1 > > for(k in 1:length(dataset$stdres)){ > > dataset$outlier_counter[k] = if(dataset$stdres[k] >= 3  > >dataset$stdres[k] <= 3) {m} else{0} > > } > > j = length(which(dataset$outlier_counter >= 1)) > > } > > return(dataset) > >} > > > >The problem that I run into is that i receive this error when i type > > > >rm.outliers(data1,data1$y,data1$x) > > > >" x y predicted stdres outlier_counter > >16 16 85 22.98647 24.04862 1 > >Error in `$<.data.frame`(`*tmp*`, "predicted", value = c(0.114285714285714, > >: > > replacement has 20 rows, data has 19" > > > >Note: the outlier_counter variable is used to state which "round" of > the > >loop the datapoint was marked as an outlier. > > > >This would be a HUGE help to me and a few buddies who run a lot of different > >regression tests. > > The solution is about 3 or 4 lines of code to make the function, but > removing outliers like this is simply statistical malpractice. Maybe > it's a good thing that R has a shallow learning curve. > >  > > David Winsemius, MD > West Hartford, CT > > ______________________________________________ > [hidden email] mailing list > > PLEASE do read the posting guide > and provide commented, minimal, selfcontained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/rhelp PLEASE do read the posting guide http://www.Rproject.org/postingguide.html and provide commented, minimal, selfcontained, reproducible code. 
Exactly right. I use the phrase to catch the unwary's attention. I
think the effect is properly placed on the yaxis. IIRC, Ben Bolker (or was it Bert Gunter?) has also commented in the R help or rdevel pages this curious inversion of functional meaning.  David On Feb 8, 2011, at 10:36 PM, Ravi Varadhan wrote: > David, > > Please allow me to digress a lot here. You are one of the few > (inlcuding yours truly!) that uses the phrase "shallow learning > curve" to indicate difficulty of learning (I assume this is what you > meant). I always felt that "steep learning curve" was incorrect. If > you plotted the amount of learning on the Yaxis and time on the X > axis, a steep learning curve means that one learns very quickly, but > this is just the opposite of what is actually meant. > > Best, > Ravi. > ____________________________________________________________________ > > Ravi Varadhan, Ph.D. > Assistant Professor, > Division of Geriatric Medicine and Gerontology > School of Medicine > Johns Hopkins University > > Ph. (410) 5022619 > email: [hidden email] > > >  Original Message  > From: David Winsemius <[hidden email]> > Date: Tuesday, February 8, 2011 10:09 pm > Subject: Re: [R] Removing Outliers Function > To: kirtau <[hidden email]> > Cc: [hidden email] > > >> On Feb 8, 2011, at 9:11 PM, kirtau wrote: >> >>> >>> I am working on a function that will remove outliers for regression >> analysis. >>> I am stating that a data point is an outlier if its studentized >> residual is >>> above or below 3 and 3, respectively. The code below is what i have >> thus >>> far for the function >>> >>> x = c(1:20) >>> y = c(1,3,4,2,5,6,18,8,10,8,11,13,14,14,15,85,17,19,19,20) >>> data1 = data.frame(x,y) >>> >>> >>> rm.outliers = function(dataset,dependent,independent){ >>> dataset$predicted = predict(lm(dependent~independent)) >>> dataset$stdres = rstudent(lm(dependent~independent)) >>> m = 1 >>> for(i in 1:length(dataset$stdres)){ >>> dataset$outlier_counter[i] = if(dataset$stdres[i] >= 3  >>> dataset$stdres[i] <= 3) {m} else{0} >>> } >>> j = length(which(dataset$outlier_counter >= 1)) >>> while(j>=1){ >>> print(dataset[which(dataset$outlier_counter >= 1),]) >>> dataset = dataset[which(dataset$outlier_counter == 0),] >>> dataset$predicted = predict(lm(dependent~independent)) >>> dataset$stdres = rstudent(lm(dependent~independent)) >>> m = m+1 >>> for(k in 1:length(dataset$stdres)){ >>> dataset$outlier_counter[k] = if(dataset$stdres[k] >= 3  >>> dataset$stdres[k] <= 3) {m} else{0} >>> } >>> j = length(which(dataset$outlier_counter >= 1)) >>> } >>> return(dataset) >>> } >>> >>> The problem that I run into is that i receive this error when i type >>> >>> rm.outliers(data1,data1$y,data1$x) >>> >>> " x y predicted stdres outlier_counter >>> 16 16 85 22.98647 24.04862 1 >>> Error in `$<.data.frame`(`*tmp*`, "predicted", value = >>> c(0.114285714285714, >>> : >>> replacement has 20 rows, data has 19" >>> >>> Note: the outlier_counter variable is used to state which "round" of >> the >>> loop the datapoint was marked as an outlier. >>> >>> This would be a HUGE help to me and a few buddies who run a lot of >>> different >>> regression tests. >> >> The solution is about 3 or 4 lines of code to make the function, but >> removing outliers like this is simply statistical malpractice. Maybe >> it's a good thing that R has a shallow learning curve. >> >>  >> >> David Winsemius, MD >> West Hartford, CT >> >> ______________________________________________ >> [hidden email] mailing list >> >> PLEASE do read the posting guide >> and provide commented, minimal, selfcontained, reproducible code. David Winsemius, MD West Hartford, CT ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/rhelp PLEASE do read the posting guide http://www.Rproject.org/postingguide.html and provide commented, minimal, selfcontained, reproducible code. 
On 02/09/2011 03:43 PM, David Winsemius wrote:
> Exactly right. I use the phrase to catch the unwary's attention. I think > the effect is properly placed on the yaxis. > > IIRC, Ben Bolker (or was it Bert Gunter?) has also commented in the > Rhelp or rdevel pages this curious inversion of functional meaning. > I certainly agree with both of you as a matter of illustration. However, I have heard the phrase (mis)used to indicate a situation in which the learner had to learn a lot quickly, or in the concrete imagery of "a steep learning curve must be hard to climb". Language is a wonderful tool, even it we sometimes break things with it. Jim ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/rhelp PLEASE do read the posting guide http://www.Rproject.org/postingguide.html and provide commented, minimal, selfcontained, reproducible code. 
In reply to this post by David Winsemius
CONTENTS DELETED
The author has deleted this message.

On Feb 9, 2011, at 1:25 PM, kirtau wrote: > > I have two questions, > > 1) if the solutions is only three or four lines of code is there > anyway you > can share those lines instead of stating that the solution is easy and > providing no code. I prefer not to use an RPackage but have a "raw > function". > > 2) Can you explain why you feel that this is "statistical malpractice" You are proposing to systematically distort your data (apparently without even examining it) before conducting an inferential process. The old FLA GIGO is operative here. The data arose from some process in nature and the outliers are just as important as the inliers. If you want methods that are robust to "outliers" you should look at the Robust Statistics Task View: http://cran.rproject.org/web/views/Robust.html  David Winsemius, MD West Hartford, CT ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/rhelp PLEASE do read the posting guide http://www.Rproject.org/postingguide.html and provide commented, minimal, selfcontained, reproducible code. 
In reply to this post by kirtau
If you insist ...
1. You are reinventing wheels (poorly). RSiteSearch("outlier tests",restr="fun") ##RsiteSearch is a handy interface to search facilities on CRAN. # Go to the site directly for more. Or use Google or other search engines. will show you that a R package, outlier, already exists that does all the tests you can imagine  and more. 2. For why this is a BAD idea, you would need to read up on the voluminous literature. Talking to a local statistician might be a better alternative. But here's a hint: AFAIK, the FDA allows no such tests in the submissions of clinical trial data because it would bias the results. (Correction welcome if this statement is wrong). Cheers, Bert On Wed, Feb 9, 2011 at 10:25 AM, kirtau <[hidden email]> wrote: > > I have two questions, > > 1) if the solutions is only three or four lines of code is there anyway you > can share those lines instead of stating that the solution is easy and > providing no code. I prefer not to use an RPackage but have a "raw > function". > > 2) Can you explain why you feel that this is "statistical malpractice" > >  >  AK >  > View this message in context: http://r.789695.n4.nabble.com/RemovingOutliersFunctiontp3293395p3297853.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/rhelp > PLEASE do read the posting guide http://www.Rproject.org/postingguide.html > and provide commented, minimal, selfcontained, reproducible code. >  Bert Gunter Genentech Nonclinical Biostatistics ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/rhelp PLEASE do read the posting guide http://www.Rproject.org/postingguide.html and provide commented, minimal, selfcontained, reproducible code. 
In reply to this post by David Winsemius
For your number 2, look at the outliers data set in the TeachingDemos package and run the 1st set of examples, yes it uses a different rule than you use, but still a common one. Think about what is happening in the example, doesn't that make you a little nervous about methods that automatically discard "outliers"?
 Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare [hidden email] 801.408.8111 > Original Message > From: [hidden email] [mailto:rhelpbounces@r > project.org] On Behalf Of kirtau > Sent: Wednesday, February 09, 2011 11:06 AM > To: [hidden email] > Subject: Re: [R] Removing Outliers Function > > > I have two questions, > > 1) if the solutions is only three or four lines of code is there anyway > you > can share those lines, without disrespecting me further > > 2) Can you explain why you feel that this is "statistical malpractice" > >  >  AK >  > View this message in context: http://r.789695.n4.nabble.com/Removing > OutliersFunctiontp3293395p3297816.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/rhelp > PLEASE do read the posting guide http://www.Rproject.org/posting > guide.html > and provide commented, minimal, selfcontained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/rhelp PLEASE do read the posting guide http://www.Rproject.org/postingguide.html and provide commented, minimal, selfcontained, reproducible code. 
In reply to this post by kirtau
To answer part 2: You should read up on statistical distributions and
when a sample size is (or isn't) large enough to produce reliable statistical parameters such as mean or variance. I suspect David was implying that your yardstick, based on studentized residual, removes valid samples. I once wrote a simple bit of code (back when I had to do things in c rather than R :( ) that removed data points that were more than N*sigma off the current fitted data set, where N was 3 or 4. Even that is sloppy, as it doesn't take the sample size or other fit parameters into account, but it's a lot easier than your setup. Carl <quote> From: kirtau <kirtau_at_live.com> Date: Wed, 09 Feb 2011 10:06:07 0800 (PST) I have two questions, 1. if the solutions is only three or four lines of code is there anyway you can share those lines, without disrespecting me further 2. Can you explain why you feel that this is "statistical malpractice" </quote> ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/rhelp PLEASE do read the posting guide http://www.Rproject.org/postingguide.html and provide commented, minimal, selfcontained, reproducible code. 
Powered by Nabble  Edit this page 