|
I'm someone who from time to time comes to R to do applied stats for social
science research. I think the R language is excellent--much better than Stata for writing complex statistical programs. I am thrilled that I can do complex stats readily in R--sem, maximum likelihood, bootstrapping, some Bayesian analysis. I wish I could make R my main statistical package, but find that a few stats that are important to my work are difficult to find or produce in R. Before I list some examples, I recognize that people view R not as a statistical package but rather as a statistical programming environment. That said, however, it seems, from my admittedly limited perspective, that it would be fairly easy to make a few adjustments to R that would make it a lot more practical and friendly for a broader range of people--including people like me who from time to time want to do statistical programming but more often need to run canned procedures. I'm not a statistician, so I don't want to have to learn everything there is to know about common procedures I use, including how to write them from scratch. I want to be able to focus my efforts on more novel problems w/o reinventing the wheel. I would also prefer not to have to work through a couple books on R or S+ to learn how to meet common needs in R. If R were extended a bit in the direction of helping people like me, I wonder whether it would not acquire a much broader audience. Then again, these may just be the rantings of someone not sufficiently familiar w/ R or the community of stat package users--so take my comments w/ a grain of salt. Some examples of statistics I typically use that are difficult to find and / or produce or produce in a usefully formatted way in R-- Ex. 1) Wald tests of linear hypotheses after max. likelihood or even after a regression. "Wald" does not even appear in my standard R package on a search. There's no comment in the lm help or optim help about what function to use for hypothesis tests. I know that statisticians prefer likelihood ratio tests, but Wald tests are still useful and indeed crucial for first-pass analysis. After searching with Google for some time, I found several Wald functions in various contributed R packages I did not have installed. One confusion was which one would be relevant to my needs. This took some time to resolve. I concluded, perhaps on insufficient evidence, that package car's Wald test would be most helpful. To use it, however, one has to put together a matrix for the hypotheses, which can be arduous for a many-term regression or a complex hypothesis. In comparison, in Stata one simply states the hypothesis in symbolic terms. I also don't know for certain that this function in car will work or work properly w/ various kinds of output, say from lm or from optim. To be sure, I'd need to run time-consuming tests comparing it with Stata output or examine the function's code. In Stata the test is easy to find, and there's no uncertainty about where it can be run or its accuracy. Simply having a comment or "see also" in lm help or mle or optim help pointing the user to the right Wald function would be of enormous help. Ex. 2) Getting neat output of a regression with Huberized variance matrix. I frequently have to run regressions w/ robust variances. In Stata, one simply adds the word "robust" to the end of the command or "cluster(cluster.variable)" for a cluster-robust error. In R, there are two functions, robcov and hccm. I had to run tests to figure out what the relationship is between them and between them and Stata (robcov w/o cluster gives hccm's hc0; hccm's hc1 is equivalent to Stata's 'robust' w/o cluster; etc.). A single sentence in hccm's help saying something to the effect that statisticians prefer hc3 for most types of data might save me from having to scramble through the statistical literature to try to figure out which of these I should be using. A few sentences on what the differences are between these methods would be even better. Then, there's the problem of output. Given that hc1 or hc3 are preferred for non-clustered data, I'd need to be able to get regression output of the form summary(lm) out of hccm, for any practical use. Getting this, however, would require programming my own function. Huberized t-stats for regressions are commonplace needs, an R oriented a little toward more everyday needs would not require programming of such needs. Also, I'm not sure yet how well any of the existing functions handle missing data. Ex. 3) I need to do bootstrapping w/ clustered data, again a common statistical need. I wasted a good deal of time reading the help contents of boot and Bootstrap, only to conclude that I'd need to write my own, probably inefficient, function to bootstrap clustered data if I were to use boot. It's odd that boot can't handle this more directly. After more digging, I learned that bootcov in package Design would handle the cluster bootstrap and save the parameters. I wouldn't have found this if I had not needed bootcov for another purpose. Again, maybe a few words in the boot help saying that 'for clustered data, you could use bootcov or program a function in boot' would be very helpful. I still don't know whether I can feed the results of bootcov back into functions in the boot package for further analysis. My 2 bits for what they're worth, Peter ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
|
Dear Peter,
> -----Original Message----- > From: [hidden email] > [mailto:[hidden email]] On Behalf Of Peter > Muhlberger > Sent: Wednesday, January 04, 2006 2:43 PM > To: rhelp > Subject: [R] A comment about R: > . . . > Ex. 1) Wald tests of linear hypotheses after max. likelihood > or even after a regression. "Wald" does not even appear in > my standard R package on a search. There's no comment in the > lm help or optim help about what function to use for > hypothesis tests. I know that statisticians prefer > likelihood ratio tests, but Wald tests are still useful and > indeed crucial for first-pass analysis. After searching with > Google for some time, I found several Wald functions in > various contributed R packages I did not have installed. One > confusion was which one would be relevant to my needs. This > took some time to resolve. I concluded, perhaps on > insufficient evidence, that package car's Wald test would be > most helpful. To use it, however, one has to put together a > matrix for the hypotheses, which can be arduous for a > many-term regression or a complex hypothesis. > In comparison, > in Stata one simply states the hypothesis in symbolic terms. > I also don't know for certain that this function in car will > work or work properly w/ various kinds of output, say from lm > or from optim. To be sure, I'd need to run time-consuming > tests comparing it with Stata output or examine the > function's code. In Stata the test is easy to find, and > there's no uncertainty about where it can be run or its > accuracy. Simply having a comment or "see also" in lm help > or mle or optim help pointing the user to the right Wald > function would be of enormous help. > The reference, I believe, is to the linear.hypothesis() function, which has methods for lm and glm objects. [To see what kinds of objects linear.hypothesis is suitable for, use the command methods(linear.hypothesis).] For lm objects, you get an F-test by default. Note that the Anova() function, also in car, can more conveniently compute Wald tests for certain kinds of hypotheses. More generally, however, I'd be interested in your suggestions for an alternative method of specifying linear hypotheses. There is currently no method for mle objects, but adding one is a good idea, and I'll do that when I have a chance. (In the meantime, it's very easy to compute Wald tests from the coefficients and the hypothesis and coefficient-covariance matrices. Writing a small function to do so, without the bells and whistles of something like linear.hypothesis(), should not be hard. Indeed, the ability to do this kind of thing easily is what I see as the primary advantage of working in a statistical computing environment like R -- or Stata. > Ex. 2) Getting neat output of a regression with Huberized > variance matrix. > I frequently have to run regressions w/ robust variances. In > Stata, one simply adds the word "robust" to the end of the > command or "cluster(cluster.variable)" for a cluster-robust > error. In R, there are two functions, robcov and hccm. I > had to run tests to figure out what the relationship is > between them and between them and Stata (robcov w/o cluster > gives hccm's hc0; hccm's hc1 is equivalent to Stata's > 'robust' w/o cluster; etc.). A single sentence in hccm's > help saying something to the effect that statisticians prefer > hc3 for most types of data might save me from having to > scramble through the statistical literature to try to figure > out which of these I should be using. A few sentences on > what the differences are between these methods would be even > better. Then, there's the problem of output. Given that hc1 > or hc3 are preferred for non-clustered data, I'd need to be > able to get regression output of the form summary(lm) out of > hccm, for any practical use. Getting this, however, would > require programming my own function. Huberized t-stats for > regressions are commonplace needs, an R oriented a little > toward more everyday needs would not require programming of > such needs. Also, I'm not sure yet how well any of the > existing functions handle missing data. > I think that we have a philosophical difference here: I don't like giving advice in documentation. An egregious extended example of this, in my opinion, is the SPSS documentation. The hccm() function uses hc3 as the default, which is an implicit recommendation, but more usefully, in my view, points to Long and Erwin's American Statistician paper on the subject, which does give advice and which is quite accessible. As well, and more generally, the car package is associated with a book (my R and S-PLUS Companion to Applied Regression), which gives advice, though, admittedly, tersely in this case. The Anova() function with argument white=TRUE will give you F-tests corresponding to the t-tests to which you refer (though it will combine df for multiple-df terms in the model). To get the kind of summary you describe, you could use something like mysummary <- function(model){ coef <- coef(model) se <- sqrt(diag(hccm(model))) t <- coef/se p <- 2*pt(abs(t), df=model$df.residual, lower=FALSE) table <- cbind(coef, se, t, p) rownames(table) <- names(coef) colnames(table) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)") table } Again, it's not time-consuming to write simple functions like this for one's own use, and the ability to do so is a strength of R, in my view. I'm not sure what you mean about handling missing data: functions like hccm(), linear.hypothesis(), and Anova() start with a model object for which missing data have already been handled. Regards, John ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
|
As John and myself seem to have written our replies in parallel, hence
I added some more clarifying remarks in this mail: > Note that the Anova() function, also in car, can more conveniently compute > Wald tests for certain kinds of hypotheses. More generally, however, I'd be > interested in your suggestions for an alternative method of specifying > linear hypotheses. My understanding was that Peter just wants to eliminate various elements from the terms(obj) which is what waldtest() in lmtest supports. If some other way of specifying nested models is required, I'd also be interested in that. > The Anova() function with argument white=TRUE will give you F-tests > corresponding to the t-tests to which you refer (though it will combine df > for multiple-df terms in the model). To get the kind of summary you > describe, you could use something like > > mysummary <- function(model){ > coef <- coef(model) > se <- sqrt(diag(hccm(model))) > t <- coef/se > p <- 2*pt(abs(t), df=model$df.residual, lower=FALSE) > table <- cbind(coef, se, t, p) > rownames(table) <- names(coef) > colnames(table) <- c("Estimate", "Std. Error", "t value", > "Pr(>|t|)") > table > } This is supported out of the box in coeftest() in lmtest. Z ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
|
In reply to this post by John Fox
John Fox wrote:
> Dear Peter, > > >>-----Original Message----- >>From: [hidden email] >>[mailto:[hidden email]] On Behalf Of Peter >>Muhlberger >>Sent: Wednesday, January 04, 2006 2:43 PM >>To: rhelp >>Subject: [R] A comment about R: >> > > > . . . > > >>Ex. 1) Wald tests of linear hypotheses after max. likelihood >>or even after a regression. "Wald" does not even appear in >>my standard R package on a search. There's no comment in the >>lm help or optim help about what function to use for >>hypothesis tests. I know that statisticians prefer >>likelihood ratio tests, but Wald tests are still useful and >>indeed crucial for first-pass analysis. After searching with >>Google for some time, I found several Wald functions in >>various contributed R packages I did not have installed. One >>confusion was which one would be relevant to my needs. This >>took some time to resolve. I concluded, perhaps on >>insufficient evidence, that package car's Wald test would be >>most helpful. To use it, however, one has to put together a >>matrix for the hypotheses, which can be arduous for a >>many-term regression or a complex hypothesis. >>In comparison, >>in Stata one simply states the hypothesis in symbolic terms. >>I also don't know for certain that this function in car will >>work or work properly w/ various kinds of output, say from lm >>or from optim. To be sure, I'd need to run time-consuming >>tests comparing it with Stata output or examine the >>function's code. In Stata the test is easy to find, and >>there's no uncertainty about where it can be run or its >>accuracy. Simply having a comment or "see also" in lm help >>or mle or optim help pointing the user to the right Wald >>function would be of enormous help. The Design package's anova.Design and contrast.Design make many Wald tests very easy. contrast( ) will allow you to test all kinds of hypotheses by stating which differences in predicted values you are interested in. Frank Harrell >> > > > > The reference, I believe, is to the linear.hypothesis() function, which has > methods for lm and glm objects. [To see what kinds of objects > linear.hypothesis is suitable for, use the command > methods(linear.hypothesis).] For lm objects, you get an F-test by default. > Note that the Anova() function, also in car, can more conveniently compute > Wald tests for certain kinds of hypotheses. More generally, however, I'd be > interested in your suggestions for an alternative method of specifying > linear hypotheses. There is currently no method for mle objects, but adding > one is a good idea, and I'll do that when I have a chance. (In the meantime, > it's very easy to compute Wald tests from the coefficients and the > hypothesis and coefficient-covariance matrices. Writing a small function to > do so, without the bells and whistles of something like linear.hypothesis(), > should not be hard. Indeed, the ability to do this kind of thing easily is > what I see as the primary advantage of working in a statistical computing > environment like R -- or Stata. > > >>Ex. 2) Getting neat output of a regression with Huberized >>variance matrix. >>I frequently have to run regressions w/ robust variances. In >>Stata, one simply adds the word "robust" to the end of the >>command or "cluster(cluster.variable)" for a cluster-robust >>error. In R, there are two functions, robcov and hccm. I >>had to run tests to figure out what the relationship is >>between them and between them and Stata (robcov w/o cluster >>gives hccm's hc0; hccm's hc1 is equivalent to Stata's >>'robust' w/o cluster; etc.). A single sentence in hccm's >>help saying something to the effect that statisticians prefer >>hc3 for most types of data might save me from having to >>scramble through the statistical literature to try to figure >>out which of these I should be using. A few sentences on >>what the differences are between these methods would be even >>better. Then, there's the problem of output. Given that hc1 >>or hc3 are preferred for non-clustered data, I'd need to be >>able to get regression output of the form summary(lm) out of >>hccm, for any practical use. Getting this, however, would >>require programming my own function. Huberized t-stats for >>regressions are commonplace needs, an R oriented a little >>toward more everyday needs would not require programming of >>such needs. Also, I'm not sure yet how well any of the >>existing functions handle missing data. >> > > > I think that we have a philosophical difference here: I don't like giving > advice in documentation. An egregious extended example of this, in my > opinion, is the SPSS documentation. The hccm() function uses hc3 as the > default, which is an implicit recommendation, but more usefully, in my view, > points to Long and Erwin's American Statistician paper on the subject, which > does give advice and which is quite accessible. As well, and more generally, > the car package is associated with a book (my R and S-PLUS Companion to > Applied Regression), which gives advice, though, admittedly, tersely in this > case. > > The Anova() function with argument white=TRUE will give you F-tests > corresponding to the t-tests to which you refer (though it will combine df > for multiple-df terms in the model). To get the kind of summary you > describe, you could use something like > > mysummary <- function(model){ > coef <- coef(model) > se <- sqrt(diag(hccm(model))) > t <- coef/se > p <- 2*pt(abs(t), df=model$df.residual, lower=FALSE) > table <- cbind(coef, se, t, p) > rownames(table) <- names(coef) > colnames(table) <- c("Estimate", "Std. Error", "t value", > "Pr(>|t|)") > table > } > > Again, it's not time-consuming to write simple functions like this for one's > own use, and the ability to do so is a strength of R, in my view. > > I'm not sure what you mean about handling missing data: functions like > hccm(), linear.hypothesis(), and Anova() start with a model object for which > missing data have already been handled. > > Regards, > John > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Frank Harrell
Department of Biostatistics, Vanderbilt University |
|
In reply to this post by Achim Zeileis
On 1/5/06 11:27 AM, "Achim Zeileis" <[hidden email]> wrote:
> As John and myself seem to have written our replies in parallel, hence > I added some more clarifying remarks in this mail: >> Note that the Anova() function, also in car, can more conveniently compute >> Wald tests for certain kinds of hypotheses. More generally, however, I'd be >> interested in your suggestions for an alternative method of specifying >> linear hypotheses. > My understanding was that Peter just wants to eliminate various elements > from the terms(obj) which is what waldtest() in lmtest supports. If some > other way of specifying nested models is required, I'd also be interested > in that. My two most immediate problems were a) to test whether a set of coefficients were jointly zero (as Achim suggests, though the complication here is that the varcov matrix is bootstrapped), but also b) to test whether the average of a set of coefficients was equal to zero. At other points in time, I remember having had to test more complex linear hypotheses involving joint combinations of equality, non-zero, and 'averages.' The Stata interface for linear hypothesis tests is amazingly straightforward. For example, after a regression, I could use the following to test the joint hypothesis that v1=v2 and the average (or sum) of v3 through v5 is zero and .75v6+.25v7 is zero: test v1=v2 test v3+v4+v5=0, accum test .75*v6+.25*v7=0, accum I don't even have to set up a matrix for my test ];-) ! The output would show not merely the joint test of all the hypotheses but the tests along the way, one for each line of commands. I vaguely remember the hypothesis testing command after an ml run is much the same and cross-equation hypothesis tests simply involve adding an equation indicator to the terms. I can get huberized var-cov matrices simply by adding "robust" to the regression command. I believe there's also a command that will huberize a var-cov matrix after the fact. Subsequent hypothesis tests would be on the huberized matrix. I won't claim to know what's good for R or the R community, but it would be nice for me and perhaps others if there were a comparable straightforward command as in Stata that could meet a variety of needs. I need to play w/ the commands that have been suggested to me by you guys recently, but I'm looking at a multitude of commands none of which I suspect have the flexibility and ease of use of the above Stata commands, at least for the kind of applications I'd like. Perhaps the point of R isn't to serve as a package for a wider set of non-statisticians, but if it wishes to develop in that direction, facilities like this may be helpful. It's interesting that Achim points out that a function John suggests is already available in R--an indication that even R experts don't have a complete handle on everything in R even on a relatively straightforward topic like hypothesis tests. John is no doubt right that editorializing about statistics would be out of place on an R help page. But when I have gone to statistical papers, many have been difficult to access & not very helpful for practical concerns. I'm glad to hear that Long and Erwin's paper is helpful, but there's a goodly list of papers mentioned in help. Perhaps something that would be useful is some way of highlighting on a help page which reference is most helpful for practical concerns? Again, thanks for all the great input from everyone! Peter ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
|
Peter:
> My two most immediate problems were a) to test whether a set of coefficients > were jointly zero (as Achim suggests, though the complication here is that > the varcov matrix is bootstrapped), but also b) to test whether the average This can be tested with both waldtest() and linear.hypothesis() when you've got the bootstrapped vcov estimator of your choice available. This can be conveniently plugged into both functions (either as a vcov matrix or as a function extracting the vcov matrix from the fitted model object). There is some discussion about this in the vignette accompanying the sandwich package. > of a set of coefficients was equal to zero. At other points in time, I > remember having had to test more complex linear hypotheses involving joint > combinations of equality, non-zero, and 'averages.' The Stata interface for > linear hypothesis tests is amazingly straightforward. For example, after a > regression, I could use the following to test the joint hypothesis that > v1=v2 and the average (or sum) of v3 through v5 is zero and .75v6+.25v7 is > zero: > > test v1=v2 > test v3+v4+v5=0, accum > test .75*v6+.25*v7=0, accum Mmmh, should be possible to derive the restriction matrix from this together with the terms structure...I'll think about this. > I don't even have to set up a matrix for my test ];-) ! The output would > show not merely the joint test of all the hypotheses but the tests along the > way, one for each line of commands. I vaguely remember the hypothesis > testing command after an ml run is much the same and cross-equation > hypothesis tests simply involve adding an equation indicator to the terms. > I can get huberized var-cov matrices simply by adding "robust" to the > regression command. Whether you find this simple or not depends on what you might want to have. Personally, I always find it very limiting if I've only got a switch to choose one or another vcov matrix when there is a multitude of vcov matrices in use in the literature. What if you would want to do HC3 instead of the HC(0) that is offered by Eviews...or HC4...or HAC...or something bootstrapped...or... In my view, this is the stengths of many implementation in R: you can make programs very modular so that the user can easily extend the software or re-use it for other purposes. The price you pay for that is that it is not as easy to as a point-and-click software that offers some standard tools. Of course, both sides have advantages or disadvantages. > I won't claim to know what's good for R or the R community, but it would be > nice for me and perhaps others if there were a comparable straightforward > command as in Stata that could meet a variety of needs. I need to play w/ > the commands that have been suggested to me by you guys recently, but I'm > looking at a multitude of commands none of which I suspect have the > flexibility and ease of use of the above Stata commands, at least for the > kind of applications I'd like. Perhaps the point of R isn't to serve as a > package for a wider set of non-statisticians, but if it wishes to develop in > that direction, facilities like this may be helpful. The point of R is hard to determine, R itself does not wish this or that, it is an open source project which is driven by many contributors. If there are people out there that want to use R for social sciences, they are free to contribute to the project. And in this particular case, I think that there has been some activity in the last one or two years aiming at providing tools for econometrics, quantitative methods in the social and political sciences. However, you won't be very happy with R when you want R to be Stata. If you want Stata, use it. > It's interesting that > Achim points out that a function John suggests is already available in R--an > indication that even R experts don't have a complete handle on everything in > R even on a relatively straightforward topic like hypothesis tests. In fairness to John, this functionality became available rather recently. And it's not surprising that John knows his car package better and that I'm more familiar with my lmtest package. Therefore, it's very natural to think first how you would do a certain task using your own package...in particular given that you specifically asked about car. > John is no doubt right that editorializing about statistics would be out of > place on an R help page. But when I have gone to statistical papers, many > have been difficult to access & not very helpful for practical concerns. > I'm glad to hear that Long and Erwin's paper is helpful, but there's a > goodly list of papers mentioned in help. I would think this to be an advantage not a drawback. It's the user's responsiblity to know what he/she is doing. Best wishes, Z ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
|
In reply to this post by Peter Muhlberger
A few thoughts about R vs SAS:
I started learning SAS 8 years ago at IBM, I believe it was version 6.10. I started with R 7 months ago. Learning curve: I think I can do everything in R after 7 months that I could do in SAS after about 4 years. Bugs: I suffered through several SAS version changes, 7.0, 7.1, 7.2, 8.0, 9.0 (I may have misquoted some version numbers). Every version change gave me headaches, as every version release (of an expensive commercially produced software set) had bugs which upset or crashed previously working code. I had code which ran fine under Windows 2000 and terribly under Windows XP. Most bugs I found were noted by SAS, but never fixed. With R I have encounted very few bugs, except for an occasional crash of R, which I usually ascribe to some bug in Windows XP. Help: SAS help was OK. As others have mentioned, there is too much. I even had the set of printed manuals on my desk (stretching 4 feet or so), which were quote impenetrable. I had almost no support from colleagues: even within IBM the number of advanced SAS users was small. With R this mailing list has been of great help: almost every issue I copy some program and save it as a "R hint xxxx" file. --> A REQUEST I would say that I would appreciate a few more program examples with the help pages for some functions. For instance, "?Control" tells me about "if(cond) cons.expr else alt.expr", however an example of if(i==1) { print("one") } else if(i==2) { print("two") } else if(i>2) { print("bigger than two") } at the end of that help section would have been very helpful for me a few months ago. Functions: Writing my own functions in SAS was by use of macros, and usually depended heavily on macro substitution. Learning SAS's macro language, especially macro substitution, was very difficult and it took me years to be able to write complicated functions. Quite different situation in R. Some functions I have written by dint of copying code from other people's packages, which has been very helpful. I wanted to generate arbitrary k-values (the k-multiplier of sigma for a given alpha, beta, and N to establish confidence limits around a mean for small populations). I had a table from a years old microfiche book giving values but wanted to generate my own. I had to find the correct integrals to approximate the k-values and then write two SAS macros which iterated to the desired level of tolerance to generate values. I would guess that there is either an R base function or a package which will do this for me (when I need to start generating AQL tables). Given the utility of these numbers, I was disappointed with SAS. Data manipulation: All SAS data is in 2-dimensional datasets, which was very frustrating after having used variables, arrays, and matrices in BASIC, APL, FORTRAN, C, Pascal, and LabVIEW. SAS allows you to access only 1 row of a dataset at a time which was terribly horribly incomprehensibly frustrating. There were so many many problems I had to solve where I had to work around this SAS paradigm. In R, I can access all the elements of a matrix/dataframe at once, and I can use >2 dimensional matrices. In fact, the limitations of SAS I had ingrained from 7.5 years has sometimes made me forget how I can do something so easily in R, like be able to know when a value in a column of a dataframe changes: DF$marker <- DF[1:(nrow(DF)-1),icol] != DF[2:nrow(DF),icol] This was hard to do in SAS...and even after years it was sometimes buggy, keeping variable values from previous iterations of a SAS program. One very nice advantage with SAS is that after data is saved in libraries, there is a GUI showing all the libraries and the datasets inside the libraries with sizes and dates. While we can save Rdata objects in an external file, the base package doesn't seem to have the same capabilities as SAS. Graphics: SAS graphics were quite mediocre, and generating customized labels was cumbersome. Porting code from one Windows platform to another produced unpredictable and sometimes unworkable results. It has been easier in R: I anticipate that I will be able to port R Windows code to *NIX and generate the same graphics. Batch commands: I am working on porting some of my R code to our *NIX server to generate reports and graphs on a scheduled basis. Although a few at IBM did this with SAS, I would have found doing this fairly daunting. -Leif ----------------------------- Leif Kirschenbaum, Ph.D. Senior Yield Engineer Reflectivity [hidden email] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
|
Leif Kirschenbaum wrote:
> A few thoughts about R vs SAS: > I started learning SAS 8 years ago at IBM, I believe it was version 6.10. > I started with R 7 months ago. > > Learning curve: > I think I can do everything in R after 7 months that I could do in SAS after about 4 years. > > Bugs: > I suffered through several SAS version changes, 7.0, 7.1, 7.2, 8.0, 9.0 (I may have misquoted some version numbers). Every version change gave me headaches, as every version release (of an expensive commercially produced software set) had bugs which upset or crashed previously working code. I had code which ran fine under Windows 2000 and terribly under Windows XP. Most bugs I found were noted by SAS, but never fixed. > With R I have encounted very few bugs, except for an occasional crash of R, which I usually ascribe to some bug in Windows XP. > > Help: > SAS help was OK. As others have mentioned, there is too much. I even had the set of printed manuals on my desk (stretching 4 feet or so), which were quote impenetrable. I had almost no support from colleagues: even within IBM the number of advanced SAS users was small. > With R this mailing list has been of great help: almost every issue I copy some program and save it as a "R hint xxxx" file. > --> A REQUEST > I would say that I would appreciate a few more program examples with the help pages for some functions. For instance, "?Control" tells me about "if(cond) cons.expr else alt.expr", however an example of > if(i==1) { print("one") > } else if(i==2) { print("two") > } else if(i>2) { print("bigger than two") } > at the end of that help section would have been very helpful for me a few months ago. > > Functions: > Writing my own functions in SAS was by use of macros, and usually depended heavily on macro substitution. Learning SAS's macro language, especially macro substitution, was very difficult and it took me years to be able to write complicated functions. Quite different situation in R. Some functions I have written by dint of copying code from other people's packages, which has been very helpful. > I wanted to generate arbitrary k-values (the k-multiplier of sigma for a given alpha, beta, and N to establish confidence limits around a mean for small populations). I had a table from a years old microfiche book giving values but wanted to generate my own. I had to find the correct integrals to approximate the k-values and then write two SAS macros which iterated to the desired level of tolerance to generate values. I would guess that there is either an R base function or a package which will do this for me (when I need to start generating AQL tables). Given the utility of these numbers, I was disappointed with SAS. > > Data manipulation: > All SAS data is in 2-dimensional datasets, which was very frustrating after having used variables, arrays, and matrices in BASIC, APL, FORTRAN, C, Pascal, and LabVIEW. SAS allows you to access only 1 row of a dataset at a time which was terribly horribly incomprehensibly frustrating. There were so many many problems I had to solve where I had to work around this SAS paradigm. > In R, I can access all the elements of a matrix/dataframe at once, and I can use >2 dimensional matrices. In fact, the limitations of SAS I had ingrained from 7.5 years has sometimes made me forget how I can do something so easily in R, like be able to know when a value in a column of a dataframe changes: > DF$marker <- DF[1:(nrow(DF)-1),icol] != DF[2:nrow(DF),icol] > This was hard to do in SAS...and even after years it was sometimes buggy, keeping variable values from previous iterations of a SAS program. > One very nice advantage with SAS is that after data is saved in libraries, there is a GUI showing all the libraries and the datasets inside the libraries with sizes and dates. While we can save Rdata objects in an external file, the base package doesn't seem to have the same capabilities as SAS. > > Graphics: > SAS graphics were quite mediocre, and generating customized labels was cumbersome. Porting code from one Windows platform to another produced unpredictable and sometimes unworkable results. > It has been easier in R: I anticipate that I will be able to port R Windows code to *NIX and generate the same graphics. > > Batch commands: > I am working on porting some of my R code to our *NIX server to generate reports and graphs on a scheduled basis. Although a few at IBM did this with SAS, I would have found doing this fairly daunting. > > > -Leif Leif, Those are excellent points. I'm especially glad you mentioned data manipulation. I find that R is far ahead of SAS in this respect although most people are shocked to hear me say that. We are doing all our data manipulation (merging, recoding, etc.) in R for pharmaceutical research. The ability to deal with lists of data frames also helps us a great deal when someone sends us a clinical trial database made of 50 SAS datasets. Frank > > ----------------------------- > Leif Kirschenbaum, Ph.D. > Senior Yield Engineer > Reflectivity > [hidden email] > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Frank Harrell
Department of Biostatistics, Vanderbilt University |
|
In reply to this post by Leif Kirschenbaum
Leif Kirschenbaum wrote:
> A few thoughts about R vs SAS: > I started learning SAS 8 years ago at IBM, I believe it was version 6.10. > I started with R 7 months ago. > > Learning curve: > I think I can do everything in R after 7 months that I could do in SAS after about 4 years. > > Bugs: > I suffered through several SAS version changes, 7.0, 7.1, 7.2, 8.0, 9.0 (I may have misquoted some version numbers). Every version change gave me headaches, as every version release (of an expensive commercially produced software set) had bugs which upset or crashed previously working code. I had code which ran fine under Windows 2000 and terribly under Windows XP. Most bugs I found were noted by SAS, but never fixed. > With R I have encounted very few bugs, except for an occasional crash of R, which I usually ascribe to some bug in Windows XP. > > Help: > SAS help was OK. As others have mentioned, there is too much. I even had the set of printed manuals on my desk (stretching 4 feet or so), which were quote impenetrable. I had almost no support from colleagues: even within IBM the number of advanced SAS users was small. > With R this mailing list has been of great help: almost every issue I copy some program and save it as a "R hint xxxx" file. > --> A REQUEST > I would say that I would appreciate a few more program examples with the help pages for some functions. For instance, "?Control" tells me about "if(cond) cons.expr else alt.expr", however an example of > if(i==1) { print("one") > } else if(i==2) { print("two") > } else if(i>2) { print("bigger than two") } > at the end of that help section would have been very helpful for me a few months ago. > > Functions: > Writing my own functions in SAS was by use of macros, and usually depended heavily on macro substitution. Learning SAS's macro language, especially macro substitution, was very difficult and it took me years to be able to write complicated functions. Quite different situation in R. Some functions I have written by dint of copying code from other people's packages, which has been very helpful. > I wanted to generate arbitrary k-values (the k-multiplier of sigma for a given alpha, beta, and N to establish confidence limits around a mean for small populations). I had a table from a years old microfiche book giving values but wanted to generate my own. I had to find the correct integrals to approximate the k-values and then write two SAS macros which iterated to the desired level of tolerance to generate values. I would guess that there is either an R base function or a package which will do this for me (when I need to start generating AQL tables). Given the utility of these numbers, I was disappointed with SAS. > > Data manipulation: > All SAS data is in 2-dimensional datasets, which was very frustrating after having used variables, arrays, and matrices in BASIC, APL, FORTRAN, C, Pascal, and LabVIEW. SAS allows you to access only 1 row of a dataset at a time which was terribly horribly incomprehensibly frustrating. There were so many many problems I had to solve where I had to work around this SAS paradigm. > In R, I can access all the elements of a matrix/dataframe at once, and I can use >2 dimensional matrices. In fact, the limitations of SAS I had ingrained from 7.5 years has sometimes made me forget how I can do something so easily in R, like be able to know when a value in a column of a dataframe changes: > DF$marker <- DF[1:(nrow(DF)-1),icol] != DF[2:nrow(DF),icol] > This was hard to do in SAS...and even after years it was sometimes buggy, keeping variable values from previous iterations of a SAS program. > One very nice advantage with SAS is that after data is saved in libraries, there is a GUI showing all the libraries and the datasets inside the libraries with sizes and dates. While we can save Rdata objects in an external file, the base package doesn't seem to have the same capabilities as SAS. > > Graphics: > SAS graphics were quite mediocre, and generating customized labels was cumbersome. Porting code from one Windows platform to another produced unpredictable and sometimes unworkable results. > It has been easier in R: I anticipate that I will be able to port R Windows code to *NIX and generate the same graphics. > > Batch commands: > I am working on porting some of my R code to our *NIX server to generate reports and graphs on a scheduled basis. Although a few at IBM did this with SAS, I would have found doing this fairly daunting. > > > -Leif Leif, Those are excellent points. I'm especially glad you mentioned data manipulation. I find that R is far ahead of SAS in this respect although most people are shocked to hear me say that. We are doing all our data manipulation (merging, recoding, etc.) in R for pharmaceutical research. The ability to deal with lists of data frames also helps us a great deal when someone sends us a clinical trial database made of 50 SAS datasets. Frank > > ----------------------------- > Leif Kirschenbaum, Ph.D. > Senior Yield Engineer > Reflectivity > [hidden email] > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Frank Harrell
Department of Biostatistics, Vanderbilt University |
|
In reply to this post by Achim Zeileis
On Thursday 05 January 2006 12:13, Achim Zeileis wrote:
> . . . snip > Whether you find this simple or not depends on what you might want to > have. Personally, I always find it very limiting if I've only got a switch > to choose one or another vcov matrix when there is a multitude of vcov > matrices in use in the literature. What if you would want to do HC3 > instead of the HC(0) that is offered by Eviews...or HC4...or HAC...or > something bootstrapped...or... > In my view, this is the stengths of many implementation in R: you can make > programs very modular so that the user can easily extend the software or > re-use it for other purposes. The price you pay for that is that it is not > as easy to as a point-and-click software that offers some standard tools. > Of course, both sides have advantages or disadvantages. > . . .snip Stata's ADO scripting language has the ability to access intermediate steps and local variables used by various commands. These are typically held in memory until they are purged. The difference between Stata and R is more that Stata has been streamlined into an application, the nuts and bolts hidden away, the rivet heads counter sunk and polished, so that unless you really need to use them, they aren't visible. It only LOOKS like you are constrained to the readily available results of specific commands. Stata output will tend to look very much like the standard output one becomes accustomed to in undergraduate stat courses. R assumes you _will_ want access to the nuts and bolts, and don't much care about visible rivets if the system is both accurate and functional. R is much more a programming environment in that sense. It is an important difference. There is going to be a continuing growth in users of R as companies see cost savings in OS. They will often be people who happily dragged .xls files into SPSS or SPSS for analysis and then printed the resulting reports. (Personally, I became a strong believer in statistical analysis packages after receiving a _negative_ variance in Excel once upon a time. I don't see how that could even be possible, but apparently it was a known issue. Some ad hoc experimentation then demonstrated that no spreadsheet was all that precise). One place where R and Stata have a great deal in common is in the manner in which graphs and charts are formatted. Stata is perhaps slightly less bizantine, but only slightly. Both systems emphasize flexibility and quality graphics at the price of learning to know what you are doing. That said, you can still do a lot more with R in some areas than Stata, especially in spatial graphics and analysis. JD ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
|
In reply to this post by Frank Harrell
Hi all,
UCLA ATS Statistical Consulting Group has just launched a very interesting paper comparing SPSS, SAS & Stata as Statistical Packages.. "Perhaps the most notable exception to this discussion is R" http://www.ats.ucla.edu/stat/technicalreports/ It's an interesting reading for this thread. Best regards Naji ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
|
Naji <[hidden email]> writes:
> Hi all, > > UCLA ATS Statistical Consulting Group has just launched a very interesting > paper comparing SPSS, SAS & Stata as Statistical Packages.. "Perhaps the > most notable exception to this discussion is R" > http://www.ats.ucla.edu/stat/technicalreports/ > It's an interesting reading for this thread. In fact, if you trace the thread back to its root, this is what started it... -- O__ ---- Peter Dalgaard Ă˜ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - ([hidden email]) FAX: (+45) 35327907 ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html |
| Powered by Nabble | Edit this page |
