|
I would like to get some idea of which R-packages are popular, and what R is used for in general. Are there any statistics available on which R packages are downloaded often, or is there something like a package-survey? Something similar to http://popcon.debian.org/ maybe? Any tips are welcome!
|
|
This function will show which other packages depend on a particular
package: > dep <- function(pkg, AP = available.packages()) { + pkg <- paste("\\b", pkg, "\\b", sep = "") + cat("Depends:", rownames(AP)[grep(pkg, AP[, "Depends"])], "\n") + cat("Suggests:", rownames(AP)[grep(pkg, AP[, "Suggests"])], "\n") + } > dep("zoo") Depends: AER BootPR FinTS PerformanceAnalytics RBloomberg StreamMetabolism TSfame TShistQuote VhayuR dyn dynlm fda fxregime lmtest meboot party quantmod sandwich sde strucchange tripEstimation tseries xts Suggests: TSMySQL TSPostgreSQL TSSQLite TSdbi TSodbc UsingR Zelig gsubfn playwith pscl tframePlus On Sat, Mar 7, 2009 at 2:57 PM, Jeroen Ooms <[hidden email]> wrote: > > I would like to get some idea of which R-packages are popular, and what R is > used for in general. Are there any statistics available on which R packages > are downloaded often, or is there something like a package-survey? Something > similar to http://popcon.debian.org/ maybe? Any tips are welcome! > > ----- > Jeroen Ooms * Dept. of Methodology and Statistics * Utrecht University > > Visit http://www.jeroenooms.com www.jeroenooms.com to explore some of my > current projects. > > > > > > > -- > View this message in context: http://www.nabble.com/popular-R-packages-tp22391260p22391260.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by jeroen00ms
When the question arises "How many R-users there are?", the consensus
seems to be that there is no valid method to address the question. The thread "R-business case" from 2004 can be found here: https://stat.ethz.ch/pipermail/r-help/2004-March/047606.html I did not see any material revision to that conclusion during the recent discussion of the New York Times article on the r-challenge to SAS. Gmane tracks the number of r-help activity (I realize not what you asked for): http://www.gmane.org/info.php?group=gmane.comp.lang.r.general The distribution of r-packages is, well ... distributed: http://cran.r-project.org/mirrors.html At least one of the participants in the 2004 thread suggested that it would be a "good thing" to track the numbers of downloads by package. I have not heard of any such system being installed in the mirror software and I see nothing that suggests data gathering in the CRAN Mirror How-to: http://cran.r-project.org/mirror-howto.html On the other hand I am not part of R-core, so you must await more authoritative opinion since a 5 year-old thread and amateur speculation is not much of a leg to stand on. There are lexicographic packages for R. One approach to a de novo analysis would be to do some sort of natural language analysis of the r-help archives counting up either package names with non-English names or close proximity of the words "library" or "package" to package names that overlap the 30,000 common English words. That would have the danger of inflating counts of the packages with the least adequate documentation or a paucity of good worked examples, but there are many readers of this list who suspect that new users don't look at the documentation, so who knows? -- David Winsemius On Mar 7, 2009, at 2:57 PM, Jeroen Ooms wrote: > > I would like to get some idea of which R-packages are popular, and > what R is > used for in general. Are there any statistics available on which R > packages > are downloaded often, or is there something like a package-survey? > Something > similar to http://popcon.debian.org/ maybe? Any tips are welcome! > > ----- > Jeroen Ooms * Dept. of Methodology and Statistics * Utrecht University > > Visit http://www.jeroenooms.com www.jeroenooms.com to explore some > of my > current projects. > > > > > > > -- > View this message in context: http://www.nabble.com/popular-R-packages-tp22391260p22391260.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD Heritage Laboratories West Hartford, CT ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
I don't think "At least one of the participants in the 2004 thread
suggested that it would be a "good thing" to track the numbers of downloads by package." is reasonable because I download R packages for 2 home computers (laptop & desktop) and 2 at work (1 Linux & 1 Mac). There must be many such cases… Tom David Winsemius wrote: > When the question arises "How many R-users there are?", the consensus > seems to be that there is no valid method to address the question. The > thread "R-business case" from 2004 can be found here: > https://stat.ethz.ch/pipermail/r-help/2004-March/047606.html > > I did not see any material revision to that conclusion during the > recent discussion of the New York Times article on the r-challenge to > SAS. > > Gmane tracks the number of r-help activity (I realize not what you > asked for): > http://www.gmane.org/info.php?group=gmane.comp.lang.r.general > > The distribution of r-packages is, well ... distributed: > http://cran.r-project.org/mirrors.html > > At least one of the participants in the 2004 thread suggested that it > would be a "good thing" to track the numbers of downloads by package. > I have not heard of any such system being installed in the mirror > software and I see nothing that suggests data gathering in the CRAN > Mirror How-to: > http://cran.r-project.org/mirror-howto.html > > On the other hand I am not part of R-core, so you must await more > authoritative opinion since a 5 year-old thread and amateur > speculation is not much of a leg to stand on. > > There are lexicographic packages for R. One approach to a de novo > analysis would be to do some sort of natural language analysis of the > r-help archives counting up either package names with non-English > names or close proximity of the words "library" or "package" to > package names that overlap the 30,000 common English words. That would > have the danger of inflating counts of the packages with the least > adequate documentation or a paucity of good worked examples, but there > are many readers of this list who suspect that new users don't look at > the documentation, so who knows? > -- Thomas E Adams National Weather Service Ohio River Forecast Center 1901 South State Route 134 Wilmington, OH 45177 EMAIL: [hidden email] VOICE: 937-383-0528 FAX: 937-383-0033 ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
I agree with Thomas, over the years I have installed R on at least 5
computers. BTW: does any one knows how the website statistics of r-project are being analyzed? Since I can't see any "google analytics" or other tracking code in the main website, I am guessing someone might be running some log-file analyzer - but I'd rather hear that then assume. On Sun, Mar 8, 2009 at 12:45 AM, Thomas Adams <[hidden email]> wrote: > I don't think "At least one of the participants in the 2004 thread > suggested that it would be a "good thing" to track the numbers of downloads > by package." is reasonable because I download R packages for 2 home > computers (laptop & desktop) and 2 at work (1 Linux & 1 Mac). There must be > many such cases > > Tom > > David Winsemius wrote: > >> When the question arises "How many R-users there are?", the consensus >> seems to be that there is no valid method to address the question. The >> thread "R-business case" from 2004 can be found here: >> https://stat.ethz.ch/pipermail/r-help/2004-March/047606.html >> >> I did not see any material revision to that conclusion during the recent >> discussion of the New York Times article on the r-challenge to SAS. >> >> Gmane tracks the number of r-help activity (I realize not what you asked >> for): >> http://www.gmane.org/info.php?group=gmane.comp.lang.r.general >> >> The distribution of r-packages is, well ... distributed: >> http://cran.r-project.org/mirrors.html >> >> At least one of the participants in the 2004 thread suggested that it >> would be a "good thing" to track the numbers of downloads by package. I have >> not heard of any such system being installed in the mirror software and I >> see nothing that suggests data gathering in the CRAN Mirror How-to: >> http://cran.r-project.org/mirror-howto.html >> >> On the other hand I am not part of R-core, so you must await more >> authoritative opinion since a 5 year-old thread and amateur speculation is >> not much of a leg to stand on. >> >> There are lexicographic packages for R. One approach to a de novo analysis >> would be to do some sort of natural language analysis of the r-help archives >> counting up either package names with non-English names or close proximity >> of the words "library" or "package" to package names that overlap the 30,000 >> common English words. That would have the danger of inflating counts of the >> packages with the least adequate documentation or a paucity of good worked >> examples, but there are many readers of this list who suspect that new users >> don't look at the documentation, so who knows? >> >> > > -- > Thomas E Adams > National Weather Service > Ohio River Forecast Center > 1901 South State Route 134 > Wilmington, OH 45177 > > EMAIL: [hidden email] > > VOICE: 937-383-0528 > FAX: 937-383-0033 > > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- ---------------------------------------------- My contact information: Tal Galili Phone number: 972-50-3373767 FaceBook: Tal Galili My Blogs: www.talgalili.com www.biostatistics.co.il [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by Thomas Adams
Quite so. It certainly is the case that Dirk Eddelbuettel suggested
would be very desirable and I think Dirk's track record speaks for itself. I never said (and I am sure Dirk never intended) that one could take the raw numbers as a basis for blandly asserting that <nnnn> copies of <ttt> package are currently installed. When I update packages, the automated process takes hold and I go for a cup of coffee. I only have at the moment two computers with R installed and have not updated any binary packages on Windoze in over a year. Nonetheless, I do think the relative numbers of package downloads might be interpretable, or at the very least, the basis for discussions over beer. -- David Winsemius On Mar 7, 2009, at 5:45 PM, Thomas Adams wrote: > I don't think "At least one of the participants in the 2004 thread > suggested that it would be a "good thing" to track the numbers of > downloads by package." is reasonable because I download R packages > for 2 home computers (laptop & desktop) and 2 at work (1 Linux & 1 > Mac). There must be many such cases… > > Tom > > David Winsemius wrote: >> When the question arises "How many R-users there are?", the >> consensus seems to be that there is no valid method to address the >> question. The thread "R-business case" from 2004 can be found here: >> https://stat.ethz.ch/pipermail/r-help/2004-March/047606.html >> >> I did not see any material revision to that conclusion during the >> recent discussion of the New York Times article on the r-challenge >> to SAS. >> >> Gmane tracks the number of r-help activity (I realize not what you >> asked for): >> http://www.gmane.org/info.php?group=gmane.comp.lang.r.general >> >> The distribution of r-packages is, well ... distributed: >> http://cran.r-project.org/mirrors.html >> >> At least one of the participants in the 2004 thread suggested that >> it would be a "good thing" to track the numbers of downloads by >> package. I have not heard of any such system being installed in the >> mirror software and I see nothing that suggests data gathering in >> the CRAN Mirror How-to: >> http://cran.r-project.org/mirror-howto.html >> >> On the other hand I am not part of R-core, so you must await more >> authoritative opinion since a 5 year-old thread and amateur >> speculation is not much of a leg to stand on. >> >> There are lexicographic packages for R. One approach to a de novo >> analysis would be to do some sort of natural language analysis of >> the r-help archives counting up either package names with non- >> English names or close proximity of the words "library" or >> "package" to package names that overlap the 30,000 common English >> words. That would have the danger of inflating counts of the >> packages with the least adequate documentation or a paucity of good >> worked examples, but there are many readers of this list who >> suspect that new users don't look at the documentation, so who knows? >> > > > -- > Thomas E Adams > National Weather Service > Ohio River Forecast Center > 1901 South State Route 134 > Wilmington, OH 45177 > > EMAIL: [hidden email] > > VOICE: 937-383-0528 > FAX: 937-383-0033 > David Winsemius, MD Heritage Laboratories West Hartford, CT ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by Tal Galili
>
> I agree with Thomas, over the years I have installed R on at least 5 > computers. > I don't see why per-marchine statistics would not be useful. When you installed a package on five machines, you probably use it a lot, and it is more important to you than packages that you only installed once. Furthermore I don't think the distribution of packages has to be problematic. I guess downloads are only slightly related to the specific mirror, so download statistics from one of the popular mirror's would do for me. Of course these statistics are never perfect, but they could be informative... [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by Tal Galili
i have kept r installed on more than ten computers during the past few
years, some of them running win + more than one linux distro, all of them having r, most often installed from a separate download. i know of many cases where students download r for the purpose of a course in statistics -- often an introductory course for students who otherwise have little to do with stats. some of them do it more than once during the semester, and many of them never use r again. taking into account that basic statistics courses are taught to most university students and that r is surely the most popular free statistical computing environment, download-based usage estimates may be a bit optimistic, unless 'usage' is taken to include 'learn-pass-forget'. vQ Tal Galili wrote: > I agree with Thomas, over the years I have installed R on at least 5 > computers. > > BTW: does any one knows how the website statistics of r-project are > being analyzed? > Since I can't see any "google analytics" or other tracking code in the main > website, I am guessing someone might be running some log-file analyzer - but > I'd rather hear that then assume. > > > > > > > On Sun, Mar 8, 2009 at 12:45 AM, Thomas Adams <[hidden email]> wrote: > > >> I don't think "At least one of the participants in the 2004 thread >> suggested that it would be a "good thing" to track the numbers of downloads >> by package." is reasonable because I download R packages for 2 home >> computers (laptop & desktop) and 2 at work (1 Linux & 1 Mac). There must be >> many such cases… >> >> Tom >> ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
I just did RSiteSearch("library(xxx)") with xxx = the names of 6
packages familiar to me, with the following numbers of hits: hits package 169 lme4 165 nlme 6 fda 4 maps 2 FinTS 2 DierckxSpline Software could be written to (1) extract the names of current packages from CRAN then (2) perform queries similar to this on all such packages and summarize the results. I don't have the time now to write code for this, but I've written similar code before for step (1); it can be found in "scripts/TsayFiles.R" in the "FinTS" package on CRAN. For step (2), Sundar Dorai-Raj wrote code that is is included in the preliminary "RSiteSearch" package available from R-Forge via install.'packages("RSiteSearch",repos="http://r-forge.r-project.org")'. Code to do this could probably be written (a) in a matter of seconds by many of those in the R Core team or (b) in a matter of hours by virtually any reader of this list using the examples I just cited. And it could provide numbers without a need to convince others to keep download statistics and make them available later. Hope this helps. Spencer Graves Wacek Kusnierczyk wrote: > i have kept r installed on more than ten computers during the past few > years, some of them running win + more than one linux distro, all of > them having r, most often installed from a separate download. > > i know of many cases where students download r for the purpose of a > course in statistics -- often an introductory course for students who > otherwise have little to do with stats. some of them do it more than > once during the semester, and many of them never use r again. > > taking into account that basic statistics courses are taught to most > university students and that r is surely the most popular free > statistical computing environment, download-based usage estimates may be > a bit optimistic, unless 'usage' is taken to include 'learn-pass-forget'. > > vQ > > > > Tal Galili wrote: > >> I agree with Thomas, over the years I have installed R on at least 5 >> computers. >> >> BTW: does any one knows how the website statistics of r-project are >> being analyzed? >> Since I can't see any "google analytics" or other tracking code in the main >> website, I am guessing someone might be running some log-file analyzer - but >> I'd rather hear that then assume. >> >> >> >> >> >> >> On Sun, Mar 8, 2009 at 12:45 AM, Thomas Adams <[hidden email]> wrote: >> >> >> >>> I don't think "At least one of the participants in the 2004 thread >>> suggested that it would be a "good thing" to track the numbers of downloads >>> by package." is reasonable because I download R packages for 2 home >>> computers (laptop & desktop) and 2 at work (1 Linux & 1 Mac). There must be >>> many such cases… >>> >>> Tom >>> >>> > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by jeroen00ms
Hi Spencer,
XLSolutions is currently analyzing r-help archived questions to rank packages for the upcoming R-PLUS 3.3 Professional version and we will be happy to share the outcome with interested parties. Please email [hidden email] Regards - Sue Turner Senior Account Manager XLSolutions Corporation North American Division 1700 7th Ave Suite 2100 Seattle, WA 98101 Phone: 206-686-1578 Email: [hidden email] web: www.xlsolutions-corp.com --- On Sat, 3/7/09, Spencer Graves <[hidden email]> wrote: > From: Spencer Graves <[hidden email]> > Subject: Re: [R] popular R packages > To: "Wacek Kusnierczyk" <[hidden email]> > Cc: [hidden email], "Jeroen Ooms" <[hidden email]>, "Thomas Adams" <[hidden email]> > Date: Saturday, March 7, 2009, 5:22 PM > I just did RSiteSearch("library(xxx)") with xxx = > the names of 6 packages familiar to me, with the following > numbers of hits: > > hits package > > 169 lme4 > 165 nlme > 6 fda > 4 maps > 2 FinTS > 2 DierckxSpline > > Software could be written to (1) extract the names of > current packages from CRAN then (2) perform queries similar > to this on all such packages and summarize the results. I > don't have the time now to write code for this, but > I've written similar code before for step (1); it can > be found in "scripts/TsayFiles.R" in the > "FinTS" package on CRAN. For step (2), Sundar > Dorai-Raj wrote code that is is included in the preliminary > "RSiteSearch" package available from R-Forge via > install.'packages("RSiteSearch",repos="http://r-forge.r-project.org")'. > > Code to do this could probably be written (a) in a > matter of seconds by many of those in the R Core team or (b) > in a matter of hours by virtually any reader of this list > using the examples I just cited. And it could provide > numbers without a need to convince others to keep download > statistics and make them available later. > Hope this helps. Spencer Graves > Wacek Kusnierczyk wrote: > > i have kept r installed on more than ten computers > during the past few > > years, some of them running win + more than one linux > distro, all of > > them having r, most often installed from a separate > download. > > > > i know of many cases where students download r for the > purpose of a > > course in statistics -- often an introductory course > for students who > > otherwise have little to do with stats. some of them > do it more than > > once during the semester, and many of them never use r > again. > > > > taking into account that basic statistics courses are > taught to most > > university students and that r is surely the most > popular free > > statistical computing environment, download-based > usage estimates may be > > a bit optimistic, unless 'usage' is taken to > include 'learn-pass-forget'. > > > > vQ > > > > > > > > Tal Galili wrote: > > > >> I agree with Thomas, over the years I have > installed R on at least 5 > >> computers. > >> > >> BTW: does any one knows how the website statistics > of r-project are > >> being analyzed? > >> Since I can't see any "google > analytics" or other tracking code in the main > >> website, I am guessing someone might be running > some log-file analyzer - but > >> I'd rather hear that then assume. > >> > >> > >> > >> > >> > >> > >> On Sun, Mar 8, 2009 at 12:45 AM, Thomas Adams > <[hidden email]> wrote: > >> > >> > >>> I don't think "At least one of the > participants in the 2004 thread > >>> suggested that it would be a "good > thing" to track the numbers of downloads > >>> by package." is reasonable because I > download R packages for 2 home > >>> computers (laptop & desktop) and 2 at work > (1 Linux & 1 Mac). There must be > >>> many such cases… > >>> > >>> Tom > >>> > >>> > > > > ______________________________________________ > > [hidden email] mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, > reproducible code. > > > > > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, > reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by jeroen00ms
Hi all,
I'm kind of amazed at the answers suggested for the relatively simple question, "How many times has each R package been downloaded?". Some have veered off in another direction, like working out how many packages a package depends upon, or whether someone downloads more than one copy. The response about ranking packages by the number of questions asked about them may be interesting, but may not relate very well at all to popularity in terms of downloads. If people were constantly asking questions about one of the packages I maintain, I would be working on the help pages to improve them, not basking in the inferred glory of having a popular package. There is one way that the download count would be very useful for package maintainers, if no one else. Take as an example the package concord, that has not been maintained for a year or more since the content was merged into the irr package. If I knew that no one downloaded concord any more, I would surely petition those in charge of the archive to remove it or at least transfer it to the package museum. No point in having ever more packages on CRAN if they are never downloaded. Jim ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by David Winsemius
On Sat, 07 Mar 2009 18:04:24 -0500, David Winsemius wrote :
[ Snip ... ] > Nonetheless, I do think the relative numbers of package downloads might > be interpretable, or at the very least, the basis for discussions over > beer. *Anything* might be the basis for discussions over beer (obvious corollary to Thermogoddamics' second principle....). More seriously : I don't think relative numbers of package downloads can be interpreted in any reasonable way, because reasons for package download have a very wide range from curiosity ("what's this ?"), fun (think "fortunes"...), to vital need tthink lme4 if/when a consensus on denominator DFs can be reached :-)...). What can you infer in good faith from such a mess ? Emmanuel Charpentier ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
> More seriously : I don't think relative numbers of package downloads can
> be interpreted in any reasonable way, because reasons for package > download have a very wide range from curiosity ("what's this ?"), fun > (think "fortunes"...), to vital need tthink lme4 if/when a consensus on > denominator DFs can be reached :-)...). What can you infer in good faith > from such a mess ? So when we have messy data with measurement error, we should just give up? Doesn't sound very statistical! ;) Hadley -- http://had.co.nz/ ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
On Sun, Mar 8, 2009 at 10:49 AM, hadley wickham <[hidden email]> wrote:
>> More seriously : I don't think relative numbers of package downloads can >> be interpreted in any reasonable way, because reasons for package >> download have a very wide range from curiosity ("what's this ?"), fun >> (think "fortunes"...), to vital need tthink lme4 if/when a consensus on >> denominator DFs can be reached :-)...). What can you infer in good faith >> from such a mess ? > > So when we have messy data with measurement error, we should just give > up? Doesn't sound very statistical! ;) > Also I would think that the rankings would be meaningful since the factors that cause the absolute numbers to be off would affect all packages equally. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by hadley wickham
On 08/03/2009 10:49 AM, hadley wickham wrote:
>> More seriously : I don't think relative numbers of package downloads can >> be interpreted in any reasonable way, because reasons for package >> download have a very wide range from curiosity ("what's this ?"), fun >> (think "fortunes"...), to vital need tthink lme4 if/when a consensus on >> denominator DFs can be reached :-)...). What can you infer in good faith >> from such a mess ? > > So when we have messy data with measurement error, we should just give > up? Doesn't sound very statistical! ;) I think the situation is worse than messy. If a client comes in with data that doesn't address the question they're interested in, I think they are better served to be told that, than to be given an answer that is not actually valid. They should also be told how to design a study that actually does address their question. You (and others) have mentioned Google Analytics as a possible way to address the quality of data; that's helpful. But analyzing bad data will just give bad conclusions. Duncan Murdoch ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
> I think the situation is worse than messy. If a client comes in with data
> that doesn't address the question they're interested in, I think they are > better served to be told that, than to be given an answer that is not > actually valid. They should also be told how to design a study that > actually does address their question. > > You (and others) have mentioned Google Analytics as a possible way to > address the quality of data; that's helpful. But analyzing bad data will > just give bad conclusions. As long as we say 'package Foo is the most downloaded package on CRAN', and not 'package Foo is the most used package for R', we can leave it to the user to decide if the latter conclusion follows from the former. In the absence of actual usage data I would think it a good approximation. Not that I would risk my life on it. Pop music charts are now based on download counts, but I wouldn't believe they represent the songs that are listened to the most times. Nor would I go so far as to believe they represent the quality of the songs... Should R have a 'Would you like to tell CRAN every time you do library(foo) so we can do usage counts (no personal data is transmitted blah blah) ?'? I don't think so.... Barry ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by Duncan Murdoch
On 08-Mar-09 15:14:03, Duncan Murdoch wrote:
> On 08/03/2009 10:49 AM, hadley wickham wrote: >>> More seriously : I don't think relative numbers of package downloads >>> can be interpreted in any reasonable way, because reasons for >>> package download have a very wide range from curiosity ("what's >>> this ?"), fun (think "fortunes"...), to vital need tthink lme4 >>> if/when a consensus on denominator DFs can be reached :-)...). >>> What can you infer in good faith from such a mess ? >> >> So when we have messy data with measurement error, we should just >> give up? Doesn't sound very statistical! ;) > > I think the situation is worse than messy. If a client comes in with > data that doesn't address the question they're interested in, I think > they are better served to be told that, than to be given an answer that > is not actually valid. They should also be told how to design a study > that actually does address their question. > > You (and others) have mentioned Google Analytics as a possible way to > address the quality of data; that's helpful. But analyzing bad data > will just give bad conclusions. > Duncan Murdoch The population of R users (which we would need to sample in order to obtain good data) is probably more elusive than a fish population in the ocean -- only partially visible at best, and with an unknown proportion invisible. At least in Fisheries research, there are long established capture techniques (from trawling to netting to electro-fishing to ... ) which can be deployed, for research purposes, in such a way as to potentially reach all members of a target population, with at least a moderately good approximation to random sampling. What have we for R? Come to think of it, electro-fishing, ... Suppose R were released with 2 types of cookie embedded in base R. Each type is randomly configured, when R is first run, to be Active or Inactive (probability of activation to be decided at the design stage ... ). Type 1, if active, on a certain date generates an event which brings it to the notice of R-Core (e.g. by clandestine email or by inducing a bug report). Type 2 acts similarly on a later date. If Type 2 acts, it carries with it information as to whether there was a Type 1 action along with whether, apparently, the Type 1 action "succeeded". We then have, in effect, an analogue of the Mark-Recapture technique of population estimation (along with the usual questions about equal catchability and so forth). However, since this sort of thing (which I am not proposing seriously, only for the sake of argument) is undoubtedly unethical (and would do R's reputation no good if it came to light), I tentatively conclude that the population of R users is likely to remain as elusive as ever. Best wishes to all, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <[hidden email]> Fax-to-email: +44 (0)870 094 0861 Date: 08-Mar-09 Time: 16:11:44 ------------------------------ XFMail ------------------------------ ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Hi Ted,
Coming to think about your direction - another idea came to mind: The next time a major release is made (there is one scheduled quite soon actually), the core team could add a "survey" on the downloading page of the R base package asking for just one question "please click here if this is the first computer you are downloading this package for". This, combined with the fact that when serving a user we can obtain his IP address (which gives geo information) could give a pretty nice rough estimate of how many "major release downloaders" the R community has. Tal On Sun, Mar 8, 2009 at 6:11 PM, Ted Harding <[hidden email]>wrote: > On 08-Mar-09 15:14:03, Duncan Murdoch wrote: > > On 08/03/2009 10:49 AM, hadley wickham wrote: > >>> More seriously : I don't think relative numbers of package downloads > >>> can be interpreted in any reasonable way, because reasons for > >>> package download have a very wide range from curiosity ("what's > >>> this ?"), fun (think "fortunes"...), to vital need tthink lme4 > >>> if/when a consensus on denominator DFs can be reached :-)...). > >>> What can you infer in good faith from such a mess ? > >> > >> So when we have messy data with measurement error, we should just > >> give up? Doesn't sound very statistical! ;) > > > > I think the situation is worse than messy. If a client comes in with > > data that doesn't address the question they're interested in, I think > > they are better served to be told that, than to be given an answer that > > is not actually valid. They should also be told how to design a study > > that actually does address their question. > > > > You (and others) have mentioned Google Analytics as a possible way to > > address the quality of data; that's helpful. But analyzing bad data > > will just give bad conclusions. > > Duncan Murdoch > > The population of R users (which we would need to sample in order > to obtain good data) is probably more elusive than a fish population > in the ocean -- only partially visible at best, and with an unknown > proportion invisible. > > At least in Fisheries research, there are long established capture > techniques (from trawling to netting to electro-fishing to ... ) > which can be deployed, for research purposes, in such a way as to > potentially reach all members of a target population, with at least > a moderately good approximation to random sampling. What have we > for R? > > Come to think of it, electro-fishing, ... > > Suppose R were released with 2 types of cookie embedded in base R. > Each type is randomly configured, when R is first run, to be Active > or Inactive (probability of activation to be decided at the design > stage ... ). Type 1, if active, on a certain date generates an > event which brings it to the notice of R-Core (e.g. by clandestine > email or by inducing a bug report). Type 2 acts similarly on a later > date. If Type 2 acts, it carries with it information as to whether > there was a Type 1 action along with whether, apparently, the Type 1 > action "succeeded". > > We then have, in effect, an analogue of the Mark-Recapture technique > of population estimation (along with the usual questions about > equal catchability and so forth). > > However, since this sort of thing (which I am not proposing seriously, > only for the sake of argument) is undoubtedly unethical (and would > do R's reputation no good if it came to light), I tentatively conclude > that the population of R users is likely to remain as elusive as ever. > > Best wishes to all, > Ted. > > -------------------------------------------------------------------- > E-Mail: (Ted Harding) <[hidden email]> > Fax-to-email: +44 (0)870 094 0861 > Date: 08-Mar-09 Time: 16:11:44 > ------------------------------ XFMail ------------------------------ > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- ---------------------------------------------- My contact information: Tal Galili Phone number: 972-50-3373767 FaceBook: Tal Galili My Blogs: www.talgalili.com www.biostatistics.co.il [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by Ted.Harding-2
Is this another discussion of what data might be collected and
analyzed, and what could and could not be said if we only had such data? Has anyone but me produced any actual data? If so, I missed it. Hadly mentioned the 'fortunes' package. My earlier methodology, "RSiteSearch('library(fortunes)')", produced 40 hits for 'fortunes', compared to 169 for 'lme4' and 2 for 'DierckxSpline'. With anything like this, it would be wise to approach the problem from many different perspectives, recognizing that the strengths of one approach can help improve our understanding of what other analyses say about the question at hand. Happy Sunday. Spencer Graves (Ted Harding) wrote: > On 08-Mar-09 15:14:03, Duncan Murdoch wrote: > >> On 08/03/2009 10:49 AM, hadley wickham wrote: >> >>>> More seriously : I don't think relative numbers of package downloads >>>> can be interpreted in any reasonable way, because reasons for >>>> package download have a very wide range from curiosity ("what's >>>> this ?"), fun (think "fortunes"...), to vital need tthink lme4 >>>> if/when a consensus on denominator DFs can be reached :-)...). >>>> What can you infer in good faith from such a mess ? >>>> >>> So when we have messy data with measurement error, we should just >>> give up? Doesn't sound very statistical! ;) >>> >> I think the situation is worse than messy. If a client comes in with >> data that doesn't address the question they're interested in, I think >> they are better served to be told that, than to be given an answer that >> is not actually valid. They should also be told how to design a study >> that actually does address their question. >> >> You (and others) have mentioned Google Analytics as a possible way to >> address the quality of data; that's helpful. But analyzing bad data >> will just give bad conclusions. >> Duncan Murdoch >> > > The population of R users (which we would need to sample in order > to obtain good data) is probably more elusive than a fish population > in the ocean -- only partially visible at best, and with an unknown > proportion invisible. > > At least in Fisheries research, there are long established capture > techniques (from trawling to netting to electro-fishing to ... ) > which can be deployed, for research purposes, in such a way as to > potentially reach all members of a target population, with at least > a moderately good approximation to random sampling. What have we > for R? > > Come to think of it, electro-fishing, ... > > Suppose R were released with 2 types of cookie embedded in base R. > Each type is randomly configured, when R is first run, to be Active > or Inactive (probability of activation to be decided at the design > stage ... ). Type 1, if active, on a certain date generates an > event which brings it to the notice of R-Core (e.g. by clandestine > email or by inducing a bug report). Type 2 acts similarly on a later > date. If Type 2 acts, it carries with it information as to whether > there was a Type 1 action along with whether, apparently, the Type 1 > action "succeeded". > > We then have, in effect, an analogue of the Mark-Recapture technique > of population estimation (along with the usual questions about > equal catchability and so forth). > > However, since this sort of thing (which I am not proposing seriously, > only for the sake of argument) is undoubtedly unethical (and would > do R's reputation no good if it came to light), I tentatively conclude > that the population of R users is likely to remain as elusive as ever. > > Best wishes to all, > Ted. > > -------------------------------------------------------------------- > E-Mail: (Ted Harding) <[hidden email]> > Fax-to-email: +44 (0)870 094 0861 > Date: 08-Mar-09 Time: 16:11:44 > ------------------------------ XFMail ------------------------------ > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by Barry Rowlingson
On 08/03/2009 12:08 PM, Barry Rowlingson wrote:
>> I think the situation is worse than messy. If a client comes in with data >> that doesn't address the question they're interested in, I think they are >> better served to be told that, than to be given an answer that is not >> actually valid. They should also be told how to design a study that >> actually does address their question. >> >> You (and others) have mentioned Google Analytics as a possible way to >> address the quality of data; that's helpful. But analyzing bad data will >> just give bad conclusions. > > As long as we say 'package Foo is the most downloaded package on > CRAN', and not 'package Foo is the most used package for R', we can > leave it to the user to decide if the latter conclusion follows from > the former. But we don't even have that data, since CRAN is distributed across lots of mirrors. Duncan Murdoch In the absence of actual usage data I would think it a > good approximation. Not that I would risk my life on it. > > Pop music charts are now based on download counts, but I wouldn't > believe they represent the songs that are listened to the most times. > Nor would I go so far as to believe they represent the quality of the > songs... > > Should R have a 'Would you like to tell CRAN every time you do > library(foo) so we can do usage counts (no personal data is > transmitted blah blah) ?'? I don't think so.... > > Barry ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
| Powered by Nabble | Edit this page |
