|
I vote to 'fortunize' Doug Bates on
Hierarchical data sets: which software to use? "The widespread use of spreadsheets or SPSS data sets or SAS data sets which encourage the "single table with a gargantuan number of columns, most of which are missing data in most cases" approach to organization of longitudinal data is regrettable." http://n4.nabble.com/Hierarchical-data-sets-which-software-to-use-td1458477.html#a1470430 -- Peter Ehlers University of Calgary ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Peter Ehlers wrote:
> I vote to 'fortunize' Doug Bates on > > Hierarchical data sets: which software to use? > > "The widespread use of spreadsheets or SPSS data sets or SAS data sets > which encourage the "single table with a gargantuan number of columns, > most of which are missing data in most cases" approach to organization > of longitudinal data is regrettable." > > http://n4.nabble.com/Hierarchical-data-sets-which-software-to-use-td1458477.html#a1470430 > > Hmm, well, it's not like "long format" data frames (which I actually think are more common in connection with SAS's PROC MIXED) are much better. Those tend to replicate base data unnecessarily - "as if rats change sex with millisecond resolution". The correct data structure would be a relational database with multiple levels of tables, but, to my knowledge, no statistical software, including R, is prepared to deal with data in that form. -- O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - ([hidden email]) FAX: (+45) 35327907 ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by Peter Ehlers
On Fri, 5 Feb 2010, Peter Ehlers wrote:
> I vote to 'fortunize' Doug Bates on > > Hierarchical data sets: which software to use? > > "The widespread use of spreadsheets or SPSS data sets or SAS data sets > which encourage the "single table with a gargantuan number of columns, > most of which are missing data in most cases" approach to organization > of longitudinal data is regrettable." > > http://n4.nabble.com/Hierarchical-data-sets-which-software-to-use-td1458477.html#a1470430 Thanks, added to the devel-version on R-Forge. Z > -- > Peter Ehlers > University of Calgary > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by Peter Dalgaard
Note : this post has been motivated more by the "hierarchical data"
subject than the aside joke of Douglas Bates, but might be of interest to its respondents. Le vendredi 05 février 2010 à 21:56 +0100, Peter Dalgaard a écrit : > Peter Ehlers wrote: > > I vote to 'fortunize' Doug Bates on > > > > Hierarchical data sets: which software to use? > > > > "The widespread use of spreadsheets or SPSS data sets or SAS data sets > > which encourage the "single table with a gargantuan number of columns, > > most of which are missing data in most cases" approach to organization > > of longitudinal data is regrettable." > > > > http://n4.nabble.com/Hierarchical-data-sets-which-software-to-use-td1458477.html#a1470430 > > > > > > Hmm, well, it's not like "long format" data frames (which I actually > think are more common in connection with SAS's PROC MIXED) are much > better. Those tend to replicate base data unnecessarily - "as if rats > change sex with millisecond resolution". [ Note to Achim Zeilis : the "rats changing sex with millisecond resolution" quote is well worth a nomination to "fortune" fame ; it seems it is not one already... ] > The correct data structure > would be a relational database with multiple levels of tables, but, to > my knowledge, no statistical software, including R, is prepared to deal > with data in that form. Well, I can think of two exceptions : - BUGS, in its various incarnations (WinBUGS, OpenBUGS, JAGS), does not require its data to come from the same source. For example, while programming a hierarchical model (a. k. a. mixed-effect model), individual level variables may come from one source and various group level variables may come from other sources. Quite handy : no previous merge() required. Now, writing (and debugging !) such models in BUGS is another story... - SAS has had this concept of "data view" for a long time, its most useful incarnation being a "data view" of an SQL view. Again, this avoids the need to actually merge the datasets (which, AFAICR, is a serious piece of pain in the @$$ in SAS (maybe that's the *real* etymology of the name ?)). This problem has bugged me for a while. I think that the concept of a "data view" is right (after all, that's one of the core concepts of SQL for a reason...), but that implementing it *cleanly* in R is probably hard work. Using a DBMS for maintaining tables and views and querying them "just at the right time" does help, but the ability of using these DBMS data without importing them in R is, AFAIK, currently lacking. One upon a time, a very old version of RPgSQL (a Bioconductor package), aimed to such a representation : it created objects inheriting from data.frame to represent Postgres-based data, allowing to use these data "transparently". This package dropped into oblivon when his creator and sole maintainer became unable to maintain it further. As far as I understand it, the DBI specification *might* allow the creation of such objects, but I am not aware of any driver actually implementing that. In fact, there are two elements of solution to this problem : a) creation of (abstract) objects representing data collections as data frames, with the same properties, but not requesting the creation of an actual data frame. As far as my (very poor) object-oriented knowledge goes, these objects should be, in C++/Python parlance, inherit from data.frame. b) creation of objects implementing various realizations of the objects created in a) : DBMS querying, actual data.frame querying (here I'm thinking of sqldf, which does this on the reverse direction, allowing querying R data frames to be queried in SQL. Quite handy...), etc ... I tried my hand once at building such a representation (for DBMS-deposited data), with partial success (read-only was OK, read-write was seriously buggy). But my S3 object-oriented code stinks, my Python is pytiful, and, as a public health measure, I won't even try to qualify my C++... So I leave implementation to better programmers as an exercise (a term project, or even a master's thesis subject is probably closer to truth...). A third, much larger, (implementation) element, is lacking in this picture : the algorithms used on these data. SAS is notoriously good (in some simple cases, such as ordinary regression) at handling datasets larger than available memory because the algorithms have been written with punched cards (maybe even paper tape) in mind : *one* *sequential* read of the data was the only *practical* way to go back in those days. So all the matrices and vectors necessary to the computation (notionally, X'X and X'Y) were built in memory in *one* step. Such an organization is probably impossible with most "modern" algorithms : see Douglas Bates' description of the lmer() algorithms for a nice, big counter-example, or consider MCMC... But coming closer to such an organization *seems* possible : see for example biglm. So I think that data views are a a worthy but not-so-easy possible goal aimed at various data structure problems (including hierarchical data), but not *the* solution to data-representation problem in R. Any thoughts ? Emmanuel Charpentier ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
On Sun, Feb 7, 2010 at 2:40 PM, Emmanuel Charpentier
<[hidden email]> wrote: > Note : this post has been motivated more by the "hierarchical data" > subject than the aside joke of Douglas Bates, but might be of interest > to its respondents. > > Le vendredi 05 février 2010 à 21:56 +0100, Peter Dalgaard a écrit : >> Peter Ehlers wrote: >> > I vote to 'fortunize' Doug Bates on >> > >> > Hierarchical data sets: which software to use? >> > >> > "The widespread use of spreadsheets or SPSS data sets or SAS data sets >> > which encourage the "single table with a gargantuan number of columns, >> > most of which are missing data in most cases" approach to organization >> > of longitudinal data is regrettable." >> > >> > http://n4.nabble.com/Hierarchical-data-sets-which-software-to-use-td1458477.html#a1470430 >> > >> > >> >> Hmm, well, it's not like "long format" data frames (which I actually >> think are more common in connection with SAS's PROC MIXED) are much >> better. Those tend to replicate base data unnecessarily - "as if rats >> change sex with millisecond resolution". > [ Note to Achim Zeilis : the "rats changing sex with millisecond > resolution" quote is well worth a nomination to "fortune" fame ; it > seems it is not one already... ] > >> The correct data structure >> would be a relational database with multiple levels of tables, but, to >> my knowledge, no statistical software, including R, is prepared to deal >> with data in that form. I think if you go back to my original reply you will see that my first suggestion was to use an SQL data base. I didn't mention views (in the SQL sense) explicitly but those are a natural construction for organizing longitudinal data. The data can be stored as a set of normalized tables in a data base but extracted as a data frame in the long format. > Well, I can think of two exceptions : > > - BUGS, in its various incarnations (WinBUGS, OpenBUGS, JAGS), does not > require its data to come from the same source. For example, while > programming a hierarchical model (a. k. a. mixed-effect model), > individual level variables may come from one source and various group > level variables may come from other sources. Quite handy : no previous > merge() required. Now, writing (and debugging !) such models in BUGS > is another story... > > - SAS has had this concept of "data view" for a long time, its most > useful incarnation being a "data view" of an SQL view. Again, this > avoids the need to actually merge the datasets (which, AFAICR, is a > serious piece of pain in the @$$ in SAS (maybe that's the *real* > etymology of the name ?)). > > This problem has bugged me for a while. I think that the concept of a > "data view" is right (after all, that's one of the core concepts of SQL > for a reason...), but that implementing it *cleanly* in R is probably > hard work. Using a DBMS for maintaining tables and views and querying > them "just at the right time" does help, but the ability of using these > DBMS data without importing them in R is, AFAIK, currently lacking. I think the issue is more than that. Most model-fitting functions in R incorporate a formula/data -> model.frame -> model.matrix sequence. The symbolic analysis to create the model frame can, I think, be applied to a view and the result stored back in an SQL data base. (I haven't looked at the code for model.frame in a long time but I think it can be applied a row at a time or to chunks of rows.) Some auxiliary information, such as unused factor levels, could be accumulated. The model.matrix function can be a bit more problematic but it too could be applied to chunks when generating dense model matrices, with the summary information from the chunks being accumulated. Updating sparse model matrices and accumulating summary information is potentially more time-consuming because you may need to update the form of the summary matrices as well as the numerical values. Of course in SAS these changes are easier because their least squares calculations are based on accumulating summary matrices from the data a row at a time. I think there are three different cases for fitting models based on large model matrices. If you have very large n (number of rows) but small to moderate p (number of columns) in the model matrix then you can work on chunks of rows using dense matrices to accumulate summary results. Large n and large p producing a sparse model matrix could be handled with the sparse.model.matrix function in the Matrix package because in these cases the model frame and the sparse model matrix tend to use smaller amounts of storage. When you have a large n and a large p creating a dense model matrix I think the best course is to buy a machine with a 64-bit processor and a 64-bit operating system and stuff it full of memory. It depends on how large p^2 is compared to np whether you are better off working in chunks or rows or not. > One upon a time, a very old version of RPgSQL (a Bioconductor package), > aimed to such a representation : it created objects inheriting from > data.frame to represent Postgres-based data, allowing to use these data > "transparently". This package dropped into oblivon when his creator and > sole maintainer became unable to maintain it further. > > As far as I understand it, the DBI specification *might* allow the > creation of such objects, but I am not aware of any driver actually > implementing that. > > In fact, there are two elements of solution to this problem : > a) creation of (abstract) objects representing data collections as data > frames, with the same properties, but not requesting the creation of an > actual data frame. As far as my (very poor) object-oriented knowledge > goes, these objects should be, in C++/Python parlance, inherit from > data.frame. > b) creation of objects implementing various realizations of the objects > created in a) : DBMS querying, actual data.frame querying (here I'm > thinking of sqldf, which does this on the reverse direction, allowing > querying R data frames to be queried in SQL. Quite handy...), etc ... > > I tried my hand once at building such a representation (for > DBMS-deposited data), with partial success (read-only was OK, read-write > was seriously buggy). But my S3 object-oriented code stinks, my Python > is pytiful, and, as a public health measure, I won't even try to > qualify my C++... So I leave implementation to better programmers as an > exercise (a term project, or even a master's thesis subject is probably > closer to truth...). > > A third, much larger, (implementation) element, is lacking in this > picture : the algorithms used on these data. SAS is notoriously good (in > some simple cases, such as ordinary regression) at handling datasets > larger than available memory because the algorithms have been written > with punched cards (maybe even paper tape) in mind : *one* *sequential* > read of the data was the only *practical* way to go back in those days. > So all the matrices and vectors necessary to the computation > (notionally, X'X and X'Y) were built in memory in *one* step. > > Such an organization is probably impossible with most "modern" > algorithms : see Douglas Bates' description of the lmer() algorithms for > a nice, big counter-example, or consider MCMC... But coming closer to > such an organization *seems* possible : see for example biglm. Actually the algorithms for linear mixed models in lmer could be adapted to work with only the summary matrices. The actual calculations only reference the sparse model matrix Z for the random effects in one place and that could be replaced with another calculation involving Z'Z. The model matrix X for the fixed effects can be dense or sparse and all the calculations are based on pre-computed products like Z'X, X'X and X'y. In the extensions, generalized linear mixed models or nonlinear mixed models, you can't pre-compute the products because the products of interest involve weights that vary at each iteration. > So I think that data views are a a worthy but not-so-easy possible goal > aimed at various data structure problems (including hierarchical data), > but not *the* solution to data-representation problem in R. > > Any thoughts ? > > Emmanuel Charpentier > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
| Powered by Nabble | Edit this page |
