Improved Data Aggregation and Summary Statistics in R

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Improved Data Aggregation and Summary Statistics in R

Sebastian Martin Krantz
Dear Developers,

Having spent time developing and thinking about how data aggregation and
summary statistics can be enhanced in R, I would like to present my
ideas/efforts in the form of two commands:

The first, which for now I called 'collap', is an upgrade of aggregate that
accommodates and extends the functionality of aggregate in various
respects, most importantly to work with multilevel and multi-type data,
multiple function calls, highly customized aggregation tasks, a much
greater flexibility in the passing of inputs and tidy output.

The second function, 'qsu', is an advanced and flexible summary command for
cross-sectional and multilevel (panel) data (i.e. it can provide overall,
between and within entities statistics, and allows for grouping, custom
functions and transformations). It also provides a quick method to compute
and output within-transformed data.

Both commands are efficiently built from core R, but provide for optional
integration with data.table, which renders them extremely fast on large
datasets. An explanation of the syntax, a demonstration and benchmark
results are provided in the attached vignette.

Since both commands accommodate existing functionality while adding
significant basic functionality, I though that their addition to the stats
package would be a worthwhile consideration. I am happy for your feedback.

Best regards,

Sebastian Krantz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

collap & qsu vignette.pdf (760K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Improved Data Aggregation and Summary Statistics in R

Iñaki Ucar
On Wed, 27 Feb 2019 at 09:02, Sebastian Martin Krantz
<[hidden email]> wrote:

>
> Dear Developers,
>
> Having spent time developing and thinking about how data aggregation and
> summary statistics can be enhanced in R, I would like to present my
> ideas/efforts in the form of two commands:
>
> The first, which for now I called 'collap', is an upgrade of aggregate that
> accommodates and extends the functionality of aggregate in various
> respects, most importantly to work with multilevel and multi-type data,
> multiple function calls, highly customized aggregation tasks, a much
> greater flexibility in the passing of inputs and tidy output.
>
> The second function, 'qsu', is an advanced and flexible summary command for
> cross-sectional and multilevel (panel) data (i.e. it can provide overall,
> between and within entities statistics, and allows for grouping, custom
> functions and transformations). It also provides a quick method to compute
> and output within-transformed data.
>
> Both commands are efficiently built from core R, but provide for optional
> integration with data.table, which renders them extremely fast on large
> datasets. An explanation of the syntax, a demonstration and benchmark
> results are provided in the attached vignette.

Looks interesting. Sorry if it's there and I didn't find it: is there
any package implementing these functions so that we can try them?

Iñaki

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Improved Data Aggregation and Summary Statistics in R

Joris FA Meys
In reply to this post by Sebastian Martin Krantz
Dear Sebastian,

Initially I was a bit hesitant to think about yet another way to summarize
data, but your illustrations convinced me this is actually a great addition
to the toolset currently available in different R packages. Many of us have
written custom functions to get the required tables for specific data sets,
but this would reduce that effort to simply using the right collap() call.

Like Inaki, I'm very interested in trying it out if you have the code
available somewhere.

Cheers
Joris





On Wed, Feb 27, 2019 at 9:01 AM Sebastian Martin Krantz <
[hidden email]> wrote:

> Dear Developers,
>
> Having spent time developing and thinking about how data aggregation and
> summary statistics can be enhanced in R, I would like to present my
> ideas/efforts in the form of two commands:
>
> The first, which for now I called 'collap', is an upgrade of aggregate that
> accommodates and extends the functionality of aggregate in various
> respects, most importantly to work with multilevel and multi-type data,
> multiple function calls, highly customized aggregation tasks, a much
> greater flexibility in the passing of inputs and tidy output.
>
> The second function, 'qsu', is an advanced and flexible summary command for
> cross-sectional and multilevel (panel) data (i.e. it can provide overall,
> between and within entities statistics, and allows for grouping, custom
> functions and transformations). It also provides a quick method to compute
> and output within-transformed data.
>
> Both commands are efficiently built from core R, but provide for optional
> integration with data.table, which renders them extremely fast on large
> datasets. An explanation of the syntax, a demonstration and benchmark
> results are provided in the attached vignette.
>
> Since both commands accommodate existing functionality while adding
> significant basic functionality, I though that their addition to the stats
> package would be a worthwhile consideration. I am happy for your feedback.
>
> Best regards,
>
> Sebastian Krantz
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>


--
Joris Meys
Statistical consultant

Department of Data Analysis and Mathematical Modelling
Ghent University
Coupure Links 653, B-9000 Gent (Belgium)
<https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g>

-----------
Biowiskundedagen 2018-2019
http://www.biowiskundedagen.ugent.be/

-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Improved Data Aggregation and Summary Statistics in R

Sebastian Martin Krantz
Dear Iñaki and Joris,

thank you for the positive feedback! I had attached a code file to the
post, but apparently it was removed.
I will attach it again to this e-mail, otherwise both vignette and code can
be downloaded from the following link:
https://www.dropbox.com/sh/s0k1tiz7el55g1q/AACpri-nruXjcMwUnNcHoycKa?dl=0
Best,
Sebastian

On Wed, 27 Feb 2019 at 11:14, Joris Meys <[hidden email]> wrote:

> Dear Sebastian,
>
> Initially I was a bit hesitant to think about yet another way to summarize
> data, but your illustrations convinced me this is actually a great addition
> to the toolset currently available in different R packages. Many of us have
> written custom functions to get the required tables for specific data sets,
> but this would reduce that effort to simply using the right collap() call.
>
> Like Inaki, I'm very interested in trying it out if you have the code
> available somewhere.
>
> Cheers
> Joris
>
>
>
>
>
> On Wed, Feb 27, 2019 at 9:01 AM Sebastian Martin Krantz <
> [hidden email]> wrote:
>
>> Dear Developers,
>>
>> Having spent time developing and thinking about how data aggregation and
>> summary statistics can be enhanced in R, I would like to present my
>> ideas/efforts in the form of two commands:
>>
>> The first, which for now I called 'collap', is an upgrade of aggregate
>> that
>> accommodates and extends the functionality of aggregate in various
>> respects, most importantly to work with multilevel and multi-type data,
>> multiple function calls, highly customized aggregation tasks, a much
>> greater flexibility in the passing of inputs and tidy output.
>>
>> The second function, 'qsu', is an advanced and flexible summary command
>> for
>> cross-sectional and multilevel (panel) data (i.e. it can provide overall,
>> between and within entities statistics, and allows for grouping, custom
>> functions and transformations). It also provides a quick method to compute
>> and output within-transformed data.
>>
>> Both commands are efficiently built from core R, but provide for optional
>> integration with data.table, which renders them extremely fast on large
>> datasets. An explanation of the syntax, a demonstration and benchmark
>> results are provided in the attached vignette.
>>
>> Since both commands accommodate existing functionality while adding
>> significant basic functionality, I though that their addition to the stats
>> package would be a worthwhile consideration. I am happy for your feedback.
>>
>> Best regards,
>>
>> Sebastian Krantz
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>
> --
> Joris Meys
> Statistical consultant
>
> Department of Data Analysis and Mathematical Modelling
> Ghent University
> Coupure Links 653, B-9000 Gent (Belgium)
>
> <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g>
>
> -----------
> Biowiskundedagen 2018-2019
> http://www.biowiskundedagen.ugent.be/
>
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Improved Data Aggregation and Summary Statistics in R

Duncan Murdoch-2
In reply to this post by Sebastian Martin Krantz
On 26/02/2019 8:25 a.m., Sebastian Martin Krantz wrote:

> Dear Developers,
>
> Having spent time developing and thinking about how data aggregation and
> summary statistics can be enhanced in R, I would like to present my
> ideas/efforts in the form of two commands:
>
> The first, which for now I called 'collap', is an upgrade of aggregate that
> accommodates and extends the functionality of aggregate in various
> respects, most importantly to work with multilevel and multi-type data,
> multiple function calls, highly customized aggregation tasks, a much
> greater flexibility in the passing of inputs and tidy output.
>
> The second function, 'qsu', is an advanced and flexible summary command for
> cross-sectional and multilevel (panel) data (i.e. it can provide overall,
> between and within entities statistics, and allows for grouping, custom
> functions and transformations). It also provides a quick method to compute
> and output within-transformed data.
>
> Both commands are efficiently built from core R, but provide for optional
> integration with data.table, which renders them extremely fast on large
> datasets. An explanation of the syntax, a demonstration and benchmark
> results are provided in the attached vignette.
>
> Since both commands accommodate existing functionality while adding
> significant basic functionality, I though that their addition to the stats
> package would be a worthwhile consideration. I am happy for your feedback.

Generally the R Core group is reluctant to incorporate new functions
into the base packages.  Each function that is added adds to their work,
and they already have too much to do.  (I am no longer a member of R
Core, but I don't think things have changed since I retired.)

It is much easier for them if volunteers publish functions themselves,
via contributed packages.

Nowadays Github provides a very convenient platform on which you can
develop a package containing your functions.  If other users find bugs
or have suggested improvements, it's very easy for them to send those to
you, and you can make the fixes available immediately.  Once you are
satisfied that it is stable, you can submit it to CRAN, and anyone using
R can easily install it.

If you find the prospect of writing a package daunting, you shouldn't.
It's actually quite easy, especially if you are using RStudio or ESS (or
some other helpful front-end.)  Hadley Wickham's book
<http://r-pkgs.had.co.nz/> is a pretty accessible description of a
development strategy.  (It's not the only strategy, but lots of people
use it.)

Duncan Murdoch

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Improved Data Aggregation and Summary Statistics in R

Sebastian Martin Krantz
Thanks to all who gave feedback so far, there is now a version of the
package on Github, it can be installed by

remotes::install_github("SebKrantz/collapse")

further feedback is still very welcome!


On Wed, 27 Feb 2019 at 12:48, Duncan Murdoch <[hidden email]>
wrote:

> On 26/02/2019 8:25 a.m., Sebastian Martin Krantz wrote:
> > Dear Developers,
> >
> > Having spent time developing and thinking about how data aggregation and
> > summary statistics can be enhanced in R, I would like to present my
> > ideas/efforts in the form of two commands:
> >
> > The first, which for now I called 'collap', is an upgrade of aggregate
> that
> > accommodates and extends the functionality of aggregate in various
> > respects, most importantly to work with multilevel and multi-type data,
> > multiple function calls, highly customized aggregation tasks, a much
> > greater flexibility in the passing of inputs and tidy output.
> >
> > The second function, 'qsu', is an advanced and flexible summary command
> for
> > cross-sectional and multilevel (panel) data (i.e. it can provide overall,
> > between and within entities statistics, and allows for grouping, custom
> > functions and transformations). It also provides a quick method to
> compute
> > and output within-transformed data.
> >
> > Both commands are efficiently built from core R, but provide for optional
> > integration with data.table, which renders them extremely fast on large
> > datasets. An explanation of the syntax, a demonstration and benchmark
> > results are provided in the attached vignette.
> >
> > Since both commands accommodate existing functionality while adding
> > significant basic functionality, I though that their addition to the
> stats
> > package would be a worthwhile consideration. I am happy for your
> feedback.
>
> Generally the R Core group is reluctant to incorporate new functions
> into the base packages.  Each function that is added adds to their work,
> and they already have too much to do.  (I am no longer a member of R
> Core, but I don't think things have changed since I retired.)
>
> It is much easier for them if volunteers publish functions themselves,
> via contributed packages.
>
> Nowadays Github provides a very convenient platform on which you can
> develop a package containing your functions.  If other users find bugs
> or have suggested improvements, it's very easy for them to send those to
> you, and you can make the fixes available immediately.  Once you are
> satisfied that it is stable, you can submit it to CRAN, and anyone using
> R can easily install it.
>
> If you find the prospect of writing a package daunting, you shouldn't.
> It's actually quite easy, especially if you are using RStudio or ESS (or
> some other helpful front-end.)  Hadley Wickham's book
> <http://r-pkgs.had.co.nz/> is a pretty accessible description of a
> development strategy.  (It's not the only strategy, but lots of people
> use it.)
>
> Duncan Murdoch
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel