Potential improvements of ave?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Potential improvements of ave?

SOEIRO Thomas
Dear all,

I have two questions/suggestions about ave, but I am not sure if it's relevant for bug reports.



1) I have performance issues with ave in a case where I didn't expect it. The following code runs as expected:

set.seed(1)

df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
                  id2 = sample(1:3, 5e2, TRUE),
                  id3 = sample(1:5, 5e2, TRUE),
                  val = sample(1:300, 5e2, TRUE))

df1$diff <- ave(df1$val,
                df1$id1,
                df1$id2,
                df1$id3,
                FUN = function(i) c(diff(i), 0))

head(df1[order(df1$id1,
               df1$id2,
               df1$id3), ])

But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate vector of size 1110.0 Gb):

df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
                  id2 = sample(1:3, 5e2 * 1e4, TRUE),
                  id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
                  val = sample(1:300, 5e2 * 1e4, TRUE))

df2$diff <- ave(df2$val,
                df2$id1,
                df2$id2,
                df2$id3,
                FUN = function(i) c(diff(i), 0))

This use case does not seem extreme to me (e.g. aggregate et al work perfectly on this data.frame).
So my question is: Is this expected/intended/reasonable? i.e. Does ave need to be optimized?



2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to avoid warnings in case of unused levels (https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html).
Is it relevant/possible to expose the drop argument explicitly?



Thanks,

Thomas
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Potential improvements of ave?

SOEIRO Thomas
The bottleneck of ave is the call to interaction (i.e. not the call to split/lapply).

Therefore, the following code runs as expected (but I may miss something...):

ave2 <- function (x, ..., FUN = mean)
{
    if(missing(...))
        x[] <- FUN(x)
    else {
        #g <- interaction(...)
        g <- paste0(...)
        split(x,g) <- lapply(split(x, g), FUN)
    }
    x
}

df2$diff <- ave2(df2$val,
                 df2$id1,
                 df2$id2,
                 df2$id3,
                 FUN = function(i) c(diff(i), 0))



Of course I can also simply solve my current issue with:

df2$id123 <- paste0(df2$id1,
                    df2$id2,
                    df2$id3)

df2$diff <- ave(df2$val,
                df2$id123,
                FUN = function(i) c(diff(i), 0))



In addition, ave2 also avoid warnings in case of unused levels (see point 2) in my previous message).
________________________________________
De : SOEIRO Thomas
Envoyé : vendredi 12 mars 2021 23:59
À : [hidden email]
Objet : Potential improvements of ave?

Dear all,

I have two questions/suggestions about ave, but I am not sure if it's relevant for bug reports.



1) I have performance issues with ave in a case where I didn't expect it. The following code runs as expected:

set.seed(1)

df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
                  id2 = sample(1:3, 5e2, TRUE),
                  id3 = sample(1:5, 5e2, TRUE),
                  val = sample(1:300, 5e2, TRUE))

df1$diff <- ave(df1$val,
                df1$id1,
                df1$id2,
                df1$id3,
                FUN = function(i) c(diff(i), 0))

head(df1[order(df1$id1,
               df1$id2,
               df1$id3), ])

But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate vector of size 1110.0 Gb):

df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
                  id2 = sample(1:3, 5e2 * 1e4, TRUE),
                  id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
                  val = sample(1:300, 5e2 * 1e4, TRUE))

df2$diff <- ave(df2$val,
                df2$id1,
                df2$id2,
                df2$id3,
                FUN = function(i) c(diff(i), 0))

This use case does not seem extreme to me (e.g. aggregate et al work perfectly on this data.frame).
So my question is: Is this expected/intended/reasonable? i.e. Does ave need to be optimized?



2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to avoid warnings in case of unused levels (https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html).
Is it relevant/possible to expose the drop argument explicitly?



Thanks,

Thomas

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Potential improvements of ave?

aBBy Spurdle, ⍺XY
In reply to this post by SOEIRO Thomas
Hi Thomas,

These are some great suggestions.
But I can't help but feel there's a much bigger problem here.

Intuitively, the ave function could (or should) sort the data.
Then the indexing step becomes almost trivial, in terms of both time
and space complexity.
And the ave function is not the only example of where a problem
becomes much simpler, if the data is sorted.

Historically, I've never found base R functions user-friendly for
aggregation purposes, or for sorting.
(At least, not by comparison to SQL).

But that's not the main problem.
It would seem preferable to sort the data, only once.
(Rather than sorting it repeatedly, or not at all).

Perhaps, objects such as vectors and data.frame(s) could have a
boolean attribute, to indicate if they're sorted.
Or functions such as ave could have a sorted argument.
In either case, if true, the function assumes the data is sorted and
applies a more efficient algorithm.


B.


On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <[hidden email]> wrote:

>
> Dear all,
>
> I have two questions/suggestions about ave, but I am not sure if it's relevant for bug reports.
>
>
>
> 1) I have performance issues with ave in a case where I didn't expect it. The following code runs as expected:
>
> set.seed(1)
>
> df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
>                   id2 = sample(1:3, 5e2, TRUE),
>                   id3 = sample(1:5, 5e2, TRUE),
>                   val = sample(1:300, 5e2, TRUE))
>
> df1$diff <- ave(df1$val,
>                 df1$id1,
>                 df1$id2,
>                 df1$id3,
>                 FUN = function(i) c(diff(i), 0))
>
> head(df1[order(df1$id1,
>                df1$id2,
>                df1$id3), ])
>
> But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate vector of size 1110.0 Gb):
>
> df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
>                   id2 = sample(1:3, 5e2 * 1e4, TRUE),
>                   id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
>                   val = sample(1:300, 5e2 * 1e4, TRUE))
>
> df2$diff <- ave(df2$val,
>                 df2$id1,
>                 df2$id2,
>                 df2$id3,
>                 FUN = function(i) c(diff(i), 0))
>
> This use case does not seem extreme to me (e.g. aggregate et al work perfectly on this data.frame).
> So my question is: Is this expected/intended/reasonable? i.e. Does ave need to be optimized?
>
>
>
> 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to avoid warnings in case of unused levels (https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html).
> Is it relevant/possible to expose the drop argument explicitly?
>
>
>
> Thanks,
>
> Thomas
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Potential improvements of ave?

SOEIRO Thomas
Hi Abby,

Thank you for your positive feedback.

I agree for your general comment about sorting.

For ave specifically, ordering may not help because the output must maintain the order of the input (as ave returns only x and not the entiere data.frame).

Thanks,

Thomas
________________________________________
De : Abby Spurdle <[hidden email]>
Envoyé : lundi 15 mars 2021 10:22
À : SOEIRO Thomas
Cc : [hidden email]
Objet : Re: [Rd] Potential improvements of ave?

EMAIL EXTERNE - TRAITER AVEC PRÉCAUTION LIENS ET FICHIERS

Hi Thomas,

These are some great suggestions.
But I can't help but feel there's a much bigger problem here.

Intuitively, the ave function could (or should) sort the data.
Then the indexing step becomes almost trivial, in terms of both time
and space complexity.
And the ave function is not the only example of where a problem
becomes much simpler, if the data is sorted.

Historically, I've never found base R functions user-friendly for
aggregation purposes, or for sorting.
(At least, not by comparison to SQL).

But that's not the main problem.
It would seem preferable to sort the data, only once.
(Rather than sorting it repeatedly, or not at all).

Perhaps, objects such as vectors and data.frame(s) could have a
boolean attribute, to indicate if they're sorted.
Or functions such as ave could have a sorted argument.
In either case, if true, the function assumes the data is sorted and
applies a more efficient algorithm.


B.


On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <[hidden email]> wrote:

>
> Dear all,
>
> I have two questions/suggestions about ave, but I am not sure if it's relevant for bug reports.
>
>
>
> 1) I have performance issues with ave in a case where I didn't expect it. The following code runs as expected:
>
> set.seed(1)
>
> df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
>                   id2 = sample(1:3, 5e2, TRUE),
>                   id3 = sample(1:5, 5e2, TRUE),
>                   val = sample(1:300, 5e2, TRUE))
>
> df1$diff <- ave(df1$val,
>                 df1$id1,
>                 df1$id2,
>                 df1$id3,
>                 FUN = function(i) c(diff(i), 0))
>
> head(df1[order(df1$id1,
>                df1$id2,
>                df1$id3), ])
>
> But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate vector of size 1110.0 Gb):
>
> df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
>                   id2 = sample(1:3, 5e2 * 1e4, TRUE),
>                   id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
>                   val = sample(1:300, 5e2 * 1e4, TRUE))
>
> df2$diff <- ave(df2$val,
>                 df2$id1,
>                 df2$id2,
>                 df2$id3,
>                 FUN = function(i) c(diff(i), 0))
>
> This use case does not seem extreme to me (e.g. aggregate et al work perfectly on this data.frame).
> So my question is: Is this expected/intended/reasonable? i.e. Does ave need to be optimized?
>
>
>
> 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to avoid warnings in case of unused levels (https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjU7NXrBO$ ).
> Is it relevant/possible to expose the drop argument explicitly?
>
>
>
> Thanks,
>
> Thomas
> ______________________________________________
> [hidden email] mailing list
> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjUzdLFM1$

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Potential improvements of ave?

Gabriel Becker-2
Abby,

Vectors do have an internal mechanism for knowing that they are sorted via
ALTREP (it was one of 2 core motivating features for 'smart vectors' the
other being knowledge about presence of NAs).

Currently I don't think we expose it at the R level, though it is part of
the official C API. I don't know of any plans for this to change, but I
suppose it could. Plus for functions in R itself, we could even use it
without exposing it more widely. A number of functions, including sort
itself, already do this in fact, but more could. I'd be interested in
hearing which functions you think would particularly benefit from this.

~G

On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas <[hidden email]>
wrote:

> Hi Abby,
>
> Thank you for your positive feedback.
>
> I agree for your general comment about sorting.
>
> For ave specifically, ordering may not help because the output must
> maintain the order of the input (as ave returns only x and not the entiere
> data.frame).
>
> Thanks,
>
> Thomas
> ________________________________________
> De : Abby Spurdle <[hidden email]>
> Envoyé : lundi 15 mars 2021 10:22
> À : SOEIRO Thomas
> Cc : [hidden email]
> Objet : Re: [Rd] Potential improvements of ave?
>
> EMAIL EXTERNE - TRAITER AVEC PRÉCAUTION LIENS ET FICHIERS
>
> Hi Thomas,
>
> These are some great suggestions.
> But I can't help but feel there's a much bigger problem here.
>
> Intuitively, the ave function could (or should) sort the data.
> Then the indexing step becomes almost trivial, in terms of both time
> and space complexity.
> And the ave function is not the only example of where a problem
> becomes much simpler, if the data is sorted.
>
> Historically, I've never found base R functions user-friendly for
> aggregation purposes, or for sorting.
> (At least, not by comparison to SQL).
>
> But that's not the main problem.
> It would seem preferable to sort the data, only once.
> (Rather than sorting it repeatedly, or not at all).
>
> Perhaps, objects such as vectors and data.frame(s) could have a
> boolean attribute, to indicate if they're sorted.
> Or functions such as ave could have a sorted argument.
> In either case, if true, the function assumes the data is sorted and
> applies a more efficient algorithm.
>
>
> B.
>
>
> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <[hidden email]>
> wrote:
> >
> > Dear all,
> >
> > I have two questions/suggestions about ave, but I am not sure if it's
> relevant for bug reports.
> >
> >
> >
> > 1) I have performance issues with ave in a case where I didn't expect
> it. The following code runs as expected:
> >
> > set.seed(1)
> >
> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
> >                   id2 = sample(1:3, 5e2, TRUE),
> >                   id3 = sample(1:5, 5e2, TRUE),
> >                   val = sample(1:300, 5e2, TRUE))
> >
> > df1$diff <- ave(df1$val,
> >                 df1$id1,
> >                 df1$id2,
> >                 df1$id3,
> >                 FUN = function(i) c(diff(i), 0))
> >
> > head(df1[order(df1$id1,
> >                df1$id2,
> >                df1$id3), ])
> >
> > But when expanding the data.frame (* 1e4), ave fails (Error: cannot
> allocate vector of size 1110.0 Gb):
> >
> > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
> >                   id2 = sample(1:3, 5e2 * 1e4, TRUE),
> >                   id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
> >                   val = sample(1:300, 5e2 * 1e4, TRUE))
> >
> > df2$diff <- ave(df2$val,
> >                 df2$id1,
> >                 df2$id2,
> >                 df2$id3,
> >                 FUN = function(i) c(diff(i), 0))
> >
> > This use case does not seem extreme to me (e.g. aggregate et al work
> perfectly on this data.frame).
> > So my question is: Is this expected/intended/reasonable? i.e. Does ave
> need to be optimized?
> >
> >
> >
> > 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to
> avoid warnings in case of unused levels (
> https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjU7NXrBO$
> ).
> > Is it relevant/possible to expose the drop argument explicitly?
> >
> >
> >
> > Thanks,
> >
> > Thomas
> > ______________________________________________
> > [hidden email] mailing list
> >
> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjUzdLFM1$
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Potential improvements of ave?

Martin Maechler
>>>>> Gabriel Becker
>>>>>     on Mon, 15 Mar 2021 15:08:44 -0700 writes:

    > Abby,
    > Vectors do have an internal mechanism for knowing that they are sorted via
    > ALTREP (it was one of 2 core motivating features for 'smart vectors' the
    > other being knowledge about presence of NAs).

    > Currently I don't think we expose it at the R level, though it is part of
    > the official C API. I don't know of any plans for this to change, but I
    > suppose it could. Plus for functions in R itself, we could even use it
    > without exposing it more widely. A number of functions, including sort
    > itself, already do this in fact, but more could. I'd be interested in
    > hearing which functions you think would particularly benefit from this.

Thank you Gabe.

    > ~G

I vaguely remember (from Luke's docs/presentation on ALTREP)
that there are some "missing parts" here.
One of them the not-existing R level functionality, another may be
the C code below R's  is.unsorted()  ... maybe  is.unsorted()
could get a new argument and or be re-written, moving  the NA
handling also to C and have that happen *after* the C code looks
if it's an ALTREP object and if that "knows it's sorted".

Martin


    > On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas <[hidden email]>
    > wrote:

    >> Hi Abby,
    >>
    >> Thank you for your positive feedback.
    >>
    >> I agree for your general comment about sorting.
    >>
    >> For ave specifically, ordering may not help because the output must
    >> maintain the order of the input (as ave returns only x and not the entiere
    >> data.frame).
    >>
    >> Thanks,
    >>
    >> Thomas
    >> ________________________________________
    >> De : Abby Spurdle <[hidden email]>
    >> Envoyé : lundi 15 mars 2021 10:22
    >> À : SOEIRO Thomas
    >> Cc : [hidden email]
    >> Objet : Re: [Rd] Potential improvements of ave?
    >>
    >> EMAIL EXTERNE - TRAITER AVEC PRÉCAUTION LIENS ET FICHIERS
    >>
    >> Hi Thomas,
    >>
    >> These are some great suggestions.
    >> But I can't help but feel there's a much bigger problem here.
    >>
    >> Intuitively, the ave function could (or should) sort the data.
    >> Then the indexing step becomes almost trivial, in terms of both time
    >> and space complexity.
    >> And the ave function is not the only example of where a problem
    >> becomes much simpler, if the data is sorted.
    >>
    >> Historically, I've never found base R functions user-friendly for
    >> aggregation purposes, or for sorting.
    >> (At least, not by comparison to SQL).
    >>
    >> But that's not the main problem.
    >> It would seem preferable to sort the data, only once.
    >> (Rather than sorting it repeatedly, or not at all).
    >>
    >> Perhaps, objects such as vectors and data.frame(s) could have a
    >> boolean attribute, to indicate if they're sorted.
    >> Or functions such as ave could have a sorted argument.
    >> In either case, if true, the function assumes the data is sorted and
    >> applies a more efficient algorithm.
    >>
    >>
    >> B.
    >>
    >>
    >> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <[hidden email]>
    >> wrote:
    >> >
    >> > Dear all,
    >> >
    >> > I have two questions/suggestions about ave, but I am not sure if it's
    >> relevant for bug reports.
    >> >
    >> >
    >> >
    >> > 1) I have performance issues with ave in a case where I didn't expect
    >> it. The following code runs as expected:
    >> >
    >> > set.seed(1)
    >> >
    >> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
    >> >                   id2 = sample(1:3, 5e2, TRUE),
    >> >                   id3 = sample(1:5, 5e2, TRUE),
    >> >                   val = sample(1:300, 5e2, TRUE))
    >> >
    >> > df1$diff <- ave(df1$val,
    >> >                 df1$id1,
    >> >                 df1$id2,
    >> >                 df1$id3,
    >> >                 FUN = function(i) c(diff(i), 0))
    >> >
    >> > head(df1[order(df1$id1,
    >> >                df1$id2,
    >> >                df1$id3), ])
    >> >
    >> > But when expanding the data.frame (* 1e4), ave fails (Error: cannot
    >> allocate vector of size 1110.0 Gb):
    >> >
    >> > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
    >> >                   id2 = sample(1:3, 5e2 * 1e4, TRUE),
    >> >                   id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
    >> >                   val = sample(1:300, 5e2 * 1e4, TRUE))
    >> >
    >> > df2$diff <- ave(df2$val,
    >> >                 df2$id1,
    >> >                 df2$id2,
    >> >                 df2$id3,
    >> >                 FUN = function(i) c(diff(i), 0))
    >> >
    >> > This use case does not seem extreme to me (e.g. aggregate et al work
    >> perfectly on this data.frame).
    >> > So my question is: Is this expected/intended/reasonable? i.e. Does ave
    >> need to be optimized?
    >> >
    >> >
    >> >
    >> > 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to
    >> avoid warnings in case of unused levels (
    >> https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjU7NXrBO$
    >> ).
    >> > Is it relevant/possible to expose the drop argument explicitly?
    >> >
    >> >
    >> >
    >> > Thanks,
    >> >
    >> > Thomas
    >> > ______________________________________________
    >> > [hidden email] mailing list
    >> >
    >> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjUzdLFM1$
    >>
    >> ______________________________________________
    >> [hidden email] mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-devel
    >>

    > [[alternative HTML version deleted]]

    > ______________________________________________
    > [hidden email] mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Potential improvements of ave?

Dirk Eddelbuettel

On 16 March 2021 at 10:50, Martin Maechler wrote:
| I vaguely remember (from Luke's docs/presentation on ALTREP)
| that there are some "missing parts" here.
| One of them the not-existing R level functionality, another may be
| the C code below R's  is.unsorted()  ... maybe  is.unsorted()
| could get a new argument and or be re-written, moving  the NA
| handling also to C and have that happen *after* the C code looks
| if it's an ALTREP object and if that "knows it's sorted".

Somewhat related: I recently tried to demonstrate that, say, a 1e2 and a 1e6
vector each with one added NA should have equal "O(1)" time under anyNA(),
yet apparently they do not---so the other magic bit (about knowing NAness)
may not propagate. (That was with R 4.0.4.) I may also have missed a crucial
bit about enabling ALTREPness here, yet the "gist of it" always was (AFAIK)
that it is supposedly seamless.

Dirk

--
https://dirk.eddelbuettel.com | @eddelbuettel | [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Potential improvements of ave?

aBBy Spurdle, ⍺XY
In reply to this post by Gabriel Becker-2
There are some relatively obvious examples:
unique, which.min/which.max/etc, range/min/max, quantile, aggregate/split

Also, many timeseries, graphics and spline functions are dependent on the order.

In the case of data.frame(s), a boolean flag would probably need to be
extended to allow for multiple column sorting, and
ascending/descending options.

On Tue, Mar 16, 2021 at 11:08 AM Gabriel Becker <[hidden email]> wrote:

>
> Abby,
>
> Vectors do have an internal mechanism for knowing that they are sorted via ALTREP (it was one of 2 core motivating features for 'smart vectors' the other being knowledge about presence of NAs).
>
> Currently I don't think we expose it at the R level, though it is part of the official C API. I don't know of any plans for this to change, but I suppose it could. Plus for functions in R itself, we could even use it without exposing it more widely. A number of functions, including sort itself, already do this in fact, but more could. I'd be interested in hearing which functions you think would particularly benefit from this.
>
> ~G
>
> On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas <[hidden email]> wrote:
>>
>> Hi Abby,
>>
>> Thank you for your positive feedback.
>>
>> I agree for your general comment about sorting.
>>
>> For ave specifically, ordering may not help because the output must maintain the order of the input (as ave returns only x and not the entiere data.frame).
>>
>> Thanks,
>>
>> Thomas
>> ________________________________________
>> De : Abby Spurdle <[hidden email]>
>> Envoyé : lundi 15 mars 2021 10:22
>> À : SOEIRO Thomas
>> Cc : [hidden email]
>> Objet : Re: [Rd] Potential improvements of ave?
>>
>> EMAIL EXTERNE - TRAITER AVEC PRÉCAUTION LIENS ET FICHIERS
>>
>> Hi Thomas,
>>
>> These are some great suggestions.
>> But I can't help but feel there's a much bigger problem here.
>>
>> Intuitively, the ave function could (or should) sort the data.
>> Then the indexing step becomes almost trivial, in terms of both time
>> and space complexity.
>> And the ave function is not the only example of where a problem
>> becomes much simpler, if the data is sorted.
>>
>> Historically, I've never found base R functions user-friendly for
>> aggregation purposes, or for sorting.
>> (At least, not by comparison to SQL).
>>
>> But that's not the main problem.
>> It would seem preferable to sort the data, only once.
>> (Rather than sorting it repeatedly, or not at all).
>>
>> Perhaps, objects such as vectors and data.frame(s) could have a
>> boolean attribute, to indicate if they're sorted.
>> Or functions such as ave could have a sorted argument.
>> In either case, if true, the function assumes the data is sorted and
>> applies a more efficient algorithm.
>>
>>
>> B.
>>
>>
>> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <[hidden email]> wrote:
>> >
>> > Dear all,
>> >
>> > I have two questions/suggestions about ave, but I am not sure if it's relevant for bug reports.
>> >
>> >
>> >
>> > 1) I have performance issues with ave in a case where I didn't expect it. The following code runs as expected:
>> >
>> > set.seed(1)
>> >
>> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
>> >                   id2 = sample(1:3, 5e2, TRUE),
>> >                   id3 = sample(1:5, 5e2, TRUE),
>> >                   val = sample(1:300, 5e2, TRUE))
>> >
>> > df1$diff <- ave(df1$val,
>> >                 df1$id1,
>> >                 df1$id2,
>> >                 df1$id3,
>> >                 FUN = function(i) c(diff(i), 0))
>> >
>> > head(df1[order(df1$id1,
>> >                df1$id2,
>> >                df1$id3), ])
>> >
>> > But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate vector of size 1110.0 Gb):
>> >
>> > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
>> >                   id2 = sample(1:3, 5e2 * 1e4, TRUE),
>> >                   id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
>> >                   val = sample(1:300, 5e2 * 1e4, TRUE))
>> >
>> > df2$diff <- ave(df2$val,
>> >                 df2$id1,
>> >                 df2$id2,
>> >                 df2$id3,
>> >                 FUN = function(i) c(diff(i), 0))
>> >
>> > This use case does not seem extreme to me (e.g. aggregate et al work perfectly on this data.frame).
>> > So my question is: Is this expected/intended/reasonable? i.e. Does ave need to be optimized?
>> >
>> >
>> >
>> > 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to avoid warnings in case of unused levels (https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjU7NXrBO$ ).
>> > Is it relevant/possible to expose the drop argument explicitly?
>> >
>> >
>> >
>> > Thanks,
>> >
>> > Thomas
>> > ______________________________________________
>> > [hidden email] mailing list
>> > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjUzdLFM1$
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Potential improvements of ave?

SOEIRO Thomas
Dear all,

Thank you for your consideration on this topic.

I do not have enough knowledge of R internals to join the discussion about sorting mechanisms. In fact, I did not get how ordering could help for ave as the output must maintain the order of the input (because ave returns only x and not the entiere data.frame).

However, while the proposed workaround (i.e. paste0 instead of interaction, cf https://stat.ethz.ch/pipermail/r-devel/2021-March/080509.html) does not solves the "bigger problem" of sorting, it is usable as is and solves the issue. Therefore, what do you think about it? (i.e is it relevant for a patch?)

Thanks,

Thomas


> ________________________________________
> De : Abby Spurdle <[hidden email]>
> Envoyé : lundi 15 mars 2021 10:22
> À : SOEIRO Thomas
> Cc : [hidden email]
> Objet : Re: [Rd] Potential improvements of ave?
>
> Hi Thomas,
>
> These are some great suggestions.
> But I can't help but feel there's a much bigger problem here.
>
> Intuitively, the ave function could (or should) sort the data.
> Then the indexing step becomes almost trivial, in terms of both time
> and space complexity.
> And the ave function is not the only example of where a problem
> becomes much simpler, if the data is sorted.
>
> Historically, I've never found base R functions user-friendly for
> aggregation purposes, or for sorting.
> (At least, not by comparison to SQL).
>
> But that's not the main problem.
> It would seem preferable to sort the data, only once.
> (Rather than sorting it repeatedly, or not at all).
>
> Perhaps, objects such as vectors and data.frame(s) could have a
> boolean attribute, to indicate if they're sorted.
> Or functions such as ave could have a sorted argument.
> In either case, if true, the function assumes the data is sorted and
> applies a more efficient algorithm.
>
>
> B.
>
>
> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <[hidden email]> wrote:
>>
>> Dear all,
>>
>> I have two questions/suggestions about ave, but I am not sure if it's relevant for bug reports.
>>
>>
>>
>> 1) I have performance issues with ave in a case where I didn't expect it. The following code runs as expected:
>>
>> set.seed(1)
>>
>> df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
>>                   id2 = sample(1:3, 5e2, TRUE),
>>                   id3 = sample(1:5, 5e2, TRUE),
>>                   val = sample(1:300, 5e2, TRUE))
>>
>> df1$diff <- ave(df1$val,
>>                 df1$id1,
>>                 df1$id2,
>>                 df1$id3,
>>                 FUN = function(i) c(diff(i), 0))
>>
>> head(df1[order(df1$id1,
>>                df1$id2,
>>                df1$id3), ])
>>
>> But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate vector of size 1110.0 Gb):
>>
>> df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
>>                   id2 = sample(1:3, 5e2 * 1e4, TRUE),
>>                   id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
>>                   val = sample(1:300, 5e2 * 1e4, TRUE))
>>
>> df2$diff <- ave(df2$val,
>>                 df2$id1,
>>                 df2$id2,
>>                 df2$id3,
>>                 FUN = function(i) c(diff(i), 0))
>>
>> This use case does not seem extreme to me (e.g. aggregate et al work perfectly on this data.frame).
>> So my question is: Is this expected/intended/reasonable? i.e. Does ave need to be optimized?
>>
>>
>>
>> 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to avoid warnings in case of unused levels (https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html).
>> Is it relevant/possible to expose the drop argument explicitly?
>>
>>
>>
>> Thanks,
>>
>> Thomas
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Potential improvements of ave?

Gabriel Becker-2
In reply to this post by aBBy Spurdle, ⍺XY
Hi Abby,

I actually have a patch submitted that does this for unique/duplicated
(only numeric cases I think) but it is, as patches from external
contributors go, quite sizable which means it requires a correspondingly
large amount of an R-core member's time and energy to vet and consider. It
is in the queue, and so, I expect (/hope, provided I didn't make a mistake)
it will be incorporated at some point. (
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17993)

You are correct that the speedups are quite significant for calling
unique/duplicated on large vectors that know they are sorted: Speedup on my
machine for a fairly sizable vector (length 1e7) ranges from about ~10x in
the densely duplicated case up to ~60-70x in the sparsely duplicated case
for duplicated(). For unique() it seems to range from ~10x in the densely
duplicated case to ~15 in the spare case.

I had thought that min and max already did this, but looking now, they
don't seem to by default, thought ALTREP classes themselves do have an
option of setting a min/max method, which would be hit. That does seem like
low-hanging fruit, I agree, though in many cases the slow down from a
single pass over the data to get a min probably isn't earthshattering.

The others do seem like they could benefit as well.

Best,
~G

On Tue, Mar 16, 2021 at 2:54 PM Abby Spurdle <[hidden email]> wrote:

> There are some relatively obvious examples:
> unique, which.min/which.max/etc, range/min/max, quantile, aggregate/split
>
> Also, many timeseries, graphics and spline functions are dependent on the
> order.
>
> In the case of data.frame(s), a boolean flag would probably need to be
> extended to allow for multiple column sorting, and
> ascending/descending options.
>
> On Tue, Mar 16, 2021 at 11:08 AM Gabriel Becker <[hidden email]>
> wrote:
> >
> > Abby,
> >
> > Vectors do have an internal mechanism for knowing that they are sorted
> via ALTREP (it was one of 2 core motivating features for 'smart vectors'
> the other being knowledge about presence of NAs).
> >
> > Currently I don't think we expose it at the R level, though it is part
> of the official C API. I don't know of any plans for this to change, but I
> suppose it could. Plus for functions in R itself, we could even use it
> without exposing it more widely. A number of functions, including sort
> itself, already do this in fact, but more could. I'd be interested in
> hearing which functions you think would particularly benefit from this.
> >
> > ~G
> >
> > On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas <[hidden email]>
> wrote:
> >>
> >> Hi Abby,
> >>
> >> Thank you for your positive feedback.
> >>
> >> I agree for your general comment about sorting.
> >>
> >> For ave specifically, ordering may not help because the output must
> maintain the order of the input (as ave returns only x and not the entiere
> data.frame).
> >>
> >> Thanks,
> >>
> >> Thomas
> >> ________________________________________
> >> De : Abby Spurdle <[hidden email]>
> >> Envoyé : lundi 15 mars 2021 10:22
> >> À : SOEIRO Thomas
> >> Cc : [hidden email]
> >> Objet : Re: [Rd] Potential improvements of ave?
> >>
> >> EMAIL EXTERNE - TRAITER AVEC PRÉCAUTION LIENS ET FICHIERS
> >>
> >> Hi Thomas,
> >>
> >> These are some great suggestions.
> >> But I can't help but feel there's a much bigger problem here.
> >>
> >> Intuitively, the ave function could (or should) sort the data.
> >> Then the indexing step becomes almost trivial, in terms of both time
> >> and space complexity.
> >> And the ave function is not the only example of where a problem
> >> becomes much simpler, if the data is sorted.
> >>
> >> Historically, I've never found base R functions user-friendly for
> >> aggregation purposes, or for sorting.
> >> (At least, not by comparison to SQL).
> >>
> >> But that's not the main problem.
> >> It would seem preferable to sort the data, only once.
> >> (Rather than sorting it repeatedly, or not at all).
> >>
> >> Perhaps, objects such as vectors and data.frame(s) could have a
> >> boolean attribute, to indicate if they're sorted.
> >> Or functions such as ave could have a sorted argument.
> >> In either case, if true, the function assumes the data is sorted and
> >> applies a more efficient algorithm.
> >>
> >>
> >> B.
> >>
> >>
> >> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <[hidden email]>
> wrote:
> >> >
> >> > Dear all,
> >> >
> >> > I have two questions/suggestions about ave, but I am not sure if it's
> relevant for bug reports.
> >> >
> >> >
> >> >
> >> > 1) I have performance issues with ave in a case where I didn't expect
> it. The following code runs as expected:
> >> >
> >> > set.seed(1)
> >> >
> >> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
> >> >                   id2 = sample(1:3, 5e2, TRUE),
> >> >                   id3 = sample(1:5, 5e2, TRUE),
> >> >                   val = sample(1:300, 5e2, TRUE))
> >> >
> >> > df1$diff <- ave(df1$val,
> >> >                 df1$id1,
> >> >                 df1$id2,
> >> >                 df1$id3,
> >> >                 FUN = function(i) c(diff(i), 0))
> >> >
> >> > head(df1[order(df1$id1,
> >> >                df1$id2,
> >> >                df1$id3), ])
> >> >
> >> > But when expanding the data.frame (* 1e4), ave fails (Error: cannot
> allocate vector of size 1110.0 Gb):
> >> >
> >> > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
> >> >                   id2 = sample(1:3, 5e2 * 1e4, TRUE),
> >> >                   id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
> >> >                   val = sample(1:300, 5e2 * 1e4, TRUE))
> >> >
> >> > df2$diff <- ave(df2$val,
> >> >                 df2$id1,
> >> >                 df2$id2,
> >> >                 df2$id3,
> >> >                 FUN = function(i) c(diff(i), 0))
> >> >
> >> > This use case does not seem extreme to me (e.g. aggregate et al work
> perfectly on this data.frame).
> >> > So my question is: Is this expected/intended/reasonable? i.e. Does
> ave need to be optimized?
> >> >
> >> >
> >> >
> >> > 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed
> to avoid warnings in case of unused levels (
> https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjU7NXrBO$
> ).
> >> > Is it relevant/possible to expose the drop argument explicitly?
> >> >
> >> >
> >> >
> >> > Thanks,
> >> >
> >> > Thomas
> >> > ______________________________________________
> >> > [hidden email] mailing list
> >> >
> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjUzdLFM1$
> >>
> >> ______________________________________________
> >> [hidden email] mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Potential improvements of ave?

Bill Dunlap-2
In reply to this post by SOEIRO Thomas
Your proposed change (roughly, replacing interaction() by
unique(paste())) slows down ave() considerably when there are long
columns with lots of repeated rows.

I think that interaction(drop=TRUE, ...) can be changed to use less
memory and be faster by making a separate branch for drop=TRUE that
uses the following idiom for finding the unique rows in a data.frame:

new.duplicated.data.frame <- function (x, incomparables = FALSE,
fromLast = FALSE, ...)
{
    dup <- !logical(nrow(x)) # all entries considered duplicated until
proven otherwise
    for(column in x) {
        dup <- dup & duplicated(column, incomparables = incomparables,
fromLast = fromLast)
    }
    dup
}

ave() could use the above directly or it could call interaction(drop=TRUE,...).

On Tue, Mar 16, 2021 at 3:50 PM SOEIRO Thomas <[hidden email]> wrote:

>
> Dear all,
>
> Thank you for your consideration on this topic.
>
> I do not have enough knowledge of R internals to join the discussion about sorting mechanisms. In fact, I did not get how ordering could help for ave as the output must maintain the order of the input (because ave returns only x and not the entiere data.frame).
>
> However, while the proposed workaround (i.e. paste0 instead of interaction, cf https://stat.ethz.ch/pipermail/r-devel/2021-March/080509.html) does not solves the "bigger problem" of sorting, it is usable as is and solves the issue. Therefore, what do you think about it? (i.e is it relevant for a patch?)
>
> Thanks,
>
> Thomas
>
>
> > ________________________________________
> > De : Abby Spurdle <[hidden email]>
> > Envoyé : lundi 15 mars 2021 10:22
> > À : SOEIRO Thomas
> > Cc : [hidden email]
> > Objet : Re: [Rd] Potential improvements of ave?
> >
> > Hi Thomas,
> >
> > These are some great suggestions.
> > But I can't help but feel there's a much bigger problem here.
> >
> > Intuitively, the ave function could (or should) sort the data.
> > Then the indexing step becomes almost trivial, in terms of both time
> > and space complexity.
> > And the ave function is not the only example of where a problem
> > becomes much simpler, if the data is sorted.
> >
> > Historically, I've never found base R functions user-friendly for
> > aggregation purposes, or for sorting.
> > (At least, not by comparison to SQL).
> >
> > But that's not the main problem.
> > It would seem preferable to sort the data, only once.
> > (Rather than sorting it repeatedly, or not at all).
> >
> > Perhaps, objects such as vectors and data.frame(s) could have a
> > boolean attribute, to indicate if they're sorted.
> > Or functions such as ave could have a sorted argument.
> > In either case, if true, the function assumes the data is sorted and
> > applies a more efficient algorithm.
> >
> >
> > B.
> >
> >
> > On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <[hidden email]> wrote:
> >>
> >> Dear all,
> >>
> >> I have two questions/suggestions about ave, but I am not sure if it's relevant for bug reports.
> >>
> >>
> >>
> >> 1) I have performance issues with ave in a case where I didn't expect it. The following code runs as expected:
> >>
> >> set.seed(1)
> >>
> >> df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
> >>                   id2 = sample(1:3, 5e2, TRUE),
> >>                   id3 = sample(1:5, 5e2, TRUE),
> >>                   val = sample(1:300, 5e2, TRUE))
> >>
> >> df1$diff <- ave(df1$val,
> >>                 df1$id1,
> >>                 df1$id2,
> >>                 df1$id3,
> >>                 FUN = function(i) c(diff(i), 0))
> >>
> >> head(df1[order(df1$id1,
> >>                df1$id2,
> >>                df1$id3), ])
> >>
> >> But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate vector of size 1110.0 Gb):
> >>
> >> df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
> >>                   id2 = sample(1:3, 5e2 * 1e4, TRUE),
> >>                   id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
> >>                   val = sample(1:300, 5e2 * 1e4, TRUE))
> >>
> >> df2$diff <- ave(df2$val,
> >>                 df2$id1,
> >>                 df2$id2,
> >>                 df2$id3,
> >>                 FUN = function(i) c(diff(i), 0))
> >>
> >> This use case does not seem extreme to me (e.g. aggregate et al work perfectly on this data.frame).
> >> So my question is: Is this expected/intended/reasonable? i.e. Does ave need to be optimized?
> >>
> >>
> >>
> >> 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to avoid warnings in case of unused levels (https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html).
> >> Is it relevant/possible to expose the drop argument explicitly?
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Thomas
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel