Dear all,
I have two questions/suggestions about ave, but I am not sure if it's relevant for bug reports. 1) I have performance issues with ave in a case where I didn't expect it. The following code runs as expected: set.seed(1) df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE), id2 = sample(1:3, 5e2, TRUE), id3 = sample(1:5, 5e2, TRUE), val = sample(1:300, 5e2, TRUE)) df1$diff <- ave(df1$val, df1$id1, df1$id2, df1$id3, FUN = function(i) c(diff(i), 0)) head(df1[order(df1$id1, df1$id2, df1$id3), ]) But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate vector of size 1110.0 Gb): df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE), id2 = sample(1:3, 5e2 * 1e4, TRUE), id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE), val = sample(1:300, 5e2 * 1e4, TRUE)) df2$diff <- ave(df2$val, df2$id1, df2$id2, df2$id3, FUN = function(i) c(diff(i), 0)) This use case does not seem extreme to me (e.g. aggregate et al work perfectly on this data.frame). So my question is: Is this expected/intended/reasonable? i.e. Does ave need to be optimized? 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to avoid warnings in case of unused levels (https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html). Is it relevant/possible to expose the drop argument explicitly? Thanks, Thomas ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
The bottleneck of ave is the call to interaction (i.e. not the call to split/lapply).
Therefore, the following code runs as expected (but I may miss something...): ave2 <- function (x, ..., FUN = mean) { if(missing(...)) x[] <- FUN(x) else { #g <- interaction(...) g <- paste0(...) split(x,g) <- lapply(split(x, g), FUN) } x } df2$diff <- ave2(df2$val, df2$id1, df2$id2, df2$id3, FUN = function(i) c(diff(i), 0)) Of course I can also simply solve my current issue with: df2$id123 <- paste0(df2$id1, df2$id2, df2$id3) df2$diff <- ave(df2$val, df2$id123, FUN = function(i) c(diff(i), 0)) In addition, ave2 also avoid warnings in case of unused levels (see point 2) in my previous message). ________________________________________ De : SOEIRO Thomas Envoyé : vendredi 12 mars 2021 23:59 À : [hidden email] Objet : Potential improvements of ave? Dear all, I have two questions/suggestions about ave, but I am not sure if it's relevant for bug reports. 1) I have performance issues with ave in a case where I didn't expect it. The following code runs as expected: set.seed(1) df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE), id2 = sample(1:3, 5e2, TRUE), id3 = sample(1:5, 5e2, TRUE), val = sample(1:300, 5e2, TRUE)) df1$diff <- ave(df1$val, df1$id1, df1$id2, df1$id3, FUN = function(i) c(diff(i), 0)) head(df1[order(df1$id1, df1$id2, df1$id3), ]) But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate vector of size 1110.0 Gb): df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE), id2 = sample(1:3, 5e2 * 1e4, TRUE), id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE), val = sample(1:300, 5e2 * 1e4, TRUE)) df2$diff <- ave(df2$val, df2$id1, df2$id2, df2$id3, FUN = function(i) c(diff(i), 0)) This use case does not seem extreme to me (e.g. aggregate et al work perfectly on this data.frame). So my question is: Is this expected/intended/reasonable? i.e. Does ave need to be optimized? 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to avoid warnings in case of unused levels (https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html). Is it relevant/possible to expose the drop argument explicitly? Thanks, Thomas ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
In reply to this post by SOEIRO Thomas
Hi Thomas,
These are some great suggestions. But I can't help but feel there's a much bigger problem here. Intuitively, the ave function could (or should) sort the data. Then the indexing step becomes almost trivial, in terms of both time and space complexity. And the ave function is not the only example of where a problem becomes much simpler, if the data is sorted. Historically, I've never found base R functions user-friendly for aggregation purposes, or for sorting. (At least, not by comparison to SQL). But that's not the main problem. It would seem preferable to sort the data, only once. (Rather than sorting it repeatedly, or not at all). Perhaps, objects such as vectors and data.frame(s) could have a boolean attribute, to indicate if they're sorted. Or functions such as ave could have a sorted argument. In either case, if true, the function assumes the data is sorted and applies a more efficient algorithm. B. On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <[hidden email]> wrote: > > Dear all, > > I have two questions/suggestions about ave, but I am not sure if it's relevant for bug reports. > > > > 1) I have performance issues with ave in a case where I didn't expect it. The following code runs as expected: > > set.seed(1) > > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE), > id2 = sample(1:3, 5e2, TRUE), > id3 = sample(1:5, 5e2, TRUE), > val = sample(1:300, 5e2, TRUE)) > > df1$diff <- ave(df1$val, > df1$id1, > df1$id2, > df1$id3, > FUN = function(i) c(diff(i), 0)) > > head(df1[order(df1$id1, > df1$id2, > df1$id3), ]) > > But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate vector of size 1110.0 Gb): > > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE), > id2 = sample(1:3, 5e2 * 1e4, TRUE), > id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE), > val = sample(1:300, 5e2 * 1e4, TRUE)) > > df2$diff <- ave(df2$val, > df2$id1, > df2$id2, > df2$id3, > FUN = function(i) c(diff(i), 0)) > > This use case does not seem extreme to me (e.g. aggregate et al work perfectly on this data.frame). > So my question is: Is this expected/intended/reasonable? i.e. Does ave need to be optimized? > > > > 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to avoid warnings in case of unused levels (https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html). > Is it relevant/possible to expose the drop argument explicitly? > > > > Thanks, > > Thomas > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
Hi Abby,
Thank you for your positive feedback. I agree for your general comment about sorting. For ave specifically, ordering may not help because the output must maintain the order of the input (as ave returns only x and not the entiere data.frame). Thanks, Thomas ________________________________________ De : Abby Spurdle <[hidden email]> Envoyé : lundi 15 mars 2021 10:22 À : SOEIRO Thomas Cc : [hidden email] Objet : Re: [Rd] Potential improvements of ave? EMAIL EXTERNE - TRAITER AVEC PRÉCAUTION LIENS ET FICHIERS Hi Thomas, These are some great suggestions. But I can't help but feel there's a much bigger problem here. Intuitively, the ave function could (or should) sort the data. Then the indexing step becomes almost trivial, in terms of both time and space complexity. And the ave function is not the only example of where a problem becomes much simpler, if the data is sorted. Historically, I've never found base R functions user-friendly for aggregation purposes, or for sorting. (At least, not by comparison to SQL). But that's not the main problem. It would seem preferable to sort the data, only once. (Rather than sorting it repeatedly, or not at all). Perhaps, objects such as vectors and data.frame(s) could have a boolean attribute, to indicate if they're sorted. Or functions such as ave could have a sorted argument. In either case, if true, the function assumes the data is sorted and applies a more efficient algorithm. B. On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <[hidden email]> wrote: > > Dear all, > > I have two questions/suggestions about ave, but I am not sure if it's relevant for bug reports. > > > > 1) I have performance issues with ave in a case where I didn't expect it. The following code runs as expected: > > set.seed(1) > > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE), > id2 = sample(1:3, 5e2, TRUE), > id3 = sample(1:5, 5e2, TRUE), > val = sample(1:300, 5e2, TRUE)) > > df1$diff <- ave(df1$val, > df1$id1, > df1$id2, > df1$id3, > FUN = function(i) c(diff(i), 0)) > > head(df1[order(df1$id1, > df1$id2, > df1$id3), ]) > > But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate vector of size 1110.0 Gb): > > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE), > id2 = sample(1:3, 5e2 * 1e4, TRUE), > id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE), > val = sample(1:300, 5e2 * 1e4, TRUE)) > > df2$diff <- ave(df2$val, > df2$id1, > df2$id2, > df2$id3, > FUN = function(i) c(diff(i), 0)) > > This use case does not seem extreme to me (e.g. aggregate et al work perfectly on this data.frame). > So my question is: Is this expected/intended/reasonable? i.e. Does ave need to be optimized? > > > > 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to avoid warnings in case of unused levels (https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjU7NXrBO$ ). > Is it relevant/possible to expose the drop argument explicitly? > > > > Thanks, > > Thomas > ______________________________________________ > [hidden email] mailing list > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjUzdLFM1$ ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
Abby,
Vectors do have an internal mechanism for knowing that they are sorted via ALTREP (it was one of 2 core motivating features for 'smart vectors' the other being knowledge about presence of NAs). Currently I don't think we expose it at the R level, though it is part of the official C API. I don't know of any plans for this to change, but I suppose it could. Plus for functions in R itself, we could even use it without exposing it more widely. A number of functions, including sort itself, already do this in fact, but more could. I'd be interested in hearing which functions you think would particularly benefit from this. ~G On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas <[hidden email]> wrote: > Hi Abby, > > Thank you for your positive feedback. > > I agree for your general comment about sorting. > > For ave specifically, ordering may not help because the output must > maintain the order of the input (as ave returns only x and not the entiere > data.frame). > > Thanks, > > Thomas > ________________________________________ > De : Abby Spurdle <[hidden email]> > Envoyé : lundi 15 mars 2021 10:22 > À : SOEIRO Thomas > Cc : [hidden email] > Objet : Re: [Rd] Potential improvements of ave? > > EMAIL EXTERNE - TRAITER AVEC PRÉCAUTION LIENS ET FICHIERS > > Hi Thomas, > > These are some great suggestions. > But I can't help but feel there's a much bigger problem here. > > Intuitively, the ave function could (or should) sort the data. > Then the indexing step becomes almost trivial, in terms of both time > and space complexity. > And the ave function is not the only example of where a problem > becomes much simpler, if the data is sorted. > > Historically, I've never found base R functions user-friendly for > aggregation purposes, or for sorting. > (At least, not by comparison to SQL). > > But that's not the main problem. > It would seem preferable to sort the data, only once. > (Rather than sorting it repeatedly, or not at all). > > Perhaps, objects such as vectors and data.frame(s) could have a > boolean attribute, to indicate if they're sorted. > Or functions such as ave could have a sorted argument. > In either case, if true, the function assumes the data is sorted and > applies a more efficient algorithm. > > > B. > > > On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <[hidden email]> > wrote: > > > > Dear all, > > > > I have two questions/suggestions about ave, but I am not sure if it's > relevant for bug reports. > > > > > > > > 1) I have performance issues with ave in a case where I didn't expect > it. The following code runs as expected: > > > > set.seed(1) > > > > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE), > > id2 = sample(1:3, 5e2, TRUE), > > id3 = sample(1:5, 5e2, TRUE), > > val = sample(1:300, 5e2, TRUE)) > > > > df1$diff <- ave(df1$val, > > df1$id1, > > df1$id2, > > df1$id3, > > FUN = function(i) c(diff(i), 0)) > > > > head(df1[order(df1$id1, > > df1$id2, > > df1$id3), ]) > > > > But when expanding the data.frame (* 1e4), ave fails (Error: cannot > allocate vector of size 1110.0 Gb): > > > > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE), > > id2 = sample(1:3, 5e2 * 1e4, TRUE), > > id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE), > > val = sample(1:300, 5e2 * 1e4, TRUE)) > > > > df2$diff <- ave(df2$val, > > df2$id1, > > df2$id2, > > df2$id3, > > FUN = function(i) c(diff(i), 0)) > > > > This use case does not seem extreme to me (e.g. aggregate et al work > perfectly on this data.frame). > > So my question is: Is this expected/intended/reasonable? i.e. Does ave > need to be optimized? > > > > > > > > 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to > avoid warnings in case of unused levels ( > https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjU7NXrBO$ > ). > > Is it relevant/possible to expose the drop argument explicitly? > > > > > > > > Thanks, > > > > Thomas > > ______________________________________________ > > [hidden email] mailing list > > > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjUzdLFM1$ > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
>>>>> Gabriel Becker
>>>>> on Mon, 15 Mar 2021 15:08:44 -0700 writes: > Abby, > Vectors do have an internal mechanism for knowing that they are sorted via > ALTREP (it was one of 2 core motivating features for 'smart vectors' the > other being knowledge about presence of NAs). > Currently I don't think we expose it at the R level, though it is part of > the official C API. I don't know of any plans for this to change, but I > suppose it could. Plus for functions in R itself, we could even use it > without exposing it more widely. A number of functions, including sort > itself, already do this in fact, but more could. I'd be interested in > hearing which functions you think would particularly benefit from this. Thank you Gabe. > ~G I vaguely remember (from Luke's docs/presentation on ALTREP) that there are some "missing parts" here. One of them the not-existing R level functionality, another may be the C code below R's is.unsorted() ... maybe is.unsorted() could get a new argument and or be re-written, moving the NA handling also to C and have that happen *after* the C code looks if it's an ALTREP object and if that "knows it's sorted". Martin > On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas <[hidden email]> > wrote: >> Hi Abby, >> >> Thank you for your positive feedback. >> >> I agree for your general comment about sorting. >> >> For ave specifically, ordering may not help because the output must >> maintain the order of the input (as ave returns only x and not the entiere >> data.frame). >> >> Thanks, >> >> Thomas >> ________________________________________ >> De : Abby Spurdle <[hidden email]> >> Envoyé : lundi 15 mars 2021 10:22 >> À : SOEIRO Thomas >> Cc : [hidden email] >> Objet : Re: [Rd] Potential improvements of ave? >> >> EMAIL EXTERNE - TRAITER AVEC PRÉCAUTION LIENS ET FICHIERS >> >> Hi Thomas, >> >> These are some great suggestions. >> But I can't help but feel there's a much bigger problem here. >> >> Intuitively, the ave function could (or should) sort the data. >> Then the indexing step becomes almost trivial, in terms of both time >> and space complexity. >> And the ave function is not the only example of where a problem >> becomes much simpler, if the data is sorted. >> >> Historically, I've never found base R functions user-friendly for >> aggregation purposes, or for sorting. >> (At least, not by comparison to SQL). >> >> But that's not the main problem. >> It would seem preferable to sort the data, only once. >> (Rather than sorting it repeatedly, or not at all). >> >> Perhaps, objects such as vectors and data.frame(s) could have a >> boolean attribute, to indicate if they're sorted. >> Or functions such as ave could have a sorted argument. >> In either case, if true, the function assumes the data is sorted and >> applies a more efficient algorithm. >> >> >> B. >> >> >> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <[hidden email]> >> wrote: >> > >> > Dear all, >> > >> > I have two questions/suggestions about ave, but I am not sure if it's >> relevant for bug reports. >> > >> > >> > >> > 1) I have performance issues with ave in a case where I didn't expect >> it. The following code runs as expected: >> > >> > set.seed(1) >> > >> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE), >> > id2 = sample(1:3, 5e2, TRUE), >> > id3 = sample(1:5, 5e2, TRUE), >> > val = sample(1:300, 5e2, TRUE)) >> > >> > df1$diff <- ave(df1$val, >> > df1$id1, >> > df1$id2, >> > df1$id3, >> > FUN = function(i) c(diff(i), 0)) >> > >> > head(df1[order(df1$id1, >> > df1$id2, >> > df1$id3), ]) >> > >> > But when expanding the data.frame (* 1e4), ave fails (Error: cannot >> allocate vector of size 1110.0 Gb): >> > >> > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE), >> > id2 = sample(1:3, 5e2 * 1e4, TRUE), >> > id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE), >> > val = sample(1:300, 5e2 * 1e4, TRUE)) >> > >> > df2$diff <- ave(df2$val, >> > df2$id1, >> > df2$id2, >> > df2$id3, >> > FUN = function(i) c(diff(i), 0)) >> > >> > This use case does not seem extreme to me (e.g. aggregate et al work >> perfectly on this data.frame). >> > So my question is: Is this expected/intended/reasonable? i.e. Does ave >> need to be optimized? >> > >> > >> > >> > 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to >> avoid warnings in case of unused levels ( >> https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjU7NXrBO$ >> ). >> > Is it relevant/possible to expose the drop argument explicitly? >> > >> > >> > >> > Thanks, >> > >> > Thomas >> > ______________________________________________ >> > [hidden email] mailing list >> > >> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjUzdLFM1$ >> >> ______________________________________________ >> [hidden email] mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > [[alternative HTML version deleted]] > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
On 16 March 2021 at 10:50, Martin Maechler wrote: | I vaguely remember (from Luke's docs/presentation on ALTREP) | that there are some "missing parts" here. | One of them the not-existing R level functionality, another may be | the C code below R's is.unsorted() ... maybe is.unsorted() | could get a new argument and or be re-written, moving the NA | handling also to C and have that happen *after* the C code looks | if it's an ALTREP object and if that "knows it's sorted". Somewhat related: I recently tried to demonstrate that, say, a 1e2 and a 1e6 vector each with one added NA should have equal "O(1)" time under anyNA(), yet apparently they do not---so the other magic bit (about knowing NAness) may not propagate. (That was with R 4.0.4.) I may also have missed a crucial bit about enabling ALTREPness here, yet the "gist of it" always was (AFAIK) that it is supposedly seamless. Dirk -- https://dirk.eddelbuettel.com | @eddelbuettel | [hidden email] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
In reply to this post by Gabriel Becker-2
There are some relatively obvious examples:
unique, which.min/which.max/etc, range/min/max, quantile, aggregate/split Also, many timeseries, graphics and spline functions are dependent on the order. In the case of data.frame(s), a boolean flag would probably need to be extended to allow for multiple column sorting, and ascending/descending options. On Tue, Mar 16, 2021 at 11:08 AM Gabriel Becker <[hidden email]> wrote: > > Abby, > > Vectors do have an internal mechanism for knowing that they are sorted via ALTREP (it was one of 2 core motivating features for 'smart vectors' the other being knowledge about presence of NAs). > > Currently I don't think we expose it at the R level, though it is part of the official C API. I don't know of any plans for this to change, but I suppose it could. Plus for functions in R itself, we could even use it without exposing it more widely. A number of functions, including sort itself, already do this in fact, but more could. I'd be interested in hearing which functions you think would particularly benefit from this. > > ~G > > On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas <[hidden email]> wrote: >> >> Hi Abby, >> >> Thank you for your positive feedback. >> >> I agree for your general comment about sorting. >> >> For ave specifically, ordering may not help because the output must maintain the order of the input (as ave returns only x and not the entiere data.frame). >> >> Thanks, >> >> Thomas >> ________________________________________ >> De : Abby Spurdle <[hidden email]> >> Envoyé : lundi 15 mars 2021 10:22 >> À : SOEIRO Thomas >> Cc : [hidden email] >> Objet : Re: [Rd] Potential improvements of ave? >> >> EMAIL EXTERNE - TRAITER AVEC PRÉCAUTION LIENS ET FICHIERS >> >> Hi Thomas, >> >> These are some great suggestions. >> But I can't help but feel there's a much bigger problem here. >> >> Intuitively, the ave function could (or should) sort the data. >> Then the indexing step becomes almost trivial, in terms of both time >> and space complexity. >> And the ave function is not the only example of where a problem >> becomes much simpler, if the data is sorted. >> >> Historically, I've never found base R functions user-friendly for >> aggregation purposes, or for sorting. >> (At least, not by comparison to SQL). >> >> But that's not the main problem. >> It would seem preferable to sort the data, only once. >> (Rather than sorting it repeatedly, or not at all). >> >> Perhaps, objects such as vectors and data.frame(s) could have a >> boolean attribute, to indicate if they're sorted. >> Or functions such as ave could have a sorted argument. >> In either case, if true, the function assumes the data is sorted and >> applies a more efficient algorithm. >> >> >> B. >> >> >> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <[hidden email]> wrote: >> > >> > Dear all, >> > >> > I have two questions/suggestions about ave, but I am not sure if it's relevant for bug reports. >> > >> > >> > >> > 1) I have performance issues with ave in a case where I didn't expect it. The following code runs as expected: >> > >> > set.seed(1) >> > >> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE), >> > id2 = sample(1:3, 5e2, TRUE), >> > id3 = sample(1:5, 5e2, TRUE), >> > val = sample(1:300, 5e2, TRUE)) >> > >> > df1$diff <- ave(df1$val, >> > df1$id1, >> > df1$id2, >> > df1$id3, >> > FUN = function(i) c(diff(i), 0)) >> > >> > head(df1[order(df1$id1, >> > df1$id2, >> > df1$id3), ]) >> > >> > But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate vector of size 1110.0 Gb): >> > >> > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE), >> > id2 = sample(1:3, 5e2 * 1e4, TRUE), >> > id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE), >> > val = sample(1:300, 5e2 * 1e4, TRUE)) >> > >> > df2$diff <- ave(df2$val, >> > df2$id1, >> > df2$id2, >> > df2$id3, >> > FUN = function(i) c(diff(i), 0)) >> > >> > This use case does not seem extreme to me (e.g. aggregate et al work perfectly on this data.frame). >> > So my question is: Is this expected/intended/reasonable? i.e. Does ave need to be optimized? >> > >> > >> > >> > 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to avoid warnings in case of unused levels (https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjU7NXrBO$ ). >> > Is it relevant/possible to expose the drop argument explicitly? >> > >> > >> > >> > Thanks, >> > >> > Thomas >> > ______________________________________________ >> > [hidden email] mailing list >> > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjUzdLFM1$ >> >> ______________________________________________ >> [hidden email] mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
Dear all,
Thank you for your consideration on this topic. I do not have enough knowledge of R internals to join the discussion about sorting mechanisms. In fact, I did not get how ordering could help for ave as the output must maintain the order of the input (because ave returns only x and not the entiere data.frame). However, while the proposed workaround (i.e. paste0 instead of interaction, cf https://stat.ethz.ch/pipermail/r-devel/2021-March/080509.html) does not solves the "bigger problem" of sorting, it is usable as is and solves the issue. Therefore, what do you think about it? (i.e is it relevant for a patch?) Thanks, Thomas > ________________________________________ > De : Abby Spurdle <[hidden email]> > Envoyé : lundi 15 mars 2021 10:22 > À : SOEIRO Thomas > Cc : [hidden email] > Objet : Re: [Rd] Potential improvements of ave? > > Hi Thomas, > > These are some great suggestions. > But I can't help but feel there's a much bigger problem here. > > Intuitively, the ave function could (or should) sort the data. > Then the indexing step becomes almost trivial, in terms of both time > and space complexity. > And the ave function is not the only example of where a problem > becomes much simpler, if the data is sorted. > > Historically, I've never found base R functions user-friendly for > aggregation purposes, or for sorting. > (At least, not by comparison to SQL). > > But that's not the main problem. > It would seem preferable to sort the data, only once. > (Rather than sorting it repeatedly, or not at all). > > Perhaps, objects such as vectors and data.frame(s) could have a > boolean attribute, to indicate if they're sorted. > Or functions such as ave could have a sorted argument. > In either case, if true, the function assumes the data is sorted and > applies a more efficient algorithm. > > > B. > > > On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <[hidden email]> wrote: >> >> Dear all, >> >> I have two questions/suggestions about ave, but I am not sure if it's relevant for bug reports. >> >> >> >> 1) I have performance issues with ave in a case where I didn't expect it. The following code runs as expected: >> >> set.seed(1) >> >> df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE), >> id2 = sample(1:3, 5e2, TRUE), >> id3 = sample(1:5, 5e2, TRUE), >> val = sample(1:300, 5e2, TRUE)) >> >> df1$diff <- ave(df1$val, >> df1$id1, >> df1$id2, >> df1$id3, >> FUN = function(i) c(diff(i), 0)) >> >> head(df1[order(df1$id1, >> df1$id2, >> df1$id3), ]) >> >> But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate vector of size 1110.0 Gb): >> >> df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE), >> id2 = sample(1:3, 5e2 * 1e4, TRUE), >> id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE), >> val = sample(1:300, 5e2 * 1e4, TRUE)) >> >> df2$diff <- ave(df2$val, >> df2$id1, >> df2$id2, >> df2$id3, >> FUN = function(i) c(diff(i), 0)) >> >> This use case does not seem extreme to me (e.g. aggregate et al work perfectly on this data.frame). >> So my question is: Is this expected/intended/reasonable? i.e. Does ave need to be optimized? >> >> >> >> 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to avoid warnings in case of unused levels (https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html). >> Is it relevant/possible to expose the drop argument explicitly? >> >> >> >> Thanks, >> >> Thomas [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
In reply to this post by aBBy Spurdle, ⍺XY
Hi Abby,
I actually have a patch submitted that does this for unique/duplicated (only numeric cases I think) but it is, as patches from external contributors go, quite sizable which means it requires a correspondingly large amount of an R-core member's time and energy to vet and consider. It is in the queue, and so, I expect (/hope, provided I didn't make a mistake) it will be incorporated at some point. ( https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17993) You are correct that the speedups are quite significant for calling unique/duplicated on large vectors that know they are sorted: Speedup on my machine for a fairly sizable vector (length 1e7) ranges from about ~10x in the densely duplicated case up to ~60-70x in the sparsely duplicated case for duplicated(). For unique() it seems to range from ~10x in the densely duplicated case to ~15 in the spare case. I had thought that min and max already did this, but looking now, they don't seem to by default, thought ALTREP classes themselves do have an option of setting a min/max method, which would be hit. That does seem like low-hanging fruit, I agree, though in many cases the slow down from a single pass over the data to get a min probably isn't earthshattering. The others do seem like they could benefit as well. Best, ~G On Tue, Mar 16, 2021 at 2:54 PM Abby Spurdle <[hidden email]> wrote: > There are some relatively obvious examples: > unique, which.min/which.max/etc, range/min/max, quantile, aggregate/split > > Also, many timeseries, graphics and spline functions are dependent on the > order. > > In the case of data.frame(s), a boolean flag would probably need to be > extended to allow for multiple column sorting, and > ascending/descending options. > > On Tue, Mar 16, 2021 at 11:08 AM Gabriel Becker <[hidden email]> > wrote: > > > > Abby, > > > > Vectors do have an internal mechanism for knowing that they are sorted > via ALTREP (it was one of 2 core motivating features for 'smart vectors' > the other being knowledge about presence of NAs). > > > > Currently I don't think we expose it at the R level, though it is part > of the official C API. I don't know of any plans for this to change, but I > suppose it could. Plus for functions in R itself, we could even use it > without exposing it more widely. A number of functions, including sort > itself, already do this in fact, but more could. I'd be interested in > hearing which functions you think would particularly benefit from this. > > > > ~G > > > > On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas <[hidden email]> > wrote: > >> > >> Hi Abby, > >> > >> Thank you for your positive feedback. > >> > >> I agree for your general comment about sorting. > >> > >> For ave specifically, ordering may not help because the output must > maintain the order of the input (as ave returns only x and not the entiere > data.frame). > >> > >> Thanks, > >> > >> Thomas > >> ________________________________________ > >> De : Abby Spurdle <[hidden email]> > >> Envoyé : lundi 15 mars 2021 10:22 > >> À : SOEIRO Thomas > >> Cc : [hidden email] > >> Objet : Re: [Rd] Potential improvements of ave? > >> > >> EMAIL EXTERNE - TRAITER AVEC PRÉCAUTION LIENS ET FICHIERS > >> > >> Hi Thomas, > >> > >> These are some great suggestions. > >> But I can't help but feel there's a much bigger problem here. > >> > >> Intuitively, the ave function could (or should) sort the data. > >> Then the indexing step becomes almost trivial, in terms of both time > >> and space complexity. > >> And the ave function is not the only example of where a problem > >> becomes much simpler, if the data is sorted. > >> > >> Historically, I've never found base R functions user-friendly for > >> aggregation purposes, or for sorting. > >> (At least, not by comparison to SQL). > >> > >> But that's not the main problem. > >> It would seem preferable to sort the data, only once. > >> (Rather than sorting it repeatedly, or not at all). > >> > >> Perhaps, objects such as vectors and data.frame(s) could have a > >> boolean attribute, to indicate if they're sorted. > >> Or functions such as ave could have a sorted argument. > >> In either case, if true, the function assumes the data is sorted and > >> applies a more efficient algorithm. > >> > >> > >> B. > >> > >> > >> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <[hidden email]> > wrote: > >> > > >> > Dear all, > >> > > >> > I have two questions/suggestions about ave, but I am not sure if it's > relevant for bug reports. > >> > > >> > > >> > > >> > 1) I have performance issues with ave in a case where I didn't expect > it. The following code runs as expected: > >> > > >> > set.seed(1) > >> > > >> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE), > >> > id2 = sample(1:3, 5e2, TRUE), > >> > id3 = sample(1:5, 5e2, TRUE), > >> > val = sample(1:300, 5e2, TRUE)) > >> > > >> > df1$diff <- ave(df1$val, > >> > df1$id1, > >> > df1$id2, > >> > df1$id3, > >> > FUN = function(i) c(diff(i), 0)) > >> > > >> > head(df1[order(df1$id1, > >> > df1$id2, > >> > df1$id3), ]) > >> > > >> > But when expanding the data.frame (* 1e4), ave fails (Error: cannot > allocate vector of size 1110.0 Gb): > >> > > >> > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE), > >> > id2 = sample(1:3, 5e2 * 1e4, TRUE), > >> > id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE), > >> > val = sample(1:300, 5e2 * 1e4, TRUE)) > >> > > >> > df2$diff <- ave(df2$val, > >> > df2$id1, > >> > df2$id2, > >> > df2$id3, > >> > FUN = function(i) c(diff(i), 0)) > >> > > >> > This use case does not seem extreme to me (e.g. aggregate et al work > perfectly on this data.frame). > >> > So my question is: Is this expected/intended/reasonable? i.e. Does > ave need to be optimized? > >> > > >> > > >> > > >> > 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed > to avoid warnings in case of unused levels ( > https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjU7NXrBO$ > ). > >> > Is it relevant/possible to expose the drop argument explicitly? > >> > > >> > > >> > > >> > Thanks, > >> > > >> > Thomas > >> > ______________________________________________ > >> > [hidden email] mailing list > >> > > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjUzdLFM1$ > >> > >> ______________________________________________ > >> [hidden email] mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
In reply to this post by SOEIRO Thomas
Your proposed change (roughly, replacing interaction() by
unique(paste())) slows down ave() considerably when there are long columns with lots of repeated rows. I think that interaction(drop=TRUE, ...) can be changed to use less memory and be faster by making a separate branch for drop=TRUE that uses the following idiom for finding the unique rows in a data.frame: new.duplicated.data.frame <- function (x, incomparables = FALSE, fromLast = FALSE, ...) { dup <- !logical(nrow(x)) # all entries considered duplicated until proven otherwise for(column in x) { dup <- dup & duplicated(column, incomparables = incomparables, fromLast = fromLast) } dup } ave() could use the above directly or it could call interaction(drop=TRUE,...). On Tue, Mar 16, 2021 at 3:50 PM SOEIRO Thomas <[hidden email]> wrote: > > Dear all, > > Thank you for your consideration on this topic. > > I do not have enough knowledge of R internals to join the discussion about sorting mechanisms. In fact, I did not get how ordering could help for ave as the output must maintain the order of the input (because ave returns only x and not the entiere data.frame). > > However, while the proposed workaround (i.e. paste0 instead of interaction, cf https://stat.ethz.ch/pipermail/r-devel/2021-March/080509.html) does not solves the "bigger problem" of sorting, it is usable as is and solves the issue. Therefore, what do you think about it? (i.e is it relevant for a patch?) > > Thanks, > > Thomas > > > > ________________________________________ > > De : Abby Spurdle <[hidden email]> > > Envoyé : lundi 15 mars 2021 10:22 > > À : SOEIRO Thomas > > Cc : [hidden email] > > Objet : Re: [Rd] Potential improvements of ave? > > > > Hi Thomas, > > > > These are some great suggestions. > > But I can't help but feel there's a much bigger problem here. > > > > Intuitively, the ave function could (or should) sort the data. > > Then the indexing step becomes almost trivial, in terms of both time > > and space complexity. > > And the ave function is not the only example of where a problem > > becomes much simpler, if the data is sorted. > > > > Historically, I've never found base R functions user-friendly for > > aggregation purposes, or for sorting. > > (At least, not by comparison to SQL). > > > > But that's not the main problem. > > It would seem preferable to sort the data, only once. > > (Rather than sorting it repeatedly, or not at all). > > > > Perhaps, objects such as vectors and data.frame(s) could have a > > boolean attribute, to indicate if they're sorted. > > Or functions such as ave could have a sorted argument. > > In either case, if true, the function assumes the data is sorted and > > applies a more efficient algorithm. > > > > > > B. > > > > > > On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <[hidden email]> wrote: > >> > >> Dear all, > >> > >> I have two questions/suggestions about ave, but I am not sure if it's relevant for bug reports. > >> > >> > >> > >> 1) I have performance issues with ave in a case where I didn't expect it. The following code runs as expected: > >> > >> set.seed(1) > >> > >> df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE), > >> id2 = sample(1:3, 5e2, TRUE), > >> id3 = sample(1:5, 5e2, TRUE), > >> val = sample(1:300, 5e2, TRUE)) > >> > >> df1$diff <- ave(df1$val, > >> df1$id1, > >> df1$id2, > >> df1$id3, > >> FUN = function(i) c(diff(i), 0)) > >> > >> head(df1[order(df1$id1, > >> df1$id2, > >> df1$id3), ]) > >> > >> But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate vector of size 1110.0 Gb): > >> > >> df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE), > >> id2 = sample(1:3, 5e2 * 1e4, TRUE), > >> id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE), > >> val = sample(1:300, 5e2 * 1e4, TRUE)) > >> > >> df2$diff <- ave(df2$val, > >> df2$id1, > >> df2$id2, > >> df2$id3, > >> FUN = function(i) c(diff(i), 0)) > >> > >> This use case does not seem extreme to me (e.g. aggregate et al work perfectly on this data.frame). > >> So my question is: Is this expected/intended/reasonable? i.e. Does ave need to be optimized? > >> > >> > >> > >> 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to avoid warnings in case of unused levels (https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html). > >> Is it relevant/possible to expose the drop argument explicitly? > >> > >> > >> > >> Thanks, > >> > >> Thomas > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
Free forum by Nabble | Edit this page |