Re: Proposed speedup of ifelse

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Proposed speedup of ifelse

Gabe Becker
Hugh,

(Note I speak for myself only and not for R-core) Thanks for looking into
this. I think it's great to have community members that are interested in
contributing to R and helping it continue to get better.

And I think, and my local experiments bear out, that using anyNA as a
fastpass condition does allow us to get a significant speedup over what's
in there now. To do so, though, I took a somewhat different approach than
your proposal:

ifelse2 = function(test, yes, no) {
    if (is.atomic(test)) {
        if (typeof(test) != "logical")
            storage.mode(test) <- "logical"
        if (length(test) == 1 && is.null(attributes(test))) {
            if (is.na(test))
                return(NA)
            else if (test) {
                if (length(yes) == 1) {
                    yat <- attributes(yes)
                    if (is.null(yat) || (is.function(yes) &&
identical(names(yat),

 "srcref")))
                        return(yes)
                }
            }
            else if (length(no) == 1) {
                nat <- attributes(no)
                if (is.null(nat) || (is.function(no) &&
identical(names(nat),

"srcref")))
                    return(no)
            }
        }
    }
    else test <- if (isS4(test))
                     methods::as(test, "logical")
                 else as.logical(test)
    ## this is to ensure the documented behavior re: attributes of result
    ans <- test
    len = length(ans)
    if(nonas <- !anyNA(test)) {
        ypos = test
        npos = !test
    } else {
        ok <- !(nas <- is.na(test))
        ypos = test & ok
        npos = !test & ok
    }
    if(any(ypos, na.rm = TRUE)) ##equivalent to any(test[ok])
        ans[ypos] = rep(yes, length.out = len)[ypos]
    if(any(npos, na.rm = TRUE)) ##equivalent to any(!test[ok])
        ans[npos] = rep(no, length.out = len)[npos]
    ## This is in the original but I don't see why it's necessary
    ## due to ans being initialized to test. The NAs should already
    ## be there...
    if(!nonas)
        ans[nas] = NA
    ans
}

On my machine, after an initial call to invoke the JIT and get the function
compiled, this is faster at lengths of test 100 and 10000 (with the lengths
of yes and no at 10% of the length of test) by ~1.7x and ~2x respectively
for no NAs and ~1.3x and ~1.6x respectively for 10% NAs.

The key, from what I saw, is to avoid as much &ing and subsetting as we
can.  If there are no NAs none of the test&ok or test[ok] operations do
anything because ok has only TRUEs in it. Even when there are, we want to
do the & once and avoid test[ok].

There are further savings for the NAs present case if I'm correct about the
ans[nas] = NA being redundant and we're able to remove that as well.

I'm happy to submit this as a patch and share credit if that is ok with
you. Let me know.

Best,

On Thu, May 3, 2018 at 9:58 PM, Hugh Parsonage <[hidden email]>
wrote:

> Thanks Radford. I concur with all your points. I've attempted to address
> the issues you raised through the github.io post.  The new method appears
> to be slower for test lengths < 100 and possibly longer lengths (not just <
> 10). Of course length(test) < 100 is very quick, so I simply added this to
> the conditions that cause the old ifelse method to be invoked. I'll leave
> it to R-core to decide whether or not the benefits for longer vectors are
> worth it.
>
>
>
>
>
>
> On Fri, 4 May 2018 at 01:01 Radford Neal <[hidden email]> wrote:
>
> > > I propose a patch to ifelse that leverages anyNA(test) to achieve an
> > > improvement in performance. For a test vector of length 10, the change
> > > nearly halves the time taken and for a test of length 1 million, there
> > > is a tenfold increase in speed. Even for small vectors, the
> > > distributions of timings between the old and the proposed ifelse do
> > > not intersect.
> >
> > For smaller vectors, your results are significantly affected by your
> > invoking the old version via base::ifelse.  You could try defining
> > your new version as new_ifelse, and invoking the old version as just
> > ifelse.  There might still be some issues with the two versions having
> > different context w.r.t environments, and hence looking up functions
> > in different ways.  You could copy the code of the old version and
> > define it in the global environment just like new_ifelse.
> >
> > When using ifelse rather than base::ifelse, it seems the new version
> > is slower for vectors of length 10, but faster for long vectors.
> >
> > Also, I'd use system.time rather than microbenchmark.  The latter will
> > mix invocations of the two functions in a way where it is unclear that
> > garbage collection time will be fairly attributed.  Also, it's a bit
> > silly to plot the distributions of times, which will mostly reflect
> > variations in when garbage collections at various levels occur - just
> > the mean is what is relevant.
> >
> > Regards,
> >
> >    Radford Neal
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>


--
Gabriel Becker, Ph.D
Scientist
Bioinformatics and Computational Biology
Genentech Research

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Proposed speedup of ifelse

Hugh Parsonage
Thanks Gabe, and yes happy for you to submit the patch.

Some thoughts I've had in the interim:

1. Matt Dowle encouraged me to submit a patch as it could improve the
CRAN check farm timings since `ifelse` is presumably used a lot. I
thought the biggest benefit might come by improving the speed when
length(test) is large, though on reflection this may not be the case.
ifelse may be used much more often on small vectors.

2. a[which(x)] seems to be faster than a[x & !is.na(x)] but, again,
not really enough for small x

3. I think the documentation should note that if test is an S4 object
the value of ifelse may not retain its attributes in the first or
second sentence of the "Value" section.


On 8 May 2018 at 12:04, Gabe Becker <[hidden email]> wrote:

> Hugh,
>
> (Note I speak for myself only and not for R-core) Thanks for looking into
> this. I think it's great to have community members that are interested in
> contributing to R and helping it continue to get better.
>
> And I think, and my local experiments bear out, that using anyNA as a
> fastpass condition does allow us to get a significant speedup over what's in
> there now. To do so, though, I took a somewhat different approach than your
> proposal:
>
> ifelse2 = function(test, yes, no) {
>     if (is.atomic(test)) {
>         if (typeof(test) != "logical")
>             storage.mode(test) <- "logical"
>         if (length(test) == 1 && is.null(attributes(test))) {
>             if (is.na(test))
>                 return(NA)
>             else if (test) {
>                 if (length(yes) == 1) {
>                     yat <- attributes(yes)
>                     if (is.null(yat) || (is.function(yes) &&
> identical(names(yat),
>
> "srcref")))
>                         return(yes)
>                 }
>             }
>             else if (length(no) == 1) {
>                 nat <- attributes(no)
>                 if (is.null(nat) || (is.function(no) &&
> identical(names(nat),
>
> "srcref")))
>                     return(no)
>             }
>         }
>     }
>     else test <- if (isS4(test))
>                      methods::as(test, "logical")
>                  else as.logical(test)
>     ## this is to ensure the documented behavior re: attributes of result
>     ans <- test
>     len = length(ans)
>     if(nonas <- !anyNA(test)) {
>         ypos = test
>         npos = !test
>     } else {
>         ok <- !(nas <- is.na(test))
>         ypos = test & ok
>         npos = !test & ok
>     }
>     if(any(ypos, na.rm = TRUE)) ##equivalent to any(test[ok])
>         ans[ypos] = rep(yes, length.out = len)[ypos]
>     if(any(npos, na.rm = TRUE)) ##equivalent to any(!test[ok])
>         ans[npos] = rep(no, length.out = len)[npos]
>     ## This is in the original but I don't see why it's necessary
>     ## due to ans being initialized to test. The NAs should already
>     ## be there...
>     if(!nonas)
>         ans[nas] = NA
>     ans
> }
>
> On my machine, after an initial call to invoke the JIT and get the function
> compiled, this is faster at lengths of test 100 and 10000 (with the lengths
> of yes and no at 10% of the length of test) by ~1.7x and ~2x respectively
> for no NAs and ~1.3x and ~1.6x respectively for 10% NAs.
>
> The key, from what I saw, is to avoid as much &ing and subsetting as we can.
> If there are no NAs none of the test&ok or test[ok] operations do anything
> because ok has only TRUEs in it. Even when there are, we want to do the &
> once and avoid test[ok].
>
> There are further savings for the NAs present case if I'm correct about the
> ans[nas] = NA being redundant and we're able to remove that as well.
>
> I'm happy to submit this as a patch and share credit if that is ok with you.
> Let me know.
>
> Best,
>
> On Thu, May 3, 2018 at 9:58 PM, Hugh Parsonage <[hidden email]>
> wrote:
>>
>> Thanks Radford. I concur with all your points. I've attempted to address
>> the issues you raised through the github.io post.  The new method appears
>> to be slower for test lengths < 100 and possibly longer lengths (not just
>> <
>> 10). Of course length(test) < 100 is very quick, so I simply added this to
>> the conditions that cause the old ifelse method to be invoked. I'll leave
>> it to R-core to decide whether or not the benefits for longer vectors are
>> worth it.
>>
>>
>>
>>
>>
>>
>> On Fri, 4 May 2018 at 01:01 Radford Neal <[hidden email]> wrote:
>>
>> > > I propose a patch to ifelse that leverages anyNA(test) to achieve an
>> > > improvement in performance. For a test vector of length 10, the change
>> > > nearly halves the time taken and for a test of length 1 million, there
>> > > is a tenfold increase in speed. Even for small vectors, the
>> > > distributions of timings between the old and the proposed ifelse do
>> > > not intersect.
>> >
>> > For smaller vectors, your results are significantly affected by your
>> > invoking the old version via base::ifelse.  You could try defining
>> > your new version as new_ifelse, and invoking the old version as just
>> > ifelse.  There might still be some issues with the two versions having
>> > different context w.r.t environments, and hence looking up functions
>> > in different ways.  You could copy the code of the old version and
>> > define it in the global environment just like new_ifelse.
>> >
>> > When using ifelse rather than base::ifelse, it seems the new version
>> > is slower for vectors of length 10, but faster for long vectors.
>> >
>> > Also, I'd use system.time rather than microbenchmark.  The latter will
>> > mix invocations of the two functions in a way where it is unclear that
>> > garbage collection time will be fairly attributed.  Also, it's a bit
>> > silly to plot the distributions of times, which will mostly reflect
>> > variations in when garbage collections at various levels occur - just
>> > the mean is what is relevant.
>> >
>> > Regards,
>> >
>> >    Radford Neal
>> >
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>
>
> --
> Gabriel Becker, Ph.D
> Scientist
> Bioinformatics and Computational Biology
> Genentech Research

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel