anyNA() performance on vectors of POSIXct

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

anyNA() performance on vectors of POSIXct

Harvey Smith
Inside of the anyNA() function, it will use the legacy any(is.na()) code if
x is an OBJECT().  If x is a vector of POSIXct, it will be an OBJECT(), but
it is also TYPEOF(x) == REALSXP.  Therefore, it will skip the faster
ITERATE_BY_REGION, which is typically 5x faster in my testing.

Is the OBJECT() condition really necessary, or could it be moved after the
switch() for the individual TYPEOF(x) ITERATE_BY_REGION calls?

# script to demonstrate performance difference if x is an OBJECT or not by
using unclass()
x.posixct = Sys.time() + 1:1e6
microbenchmark::microbenchmark(
  any(is.na( x.posixct )),
  anyNA( x.posixct ),
  anyNA( unclass(x.posixct) ),
  unit='ms')



static Rboolean anyNA(SEXP call, SEXP op, SEXP args, SEXP env)
{
  SEXP x = CAR(args);
  SEXPTYPE xT = TYPEOF(x);
  Rboolean isList =  (xT == VECSXP || xT == LISTSXP), recursive = FALSE;

  if (isList && length(args) > 1) recursive = asLogical(CADR(args));
  *if (OBJECT(x) || (isList && !recursive)) {*
    SEXP e0 = PROTECT(lang2(install("is.na"), x));
    SEXP e = PROTECT(lang2(install("any"), e0));
    SEXP res = PROTECT(eval(e, env));
    int ans = asLogical(res);
    UNPROTECT(3);
    return ans == 1; // so NA answer is false.
  }

  R_xlen_t i, n = xlength(x);
  switch (xT) {
    case REALSXP:
    {
      if(REAL_NO_NA(x))
        return FALSE;
      ITERATE_BY_REGION(x, xD, i, nbatch, double, REAL, {
        for (int k = 0; k < nbatch; k++)
          if (ISNAN(xD[k]))
            return TRUE;
      });
      break;
    }

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: anyNA() performance on vectors of POSIXct

Joshua Ulrich
On Wed, May 1, 2019 at 7:45 AM Harvey Smith <[hidden email]> wrote:

>
> Inside of the anyNA() function, it will use the legacy any(is.na()) code if
> x is an OBJECT().  If x is a vector of POSIXct, it will be an OBJECT(), but
> it is also TYPEOF(x) == REALSXP.  Therefore, it will skip the faster
> ITERATE_BY_REGION, which is typically 5x faster in my testing.
>
> Is the OBJECT() condition really necessary, or could it be moved after the
> switch() for the individual TYPEOF(x) ITERATE_BY_REGION calls?
>
> # script to demonstrate performance difference if x is an OBJECT or not by
> using unclass()
> x.posixct = Sys.time() + 1:1e6
> microbenchmark::microbenchmark(
>   any(is.na( x.posixct )),
>   anyNA( x.posixct ),
>   anyNA( unclass(x.posixct) ),
>   unit='ms')
>
>
>
> static Rboolean anyNA(SEXP call, SEXP op, SEXP args, SEXP env)
> {
>   SEXP x = CAR(args);
>   SEXPTYPE xT = TYPEOF(x);
>   Rboolean isList =  (xT == VECSXP || xT == LISTSXP), recursive = FALSE;
>
>   if (isList && length(args) > 1) recursive = asLogical(CADR(args));
>   *if (OBJECT(x) || (isList && !recursive)) {*
>     SEXP e0 = PROTECT(lang2(install("is.na"), x));
>     SEXP e = PROTECT(lang2(install("any"), e0));
>     SEXP res = PROTECT(eval(e, env));
>     int ans = asLogical(res);
>     UNPROTECT(3);
>     return ans == 1; // so NA answer is false.
>   }
>
>   R_xlen_t i, n = xlength(x);
>   switch (xT) {
>     case REALSXP:
>     {
>       if(REAL_NO_NA(x))
>         return FALSE;
>       ITERATE_BY_REGION(x, xD, i, nbatch, double, REAL, {
>         for (int k = 0; k < nbatch; k++)
>           if (ISNAN(xD[k]))
>             return TRUE;
>       });
>       break;
>     }
>

I'm interested in this as well, because it causes performance
degradation in xts subsetting:
https://github.com/joshuaulrich/xts/issues/296

Would it be possible to special-case POSIXct, and perhaps other types
defined in base+recommended packages?

Best,
Josh

--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com
R/Finance 2019 | www.rinfinance.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: anyNA() performance on vectors of POSIXct

Martin Maechler
In reply to this post by Harvey Smith
>>>>> Harvey Smith
>>>>>     on Wed, 1 May 2019 03:20:55 -0400 writes:

    > Inside of the anyNA() function, it will use the legacy any(is.na()) code if
    > x is an OBJECT().  If x is a vector of POSIXct, it will be an OBJECT(), but
    > it is also TYPEOF(x) == REALSXP.  Therefore, it will skip the faster
    > ITERATE_BY_REGION, which is typically 5x faster in my testing.

    > Is the OBJECT() condition really necessary, or could it be moved after the
    > switch() for the individual TYPEOF(x) ITERATE_BY_REGION calls?

 "necessary ?" :  yes, in the following sense :

When it was introduced, the idea of anyNA(.) has been that it
should be equivalent (but often faster) than  any(is.na(.)).
As anyNA() was only introduced quite recently (*)
and many (S3 and S4) classes have had  is.na() methods defined
for them but -- initially at least -- not an anyNA().

So to ensure  the equivalence    anyNA(x)  ===   any(is.na(x))
for "all" R objects 'x', that OBJECT(.) condition had been
important and necessary.

Still, being the person who had added  anyNA() to R,
I'm naturally sympathetic to have it faster in cases such as
"Date" or "POSIXct" objects.

I'd find it ugly to test for these classes specifically in the C code (via
the equivalent of  inherits(., "POSIXct")
  {{ *NOT* via the really wrong  class(.)[[1]] == "POSIXct"
      that I see in some "experts" R code, because that fails
      for all class extensions ! }}
but that may still be an option;

Yet alternatively, one *could* consider changing the API and
declare that for atomic types with a class {i.e. OBJECT(.)}, and
*if* there is no anyNA() method, anyNA() will use the "atomic"
fast method, instead of using any(is.na(.)).

This may break existing code in packages, but the maintainers of
that code could solve the problems by providing  anyNA(.)
methods for their objects.

Other opinions / ideas ?

Martin Maechler
ETH Zurich / R Core Team

   
--
*) in Spring 2013, but too late for R 3.0.0;
   "recently", considering R's history starting with S in the early 1980's


> # script to demonstrate performance difference if x is an OBJECT or not by
> using unclass()
> x.posixct = Sys.time() + 1:1e6
> microbenchmark::microbenchmark(
>   any(is.na( x.posixct )),
>   anyNA( x.posixct ),
>   anyNA( unclass(x.posixct) ),
>   unit='ms')
>
>
>
> static Rboolean anyNA(SEXP call, SEXP op, SEXP args, SEXP env)
> {
>   SEXP x = CAR(args);
>   SEXPTYPE xT = TYPEOF(x);
>   Rboolean isList =  (xT == VECSXP || xT == LISTSXP), recursive = FALSE;
>
>   if (isList && length(args) > 1) recursive = asLogical(CADR(args));
>   *if (OBJECT(x) || (isList && !recursive)) {*
>     SEXP e0 = PROTECT(lang2(install("is.na"), x));
>     SEXP e = PROTECT(lang2(install("any"), e0));
>     SEXP res = PROTECT(eval(e, env));
>     int ans = asLogical(res);
>     UNPROTECT(3);
>     return ans == 1; // so NA answer is false.
>   }
>
>   R_xlen_t i, n = xlength(x);
>   switch (xT) {
>     case REALSXP:
>     {
>       if(REAL_NO_NA(x))
>         return FALSE;
>       ITERATE_BY_REGION(x, xD, i, nbatch, double, REAL, {
>         for (int k = 0; k < nbatch; k++)
>           if (ISNAN(xD[k]))
>             return TRUE;
>       });
>       break;
>     }
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: anyNA() performance on vectors of POSIXct

Harvey Smith
 I think there was a similar discussion to this when I raised the issue of
interpreting the sort order for an object versus its underlying type.  In
this anyNA example it is the is.na for the object versus the is.na for the
type, whereas in the discussion below, which Gabriel Becker raised, it was
the sort ordering.  They seem to be related when vectors of POSIXct are
handled as objects instead of the underlying numeric type.

So looking at this, it is because is.object(x.posixct) returns true, which
means sort.default does x[order(x, <bla bla bla>)], which ALTREP is not
currently (and may not ever be?) smart enough to catch on its own and know
is sorted.

Its true we could add something after that to wrap it in what is called a
wrapper altrep which would know it's sorted, but we don't do that currently
now and I'm not sure we actually should in the general case. I'm not
convinced its safe to assume an object class' defined ordering will match
the ordering of an underlying double/int representation. I believe we ran
into something similar with deferred sting conversions from integers (I
think, possibly doubles) where the int had sortedness information but that
wasn't correct for the *character vector *the ALTREP ultimately
represented.

http://r.789695.n4.nabble.com/unsorted-suggestion-for-performance-improvement-and-ALTREP-support-for-POSIXct-td4754634.html


On Tue, May 21, 2019 at 12:04 PM Martin Maechler <[hidden email]>
wrote:

> >>>>> Harvey Smith
> >>>>>     on Wed, 1 May 2019 03:20:55 -0400 writes:
>
>     > Inside of the anyNA() function, it will use the legacy any(is.na())
> code if
>     > x is an OBJECT().  If x is a vector of POSIXct, it will be an
> OBJECT(), but
>     > it is also TYPEOF(x) == REALSXP.  Therefore, it will skip the faster
>     > ITERATE_BY_REGION, which is typically 5x faster in my testing.
>
>     > Is the OBJECT() condition really necessary, or could it be moved
> after the
>     > switch() for the individual TYPEOF(x) ITERATE_BY_REGION calls?
>
>  "necessary ?" :  yes, in the following sense :
>
> When it was introduced, the idea of anyNA(.) has been that it
> should be equivalent (but often faster) than  any(is.na(.)).
> As anyNA() was only introduced quite recently (*)
> and many (S3 and S4) classes have had  is.na() methods defined
> for them but -- initially at least -- not an anyNA().
>
> So to ensure  the equivalence    anyNA(x)  ===   any(is.na(x))
> for "all" R objects 'x', that OBJECT(.) condition had been
> important and necessary.
>
> Still, being the person who had added  anyNA() to R,
> I'm naturally sympathetic to have it faster in cases such as
> "Date" or "POSIXct" objects.
>
> I'd find it ugly to test for these classes specifically in the C code (via
> the equivalent of  inherits(., "POSIXct")
>   {{ *NOT* via the really wrong  class(.)[[1]] == "POSIXct"
>       that I see in some "experts" R code, because that fails
>       for all class extensions ! }}
> but that may still be an option;
>
> Yet alternatively, one *could* consider changing the API and
> declare that for atomic types with a class {i.e. OBJECT(.)}, and
> *if* there is no anyNA() method, anyNA() will use the "atomic"
> fast method, instead of using any(is.na(.)).
>
> This may break existing code in packages, but the maintainers of
> that code could solve the problems by providing  anyNA(.)
> methods for their objects.
>
> Other opinions / ideas ?
>
> Martin Maechler
> ETH Zurich / R Core Team
>
>
> --
> *) in Spring 2013, but too late for R 3.0.0;
>    "recently", considering R's history starting with S in the early 1980's
>
>
> > # script to demonstrate performance difference if x is an OBJECT or not
> by
> > using unclass()
> > x.posixct = Sys.time() + 1:1e6
> > microbenchmark::microbenchmark(
> >   any(is.na( x.posixct )),
> >   anyNA( x.posixct ),
> >   anyNA( unclass(x.posixct) ),
> >   unit='ms')
> >
> >
> >
> > static Rboolean anyNA(SEXP call, SEXP op, SEXP args, SEXP env)
> > {
> >   SEXP x = CAR(args);
> >   SEXPTYPE xT = TYPEOF(x);
> >   Rboolean isList =  (xT == VECSXP || xT == LISTSXP), recursive = FALSE;
> >
> >   if (isList && length(args) > 1) recursive = asLogical(CADR(args));
> >   *if (OBJECT(x) || (isList && !recursive)) {*
> >     SEXP e0 = PROTECT(lang2(install("is.na"), x));
> >     SEXP e = PROTECT(lang2(install("any"), e0));
> >     SEXP res = PROTECT(eval(e, env));
> >     int ans = asLogical(res);
> >     UNPROTECT(3);
> >     return ans == 1; // so NA answer is false.
> >   }
> >
> >   R_xlen_t i, n = xlength(x);
> >   switch (xT) {
> >     case REALSXP:
> >     {
> >       if(REAL_NO_NA(x))
> >         return FALSE;
> >       ITERATE_BY_REGION(x, xD, i, nbatch, double, REAL, {
> >         for (int k = 0; k < nbatch; k++)
> >           if (ISNAN(xD[k]))
> >             return TRUE;
> >       });
> >       break;
> >     }
> >
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel