Question re: NA, NaNs in R

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Question re: NA, NaNs in R

Kevin Ushey
Hi R-devel,

I have a question about the differentiation between NA and NaN values
as implemented in R. In arithmetic.c, we have

int R_IsNA(double x)
{
    if (isnan(x)) {
ieee_double y;
y.value = x;
return (y.word[lw] == 1954);
    }
    return 0;
}

ieee_double is just used for type punning so we can check the final
bits and see if they're equal to 1954; if they are, x is NA, if
they're not, x is NaN (as defined for R_IsNaN).

My question is -- I can see a substantial increase in speed (on my
computer, in certain cases) if I replace this check with

int R_IsNA(double x)
{
    return memcmp(
        (char*)(&x),
        (char*)(&NA_REAL),
        sizeof(double)
    ) == 0;
}

IIUC, there is only one bit pattern used to encode R NA values, so
this should be safe. But I would like to be sure:

Is there any guarantee that the different functions in R would return
NA as identical to the bit pattern defined for NA_REAL, for a given
architecture? Similarly for NaN value(s) and R_NaN?

My guess is that it is possible some functions used internally by R
might encode NaN values differently; ie, setting the lower word to a
value different than 1954 (hence being NaN, but potentially not
identical to R_NaN), or perhaps this is architecture-dependent.
However, NA should be one specific bit pattern (?). And, I wonder if
there is any guarantee that the different functions used in R would
return an NaN value as identical to R_NaN (which appears to be the
'IEEE NaN')?

(interested parties can see + run a simple benchmark from the gist at
https://gist.github.com/kevinushey/8911432)

Thanks,
Kevin

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Question re: NA, NaNs in R

Prof Brian Ripley
There is one NA but mulitple NaNs.

And please re-read 'man memcmp': your cast is wrong.

On 10/02/2014 06:52, Kevin Ushey wrote:

> Hi R-devel,
>
> I have a question about the differentiation between NA and NaN values
> as implemented in R. In arithmetic.c, we have
>
> int R_IsNA(double x)
> {
>      if (isnan(x)) {
> ieee_double y;
> y.value = x;
> return (y.word[lw] == 1954);
>      }
>      return 0;
> }
>
> ieee_double is just used for type punning so we can check the final
> bits and see if they're equal to 1954; if they are, x is NA, if
> they're not, x is NaN (as defined for R_IsNaN).
>
> My question is -- I can see a substantial increase in speed (on my
> computer, in certain cases) if I replace this check with
>
> int R_IsNA(double x)
> {
>      return memcmp(
>          (char*)(&x),
>          (char*)(&NA_REAL),
>          sizeof(double)
>      ) == 0;
> }
>
> IIUC, there is only one bit pattern used to encode R NA values, so
> this should be safe. But I would like to be sure:
>
> Is there any guarantee that the different functions in R would return
> NA as identical to the bit pattern defined for NA_REAL, for a given
> architecture? Similarly for NaN value(s) and R_NaN?
>
> My guess is that it is possible some functions used internally by R
> might encode NaN values differently; ie, setting the lower word to a
> value different than 1954 (hence being NaN, but potentially not
> identical to R_NaN), or perhaps this is architecture-dependent.
> However, NA should be one specific bit pattern (?). And, I wonder if
> there is any guarantee that the different functions used in R would
> return an NaN value as identical to R_NaN (which appears to be the
> 'IEEE NaN')?
>
> (interested parties can see + run a simple benchmark from the gist at
> https://gist.github.com/kevinushey/8911432)
>
> Thanks,
> Kevin
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>


--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Question re: NA, NaNs in R

Tim Hesterberg-2
This isn't quite what you were asking, but might inform your choice.

R doesn't try to maintain the distinction between NA and NaN when
doing calculations, e.g.:
> NA + NaN
[1] NA
> NaN + NA
[1] NaN
So for the aggregate package, I didn't attempt to treat them differently.

The aggregate package is available at
http://www.timhesterberg.net/r-packages

Here is the inst/doc/missingValues.txt file from that package:

--------------------------------------------------
Copyright 2012 Google Inc. All Rights Reserved.
Author: Tim Hesterberg <[hidden email]>
Distributed under GPL 2 or later.


        Handling of missing values and not-a-numbers.


Here I'll note how this package handles missing values.
I do it the way R handles them, rather than the more strict way that S+ does.

First, for terminology,
  NaN = "not-a-number", e.g. the result of 0/0
  NA  = "missing value" or "true missing value", e.g. survey non-response
  xx  = I'll uses this for the union of those, or "missing value of any kind".

For background, at the hardware level there is an IEEE standard that
specifies that certain bit patterns are NaN, and specifies that
operations involving an NaN result in another NaN.

That standard doesn't say anything about missing values, which are
important in statistics.

So what R and S+ do is to pick one of the bit patterns and declare
that to be a NA.  In other words, the NA bit pattern is a subset of
the NaN bit patterns.

At the user level, the reverse seems to hold.
You can assign either NA or NaN to an object.
But:
        is.na(x) returns TRUE for both
        is.nan(x) returns TRUE for NaN and FALSE for NA
Based on that, you'd think that NaN is a subset of NA.
To tell whether something is a true missing value do:
        (is.na(x) & !is.nan(x))

The S+ convention is that any operation involving NA results in an NA;
otherwise any operation involving NaN results in NaN.

The R convention is that any operation involving xx results in an xx;
a missing value of any kind results in another missing value of any
kind.  R considers NA and NaN equivalent for testing purposes:
        all.equal(NA_real_, NaN)
gives TRUE.

Some R functions follow the S+ convention, e.g. the Math2 functions
in src/main/arithmetic.c use this macro:
#define if_NA_Math2_set(y,a,b) \
        if      (ISNA (a) || ISNA (b)) y = NA_REAL; \
        else if (ISNAN(a) || ISNAN(b)) y = R_NaN;

Other R functions, like the basic arithmetic operations +-/*^,
do not (search for PLUSOP in src/main/arithmetic.c).
They just let the hardware do the calculations.
As a result, you can get odd results like
> is.nan(NA_real_ + NaN)
[1] FALSE
> is.nan(NaN + NA_real_)
[1] TRUE

The R help files help(is.na) and help(is.nan) suggest that
computations involving NA and NaN are indeterminate.

It is faster to use the R convention; most operations are just
handled by the hardware, without extra work.

In cases like sum(x, na.rm=TRUE), the help file specifies that both NA
and NaN are removed.




>There is one NA but mulitple NaNs.
>
>And please re-read 'man memcmp': your cast is wrong.
>
>On 10/02/2014 06:52, Kevin Ushey wrote:
>> Hi R-devel,
>>
>> I have a question about the differentiation between NA and NaN values
>> as implemented in R. In arithmetic.c, we have
>>
>> int R_IsNA(double x)
>> {
>>      if (isnan(x)) {
>> ieee_double y;
>> y.value = x;
>> return (y.word[lw] == 1954);
>>      }
>>      return 0;
>> }
>>
>> ieee_double is just used for type punning so we can check the final
>> bits and see if they're equal to 1954; if they are, x is NA, if
>> they're not, x is NaN (as defined for R_IsNaN).
>>
>> My question is -- I can see a substantial increase in speed (on my
>> computer, in certain cases) if I replace this check with
>>
>> int R_IsNA(double x)
>> {
>>      return memcmp(
>>          (char*)(&x),
>>          (char*)(&NA_REAL),
>>          sizeof(double)
>>      ) == 0;
>> }
>>
>> IIUC, there is only one bit pattern used to encode R NA values, so
>> this should be safe. But I would like to be sure:
>>
>> Is there any guarantee that the different functions in R would return
>> NA as identical to the bit pattern defined for NA_REAL, for a given
>> architecture? Similarly for NaN value(s) and R_NaN?
>>
>> My guess is that it is possible some functions used internally by R
>> might encode NaN values differently; ie, setting the lower word to a
>> value different than 1954 (hence being NaN, but potentially not
>> identical to R_NaN), or perhaps this is architecture-dependent.
>> However, NA should be one specific bit pattern (?). And, I wonder if
>> there is any guarantee that the different functions used in R would
>> return an NaN value as identical to R_NaN (which appears to be the
>> 'IEEE NaN')?
>>
>> (interested parties can see + run a simple benchmark from the gist at
>> https://gist.github.com/kevinushey/8911432)
>>
>> Thanks,
>> Kevin
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>
>--
>Brian D. Ripley,                  [hidden email]
>Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>University of Oxford,             Tel:  +44 1865 272861 (self)
>1 South Parks Road,                     +44 1865 272866 (PA)
>Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Question re: NA, NaNs in R

Kevin Ushey
Thanks Tim, this is exactly the explanation I was hoping to see. Much
appreciated!

On Mon, Feb 10, 2014 at 7:21 AM, Tim Hesterberg <[hidden email]> wrote:

> This isn't quite what you were asking, but might inform your choice.
>
> R doesn't try to maintain the distinction between NA and NaN when
> doing calculations, e.g.:
>> NA + NaN
> [1] NA
>> NaN + NA
> [1] NaN
> So for the aggregate package, I didn't attempt to treat them differently.
>
> The aggregate package is available at
> http://www.timhesterberg.net/r-packages
>
> Here is the inst/doc/missingValues.txt file from that package:
>
> --------------------------------------------------
> Copyright 2012 Google Inc. All Rights Reserved.
> Author: Tim Hesterberg <[hidden email]>
> Distributed under GPL 2 or later.
>
>
>         Handling of missing values and not-a-numbers.
>
>
> Here I'll note how this package handles missing values.
> I do it the way R handles them, rather than the more strict way that S+ does.
>
> First, for terminology,
>   NaN = "not-a-number", e.g. the result of 0/0
>   NA  = "missing value" or "true missing value", e.g. survey non-response
>   xx  = I'll uses this for the union of those, or "missing value of any kind".
>
> For background, at the hardware level there is an IEEE standard that
> specifies that certain bit patterns are NaN, and specifies that
> operations involving an NaN result in another NaN.
>
> That standard doesn't say anything about missing values, which are
> important in statistics.
>
> So what R and S+ do is to pick one of the bit patterns and declare
> that to be a NA.  In other words, the NA bit pattern is a subset of
> the NaN bit patterns.
>
> At the user level, the reverse seems to hold.
> You can assign either NA or NaN to an object.
> But:
>         is.na(x) returns TRUE for both
>         is.nan(x) returns TRUE for NaN and FALSE for NA
> Based on that, you'd think that NaN is a subset of NA.
> To tell whether something is a true missing value do:
>         (is.na(x) & !is.nan(x))
>
> The S+ convention is that any operation involving NA results in an NA;
> otherwise any operation involving NaN results in NaN.
>
> The R convention is that any operation involving xx results in an xx;
> a missing value of any kind results in another missing value of any
> kind.  R considers NA and NaN equivalent for testing purposes:
>         all.equal(NA_real_, NaN)
> gives TRUE.
>
> Some R functions follow the S+ convention, e.g. the Math2 functions
> in src/main/arithmetic.c use this macro:
> #define if_NA_Math2_set(y,a,b)                          \
>         if      (ISNA (a) || ISNA (b)) y = NA_REAL;     \
>         else if (ISNAN(a) || ISNAN(b)) y = R_NaN;
>
> Other R functions, like the basic arithmetic operations +-/*^,
> do not (search for PLUSOP in src/main/arithmetic.c).
> They just let the hardware do the calculations.
> As a result, you can get odd results like
>> is.nan(NA_real_ + NaN)
> [1] FALSE
>> is.nan(NaN + NA_real_)
> [1] TRUE
>
> The R help files help(is.na) and help(is.nan) suggest that
> computations involving NA and NaN are indeterminate.
>
> It is faster to use the R convention; most operations are just
> handled by the hardware, without extra work.
>
> In cases like sum(x, na.rm=TRUE), the help file specifies that both NA
> and NaN are removed.
>
>
>
>
>>There is one NA but mulitple NaNs.
>>
>>And please re-read 'man memcmp': your cast is wrong.
>>
>>On 10/02/2014 06:52, Kevin Ushey wrote:
>>> Hi R-devel,
>>>
>>> I have a question about the differentiation between NA and NaN values
>>> as implemented in R. In arithmetic.c, we have
>>>
>>> int R_IsNA(double x)
>>> {
>>>      if (isnan(x)) {
>>> ieee_double y;
>>> y.value = x;
>>> return (y.word[lw] == 1954);
>>>      }
>>>      return 0;
>>> }
>>>
>>> ieee_double is just used for type punning so we can check the final
>>> bits and see if they're equal to 1954; if they are, x is NA, if
>>> they're not, x is NaN (as defined for R_IsNaN).
>>>
>>> My question is -- I can see a substantial increase in speed (on my
>>> computer, in certain cases) if I replace this check with
>>>
>>> int R_IsNA(double x)
>>> {
>>>      return memcmp(
>>>          (char*)(&x),
>>>          (char*)(&NA_REAL),
>>>          sizeof(double)
>>>      ) == 0;
>>> }
>>>
>>> IIUC, there is only one bit pattern used to encode R NA values, so
>>> this should be safe. But I would like to be sure:
>>>
>>> Is there any guarantee that the different functions in R would return
>>> NA as identical to the bit pattern defined for NA_REAL, for a given
>>> architecture? Similarly for NaN value(s) and R_NaN?
>>>
>>> My guess is that it is possible some functions used internally by R
>>> might encode NaN values differently; ie, setting the lower word to a
>>> value different than 1954 (hence being NaN, but potentially not
>>> identical to R_NaN), or perhaps this is architecture-dependent.
>>> However, NA should be one specific bit pattern (?). And, I wonder if
>>> there is any guarantee that the different functions used in R would
>>> return an NaN value as identical to R_NaN (which appears to be the
>>> 'IEEE NaN')?
>>>
>>> (interested parties can see + run a simple benchmark from the gist at
>>> https://gist.github.com/kevinushey/8911432)
>>>
>>> Thanks,
>>> Kevin
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>>
>>--
>>Brian D. Ripley,                  [hidden email]
>>Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>>University of Oxford,             Tel:  +44 1865 272861 (self)
>>1 South Parks Road,                     +44 1865 272866 (PA)
>>Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Question re: NA, NaNs in R

Duncan Murdoch-2
In reply to this post by Tim Hesterberg-2
On 10/02/2014 10:21 AM, Tim Hesterberg wrote:
> This isn't quite what you were asking, but might inform your choice.
>
> R doesn't try to maintain the distinction between NA and NaN when
> doing calculations, e.g.:
> > NA + NaN
> [1] NA
> > NaN + NA
> [1] NaN
> So for the aggregate package, I didn't attempt to treat them differently.

This looks like a bug to me.  In 32 bit 3.0.2 and R-patched I see

 > NA + NaN
[1] NA
 > NaN + NA
[1] NA

This seems more reasonable to me.  NA should propagate.  (I can see an
argument for NaN for the answer here, as I can't think of any possible
non-missing value that would give anything else when added to NaN, but
the answer should not depend on the order of operands.)

However, I get the same as you in 64 bit 3.0.2.  All calculations I've
shown are on 64 bit Windows 7.

Duncan Murdoch


>
> The aggregate package is available at
> http://www.timhesterberg.net/r-packages
>
> Here is the inst/doc/missingValues.txt file from that package:
>
> --------------------------------------------------
> Copyright 2012 Google Inc. All Rights Reserved.
> Author: Tim Hesterberg <[hidden email]>
> Distributed under GPL 2 or later.
>
>
> Handling of missing values and not-a-numbers.
>
>
> Here I'll note how this package handles missing values.
> I do it the way R handles them, rather than the more strict way that S+ does.
>
> First, for terminology,
>    NaN = "not-a-number", e.g. the result of 0/0
>    NA  = "missing value" or "true missing value", e.g. survey non-response
>    xx  = I'll uses this for the union of those, or "missing value of any kind".
>
> For background, at the hardware level there is an IEEE standard that
> specifies that certain bit patterns are NaN, and specifies that
> operations involving an NaN result in another NaN.
>
> That standard doesn't say anything about missing values, which are
> important in statistics.
>
> So what R and S+ do is to pick one of the bit patterns and declare
> that to be a NA.  In other words, the NA bit pattern is a subset of
> the NaN bit patterns.
>
> At the user level, the reverse seems to hold.
> You can assign either NA or NaN to an object.
> But:
> is.na(x) returns TRUE for both
> is.nan(x) returns TRUE for NaN and FALSE for NA
> Based on that, you'd think that NaN is a subset of NA.
> To tell whether something is a true missing value do:
> (is.na(x) & !is.nan(x))
>
> The S+ convention is that any operation involving NA results in an NA;
> otherwise any operation involving NaN results in NaN.
>
> The R convention is that any operation involving xx results in an xx;
> a missing value of any kind results in another missing value of any
> kind.  R considers NA and NaN equivalent for testing purposes:
> all.equal(NA_real_, NaN)
> gives TRUE.
>
> Some R functions follow the S+ convention, e.g. the Math2 functions
> in src/main/arithmetic.c use this macro:
> #define if_NA_Math2_set(y,a,b) \
> if      (ISNA (a) || ISNA (b)) y = NA_REAL; \
> else if (ISNAN(a) || ISNAN(b)) y = R_NaN;
>
> Other R functions, like the basic arithmetic operations +-/*^,
> do not (search for PLUSOP in src/main/arithmetic.c).
> They just let the hardware do the calculations.
> As a result, you can get odd results like
> > is.nan(NA_real_ + NaN)
> [1] FALSE
> > is.nan(NaN + NA_real_)
> [1] TRUE
>
> The R help files help(is.na) and help(is.nan) suggest that
> computations involving NA and NaN are indeterminate.
>
> It is faster to use the R convention; most operations are just
> handled by the hardware, without extra work.
>
> In cases like sum(x, na.rm=TRUE), the help file specifies that both NA
> and NaN are removed.
>
>
>
>
> >There is one NA but mulitple NaNs.
> >
> >And please re-read 'man memcmp': your cast is wrong.
> >
> >On 10/02/2014 06:52, Kevin Ushey wrote:
> >> Hi R-devel,
> >>
> >> I have a question about the differentiation between NA and NaN values
> >> as implemented in R. In arithmetic.c, we have
> >>
> >> int R_IsNA(double x)
> >> {
> >>      if (isnan(x)) {
> >> ieee_double y;
> >> y.value = x;
> >> return (y.word[lw] == 1954);
> >>      }
> >>      return 0;
> >> }
> >>
> >> ieee_double is just used for type punning so we can check the final
> >> bits and see if they're equal to 1954; if they are, x is NA, if
> >> they're not, x is NaN (as defined for R_IsNaN).
> >>
> >> My question is -- I can see a substantial increase in speed (on my
> >> computer, in certain cases) if I replace this check with
> >>
> >> int R_IsNA(double x)
> >> {
> >>      return memcmp(
> >>          (char*)(&x),
> >>          (char*)(&NA_REAL),
> >>          sizeof(double)
> >>      ) == 0;
> >> }
> >>
> >> IIUC, there is only one bit pattern used to encode R NA values, so
> >> this should be safe. But I would like to be sure:
> >>
> >> Is there any guarantee that the different functions in R would return
> >> NA as identical to the bit pattern defined for NA_REAL, for a given
> >> architecture? Similarly for NaN value(s) and R_NaN?
> >>
> >> My guess is that it is possible some functions used internally by R
> >> might encode NaN values differently; ie, setting the lower word to a
> >> value different than 1954 (hence being NaN, but potentially not
> >> identical to R_NaN), or perhaps this is architecture-dependent.
> >> However, NA should be one specific bit pattern (?). And, I wonder if
> >> there is any guarantee that the different functions used in R would
> >> return an NaN value as identical to R_NaN (which appears to be the
> >> 'IEEE NaN')?
> >>
> >> (interested parties can see + run a simple benchmark from the gist at
> >> https://gist.github.com/kevinushey/8911432)
> >>
> >> Thanks,
> >> Kevin
> >>
> >> ______________________________________________
> >> [hidden email] mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
> >
> >
> >--
> >Brian D. Ripley,                  [hidden email]
> >Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> >University of Oxford,             Tel:  +44 1865 272861 (self)
> >1 South Parks Road,                     +44 1865 272866 (PA)
> >Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Question re: NA, NaNs in R

Kevin Ushey
Also, similarly, to clarify, should there be _one_ unique bit pattern
for R's NA_REAL, or two? Because I see (for a function hex that
produces the hex representation of a number):

> hex(NA_real_)
[1] "7FF00000000007A2"
> hex(NA_real_+1)
[1] "7FF80000000007A2"
> hex(NaN)
[1] "7FF8000000000000"

This is with 64-bit R (on OS X Mavericks, R-devel r64910), as well. I
also noticed in a conversation of Arun (co-author of data.table) that:

On 32-bit R-2.15.3:

NA: 7ff80000000007a2
NaN: 7ff8000000000000

On 64-bit version of R-2.15.3
NA: 7ff00000000007a2
NaN: 7ff8000000000000

Notice that the initial bit pattern is 7ff0, rather than 7ff8, for
64-bit R. Is this intentional?

Thanks,
Kevin

(function follows:)

// assume size of double, unsigned long long is the same
SEXP hex(SEXP x) {

  // double is 8 bytes, each byte can be represented by 2 hex chars,
  // so need a str with 16+1 slots
  int n = sizeof(unsigned long long) * 2 + 1;

  unsigned long long *xx = (unsigned long long*) REAL(x);
  char buf[n];
  snprintf(buf, n, "%016llX", *xx);
  SEXP output = PROTECT(allocVector(STRSXP, 1));
  SET_STRING_ELT(output, 0, mkChar(buf));
  UNPROTECT(1);
  return output;
}

On Mon, Feb 10, 2014 at 10:07 AM, Duncan Murdoch
<[hidden email]> wrote:

> On 10/02/2014 10:21 AM, Tim Hesterberg wrote:
>>
>> This isn't quite what you were asking, but might inform your choice.
>>
>> R doesn't try to maintain the distinction between NA and NaN when
>> doing calculations, e.g.:
>> > NA + NaN
>> [1] NA
>> > NaN + NA
>> [1] NaN
>> So for the aggregate package, I didn't attempt to treat them differently.
>
>
> This looks like a bug to me.  In 32 bit 3.0.2 and R-patched I see
>
>
>> NA + NaN
> [1] NA
>> NaN + NA
> [1] NA
>
> This seems more reasonable to me.  NA should propagate.  (I can see an
> argument for NaN for the answer here, as I can't think of any possible
> non-missing value that would give anything else when added to NaN, but the
> answer should not depend on the order of operands.)
>
> However, I get the same as you in 64 bit 3.0.2.  All calculations I've shown
> are on 64 bit Windows 7.
>
> Duncan Murdoch
>
>
>
>>
>> The aggregate package is available at
>> http://www.timhesterberg.net/r-packages
>>
>> Here is the inst/doc/missingValues.txt file from that package:
>>
>> --------------------------------------------------
>> Copyright 2012 Google Inc. All Rights Reserved.
>> Author: Tim Hesterberg <[hidden email]>
>> Distributed under GPL 2 or later.
>>
>>
>>         Handling of missing values and not-a-numbers.
>>
>>
>> Here I'll note how this package handles missing values.
>> I do it the way R handles them, rather than the more strict way that S+
>> does.
>>
>> First, for terminology,
>>    NaN = "not-a-number", e.g. the result of 0/0
>>    NA  = "missing value" or "true missing value", e.g. survey non-response
>>    xx  = I'll uses this for the union of those, or "missing value of any
>> kind".
>>
>> For background, at the hardware level there is an IEEE standard that
>> specifies that certain bit patterns are NaN, and specifies that
>> operations involving an NaN result in another NaN.
>>
>> That standard doesn't say anything about missing values, which are
>> important in statistics.
>>
>> So what R and S+ do is to pick one of the bit patterns and declare
>> that to be a NA.  In other words, the NA bit pattern is a subset of
>> the NaN bit patterns.
>>
>> At the user level, the reverse seems to hold.
>> You can assign either NA or NaN to an object.
>> But:
>>         is.na(x) returns TRUE for both
>>         is.nan(x) returns TRUE for NaN and FALSE for NA
>> Based on that, you'd think that NaN is a subset of NA.
>> To tell whether something is a true missing value do:
>>         (is.na(x) & !is.nan(x))
>>
>> The S+ convention is that any operation involving NA results in an NA;
>> otherwise any operation involving NaN results in NaN.
>>
>> The R convention is that any operation involving xx results in an xx;
>> a missing value of any kind results in another missing value of any
>> kind.  R considers NA and NaN equivalent for testing purposes:
>>         all.equal(NA_real_, NaN)
>> gives TRUE.
>>
>> Some R functions follow the S+ convention, e.g. the Math2 functions
>> in src/main/arithmetic.c use this macro:
>> #define if_NA_Math2_set(y,a,b)                          \
>>         if      (ISNA (a) || ISNA (b)) y = NA_REAL;     \
>>         else if (ISNAN(a) || ISNAN(b)) y = R_NaN;
>>
>> Other R functions, like the basic arithmetic operations +-/*^,
>> do not (search for PLUSOP in src/main/arithmetic.c).
>> They just let the hardware do the calculations.
>> As a result, you can get odd results like
>> > is.nan(NA_real_ + NaN)
>> [1] FALSE
>> > is.nan(NaN + NA_real_)
>> [1] TRUE
>>
>> The R help files help(is.na) and help(is.nan) suggest that
>> computations involving NA and NaN are indeterminate.
>>
>> It is faster to use the R convention; most operations are just
>> handled by the hardware, without extra work.
>>
>> In cases like sum(x, na.rm=TRUE), the help file specifies that both NA
>> and NaN are removed.
>>
>>
>>
>>
>> >There is one NA but mulitple NaNs.
>> >
>> >And please re-read 'man memcmp': your cast is wrong.
>> >
>> >On 10/02/2014 06:52, Kevin Ushey wrote:
>> >> Hi R-devel,
>> >>
>> >> I have a question about the differentiation between NA and NaN values
>> >> as implemented in R. In arithmetic.c, we have
>> >>
>> >> int R_IsNA(double x)
>> >> {
>> >>      if (isnan(x)) {
>> >> ieee_double y;
>> >> y.value = x;
>> >> return (y.word[lw] == 1954);
>> >>      }
>> >>      return 0;
>> >> }
>> >>
>> >> ieee_double is just used for type punning so we can check the final
>> >> bits and see if they're equal to 1954; if they are, x is NA, if
>> >> they're not, x is NaN (as defined for R_IsNaN).
>> >>
>> >> My question is -- I can see a substantial increase in speed (on my
>> >> computer, in certain cases) if I replace this check with
>> >>
>> >> int R_IsNA(double x)
>> >> {
>> >>      return memcmp(
>> >>          (char*)(&x),
>> >>          (char*)(&NA_REAL),
>> >>          sizeof(double)
>> >>      ) == 0;
>> >> }
>> >>
>> >> IIUC, there is only one bit pattern used to encode R NA values, so
>> >> this should be safe. But I would like to be sure:
>> >>
>> >> Is there any guarantee that the different functions in R would return
>> >> NA as identical to the bit pattern defined for NA_REAL, for a given
>> >> architecture? Similarly for NaN value(s) and R_NaN?
>> >>
>> >> My guess is that it is possible some functions used internally by R
>> >> might encode NaN values differently; ie, setting the lower word to a
>> >> value different than 1954 (hence being NaN, but potentially not
>> >> identical to R_NaN), or perhaps this is architecture-dependent.
>> >> However, NA should be one specific bit pattern (?). And, I wonder if
>> >> there is any guarantee that the different functions used in R would
>> >> return an NaN value as identical to R_NaN (which appears to be the
>> >> 'IEEE NaN')?
>> >>
>> >> (interested parties can see + run a simple benchmark from the gist at
>> >> https://gist.github.com/kevinushey/8911432)
>> >>
>> >> Thanks,
>> >> Kevin
>> >>
>> >> ______________________________________________
>> >> [hidden email] mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-devel
>> >>
>> >
>> >
>> >--
>> >Brian D. Ripley,                  [hidden email]
>> >Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>> >University of Oxford,             Tel:  +44 1865 272861 (self)
>> >1 South Parks Road,                     +44 1865 272866 (PA)
>> >Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Question re: NA, NaNs in R

Duncan Murdoch-2
On 10/02/2014 1:43 PM, Kevin Ushey wrote:
> Also, similarly, to clarify, should there be _one_ unique bit pattern
> for R's NA_REAL, or two? Because I see (for a function hex that
> produces the hex representation of a number):

I don't think the language definition defines bit patterns, it defines
behaviour.  If both of those bit patterns behave correctly, then it's
fine.  (I see the same patterns as you do in 64 bit R-patched, but only
the 2nd in 32 bit R).

Now, perhaps we should define the bit pattern for NA_real_ to make the
optimization you're looking into a bit easier; I don't know if that's
trivial or hard.

Duncan Murdoch

>
> > hex(NA_real_)
> [1] "7FF00000000007A2"
> > hex(NA_real_+1)
> [1] "7FF80000000007A2"
> > hex(NaN)
> [1] "7FF8000000000000"
>
> This is with 64-bit R (on OS X Mavericks, R-devel r64910), as well. I
> also noticed in a conversation of Arun (co-author of data.table) that:
>
> On 32-bit R-2.15.3:
>
> NA: 7ff80000000007a2
> NaN: 7ff8000000000000
>
> On 64-bit version of R-2.15.3
> NA: 7ff00000000007a2
> NaN: 7ff8000000000000
>
> Notice that the initial bit pattern is 7ff0, rather than 7ff8, for
> 64-bit R. Is this intentional?
>
> Thanks,
> Kevin
>
> (function follows:)
>
> // assume size of double, unsigned long long is the same
> SEXP hex(SEXP x) {
>
>    // double is 8 bytes, each byte can be represented by 2 hex chars,
>    // so need a str with 16+1 slots
>    int n = sizeof(unsigned long long) * 2 + 1;
>
>    unsigned long long *xx = (unsigned long long*) REAL(x);
>    char buf[n];
>    snprintf(buf, n, "%016llX", *xx);
>    SEXP output = PROTECT(allocVector(STRSXP, 1));
>    SET_STRING_ELT(output, 0, mkChar(buf));
>    UNPROTECT(1);
>    return output;
> }
>
> On Mon, Feb 10, 2014 at 10:07 AM, Duncan Murdoch
> <[hidden email]> wrote:
> > On 10/02/2014 10:21 AM, Tim Hesterberg wrote:
> >>
> >> This isn't quite what you were asking, but might inform your choice.
> >>
> >> R doesn't try to maintain the distinction between NA and NaN when
> >> doing calculations, e.g.:
> >> > NA + NaN
> >> [1] NA
> >> > NaN + NA
> >> [1] NaN
> >> So for the aggregate package, I didn't attempt to treat them differently.
> >
> >
> > This looks like a bug to me.  In 32 bit 3.0.2 and R-patched I see
> >
> >
> >> NA + NaN
> > [1] NA
> >> NaN + NA
> > [1] NA
> >
> > This seems more reasonable to me.  NA should propagate.  (I can see an
> > argument for NaN for the answer here, as I can't think of any possible
> > non-missing value that would give anything else when added to NaN, but the
> > answer should not depend on the order of operands.)
> >
> > However, I get the same as you in 64 bit 3.0.2.  All calculations I've shown
> > are on 64 bit Windows 7.
> >
> > Duncan Murdoch
> >
> >
> >
> >>
> >> The aggregate package is available at
> >> http://www.timhesterberg.net/r-packages
> >>
> >> Here is the inst/doc/missingValues.txt file from that package:
> >>
> >> --------------------------------------------------
> >> Copyright 2012 Google Inc. All Rights Reserved.
> >> Author: Tim Hesterberg <[hidden email]>
> >> Distributed under GPL 2 or later.
> >>
> >>
> >>         Handling of missing values and not-a-numbers.
> >>
> >>
> >> Here I'll note how this package handles missing values.
> >> I do it the way R handles them, rather than the more strict way that S+
> >> does.
> >>
> >> First, for terminology,
> >>    NaN = "not-a-number", e.g. the result of 0/0
> >>    NA  = "missing value" or "true missing value", e.g. survey non-response
> >>    xx  = I'll uses this for the union of those, or "missing value of any
> >> kind".
> >>
> >> For background, at the hardware level there is an IEEE standard that
> >> specifies that certain bit patterns are NaN, and specifies that
> >> operations involving an NaN result in another NaN.
> >>
> >> That standard doesn't say anything about missing values, which are
> >> important in statistics.
> >>
> >> So what R and S+ do is to pick one of the bit patterns and declare
> >> that to be a NA.  In other words, the NA bit pattern is a subset of
> >> the NaN bit patterns.
> >>
> >> At the user level, the reverse seems to hold.
> >> You can assign either NA or NaN to an object.
> >> But:
> >>         is.na(x) returns TRUE for both
> >>         is.nan(x) returns TRUE for NaN and FALSE for NA
> >> Based on that, you'd think that NaN is a subset of NA.
> >> To tell whether something is a true missing value do:
> >>         (is.na(x) & !is.nan(x))
> >>
> >> The S+ convention is that any operation involving NA results in an NA;
> >> otherwise any operation involving NaN results in NaN.
> >>
> >> The R convention is that any operation involving xx results in an xx;
> >> a missing value of any kind results in another missing value of any
> >> kind.  R considers NA and NaN equivalent for testing purposes:
> >>         all.equal(NA_real_, NaN)
> >> gives TRUE.
> >>
> >> Some R functions follow the S+ convention, e.g. the Math2 functions
> >> in src/main/arithmetic.c use this macro:
> >> #define if_NA_Math2_set(y,a,b)                          \
> >>         if      (ISNA (a) || ISNA (b)) y = NA_REAL;     \
> >>         else if (ISNAN(a) || ISNAN(b)) y = R_NaN;
> >>
> >> Other R functions, like the basic arithmetic operations +-/*^,
> >> do not (search for PLUSOP in src/main/arithmetic.c).
> >> They just let the hardware do the calculations.
> >> As a result, you can get odd results like
> >> > is.nan(NA_real_ + NaN)
> >> [1] FALSE
> >> > is.nan(NaN + NA_real_)
> >> [1] TRUE
> >>
> >> The R help files help(is.na) and help(is.nan) suggest that
> >> computations involving NA and NaN are indeterminate.
> >>
> >> It is faster to use the R convention; most operations are just
> >> handled by the hardware, without extra work.
> >>
> >> In cases like sum(x, na.rm=TRUE), the help file specifies that both NA
> >> and NaN are removed.
> >>
> >>
> >>
> >>
> >> >There is one NA but mulitple NaNs.
> >> >
> >> >And please re-read 'man memcmp': your cast is wrong.
> >> >
> >> >On 10/02/2014 06:52, Kevin Ushey wrote:
> >> >> Hi R-devel,
> >> >>
> >> >> I have a question about the differentiation between NA and NaN values
> >> >> as implemented in R. In arithmetic.c, we have
> >> >>
> >> >> int R_IsNA(double x)
> >> >> {
> >> >>      if (isnan(x)) {
> >> >> ieee_double y;
> >> >> y.value = x;
> >> >> return (y.word[lw] == 1954);
> >> >>      }
> >> >>      return 0;
> >> >> }
> >> >>
> >> >> ieee_double is just used for type punning so we can check the final
> >> >> bits and see if they're equal to 1954; if they are, x is NA, if
> >> >> they're not, x is NaN (as defined for R_IsNaN).
> >> >>
> >> >> My question is -- I can see a substantial increase in speed (on my
> >> >> computer, in certain cases) if I replace this check with
> >> >>
> >> >> int R_IsNA(double x)
> >> >> {
> >> >>      return memcmp(
> >> >>          (char*)(&x),
> >> >>          (char*)(&NA_REAL),
> >> >>          sizeof(double)
> >> >>      ) == 0;
> >> >> }
> >> >>
> >> >> IIUC, there is only one bit pattern used to encode R NA values, so
> >> >> this should be safe. But I would like to be sure:
> >> >>
> >> >> Is there any guarantee that the different functions in R would return
> >> >> NA as identical to the bit pattern defined for NA_REAL, for a given
> >> >> architecture? Similarly for NaN value(s) and R_NaN?
> >> >>
> >> >> My guess is that it is possible some functions used internally by R
> >> >> might encode NaN values differently; ie, setting the lower word to a
> >> >> value different than 1954 (hence being NaN, but potentially not
> >> >> identical to R_NaN), or perhaps this is architecture-dependent.
> >> >> However, NA should be one specific bit pattern (?). And, I wonder if
> >> >> there is any guarantee that the different functions used in R would
> >> >> return an NaN value as identical to R_NaN (which appears to be the
> >> >> 'IEEE NaN')?
> >> >>
> >> >> (interested parties can see + run a simple benchmark from the gist at
> >> >> https://gist.github.com/kevinushey/8911432)
> >> >>
> >> >> Thanks,
> >> >> Kevin
> >> >>
> >> >> ______________________________________________
> >> >> [hidden email] mailing list
> >> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >> >>
> >> >
> >> >
> >> >--
> >> >Brian D. Ripley,                  [hidden email]
> >> >Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> >> >University of Oxford,             Tel:  +44 1865 272861 (self)
> >> >1 South Parks Road,                     +44 1865 272866 (PA)
> >> >Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> >>
> >> ______________________________________________
> >> [hidden email] mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Question re: NA, NaNs in R

Duncan Murdoch-2
In reply to this post by Kevin Ushey
On 10/02/2014 1:43 PM, Kevin Ushey wrote:

> Also, similarly, to clarify, should there be _one_ unique bit pattern
> for R's NA_REAL, or two? Because I see (for a function hex that
> produces the hex representation of a number):
>
> > hex(NA_real_)
> [1] "7FF00000000007A2"
> > hex(NA_real_+1)
> [1] "7FF80000000007A2"
> > hex(NaN)
> [1] "7FF8000000000000"
>
> This is with 64-bit R (on OS X Mavericks, R-devel r64910), as well. I
> also noticed in a conversation of Arun (co-author of data.table) that:

I've just taken a look at the IEEE Std 754-2008.  The encoding we use
for NA is a "signaling NaN".  The standard says that operations on
signaling NaN values convert them to "quiet NaN" values, with the most
significant bit of the mantissa set.  That's the difference between your
first and second rows.

R doesn't do all this bit twiddling, it's done in hardware.  Perhaps we
should have chosen a quiet NaN for NA from the beginning; they stay
quiet when you do operations on them.  However, I think the choice of
bit pattern was made before that behaviour was mandated, and a change
now would be quite disruptive.

Similarly, the standard doesn't say which bit pattern propagates when
you do binary operations on two NaNs.  That's why sometimes NA + NaN is
NA, and sometimes NaN.

Duncan Murdoch



>
> On 32-bit R-2.15.3:
>
> NA: 7ff80000000007a2
> NaN: 7ff8000000000000
>
> On 64-bit version of R-2.15.3
> NA: 7ff00000000007a2
> NaN: 7ff8000000000000
>
> Notice that the initial bit pattern is 7ff0, rather than 7ff8, for
> 64-bit R. Is this intentional?
>
> Thanks,
> Kevin
>
> (function follows:)
>
> // assume size of double, unsigned long long is the same
> SEXP hex(SEXP x) {
>
>    // double is 8 bytes, each byte can be represented by 2 hex chars,
>    // so need a str with 16+1 slots
>    int n = sizeof(unsigned long long) * 2 + 1;
>
>    unsigned long long *xx = (unsigned long long*) REAL(x);
>    char buf[n];
>    snprintf(buf, n, "%016llX", *xx);
>    SEXP output = PROTECT(allocVector(STRSXP, 1));
>    SET_STRING_ELT(output, 0, mkChar(buf));
>    UNPROTECT(1);
>    return output;
> }
>
> On Mon, Feb 10, 2014 at 10:07 AM, Duncan Murdoch
> <[hidden email]> wrote:
> > On 10/02/2014 10:21 AM, Tim Hesterberg wrote:
> >>
> >> This isn't quite what you were asking, but might inform your choice.
> >>
> >> R doesn't try to maintain the distinction between NA and NaN when
> >> doing calculations, e.g.:
> >> > NA + NaN
> >> [1] NA
> >> > NaN + NA
> >> [1] NaN
> >> So for the aggregate package, I didn't attempt to treat them differently.
> >
> >
> > This looks like a bug to me.  In 32 bit 3.0.2 and R-patched I see
> >
> >
> >> NA + NaN
> > [1] NA
> >> NaN + NA
> > [1] NA
> >
> > This seems more reasonable to me.  NA should propagate.  (I can see an
> > argument for NaN for the answer here, as I can't think of any possible
> > non-missing value that would give anything else when added to NaN, but the
> > answer should not depend on the order of operands.)
> >
> > However, I get the same as you in 64 bit 3.0.2.  All calculations I've shown
> > are on 64 bit Windows 7.
> >
> > Duncan Murdoch
> >
> >
> >
> >>
> >> The aggregate package is available at
> >> http://www.timhesterberg.net/r-packages
> >>
> >> Here is the inst/doc/missingValues.txt file from that package:
> >>
> >> --------------------------------------------------
> >> Copyright 2012 Google Inc. All Rights Reserved.
> >> Author: Tim Hesterberg <[hidden email]>
> >> Distributed under GPL 2 or later.
> >>
> >>
> >>         Handling of missing values and not-a-numbers.
> >>
> >>
> >> Here I'll note how this package handles missing values.
> >> I do it the way R handles them, rather than the more strict way that S+
> >> does.
> >>
> >> First, for terminology,
> >>    NaN = "not-a-number", e.g. the result of 0/0
> >>    NA  = "missing value" or "true missing value", e.g. survey non-response
> >>    xx  = I'll uses this for the union of those, or "missing value of any
> >> kind".
> >>
> >> For background, at the hardware level there is an IEEE standard that
> >> specifies that certain bit patterns are NaN, and specifies that
> >> operations involving an NaN result in another NaN.
> >>
> >> That standard doesn't say anything about missing values, which are
> >> important in statistics.
> >>
> >> So what R and S+ do is to pick one of the bit patterns and declare
> >> that to be a NA.  In other words, the NA bit pattern is a subset of
> >> the NaN bit patterns.
> >>
> >> At the user level, the reverse seems to hold.
> >> You can assign either NA or NaN to an object.
> >> But:
> >>         is.na(x) returns TRUE for both
> >>         is.nan(x) returns TRUE for NaN and FALSE for NA
> >> Based on that, you'd think that NaN is a subset of NA.
> >> To tell whether something is a true missing value do:
> >>         (is.na(x) & !is.nan(x))
> >>
> >> The S+ convention is that any operation involving NA results in an NA;
> >> otherwise any operation involving NaN results in NaN.
> >>
> >> The R convention is that any operation involving xx results in an xx;
> >> a missing value of any kind results in another missing value of any
> >> kind.  R considers NA and NaN equivalent for testing purposes:
> >>         all.equal(NA_real_, NaN)
> >> gives TRUE.
> >>
> >> Some R functions follow the S+ convention, e.g. the Math2 functions
> >> in src/main/arithmetic.c use this macro:
> >> #define if_NA_Math2_set(y,a,b)                          \
> >>         if      (ISNA (a) || ISNA (b)) y = NA_REAL;     \
> >>         else if (ISNAN(a) || ISNAN(b)) y = R_NaN;
> >>
> >> Other R functions, like the basic arithmetic operations +-/*^,
> >> do not (search for PLUSOP in src/main/arithmetic.c).
> >> They just let the hardware do the calculations.
> >> As a result, you can get odd results like
> >> > is.nan(NA_real_ + NaN)
> >> [1] FALSE
> >> > is.nan(NaN + NA_real_)
> >> [1] TRUE
> >>
> >> The R help files help(is.na) and help(is.nan) suggest that
> >> computations involving NA and NaN are indeterminate.
> >>
> >> It is faster to use the R convention; most operations are just
> >> handled by the hardware, without extra work.
> >>
> >> In cases like sum(x, na.rm=TRUE), the help file specifies that both NA
> >> and NaN are removed.
> >>
> >>
> >>
> >>
> >> >There is one NA but mulitple NaNs.
> >> >
> >> >And please re-read 'man memcmp': your cast is wrong.
> >> >
> >> >On 10/02/2014 06:52, Kevin Ushey wrote:
> >> >> Hi R-devel,
> >> >>
> >> >> I have a question about the differentiation between NA and NaN values
> >> >> as implemented in R. In arithmetic.c, we have
> >> >>
> >> >> int R_IsNA(double x)
> >> >> {
> >> >>      if (isnan(x)) {
> >> >> ieee_double y;
> >> >> y.value = x;
> >> >> return (y.word[lw] == 1954);
> >> >>      }
> >> >>      return 0;
> >> >> }
> >> >>
> >> >> ieee_double is just used for type punning so we can check the final
> >> >> bits and see if they're equal to 1954; if they are, x is NA, if
> >> >> they're not, x is NaN (as defined for R_IsNaN).
> >> >>
> >> >> My question is -- I can see a substantial increase in speed (on my
> >> >> computer, in certain cases) if I replace this check with
> >> >>
> >> >> int R_IsNA(double x)
> >> >> {
> >> >>      return memcmp(
> >> >>          (char*)(&x),
> >> >>          (char*)(&NA_REAL),
> >> >>          sizeof(double)
> >> >>      ) == 0;
> >> >> }
> >> >>
> >> >> IIUC, there is only one bit pattern used to encode R NA values, so
> >> >> this should be safe. But I would like to be sure:
> >> >>
> >> >> Is there any guarantee that the different functions in R would return
> >> >> NA as identical to the bit pattern defined for NA_REAL, for a given
> >> >> architecture? Similarly for NaN value(s) and R_NaN?
> >> >>
> >> >> My guess is that it is possible some functions used internally by R
> >> >> might encode NaN values differently; ie, setting the lower word to a
> >> >> value different than 1954 (hence being NaN, but potentially not
> >> >> identical to R_NaN), or perhaps this is architecture-dependent.
> >> >> However, NA should be one specific bit pattern (?). And, I wonder if
> >> >> there is any guarantee that the different functions used in R would
> >> >> return an NaN value as identical to R_NaN (which appears to be the
> >> >> 'IEEE NaN')?
> >> >>
> >> >> (interested parties can see + run a simple benchmark from the gist at
> >> >> https://gist.github.com/kevinushey/8911432)
> >> >>
> >> >> Thanks,
> >> >> Kevin
> >> >>
> >> >> ______________________________________________
> >> >> [hidden email] mailing list
> >> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >> >>
> >> >
> >> >
> >> >--
> >> >Brian D. Ripley,                  [hidden email]
> >> >Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> >> >University of Oxford,             Tel:  +44 1865 272861 (self)
> >> >1 South Parks Road,                     +44 1865 272866 (PA)
> >> >Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> >>
> >> ______________________________________________
> >> [hidden email] mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Question re: NA, NaNs in R

Rainer Krug-3
In reply to this post by Duncan Murdoch-2


On 02/10/14, 19:07 , Duncan Murdoch wrote:

> On 10/02/2014 10:21 AM, Tim Hesterberg wrote:
>> This isn't quite what you were asking, but might inform your choice.
>>
>> R doesn't try to maintain the distinction between NA and NaN when
>> doing calculations, e.g.:
>> > NA + NaN
>> [1] NA
>> > NaN + NA
>> [1] NaN
>> So for the aggregate package, I didn't attempt to treat them differently.
>
> This looks like a bug to me.  In 32 bit 3.0.2 and R-patched I see
>
>> NA + NaN
> [1] NA
>> NaN + NA
> [1] NA
But under 3.0.2 patched 64 bit on Maverick:

> version
               _
platform       x86_64-apple-darwin10.8.0
arch           x86_64
os             darwin10.8.0
system         x86_64, darwin10.8.0
status         Patched
major          3
minor          0.2
year           2014
month          01
day            07
svn rev        64692
language       R
version.string R version 3.0.2 Patched (2014-01-07 r64692)
nickname       Frisbee Sailing
> NA+NaN
[1] NA
> NaN+NA
[1] NaN

>
> This seems more reasonable to me.  NA should propagate.  (I can see an
> argument for NaN for the answer here, as I can't think of any possible
> non-missing value that would give anything else when added to NaN, but
> the answer should not depend on the order of operands.)
>
> However, I get the same as you in 64 bit 3.0.2.  All calculations I've
> shown are on 64 bit Windows 7.
>
> Duncan Murdoch
>
>
>>
>> The aggregate package is available at
>> http://www.timhesterberg.net/r-packages
>>
>> Here is the inst/doc/missingValues.txt file from that package:
>>
>> --------------------------------------------------
>> Copyright 2012 Google Inc. All Rights Reserved.
>> Author: Tim Hesterberg <[hidden email]>
>> Distributed under GPL 2 or later.
>>
>>
>>     Handling of missing values and not-a-numbers.
>>
>>
>> Here I'll note how this package handles missing values.
>> I do it the way R handles them, rather than the more strict way that
>> S+ does.
>>
>> First, for terminology,
>>    NaN = "not-a-number", e.g. the result of 0/0
>>    NA  = "missing value" or "true missing value", e.g. survey
>> non-response
>>    xx  = I'll uses this for the union of those, or "missing value of
>> any kind".
>>
>> For background, at the hardware level there is an IEEE standard that
>> specifies that certain bit patterns are NaN, and specifies that
>> operations involving an NaN result in another NaN.
>>
>> That standard doesn't say anything about missing values, which are
>> important in statistics.
>>
>> So what R and S+ do is to pick one of the bit patterns and declare
>> that to be a NA.  In other words, the NA bit pattern is a subset of
>> the NaN bit patterns.
>>
>> At the user level, the reverse seems to hold.
>> You can assign either NA or NaN to an object.
>> But:
>>     is.na(x) returns TRUE for both
>>     is.nan(x) returns TRUE for NaN and FALSE for NA
>> Based on that, you'd think that NaN is a subset of NA.
>> To tell whether something is a true missing value do:
>>     (is.na(x) & !is.nan(x))
>>
>> The S+ convention is that any operation involving NA results in an NA;
>> otherwise any operation involving NaN results in NaN.
>>
>> The R convention is that any operation involving xx results in an xx;
>> a missing value of any kind results in another missing value of any
>> kind.  R considers NA and NaN equivalent for testing purposes:
>>     all.equal(NA_real_, NaN)
>> gives TRUE.
>>
>> Some R functions follow the S+ convention, e.g. the Math2 functions
>> in src/main/arithmetic.c use this macro:
>> #define if_NA_Math2_set(y,a,b)                \
>>     if      (ISNA (a) || ISNA (b)) y = NA_REAL;    \
>>     else if (ISNAN(a) || ISNAN(b)) y = R_NaN;
>>
>> Other R functions, like the basic arithmetic operations +-/*^,
>> do not (search for PLUSOP in src/main/arithmetic.c).
>> They just let the hardware do the calculations.
>> As a result, you can get odd results like
>> > is.nan(NA_real_ + NaN)
>> [1] FALSE
>> > is.nan(NaN + NA_real_)
>> [1] TRUE
>>
>> The R help files help(is.na) and help(is.nan) suggest that
>> computations involving NA and NaN are indeterminate.
>>
>> It is faster to use the R convention; most operations are just
>> handled by the hardware, without extra work.
>>
>> In cases like sum(x, na.rm=TRUE), the help file specifies that both NA
>> and NaN are removed.
>>
>>
>>
>>
>> >There is one NA but mulitple NaNs.
>> >
>> >And please re-read 'man memcmp': your cast is wrong.
>> >
>> >On 10/02/2014 06:52, Kevin Ushey wrote:
>> >> Hi R-devel,
>> >>
>> >> I have a question about the differentiation between NA and NaN values
>> >> as implemented in R. In arithmetic.c, we have
>> >>
>> >> int R_IsNA(double x)
>> >> {
>> >>      if (isnan(x)) {
>> >> ieee_double y;
>> >> y.value = x;
>> >> return (y.word[lw] == 1954);
>> >>      }
>> >>      return 0;
>> >> }
>> >>
>> >> ieee_double is just used for type punning so we can check the final
>> >> bits and see if they're equal to 1954; if they are, x is NA, if
>> >> they're not, x is NaN (as defined for R_IsNaN).
>> >>
>> >> My question is -- I can see a substantial increase in speed (on my
>> >> computer, in certain cases) if I replace this check with
>> >>
>> >> int R_IsNA(double x)
>> >> {
>> >>      return memcmp(
>> >>          (char*)(&x),
>> >>          (char*)(&NA_REAL),
>> >>          sizeof(double)
>> >>      ) == 0;
>> >> }
>> >>
>> >> IIUC, there is only one bit pattern used to encode R NA values, so
>> >> this should be safe. But I would like to be sure:
>> >>
>> >> Is there any guarantee that the different functions in R would return
>> >> NA as identical to the bit pattern defined for NA_REAL, for a given
>> >> architecture? Similarly for NaN value(s) and R_NaN?
>> >>
>> >> My guess is that it is possible some functions used internally by R
>> >> might encode NaN values differently; ie, setting the lower word to a
>> >> value different than 1954 (hence being NaN, but potentially not
>> >> identical to R_NaN), or perhaps this is architecture-dependent.
>> >> However, NA should be one specific bit pattern (?). And, I wonder if
>> >> there is any guarantee that the different functions used in R would
>> >> return an NaN value as identical to R_NaN (which appears to be the
>> >> 'IEEE NaN')?
>> >>
>> >> (interested parties can see + run a simple benchmark from the gist at
>> >> https://gist.github.com/kevinushey/8911432)
>> >>
>> >> Thanks,
>> >> Kevin
>> >>
>> >> ______________________________________________
>> >> [hidden email] mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-devel
>> >>
>> >
>> >
>> >--
>> >Brian D. Ripley,                  [hidden email]
>> >Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>> >University of Oxford,             Tel:  +44 1865 272861 (self)
>> >1 South Parks Road,                     +44 1865 272866 (PA)
>> >Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
--
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation
Biology, UCT), Dipl. Phys. (Germany)

Centre of Excellence for Invasion Biology
Stellenbosch University
South Africa

Tel :       +33 - (0)9 53 10 27 44
Cell:       +33 - (0)6 85 62 59 98
Fax :       +33 - (0)9 58 10 27 44

Fax (D):    +49 - (0)3 21 21 25 22 44

email:      [hidden email]

Skype:      RMkrug


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

signature.asc (572 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Question re: NA, NaNs in R

Kevin Ushey
Hi Duncan,

Thanks a ton -- I appreciate your taking the time to investigate this,
and especially even checking into the IEEE standard to clarify.

Cheers,
Kevin

On Mon, Feb 10, 2014 at 11:54 AM, Rainer M Krug <[hidden email]> wrote:

>
>
> On 02/10/14, 19:07 , Duncan Murdoch wrote:
>> On 10/02/2014 10:21 AM, Tim Hesterberg wrote:
>>> This isn't quite what you were asking, but might inform your choice.
>>>
>>> R doesn't try to maintain the distinction between NA and NaN when
>>> doing calculations, e.g.:
>>> > NA + NaN
>>> [1] NA
>>> > NaN + NA
>>> [1] NaN
>>> So for the aggregate package, I didn't attempt to treat them differently.
>>
>> This looks like a bug to me.  In 32 bit 3.0.2 and R-patched I see
>>
>>> NA + NaN
>> [1] NA
>>> NaN + NA
>> [1] NA
>
> But under 3.0.2 patched 64 bit on Maverick:
>
>> version
>                _
> platform       x86_64-apple-darwin10.8.0
> arch           x86_64
> os             darwin10.8.0
> system         x86_64, darwin10.8.0
> status         Patched
> major          3
> minor          0.2
> year           2014
> month          01
> day            07
> svn rev        64692
> language       R
> version.string R version 3.0.2 Patched (2014-01-07 r64692)
> nickname       Frisbee Sailing
>> NA+NaN
> [1] NA
>> NaN+NA
> [1] NaN
>
>>
>> This seems more reasonable to me.  NA should propagate.  (I can see an
>> argument for NaN for the answer here, as I can't think of any possible
>> non-missing value that would give anything else when added to NaN, but
>> the answer should not depend on the order of operands.)
>>
>> However, I get the same as you in 64 bit 3.0.2.  All calculations I've
>> shown are on 64 bit Windows 7.
>>
>> Duncan Murdoch
>>
>>
>>>
>>> The aggregate package is available at
>>> http://www.timhesterberg.net/r-packages
>>>
>>> Here is the inst/doc/missingValues.txt file from that package:
>>>
>>> --------------------------------------------------
>>> Copyright 2012 Google Inc. All Rights Reserved.
>>> Author: Tim Hesterberg <[hidden email]>
>>> Distributed under GPL 2 or later.
>>>
>>>
>>>     Handling of missing values and not-a-numbers.
>>>
>>>
>>> Here I'll note how this package handles missing values.
>>> I do it the way R handles them, rather than the more strict way that
>>> S+ does.
>>>
>>> First, for terminology,
>>>    NaN = "not-a-number", e.g. the result of 0/0
>>>    NA  = "missing value" or "true missing value", e.g. survey
>>> non-response
>>>    xx  = I'll uses this for the union of those, or "missing value of
>>> any kind".
>>>
>>> For background, at the hardware level there is an IEEE standard that
>>> specifies that certain bit patterns are NaN, and specifies that
>>> operations involving an NaN result in another NaN.
>>>
>>> That standard doesn't say anything about missing values, which are
>>> important in statistics.
>>>
>>> So what R and S+ do is to pick one of the bit patterns and declare
>>> that to be a NA.  In other words, the NA bit pattern is a subset of
>>> the NaN bit patterns.
>>>
>>> At the user level, the reverse seems to hold.
>>> You can assign either NA or NaN to an object.
>>> But:
>>>     is.na(x) returns TRUE for both
>>>     is.nan(x) returns TRUE for NaN and FALSE for NA
>>> Based on that, you'd think that NaN is a subset of NA.
>>> To tell whether something is a true missing value do:
>>>     (is.na(x) & !is.nan(x))
>>>
>>> The S+ convention is that any operation involving NA results in an NA;
>>> otherwise any operation involving NaN results in NaN.
>>>
>>> The R convention is that any operation involving xx results in an xx;
>>> a missing value of any kind results in another missing value of any
>>> kind.  R considers NA and NaN equivalent for testing purposes:
>>>     all.equal(NA_real_, NaN)
>>> gives TRUE.
>>>
>>> Some R functions follow the S+ convention, e.g. the Math2 functions
>>> in src/main/arithmetic.c use this macro:
>>> #define if_NA_Math2_set(y,a,b)                \
>>>     if      (ISNA (a) || ISNA (b)) y = NA_REAL;    \
>>>     else if (ISNAN(a) || ISNAN(b)) y = R_NaN;
>>>
>>> Other R functions, like the basic arithmetic operations +-/*^,
>>> do not (search for PLUSOP in src/main/arithmetic.c).
>>> They just let the hardware do the calculations.
>>> As a result, you can get odd results like
>>> > is.nan(NA_real_ + NaN)
>>> [1] FALSE
>>> > is.nan(NaN + NA_real_)
>>> [1] TRUE
>>>
>>> The R help files help(is.na) and help(is.nan) suggest that
>>> computations involving NA and NaN are indeterminate.
>>>
>>> It is faster to use the R convention; most operations are just
>>> handled by the hardware, without extra work.
>>>
>>> In cases like sum(x, na.rm=TRUE), the help file specifies that both NA
>>> and NaN are removed.
>>>
>>>
>>>
>>>
>>> >There is one NA but mulitple NaNs.
>>> >
>>> >And please re-read 'man memcmp': your cast is wrong.
>>> >
>>> >On 10/02/2014 06:52, Kevin Ushey wrote:
>>> >> Hi R-devel,
>>> >>
>>> >> I have a question about the differentiation between NA and NaN values
>>> >> as implemented in R. In arithmetic.c, we have
>>> >>
>>> >> int R_IsNA(double x)
>>> >> {
>>> >>      if (isnan(x)) {
>>> >> ieee_double y;
>>> >> y.value = x;
>>> >> return (y.word[lw] == 1954);
>>> >>      }
>>> >>      return 0;
>>> >> }
>>> >>
>>> >> ieee_double is just used for type punning so we can check the final
>>> >> bits and see if they're equal to 1954; if they are, x is NA, if
>>> >> they're not, x is NaN (as defined for R_IsNaN).
>>> >>
>>> >> My question is -- I can see a substantial increase in speed (on my
>>> >> computer, in certain cases) if I replace this check with
>>> >>
>>> >> int R_IsNA(double x)
>>> >> {
>>> >>      return memcmp(
>>> >>          (char*)(&x),
>>> >>          (char*)(&NA_REAL),
>>> >>          sizeof(double)
>>> >>      ) == 0;
>>> >> }
>>> >>
>>> >> IIUC, there is only one bit pattern used to encode R NA values, so
>>> >> this should be safe. But I would like to be sure:
>>> >>
>>> >> Is there any guarantee that the different functions in R would return
>>> >> NA as identical to the bit pattern defined for NA_REAL, for a given
>>> >> architecture? Similarly for NaN value(s) and R_NaN?
>>> >>
>>> >> My guess is that it is possible some functions used internally by R
>>> >> might encode NaN values differently; ie, setting the lower word to a
>>> >> value different than 1954 (hence being NaN, but potentially not
>>> >> identical to R_NaN), or perhaps this is architecture-dependent.
>>> >> However, NA should be one specific bit pattern (?). And, I wonder if
>>> >> there is any guarantee that the different functions used in R would
>>> >> return an NaN value as identical to R_NaN (which appears to be the
>>> >> 'IEEE NaN')?
>>> >>
>>> >> (interested parties can see + run a simple benchmark from the gist at
>>> >> https://gist.github.com/kevinushey/8911432)
>>> >>
>>> >> Thanks,
>>> >> Kevin
>>> >>
>>> >> ______________________________________________
>>> >> [hidden email] mailing list
>>> >> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> >>
>>> >
>>> >
>>> >--
>>> >Brian D. Ripley,                  [hidden email]
>>> >Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>>> >University of Oxford,             Tel:  +44 1865 272861 (self)
>>> >1 South Parks Road,                     +44 1865 272866 (PA)
>>> >Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> --
> Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation
> Biology, UCT), Dipl. Phys. (Germany)
>
> Centre of Excellence for Invasion Biology
> Stellenbosch University
> South Africa
>
> Tel :       +33 - (0)9 53 10 27 44
> Cell:       +33 - (0)6 85 62 59 98
> Fax :       +33 - (0)9 58 10 27 44
>
> Fax (D):    +49 - (0)3 21 21 25 22 44
>
> email:      [hidden email]
>
> Skype:      RMkrug
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel