1954 from NA

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
47 messages Options
123
Reply | Threaded
Open this post in threaded view
|

1954 from NA

Adrian Dușa
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: 1954 from NA

R devel mailing list
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: 1954 from NA

R devel mailing list
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: 1954 from NA

Mark van der Loo
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: 1954 from NA

dusadrian
In reply to this post by R devel mailing list
On Sun, May 23, 2021 at 4:33 PM brodie gaslam via R-devel <
[hidden email]> wrote:

> I should add, I don't know that you can rely on this
> particular encoding of R's NA.  If I were trying to restore
> an NA from some external format, I would just generate an
> R NA via e.g NA_real_ in the R session I'm restoring the
> external data into, and not try to hand assemble one.
>

Thanks for your answer, Brodie, especially on Sunday (much appreciated).
The aim is not to reconstruct an NA, but to "tag" an NA (and yes, I was
referring to an NA_real_ of course), as seen in action here:
https://github.com/tidyverse/haven/blob/master/src/tagged_na.c

That code:
- preserves the first part 0x7ff0
- preserves the last part 1954
- adds one additional byte to store (tag) a character provided in the SEXP
vector

That is precisely my understanding, that doubles starting with 0x7ff are
all NaNs. My question was related to the additional part 1954 from the low
bits: why does it need 32 bits?

The binary value of 1954 is 11110100010, which is represented by 11 bits
occupying at most 2 bytes... So why does it need 4 bytes?

Re. the possible overflow, I am not sure: 0x7ff0 is the decimal 32752, or
the binary 111111111110000.
That is just about enough to fit in the available 16 bits (actually 15 to
leave one for the sign bit), so I don't really understand why it would. And
in any case, the union definition uses an unsigned short which (if my
understanding is correct) should certainly not overflow:

typedef union
{
    double value;
    unsigned short word[4];
} ieee_double;

What is gained with this proposal: 16 additional bits to do something with.
For the moment, only 16 are available (from the lower part of the high 32
bits). If the value 1954 would be checked as a short instead of an int, the
other 16 bits would become available. And those bits could be extremely
valuable to tag multi-byte characters, for instance, but also higher
numbers than 32767.

Best wishes,
Adrian

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: 1954 from NA

R devel mailing list
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: 1954 from NA

Tomas Kalibera
In reply to this post by Adrian Dușa
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: 1954 from NA

dusadrian
Dear Tomas,

I understand that perfectly, but that is fine.
The payload is not going to be used in any computations anyways, it is
strictly an information carrier that differentiates between different types
of (tagged) NA values.

Having only one NA value in R is extremely limiting for the social
sciences, where multiple missing values may exist, because respondents:
- did not know what to respond, or
- did not want to respond, or perhaps
- the question did not apply in a given situation etc.

All of these need to be captured, stored, and most importantly treated as
if they would be regular missing values. Whether the payload might be lost
in computations makes no difference: they were supposed to be "missing
values" anyways.

The original question is how the payload is currently stored: as an
unsigned int of 32 bits, or as an unsigned short of 16 bits. If the R
internals would not be affected (and I see no reason why they would be), it
would allow an entire universe for the social sciences that is not
currently available and which all other major statistical packages do offer.

Thank you very much, your attention is greatly appreciated,
Adrian

On Sun, May 23, 2021 at 7:59 PM Tomas Kalibera <[hidden email]>
wrote:

> TLDR: tagging R NAs is not possible.
>
> External software should not depend on how R currently implements NA,
> this may change at any time. Tagging of NA is not supported in R (if it
> were, it would have been documented). It would not be possible to
> implement such tagging reliably with the current implementation of NA in R.
>
> NaN payload propagation is not standardized. Compilers are free to and
> do optimize code not preserving/achieving any specific propagation.
> CPUs/FPUs differ in how they propagate in binary operations, some zero
> the payload on any operation. Virtualized environments, binary
> translations, etc, may not preserve it in any way, either. ?NA has
> disclaimers about this, an NA may become NaN (payload lost) even in
> unary operations and also in binary operations not involving other NaN/NAs.
>
> Writing any new software that would depend on that anything specific
> happens to the NaN payloads would not be a good idea. One can only
> reliably use the NaN payload bits for storage, that is if one avoids any
> computation at all, avoids passing the values to any external code
> unaware of such tagging (including R), etc. If such software wants any
> NaN to be understood as NA by R, it would have to use the documented R
> API for this (so essentially translating) - but given the problems
> mentioned above, there is really no point in doing that, because such
> NAs become NaNs at any time.
>
> Best
> Tomas
>
> On 5/23/21 9:56 AM, Adrian Dușa wrote:
> > Dear R devs,
> >
> > I am probably missing something obvious, but still trying to understand
> why
> > the 1954 from the definition of an NA has to fill 32 bits when it
> normally
> > doesn't need more than 16.
> >
> > Wouldn't the code below achieve exactly the same thing?
> >
> > typedef union
> > {
> >      double value;
> >      unsigned short word[4];
> > } ieee_double;
> >
> >
> > #ifdef WORDS_BIGENDIAN
> > static CONST int hw = 0;
> > static CONST int lw = 3;
> > #else  /* !WORDS_BIGENDIAN */
> > static CONST int hw = 3;
> > static CONST int lw = 0;
> > #endif /* WORDS_BIGENDIAN */
> >
> >
> > static double R_ValueOfNA(void)
> > {
> >      volatile ieee_double x;
> >      x.word[hw] = 0x7ff0;
> >      x.word[lw] = 1954;
> >      return x.value;
> > }
> >
> > This question has to do with the tagged NA values from package haven, on
> > which I want to improve. Every available bit counts, especially if
> > multi-byte characters are going to be involved.
> >
> > Best wishes,
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: 1954 from NA

Tomas Kalibera
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: 1954 from NA

R devel mailing list
In reply to this post by dusadrian
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: 1954 from NA

Greg Minshall
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: 1954 from NA

Adrian Dușa
In reply to this post by R devel mailing list
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: 1954 from NA

dusadrian
In reply to this post by Tomas Kalibera
On Sun, May 23, 2021 at 10:14 PM Tomas Kalibera <[hidden email]>
wrote:

> [...]
>
> Good, but unfortunately the delineation between computation and
> non-computation is not always transparent. Even if an operation doesn't
> look like "computation" on the high-level, it may internally involve
> computation - so, really, an R NA can become R NaN and vice versa, at any
> point (this is not a "feature", but it is how things are now).
>

I see.
Well, this is a risk we'll have to consider when the time comes. For the
moment, storing some metadata within the payload seems to work.



> [...]
>
> Ok, then I would probably keep the meta-data on the missing values on the
> side to implement such missing values in such code, and treat them
> explicitly in supported operations.
>
> But. in principle, you can use the floating-point NaN payloads, and you
> can pass such values to R. You just need to be prepared that not only you
> would loose your payloads/tags, but also the difference between R NA and R
> NaNs. Thanks to value semantics of R, you would not loose the tags in input
> values with proper reference counts (e.g. marked immutable), because those
> values will not be modified.
>
NaNs are fine of course, but then some (social science?) users might get
confused about the difference between NAs and NaNs, and for this reason
only I would still like to preserve the 1954 payload.
If at all possible, however, the extra 16 bits from this payload would make
a whole lot of a difference.

Please forgive my persistence, but would it be possible to use an unsigned
short instead of an unsigned int for the 1954 payload?
That is, if it doesn't break anything, but I don't really see what it
could. The corresponding check function seems to work just fine and it
doesn't need to be changed at all:

int R_IsNA(double x)
{
    if (isnan(x)) {
ieee_double y;
y.value = x;
return (y.word[lw] == 1954);
    }
    return 0;
}

Best wishes,
Adrian

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: 1954 from NA

Tomas Kalibera
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: 1954 from NA

Greg Minshall
In reply to this post by Adrian Dușa
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: 1954 from NA

dusadrian
In reply to this post by Tomas Kalibera
On Mon, May 24, 2021 at 1:31 PM Tomas Kalibera <[hidden email]>
wrote:

> [...]
>
> For the reasons I explained, I would be against such a change. Keeping the
> data on the side, as also recommended by others on this list, would allow
> you for a reliable implementation. I don't want to support fragile package
> code building on unspecified R internals, and in this case particularly
> internals that themselves have not stood the test of time, so are at high
> risk of change.
>
I understand, and it makes sense.
We'll have to wait for the R internals to settle (this really is
surprising, I wonder how other software have solved this). In the meantime,
I will probably go ahead with NaNs.

Thank you again,
Adrian

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: 1954 from NA

Adrian Dușa
In reply to this post by Greg Minshall
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: 1954 from NA

R devel mailing list
In reply to this post by dusadrian
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: 1954 from NA

dusadrian
Dear Alex,

Thanks for piping in, I am learning with each new message.
The problem is clear, the solution escapes me though. I've already tried
the attributes route: it is going to triple the data size: along with the
additional (logical) variable that specifies which level is missing, one
also needs to store an index such that sorting the data would still
maintain the correct information.

One also needs to think about subsetting (subset the attributes as well),
splitting (the same), aggregating multiple datasets (even more attention),
creating custom vectors out of multiple variables... complexity quickly
grows towards infinity.

R factors are nice indeed, but:
- there are numerical variables which can hold multiple missing values (for
instance income)
- factors convert the original questionnaire values: if a missing value was
coded 999, turning that into a factor would convert that value into
something else

I really, and wholeheartedly, do appreciate all advice: but please be
assured that I have been thinking about this for more than 10 years and
still haven't found a satisfactory solution.

Which makes it even more intriguing, since other software like SAS or Stata
have solved this for decades: what is their implementation, and how come
they don't seem to be affected by the new M1 architecture?
When package "haven" introduced the tagged NA values I said: ah-haa... so
that is how it's done... only to learn that implementation is just as
fragile as the R internals.

There really should be a robust solution for this seemingly mundane
problem, but apparently is far from mundane...

Best wishes,
Adrian


On Mon, May 24, 2021 at 3:29 PM Bertram, Alexander <[hidden email]>
wrote:

> Dear Adrian,
> I just wanted to pipe in and underscore Thomas' point: the payload bits of
> IEEE 754 floating point values are no place to store data that you care
> about or need to keep. That is not only related to the R APIs, but also how
> processors handle floating point values and signaling and non-signaling
> NaNs. It is very difficult to reason about when and under which
> circumstances these bits are preserved. I spent a lot of time working on
> Renjin's handling of these values and I can assure that any such scheme
> will end in tears.
>
> A far, far better option is to use R's attributes to store this kind of
> metadata. This is exactly what this language feature is for. There is
> already a standard 'levels' attribute that holds the labels of factors like
> "Yes", "No" , "Refused", "Interviewer error'' etc. In the past, I've worked
> on projects where we stored an additional attribute like "missingLevels"
> that stores extra metadata on which levels should be used in which kind of
> analysis. That way, you can preserve all the information, and then write a
> utility function which automatically applies certain logic to a whole
> dataframe just before passing the data to an analysis function. This is
> also important because in surveys like this, different values should be
> excluded at different times. For example, you might want to include all
> responses in a data quality report, but exclude interviewer error and
> refusals when conducting a PCA or fitting a model.
>
> Best,
> Alex
>
> On Mon, May 24, 2021 at 2:03 PM Adrian Dușa <[hidden email]> wrote:
>
>> On Mon, May 24, 2021 at 1:31 PM Tomas Kalibera <[hidden email]>
>> wrote:
>>
>> > [...]
>> >
>> > For the reasons I explained, I would be against such a change. Keeping
>> the
>> > data on the side, as also recommended by others on this list, would
>> allow
>> > you for a reliable implementation. I don't want to support fragile
>> package
>> > code building on unspecified R internals, and in this case particularly
>> > internals that themselves have not stood the test of time, so are at
>> high
>> > risk of change.
>> >
>> I understand, and it makes sense.
>> We'll have to wait for the R internals to settle (this really is
>> surprising, I wonder how other software have solved this). In the
>> meantime,
>> I will probably go ahead with NaNs.
>>
>> Thank you again,
>> Adrian
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>
> --
> Alexander Bertram
> Technical Director
> *BeDataDriven BV*
>
> Web: http://bedatadriven.com
> Email: [hidden email]
> Tel. Nederlands: +31(0)647205388
> Skype: akbertram
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: [External] Re: 1954 from NA

luke-tierney
In reply to this post by Adrian Dușa
CONTENTS DELETED
The author has deleted this message.
123