Suggestion: mkString(NULL) should be NA

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Suggestion: mkString(NULL) should be NA

Jeroen Ooms.
I would like to propose that Rf_mkString(NULL) and Rf_mkChar(NULL)
return NA rather than segfault.

Case: the mkString() and mkChar() functions are convenient to wrap
strings returned by e.g. external C libraries into an R vector.
However sometimes a library returns NULL instead of a string when the
result is unavailable. In some C libraries this can happen
unexpectedly or is even undocumented.

A good R package author always checks results for a null pointer, and
deals with it accordingly. But sometimes we make assumptions. There
was an example in the 'curl' package where a documented version string
was suddenly NULL if libcurl was built with some unusual
configuration. These problems are hard to catch and I don't see any
benefit of segfaulting for such edge cases.

Some packages use a macro like this to protect against such problems:

  #define make_string(x) x ? Rf_mkString(x) : ScalarString(NA_STRING)

But I think it would make sense if this was the default behavior in
Rf_mkString and Rf_mkChar.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion: mkString(NULL) should be NA

Jeroen Ooms.
On Thu, May 12, 2016 at 1:20 PM, Jeroen Ooms <[hidden email]> wrote:
> I would like to propose that Rf_mkString(NULL) and Rf_mkChar(NULL)
> return NA rather than segfault.

An example implementation: https://git.io/vroxm

With this patch, mkChar(NULL), mkCharCE(NULL, ce) would return
NA_STRING rather than segfault at strlen(NULL). This automatically
fixes mkString(NULL) as well which wraps mkChar (See Rinlinedfuns.h).

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion: mkString(NULL) should be NA

Gabriel Becker
Shouldn't Rf_mkString(NULL) return (the c-level equivalent of) character()
rather than the NA_character_? An empty string and NULL aren't the same. It
seems reasonable for Rf_mkChar to give NA_character_ though.

~G

On Tue, May 24, 2016 at 8:42 AM, Jeroen Ooms <[hidden email]>
wrote:

> On Thu, May 12, 2016 at 1:20 PM, Jeroen Ooms <[hidden email]>
> wrote:
> > I would like to propose that Rf_mkString(NULL) and Rf_mkChar(NULL)
> > return NA rather than segfault.
>
> An example implementation: https://git.io/vroxm
>
> With this patch, mkChar(NULL), mkCharCE(NULL, ce) would return
> NA_STRING rather than segfault at strlen(NULL). This automatically
> fixes mkString(NULL) as well which wraps mkChar (See Rinlinedfuns.h).
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



--
Gabriel Becker, PhD
Associate Scientist (Bioinformatics)
Genentech Research

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion: mkString(NULL) should be NA

R devel mailing list
In reply to this post by Jeroen Ooms.
Why should Rf_mkString(NULL) produce NA_STRING instead of ""
(R_BlankString)?  I prefer that passing in a nil pointer would cause
an error instead, as the nil may arise by accident, perhaps a pointer
to freed memory, and I would like to be notified that my code is bad instead
of getting a random NA_STRING.


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Tue, May 24, 2016 at 8:42 AM, Jeroen Ooms <[hidden email]>
wrote:

> On Thu, May 12, 2016 at 1:20 PM, Jeroen Ooms <[hidden email]>
> wrote:
> > I would like to propose that Rf_mkString(NULL) and Rf_mkChar(NULL)
> > return NA rather than segfault.
>
> An example implementation: https://git.io/vroxm
>
> With this patch, mkChar(NULL), mkCharCE(NULL, ce) would return
> NA_STRING rather than segfault at strlen(NULL). This automatically
> fixes mkString(NULL) as well which wraps mkChar (See Rinlinedfuns.h).
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion: mkString(NULL) should be NA

Jeroen Ooms.
In reply to this post by Gabriel Becker
On Tue, May 24, 2016 at 5:59 PM, Gabriel Becker <[hidden email]> wrote:
> Shouldn't Rf_mkString(NULL) return (the c-level equivalent of) character()
> rather than the NA_character_?

No. It should still be safe to assume that mkString() always returns a
character vector of exactly length one. Anything else could lead to
type errors.

> An empty string and NULL aren't the same.

Exactly! So if you pass in an empty C string, you get an empty R
string, and if you pass in a null pointer you get NA.

Rf_mkString(NULL)   <--> NA
Rf_mkString("")   <--> ""

There is no ambiguity, and much better than segfaulting.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion: mkString(NULL) should be NA

Gabriel Becker
On Tue, May 24, 2016 at 9:30 AM, Jeroen Ooms <[hidden email]>
wrote:

> On Tue, May 24, 2016 at 5:59 PM, Gabriel Becker <[hidden email]>
> wrote:
> > Shouldn't Rf_mkString(NULL) return (the c-level equivalent of)
> character()
> > rather than the NA_character_?
>
> No. It should still be safe to assume that mkString() always returns a
> character vector of exactly length one. Anything else could lead to
> type errors.
>


Well the thing is you're passing an invalid pointer, that doesn't point to
a C string, to a constructor expecting a valid const char *. I'm fine with
the contract being that mkString always returns a character vector of
length one, but that doesn't necessarily mean that the function needs to
accept NULL pointers. The contract as I understand it is that if you give
it a C string, it will create a CHARSXP for that string. In this light,
Bill's suggestion that it throw an error seems the most principled
response. I would think you would need to at the very least emit a warning.


>
> > An empty string and NULL aren't the same.
>
> Exactly! So if you pass in an empty C string, you get an empty R
> string, and if you pass in a null pointer you get NA.
>
> Rf_mkString(NULL)   <--> NA
> Rf_mkString("")   <--> ""
>
> There is no ambiguity, and much better than segfaulting.
>

Well, better than segfaulting is not really relevant here. No one is
arguing that it should segfault. The question is what behavior it should
have when it doesn't segfault.

It's true that a C empty string is not the same as NULL, but NULL isn't the
same as NA either. Semantically, for your use-case (which I gather arose
from interactions we had :) ) the NULL means there is no version, while NA
indicates there is a version but we don't know what it is. Imagine an
object class that represents a persons name (first, middle, last). Now take
two people, One has no middle name (and we know that when creating the
object) and another for whom we don't have any information about the middle
name, only first and last were reported. I would expect the first one to
have middle name either NULL or (in a data.frame context) "", while the
second would have NA_character_. In this light, mkString should arguably
generate "". i don't think the fact that there is another way to get "" is
a particularly large problem.

On the other hand, and in support of your position it came up as Michael
Lawrence and I were talking about this that asChar from utils.c  will give
you NA_STRING when you give it R_NilValue. That is a coercion though,
whereas arguably mkString is not. That said, consistency would probably be
good.

~G


--
Gabriel Becker, PhD
Associate Scientist (Bioinformatics)
Genentech Research

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion: mkString(NULL) should be NA

Martin Maechler
>>>>> Gabriel Becker <[hidden email]>
>>>>>     on Tue, 24 May 2016 10:30:48 -0700 writes:

    > On Tue, May 24, 2016 at 9:30 AM, Jeroen Ooms <[hidden email]>
    > wrote:

    >> On Tue, May 24, 2016 at 5:59 PM, Gabriel Becker <[hidden email]>
    >> wrote:
    >> > Shouldn't Rf_mkString(NULL) return (the c-level equivalent of)
    >> character()
    >> > rather than the NA_character_?
    >>
    >> No. It should still be safe to assume that mkString() always returns a
    >> character vector of exactly length one. Anything else could lead to
    >> type errors.
    >>


    > Well the thing is you're passing an invalid pointer, that doesn't point to
    > a C string, to a constructor expecting a valid const char *. I'm fine with
    > the contract being that mkString always returns a character vector of
    > length one, but that doesn't necessarily mean that the function needs to
    > accept NULL pointers. The contract as I understand it is that if you give
    > it a C string, it will create a CHARSXP for that string. In this light,
    > Bill's suggestion that it throw an error seems the most principled
    > response. I would think you would need to at the very least emit a warning.

I agree with Jerooen that mkChar() and mkString() may be
used in contexts where they can end up with a NULL and hence
should not segfault... and hence am willing the extra (very
small) penalty of checking for NULL.

    >>
    >> > An empty string and NULL aren't the same.
    >>
    >> Exactly! So if you pass in an empty C string, you get an empty R
    >> string, and if you pass in a null pointer you get NA.
    >>
    >> Rf_mkString(NULL)   <--> NA
    >> Rf_mkString("")   <--> ""
    >>
    >> There is no ambiguity, and much better than segfaulting.

Better than segfaulting, yes, but really agree with Bill (and
Gabe), also for Rf_mkChar(NULL):
I think both functions should give an error in such a case
rather than returning NA_character_

It is an accident of some kind if they got NULL, no?

--
Martin Maechler,
ETH Zurich



    > Well, better than segfaulting is not really relevant here. No one is
    > arguing that it should segfault. The question is what behavior it should
    > have when it doesn't segfault.

    > It's true that a C empty string is not the same as NULL, but NULL isn't the
    > same as NA either. Semantically, for your use-case (which I gather arose
    > from interactions we had :) ) the NULL means there is no version, while NA
    > indicates there is a version but we don't know what it is. Imagine an
    > object class that represents a persons name (first, middle, last). Now take
    > two people, One has no middle name (and we know that when creating the
    > object) and another for whom we don't have any information about the middle
    > name, only first and last were reported. I would expect the first one to
    > have middle name either NULL or (in a data.frame context) "", while the
    > second would have NA_character_. In this light, mkString should arguably
    > generate "". i don't think the fact that there is another way to get "" is
    > a particularly large problem.

    > On the other hand, and in support of your position it came up as Michael
    > Lawrence and I were talking about this that asChar from utils.c  will give
    > you NA_STRING when you give it R_NilValue. That is a coercion though,
    > whereas arguably mkString is not. That said, consistency would probably be
    > good.

    > ~G


    > --
    > Gabriel Becker, PhD
    > Associate Scientist (Bioinformatics)
    > Genentech Research

    > [[alternative HTML version deleted]]

    > ______________________________________________
    > [hidden email] mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion: mkString(NULL) should be NA

Jeroen Ooms.
On Wed, May 25, 2016 at 12:31 PM, Martin Maechler
<[hidden email]> wrote:
> Better than segfaulting, yes, but really agree with Bill (and
> Gabe), also for Rf_mkChar(NULL):
> I think both functions should give an error in such a case
> rather than returning NA_character_
>
> It is an accident of some kind if they got NULL, no?

Not necessarily. A char* of NULL can be a string which is not
initiated or simply unavailable due to configuration.

The example from my original email was in curl package which exposes
the version string of libz that was used to build libcurl:

  mkString(data->libz_version)

This worked on all platforms that I tested. However a user found that
if libcurl was configured --without-libz (which is uncommon) the
libz_version string does not get set by libcurl and is always NULL. I
had not foreseen this and it would lead to a segfault.

I think making mkString() return NA for null strings lead to the most
robust behavior. Raising an exception seems a little harsh to me, as
there is no way the user would be able to recover from this, and there
might not be an actual problem at all.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion: mkString(NULL) should be NA

Michael Lawrence-3
On Wed, May 25, 2016 at 4:23 AM, Jeroen Ooms <[hidden email]>
wrote:

> On Wed, May 25, 2016 at 12:31 PM, Martin Maechler
> <[hidden email]> wrote:
> > Better than segfaulting, yes, but really agree with Bill (and
> > Gabe), also for Rf_mkChar(NULL):
> > I think both functions should give an error in such a case
> > rather than returning NA_character_
> >
> > It is an accident of some kind if they got NULL, no?
>
> Not necessarily. A char* of NULL can be a string which is not
> initiated or simply unavailable due to configuration.
>
> The example from my original email was in curl package which exposes
> the version string of libz that was used to build libcurl:
>
>   mkString(data->libz_version)
>
> This worked on all platforms that I tested. However a user found that
> if libcurl was configured --without-libz (which is uncommon) the
> libz_version string does not get set by libcurl and is always NULL. I
> had not foreseen this and it would lead to a segfault.
>
> I think making mkString() return NA for null strings lead to the most
> robust behavior. Raising an exception seems a little harsh to me, as
> there is no way the user would be able to recover from this, and there
> might not be an actual problem at all.
>
>
Robust in the sense of no error being thrown, but perhaps only correct by
accident. NULL is not a valid C string --- should functions always return
NA on invalid input? As Gabe mentions, in the cited use case, it's not
clear whether the appropriate value is NA, "", or something else entirely.
Generalization seems risky at this point.


> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion: mkString(NULL) should be NA

Tim Keitt-3
On Wed, May 25, 2016 at 7:22 AM, Michael Lawrence <[hidden email]
> wrote:

> On Wed, May 25, 2016 at 4:23 AM, Jeroen Ooms <[hidden email]>
> wrote:
>

I'm not disagreeing with what's been said in this thread, but I can't help
but recall that I brought up this exact issue probably 15 years ago and was
told (by Brian, I believe) "don't do that" (pass a null pointer), which was
perfectly fine. The real issue was not the behavior but that it was not
documented or consistent. I've lived by the mantra since that you can never
trust a pointer in R code. User must always check for NULL.

I just wrote my own functions mkXXX_safe that wrap the internals and check
the pointer.

THK

http://www.keittlab.org/

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion: mkString(NULL) should be NA

Tierney, Luke
On Wed, 25 May 2016, Tim Keitt wrote:

> On Wed, May 25, 2016 at 7:22 AM, Michael Lawrence <[hidden email]
>> wrote:
>
>> On Wed, May 25, 2016 at 4:23 AM, Jeroen Ooms <[hidden email]>
>> wrote:
>>
>
> I'm not disagreeing with what's been said in this thread, but I can't help
> but recall that I brought up this exact issue probably 15 years ago and was
> told (by Brian, I believe) "don't do that" (pass a null pointer), which was
> perfectly fine. The real issue was not the behavior but that it was not
> documented or consistent. I've lived by the mantra since that you can never
> trust a pointer in R code. User must always check for NULL.

In _C_ code.  This is true whether you are calling into the R C API or
any other C library: you as the C programmer need to make sure either
that passing NULL is OK or make sure you don't do that.

I wouldn't object to mkXXX checking for NULL and signaling an error
instead of segfaulting, but good C code calling mkXXX should still
typically do its own check and handle the situation in an appropriate
way.

Best,

luke


>
> I just wrote my own functions mkXXX_safe that wrap the internals and check
> the pointer.
>
> THK
>
> http://www.keittlab.org/
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   [hidden email]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion: mkString(NULL) should be NA

Tim Keitt-3
http://www.keittlab.org/

On Wed, May 25, 2016 at 10:43 AM, <[hidden email]> wrote:

> On Wed, 25 May 2016, Tim Keitt wrote:
>
> On Wed, May 25, 2016 at 7:22 AM, Michael Lawrence <
>> [hidden email]
>>
>>> wrote:
>>>
>>
>> On Wed, May 25, 2016 at 4:23 AM, Jeroen Ooms <[hidden email]>
>>> wrote:
>>>
>>>
>> I'm not disagreeing with what's been said in this thread, but I can't help
>> but recall that I brought up this exact issue probably 15 years ago and
>> was
>> told (by Brian, I believe) "don't do that" (pass a null pointer), which
>> was
>> perfectly fine. The real issue was not the behavior but that it was not
>> documented or consistent. I've lived by the mantra since that you can
>> never
>> trust a pointer in R code. User must always check for NULL.
>>
>
> In _C_ code.  This is true whether you are calling into the R C API or
> any other C library: you as the C programmer need to make sure either
> that passing NULL is OK or make sure you don't do that.
>

I agree -- I meant it was "perfectly fine" to remind us we need to check
pointers. Its really a documentation issue.

THK




>
> I wouldn't object to mkXXX checking for NULL and signaling an error
> instead of segfaulting, but good C code calling mkXXX should still
> typically do its own check and handle the situation in an appropriate
> way.
>
> Best,
>
> luke
>
>
>
>> I just wrote my own functions mkXXX_safe that wrap the internals and check
>> the pointer.
>>
>> THK
>>
>> http://www.keittlab.org/
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
> --
> Luke Tierney
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa                  Phone:             319-335-3386
> Department of Statistics and        Fax:               319-335-3017
>    Actuarial Science
> 241 Schaeffer Hall                  email:   [hidden email]
> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion: mkString(NULL) should be NA

Pages, Herve
In reply to this post by Tierney, Luke
Hi,

I tend to agree with the objections expressed earlier. I would only
add that making the NULL pointer semantically equivalent to NA would
introduce a precedent that could lead to some confusion. For example
it would set the expectation that CHAR(Rf_mkChar(NULL)) is NULL,
which is not the case AFAIK. Or that low-level string manipulation
utilities that take a C-string as input (e.g. Rf_reEnc()) accept NULL
and propagate it.

Of course these things can be modified to be consistent with the new
"NULL pointer == NA" paradigm but this might end up being a bigger move
than what it seems at first sight...

Cheers,
H.


On 05/25/2016 08:43 AM, [hidden email] wrote:

> On Wed, 25 May 2016, Tim Keitt wrote:
>
>> On Wed, May 25, 2016 at 7:22 AM, Michael Lawrence
>> <[hidden email]
>>> wrote:
>>
>>> On Wed, May 25, 2016 at 4:23 AM, Jeroen Ooms <[hidden email]>
>>> wrote:
>>>
>>
>> I'm not disagreeing with what's been said in this thread, but I can't
>> help
>> but recall that I brought up this exact issue probably 15 years ago
>> and was
>> told (by Brian, I believe) "don't do that" (pass a null pointer),
>> which was
>> perfectly fine. The real issue was not the behavior but that it was not
>> documented or consistent. I've lived by the mantra since that you can
>> never
>> trust a pointer in R code. User must always check for NULL.
>
> In _C_ code.  This is true whether you are calling into the R C API or
> any other C library: you as the C programmer need to make sure either
> that passing NULL is OK or make sure you don't do that.
>
> I wouldn't object to mkXXX checking for NULL and signaling an error
> instead of segfaulting, but good C code calling mkXXX should still
> typically do its own check and handle the situation in an appropriate
> way.
>
> Best,
>
> luke
>
>
>>
>> I just wrote my own functions mkXXX_safe that wrap the internals and
>> check
>> the pointer.
>>
>> THK
>>
>> http://www.keittlab.org/
>>
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [hidden email]
Phone:  (206) 667-5791
Fax:    (206) 667-1319

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel