Quantcast

[patch] Behavior of .C() and .Fortran() when given double(0) or integer(0).

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

[patch] Behavior of .C() and .Fortran() when given double(0) or integer(0).

Pavel N. Krivitsky-2
Dear R-devel,

While tracking down some hard-to-reproduce bugs in a package I maintain,
I stumbled on a behavior change between R 2.15.0 and the current R-devel
(or SVN trunk).

In 2.15.0 and earlier, if you passed an 0-length vector of the right
mode (e.g., double(0) or integer(0)) as one of the arguments in a .C()
call with DUP=TRUE (the default), the C routine would be passed NULL
(the C pointer, not R NULL) in the corresponding argument. The current
development version instead passes it a pointer to what appears to be
memory location immediately following the the SEXP that holds the
metadata for the argument. If the argument has length 0, this is often
memory belonging to a different R object. (DUP=FALSE in 2.15.0
appears to have the same behavior as R-devel.)

.C() documentation and Writing R Extensions don't explicitly specify a
behavior for 0-length vectors, so I don't know if this change is
intentional, or whether it was a side-effect of the following news item:

      .C() and .Fortran() do less copying: arguments which are raw,
      logical, integer, real or complex vectors and are unnamed are not
      copied before the call, and (named or not) are not copied after
      the call.  Lists are no longer copied (they are supposed to be
      used read-only in the C code).

Was the change in the empty vector behavior intentional?

It seems to me that standardizing on the behavior of giving the C
routine NULL is safer, more consistent with other memory-related
routines, and more convenient: whereas dereferencing a NULL pointer is
an immediate (and therefore easily traced) segfault, dereferencing an
invalid pointer that is nevertheless in the general memory area
allocated to the program often causes subtle errors down the line;
R_alloc asked to allocate 0 bytes returns NULL, at least on my platform;
and the C routine can easily check if a pointer is NULL, but with the
R-devel behavior, the programmer has to add an explicit way of telling
that an empty vector was passed.

I've attached a small test case (dotC_NULL.* files) that shows the
difference. The C file should be built with R CMD SHLIB, and the R file
calls the functions in the library with a variety of arguments. Output I
get from running
R CMD BATCH --no-timing --vanilla --slave dotC_NULL.R
on R 2.15.0, R trunk, and R trunk with my patch (described below) are attached.

The attached patch (dotC_NULL.patch) against the current trunk
(affecting src/main/dotcode.c) restores the old behavior for DUP=TRUE
(i.e., 0-length vector -> NULL pointer) and extends it to the DUP=FALSE
case. It does so by checking if an argument --- if it's of mode raw,
integer, real, or complex --- to a .C() or .Fortran() call has length 0,
and, if so, sets the pointer to be passed to NULL and then skips the
copying of the C routine's changes back to the R object for that
argument. The additional computing cost should be negligible (i.e.,
checking if vector length equals 0 and break-ing out of a switch
statement if so).

The patch appears to work, at least for my package, and R CMD check
passes for all recommended packages (on my 64-bit Linux system), but
this is my first time working with R's internals, so handle with care.

                                   Best,
                                   Pavel Krivitsky


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

dotC_NULL.2.15.0.Rout (1K) Download Attachment
dotC_NULL.trunk.Rout (1K) Download Attachment
dotC_NULL.trunk.patched.Rout (1K) Download Attachment
dotC_NULL.R (761 bytes) Download Attachment
dotC_NULL.patch (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [patch] Behavior of .C() and .Fortran() when given double(0) or integer(0).

Pavel N. Krivitsky-2
Oops... Forgot to attach the dotC_NULL.c, the C source file for the test
case.

                  Pavel Krivitsky

On Fri, 2012-05-04 at 13:42 -0400, Pavel N. Krivitsky wrote:

> Dear R-devel,
>
> While tracking down some hard-to-reproduce bugs in a package I maintain,
> I stumbled on a behavior change between R 2.15.0 and the current R-devel
> (or SVN trunk).
>
> In 2.15.0 and earlier, if you passed an 0-length vector of the right
> mode (e.g., double(0) or integer(0)) as one of the arguments in a .C()
> call with DUP=TRUE (the default), the C routine would be passed NULL
> (the C pointer, not R NULL) in the corresponding argument. The current
> development version instead passes it a pointer to what appears to be
> memory location immediately following the the SEXP that holds the
> metadata for the argument. If the argument has length 0, this is often
> memory belonging to a different R object. (DUP=FALSE in 2.15.0
> appears to have the same behavior as R-devel.)
>
> .C() documentation and Writing R Extensions don't explicitly specify a
> behavior for 0-length vectors, so I don't know if this change is
> intentional, or whether it was a side-effect of the following news item:
>
>       .C() and .Fortran() do less copying: arguments which are raw,
>       logical, integer, real or complex vectors and are unnamed are not
>       copied before the call, and (named or not) are not copied after
>       the call.  Lists are no longer copied (they are supposed to be
>       used read-only in the C code).
>
> Was the change in the empty vector behavior intentional?
>
> It seems to me that standardizing on the behavior of giving the C
> routine NULL is safer, more consistent with other memory-related
> routines, and more convenient: whereas dereferencing a NULL pointer is
> an immediate (and therefore easily traced) segfault, dereferencing an
> invalid pointer that is nevertheless in the general memory area
> allocated to the program often causes subtle errors down the line;
> R_alloc asked to allocate 0 bytes returns NULL, at least on my platform;
> and the C routine can easily check if a pointer is NULL, but with the
> R-devel behavior, the programmer has to add an explicit way of telling
> that an empty vector was passed.
>
> I've attached a small test case (dotC_NULL.* files) that shows the
> difference. The C file should be built with R CMD SHLIB, and the R file
> calls the functions in the library with a variety of arguments. Output I
> get from running
> R CMD BATCH --no-timing --vanilla --slave dotC_NULL.R
> on R 2.15.0, R trunk, and R trunk with my patch (described below) are attached.
>
> The attached patch (dotC_NULL.patch) against the current trunk
> (affecting src/main/dotcode.c) restores the old behavior for DUP=TRUE
> (i.e., 0-length vector -> NULL pointer) and extends it to the DUP=FALSE
> case. It does so by checking if an argument --- if it's of mode raw,
> integer, real, or complex --- to a .C() or .Fortran() call has length 0,
> and, if so, sets the pointer to be passed to NULL and then skips the
> copying of the C routine's changes back to the R object for that
> argument. The additional computing cost should be negligible (i.e.,
> checking if vector length equals 0 and break-ing out of a switch
> statement if so).
>
> The patch appears to work, at least for my package, and R CMD check
> passes for all recommended packages (on my 64-bit Linux system), but
> this is my first time working with R's internals, so handle with care.
>
>                                    Best,
>                                    Pavel Krivitsky
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [patch] Behavior of .C() and .Fortran() when given double(0) or integer(0). --- Missing C file.

Pavel N. Krivitsky-2
In reply to this post by Pavel N. Krivitsky-2
Hi,

It looks like I didn't forget to attach it after all, but R-devel strips
C source code files. Remove the ".txt" from the attached file to compile
the test case.

                         Best,
                         Pavel

On Fri, 2012-05-04 at 13:42 -0400, Pavel N. Krivitsky wrote:

> Dear R-devel,
>
> While tracking down some hard-to-reproduce bugs in a package I maintain,
> I stumbled on a behavior change between R 2.15.0 and the current R-devel
> (or SVN trunk).
>
> In 2.15.0 and earlier, if you passed an 0-length vector of the right
> mode (e.g., double(0) or integer(0)) as one of the arguments in a .C()
> call with DUP=TRUE (the default), the C routine would be passed NULL
> (the C pointer, not R NULL) in the corresponding argument. The current
> development version instead passes it a pointer to what appears to be
> memory location immediately following the the SEXP that holds the
> metadata for the argument. If the argument has length 0, this is often
> memory belonging to a different R object. (DUP=FALSE in 2.15.0
> appears to have the same behavior as R-devel.)
>
> .C() documentation and Writing R Extensions don't explicitly specify a
> behavior for 0-length vectors, so I don't know if this change is
> intentional, or whether it was a side-effect of the following news item:
>
>       .C() and .Fortran() do less copying: arguments which are raw,
>       logical, integer, real or complex vectors and are unnamed are not
>       copied before the call, and (named or not) are not copied after
>       the call.  Lists are no longer copied (they are supposed to be
>       used read-only in the C code).
>
> Was the change in the empty vector behavior intentional?
>
> It seems to me that standardizing on the behavior of giving the C
> routine NULL is safer, more consistent with other memory-related
> routines, and more convenient: whereas dereferencing a NULL pointer is
> an immediate (and therefore easily traced) segfault, dereferencing an
> invalid pointer that is nevertheless in the general memory area
> allocated to the program often causes subtle errors down the line;
> R_alloc asked to allocate 0 bytes returns NULL, at least on my platform;
> and the C routine can easily check if a pointer is NULL, but with the
> R-devel behavior, the programmer has to add an explicit way of telling
> that an empty vector was passed.
>
> I've attached a small test case (dotC_NULL.* files) that shows the
> difference. The C file should be built with R CMD SHLIB, and the R file
> calls the functions in the library with a variety of arguments. Output I
> get from running
> R CMD BATCH --no-timing --vanilla --slave dotC_NULL.R
> on R 2.15.0, R trunk, and R trunk with my patch (described below) are attached.
>
> The attached patch (dotC_NULL.patch) against the current trunk
> (affecting src/main/dotcode.c) restores the old behavior for DUP=TRUE
> (i.e., 0-length vector -> NULL pointer) and extends it to the DUP=FALSE
> case. It does so by checking if an argument --- if it's of mode raw,
> integer, real, or complex --- to a .C() or .Fortran() call has length 0,
> and, if so, sets the pointer to be passed to NULL and then skips the
> copying of the C routine's changes back to the R object for that
> argument. The additional computing cost should be negligible (i.e.,
> checking if vector length equals 0 and break-ing out of a switch
> statement if so).
>
> The patch appears to work, at least for my package, and R CMD check
> passes for all recommended packages (on my 64-bit Linux system), but
> this is my first time working with R's internals, so handle with care.
>
>                                    Best,
>                                    Pavel Krivitsky
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

dotC_NULL.c.txt (277 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [patch] Behavior of .C() and .Fortran() when given double(0) or integer(0).

Prof Brian Ripley
In reply to this post by Pavel N. Krivitsky-2
On 04/05/2012 18:42, Pavel N. Krivitsky wrote:

> Dear R-devel,
>
> While tracking down some hard-to-reproduce bugs in a package I maintain,
> I stumbled on a behavior change between R 2.15.0 and the current R-devel
> (or SVN trunk).
>
> In 2.15.0 and earlier, if you passed an 0-length vector of the right
> mode (e.g., double(0) or integer(0)) as one of the arguments in a .C()
> call with DUP=TRUE (the default), the C routine would be passed NULL
> (the C pointer, not R NULL) in the corresponding argument. The current

Where did you get that from?  The documentation says it passes an (e.g.)
double* pointer to a copy of the data area of the R vector.  There is no
change in the documented behaviour ....  Now, of course a zero-length
area can be at any address, but none is stated anywhere that I am aware of.

> development version instead passes it a pointer to what appears to be
> memory location immediately following the the SEXP that holds the
> metadata for the argument. If the argument has length 0, this is often
> memory belonging to a different R object. (DUP=FALSE in 2.15.0
> appears to have the same behavior as R-devel.)
>
> .C() documentation and Writing R Extensions don't explicitly specify a
> behavior for 0-length vectors, so I don't know if this change is
> intentional, or whether it was a side-effect of the following news item:
>
>        .C() and .Fortran() do less copying: arguments which are raw,
>        logical, integer, real or complex vectors and are unnamed are not
>        copied before the call, and (named or not) are not copied after
>        the call.  Lists are no longer copied (they are supposed to be
>        used read-only in the C code).
>
> Was the change in the empty vector behavior intentional?
>
> It seems to me that standardizing on the behavior of giving the C
> routine NULL is safer, more consistent with other memory-related
> routines, and more convenient: whereas dereferencing a NULL pointer is
> an immediate (and therefore easily traced) segfault, dereferencing an

That's not true, in general.

> invalid pointer that is nevertheless in the general memory area
> allocated to the program often causes subtle errors down the line;
> R_alloc asked to allocate 0 bytes returns NULL, at least on my platform;

Again, undocumented and should not be relied on.

> and the C routine can easily check if a pointer is NULL, but with the
> R-devel behavior, the programmer has to add an explicit way of telling
> that an empty vector was passed.

It's no different from any other vector length: it is easy for careless
programmers to read/write off the ends of the allocated area, and this
is why in R-devel we have an option to check for that (and of course
also what valgrind is good at finding in an instrumented version of R).

> I've attached a small test case (dotC_NULL.* files) that shows the
> difference. The C file should be built with R CMD SHLIB, and the R file
> calls the functions in the library with a variety of arguments. Output I
> get from running
> R CMD BATCH --no-timing --vanilla --slave dotC_NULL.R
> on R 2.15.0, R trunk, and R trunk with my patch (described below) are attached.
>
> The attached patch (dotC_NULL.patch) against the current trunk
> (affecting src/main/dotcode.c) restores the old behavior for DUP=TRUE
> (i.e., 0-length vector ->  NULL pointer) and extends it to the DUP=FALSE
> case. It does so by checking if an argument --- if it's of mode raw,
> integer, real, or complex --- to a .C() or .Fortran() call has length 0,
> and, if so, sets the pointer to be passed to NULL and then skips the
> copying of the C routine's changes back to the R object for that
> argument. The additional computing cost should be negligible (i.e.,
> checking if vector length equals 0 and break-ing out of a switch
> statement if so).
>
> The patch appears to work, at least for my package, and R CMD check
> passes for all recommended packages (on my 64-bit Linux system), but
> this is my first time working with R's internals, so handle with care.

That's easy: we will not be changing this.  In particular, the new
checks I refer to above rely on passing the address of an in-process
memory area with guard bytes.

>                                     Best,
>                                     Pavel Krivitsky
>
>
>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel


--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [patch] Behavior of .C() and .Fortran() when given double(0) or integer(0).

Pavel N. Krivitsky-2
Dear Professor Ripley and R-devel,

Thank you for taking the time to look into this. I understand that the
2.15.0 behavior in question was undocumented and ambiguous (at least as
of that release), and it should not have been relied upon, intuitiveness
or not. My suggestion is that in the next release, it ought to be the
standard, documented behavior, not just because it's historical, but
because it's more convenient and safer.

>From the point of view of programmer convenience, a having a 0-length
vector on the R side always map to a NULL pointer on the C side provides
a useful bit of information that the programmer can use, while a
non-NULL pointer to no data isn't useful, and the current R-devel
behavior requires the programmer to pass the information about whether
it's empty through an additional argument (of which there is an upper
limit). For example, if a procedure implemented in C takes optional
weights, passing a double(0) that was translated to NULL could be used
to signal that there are no weights. Also, while the .Call() interface
allows an R vector passed to it to be resized, the .C() and .Fortran()
interfaces don't, so a 0-length R vector passed via .C() or .Fortran()
can be neither read nor written to, and nothing is lost by passing it as
NULL.

On the issue of safety, while dereferencing a NULL pointer is not an
instant segfault on absolutely every system, it is the case for the
overwhelming majority of modern systems on which anyone is likely to run
a recent version of R. For those systems for which it is not the case,
the behavior is no worse than dereferencing a non-NULL pointer to no
data. On the contrary, while it's easy to check if a pointer is NULL,
there is no general way to check whether a non-NULL pointer is valid, so
if the 0-length->NULL behavior is made the standard and documented,
package developers may be more likely to make use of it to check.

On the issue of instrumentation and debugging, again, I think it comes
down to programmer convenience. Segmentation faults caused by NULL
dereferencing can be caught and debugged interactively with a debugger
like GDB, while non-NULL memory errors have less predictable
consequences and require tools like the slower and non-interactive
Valgrind. Perhaps R's new guard bytes will change that somewhat, but,
from what I've read, they only check for invalid writes, not invalid
reads, which can cause almost as much trouble. On the other hand, both
trying to read from a NULL pointer and trying to write to it will be
detected on most systems. And, having 0-length vectors be passed as NULL
does not preclude using guard bytes on non-0-length vectors.

To summarize, I think that on both safety and convenience, standardizing
on the 0-length->NULL behavior dominates the 0-length->invalid-pointer
behavior: in each scenario that either of us has brought up so far, it
behaves no worse and often better. My patch does not include changes to
documentation, and, if you like, I am willing to write one that does. If
my patch can be improved in some other way, please let me know and I
will try to improve it.

                            Sincerely,
                            Pavel Krivitsky



On Thu, 2012-05-17 at 10:46 +0100, Prof Brian Ripley wrote:

> On 04/05/2012 18:42, Pavel N. Krivitsky wrote:
> > Dear R-devel,
> >
> > While tracking down some hard-to-reproduce bugs in a package I maintain,
> > I stumbled on a behavior change between R 2.15.0 and the current R-devel
> > (or SVN trunk).
> >
> > In 2.15.0 and earlier, if you passed an 0-length vector of the right
> > mode (e.g., double(0) or integer(0)) as one of the arguments in a .C()
> > call with DUP=TRUE (the default), the C routine would be passed NULL
> > (the C pointer, not R NULL) in the corresponding argument. The current
>
> Where did you get that from?  The documentation says it passes an (e.g.)
> double* pointer to a copy of the data area of the R vector.  There is no
> change in the documented behaviour ....  Now, of course a zero-length
> area can be at any address, but none is stated anywhere that I am aware of.
>
> > development version instead passes it a pointer to what appears to be
> > memory location immediately following the the SEXP that holds the
> > metadata for the argument. If the argument has length 0, this is often
> > memory belonging to a different R object. (DUP=FALSE in 2.15.0
> > appears to have the same behavior as R-devel.)
> >
> > .C() documentation and Writing R Extensions don't explicitly specify a
> > behavior for 0-length vectors, so I don't know if this change is
> > intentional, or whether it was a side-effect of the following news item:
> >
> >        .C() and .Fortran() do less copying: arguments which are raw,
> >        logical, integer, real or complex vectors and are unnamed are not
> >        copied before the call, and (named or not) are not copied after
> >        the call.  Lists are no longer copied (they are supposed to be
> >        used read-only in the C code).
> >
> > Was the change in the empty vector behavior intentional?
> >
> > It seems to me that standardizing on the behavior of giving the C
> > routine NULL is safer, more consistent with other memory-related
> > routines, and more convenient: whereas dereferencing a NULL pointer is
> > an immediate (and therefore easily traced) segfault, dereferencing an
>
> That's not true, in general.
>
> > invalid pointer that is nevertheless in the general memory area
> > allocated to the program often causes subtle errors down the line;
> > R_alloc asked to allocate 0 bytes returns NULL, at least on my platform;
>
> Again, undocumented and should not be relied on.
>
> > and the C routine can easily check if a pointer is NULL, but with the
> > R-devel behavior, the programmer has to add an explicit way of telling
> > that an empty vector was passed.
>
> It's no different from any other vector length: it is easy for careless
> programmers to read/write off the ends of the allocated area, and this
> is why in R-devel we have an option to check for that (and of course
> also what valgrind is good at finding in an instrumented version of R).
>
> > I've attached a small test case (dotC_NULL.* files) that shows the
> > difference. The C file should be built with R CMD SHLIB, and the R file
> > calls the functions in the library with a variety of arguments. Output I
> > get from running
> > R CMD BATCH --no-timing --vanilla --slave dotC_NULL.R
> > on R 2.15.0, R trunk, and R trunk with my patch (described below) are attached.
> >
> > The attached patch (dotC_NULL.patch) against the current trunk
> > (affecting src/main/dotcode.c) restores the old behavior for DUP=TRUE
> > (i.e., 0-length vector ->  NULL pointer) and extends it to the DUP=FALSE
> > case. It does so by checking if an argument --- if it's of mode raw,
> > integer, real, or complex --- to a .C() or .Fortran() call has length 0,
> > and, if so, sets the pointer to be passed to NULL and then skips the
> > copying of the C routine's changes back to the R object for that
> > argument. The additional computing cost should be negligible (i.e.,
> > checking if vector length equals 0 and break-ing out of a switch
> > statement if so).
> >
> > The patch appears to work, at least for my package, and R CMD check
> > passes for all recommended packages (on my 64-bit Linux system), but
> > this is my first time working with R's internals, so handle with care.
>
> That's easy: we will not be changing this.  In particular, the new
> checks I refer to above rely on passing the address of an in-process
> memory area with guard bytes.
>
> >                                     Best,
> >                                     Pavel Krivitsky
> >
> >
> >
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [patch] Behavior of .C() and .Fortran() when given double(0) or integer(0).

Simon Urbanek
On May 26, 2012, at 1:02 PM, Pavel N. Krivitsky wrote:

> Dear Professor Ripley and R-devel,
>
> Thank you for taking the time to look into this. I understand that the
> 2.15.0 behavior in question was undocumented and ambiguous (at least as
> of that release), and it should not have been relied upon, intuitiveness
> or not. My suggestion is that in the next release, it ought to be the
> standard, documented behavior, not just because it's historical, but
> because it's more convenient and safer.
>

That is bogus - .C is inherently unsafe wrt vector lengths so talking about safety here is IMHO nonsensical. Your "safety" relies on bombing the program - that is arguably much less safe than using checks that Brian was talking about because they are recoverable. You can argue either way, but there is no winner - the real answer is use .Call() instead.


>> From the point of view of programmer convenience, a having a 0-length
> vector on the R side always map to a NULL pointer on the C side provides
> a useful bit of information that the programmer can use, while a
> non-NULL pointer to no data isn't useful, and the current R-devel
> behavior requires the programmer to pass the information about whether
> it's empty through an additional argument (of which there is an upper
> limit). For example, if a procedure implemented in C takes optional
> weights, passing a double(0) that was translated to NULL could be used
> to signal that there are no weights.

That would be just plain wrong use that certainly should not be encouraged - you *have* to pass the length along with any vectors passed to .C (that's why you should not be even thinking of using .C in the first place!) so it is much safer to check that the length you passed is 0 rather than relying on special-casing into NULL pointers.

Cheers,
Simon


> Also, while the .Call() interface
> allows an R vector passed to it to be resized, the .C() and .Fortran()
> interfaces don't, so a 0-length R vector passed via .C() or .Fortran()
> can be neither read nor written to, and nothing is lost by passing it as
> NULL.
>
> On the issue of safety, while dereferencing a NULL pointer is not an
> instant segfault on absolutely every system, it is the case for the
> overwhelming majority of modern systems on which anyone is likely to run
> a recent version of R. For those systems for which it is not the case,
> the behavior is no worse than dereferencing a non-NULL pointer to no
> data. On the contrary, while it's easy to check if a pointer is NULL,
> there is no general way to check whether a non-NULL pointer is valid, so
> if the 0-length->NULL behavior is made the standard and documented,
> package developers may be more likely to make use of it to check.
>
> On the issue of instrumentation and debugging, again, I think it comes
> down to programmer convenience. Segmentation faults caused by NULL
> dereferencing can be caught and debugged interactively with a debugger
> like GDB, while non-NULL memory errors have less predictable
> consequences and require tools like the slower and non-interactive
> Valgrind. Perhaps R's new guard bytes will change that somewhat, but,
> from what I've read, they only check for invalid writes, not invalid
> reads, which can cause almost as much trouble. On the other hand, both
> trying to read from a NULL pointer and trying to write to it will be
> detected on most systems. And, having 0-length vectors be passed as NULL
> does not preclude using guard bytes on non-0-length vectors.
>
> To summarize, I think that on both safety and convenience, standardizing
> on the 0-length->NULL behavior dominates the 0-length->invalid-pointer
> behavior: in each scenario that either of us has brought up so far, it
> behaves no worse and often better. My patch does not include changes to
> documentation, and, if you like, I am willing to write one that does. If
> my patch can be improved in some other way, please let me know and I
> will try to improve it.
>
>                            Sincerely,
>                            Pavel Krivitsky
>
>
>
> On Thu, 2012-05-17 at 10:46 +0100, Prof Brian Ripley wrote:
>> On 04/05/2012 18:42, Pavel N. Krivitsky wrote:
>>> Dear R-devel,
>>>
>>> While tracking down some hard-to-reproduce bugs in a package I maintain,
>>> I stumbled on a behavior change between R 2.15.0 and the current R-devel
>>> (or SVN trunk).
>>>
>>> In 2.15.0 and earlier, if you passed an 0-length vector of the right
>>> mode (e.g., double(0) or integer(0)) as one of the arguments in a .C()
>>> call with DUP=TRUE (the default), the C routine would be passed NULL
>>> (the C pointer, not R NULL) in the corresponding argument. The current
>>
>> Where did you get that from?  The documentation says it passes an (e.g.)
>> double* pointer to a copy of the data area of the R vector.  There is no
>> change in the documented behaviour ....  Now, of course a zero-length
>> area can be at any address, but none is stated anywhere that I am aware of.
>>
>>> development version instead passes it a pointer to what appears to be
>>> memory location immediately following the the SEXP that holds the
>>> metadata for the argument. If the argument has length 0, this is often
>>> memory belonging to a different R object. (DUP=FALSE in 2.15.0
>>> appears to have the same behavior as R-devel.)
>>>
>>> .C() documentation and Writing R Extensions don't explicitly specify a
>>> behavior for 0-length vectors, so I don't know if this change is
>>> intentional, or whether it was a side-effect of the following news item:
>>>
>>>       .C() and .Fortran() do less copying: arguments which are raw,
>>>       logical, integer, real or complex vectors and are unnamed are not
>>>       copied before the call, and (named or not) are not copied after
>>>       the call.  Lists are no longer copied (they are supposed to be
>>>       used read-only in the C code).
>>>
>>> Was the change in the empty vector behavior intentional?
>>>
>>> It seems to me that standardizing on the behavior of giving the C
>>> routine NULL is safer, more consistent with other memory-related
>>> routines, and more convenient: whereas dereferencing a NULL pointer is
>>> an immediate (and therefore easily traced) segfault, dereferencing an
>>
>> That's not true, in general.
>>
>>> invalid pointer that is nevertheless in the general memory area
>>> allocated to the program often causes subtle errors down the line;
>>> R_alloc asked to allocate 0 bytes returns NULL, at least on my platform;
>>
>> Again, undocumented and should not be relied on.
>>
>>> and the C routine can easily check if a pointer is NULL, but with the
>>> R-devel behavior, the programmer has to add an explicit way of telling
>>> that an empty vector was passed.
>>
>> It's no different from any other vector length: it is easy for careless
>> programmers to read/write off the ends of the allocated area, and this
>> is why in R-devel we have an option to check for that (and of course
>> also what valgrind is good at finding in an instrumented version of R).
>>
>>> I've attached a small test case (dotC_NULL.* files) that shows the
>>> difference. The C file should be built with R CMD SHLIB, and the R file
>>> calls the functions in the library with a variety of arguments. Output I
>>> get from running
>>> R CMD BATCH --no-timing --vanilla --slave dotC_NULL.R
>>> on R 2.15.0, R trunk, and R trunk with my patch (described below) are attached.
>>>
>>> The attached patch (dotC_NULL.patch) against the current trunk
>>> (affecting src/main/dotcode.c) restores the old behavior for DUP=TRUE
>>> (i.e., 0-length vector ->  NULL pointer) and extends it to the DUP=FALSE
>>> case. It does so by checking if an argument --- if it's of mode raw,
>>> integer, real, or complex --- to a .C() or .Fortran() call has length 0,
>>> and, if so, sets the pointer to be passed to NULL and then skips the
>>> copying of the C routine's changes back to the R object for that
>>> argument. The additional computing cost should be negligible (i.e.,
>>> checking if vector length equals 0 and break-ing out of a switch
>>> statement if so).
>>>
>>> The patch appears to work, at least for my package, and R CMD check
>>> passes for all recommended packages (on my 64-bit Linux system), but
>>> this is my first time working with R's internals, so handle with care.
>>
>> That's easy: we will not be changing this.  In particular, the new
>> checks I refer to above rely on passing the address of an in-process
>> memory area with guard bytes.
>>
>>>                                    Best,
>>>                                    Pavel Krivitsky
>>>
>>>
>>>
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [patch] Behavior of .C() and .Fortran() when given double(0) or integer(0).

Dirk Eddelbuettel

On 26 May 2012 at 14:00, Simon Urbanek wrote:
| [...] the real answer is use .Call() instead.

Maybe Kurt could add something to that extent to the R FAQ ?

Dirk

--
Dirk Eddelbuettel | [hidden email] | http://dirk.eddelbuettel.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [patch] Behavior of .C() and .Fortran() when given double(0) or integer(0).

Pavel N. Krivitsky-2
In reply to this post by Simon Urbanek
Dear Simon,

On Sat, 2012-05-26 at 14:00 -0400, Simon Urbanek wrote:
> > My suggestion is that in the next release, it ought to be the
> > standard, documented behavior, not just because it's historical, but
> > because it's more convenient and safer.
>
> That is bogus - .C is inherently unsafe wrt vector lengths so talking
> about safety here is IMHO nonsensical. Your "safety" relies on bombing
> the program -

IMHO, not all memory errors are created equal. From the safety
perspective, an error that immediately bombs the program is preferable
to one that corrupts the memory, producing subtle problems much later or
one that reads the wrong memory area and goes into an infinite loop or
allocates gigabytes of RAM, etc..

> that is arguably much less safe than using checks that Brian was
> talking about because they are recoverable.

While undoubtedly useful for debugging, I don't think they are
particularly recoverable in practice. At best, they tell you that some
memory before or after that allocated has been overwritten. They cannot
tell you how much memory or whether R is now in an inconsistent state
(which may occur if the write is off by more than 64 bytes, I believe),
and should be restarted immediately, only taking the time to save the
data and history --- which is what a caught segfault in R does anyway,
at least on UNIX-alikes.

Furthermore, the guard bytes only trigger after the C routine exits, so
the error is only caught some time after it occurs, which makes
debugging it more difficult. (In contrast, a debugger like GDB can tell
exactly which C statement caused a segmentation fault.)

The one advantage guard bytes might have over NULL (for a 0-length
vector) is that an error caught by a guard byte might allow the
developer to browse (via options(error=recover)) the R function that
made the .C() call, but even that relies on the bug not overwriting more
than a few bytes, and it cannot detect improper reads.

> You can argue either way, but there is no winner - the real answer is
> use .Call() instead.

It seems to me that the 0-length->NULL approach still dominates on the
matter of safety and debugging, with a possible exception in what I am
pretty sure is a relatively rare scenario when the developer has passed
a 0-length vector via .C() _and_ it was written to _and_ the developer
wants to browse (using error=recover()) the R code leading up to the
problematic .C() call, rather than browse (via GDB) the C code that
triggered the segfault. In that scenario, the developer can still easily
infer what argument was passed as an empty vector and via what .C()
call. (Standardizing on 0-length->NULL does not preclude putting guard
bytes on non-empty vectors, of course.)

> > From the point of view of programmer convenience, a having a 0-length
> > vector on the R side always map to a NULL pointer on the C side provides
> > a useful bit of information that the programmer can use, while a
> > non-NULL pointer to no data isn't useful, and the current R-devel
> > behavior requires the programmer to pass the information about whether
> > it's empty through an additional argument (of which there is an upper
> > limit). For example, if a procedure implemented in C takes optional
> > weights, passing a double(0) that was translated to NULL could be used
> > to signal that there are no weights.
>
> That would be just plain wrong use that certainly should not be
> encouraged - you *have* to pass the length along with any vectors
> passed to .C (that's why you should not be even thinking of using .C
> in the first place!) so it is much safer to check that the length you
> passed is 0 rather than relying on special-casing into NULL pointers.

Not necessarily. In the weighted data scenario, the length of the data
vector would, presumably, be passed in a different argument, and, if
weights exist, their length would equal to that. The NULL here could be
a binary signal not to use weights.

While I understand that .Call() interface has many advantages
over .C(), .C() remains a simple and convenient interface that doesn't
require the developer to learn too much about R's internals, and, either
way, as long as the .C() interface is not being deprecated, I think that
it ought to be made as safe and as useful as possible.

                                 Best,
                                 Pavel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [patch] Behavior of .C() and .Fortran() when given double(0) or integer(0).

Pavel N. Krivitsky-2
In reply to this post by Dirk Eddelbuettel
On Sat, 2012-05-26 at 14:15 -0500, Dirk Eddelbuettel wrote:
> On 26 May 2012 at 14:00, Simon Urbanek wrote:
> | [...] the real answer is use .Call() instead.
>
> Maybe Kurt could add something to that extent to the R FAQ ?

Since it looks like the 0-length -> invalid pointer behavior is here to
stay, I want to second this request. It had taken me a long time to
track mysterious behavior of my package down to this issue, and I think
it would be helpful for developers in the future to have some documented
behavior, even if the documentation said that the behavior was
undefined.

                      Thanks,
                      Pavel Krivitsky

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Loading...