Quantcast

Segfault when mistakenly calling [.data.frame (PR#13487)

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Segfault when mistakenly calling [.data.frame (PR#13487)

Quigi
Full_Name: Christian Brechbuehler
Version: 2.7.2, 2.8.1
OS: linux-gnu
Submission from: (NULL) (24.128.51.18)


Calling [.data.frame on an object that's not a data frame, specifically 1:10,
causes segmentation fault.

Context
=======
We can subscript with a number of different notations:

   > (1:10)[3]
   [1] 3
   > do.call(get("[",pos="package:base"),list(1:10,3))
   [1] 3
   > do.call(get("[.numeric_version",pos="package:base"),list(1:10,3))
   [1] 3

Problem
=======
If we mistakenly believe the object is a data frame (as we did in a much more
complicated real situation), this happens:

   > do.call(get("[.data.frame",pos="package:base"),list(1:10,3))
   Error in NextMethod("[") :
     no calling generic was found: was a method called directly?

    *** caught segfault ***
   address (nil), cause 'unknown'

   Process R:2 segmentation fault (core dumped) at Thu Jan 29 09:26:29 2009

The Error message is appropriate.  But the segmentation fault is unexpected.


Versions
========
I reproduced the problem on R 2.7.2 and 2.8.1.  Details:

> version
               _                                          
platform       x86_64-unknown-linux-gnu                  
arch           x86_64                                    
os             linux-gnu                                  
system         x86_64, linux-gnu                          
status         Patched                                    
major          2                                          
minor          7.2                                        
year           2008                                      
month          09                                        
day            20                                        
svn rev        46776                                      
language       R                                          
version.string R version 2.7.2 Patched (2008-09-20 r46776)

==========================================================

> version
               _                                          
platform       x86_64-unknown-linux-gnu                  
arch           x86_64                                    
os             linux-gnu                                  
system         x86_64, linux-gnu                          
status         Patched                                    
major          2                                          
minor          8.1                                        
year           2009                                      
month          01                                        
day            26                                        
svn rev        47743                                      
language       R                                          
version.string R version 2.8.1 Patched (2009-01-26 r47743)

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: (PR#13487) Segfault when mistakenly calling [.data.frame

Prof Brian Ripley
What did your actual application do?  This seems a very strange thing
to do, and the segfault is in trying to construct the traceback.

Only by using do.call on the object (and not even by name) do I get
this error.  E.g.

> `[.data.frame`(1:10, 3)
Error in NextMethod("[") : object not specified
> do.call("[.data.frame", list(1:10, 3))
Error in NextMethod("[") : object not specified

are fine.

Obviously it would be nice to fix this, but I'd like to understand the
actual circumstances: there is more to it than the subject line.

On Thu, 29 Jan 2009, [hidden email] wrote:

> Full_Name: Christian Brechbuehler
> Version: 2.7.2, 2.8.1
> OS: linux-gnu
> Submission from: (NULL) (24.128.51.18)
>
>
> Calling [.data.frame on an object that's not a data frame, specifically 1:10,
> causes segmentation fault.
>
> Context
> =======
> We can subscript with a number of different notations:
>
>   > (1:10)[3]
>   [1] 3
>   > do.call(get("[",pos="package:base"),list(1:10,3))
>   [1] 3
>   > do.call(get("[.numeric_version",pos="package:base"),list(1:10,3))
>   [1] 3
>
> Problem
> =======
> If we mistakenly believe the object is a data frame (as we did in a much more
> complicated real situation), this happens:
>
>   > do.call(get("[.data.frame",pos="package:base"),list(1:10,3))
>   Error in NextMethod("[") :
>     no calling generic was found: was a method called directly?
>
>    *** caught segfault ***
>   address (nil), cause 'unknown'
>
>   Process R:2 segmentation fault (core dumped) at Thu Jan 29 09:26:29 2009
>
> The Error message is appropriate.  But the segmentation fault is unexpected.
>
>
> Versions
> ========
> I reproduced the problem on R 2.7.2 and 2.8.1.  Details:
>
>> version
>               _
> platform       x86_64-unknown-linux-gnu
> arch           x86_64
> os             linux-gnu
> system         x86_64, linux-gnu
> status         Patched
> major          2
> minor          7.2
> year           2008
> month          09
> day            20
> svn rev        46776
> language       R
> version.string R version 2.7.2 Patched (2008-09-20 r46776)
>
> ==========================================================
>
>> version
>               _
> platform       x86_64-unknown-linux-gnu
> arch           x86_64
> os             linux-gnu
> system         x86_64, linux-gnu
> status         Patched
> major          2
> minor          8.1
> year           2009
> month          01
> day            26
> svn rev        47743
> language       R
> version.string R version 2.8.1 Patched (2009-01-26 r47743)
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: (PR#13487) Segfault when mistakenly calling [.data.frame

Quigi
On Thu, Jan 29, 2009 at 4:44 PM, Prof Brian Ripley <[hidden email]>wrote:

> What did your actual application do?  This seems a very strange thing to
> do, and the segfault is in trying to construct the traceback.
>
> Only by using do.call on the object (and not even by name) do I get this
> error.  E.g.
>
>  `[.data.frame`(1:10, 3)
>>
> Error in NextMethod("[") : object not specified
>
>> do.call("[.data.frame", list(1:10, 3))
>>
> Error in NextMethod("[") : object not specified
>
> are fine.
>
> Obviously it would be nice to fix this, but I'd like to understand the
> actual circumstances: there is more to it than the subject line.


Yes, there is more.  For reporting the problem, we tried to pare it down to
a concise, self-contained test case.

My boss was debugging an issue in our R code.  We have our own "[...."
functions, because stock R drops names when subscripting.  To bypass our
now-suspect functions and get the "real" subscripting method, he used "get"
from package:base.  He was examining a large object, and believing it was a
data frame, chose "[.data.frame".  As it turns out, that object was not a
data frame, and he got an unexpected segfault.  I think it was a matrix.
But it doesn't matter -- a vector as in the test case will give the same.

We have since fixed the bug in our replacement subscripting function, so the
issue might not affect us any more.

Thanks,
/Christian


> On Thu, 29 Jan 2009, [hidden email] wrote:
>
>  Full_Name: Christian Brechbuehler
>> Version: 2.7.2, 2.8.1
>> OS: linux-gnu
>>
>> If we mistakenly believe the object is a data frame (as we did in a much
>> more
>> complicated real situation), this happens:
>>
>>  > do.call(get("[.data.frame",pos="package:base"),list(1:10,3))
>>  Error in NextMethod("[") :
>>    no calling generic was found: was a method called directly?
>>
>>   *** caught segfault ***
>>  address (nil), cause 'unknown'
>>
>>  Process R:2 segmentation fault (core dumped) at Thu Jan 29 09:26:29 2009
>>
>> The Error message is appropriate.  But the segmentation fault is
>> unexpected.
>>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: (PR#13487) Segfault when mistakenly calling [.data.frame

Simon Urbanek

On Jan 30, 2009, at 10:30 , Christian Brechbühler wrote:

> On Thu, Jan 29, 2009 at 4:44 PM, Prof Brian Ripley <[hidden email]
> >wrote:
>
>> What did your actual application do?  This seems a very strange  
>> thing to
>> do, and the segfault is in trying to construct the traceback.
>>
>> Only by using do.call on the object (and not even by name) do I get  
>> this
>> error.  E.g.
>>
>> `[.data.frame`(1:10, 3)
>>>
>> Error in NextMethod("[") : object not specified
>>
>>> do.call("[.data.frame", list(1:10, 3))
>>>
>> Error in NextMethod("[") : object not specified
>>
>> are fine.
>>
>> Obviously it would be nice to fix this, but I'd like to understand  
>> the
>> actual circumstances: there is more to it than the subject line.
>
>
> Yes, there is more.  For reporting the problem, we tried to pare it  
> down to
> a concise, self-contained test case.
>
> My boss was debugging an issue in our R code.  We have our own "[...."
> functions, because stock R drops names when subscripting.

... if you tell it to do so, yes. If you tell it to not do that, it  
won't ... ever tried drop=FALSE ?

Cheers,
S


> To bypass our
> now-suspect functions and get the "real" subscripting method, he  
> used "get"
> from package:base.  He was examining a large object, and believing  
> it was a
> data frame, chose "[.data.frame".  As it turns out, that object was  
> not a
> data frame, and he got an unexpected segfault.  I think it was a  
> matrix.
> But it doesn't matter -- a vector as in the test case will give the  
> same.
>
> We have since fixed the bug in our replacement subscripting  
> function, so the
> issue might not affect us any more.
>
> Thanks,
> /Christian
>
>
>> On Thu, 29 Jan 2009, [hidden email] wrote:
>>
>> Full_Name: Christian Brechbuehler
>>> Version: 2.7.2, 2.8.1
>>> OS: linux-gnu
>>>
>>> If we mistakenly believe the object is a data frame (as we did in  
>>> a much
>>> more
>>> complicated real situation), this happens:
>>>
>>>> do.call(get("[.data.frame",pos="package:base"),list(1:10,3))
>>> Error in NextMethod("[") :
>>>   no calling generic was found: was a method called directly?
>>>
>>>  *** caught segfault ***
>>> address (nil), cause 'unknown'
>>>
>>> Process R:2 segmentation fault (core dumped) at Thu Jan 29  
>>> 09:26:29 2009
>>>
>>> The Error message is appropriate.  But the segmentation fault is
>>> unexpected.
>>>
>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: (PR#13487) Segfault when mistakenly calling [.data.frame

Tony Plate-3
Simon Urbanek wrote:

>
> On Jan 30, 2009, at 10:30 , Christian Brechbühler wrote:
>
>> On Thu, Jan 29, 2009 at 4:44 PM, Prof Brian Ripley
>> <[hidden email]>wrote:
>>
>>> What did your actual application do?  This seems a very strange
>>> thing to
>>> do, and the segfault is in trying to construct the traceback.
>>>
>>> Only by using do.call on the object (and not even by name) do I get
>>> this
>>> error.  E.g.
>>>
>>> `[.data.frame`(1:10, 3)
>>>>
>>> Error in NextMethod("[") : object not specified
>>>
>>>> do.call("[.data.frame", list(1:10, 3))
>>>>
>>> Error in NextMethod("[") : object not specified
>>>
>>> are fine.
>>>
>>> Obviously it would be nice to fix this, but I'd like to understand the
>>> actual circumstances: there is more to it than the subject line.
>>
>>
>> Yes, there is more.  For reporting the problem, we tried to pare it
>> down to
>> a concise, self-contained test case.
>>
>> My boss was debugging an issue in our R code.  We have our own "[...."
>> functions, because stock R drops names when subscripting.
>
> ... if you tell it to do so, yes. If you tell it to not do that, it
> won't ... ever tried drop=FALSE ?
The common situation I have (which might be the same as the OP's) is
wanting to get a vector from a data frame, and having the rownames of
the dataframe become the names on the vector.

With matrix, the behavior I want is the default behavior, e.g.,
 > x <- cbind(a=c(x=1,y=2,z=3),b=4:6)
 > x
  a b
x 1 4
y 2 5
z 3 6
 > x[,1]
x y z
1 2 3

But with a data frame, subscripting returns a vector with no names:
 > xd <- as.data.frame(x)
 > xd[,1]
[1] 1 2 3

One can use drop=FALSE, but then you've still got a data frame, not a
vector:
 > (xd1 <- xd[,1,drop=FALSE])
  a
x 1
y 2
z 3

The simplest way I know to get a named vector is to use as.matrix on the
one-column dataframe:
 > as.matrix(xd1)[,1]
x y z
1 2 3
 >
(Which works fine except in the case where xd1 has only one row...)

And BTW, am I missing something, or does the behavior of drop() not
conform to the description in ?drop:
 > Value:
 >     If 'x' is an object with a 'dim' attribute (e.g., a matrix or
 >     'array'), then 'drop' returns an object like 'x', but with any
 >     extents of length one removed.  Any accompanying 'dimnames'
 >     attribute is adjusted and returned with 'x': if the result is a
 >     vector the 'names' are taken from the 'dimnames' (if any).  If the
 >     result is a length-one vector, the names are taken from the first
 >     dimension with a dimname.

How is this last sentence consistent with the following behavior?
 > dimnames(x[1,1,drop=F])
[[1]]
[1] "x"

[[2]]
[1] "a"

 > drop(x[1,1,drop=F])
[1] 1
 >
 From the description in "Value:" in ?drop, I would have expected above
result to have the name "x" (the name from the first dimension with a
dimname).

 > sessionInfo()
R version 2.8.1 (2008-12-22)
i386-pc-mingw32

locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    
 >

-- Tony Plate

>
> Cheers,
> S
>
>
>> To bypass our
>> now-suspect functions and get the "real" subscripting method, he used
>> "get"
>> from package:base.  He was examining a large object, and believing it
>> was a
>> data frame, chose "[.data.frame".  As it turns out, that object was
>> not a
>> data frame, and he got an unexpected segfault.  I think it was a matrix.
>> But it doesn't matter -- a vector as in the test case will give the
>> same.
>>
>> We have since fixed the bug in our replacement subscripting function,
>> so the
>> issue might not affect us any more.
>>
>> Thanks,
>> /Christian
>>
>>
>>> On Thu, 29 Jan 2009, [hidden email] wrote:
>>>
>>> Full_Name: Christian Brechbuehler
>>>> Version: 2.7.2, 2.8.1
>>>> OS: linux-gnu
>>>>
>>>> If we mistakenly believe the object is a data frame (as we did in a
>>>> much
>>>> more
>>>> complicated real situation), this happens:
>>>>
>>>>> do.call(get("[.data.frame",pos="package:base"),list(1:10,3))
>>>> Error in NextMethod("[") :
>>>>   no calling generic was found: was a method called directly?
>>>>
>>>>  *** caught segfault ***
>>>> address (nil), cause 'unknown'
>>>>
>>>> Process R:2 segmentation fault (core dumped) at Thu Jan 29 09:26:29
>>>> 2009
>>>>
>>>> The Error message is appropriate.  But the segmentation fault is
>>>> unexpected.
>>>>
>>>
>>
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

(PR#8192) [ subscripting sometimes loses names

Andrew Piskorski
In reply to this post by Simon Urbanek
This (tangential) discussion really should be a separate thread so I
changed the subject line above.

On Fri, Jan 30, 2009 at 11:51:00AM -0500, Simon Urbanek wrote:
> Subject: Re: [Rd] (PR#13487) Segfault when mistakenly calling [.data.frame

> >My boss was debugging an issue in our R code.  We have our own "[...."
> >functions, because stock R drops names when subscripting.
>
> ... if you tell it to do so, yes. If you tell it to not do that, it  
> won't ... ever tried drop=FALSE ?

Simon, no, the drop=FALSE argument has nothing to do with what
Christian was talking about.  The kind of thing he meant is PR# 8192,
"Subject: [ subscripting sometimes loses names":

  http://bugs.r-project.org/cgi-bin/R/wishlist?id=8192

In R, subscripting with "[" USUALLY retains names, but R has various
edge cases where it (IMNSHO) inappropriately discards them.  This
occurs with both .Primitive("[") and "[.data.frame".  This has been
known for years, but I have not yet tried digging into R's
implementation to see where and how the names are actually getting
lost.

Incidentally, versions of S-Plus since approximately S-Plus 6.0 back
in 2001 show similar buggy edge case behavior.  Older versions of
S-Plus, c. S-Plus 3.3 and earlier, had the correct, name preserving
behavior.  I presume that the original Bell Labs S had correct
name-preserving behavior, and then the S-Plus developers broke it
sometime along the way.

--
Andrew Piskorski <[hidden email]>
http://www.piskorski.com/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: (PR#8192) [ subscripting sometimes loses names

Duncan Murdoch
On 31/01/2009 7:31 AM, Andrew Piskorski wrote:

> This (tangential) discussion really should be a separate thread so I
> changed the subject line above.
>
> On Fri, Jan 30, 2009 at 11:51:00AM -0500, Simon Urbanek wrote:
>> Subject: Re: [Rd] (PR#13487) Segfault when mistakenly calling [.data.frame
>
>>> My boss was debugging an issue in our R code.  We have our own "[...."
>>> functions, because stock R drops names when subscripting.
>> ... if you tell it to do so, yes. If you tell it to not do that, it  
>> won't ... ever tried drop=FALSE ?
>
> Simon, no, the drop=FALSE argument has nothing to do with what
> Christian was talking about.  The kind of thing he meant is PR# 8192,
> "Subject: [ subscripting sometimes loses names":
>
>   http://bugs.r-project.org/cgi-bin/R/wishlist?id=8192

In that bug report you were asked to provide simple examples, and you
didn't.  I imagine that's why there was no action on it.  It is not that
easy for someone else to actually find the simple example that led you
to print

      $vec.1
BAD  $vec.1[[1]]           $vec.1[[2]]
         a    c <NA>         a  c no
         1    3   NA         1  3 NA

I just tracked this one down, and can put together this simple example:

 > (1:3)["no"]
[1] NA

where I think you would want the name "no" attached to the output.  (Or
maybe your more complicated example is wanted?  You don't explain.)  But
that looks like documented behaviour to me:  according to my reading of
"Indexing by vectors" in the R Language Definition manual, it should
give the same answer as (1:3)[4], and it does.  So it's not a bug, but a
wishlist item.

And the other two cases where you list "BAD" behaviour?  I didn't track
them down.

I know you spent a lot of time putting together that bug report; it
seems a shame that it is being ignored because you put in too much:  you
really should simplify it as you were asked to do.

Duncan Murdoch


>
> In R, subscripting with "[" USUALLY retains names, but R has various
> edge cases where it (IMNSHO) inappropriately discards them.  This
> occurs with both .Primitive("[") and "[.data.frame".  This has been
> known for years, but I have not yet tried digging into R's
> implementation to see where and how the names are actually getting
> lost.
>
> Incidentally, versions of S-Plus since approximately S-Plus 6.0 back
> in 2001 show similar buggy edge case behavior.  Older versions of
> S-Plus, c. S-Plus 3.3 and earlier, had the correct, name preserving
> behavior.  I presume that the original Bell Labs S had correct
> name-preserving behavior, and then the S-Plus developers broke it
> sometime along the way.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: (PR#8192) [ subscripting sometimes loses names

Peter Dalgaard
Duncan Murdoch wrote:

> On 31/01/2009 7:31 AM, Andrew Piskorski wrote:
>> This (tangential) discussion really should be a separate thread so I
>> changed the subject line above.
>>
>> On Fri, Jan 30, 2009 at 11:51:00AM -0500, Simon Urbanek wrote:
>>> Subject: Re: [Rd] (PR#13487) Segfault when mistakenly calling
>>> [.data.frame
>>
>>>> My boss was debugging an issue in our R code.  We have our own "[...."
>>>> functions, because stock R drops names when subscripting.
>>> ... if you tell it to do so, yes. If you tell it to not do that, it  
>>> won't ... ever tried drop=FALSE ?
>>
>> Simon, no, the drop=FALSE argument has nothing to do with what
>> Christian was talking about.  The kind of thing he meant is PR# 8192,
>> "Subject: [ subscripting sometimes loses names":
>>
>>   http://bugs.r-project.org/cgi-bin/R/wishlist?id=8192
>
> In that bug report you were asked to provide simple examples, and you
> didn't.  I imagine that's why there was no action on it.  It is not that
> easy for someone else to actually find the simple example that led you
> to print
>
>      $vec.1
> BAD  $vec.1[[1]]           $vec.1[[2]]
>         a    c <NA>         a  c no
>         1    3   NA         1  3 NA
>
> I just tracked this one down, and can put together this simple example:
>
>  > (1:3)["no"]
> [1] NA
>
> where I think you would want the name "no" attached to the output.  (Or
> maybe your more complicated example is wanted?  You don't explain.)  But
> that looks like documented behaviour to me:  according to my reading of
> "Indexing by vectors" in the R Language Definition manual, it should
> give the same answer as (1:3)[4], and it does.  So it's not a bug, but a
> wishlist item.
>
> And the other two cases where you list "BAD" behaviour?  I didn't track
> them down.

I did, and they boil down to variations of

 > data.frame(val=1:3,row.names=letters[1:3])[,1]
[1] 1 2 3

but it's not obvious that the result should be named using the row.names
and (in particular) whether or why it should differ from .....[[1]] and
....$val. Given that for most purposes, extracting the relevant names
would just be unnecessary red tape, I'd say that we can do without it.



--
    O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - ([hidden email])              FAX: (+45) 35327907

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: (PR#8192) [ subscripting sometimes loses names

Quigi
On Sat, Jan 31, 2009 at 10:13 AM, Peter Dalgaard
<[hidden email]>wrote:

> Duncan Murdoch wrote:
>
>> On 31/01/2009 7:31 AM, Andrew Piskorski wrote:
>>
>>> On Fri, Jan 30, 2009 at 11:51:00AM -0500, Simon Urbanek wrote:
>>>
>>>> Subject: Re: [Rd] (PR#13487) Segfault when mistakenly calling
>>>> [.data.frame
>>>>
>>>
>>>  ever tried drop=FALSE ?
>>>>
>>>
>>> Simon, no, the drop=FALSE argument has nothing to do with what
>>> Christian was talking about.  The kind of thing he meant is PR# 8192,
>>> "Subject: [ subscripting sometimes loses names":
>>>
>>>  http://bugs.r-project.org/cgi-bin/R/wishlist?id=8192
>>>
>>
>> In that bug report you were asked to provide simple examples, and you
>> didn't.
>> ...
>> I just tracked this one down, and can put together this simple example:
>>
>>  > (1:3)["no"]
>> [1] NA
>>
>> where I think you would want the name "no" attached to the output.
>
> No, it has nothing to do with indexing by name.  It's about preserving
existing names when subsetting.

And the other two cases where you list "BAD" behaviour?  I didn't track them

>> down.
>>
>
> I did, and they boil down to variations of
>
> > data.frame(val=1:3,row.names=letters[1:3])[,1]
> [1] 1 2 3
>
> but it's not obvious that the result should be named using the row.names
> and (in particular) whether or why it should differ from .....[[1]] and
> ....$val. Given that for most purposes, extracting the relevant names would
> just be unnecessary red tape, I'd say that we can do without it.


Compare

> data.frame(val=1:3,row.names=letters[1:3])[,1]
[1] 1 2 3
> as.matrix(data.frame(val=1:3,row.names=letters[1:3]))[,1]
a b c
1 2 3

X[,1] preserves row names if X is a matrix, and loses them if X is a data
frame.  To me, this is ugly and inconsistent.

One might argue that having names and dimnames at all is "red tape", and
wastes memory and computational efficiency -- after all, Fortran arrays had
no names.  But R chose to drag along the names (sometimes), and it can be
very helpful to us humans.  Now R should do it consistently.

/Christian

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: (PR#8192) [ subscripting sometimes loses names

Wacek Kusnierczyk
Christian Brechbühler wrote:

<snip>
>
>>> data.frame(val=1:3,row.names=letters[1:3])[,1]
>>>      
>> [1] 1 2 3
>>
>> but it's not obvious that the result should be named using the row.names
>> and (in particular) whether or why it should differ from .....[[1]] and
>> ....$val.

this might be a good argument, if not that [,1] returning a vector
rather than a one-column data frame is already inconsistent (with
[,1:2], for example).  if [,1] were not dropping the data.frame class
and were returning a data frame instead, it would be obvious the result
should use row names.

data.frame(val=1:3,row.names=letters[1:3])[,1,drop=FALSE]

will keep the class and row names, though ?'[' says "drop: For matrices
and arrays.".

it doesn't mean that dropping row names (or dropping dimensions) isn't
useful and handy in specific cases, but this makes it no less
inconsistent.

>> Given that for most purposes, extracting the relevant names would
>> just be unnecessary red tape, I'd say that we can do without it.
>>    
>
>
> Compare
>
>  
>> data.frame(val=1:3,row.names=letters[1:3])[,1]
>>    
> [1] 1 2 3
>  
>> as.matrix(data.frame(val=1:3,row.names=letters[1:3]))[,1]
>>    
> a b c
> 1 2 3
>
> X[,1] preserves row names if X is a matrix, and loses them if X is a data
> frame.  To me, this is ugly and inconsistent.
>
> One might argue that having names and dimnames at all is "red tape", and
> wastes memory and computational efficiency -- after all, Fortran arrays had
> no names.  But R chose to drag along the names (sometimes), and it can be
> very helpful to us humans.  Now R should do it consistently.
>  

i support this opinion.  whether to have or not to have row names is a
design decision, and both options may be reasonably argued for and
against.  but lack of consistency is seldom any good;  r consistently
lacks consistency.

vQ

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: (PR#8192) [ subscripting sometimes loses names

Duncan Murdoch
In reply to this post by Quigi
On 31/01/2009 3:26 PM, Christian Brechbühler wrote:

> On Sat, Jan 31, 2009 at 10:13 AM, Peter Dalgaard
> <[hidden email]>wrote:
>
>> Duncan Murdoch wrote:
>>
>>> On 31/01/2009 7:31 AM, Andrew Piskorski wrote:
>>>
>>>> On Fri, Jan 30, 2009 at 11:51:00AM -0500, Simon Urbanek wrote:
>>>>
>>>>> Subject: Re: [Rd] (PR#13487) Segfault when mistakenly calling
>>>>> [.data.frame
>>>>>
>>>>  ever tried drop=FALSE ?
>>>> Simon, no, the drop=FALSE argument has nothing to do with what
>>>> Christian was talking about.  The kind of thing he meant is PR# 8192,
>>>> "Subject: [ subscripting sometimes loses names":
>>>>
>>>>  http://bugs.r-project.org/cgi-bin/R/wishlist?id=8192
>>>>
>>> In that bug report you were asked to provide simple examples, and you
>>> didn't.
>>> ...
>>> I just tracked this one down, and can put together this simple example:
>>>
>>>  > (1:3)["no"]
>>> [1] NA
>>>
>>> where I think you would want the name "no" attached to the output.
>> No, it has nothing to do with indexing by name.  It's about preserving
> existing names when subsetting.

I think you misread my message.

>
> And the other two cases where you list "BAD" behaviour?  I didn't track them
>>> down.
>>>
>> I did, and they boil down to variations of
>>
>>> data.frame(val=1:3,row.names=letters[1:3])[,1]
>> [1] 1 2 3
>>
>> but it's not obvious that the result should be named using the row.names
>> and (in particular) whether or why it should differ from .....[[1]] and
>> ....$val. Given that for most purposes, extracting the relevant names would
>> just be unnecessary red tape, I'd say that we can do without it.
>
>
> Compare
>
>> data.frame(val=1:3,row.names=letters[1:3])[,1]
> [1] 1 2 3
>> as.matrix(data.frame(val=1:3,row.names=letters[1:3]))[,1]
> a b c
> 1 2 3
>
> X[,1] preserves row names if X is a matrix, and loses them if X is a data
> frame.  To me, this is ugly and inconsistent.
>
> One might argue that having names and dimnames at all is "red tape", and
> wastes memory and computational efficiency -- after all, Fortran arrays had
> no names.  But R chose to drag along the names (sometimes), and it can be
> very helpful to us humans.  Now R should do it consistently.

In one case you're working with a matrix, and in the other, a dataframe.
  So perfect consistency is impossible:  matrices and dataframes are not
the same.  So it's a matter of deciding how much consistency is worth
pursuing.  Now, it seems nobody thinks this is worth pursuing:  so it
won't get changed.

To get it changed, you should make the change, then investigate what
would break the change were adopted, and what would become slower, etc.
  Or convince someone else to do that.  But the fact that you think it's
ugly is probably not convincing.

Duncan Murdoch

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: (PR#8192) [ subscripting sometimes loses names

Tim Hesterberg-2
In reply to this post by Andrew Piskorski
>...
>Simon, no, the drop=FALSE argument has nothing to do with what
>Christian was talking about.  The kind of thing he meant is PR# 8192,
>"Subject: [ subscripting sometimes loses names":
>
>  http://bugs.r-project.org/cgi-bin/R/wishlist?id=8192
>
>In R, subscripting with "[" USUALLY retains names, but R has various
>edge cases where it (IMNSHO) inappropriately discards them.  This
>occurs with both .Primitive("[") and "[.data.frame".  This has been
>known for years, but I have not yet tried digging into R's
>implementation to see where and how the names are actually getting
>lost.
>
>Incidentally, versions of S-Plus since approximately S-Plus 6.0 back
>in 2001 show similar buggy edge case behavior.  Older versions of
>S-Plus, c. S-Plus 3.3 and earlier, had the correct, name preserving
>behavior.  I presume that the original Bell Labs S had correct
>name-preserving behavior, and then the S-Plus developers broke it
>sometime along the way.

(Later comments on the thread pointed out the difference between
x[,1] for matrices and data frames.)

I rewrote the S-PLUS data frame code around then, to fix
various inconsistencies and improve efficiency.
This was probably my change, and I would do it again.

Note that the components of a data frame do not have names
attached to them; the row names are a separate object.
Extracting a component vector or matrix from a data frame should not
attach names to the result, because of:
* memory (attaching row names to an object can more than double the
  size of the object),
* speed
* some objects cannot take names, and attaching them could change
  the class and other behavior of an object, and
* the names are usually/often (depending on the user) meaningless,
  artifacts of an early design decision that all data frames have row names.

Data frames differ from matrices in two ways that matter here:
* columns in matrices are all the same kind, and are simple objects
  (numeric, etc.), whereas components of data frames can be nearly
  arbitrary objects, and
* row names get added to a data frame whether a user wants them or not,
  whereas row names on a matrix have to be specified.

A historical note - unique row names on data frame were a design
decision made when people worked with small data frames, and are
convenient for small data frames.  But they are a problem for large
data frames.  I was writing for all users, not just those with small
data frames and meaningful names.

I like R's 'automatic' row names.  This is a big help working with
huge data frames (and I do this often, at Google).  But this doesn't
go far enough; subscripting and other operations sometimes convert the
automatic names to real names, and check/enforce uniqueness, which is
a big waste of time when working with large data frames.  I'll comment
more on this in a new thread.

Tim Hesterberg

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

non-duplicate names in data frames

Tim Hesterberg-2
In reply to this post by Andrew Piskorski
I wrote on another thread
(with subject "[ subscripting sometimes loses names"):
>I like R's 'automatic' row names.  This is a big help working with
>huge data frames (and I do this often, at Google).  But this doesn't
>go far enough; subscripting and other operations sometimes convert the
>automatic names to real names, and check/enforce uniqueness, which is
>a big waste of time when working with large data frames.  I'll comment
>more on this in a new thread.

I propose (and have begun writing, in my copious spare time):
* an optional argument to data.frame and other data frame creation code
* resulting in an attribute added to the data.frame
* so that subscripting and other operations on the data frame
  * always keep artificial row names
  * do not have to check for unique row names in the result.

My current thoughts, comments welcome:

Argument name and component name 'dup.row.names'
0 or FALSE or NULL - current, require unique names
1 or TRUE          - duplicates allowed (when subscripting etc.)
2                  - always automatic   (when subscripting etc.)

Option "maxRowNames", default say 10^4
Any data frames with more than this have dup.row.names default to 2.

The name 'dup.row.names' is for consistency with S+; there the options
are NULL, F or T.

Tim Hesterberg

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: (PR#8192) [ subscripting sometimes loses names

Quigi
In reply to this post by Tim Hesterberg-2
Andy had written:

> >... The drop=FALSE argument has nothing to do with what
> >Christian was talking about.  The kind of thing he meant is PR# 8192,
> >"Subject: [ subscripting sometimes loses names":
> >
> >  http://bugs.r-project.org/cgi-bin/R/wishlist?id=8192
>

On Sun, Feb 1, 2009 at 12:25 PM, Tim Hesterberg <[hidden email]>wrote:

> (Later comments on the thread pointed out the difference between
> x[,1] for matrices and data frames.)
>
> I rewrote the S-PLUS data frame code around then, to fix
> various inconsistencies and improve efficiency.
> This was probably my change, and I would do it again.
>
> Note that the components of a data frame do not have names
> attached to them; the row names are a separate object.
> Extracting a component vector or matrix from a data frame should not
> attach names to the result, because of:
> * memory (attaching row names to an object can more than double the
>  size of the object),
> * speed
> * some objects cannot take names, and attaching them could change
>  the class and other behavior of an object, and
> * the names are usually/often (depending on the user) meaningless,
>  artifacts of an early design decision that all data frames have row names.
>
> Data frames differ from matrices in two ways that matter here:
> * columns in matrices are all the same kind, and are simple objects
>  (numeric, etc.), whereas components of data frames can be nearly
>  arbitrary objects, and
> * row names get added to a data frame whether a user wants them or not,
>  whereas row names on a matrix have to be specified.
>
> A historical note - unique row names on data frame were a design
> decision made when people worked with small data frames, and are
> convenient for small data frames.  But they are a problem for large
> data frames.  I was writing for all users, not just those with small
> data frames and meaningful names.
>

Hi Tim,

Thank you for explaning this so carefully.  It's very valuable to hear the
rationale beind a design decision explained so carefully.  I accept that
yours is the right solution for general use.

In our case, we deal with not too many rows, up to a few thousand, with
meaningful names.  And we mostly use data frames.  Because of our special
situation, we wrote our own "[" methods, which normally do what's right for
us.  That's why, in one debugging session, it was necessary to "get" the
overriden, stock R method from package:base.  In that case, the obejct
happened to be a matrix not a dataframe, and R got a segmentation fault.
And that's why I submitted the bug report that sparked this discussion.

/Christian

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: (PR#8192) [ subscripting sometimes loses names

Wacek Kusnierczyk
In reply to this post by Peter Dalgaard
it's becoming an old story, but here's a bit to be added.

Peter Dalgaard wrote:

> Duncan Murdoch wrote:
>> On 31/01/2009 7:31 AM, Andrew Piskorski wrote:
>>> This (tangential) discussion really should be a separate thread so I
>>> changed the subject line above.
>>>
>>> On Fri, Jan 30, 2009 at 11:51:00AM -0500, Simon Urbanek wrote:
>>>> Subject: Re: [Rd] (PR#13487) Segfault when mistakenly calling
>>>> [.data.frame
>>>
>>>>> My boss was debugging an issue in our R code.  We have our own
>>>>> "[...."
>>>>> functions, because stock R drops names when subscripting.
>>>> ... if you tell it to do so, yes. If you tell it to not do that,
>>>> it  won't ... ever tried drop=FALSE ?
>>>
>>> Simon, no, the drop=FALSE argument has nothing to do with what
>>> Christian was talking about.  The kind of thing he meant is PR# 8192,
>>> "Subject: [ subscripting sometimes loses names":
>>>
>>>   http://bugs.r-project.org/cgi-bin/R/wishlist?id=8192
>>
>> In that bug report you were asked to provide simple examples, and you
>> didn't.  I imagine that's why there was no action on it.  It is not
>> that easy for someone else to actually find the simple example that
>> led you to print
>>
>>      $vec.1
>> BAD  $vec.1[[1]]           $vec.1[[2]]
>>         a    c <NA>         a  c no
>>         1    3   NA         1  3 NA
>>
>> I just tracked this one down, and can put together this simple example:
>>
>>  > (1:3)["no"]
>> [1] NA
>>
>> where I think you would want the name "no" attached to the output.
>> (Or maybe your more complicated example is wanted?  You don't
>> explain.)  But that looks like documented behaviour to me:  according
>> to my reading of "Indexing by vectors" in the R Language Definition
>> manual, it should give the same answer as (1:3)[4], and it does.  So
>> it's not a bug, but a wishlist item.
>>
>> And the other two cases where you list "BAD" behaviour?  I didn't
>> track them down.
>
> I did, and they boil down to variations of
>
> > data.frame(val=1:3,row.names=letters[1:3])[,1]
> [1] 1 2 3
>
> but it's not obvious that the result should be named using the
> row.names and (in particular) whether or why it should differ from
> .....[[1]] and ....$val.

once you are saying that, be prepared to explain why it should *not*
differ from [[1]] and $val.  reading ?'[' carefully, you'll find:

"     The most important distinction between '[', '[[' and '$' is that
     the '[' can select more than one element whereas the other two
     select a single element."

that's actually quite enough to justify why [,1] (or rather [, indices],
with an arbitrary vector of indices) should differ from [[1]] and $val.
precisely because:

a) [[index]] and $name are *guaranteed* to return one column (or fail),
so it's reasonable to *always* drop the dimension -- because it will be
done in the case of every successful selection;

b) [, indices] *may* or *may not* return one column in a successful
selection, and now dropping the dimension (and names) depends not on the
type of the indices used (positive numeric, negative numeric, character,
whatever), but on the length of the index vector.

why is external consistence of [ (being like [[ and $) when a single
index is used more important than its internal consistence (returning
the same type of data -- a data frame, or a like-dimensioned matrix --
irrespectively of the length of the index vector)?

i realize that the issue of drop=FALSE vs drop=TRUE as the default has
been discussed before, but i don't find clear arguments given for the
first option, beyond that it just is so and would break much old code if
were to be changed.  i'm actually hoping not for this to be changed, but
for users not to be blamed for assuming [,1] returns a data frame with
row names.  it's *not* their fault they are wrong.


> Given that for most purposes, extracting the relevant names would just
> be unnecessary red tape, I'd say that we can do without it.
>

would keeping the dimensions and class be just unnecessary red tape,
too?  can you know what most users' purposes are?

vQ

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Loading...