|
Full_Name: Christian Brechbuehler
Version: 2.7.2, 2.8.1 OS: linux-gnu Submission from: (NULL) (24.128.51.18) Calling [.data.frame on an object that's not a data frame, specifically 1:10, causes segmentation fault. Context ======= We can subscript with a number of different notations: > (1:10)[3] [1] 3 > do.call(get("[",pos="package:base"),list(1:10,3)) [1] 3 > do.call(get("[.numeric_version",pos="package:base"),list(1:10,3)) [1] 3 Problem ======= If we mistakenly believe the object is a data frame (as we did in a much more complicated real situation), this happens: > do.call(get("[.data.frame",pos="package:base"),list(1:10,3)) Error in NextMethod("[") : no calling generic was found: was a method called directly? *** caught segfault *** address (nil), cause 'unknown' Process R:2 segmentation fault (core dumped) at Thu Jan 29 09:26:29 2009 The Error message is appropriate. But the segmentation fault is unexpected. Versions ======== I reproduced the problem on R 2.7.2 and 2.8.1. Details: > version _ platform x86_64-unknown-linux-gnu arch x86_64 os linux-gnu system x86_64, linux-gnu status Patched major 2 minor 7.2 year 2008 month 09 day 20 svn rev 46776 language R version.string R version 2.7.2 Patched (2008-09-20 r46776) ========================================================== > version _ platform x86_64-unknown-linux-gnu arch x86_64 os linux-gnu system x86_64, linux-gnu status Patched major 2 minor 8.1 year 2009 month 01 day 26 svn rev 47743 language R version.string R version 2.8.1 Patched (2009-01-26 r47743) ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
|
What did your actual application do? This seems a very strange thing
to do, and the segfault is in trying to construct the traceback. Only by using do.call on the object (and not even by name) do I get this error. E.g. > `[.data.frame`(1:10, 3) Error in NextMethod("[") : object not specified > do.call("[.data.frame", list(1:10, 3)) Error in NextMethod("[") : object not specified are fine. Obviously it would be nice to fix this, but I'd like to understand the actual circumstances: there is more to it than the subject line. On Thu, 29 Jan 2009, [hidden email] wrote: > Full_Name: Christian Brechbuehler > Version: 2.7.2, 2.8.1 > OS: linux-gnu > Submission from: (NULL) (24.128.51.18) > > > Calling [.data.frame on an object that's not a data frame, specifically 1:10, > causes segmentation fault. > > Context > ======= > We can subscript with a number of different notations: > > > (1:10)[3] > [1] 3 > > do.call(get("[",pos="package:base"),list(1:10,3)) > [1] 3 > > do.call(get("[.numeric_version",pos="package:base"),list(1:10,3)) > [1] 3 > > Problem > ======= > If we mistakenly believe the object is a data frame (as we did in a much more > complicated real situation), this happens: > > > do.call(get("[.data.frame",pos="package:base"),list(1:10,3)) > Error in NextMethod("[") : > no calling generic was found: was a method called directly? > > *** caught segfault *** > address (nil), cause 'unknown' > > Process R:2 segmentation fault (core dumped) at Thu Jan 29 09:26:29 2009 > > The Error message is appropriate. But the segmentation fault is unexpected. > > > Versions > ======== > I reproduced the problem on R 2.7.2 and 2.8.1. Details: > >> version > _ > platform x86_64-unknown-linux-gnu > arch x86_64 > os linux-gnu > system x86_64, linux-gnu > status Patched > major 2 > minor 7.2 > year 2008 > month 09 > day 20 > svn rev 46776 > language R > version.string R version 2.7.2 Patched (2008-09-20 r46776) > > ========================================================== > >> version > _ > platform x86_64-unknown-linux-gnu > arch x86_64 > os linux-gnu > system x86_64, linux-gnu > status Patched > major 2 > minor 8.1 > year 2009 > month 01 > day 26 > svn rev 47743 > language R > version.string R version 2.8.1 Patched (2009-01-26 r47743) > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > -- Brian D. Ripley, [hidden email] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595 ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
|
On Thu, Jan 29, 2009 at 4:44 PM, Prof Brian Ripley <[hidden email]>wrote:
> What did your actual application do? This seems a very strange thing to > do, and the segfault is in trying to construct the traceback. > > Only by using do.call on the object (and not even by name) do I get this > error. E.g. > > `[.data.frame`(1:10, 3) >> > Error in NextMethod("[") : object not specified > >> do.call("[.data.frame", list(1:10, 3)) >> > Error in NextMethod("[") : object not specified > > are fine. > > Obviously it would be nice to fix this, but I'd like to understand the > actual circumstances: there is more to it than the subject line. Yes, there is more. For reporting the problem, we tried to pare it down to a concise, self-contained test case. My boss was debugging an issue in our R code. We have our own "[...." functions, because stock R drops names when subscripting. To bypass our now-suspect functions and get the "real" subscripting method, he used "get" from package:base. He was examining a large object, and believing it was a data frame, chose "[.data.frame". As it turns out, that object was not a data frame, and he got an unexpected segfault. I think it was a matrix. But it doesn't matter -- a vector as in the test case will give the same. We have since fixed the bug in our replacement subscripting function, so the issue might not affect us any more. Thanks, /Christian > On Thu, 29 Jan 2009, [hidden email] wrote: > > Full_Name: Christian Brechbuehler >> Version: 2.7.2, 2.8.1 >> OS: linux-gnu >> >> If we mistakenly believe the object is a data frame (as we did in a much >> more >> complicated real situation), this happens: >> >> > do.call(get("[.data.frame",pos="package:base"),list(1:10,3)) >> Error in NextMethod("[") : >> no calling generic was found: was a method called directly? >> >> *** caught segfault *** >> address (nil), cause 'unknown' >> >> Process R:2 segmentation fault (core dumped) at Thu Jan 29 09:26:29 2009 >> >> The Error message is appropriate. But the segmentation fault is >> unexpected. >> > [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
|
On Jan 30, 2009, at 10:30 , Christian Brechbühler wrote: > On Thu, Jan 29, 2009 at 4:44 PM, Prof Brian Ripley <[hidden email] > >wrote: > >> What did your actual application do? This seems a very strange >> thing to >> do, and the segfault is in trying to construct the traceback. >> >> Only by using do.call on the object (and not even by name) do I get >> this >> error. E.g. >> >> `[.data.frame`(1:10, 3) >>> >> Error in NextMethod("[") : object not specified >> >>> do.call("[.data.frame", list(1:10, 3)) >>> >> Error in NextMethod("[") : object not specified >> >> are fine. >> >> Obviously it would be nice to fix this, but I'd like to understand >> the >> actual circumstances: there is more to it than the subject line. > > > Yes, there is more. For reporting the problem, we tried to pare it > down to > a concise, self-contained test case. > > My boss was debugging an issue in our R code. We have our own "[...." > functions, because stock R drops names when subscripting. ... if you tell it to do so, yes. If you tell it to not do that, it won't ... ever tried drop=FALSE ? Cheers, S > To bypass our > now-suspect functions and get the "real" subscripting method, he > used "get" > from package:base. He was examining a large object, and believing > it was a > data frame, chose "[.data.frame". As it turns out, that object was > not a > data frame, and he got an unexpected segfault. I think it was a > matrix. > But it doesn't matter -- a vector as in the test case will give the > same. > > We have since fixed the bug in our replacement subscripting > function, so the > issue might not affect us any more. > > Thanks, > /Christian > > >> On Thu, 29 Jan 2009, [hidden email] wrote: >> >> Full_Name: Christian Brechbuehler >>> Version: 2.7.2, 2.8.1 >>> OS: linux-gnu >>> >>> If we mistakenly believe the object is a data frame (as we did in >>> a much >>> more >>> complicated real situation), this happens: >>> >>>> do.call(get("[.data.frame",pos="package:base"),list(1:10,3)) >>> Error in NextMethod("[") : >>> no calling generic was found: was a method called directly? >>> >>> *** caught segfault *** >>> address (nil), cause 'unknown' >>> >>> Process R:2 segmentation fault (core dumped) at Thu Jan 29 >>> 09:26:29 2009 >>> >>> The Error message is appropriate. But the segmentation fault is >>> unexpected. >>> >> > > [[alternative HTML version deleted]] > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
|
Simon Urbanek wrote:
> > On Jan 30, 2009, at 10:30 , Christian Brechbühler wrote: > >> On Thu, Jan 29, 2009 at 4:44 PM, Prof Brian Ripley >> <[hidden email]>wrote: >> >>> What did your actual application do? This seems a very strange >>> thing to >>> do, and the segfault is in trying to construct the traceback. >>> >>> Only by using do.call on the object (and not even by name) do I get >>> this >>> error. E.g. >>> >>> `[.data.frame`(1:10, 3) >>>> >>> Error in NextMethod("[") : object not specified >>> >>>> do.call("[.data.frame", list(1:10, 3)) >>>> >>> Error in NextMethod("[") : object not specified >>> >>> are fine. >>> >>> Obviously it would be nice to fix this, but I'd like to understand the >>> actual circumstances: there is more to it than the subject line. >> >> >> Yes, there is more. For reporting the problem, we tried to pare it >> down to >> a concise, self-contained test case. >> >> My boss was debugging an issue in our R code. We have our own "[...." >> functions, because stock R drops names when subscripting. > > ... if you tell it to do so, yes. If you tell it to not do that, it > won't ... ever tried drop=FALSE ? wanting to get a vector from a data frame, and having the rownames of the dataframe become the names on the vector. With matrix, the behavior I want is the default behavior, e.g., > x <- cbind(a=c(x=1,y=2,z=3),b=4:6) > x a b x 1 4 y 2 5 z 3 6 > x[,1] x y z 1 2 3 But with a data frame, subscripting returns a vector with no names: > xd <- as.data.frame(x) > xd[,1] [1] 1 2 3 One can use drop=FALSE, but then you've still got a data frame, not a vector: > (xd1 <- xd[,1,drop=FALSE]) a x 1 y 2 z 3 The simplest way I know to get a named vector is to use as.matrix on the one-column dataframe: > as.matrix(xd1)[,1] x y z 1 2 3 > (Which works fine except in the case where xd1 has only one row...) And BTW, am I missing something, or does the behavior of drop() not conform to the description in ?drop: > Value: > If 'x' is an object with a 'dim' attribute (e.g., a matrix or > 'array'), then 'drop' returns an object like 'x', but with any > extents of length one removed. Any accompanying 'dimnames' > attribute is adjusted and returned with 'x': if the result is a > vector the 'names' are taken from the 'dimnames' (if any). If the > result is a length-one vector, the names are taken from the first > dimension with a dimname. How is this last sentence consistent with the following behavior? > dimnames(x[1,1,drop=F]) [[1]] [1] "x" [[2]] [1] "a" > drop(x[1,1,drop=F]) [1] 1 > From the description in "Value:" in ?drop, I would have expected above result to have the name "x" (the name from the first dimension with a dimname). > sessionInfo() R version 2.8.1 (2008-12-22) i386-pc-mingw32 locale: LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base > -- Tony Plate > > Cheers, > S > > >> To bypass our >> now-suspect functions and get the "real" subscripting method, he used >> "get" >> from package:base. He was examining a large object, and believing it >> was a >> data frame, chose "[.data.frame". As it turns out, that object was >> not a >> data frame, and he got an unexpected segfault. I think it was a matrix. >> But it doesn't matter -- a vector as in the test case will give the >> same. >> >> We have since fixed the bug in our replacement subscripting function, >> so the >> issue might not affect us any more. >> >> Thanks, >> /Christian >> >> >>> On Thu, 29 Jan 2009, [hidden email] wrote: >>> >>> Full_Name: Christian Brechbuehler >>>> Version: 2.7.2, 2.8.1 >>>> OS: linux-gnu >>>> >>>> If we mistakenly believe the object is a data frame (as we did in a >>>> much >>>> more >>>> complicated real situation), this happens: >>>> >>>>> do.call(get("[.data.frame",pos="package:base"),list(1:10,3)) >>>> Error in NextMethod("[") : >>>> no calling generic was found: was a method called directly? >>>> >>>> *** caught segfault *** >>>> address (nil), cause 'unknown' >>>> >>>> Process R:2 segmentation fault (core dumped) at Thu Jan 29 09:26:29 >>>> 2009 >>>> >>>> The Error message is appropriate. But the segmentation fault is >>>> unexpected. >>>> >>> >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> [hidden email] mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
|
In reply to this post by Simon Urbanek
This (tangential) discussion really should be a separate thread so I
changed the subject line above. On Fri, Jan 30, 2009 at 11:51:00AM -0500, Simon Urbanek wrote: > Subject: Re: [Rd] (PR#13487) Segfault when mistakenly calling [.data.frame > >My boss was debugging an issue in our R code. We have our own "[...." > >functions, because stock R drops names when subscripting. > > ... if you tell it to do so, yes. If you tell it to not do that, it > won't ... ever tried drop=FALSE ? Simon, no, the drop=FALSE argument has nothing to do with what Christian was talking about. The kind of thing he meant is PR# 8192, "Subject: [ subscripting sometimes loses names": http://bugs.r-project.org/cgi-bin/R/wishlist?id=8192 In R, subscripting with "[" USUALLY retains names, but R has various edge cases where it (IMNSHO) inappropriately discards them. This occurs with both .Primitive("[") and "[.data.frame". This has been known for years, but I have not yet tried digging into R's implementation to see where and how the names are actually getting lost. Incidentally, versions of S-Plus since approximately S-Plus 6.0 back in 2001 show similar buggy edge case behavior. Older versions of S-Plus, c. S-Plus 3.3 and earlier, had the correct, name preserving behavior. I presume that the original Bell Labs S had correct name-preserving behavior, and then the S-Plus developers broke it sometime along the way. -- Andrew Piskorski <[hidden email]> http://www.piskorski.com/ ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
|
On 31/01/2009 7:31 AM, Andrew Piskorski wrote:
> This (tangential) discussion really should be a separate thread so I > changed the subject line above. > > On Fri, Jan 30, 2009 at 11:51:00AM -0500, Simon Urbanek wrote: >> Subject: Re: [Rd] (PR#13487) Segfault when mistakenly calling [.data.frame > >>> My boss was debugging an issue in our R code. We have our own "[...." >>> functions, because stock R drops names when subscripting. >> ... if you tell it to do so, yes. If you tell it to not do that, it >> won't ... ever tried drop=FALSE ? > > Simon, no, the drop=FALSE argument has nothing to do with what > Christian was talking about. The kind of thing he meant is PR# 8192, > "Subject: [ subscripting sometimes loses names": > > http://bugs.r-project.org/cgi-bin/R/wishlist?id=8192 In that bug report you were asked to provide simple examples, and you didn't. I imagine that's why there was no action on it. It is not that easy for someone else to actually find the simple example that led you to print $vec.1 BAD $vec.1[[1]] $vec.1[[2]] a c <NA> a c no 1 3 NA 1 3 NA I just tracked this one down, and can put together this simple example: > (1:3)["no"] [1] NA where I think you would want the name "no" attached to the output. (Or maybe your more complicated example is wanted? You don't explain.) But that looks like documented behaviour to me: according to my reading of "Indexing by vectors" in the R Language Definition manual, it should give the same answer as (1:3)[4], and it does. So it's not a bug, but a wishlist item. And the other two cases where you list "BAD" behaviour? I didn't track them down. I know you spent a lot of time putting together that bug report; it seems a shame that it is being ignored because you put in too much: you really should simplify it as you were asked to do. Duncan Murdoch > > In R, subscripting with "[" USUALLY retains names, but R has various > edge cases where it (IMNSHO) inappropriately discards them. This > occurs with both .Primitive("[") and "[.data.frame". This has been > known for years, but I have not yet tried digging into R's > implementation to see where and how the names are actually getting > lost. > > Incidentally, versions of S-Plus since approximately S-Plus 6.0 back > in 2001 show similar buggy edge case behavior. Older versions of > S-Plus, c. S-Plus 3.3 and earlier, had the correct, name preserving > behavior. I presume that the original Bell Labs S had correct > name-preserving behavior, and then the S-Plus developers broke it > sometime along the way. > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
|
Duncan Murdoch wrote:
> On 31/01/2009 7:31 AM, Andrew Piskorski wrote: >> This (tangential) discussion really should be a separate thread so I >> changed the subject line above. >> >> On Fri, Jan 30, 2009 at 11:51:00AM -0500, Simon Urbanek wrote: >>> Subject: Re: [Rd] (PR#13487) Segfault when mistakenly calling >>> [.data.frame >> >>>> My boss was debugging an issue in our R code. We have our own "[...." >>>> functions, because stock R drops names when subscripting. >>> ... if you tell it to do so, yes. If you tell it to not do that, it >>> won't ... ever tried drop=FALSE ? >> >> Simon, no, the drop=FALSE argument has nothing to do with what >> Christian was talking about. The kind of thing he meant is PR# 8192, >> "Subject: [ subscripting sometimes loses names": >> >> http://bugs.r-project.org/cgi-bin/R/wishlist?id=8192 > > In that bug report you were asked to provide simple examples, and you > didn't. I imagine that's why there was no action on it. It is not that > easy for someone else to actually find the simple example that led you > to print > > $vec.1 > BAD $vec.1[[1]] $vec.1[[2]] > a c <NA> a c no > 1 3 NA 1 3 NA > > I just tracked this one down, and can put together this simple example: > > > (1:3)["no"] > [1] NA > > where I think you would want the name "no" attached to the output. (Or > maybe your more complicated example is wanted? You don't explain.) But > that looks like documented behaviour to me: according to my reading of > "Indexing by vectors" in the R Language Definition manual, it should > give the same answer as (1:3)[4], and it does. So it's not a bug, but a > wishlist item. > > And the other two cases where you list "BAD" behaviour? I didn't track > them down. I did, and they boil down to variations of > data.frame(val=1:3,row.names=letters[1:3])[,1] [1] 1 2 3 but it's not obvious that the result should be named using the row.names and (in particular) whether or why it should differ from .....[[1]] and ....$val. Given that for most purposes, extracting the relevant names would just be unnecessary red tape, I'd say that we can do without it. -- O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - ([hidden email]) FAX: (+45) 35327907 ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
|
On Sat, Jan 31, 2009 at 10:13 AM, Peter Dalgaard
<[hidden email]>wrote: > Duncan Murdoch wrote: > >> On 31/01/2009 7:31 AM, Andrew Piskorski wrote: >> >>> On Fri, Jan 30, 2009 at 11:51:00AM -0500, Simon Urbanek wrote: >>> >>>> Subject: Re: [Rd] (PR#13487) Segfault when mistakenly calling >>>> [.data.frame >>>> >>> >>> ever tried drop=FALSE ? >>>> >>> >>> Simon, no, the drop=FALSE argument has nothing to do with what >>> Christian was talking about. The kind of thing he meant is PR# 8192, >>> "Subject: [ subscripting sometimes loses names": >>> >>> http://bugs.r-project.org/cgi-bin/R/wishlist?id=8192 >>> >> >> In that bug report you were asked to provide simple examples, and you >> didn't. >> ... >> I just tracked this one down, and can put together this simple example: >> >> > (1:3)["no"] >> [1] NA >> >> where I think you would want the name "no" attached to the output. > > No, it has nothing to do with indexing by name. It's about preserving And the other two cases where you list "BAD" behaviour? I didn't track them >> down. >> > > I did, and they boil down to variations of > > > data.frame(val=1:3,row.names=letters[1:3])[,1] > [1] 1 2 3 > > but it's not obvious that the result should be named using the row.names > and (in particular) whether or why it should differ from .....[[1]] and > ....$val. Given that for most purposes, extracting the relevant names would > just be unnecessary red tape, I'd say that we can do without it. Compare > data.frame(val=1:3,row.names=letters[1:3])[,1] [1] 1 2 3 > as.matrix(data.frame(val=1:3,row.names=letters[1:3]))[,1] a b c 1 2 3 X[,1] preserves row names if X is a matrix, and loses them if X is a data frame. To me, this is ugly and inconsistent. One might argue that having names and dimnames at all is "red tape", and wastes memory and computational efficiency -- after all, Fortran arrays had no names. But R chose to drag along the names (sometimes), and it can be very helpful to us humans. Now R should do it consistently. /Christian [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
|
Christian Brechbühler wrote:
<snip> > >>> data.frame(val=1:3,row.names=letters[1:3])[,1] >>> >> [1] 1 2 3 >> >> but it's not obvious that the result should be named using the row.names >> and (in particular) whether or why it should differ from .....[[1]] and >> ....$val. this might be a good argument, if not that [,1] returning a vector rather than a one-column data frame is already inconsistent (with [,1:2], for example). if [,1] were not dropping the data.frame class and were returning a data frame instead, it would be obvious the result should use row names. data.frame(val=1:3,row.names=letters[1:3])[,1,drop=FALSE] will keep the class and row names, though ?'[' says "drop: For matrices and arrays.". it doesn't mean that dropping row names (or dropping dimensions) isn't useful and handy in specific cases, but this makes it no less inconsistent. >> Given that for most purposes, extracting the relevant names would >> just be unnecessary red tape, I'd say that we can do without it. >> > > > Compare > > >> data.frame(val=1:3,row.names=letters[1:3])[,1] >> > [1] 1 2 3 > >> as.matrix(data.frame(val=1:3,row.names=letters[1:3]))[,1] >> > a b c > 1 2 3 > > X[,1] preserves row names if X is a matrix, and loses them if X is a data > frame. To me, this is ugly and inconsistent. > > One might argue that having names and dimnames at all is "red tape", and > wastes memory and computational efficiency -- after all, Fortran arrays had > no names. But R chose to drag along the names (sometimes), and it can be > very helpful to us humans. Now R should do it consistently. > i support this opinion. whether to have or not to have row names is a design decision, and both options may be reasonably argued for and against. but lack of consistency is seldom any good; r consistently lacks consistency. vQ ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
|
In reply to this post by Quigi
On 31/01/2009 3:26 PM, Christian Brechbühler wrote:
> On Sat, Jan 31, 2009 at 10:13 AM, Peter Dalgaard > <[hidden email]>wrote: > >> Duncan Murdoch wrote: >> >>> On 31/01/2009 7:31 AM, Andrew Piskorski wrote: >>> >>>> On Fri, Jan 30, 2009 at 11:51:00AM -0500, Simon Urbanek wrote: >>>> >>>>> Subject: Re: [Rd] (PR#13487) Segfault when mistakenly calling >>>>> [.data.frame >>>>> >>>> ever tried drop=FALSE ? >>>> Simon, no, the drop=FALSE argument has nothing to do with what >>>> Christian was talking about. The kind of thing he meant is PR# 8192, >>>> "Subject: [ subscripting sometimes loses names": >>>> >>>> http://bugs.r-project.org/cgi-bin/R/wishlist?id=8192 >>>> >>> In that bug report you were asked to provide simple examples, and you >>> didn't. >>> ... >>> I just tracked this one down, and can put together this simple example: >>> >>> > (1:3)["no"] >>> [1] NA >>> >>> where I think you would want the name "no" attached to the output. >> No, it has nothing to do with indexing by name. It's about preserving > existing names when subsetting. I think you misread my message. > > And the other two cases where you list "BAD" behaviour? I didn't track them >>> down. >>> >> I did, and they boil down to variations of >> >>> data.frame(val=1:3,row.names=letters[1:3])[,1] >> [1] 1 2 3 >> >> but it's not obvious that the result should be named using the row.names >> and (in particular) whether or why it should differ from .....[[1]] and >> ....$val. Given that for most purposes, extracting the relevant names would >> just be unnecessary red tape, I'd say that we can do without it. > > > Compare > >> data.frame(val=1:3,row.names=letters[1:3])[,1] > [1] 1 2 3 >> as.matrix(data.frame(val=1:3,row.names=letters[1:3]))[,1] > a b c > 1 2 3 > > X[,1] preserves row names if X is a matrix, and loses them if X is a data > frame. To me, this is ugly and inconsistent. > > One might argue that having names and dimnames at all is "red tape", and > wastes memory and computational efficiency -- after all, Fortran arrays had > no names. But R chose to drag along the names (sometimes), and it can be > very helpful to us humans. Now R should do it consistently. In one case you're working with a matrix, and in the other, a dataframe. So perfect consistency is impossible: matrices and dataframes are not the same. So it's a matter of deciding how much consistency is worth pursuing. Now, it seems nobody thinks this is worth pursuing: so it won't get changed. To get it changed, you should make the change, then investigate what would break the change were adopted, and what would become slower, etc. Or convince someone else to do that. But the fact that you think it's ugly is probably not convincing. Duncan Murdoch ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
|
In reply to this post by Andrew Piskorski
>...
>Simon, no, the drop=FALSE argument has nothing to do with what >Christian was talking about. The kind of thing he meant is PR# 8192, >"Subject: [ subscripting sometimes loses names": > > http://bugs.r-project.org/cgi-bin/R/wishlist?id=8192 > >In R, subscripting with "[" USUALLY retains names, but R has various >edge cases where it (IMNSHO) inappropriately discards them. This >occurs with both .Primitive("[") and "[.data.frame". This has been >known for years, but I have not yet tried digging into R's >implementation to see where and how the names are actually getting >lost. > >Incidentally, versions of S-Plus since approximately S-Plus 6.0 back >in 2001 show similar buggy edge case behavior. Older versions of >S-Plus, c. S-Plus 3.3 and earlier, had the correct, name preserving >behavior. I presume that the original Bell Labs S had correct >name-preserving behavior, and then the S-Plus developers broke it >sometime along the way. (Later comments on the thread pointed out the difference between x[,1] for matrices and data frames.) I rewrote the S-PLUS data frame code around then, to fix various inconsistencies and improve efficiency. This was probably my change, and I would do it again. Note that the components of a data frame do not have names attached to them; the row names are a separate object. Extracting a component vector or matrix from a data frame should not attach names to the result, because of: * memory (attaching row names to an object can more than double the size of the object), * speed * some objects cannot take names, and attaching them could change the class and other behavior of an object, and * the names are usually/often (depending on the user) meaningless, artifacts of an early design decision that all data frames have row names. Data frames differ from matrices in two ways that matter here: * columns in matrices are all the same kind, and are simple objects (numeric, etc.), whereas components of data frames can be nearly arbitrary objects, and * row names get added to a data frame whether a user wants them or not, whereas row names on a matrix have to be specified. A historical note - unique row names on data frame were a design decision made when people worked with small data frames, and are convenient for small data frames. But they are a problem for large data frames. I was writing for all users, not just those with small data frames and meaningful names. I like R's 'automatic' row names. This is a big help working with huge data frames (and I do this often, at Google). But this doesn't go far enough; subscripting and other operations sometimes convert the automatic names to real names, and check/enforce uniqueness, which is a big waste of time when working with large data frames. I'll comment more on this in a new thread. Tim Hesterberg ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
|
In reply to this post by Andrew Piskorski
I wrote on another thread
(with subject "[ subscripting sometimes loses names"): >I like R's 'automatic' row names. This is a big help working with >huge data frames (and I do this often, at Google). But this doesn't >go far enough; subscripting and other operations sometimes convert the >automatic names to real names, and check/enforce uniqueness, which is >a big waste of time when working with large data frames. I'll comment >more on this in a new thread. I propose (and have begun writing, in my copious spare time): * an optional argument to data.frame and other data frame creation code * resulting in an attribute added to the data.frame * so that subscripting and other operations on the data frame * always keep artificial row names * do not have to check for unique row names in the result. My current thoughts, comments welcome: Argument name and component name 'dup.row.names' 0 or FALSE or NULL - current, require unique names 1 or TRUE - duplicates allowed (when subscripting etc.) 2 - always automatic (when subscripting etc.) Option "maxRowNames", default say 10^4 Any data frames with more than this have dup.row.names default to 2. The name 'dup.row.names' is for consistency with S+; there the options are NULL, F or T. Tim Hesterberg ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
|
In reply to this post by Tim Hesterberg-2
Andy had written:
> >... The drop=FALSE argument has nothing to do with what > >Christian was talking about. The kind of thing he meant is PR# 8192, > >"Subject: [ subscripting sometimes loses names": > > > > http://bugs.r-project.org/cgi-bin/R/wishlist?id=8192 > On Sun, Feb 1, 2009 at 12:25 PM, Tim Hesterberg <[hidden email]>wrote: > (Later comments on the thread pointed out the difference between > x[,1] for matrices and data frames.) > > I rewrote the S-PLUS data frame code around then, to fix > various inconsistencies and improve efficiency. > This was probably my change, and I would do it again. > > Note that the components of a data frame do not have names > attached to them; the row names are a separate object. > Extracting a component vector or matrix from a data frame should not > attach names to the result, because of: > * memory (attaching row names to an object can more than double the > size of the object), > * speed > * some objects cannot take names, and attaching them could change > the class and other behavior of an object, and > * the names are usually/often (depending on the user) meaningless, > artifacts of an early design decision that all data frames have row names. > > Data frames differ from matrices in two ways that matter here: > * columns in matrices are all the same kind, and are simple objects > (numeric, etc.), whereas components of data frames can be nearly > arbitrary objects, and > * row names get added to a data frame whether a user wants them or not, > whereas row names on a matrix have to be specified. > > A historical note - unique row names on data frame were a design > decision made when people worked with small data frames, and are > convenient for small data frames. But they are a problem for large > data frames. I was writing for all users, not just those with small > data frames and meaningful names. > Hi Tim, Thank you for explaning this so carefully. It's very valuable to hear the rationale beind a design decision explained so carefully. I accept that yours is the right solution for general use. In our case, we deal with not too many rows, up to a few thousand, with meaningful names. And we mostly use data frames. Because of our special situation, we wrote our own "[" methods, which normally do what's right for us. That's why, in one debugging session, it was necessary to "get" the overriden, stock R method from package:base. In that case, the obejct happened to be a matrix not a dataframe, and R got a segmentation fault. And that's why I submitted the bug report that sparked this discussion. /Christian [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
|
In reply to this post by Peter Dalgaard
it's becoming an old story, but here's a bit to be added.
Peter Dalgaard wrote: > Duncan Murdoch wrote: >> On 31/01/2009 7:31 AM, Andrew Piskorski wrote: >>> This (tangential) discussion really should be a separate thread so I >>> changed the subject line above. >>> >>> On Fri, Jan 30, 2009 at 11:51:00AM -0500, Simon Urbanek wrote: >>>> Subject: Re: [Rd] (PR#13487) Segfault when mistakenly calling >>>> [.data.frame >>> >>>>> My boss was debugging an issue in our R code. We have our own >>>>> "[...." >>>>> functions, because stock R drops names when subscripting. >>>> ... if you tell it to do so, yes. If you tell it to not do that, >>>> it won't ... ever tried drop=FALSE ? >>> >>> Simon, no, the drop=FALSE argument has nothing to do with what >>> Christian was talking about. The kind of thing he meant is PR# 8192, >>> "Subject: [ subscripting sometimes loses names": >>> >>> http://bugs.r-project.org/cgi-bin/R/wishlist?id=8192 >> >> In that bug report you were asked to provide simple examples, and you >> didn't. I imagine that's why there was no action on it. It is not >> that easy for someone else to actually find the simple example that >> led you to print >> >> $vec.1 >> BAD $vec.1[[1]] $vec.1[[2]] >> a c <NA> a c no >> 1 3 NA 1 3 NA >> >> I just tracked this one down, and can put together this simple example: >> >> > (1:3)["no"] >> [1] NA >> >> where I think you would want the name "no" attached to the output. >> (Or maybe your more complicated example is wanted? You don't >> explain.) But that looks like documented behaviour to me: according >> to my reading of "Indexing by vectors" in the R Language Definition >> manual, it should give the same answer as (1:3)[4], and it does. So >> it's not a bug, but a wishlist item. >> >> And the other two cases where you list "BAD" behaviour? I didn't >> track them down. > > I did, and they boil down to variations of > > > data.frame(val=1:3,row.names=letters[1:3])[,1] > [1] 1 2 3 > > but it's not obvious that the result should be named using the > row.names and (in particular) whether or why it should differ from > .....[[1]] and ....$val. once you are saying that, be prepared to explain why it should *not* differ from [[1]] and $val. reading ?'[' carefully, you'll find: " The most important distinction between '[', '[[' and '$' is that the '[' can select more than one element whereas the other two select a single element." that's actually quite enough to justify why [,1] (or rather [, indices], with an arbitrary vector of indices) should differ from [[1]] and $val. precisely because: a) [[index]] and $name are *guaranteed* to return one column (or fail), so it's reasonable to *always* drop the dimension -- because it will be done in the case of every successful selection; b) [, indices] *may* or *may not* return one column in a successful selection, and now dropping the dimension (and names) depends not on the type of the indices used (positive numeric, negative numeric, character, whatever), but on the length of the index vector. why is external consistence of [ (being like [[ and $) when a single index is used more important than its internal consistence (returning the same type of data -- a data frame, or a like-dimensioned matrix -- irrespectively of the length of the index vector)? i realize that the issue of drop=FALSE vs drop=TRUE as the default has been discussed before, but i don't find clear arguments given for the first option, beyond that it just is so and would break much old code if were to be changed. i'm actually hoping not for this to be changed, but for users not to be blamed for assuming [,1] returns a data frame with row names. it's *not* their fault they are wrong. > Given that for most purposes, extracting the relevant names would just > be unnecessary red tape, I'd say that we can do without it. > would keeping the dimensions and class be just unnecessary red tape, too? can you know what most users' purposes are? vQ ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
| Powered by Nabble | Edit this page |
