Deprecating partial matching in $.data.frame

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Deprecating partial matching in $.data.frame

Peter Dalgaard-2
Allowing partial matching on $-extraction has always been a source of accidents. Recently, someone who shall remain nameless tried names(mydata) <- "d^2" followed by mydata$d^2.

As variables in a data frame are generally considered similar to variables in, say, the global environment, it seems strange that foo$bar can give you the content of foo$bartender.

In R-devel (i.e., *not* R-3.0.0 beta, but 3.1.0-to-be) partial matches now gives a warning.

Of course, it is inevitable that lazy programmers will have been using code like

> anova(fit1)$P
[1] 0.0008866369           NA
Warning message:
In `$.data.frame`(anova(fit1), P) : Name partially matched in data frame

and now get the warning during package checks. This can always be removed by spelling out the column name, as in

> anova(fit1)$`Pr(>F)`
[1] 0.0008866369           NA

or by explicitly specifying a partial match with

> anova(fit1)[["P", exact=FALSE]]
[1] 0.0008866369           NA


--
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Deprecating partial matching in $.data.frame

hadley wickham
On Wed, Mar 20, 2013 at 7:28 AM, peter dalgaard <[hidden email]> wrote:
> Allowing partial matching on $-extraction has always been a source of accidents. Recently, someone who shall remain nameless tried names(mydata) <- "d^2" followed by mydata$d^2.
>
> As variables in a data frame are generally considered similar to variables in, say, the global environment, it seems strange that foo$bar can give you the content of foo$bartender.
>
> In R-devel (i.e., *not* R-3.0.0 beta, but 3.1.0-to-be) partial matches now gives a warning.

Just for data frames, or also for lists?

I think this is a fantastic change, but I do worry a little that it is
going to generate warnings for a _lot_ of existing code.

Hadley

--
Chief Scientist, RStudio
http://had.co.nz/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Deprecating partial matching in $.data.frame

William Dunlap
In reply to this post by Peter Dalgaard-2
Will you be doing the same for attribute names?

  > options(prompt=with(version, paste0(language,"-",major,".",minor,"> ")))
  R-2.15.3> x <- structure(17, AnAttr="an attribute", Abcd="a b c d")
  R-2.15.3> attr(x, "A")
  NULL
  R-2.15.3> attr(x, "An")
  [1] "an attribute"
  R-2.15.3> attr(x, "Ab")
  [1] "a b c d"

How will you deal with the common idiom of using is.null(x$n)
to see if x has a compnent named "n"?  One would not want
a warning if x had a component called "nn".

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On Behalf
> Of peter dalgaard
> Sent: Wednesday, March 20, 2013 5:28 AM
> To: [hidden email]
> Subject: [Rd] Deprecating partial matching in $.data.frame
>
> Allowing partial matching on $-extraction has always been a source of accidents.
> Recently, someone who shall remain nameless tried names(mydata) <- "d^2" followed by
> mydata$d^2.
>
> As variables in a data frame are generally considered similar to variables in, say, the
> global environment, it seems strange that foo$bar can give you the content of
> foo$bartender.
>
> In R-devel (i.e., *not* R-3.0.0 beta, but 3.1.0-to-be) partial matches now gives a warning.
>
> Of course, it is inevitable that lazy programmers will have been using code like
>
> > anova(fit1)$P
> [1] 0.0008866369           NA
> Warning message:
> In `$.data.frame`(anova(fit1), P) : Name partially matched in data frame
>
> and now get the warning during package checks. This can always be removed by spelling
> out the column name, as in
>
> > anova(fit1)$`Pr(>F)`
> [1] 0.0008866369           NA
>
> or by explicitly specifying a partial match with
>
> > anova(fit1)[["P", exact=FALSE]]
> [1] 0.0008866369           NA
>
>
> --
> Peter Dalgaard, Professor
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Email: [hidden email]  Priv: [hidden email]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Deprecating partial matching in $.data.frame

Peter Dalgaard-2
In reply to this post by hadley wickham

On Mar 20, 2013, at 16:23 , Hadley Wickham wrote:

> On Wed, Mar 20, 2013 at 7:28 AM, peter dalgaard <[hidden email]> wrote:
>> Allowing partial matching on $-extraction has always been a source of accidents. Recently, someone who shall remain nameless tried names(mydata) <- "d^2" followed by mydata$d^2.
>>
>> As variables in a data frame are generally considered similar to variables in, say, the global environment, it seems strange that foo$bar can give you the content of foo$bartender.
>>
>> In R-devel (i.e., *not* R-3.0.0 beta, but 3.1.0-to-be) partial matches now gives a warning.
>
> Just for data frames, or also for lists?

Just for data frames, at least for now. For lists, there are just too many uses of chisq.test()$exp etc. (I nearly wrote t.test()$p, but that doesn't actually work!)

>
> I think this is a fantastic change, but I do worry a little that it is
> going to generate warnings for a _lot_ of existing code.

We'll see about that, but I expect it not to be all that bad. In general purpose code, you need to have a situation where the data frame has known column names, and the one that you want is sufficiently awkward to type.  The p-value column in anova is about the only realistic scenario that I can come up with. The ones in, e.g., summary.lm are in a matrix, not a data frame.  

>
> Hadley
>
> --
> Chief Scientist, RStudio
> http://had.co.nz/

--
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Deprecating partial matching in $.data.frame

Peter Dalgaard-2
In reply to this post by William Dunlap

On Mar 20, 2013, at 16:59 , William Dunlap wrote:

> Will you be doing the same for attribute names?

Not at this point.

>
>> options(prompt=with(version, paste0(language,"-",major,".",minor,"> ")))
>  R-2.15.3> x <- structure(17, AnAttr="an attribute", Abcd="a b c d")
>  R-2.15.3> attr(x, "A")
>  NULL
>  R-2.15.3> attr(x, "An")
>  [1] "an attribute"
>  R-2.15.3> attr(x, "Ab")
>  [1] "a b c d"
>
> How will you deal with the common idiom of using is.null(x$n)
> to see if x has a compnent named "n"?  One would not want
> a warning if x had a component called "nn".

Why not? If you were looking for x$n, you're getting the wrong answer.


>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>
>> -----Original Message-----
>> From: [hidden email] [mailto:[hidden email]] On Behalf
>> Of peter dalgaard
>> Sent: Wednesday, March 20, 2013 5:28 AM
>> To: [hidden email]
>> Subject: [Rd] Deprecating partial matching in $.data.frame
>>
>> Allowing partial matching on $-extraction has always been a source of accidents.
>> Recently, someone who shall remain nameless tried names(mydata) <- "d^2" followed by
>> mydata$d^2.
>>
>> As variables in a data frame are generally considered similar to variables in, say, the
>> global environment, it seems strange that foo$bar can give you the content of
>> foo$bartender.
>>
>> In R-devel (i.e., *not* R-3.0.0 beta, but 3.1.0-to-be) partial matches now gives a warning.
>>
>> Of course, it is inevitable that lazy programmers will have been using code like
>>
>>> anova(fit1)$P
>> [1] 0.0008866369           NA
>> Warning message:
>> In `$.data.frame`(anova(fit1), P) : Name partially matched in data frame
>>
>> and now get the warning during package checks. This can always be removed by spelling
>> out the column name, as in
>>
>>> anova(fit1)$`Pr(>F)`
>> [1] 0.0008866369           NA
>>
>> or by explicitly specifying a partial match with
>>
>>> anova(fit1)[["P", exact=FALSE]]
>> [1] 0.0008866369           NA
>>
>>
>> --
>> Peter Dalgaard, Professor
>> Center for Statistics, Copenhagen Business School
>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>> Phone: (+45)38153501
>> Email: [hidden email]  Priv: [hidden email]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

--
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Deprecating partial matching in $.data.frame

Milan Bouchet-Valat
In reply to this post by Peter Dalgaard-2
Le mercredi 20 mars 2013 à 17:16 +0100, peter dalgaard a écrit :

> On Mar 20, 2013, at 16:23 , Hadley Wickham wrote:
>
> > On Wed, Mar 20, 2013 at 7:28 AM, peter dalgaard <[hidden email]>
> wrote:
> >> Allowing partial matching on $-extraction has always been a source
> of accidents. Recently, someone who shall remain nameless tried
> names(mydata) <- "d^2" followed by mydata$d^2.
> >>
> >> As variables in a data frame are generally considered similar to
> variables in, say, the global environment, it seems strange that foo
> $bar can give you the content of foo$bartender.
> >>
> >> In R-devel (i.e., *not* R-3.0.0 beta, but 3.1.0-to-be) partial
> matches now gives a warning.
> >
> > Just for data frames, or also for lists?
>
> Just for data frames, at least for now. For lists, there are just too
> many uses of chisq.test()$exp etc. (I nearly wrote t.test()$p, but
> that doesn't actually work!)
I also think this is a very good idea, but special-casing data frames is
going to create some confusion in that already complex area. Wouldn't it
make more sense to aim at fixing both lists and data frames in the same
R release?

In a first phase, R CMD check could report errors when partial matching
is detected, but normal R use would not warn: this would leave some time
for package maintainers to fix their code (I guess R CMD check could
enable the warnings as an option while running package tests if
detecting them from static code parsing is not possible). Then, in a
second phase, warnings would be enabled by default for lists and data
frames.


My two cents

> >
> > I think this is a fantastic change, but I do worry a little that it is
> > going to generate warnings for a _lot_ of existing code.
>
> We'll see about that, but I expect it not to be all that bad. In
> general purpose code, you need to have a situation where the data
> frame has known column names, and the one that you want is
> sufficiently awkward to type.  The p-value column in anova is about
> the only realistic scenario that I can come up with. The ones in,
> e.g., summary.lm are in a matrix, not a data frame.  
>
> >
> > Hadley
> >
> > --
> > Chief Scientist, RStudio
> > http://had.co.nz/
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Deprecating partial matching in $.data.frame

hadley wickham
In reply to this post by Peter Dalgaard-2
On Wed, Mar 20, 2013 at 11:26 AM, peter dalgaard <[hidden email]> wrote:
>
> On Mar 20, 2013, at 16:59 , William Dunlap wrote:
>
>> Will you be doing the same for attribute names?
>
> Not at this point.

It would be really nice to have consistent behaviour across argument
names, attributes, lists and data frames, at least for R CMD check.

Hadley

--
Chief Scientist, RStudio
http://had.co.nz/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Deprecating partial matching in $.data.frame

Rainer M Krug-6
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 20/03/13 17:58, Hadley Wickham wrote:

> On Wed, Mar 20, 2013 at 11:26 AM, peter dalgaard <[hidden email]> wrote:
>>
>> On Mar 20, 2013, at 16:59 , William Dunlap wrote:
>>
>>> Will you be doing the same for attribute names?
>>
>> Not at this point.
>
> It would be really nice to have consistent behaviour across argument names, attributes, lists
> and data frames, at least for R CMD check.

I agree with Hadley that consistency is quite important. This is especially true for data.frames
and lists, as this concerns the data itself, and not names or attributes of the data.

I would very much like to see at least at the level of R CMD check warnings for *all* partial
matching so that they can be ironed out before in the next stage warnings are give to the user (as
mentioned by Milan).

I was bitten at least once by a bug, which cost me quite some time to figure out, caused by
partial completion and would very much like to see it go (or at least have the option to show
warnings if it occurs).

Cheers,

Rainer



>
> Hadley
>


- --
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys.
(Germany)

Centre of Excellence for Invasion Biology
Stellenbosch University
South Africa

Tel :       +33 - (0)9 53 10 27 44
Cell:       +33 - (0)6 85 62 59 98
Fax :       +33 - (0)9 58 10 27 44

Fax (D):    +49 - (0)3 21 21 25 22 44

email:      [hidden email]

Skype:      RMkrug
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJRSsQOAAoJENvXNx4PUvmC7zkH/Rp0yFMmgQD9D2Z2EpWm5vGR
T0ojk8WKCeqoGY4IKpCPP0rSKJqPI0HxjdAplOclFSdfBaCDrHdALLaxzqJWG6TJ
346A/lAgdgbJWNTTWMXiXcq2vqDKAvoOVhZ/A1YDo7CzjZsgpcBPzmUZREFNSDKu
TeFNM29GgLIaQ2JqV6wRPQee/j36+iLpcCfACTdsXs0H/kRkcogV96g75OTGsxJr
9pZRzOQpH0fv9DsdLGkOCO1twZ+XtWOKSCmTTcOJ97wBWcYk80jrwJObKFG7qMz7
VVoz38hWjgLKj9RRKSLtEtIfUhNogvT5bayPO3ZBD1jDx8qRfm8BtNV+ofEvnd0=
=akLx
-----END PGP SIGNATURE-----

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Deprecating partial matching in $.data.frame

Peter Dalgaard-2

On Mar 21, 2013, at 09:25 , Rainer M Krug wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 20/03/13 17:58, Hadley Wickham wrote:
>> On Wed, Mar 20, 2013 at 11:26 AM, peter dalgaard <[hidden email]> wrote:
>>>
>>> On Mar 20, 2013, at 16:59 , William Dunlap wrote:
>>>
>>>> Will you be doing the same for attribute names?
>>>
>>> Not at this point.
>>
>> It would be really nice to have consistent behaviour across argument names, attributes, lists
>> and data frames, at least for R CMD check.
>
> I agree with Hadley that consistency is quite important. This is especially true for data.frames
> and lists, as this concerns the data itself, and not names or attributes of the data.

Well, maybe consistency is important, but partial matching never worked for $-extraction in environments, so the current change could be considered mainly a nudge of data frames in the direction of environments. After all, both can be thought of as collections of named objects.

General lists are a somewhat different issue. They often, formally or informally, represent classed objects with a defined set of names, typically obtained as return values from functions. Since the names are known, people will have used the expedient of abbreviating them. This can happen with data frames as well, but less commonly, since it is in general unsafe to rely on column names being uniquely defined by any particular prefix.

I.e., deprecating partial matching for lists opens a rather larger can of worms, and might require more extensive code revisions. Also, the performance hit of a runtime check for partial matching might be more important for lists than it is for data frames. It could be worth it to implement an R CMD check warning as you suggest, but perhaps not just now.

-Peter

--
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Deprecating partial matching in $.data.frame

Hervé Pagès
Hi,

Maybe a compromise would be to just issue a warning without
deprecating? That way people who want to do anova(fit1)$P can
still do it. When working interactively, it's certainly convenient
(serious code however should probably stay away from partial matching).

And so you keep the semantic consistent with lists because yes,
consistency is important. data.frame inherits from list so any
operation that works on a list is expected to work on a data.frame,
preferably the same way (otherwise it will always be a BIG surprise
to the user/programmer). For example if I have to maintain someone
else code and see something like:

     bar <- x$bar

and I know that 'x' is a list that contains atomic vectors of the
same length, I could have some good reasons to want to use a
data.frame instead of a list. And I would assume it's safe to
modify the code by adding the following line earlier in it:

    x <- as.data.frame(x)

But with the proposed change to $.data.frame, I cannot make this
kind of assumption anymore...

My two cents

H.


On 03/21/2013 06:52 AM, peter dalgaard wrote:

>
> On Mar 21, 2013, at 09:25 , Rainer M Krug wrote:
>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> On 20/03/13 17:58, Hadley Wickham wrote:
>>> On Wed, Mar 20, 2013 at 11:26 AM, peter dalgaard <[hidden email]> wrote:
>>>>
>>>> On Mar 20, 2013, at 16:59 , William Dunlap wrote:
>>>>
>>>>> Will you be doing the same for attribute names?
>>>>
>>>> Not at this point.
>>>
>>> It would be really nice to have consistent behaviour across argument names, attributes, lists
>>> and data frames, at least for R CMD check.
>>
>> I agree with Hadley that consistency is quite important. This is especially true for data.frames
>> and lists, as this concerns the data itself, and not names or attributes of the data.
>
> Well, maybe consistency is important, but partial matching never worked for $-extraction in environments, so the current change could be considered mainly a nudge of data frames in the direction of environments. After all, both can be thought of as collections of named objects.
>
> General lists are a somewhat different issue. They often, formally or informally, represent classed objects with a defined set of names, typically obtained as return values from functions. Since the names are known, people will have used the expedient of abbreviating them. This can happen with data frames as well, but less commonly, since it is in general unsafe to rely on column names being uniquely defined by any particular prefix.
>
> I.e., deprecating partial matching for lists opens a rather larger can of worms, and might require more extensive code revisions. Also, the performance hit of a runtime check for partial matching might be more important for lists than it is for data frames. It could be worth it to implement an R CMD check warning as you suggest, but perhaps not just now.
>
> -Peter
>

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [hidden email]
Phone:  (206) 667-5791
Fax:    (206) 667-1319

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Deprecating partial matching in $.data.frame

Peter Dalgaard-2

On Mar 22, 2013, at 05:57 , Hervé Pagès wrote:

> Hi,
>
> Maybe a compromise would be to just issue a warning without
> deprecating? That way people who want to do anova(fit1)$P can
> still do it. When working interactively, it's certainly convenient
> (serious code however should probably stay away from partial matching).

That's what it does. Issuing a warning when users do X is pretty much equivalent to deprecating X.

>
> And so you keep the semantic consistent with lists because yes,
> consistency is important. data.frame inherits from list so any
> operation that works on a list is expected to work on a data.frame,
> preferably the same way (otherwise it will always be a BIG surprise
> to the user/programmer). For example if I have to maintain someone
> else code and see something like:
>
>    bar <- x$bar
>
> and I know that 'x' is a list that contains atomic vectors of the
> same length, I could have some good reasons to want to use a
> data.frame instead of a list. And I would assume it's safe to
> modify the code by adding the following line earlier in it:
>
>   x <- as.data.frame(x)
>
> But with the proposed change to $.data.frame, I cannot make this
> kind of assumption anymore...


No, but it's only a real problem if the component is not actually called "bar". You could make the same point for environments, but they never allowed partial matching:

> e <- as.environment(list(barbaric=666))
> e$bar
NULL
> e$barbaric
[1] 666




--
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Deprecating partial matching in $.data.frame

Hervé Pagès
Hi,

On 03/22/2013 01:31 AM, peter dalgaard wrote:

>
> On Mar 22, 2013, at 05:57 , Hervé Pagès wrote:
>
>> Hi,
>>
>> Maybe a compromise would be to just issue a warning without
>> deprecating? That way people who want to do anova(fit1)$P can
>> still do it. When working interactively, it's certainly convenient
>> (serious code however should probably stay away from partial matching).
>
> That's what it does. Issuing a warning when users do X is pretty much equivalent to deprecating X.

For now yes. But you won't keep it deprecated forever right?

>
>>
>> And so you keep the semantic consistent with lists because yes,
>> consistency is important. data.frame inherits from list so any
>> operation that works on a list is expected to work on a data.frame,
>> preferably the same way (otherwise it will always be a BIG surprise
>> to the user/programmer). For example if I have to maintain someone
>> else code and see something like:
>>
>>     bar <- x$bar
>>
>> and I know that 'x' is a list that contains atomic vectors of the
>> same length, I could have some good reasons to want to use a
>> data.frame instead of a list. And I would assume it's safe to
>> modify the code by adding the following line earlier in it:
>>
>>    x <- as.data.frame(x)
>>
>> But with the proposed change to $.data.frame, I cannot make this
>> kind of assumption anymore...
>
>
> No, but it's only a real problem if the component is not actually called "bar". You could make the same point for environments, but they never allowed partial matching:

A data.fame is a list, not an environment. It would be silly for
me as a programmer to assume that replacing environment 'x' by
data.frame 'x' won't break the code.

H.

>
>> e <- as.environment(list(barbaric=666))
>> e$bar
> NULL
>> e$barbaric
> [1] 666
>
>
>
>

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [hidden email]
Phone:  (206) 667-5791
Fax:    (206) 667-1319

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel