Duplicate column names created by base::merge() when by.x has the same name as a column in y

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Duplicate column names created by base::merge() when by.x has the same name as a column in y

Scott Ritchie-2
Hi,

I was unable to find a bug report for this with a cursory search, but would
like clarification if this is intended or unavoidable behaviour:

```{r}
# Create example data.frames
parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
                      sex=c("F", "M", "F", "M"),
                      age=c(41, 43, 36, 51))
children <- data.frame(parent=c("Sarah", "Max", "Qin"),
                       name=c("Oliver", "Sebastian", "Kai-lee"),
                       sex=c("M", "M", "F"),
                       age=c(5,8,7))

# Merge() creates a duplicated "name" column:
merge(parents, children, by.x = "name", by.y = "parent")
```

Output:
```
   name sex.x age.x      name sex.y age.y
1   Max     M    43 Sebastian     M     8
2   Qin     F    36   Kai-lee     F     7
3 Sarah     F    41    Oliver     M     5
Warning message:
In merge.data.frame(parents, children, by.x = "name", by.y = "parent") :
  column name ‘name’ is duplicated in the result
```

Kind Regards,

Scott Ritchie

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate column names created by base::merge() when by.x has the same name as a column in y

frederik
Hi Scott,

It seems like reasonable behavior to me. What result would you expect?
That the second "name" should be called "name.y"?

The "merge" documentation says:

    If the columns in the data frames not used in merging have any
    common names, these have ‘suffixes’ (‘".x"’ and ‘".y"’ by default)
    appended to try to make the names of the result unique.

Since the first "name" column was used in merging, leaving both
without a suffix seems consistent with the documentation...

Frederick

On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:

> Hi,
>
> I was unable to find a bug report for this with a cursory search, but would
> like clarification if this is intended or unavoidable behaviour:
>
> ```{r}
> # Create example data.frames
> parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
>                       sex=c("F", "M", "F", "M"),
>                       age=c(41, 43, 36, 51))
> children <- data.frame(parent=c("Sarah", "Max", "Qin"),
>                        name=c("Oliver", "Sebastian", "Kai-lee"),
>                        sex=c("M", "M", "F"),
>                        age=c(5,8,7))
>
> # Merge() creates a duplicated "name" column:
> merge(parents, children, by.x = "name", by.y = "parent")
> ```
>
> Output:
> ```
>    name sex.x age.x      name sex.y age.y
> 1   Max     M    43 Sebastian     M     8
> 2   Qin     F    36   Kai-lee     F     7
> 3 Sarah     F    41    Oliver     M     5
> Warning message:
> In merge.data.frame(parents, children, by.x = "name", by.y = "parent") :
>   column name ‘name’ is duplicated in the result
> ```
>
> Kind Regards,
>
> Scott Ritchie
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate column names created by base::merge() when by.x has the same name as a column in y

Scott Ritchie-2
Hi Frederick,

I would expect that any duplicate names in the resulting data.frame would
have the suffixes appended to them, regardless of whether or not they are
used as the join key. So in my example I would expect "names.x" and
"names.y" to indicate their source data.frame.

While careful reading of the documentation reveals this is not the case, I
would argue the intent of the suffixes functionality should equally be
applied to this type of case.

If you agree this would be useful, I'm happy to write a patch for
merge.data.frame that will add suffixes in this case - I intend to do the
same for merge.data.table in the data.table package where I initially
encountered the edge case.

Best,

Scott

On 17 February 2018 at 03:53, <[hidden email]> wrote:

> Hi Scott,
>
> It seems like reasonable behavior to me. What result would you expect?
> That the second "name" should be called "name.y"?
>
> The "merge" documentation says:
>
>     If the columns in the data frames not used in merging have any
>     common names, these have ‘suffixes’ (‘".x"’ and ‘".y"’ by default)
>     appended to try to make the names of the result unique.
>
> Since the first "name" column was used in merging, leaving both
> without a suffix seems consistent with the documentation...
>
> Frederick
>
> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
> > Hi,
> >
> > I was unable to find a bug report for this with a cursory search, but
> would
> > like clarification if this is intended or unavoidable behaviour:
> >
> > ```{r}
> > # Create example data.frames
> > parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
> >                       sex=c("F", "M", "F", "M"),
> >                       age=c(41, 43, 36, 51))
> > children <- data.frame(parent=c("Sarah", "Max", "Qin"),
> >                        name=c("Oliver", "Sebastian", "Kai-lee"),
> >                        sex=c("M", "M", "F"),
> >                        age=c(5,8,7))
> >
> > # Merge() creates a duplicated "name" column:
> > merge(parents, children, by.x = "name", by.y = "parent")
> > ```
> >
> > Output:
> > ```
> >    name sex.x age.x      name sex.y age.y
> > 1   Max     M    43 Sebastian     M     8
> > 2   Qin     F    36   Kai-lee     F     7
> > 3 Sarah     F    41    Oliver     M     5
> > Warning message:
> > In merge.data.frame(parents, children, by.x = "name", by.y = "parent") :
> >   column name ‘name’ is duplicated in the result
> > ```
> >
> > Kind Regards,
> >
> > Scott Ritchie
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate column names created by base::merge() when by.x has the same name as a column in y

Scott Ritchie-2
The attached patch.diff will make merge.data.frame() append the suffixes to
columns with common names between by.x and names(y).

Best,

Scott Ritchie

On 17 February 2018 at 11:15, Scott Ritchie <[hidden email]> wrote:

> Hi Frederick,
>
> I would expect that any duplicate names in the resulting data.frame would
> have the suffixes appended to them, regardless of whether or not they are
> used as the join key. So in my example I would expect "names.x" and
> "names.y" to indicate their source data.frame.
>
> While careful reading of the documentation reveals this is not the case, I
> would argue the intent of the suffixes functionality should equally be
> applied to this type of case.
>
> If you agree this would be useful, I'm happy to write a patch for
> merge.data.frame that will add suffixes in this case - I intend to do the
> same for merge.data.table in the data.table package where I initially
> encountered the edge case.
>
> Best,
>
> Scott
>
> On 17 February 2018 at 03:53, <[hidden email]> wrote:
>
>> Hi Scott,
>>
>> It seems like reasonable behavior to me. What result would you expect?
>> That the second "name" should be called "name.y"?
>>
>> The "merge" documentation says:
>>
>>     If the columns in the data frames not used in merging have any
>>     common names, these have ‘suffixes’ (‘".x"’ and ‘".y"’ by default)
>>     appended to try to make the names of the result unique.
>>
>> Since the first "name" column was used in merging, leaving both
>> without a suffix seems consistent with the documentation...
>>
>> Frederick
>>
>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
>> > Hi,
>> >
>> > I was unable to find a bug report for this with a cursory search, but
>> would
>> > like clarification if this is intended or unavoidable behaviour:
>> >
>> > ```{r}
>> > # Create example data.frames
>> > parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
>> >                       sex=c("F", "M", "F", "M"),
>> >                       age=c(41, 43, 36, 51))
>> > children <- data.frame(parent=c("Sarah", "Max", "Qin"),
>> >                        name=c("Oliver", "Sebastian", "Kai-lee"),
>> >                        sex=c("M", "M", "F"),
>> >                        age=c(5,8,7))
>> >
>> > # Merge() creates a duplicated "name" column:
>> > merge(parents, children, by.x = "name", by.y = "parent")
>> > ```
>> >
>> > Output:
>> > ```
>> >    name sex.x age.x      name sex.y age.y
>> > 1   Max     M    43 Sebastian     M     8
>> > 2   Qin     F    36   Kai-lee     F     7
>> > 3 Sarah     F    41    Oliver     M     5
>> > Warning message:
>> > In merge.data.frame(parents, children, by.x = "name", by.y = "parent") :
>> >   column name ‘name’ is duplicated in the result
>> > ```
>> >
>> > Kind Regards,
>> >
>> > Scott Ritchie
>> >
>> >       [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > [hidden email] mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-devel
>> >
>>
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

patch.diff (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate column names created by base::merge() when by.x has the same name as a column in y

frederik
Hi Scott,

Thanks for the patch. I'm not really involved in R development; it
will be up to someone in the R core team to apply it. I would hazard
to say that even if correct (I haven't checked), it will not be
applied because the change might break existing code. For example it
seems like reasonable code might easily assume that a column with the
same name as "by.x" exists in the output of 'merge'. That's just my
best guess... I don't participate on here often.

Cheers,

Frederick

On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:

> The attached patch.diff will make merge.data.frame() append the suffixes to
> columns with common names between by.x and names(y).
>
> Best,
>
> Scott Ritchie
>
> On 17 February 2018 at 11:15, Scott Ritchie <[hidden email]> wrote:
>
> > Hi Frederick,
> >
> > I would expect that any duplicate names in the resulting data.frame would
> > have the suffixes appended to them, regardless of whether or not they are
> > used as the join key. So in my example I would expect "names.x" and
> > "names.y" to indicate their source data.frame.
> >
> > While careful reading of the documentation reveals this is not the case, I
> > would argue the intent of the suffixes functionality should equally be
> > applied to this type of case.
> >
> > If you agree this would be useful, I'm happy to write a patch for
> > merge.data.frame that will add suffixes in this case - I intend to do the
> > same for merge.data.table in the data.table package where I initially
> > encountered the edge case.
> >
> > Best,
> >
> > Scott
> >
> > On 17 February 2018 at 03:53, <[hidden email]> wrote:
> >
> >> Hi Scott,
> >>
> >> It seems like reasonable behavior to me. What result would you expect?
> >> That the second "name" should be called "name.y"?
> >>
> >> The "merge" documentation says:
> >>
> >>     If the columns in the data frames not used in merging have any
> >>     common names, these have ‘suffixes’ (‘".x"’ and ‘".y"’ by default)
> >>     appended to try to make the names of the result unique.
> >>
> >> Since the first "name" column was used in merging, leaving both
> >> without a suffix seems consistent with the documentation...
> >>
> >> Frederick
> >>
> >> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
> >> > Hi,
> >> >
> >> > I was unable to find a bug report for this with a cursory search, but
> >> would
> >> > like clarification if this is intended or unavoidable behaviour:
> >> >
> >> > ```{r}
> >> > # Create example data.frames
> >> > parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
> >> >                       sex=c("F", "M", "F", "M"),
> >> >                       age=c(41, 43, 36, 51))
> >> > children <- data.frame(parent=c("Sarah", "Max", "Qin"),
> >> >                        name=c("Oliver", "Sebastian", "Kai-lee"),
> >> >                        sex=c("M", "M", "F"),
> >> >                        age=c(5,8,7))
> >> >
> >> > # Merge() creates a duplicated "name" column:
> >> > merge(parents, children, by.x = "name", by.y = "parent")
> >> > ```
> >> >
> >> > Output:
> >> > ```
> >> >    name sex.x age.x      name sex.y age.y
> >> > 1   Max     M    43 Sebastian     M     8
> >> > 2   Qin     F    36   Kai-lee     F     7
> >> > 3 Sarah     F    41    Oliver     M     5
> >> > Warning message:
> >> > In merge.data.frame(parents, children, by.x = "name", by.y = "parent") :
> >> >   column name ‘name’ is duplicated in the result
> >> > ```
> >> >
> >> > Kind Regards,
> >> >
> >> > Scott Ritchie
> >> >
> >> >       [[alternative HTML version deleted]]
> >> >
> >> > ______________________________________________
> >> > [hidden email] mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >> >
> >>
> >
> >

> Index: src/library/base/R/merge.R
> ===================================================================
> --- src/library/base/R/merge.R (revision 74264)
> +++ src/library/base/R/merge.R (working copy)
> @@ -157,6 +157,15 @@
>          }
>  
>          if(has.common.nms) names(y) <- nm.y
> +        ## If by.x %in% names(y) then duplicate column names still arise,
> +        ## apply suffixes to these
> +        dupe.keyx <- intersect(nm.by, names(y))
> +        if(length(dupe.keyx)) {
> +          if(nzchar(suffixes[1L]))
> +            names(x)[match(dupe.keyx, names(x), 0L)] <- paste(dupe.keyx, suffixes[1L], sep="")
> +          if(nzchar(suffixes[2L]))
> +            names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx, suffixes[2L], sep="")
> +        }
>          nm <- c(names(x), names(y))
>          if(any(d <- duplicated(nm)))
>              if(sum(d) > 1L)

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate column names created by base::merge() when by.x has the same name as a column in y

Duncan Murdoch-2
On 17/02/2018 6:36 PM, [hidden email] wrote:
> Hi Scott,
>
> Thanks for the patch. I'm not really involved in R development; it
> will be up to someone in the R core team to apply it. I would hazard
> to say that even if correct (I haven't checked), it will not be
> applied because the change might break existing code. For example it
> seems like reasonable code might easily assume that a column with the
> same name as "by.x" exists in the output of 'merge'. That's just my
> best guess... I don't participate on here often.


I think you're right.  If I were still a member of R Core, I would want
to test this against all packages on CRAN and Bioconductor, and since
that test takes a couple of days to run on my laptop, I'd probably never
get around to it.

There are lots of cases where "I would have done that differently", but
most of them are far too much trouble to change now that R is more than
20 years old.  And in many cases it will turn out that the way R does it
actually does make more sense than the way I would have done it.

Duncan Murdoch

>
> Cheers,
>
> Frederick
>
> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
>> The attached patch.diff will make merge.data.frame() append the suffixes to
>> columns with common names between by.x and names(y).
>>
>> Best,
>>
>> Scott Ritchie
>>
>> On 17 February 2018 at 11:15, Scott Ritchie <[hidden email]> wrote:
>>
>>> Hi Frederick,
>>>
>>> I would expect that any duplicate names in the resulting data.frame would
>>> have the suffixes appended to them, regardless of whether or not they are
>>> used as the join key. So in my example I would expect "names.x" and
>>> "names.y" to indicate their source data.frame.
>>>
>>> While careful reading of the documentation reveals this is not the case, I
>>> would argue the intent of the suffixes functionality should equally be
>>> applied to this type of case.
>>>
>>> If you agree this would be useful, I'm happy to write a patch for
>>> merge.data.frame that will add suffixes in this case - I intend to do the
>>> same for merge.data.table in the data.table package where I initially
>>> encountered the edge case.
>>>
>>> Best,
>>>
>>> Scott
>>>
>>> On 17 February 2018 at 03:53, <[hidden email]> wrote:
>>>
>>>> Hi Scott,
>>>>
>>>> It seems like reasonable behavior to me. What result would you expect?
>>>> That the second "name" should be called "name.y"?
>>>>
>>>> The "merge" documentation says:
>>>>
>>>>      If the columns in the data frames not used in merging have any
>>>>      common names, these have ‘suffixes’ (‘".x"’ and ‘".y"’ by default)
>>>>      appended to try to make the names of the result unique.
>>>>
>>>> Since the first "name" column was used in merging, leaving both
>>>> without a suffix seems consistent with the documentation...
>>>>
>>>> Frederick
>>>>
>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
>>>>> Hi,
>>>>>
>>>>> I was unable to find a bug report for this with a cursory search, but
>>>> would
>>>>> like clarification if this is intended or unavoidable behaviour:
>>>>>
>>>>> ```{r}
>>>>> # Create example data.frames
>>>>> parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
>>>>>                        sex=c("F", "M", "F", "M"),
>>>>>                        age=c(41, 43, 36, 51))
>>>>> children <- data.frame(parent=c("Sarah", "Max", "Qin"),
>>>>>                         name=c("Oliver", "Sebastian", "Kai-lee"),
>>>>>                         sex=c("M", "M", "F"),
>>>>>                         age=c(5,8,7))
>>>>>
>>>>> # Merge() creates a duplicated "name" column:
>>>>> merge(parents, children, by.x = "name", by.y = "parent")
>>>>> ```
>>>>>
>>>>> Output:
>>>>> ```
>>>>>     name sex.x age.x      name sex.y age.y
>>>>> 1   Max     M    43 Sebastian     M     8
>>>>> 2   Qin     F    36   Kai-lee     F     7
>>>>> 3 Sarah     F    41    Oliver     M     5
>>>>> Warning message:
>>>>> In merge.data.frame(parents, children, by.x = "name", by.y = "parent") :
>>>>>    column name ‘name’ is duplicated in the result
>>>>> ```
>>>>>
>>>>> Kind Regards,
>>>>>
>>>>> Scott Ritchie
>>>>>
>>>>>        [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>
>>>>
>>>
>>>
>
>> Index: src/library/base/R/merge.R
>> ===================================================================
>> --- src/library/base/R/merge.R (revision 74264)
>> +++ src/library/base/R/merge.R (working copy)
>> @@ -157,6 +157,15 @@
>>           }
>>  
>>           if(has.common.nms) names(y) <- nm.y
>> +        ## If by.x %in% names(y) then duplicate column names still arise,
>> +        ## apply suffixes to these
>> +        dupe.keyx <- intersect(nm.by, names(y))
>> +        if(length(dupe.keyx)) {
>> +          if(nzchar(suffixes[1L]))
>> +            names(x)[match(dupe.keyx, names(x), 0L)] <- paste(dupe.keyx, suffixes[1L], sep="")
>> +          if(nzchar(suffixes[2L]))
>> +            names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx, suffixes[2L], sep="")
>> +        }
>>           nm <- c(names(x), names(y))
>>           if(any(d <- duplicated(nm)))
>>               if(sum(d) > 1L)
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate column names created by base::merge() when by.x has the same name as a column in y

Scott Ritchie-2
Thanks Duncan and Frederick,

I suspected as much - there doesn't appear to be any reason why conflicts
between by.x and names(y) shouldn't and cannot be checked, but I can see
how this might be more trouble than its worth given it potentially may
break downstream packages (i.e. any cases where this occurs but they expect
the name of the key column(s) to remain the same).

Best,

Scott

On 18 February 2018 at 11:48, Duncan Murdoch <[hidden email]>
wrote:

> On 17/02/2018 6:36 PM, [hidden email] wrote:
>
>> Hi Scott,
>>
>> Thanks for the patch. I'm not really involved in R development; it
>> will be up to someone in the R core team to apply it. I would hazard
>> to say that even if correct (I haven't checked), it will not be
>> applied because the change might break existing code. For example it
>> seems like reasonable code might easily assume that a column with the
>> same name as "by.x" exists in the output of 'merge'. That's just my
>> best guess... I don't participate on here often.
>>
>
>
> I think you're right.  If I were still a member of R Core, I would want to
> test this against all packages on CRAN and Bioconductor, and since that
> test takes a couple of days to run on my laptop, I'd probably never get
> around to it.
>
> There are lots of cases where "I would have done that differently", but
> most of them are far too much trouble to change now that R is more than 20
> years old.  And in many cases it will turn out that the way R does it
> actually does make more sense than the way I would have done it.
>
> Duncan Murdoch
>
>
>
>> Cheers,
>>
>> Frederick
>>
>> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
>>
>>> The attached patch.diff will make merge.data.frame() append the suffixes
>>> to
>>> columns with common names between by.x and names(y).
>>>
>>> Best,
>>>
>>> Scott Ritchie
>>>
>>> On 17 February 2018 at 11:15, Scott Ritchie <[hidden email]>
>>> wrote:
>>>
>>> Hi Frederick,
>>>>
>>>> I would expect that any duplicate names in the resulting data.frame
>>>> would
>>>> have the suffixes appended to them, regardless of whether or not they
>>>> are
>>>> used as the join key. So in my example I would expect "names.x" and
>>>> "names.y" to indicate their source data.frame.
>>>>
>>>> While careful reading of the documentation reveals this is not the
>>>> case, I
>>>> would argue the intent of the suffixes functionality should equally be
>>>> applied to this type of case.
>>>>
>>>> If you agree this would be useful, I'm happy to write a patch for
>>>> merge.data.frame that will add suffixes in this case - I intend to do
>>>> the
>>>> same for merge.data.table in the data.table package where I initially
>>>> encountered the edge case.
>>>>
>>>> Best,
>>>>
>>>> Scott
>>>>
>>>> On 17 February 2018 at 03:53, <[hidden email]> wrote:
>>>>
>>>> Hi Scott,
>>>>>
>>>>> It seems like reasonable behavior to me. What result would you expect?
>>>>> That the second "name" should be called "name.y"?
>>>>>
>>>>> The "merge" documentation says:
>>>>>
>>>>>      If the columns in the data frames not used in merging have any
>>>>>      common names, these have ‘suffixes’ (‘".x"’ and ‘".y"’ by default)
>>>>>      appended to try to make the names of the result unique.
>>>>>
>>>>> Since the first "name" column was used in merging, leaving both
>>>>> without a suffix seems consistent with the documentation...
>>>>>
>>>>> Frederick
>>>>>
>>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I was unable to find a bug report for this with a cursory search, but
>>>>>>
>>>>> would
>>>>>
>>>>>> like clarification if this is intended or unavoidable behaviour:
>>>>>>
>>>>>> ```{r}
>>>>>> # Create example data.frames
>>>>>> parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
>>>>>>                        sex=c("F", "M", "F", "M"),
>>>>>>                        age=c(41, 43, 36, 51))
>>>>>> children <- data.frame(parent=c("Sarah", "Max", "Qin"),
>>>>>>                         name=c("Oliver", "Sebastian", "Kai-lee"),
>>>>>>                         sex=c("M", "M", "F"),
>>>>>>                         age=c(5,8,7))
>>>>>>
>>>>>> # Merge() creates a duplicated "name" column:
>>>>>> merge(parents, children, by.x = "name", by.y = "parent")
>>>>>> ```
>>>>>>
>>>>>> Output:
>>>>>> ```
>>>>>>     name sex.x age.x      name sex.y age.y
>>>>>> 1   Max     M    43 Sebastian     M     8
>>>>>> 2   Qin     F    36   Kai-lee     F     7
>>>>>> 3 Sarah     F    41    Oliver     M     5
>>>>>> Warning message:
>>>>>> In merge.data.frame(parents, children, by.x = "name", by.y =
>>>>>> "parent") :
>>>>>>    column name ‘name’ is duplicated in the result
>>>>>> ```
>>>>>>
>>>>>> Kind Regards,
>>>>>>
>>>>>> Scott Ritchie
>>>>>>
>>>>>>        [[alternative HTML version deleted]]
>>>>>>
>>>>>> ______________________________________________
>>>>>> [hidden email] mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>> Index: src/library/base/R/merge.R
>>> ===================================================================
>>> --- src/library/base/R/merge.R  (revision 74264)
>>> +++ src/library/base/R/merge.R  (working copy)
>>> @@ -157,6 +157,15 @@
>>>           }
>>>             if(has.common.nms) names(y) <- nm.y
>>> +        ## If by.x %in% names(y) then duplicate column names still
>>> arise,
>>> +        ## apply suffixes to these
>>> +        dupe.keyx <- intersect(nm.by, names(y))
>>> +        if(length(dupe.keyx)) {
>>> +          if(nzchar(suffixes[1L]))
>>> +            names(x)[match(dupe.keyx, names(x), 0L)] <-
>>> paste(dupe.keyx, suffixes[1L], sep="")
>>> +          if(nzchar(suffixes[2L]))
>>> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
>>> paste(dupe.keyx, suffixes[2L], sep="")
>>> +        }
>>>           nm <- c(names(x), names(y))
>>>           if(any(d <- duplicated(nm)))
>>>               if(sum(d) > 1L)
>>>
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate column names created by base::merge() when by.x has the same name as a column in y

Gabriel Becker
It seems like there is a way that is backwards compatible-ish in the sense
mentioned and still has the (arguably, but a good argument I think) better
behavior:

if by.x is 'name', (AND by.y is not also 'name'), then x's 'name' column is
called name and y's 'name' column (not used int he merge) is changed to
name.y.

Now of course this would still change output, but it would change it to
something I think would be better, while retaining the 'merge columns
retain their exact names' mechanic as documented.

~G

On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <[hidden email]>
wrote:

> Thanks Duncan and Frederick,
>
> I suspected as much - there doesn't appear to be any reason why conflicts
> between by.x and names(y) shouldn't and cannot be checked, but I can see
> how this might be more trouble than its worth given it potentially may
> break downstream packages (i.e. any cases where this occurs but they expect
> the name of the key column(s) to remain the same).
>
> Best,
>
> Scott
>
> On 18 February 2018 at 11:48, Duncan Murdoch <[hidden email]>
> wrote:
>
> > On 17/02/2018 6:36 PM, [hidden email] wrote:
> >
> >> Hi Scott,
> >>
> >> Thanks for the patch. I'm not really involved in R development; it
> >> will be up to someone in the R core team to apply it. I would hazard
> >> to say that even if correct (I haven't checked), it will not be
> >> applied because the change might break existing code. For example it
> >> seems like reasonable code might easily assume that a column with the
> >> same name as "by.x" exists in the output of 'merge'. That's just my
> >> best guess... I don't participate on here often.
> >>
> >
> >
> > I think you're right.  If I were still a member of R Core, I would want
> to
> > test this against all packages on CRAN and Bioconductor, and since that
> > test takes a couple of days to run on my laptop, I'd probably never get
> > around to it.
> >
> > There are lots of cases where "I would have done that differently", but
> > most of them are far too much trouble to change now that R is more than
> 20
> > years old.  And in many cases it will turn out that the way R does it
> > actually does make more sense than the way I would have done it.
> >
> > Duncan Murdoch
> >
> >
> >
> >> Cheers,
> >>
> >> Frederick
> >>
> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
> >>
> >>> The attached patch.diff will make merge.data.frame() append the
> suffixes
> >>> to
> >>> columns with common names between by.x and names(y).
> >>>
> >>> Best,
> >>>
> >>> Scott Ritchie
> >>>
> >>> On 17 February 2018 at 11:15, Scott Ritchie <[hidden email]>
> >>> wrote:
> >>>
> >>> Hi Frederick,
> >>>>
> >>>> I would expect that any duplicate names in the resulting data.frame
> >>>> would
> >>>> have the suffixes appended to them, regardless of whether or not they
> >>>> are
> >>>> used as the join key. So in my example I would expect "names.x" and
> >>>> "names.y" to indicate their source data.frame.
> >>>>
> >>>> While careful reading of the documentation reveals this is not the
> >>>> case, I
> >>>> would argue the intent of the suffixes functionality should equally be
> >>>> applied to this type of case.
> >>>>
> >>>> If you agree this would be useful, I'm happy to write a patch for
> >>>> merge.data.frame that will add suffixes in this case - I intend to do
> >>>> the
> >>>> same for merge.data.table in the data.table package where I initially
> >>>> encountered the edge case.
> >>>>
> >>>> Best,
> >>>>
> >>>> Scott
> >>>>
> >>>> On 17 February 2018 at 03:53, <[hidden email]> wrote:
> >>>>
> >>>> Hi Scott,
> >>>>>
> >>>>> It seems like reasonable behavior to me. What result would you
> expect?
> >>>>> That the second "name" should be called "name.y"?
> >>>>>
> >>>>> The "merge" documentation says:
> >>>>>
> >>>>>      If the columns in the data frames not used in merging have any
> >>>>>      common names, these have ‘suffixes’ (‘".x"’ and ‘".y"’ by
> default)
> >>>>>      appended to try to make the names of the result unique.
> >>>>>
> >>>>> Since the first "name" column was used in merging, leaving both
> >>>>> without a suffix seems consistent with the documentation...
> >>>>>
> >>>>> Frederick
> >>>>>
> >>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I was unable to find a bug report for this with a cursory search,
> but
> >>>>>>
> >>>>> would
> >>>>>
> >>>>>> like clarification if this is intended or unavoidable behaviour:
> >>>>>>
> >>>>>> ```{r}
> >>>>>> # Create example data.frames
> >>>>>> parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
> >>>>>>                        sex=c("F", "M", "F", "M"),
> >>>>>>                        age=c(41, 43, 36, 51))
> >>>>>> children <- data.frame(parent=c("Sarah", "Max", "Qin"),
> >>>>>>                         name=c("Oliver", "Sebastian", "Kai-lee"),
> >>>>>>                         sex=c("M", "M", "F"),
> >>>>>>                         age=c(5,8,7))
> >>>>>>
> >>>>>> # Merge() creates a duplicated "name" column:
> >>>>>> merge(parents, children, by.x = "name", by.y = "parent")
> >>>>>> ```
> >>>>>>
> >>>>>> Output:
> >>>>>> ```
> >>>>>>     name sex.x age.x      name sex.y age.y
> >>>>>> 1   Max     M    43 Sebastian     M     8
> >>>>>> 2   Qin     F    36   Kai-lee     F     7
> >>>>>> 3 Sarah     F    41    Oliver     M     5
> >>>>>> Warning message:
> >>>>>> In merge.data.frame(parents, children, by.x = "name", by.y =
> >>>>>> "parent") :
> >>>>>>    column name ‘name’ is duplicated in the result
> >>>>>> ```
> >>>>>>
> >>>>>> Kind Regards,
> >>>>>>
> >>>>>> Scott Ritchie
> >>>>>>
> >>>>>>        [[alternative HTML version deleted]]
> >>>>>>
> >>>>>> ______________________________________________
> >>>>>> [hidden email] mailing list
> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >> Index: src/library/base/R/merge.R
> >>> ===================================================================
> >>> --- src/library/base/R/merge.R  (revision 74264)
> >>> +++ src/library/base/R/merge.R  (working copy)
> >>> @@ -157,6 +157,15 @@
> >>>           }
> >>>             if(has.common.nms) names(y) <- nm.y
> >>> +        ## If by.x %in% names(y) then duplicate column names still
> >>> arise,
> >>> +        ## apply suffixes to these
> >>> +        dupe.keyx <- intersect(nm.by, names(y))
> >>> +        if(length(dupe.keyx)) {
> >>> +          if(nzchar(suffixes[1L]))
> >>> +            names(x)[match(dupe.keyx, names(x), 0L)] <-
> >>> paste(dupe.keyx, suffixes[1L], sep="")
> >>> +          if(nzchar(suffixes[2L]))
> >>> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
> >>> paste(dupe.keyx, suffixes[2L], sep="")
> >>> +        }
> >>>           nm <- c(names(x), names(y))
> >>>           if(any(d <- duplicated(nm)))
> >>>               if(sum(d) > 1L)
> >>>
> >>
> >> ______________________________________________
> >> [hidden email] mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
> >>
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



--
Gabriel Becker, PhD
Scientist (Bioinformatics)
Genentech Research

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate column names created by base::merge() when by.x has the same name as a column in y

Scott Ritchie-2
Thanks Gabriel,

I think your suggested approach is 100% backwards compatible

Currently in the case of duplicate column names only the first can be
indexed by its name. This will always be the column appearing in by.x,
meaning the column in y with the same name cannot be accessed. Appending
".y" (suffixes[2L]) to this column means it can now be accessed, while
keeping the current behaviour of making the key columns always accessible
by using the names provided to by.x.

I've attached a new patch that has this behaviour.

Best,

Scott

On 19 February 2018 at 05:08, Gabriel Becker <[hidden email]> wrote:

> It seems like there is a way that is backwards compatible-ish in the sense
> mentioned and still has the (arguably, but a good argument I think) better
> behavior:
>
> if by.x is 'name', (AND by.y is not also 'name'), then x's 'name' column
> is called name and y's 'name' column (not used int he merge) is changed to
> name.y.
>
> Now of course this would still change output, but it would change it to
> something I think would be better, while retaining the 'merge columns
> retain their exact names' mechanic as documented.
>
> ~G
>
> On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <[hidden email]>
> wrote:
>
>> Thanks Duncan and Frederick,
>>
>> I suspected as much - there doesn't appear to be any reason why conflicts
>> between by.x and names(y) shouldn't and cannot be checked, but I can see
>> how this might be more trouble than its worth given it potentially may
>> break downstream packages (i.e. any cases where this occurs but they
>> expect
>> the name of the key column(s) to remain the same).
>>
>> Best,
>>
>> Scott
>>
>> On 18 February 2018 at 11:48, Duncan Murdoch <[hidden email]>
>> wrote:
>>
>> > On 17/02/2018 6:36 PM, [hidden email] wrote:
>> >
>> >> Hi Scott,
>> >>
>> >> Thanks for the patch. I'm not really involved in R development; it
>> >> will be up to someone in the R core team to apply it. I would hazard
>> >> to say that even if correct (I haven't checked), it will not be
>> >> applied because the change might break existing code. For example it
>> >> seems like reasonable code might easily assume that a column with the
>> >> same name as "by.x" exists in the output of 'merge'. That's just my
>> >> best guess... I don't participate on here often.
>> >>
>> >
>> >
>> > I think you're right.  If I were still a member of R Core, I would want
>> to
>> > test this against all packages on CRAN and Bioconductor, and since that
>> > test takes a couple of days to run on my laptop, I'd probably never get
>> > around to it.
>> >
>> > There are lots of cases where "I would have done that differently", but
>> > most of them are far too much trouble to change now that R is more than
>> 20
>> > years old.  And in many cases it will turn out that the way R does it
>> > actually does make more sense than the way I would have done it.
>> >
>> > Duncan Murdoch
>> >
>> >
>> >
>> >> Cheers,
>> >>
>> >> Frederick
>> >>
>> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
>> >>
>> >>> The attached patch.diff will make merge.data.frame() append the
>> suffixes
>> >>> to
>> >>> columns with common names between by.x and names(y).
>> >>>
>> >>> Best,
>> >>>
>> >>> Scott Ritchie
>> >>>
>> >>> On 17 February 2018 at 11:15, Scott Ritchie <[hidden email]>
>> >>> wrote:
>> >>>
>> >>> Hi Frederick,
>> >>>>
>> >>>> I would expect that any duplicate names in the resulting data.frame
>> >>>> would
>> >>>> have the suffixes appended to them, regardless of whether or not they
>> >>>> are
>> >>>> used as the join key. So in my example I would expect "names.x" and
>> >>>> "names.y" to indicate their source data.frame.
>> >>>>
>> >>>> While careful reading of the documentation reveals this is not the
>> >>>> case, I
>> >>>> would argue the intent of the suffixes functionality should equally
>> be
>> >>>> applied to this type of case.
>> >>>>
>> >>>> If you agree this would be useful, I'm happy to write a patch for
>> >>>> merge.data.frame that will add suffixes in this case - I intend to do
>> >>>> the
>> >>>> same for merge.data.table in the data.table package where I initially
>> >>>> encountered the edge case.
>> >>>>
>> >>>> Best,
>> >>>>
>> >>>> Scott
>> >>>>
>> >>>> On 17 February 2018 at 03:53, <[hidden email]> wrote:
>> >>>>
>> >>>> Hi Scott,
>> >>>>>
>> >>>>> It seems like reasonable behavior to me. What result would you
>> expect?
>> >>>>> That the second "name" should be called "name.y"?
>> >>>>>
>> >>>>> The "merge" documentation says:
>> >>>>>
>> >>>>>      If the columns in the data frames not used in merging have any
>> >>>>>      common names, these have ‘suffixes’ (‘".x"’ and ‘".y"’ by
>> default)
>> >>>>>      appended to try to make the names of the result unique.
>> >>>>>
>> >>>>> Since the first "name" column was used in merging, leaving both
>> >>>>> without a suffix seems consistent with the documentation...
>> >>>>>
>> >>>>> Frederick
>> >>>>>
>> >>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
>> >>>>>
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> I was unable to find a bug report for this with a cursory search,
>> but
>> >>>>>>
>> >>>>> would
>> >>>>>
>> >>>>>> like clarification if this is intended or unavoidable behaviour:
>> >>>>>>
>> >>>>>> ```{r}
>> >>>>>> # Create example data.frames
>> >>>>>> parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
>> >>>>>>                        sex=c("F", "M", "F", "M"),
>> >>>>>>                        age=c(41, 43, 36, 51))
>> >>>>>> children <- data.frame(parent=c("Sarah", "Max", "Qin"),
>> >>>>>>                         name=c("Oliver", "Sebastian", "Kai-lee"),
>> >>>>>>                         sex=c("M", "M", "F"),
>> >>>>>>                         age=c(5,8,7))
>> >>>>>>
>> >>>>>> # Merge() creates a duplicated "name" column:
>> >>>>>> merge(parents, children, by.x = "name", by.y = "parent")
>> >>>>>> ```
>> >>>>>>
>> >>>>>> Output:
>> >>>>>> ```
>> >>>>>>     name sex.x age.x      name sex.y age.y
>> >>>>>> 1   Max     M    43 Sebastian     M     8
>> >>>>>> 2   Qin     F    36   Kai-lee     F     7
>> >>>>>> 3 Sarah     F    41    Oliver     M     5
>> >>>>>> Warning message:
>> >>>>>> In merge.data.frame(parents, children, by.x = "name", by.y =
>> >>>>>> "parent") :
>> >>>>>>    column name ‘name’ is duplicated in the result
>> >>>>>> ```
>> >>>>>>
>> >>>>>> Kind Regards,
>> >>>>>>
>> >>>>>> Scott Ritchie
>> >>>>>>
>> >>>>>>        [[alternative HTML version deleted]]
>> >>>>>>
>> >>>>>> ______________________________________________
>> >>>>>> [hidden email] mailing list
>> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> >>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >> Index: src/library/base/R/merge.R
>> >>> ===================================================================
>> >>> --- src/library/base/R/merge.R  (revision 74264)
>> >>> +++ src/library/base/R/merge.R  (working copy)
>> >>> @@ -157,6 +157,15 @@
>> >>>           }
>> >>>             if(has.common.nms) names(y) <- nm.y
>> >>> +        ## If by.x %in% names(y) then duplicate column names still
>> >>> arise,
>> >>> +        ## apply suffixes to these
>> >>> +        dupe.keyx <- intersect(nm.by, names(y))
>> >>> +        if(length(dupe.keyx)) {
>> >>> +          if(nzchar(suffixes[1L]))
>> >>> +            names(x)[match(dupe.keyx, names(x), 0L)] <-
>> >>> paste(dupe.keyx, suffixes[1L], sep="")
>> >>> +          if(nzchar(suffixes[2L]))
>> >>> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
>> >>> paste(dupe.keyx, suffixes[2L], sep="")
>> >>> +        }
>> >>>           nm <- c(names(x), names(y))
>> >>>           if(any(d <- duplicated(nm)))
>> >>>               if(sum(d) > 1L)
>> >>>
>> >>
>> >> ______________________________________________
>> >> [hidden email] mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-devel
>> >>
>> >>
>> >
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>
>
> --
> Gabriel Becker, PhD
> Scientist (Bioinformatics)
> Genentech Research
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

patch.diff (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate column names created by base::merge() when by.x has the same name as a column in y

frederik
Hi Scott,

I think that's a good idea and I tried your patch on my copy of the
repository. But it looks to me like the recent patch is identical to
the previous one, can you confirm this?

Frederick

On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote:

> Thanks Gabriel,
>
> I think your suggested approach is 100% backwards compatible
>
> Currently in the case of duplicate column names only the first can be
> indexed by its name. This will always be the column appearing in by.x,
> meaning the column in y with the same name cannot be accessed. Appending
> ".y" (suffixes[2L]) to this column means it can now be accessed, while
> keeping the current behaviour of making the key columns always accessible
> by using the names provided to by.x.
>
> I've attached a new patch that has this behaviour.
>
> Best,
>
> Scott
>
> On 19 February 2018 at 05:08, Gabriel Becker <[hidden email]> wrote:
>
> > It seems like there is a way that is backwards compatible-ish in the sense
> > mentioned and still has the (arguably, but a good argument I think) better
> > behavior:
> >
> > if by.x is 'name', (AND by.y is not also 'name'), then x's 'name' column
> > is called name and y's 'name' column (not used int he merge) is changed to
> > name.y.
> >
> > Now of course this would still change output, but it would change it to
> > something I think would be better, while retaining the 'merge columns
> > retain their exact names' mechanic as documented.
> >
> > ~G
> >
> > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <[hidden email]>
> > wrote:
> >
> >> Thanks Duncan and Frederick,
> >>
> >> I suspected as much - there doesn't appear to be any reason why conflicts
> >> between by.x and names(y) shouldn't and cannot be checked, but I can see
> >> how this might be more trouble than its worth given it potentially may
> >> break downstream packages (i.e. any cases where this occurs but they
> >> expect
> >> the name of the key column(s) to remain the same).
> >>
> >> Best,
> >>
> >> Scott
> >>
> >> On 18 February 2018 at 11:48, Duncan Murdoch <[hidden email]>
> >> wrote:
> >>
> >> > On 17/02/2018 6:36 PM, [hidden email] wrote:
> >> >
> >> >> Hi Scott,
> >> >>
> >> >> Thanks for the patch. I'm not really involved in R development; it
> >> >> will be up to someone in the R core team to apply it. I would hazard
> >> >> to say that even if correct (I haven't checked), it will not be
> >> >> applied because the change might break existing code. For example it
> >> >> seems like reasonable code might easily assume that a column with the
> >> >> same name as "by.x" exists in the output of 'merge'. That's just my
> >> >> best guess... I don't participate on here often.
> >> >>
> >> >
> >> >
> >> > I think you're right.  If I were still a member of R Core, I would want
> >> to
> >> > test this against all packages on CRAN and Bioconductor, and since that
> >> > test takes a couple of days to run on my laptop, I'd probably never get
> >> > around to it.
> >> >
> >> > There are lots of cases where "I would have done that differently", but
> >> > most of them are far too much trouble to change now that R is more than
> >> 20
> >> > years old.  And in many cases it will turn out that the way R does it
> >> > actually does make more sense than the way I would have done it.
> >> >
> >> > Duncan Murdoch
> >> >
> >> >
> >> >
> >> >> Cheers,
> >> >>
> >> >> Frederick
> >> >>
> >> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
> >> >>
> >> >>> The attached patch.diff will make merge.data.frame() append the
> >> suffixes
> >> >>> to
> >> >>> columns with common names between by.x and names(y).
> >> >>>
> >> >>> Best,
> >> >>>
> >> >>> Scott Ritchie
> >> >>>
> >> >>> On 17 February 2018 at 11:15, Scott Ritchie <[hidden email]>
> >> >>> wrote:
> >> >>>
> >> >>> Hi Frederick,
> >> >>>>
> >> >>>> I would expect that any duplicate names in the resulting data.frame
> >> >>>> would
> >> >>>> have the suffixes appended to them, regardless of whether or not they
> >> >>>> are
> >> >>>> used as the join key. So in my example I would expect "names.x" and
> >> >>>> "names.y" to indicate their source data.frame.
> >> >>>>
> >> >>>> While careful reading of the documentation reveals this is not the
> >> >>>> case, I
> >> >>>> would argue the intent of the suffixes functionality should equally
> >> be
> >> >>>> applied to this type of case.
> >> >>>>
> >> >>>> If you agree this would be useful, I'm happy to write a patch for
> >> >>>> merge.data.frame that will add suffixes in this case - I intend to do
> >> >>>> the
> >> >>>> same for merge.data.table in the data.table package where I initially
> >> >>>> encountered the edge case.
> >> >>>>
> >> >>>> Best,
> >> >>>>
> >> >>>> Scott
> >> >>>>
> >> >>>> On 17 February 2018 at 03:53, <[hidden email]> wrote:
> >> >>>>
> >> >>>> Hi Scott,
> >> >>>>>
> >> >>>>> It seems like reasonable behavior to me. What result would you
> >> expect?
> >> >>>>> That the second "name" should be called "name.y"?
> >> >>>>>
> >> >>>>> The "merge" documentation says:
> >> >>>>>
> >> >>>>>      If the columns in the data frames not used in merging have any
> >> >>>>>      common names, these have ‘suffixes’ (‘".x"’ and ‘".y"’ by
> >> default)
> >> >>>>>      appended to try to make the names of the result unique.
> >> >>>>>
> >> >>>>> Since the first "name" column was used in merging, leaving both
> >> >>>>> without a suffix seems consistent with the documentation...
> >> >>>>>
> >> >>>>> Frederick
> >> >>>>>
> >> >>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
> >> >>>>>
> >> >>>>>> Hi,
> >> >>>>>>
> >> >>>>>> I was unable to find a bug report for this with a cursory search,
> >> but
> >> >>>>>>
> >> >>>>> would
> >> >>>>>
> >> >>>>>> like clarification if this is intended or unavoidable behaviour:
> >> >>>>>>
> >> >>>>>> ```{r}
> >> >>>>>> # Create example data.frames
> >> >>>>>> parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
> >> >>>>>>                        sex=c("F", "M", "F", "M"),
> >> >>>>>>                        age=c(41, 43, 36, 51))
> >> >>>>>> children <- data.frame(parent=c("Sarah", "Max", "Qin"),
> >> >>>>>>                         name=c("Oliver", "Sebastian", "Kai-lee"),
> >> >>>>>>                         sex=c("M", "M", "F"),
> >> >>>>>>                         age=c(5,8,7))
> >> >>>>>>
> >> >>>>>> # Merge() creates a duplicated "name" column:
> >> >>>>>> merge(parents, children, by.x = "name", by.y = "parent")
> >> >>>>>> ```
> >> >>>>>>
> >> >>>>>> Output:
> >> >>>>>> ```
> >> >>>>>>     name sex.x age.x      name sex.y age.y
> >> >>>>>> 1   Max     M    43 Sebastian     M     8
> >> >>>>>> 2   Qin     F    36   Kai-lee     F     7
> >> >>>>>> 3 Sarah     F    41    Oliver     M     5
> >> >>>>>> Warning message:
> >> >>>>>> In merge.data.frame(parents, children, by.x = "name", by.y =
> >> >>>>>> "parent") :
> >> >>>>>>    column name ‘name’ is duplicated in the result
> >> >>>>>> ```
> >> >>>>>>
> >> >>>>>> Kind Regards,
> >> >>>>>>
> >> >>>>>> Scott Ritchie
> >> >>>>>>
> >> >>>>>>        [[alternative HTML version deleted]]
> >> >>>>>>
> >> >>>>>> ______________________________________________
> >> >>>>>> [hidden email] mailing list
> >> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >> >>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>>
> >> >> Index: src/library/base/R/merge.R
> >> >>> ===================================================================
> >> >>> --- src/library/base/R/merge.R  (revision 74264)
> >> >>> +++ src/library/base/R/merge.R  (working copy)
> >> >>> @@ -157,6 +157,15 @@
> >> >>>           }
> >> >>>             if(has.common.nms) names(y) <- nm.y
> >> >>> +        ## If by.x %in% names(y) then duplicate column names still
> >> >>> arise,
> >> >>> +        ## apply suffixes to these
> >> >>> +        dupe.keyx <- intersect(nm.by, names(y))
> >> >>> +        if(length(dupe.keyx)) {
> >> >>> +          if(nzchar(suffixes[1L]))
> >> >>> +            names(x)[match(dupe.keyx, names(x), 0L)] <-
> >> >>> paste(dupe.keyx, suffixes[1L], sep="")
> >> >>> +          if(nzchar(suffixes[2L]))
> >> >>> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
> >> >>> paste(dupe.keyx, suffixes[2L], sep="")
> >> >>> +        }
> >> >>>           nm <- c(names(x), names(y))
> >> >>>           if(any(d <- duplicated(nm)))
> >> >>>               if(sum(d) > 1L)
> >> >>>
> >> >>
> >> >> ______________________________________________
> >> >> [hidden email] mailing list
> >> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >> >>
> >> >>
> >> >
> >>
> >>         [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> [hidden email] mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
> >
> >
> >
> > --
> > Gabriel Becker, PhD
> > Scientist (Bioinformatics)
> > Genentech Research
> >

> Index: src/library/base/R/merge.R
> ===================================================================
> --- src/library/base/R/merge.R (revision 74264)
> +++ src/library/base/R/merge.R (working copy)
> @@ -157,6 +157,15 @@
>          }
>  
>          if(has.common.nms) names(y) <- nm.y
> +        ## If by.x %in% names(y) then duplicate column names still arise,
> +        ## apply suffixes to these
> +        dupe.keyx <- intersect(nm.by, names(y))
> +        if(length(dupe.keyx)) {
> +          if(nzchar(suffixes[1L]))
> +            names(x)[match(dupe.keyx, names(x), 0L)] <- paste(dupe.keyx, suffixes[1L], sep="")
> +          if(nzchar(suffixes[2L]))
> +            names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx, suffixes[2L], sep="")
> +        }
>          nm <- c(names(x), names(y))
>          if(any(d <- duplicated(nm)))
>              if(sum(d) > 1L)

> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate column names created by base::merge() when by.x has the same name as a column in y

Scott Ritchie-2
Hi Frederick,

It looks like I didn't overwrite the patch.diff file after the last edits.
Here's the correct patch (attached and copied below):

Index: src/library/base/R/merge.R
===================================================================
--- src/library/base/R/merge.R (revision 74280)
+++ src/library/base/R/merge.R (working copy)
@@ -157,6 +157,14 @@
         }

         if(has.common.nms) names(y) <- nm.y
+        ## If by.x %in% names(y) then duplicate column names still arise,
+        ## apply suffixes to just y - this keeps backwards compatibility
+        ## when referring to by.x in the resulting data.frame
+        dupe.keyx <- intersect(nm.by, names(y))
+        if(length(dupe.keyx)) {
+          if(nzchar(suffixes[2L]))
+            names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx,
suffixes[2L], sep="")
+        }
         nm <- c(names(x), names(y))
         if(any(d <- duplicated(nm)))
             if(sum(d) > 1L)

Best,

Scott

On 21 February 2018 at 08:23, <[hidden email]> wrote:

> Hi Scott,
>
> I think that's a good idea and I tried your patch on my copy of the
> repository. But it looks to me like the recent patch is identical to
> the previous one, can you confirm this?
>
> Frederick
>
> On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote:
> > Thanks Gabriel,
> >
> > I think your suggested approach is 100% backwards compatible
> >
> > Currently in the case of duplicate column names only the first can be
> > indexed by its name. This will always be the column appearing in by.x,
> > meaning the column in y with the same name cannot be accessed. Appending
> > ".y" (suffixes[2L]) to this column means it can now be accessed, while
> > keeping the current behaviour of making the key columns always accessible
> > by using the names provided to by.x.
> >
> > I've attached a new patch that has this behaviour.
> >
> > Best,
> >
> > Scott
> >
> > On 19 February 2018 at 05:08, Gabriel Becker <[hidden email]>
> wrote:
> >
> > > It seems like there is a way that is backwards compatible-ish in the
> sense
> > > mentioned and still has the (arguably, but a good argument I think)
> better
> > > behavior:
> > >
> > > if by.x is 'name', (AND by.y is not also 'name'), then x's 'name'
> column
> > > is called name and y's 'name' column (not used int he merge) is
> changed to
> > > name.y.
> > >
> > > Now of course this would still change output, but it would change it to
> > > something I think would be better, while retaining the 'merge columns
> > > retain their exact names' mechanic as documented.
> > >
> > > ~G
> > >
> > > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <[hidden email]>
> > > wrote:
> > >
> > >> Thanks Duncan and Frederick,
> > >>
> > >> I suspected as much - there doesn't appear to be any reason why
> conflicts
> > >> between by.x and names(y) shouldn't and cannot be checked, but I can
> see
> > >> how this might be more trouble than its worth given it potentially may
> > >> break downstream packages (i.e. any cases where this occurs but they
> > >> expect
> > >> the name of the key column(s) to remain the same).
> > >>
> > >> Best,
> > >>
> > >> Scott
> > >>
> > >> On 18 February 2018 at 11:48, Duncan Murdoch <
> [hidden email]>
> > >> wrote:
> > >>
> > >> > On 17/02/2018 6:36 PM, [hidden email] wrote:
> > >> >
> > >> >> Hi Scott,
> > >> >>
> > >> >> Thanks for the patch. I'm not really involved in R development; it
> > >> >> will be up to someone in the R core team to apply it. I would
> hazard
> > >> >> to say that even if correct (I haven't checked), it will not be
> > >> >> applied because the change might break existing code. For example
> it
> > >> >> seems like reasonable code might easily assume that a column with
> the
> > >> >> same name as "by.x" exists in the output of 'merge'. That's just my
> > >> >> best guess... I don't participate on here often.
> > >> >>
> > >> >
> > >> >
> > >> > I think you're right.  If I were still a member of R Core, I would
> want
> > >> to
> > >> > test this against all packages on CRAN and Bioconductor, and since
> that
> > >> > test takes a couple of days to run on my laptop, I'd probably never
> get
> > >> > around to it.
> > >> >
> > >> > There are lots of cases where "I would have done that differently",
> but
> > >> > most of them are far too much trouble to change now that R is more
> than
> > >> 20
> > >> > years old.  And in many cases it will turn out that the way R does
> it
> > >> > actually does make more sense than the way I would have done it.
> > >> >
> > >> > Duncan Murdoch
> > >> >
> > >> >
> > >> >
> > >> >> Cheers,
> > >> >>
> > >> >> Frederick
> > >> >>
> > >> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
> > >> >>
> > >> >>> The attached patch.diff will make merge.data.frame() append the
> > >> suffixes
> > >> >>> to
> > >> >>> columns with common names between by.x and names(y).
> > >> >>>
> > >> >>> Best,
> > >> >>>
> > >> >>> Scott Ritchie
> > >> >>>
> > >> >>> On 17 February 2018 at 11:15, Scott Ritchie <
> [hidden email]>
> > >> >>> wrote:
> > >> >>>
> > >> >>> Hi Frederick,
> > >> >>>>
> > >> >>>> I would expect that any duplicate names in the resulting
> data.frame
> > >> >>>> would
> > >> >>>> have the suffixes appended to them, regardless of whether or not
> they
> > >> >>>> are
> > >> >>>> used as the join key. So in my example I would expect "names.x"
> and
> > >> >>>> "names.y" to indicate their source data.frame.
> > >> >>>>
> > >> >>>> While careful reading of the documentation reveals this is not
> the
> > >> >>>> case, I
> > >> >>>> would argue the intent of the suffixes functionality should
> equally
> > >> be
> > >> >>>> applied to this type of case.
> > >> >>>>
> > >> >>>> If you agree this would be useful, I'm happy to write a patch for
> > >> >>>> merge.data.frame that will add suffixes in this case - I intend
> to do
> > >> >>>> the
> > >> >>>> same for merge.data.table in the data.table package where I
> initially
> > >> >>>> encountered the edge case.
> > >> >>>>
> > >> >>>> Best,
> > >> >>>>
> > >> >>>> Scott
> > >> >>>>
> > >> >>>> On 17 February 2018 at 03:53, <[hidden email]> wrote:
> > >> >>>>
> > >> >>>> Hi Scott,
> > >> >>>>>
> > >> >>>>> It seems like reasonable behavior to me. What result would you
> > >> expect?
> > >> >>>>> That the second "name" should be called "name.y"?
> > >> >>>>>
> > >> >>>>> The "merge" documentation says:
> > >> >>>>>
> > >> >>>>>      If the columns in the data frames not used in merging have
> any
> > >> >>>>>      common names, these have ‘suffixes’ (‘".x"’ and ‘".y"’ by
> > >> default)
> > >> >>>>>      appended to try to make the names of the result unique.
> > >> >>>>>
> > >> >>>>> Since the first "name" column was used in merging, leaving both
> > >> >>>>> without a suffix seems consistent with the documentation...
> > >> >>>>>
> > >> >>>>> Frederick
> > >> >>>>>
> > >> >>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
> > >> >>>>>
> > >> >>>>>> Hi,
> > >> >>>>>>
> > >> >>>>>> I was unable to find a bug report for this with a cursory
> search,
> > >> but
> > >> >>>>>>
> > >> >>>>> would
> > >> >>>>>
> > >> >>>>>> like clarification if this is intended or unavoidable
> behaviour:
> > >> >>>>>>
> > >> >>>>>> ```{r}
> > >> >>>>>> # Create example data.frames
> > >> >>>>>> parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
> > >> >>>>>>                        sex=c("F", "M", "F", "M"),
> > >> >>>>>>                        age=c(41, 43, 36, 51))
> > >> >>>>>> children <- data.frame(parent=c("Sarah", "Max", "Qin"),
> > >> >>>>>>                         name=c("Oliver", "Sebastian",
> "Kai-lee"),
> > >> >>>>>>                         sex=c("M", "M", "F"),
> > >> >>>>>>                         age=c(5,8,7))
> > >> >>>>>>
> > >> >>>>>> # Merge() creates a duplicated "name" column:
> > >> >>>>>> merge(parents, children, by.x = "name", by.y = "parent")
> > >> >>>>>> ```
> > >> >>>>>>
> > >> >>>>>> Output:
> > >> >>>>>> ```
> > >> >>>>>>     name sex.x age.x      name sex.y age.y
> > >> >>>>>> 1   Max     M    43 Sebastian     M     8
> > >> >>>>>> 2   Qin     F    36   Kai-lee     F     7
> > >> >>>>>> 3 Sarah     F    41    Oliver     M     5
> > >> >>>>>> Warning message:
> > >> >>>>>> In merge.data.frame(parents, children, by.x = "name", by.y =
> > >> >>>>>> "parent") :
> > >> >>>>>>    column name ‘name’ is duplicated in the result
> > >> >>>>>> ```
> > >> >>>>>>
> > >> >>>>>> Kind Regards,
> > >> >>>>>>
> > >> >>>>>> Scott Ritchie
> > >> >>>>>>
> > >> >>>>>>        [[alternative HTML version deleted]]
> > >> >>>>>>
> > >> >>>>>> ______________________________________________
> > >> >>>>>> [hidden email] mailing list
> > >> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> > >> >>>>>>
> > >> >>>>>>
> > >> >>>>>
> > >> >>>>
> > >> >>>>
> > >> >> Index: src/library/base/R/merge.R
> > >> >>> ============================================================
> =======
> > >> >>> --- src/library/base/R/merge.R  (revision 74264)
> > >> >>> +++ src/library/base/R/merge.R  (working copy)
> > >> >>> @@ -157,6 +157,15 @@
> > >> >>>           }
> > >> >>>             if(has.common.nms) names(y) <- nm.y
> > >> >>> +        ## If by.x %in% names(y) then duplicate column names
> still
> > >> >>> arise,
> > >> >>> +        ## apply suffixes to these
> > >> >>> +        dupe.keyx <- intersect(nm.by, names(y))
> > >> >>> +        if(length(dupe.keyx)) {
> > >> >>> +          if(nzchar(suffixes[1L]))
> > >> >>> +            names(x)[match(dupe.keyx, names(x), 0L)] <-
> > >> >>> paste(dupe.keyx, suffixes[1L], sep="")
> > >> >>> +          if(nzchar(suffixes[2L]))
> > >> >>> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
> > >> >>> paste(dupe.keyx, suffixes[2L], sep="")
> > >> >>> +        }
> > >> >>>           nm <- c(names(x), names(y))
> > >> >>>           if(any(d <- duplicated(nm)))
> > >> >>>               if(sum(d) > 1L)
> > >> >>>
> > >> >>
> > >> >> ______________________________________________
> > >> >> [hidden email] mailing list
> > >> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> > >> >>
> > >> >>
> > >> >
> > >>
> > >>         [[alternative HTML version deleted]]
> > >>
> > >> ______________________________________________
> > >> [hidden email] mailing list
> > >> https://stat.ethz.ch/mailman/listinfo/r-devel
> > >>
> > >
> > >
> > >
> > > --
> > > Gabriel Becker, PhD
> > > Scientist (Bioinformatics)
> > > Genentech Research
> > >
>
> > Index: src/library/base/R/merge.R
> > ===================================================================
> > --- src/library/base/R/merge.R        (revision 74264)
> > +++ src/library/base/R/merge.R        (working copy)
> > @@ -157,6 +157,15 @@
> >          }
> >
> >          if(has.common.nms) names(y) <- nm.y
> > +        ## If by.x %in% names(y) then duplicate column names still
> arise,
> > +        ## apply suffixes to these
> > +        dupe.keyx <- intersect(nm.by, names(y))
> > +        if(length(dupe.keyx)) {
> > +          if(nzchar(suffixes[1L]))
> > +            names(x)[match(dupe.keyx, names(x), 0L)] <-
> paste(dupe.keyx, suffixes[1L], sep="")
> > +          if(nzchar(suffixes[2L]))
> > +            names(y)[match(dupe.keyx, names(y), 0L)] <-
> paste(dupe.keyx, suffixes[2L], sep="")
> > +        }
> >          nm <- c(names(x), names(y))
> >          if(any(d <- duplicated(nm)))
> >              if(sum(d) > 1L)
>
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

patch.diff (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate column names created by base::merge() when by.x has the same name as a column in y

frederik
Hi Scott,

I tried the new patch and can confirm that it has the advertised
behavior on a couple of test cases. I think it makes sense to apply
it, because any existing code which refers to a second duplicate
data.frame column by name is already broken, while if the reference is
by numerical index then changing the column name shouldn't break it.

I don't know if you need to update the documentation as part of your
patch, or if whoever applies it would be happy to do that. Somebody
from R core want to weigh in on this?

I attach a file with the test example from your original email as well
as a second test case I added with two "by" columns.

Thanks,

Frederick

On Wed, Feb 21, 2018 at 10:06:21AM +1100, Scott Ritchie wrote:

> Hi Frederick,
>
> It looks like I didn't overwrite the patch.diff file after the last edits.
> Here's the correct patch (attached and copied below):
>
> Index: src/library/base/R/merge.R
> ===================================================================
> --- src/library/base/R/merge.R (revision 74280)
> +++ src/library/base/R/merge.R (working copy)
> @@ -157,6 +157,14 @@
>          }
>
>          if(has.common.nms) names(y) <- nm.y
> +        ## If by.x %in% names(y) then duplicate column names still arise,
> +        ## apply suffixes to just y - this keeps backwards compatibility
> +        ## when referring to by.x in the resulting data.frame
> +        dupe.keyx <- intersect(nm.by, names(y))
> +        if(length(dupe.keyx)) {
> +          if(nzchar(suffixes[2L]))
> +            names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx,
> suffixes[2L], sep="")
> +        }
>          nm <- c(names(x), names(y))
>          if(any(d <- duplicated(nm)))
>              if(sum(d) > 1L)
>
> Best,
>
> Scott
>
> On 21 February 2018 at 08:23, <[hidden email]> wrote:
>
> > Hi Scott,
> >
> > I think that's a good idea and I tried your patch on my copy of the
> > repository. But it looks to me like the recent patch is identical to
> > the previous one, can you confirm this?
> >
> > Frederick
> >
> > On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote:
> > > Thanks Gabriel,
> > >
> > > I think your suggested approach is 100% backwards compatible
> > >
> > > Currently in the case of duplicate column names only the first can be
> > > indexed by its name. This will always be the column appearing in by.x,
> > > meaning the column in y with the same name cannot be accessed. Appending
> > > ".y" (suffixes[2L]) to this column means it can now be accessed, while
> > > keeping the current behaviour of making the key columns always accessible
> > > by using the names provided to by.x.
> > >
> > > I've attached a new patch that has this behaviour.
> > >
> > > Best,
> > >
> > > Scott
> > >
> > > On 19 February 2018 at 05:08, Gabriel Becker <[hidden email]>
> > wrote:
> > >
> > > > It seems like there is a way that is backwards compatible-ish in the
> > sense
> > > > mentioned and still has the (arguably, but a good argument I think)
> > better
> > > > behavior:
> > > >
> > > > if by.x is 'name', (AND by.y is not also 'name'), then x's 'name'
> > column
> > > > is called name and y's 'name' column (not used int he merge) is
> > changed to
> > > > name.y.
> > > >
> > > > Now of course this would still change output, but it would change it to
> > > > something I think would be better, while retaining the 'merge columns
> > > > retain their exact names' mechanic as documented.
> > > >
> > > > ~G
> > > >
> > > > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <[hidden email]>
> > > > wrote:
> > > >
> > > >> Thanks Duncan and Frederick,
> > > >>
> > > >> I suspected as much - there doesn't appear to be any reason why
> > conflicts
> > > >> between by.x and names(y) shouldn't and cannot be checked, but I can
> > see
> > > >> how this might be more trouble than its worth given it potentially may
> > > >> break downstream packages (i.e. any cases where this occurs but they
> > > >> expect
> > > >> the name of the key column(s) to remain the same).
> > > >>
> > > >> Best,
> > > >>
> > > >> Scott
> > > >>
> > > >> On 18 February 2018 at 11:48, Duncan Murdoch <
> > [hidden email]>
> > > >> wrote:
> > > >>
> > > >> > On 17/02/2018 6:36 PM, [hidden email] wrote:
> > > >> >
> > > >> >> Hi Scott,
> > > >> >>
> > > >> >> Thanks for the patch. I'm not really involved in R development; it
> > > >> >> will be up to someone in the R core team to apply it. I would
> > hazard
> > > >> >> to say that even if correct (I haven't checked), it will not be
> > > >> >> applied because the change might break existing code. For example
> > it
> > > >> >> seems like reasonable code might easily assume that a column with
> > the
> > > >> >> same name as "by.x" exists in the output of 'merge'. That's just my
> > > >> >> best guess... I don't participate on here often.
> > > >> >>
> > > >> >
> > > >> >
> > > >> > I think you're right.  If I were still a member of R Core, I would
> > want
> > > >> to
> > > >> > test this against all packages on CRAN and Bioconductor, and since
> > that
> > > >> > test takes a couple of days to run on my laptop, I'd probably never
> > get
> > > >> > around to it.
> > > >> >
> > > >> > There are lots of cases where "I would have done that differently",
> > but
> > > >> > most of them are far too much trouble to change now that R is more
> > than
> > > >> 20
> > > >> > years old.  And in many cases it will turn out that the way R does
> > it
> > > >> > actually does make more sense than the way I would have done it.
> > > >> >
> > > >> > Duncan Murdoch
> > > >> >
> > > >> >
> > > >> >
> > > >> >> Cheers,
> > > >> >>
> > > >> >> Frederick
> > > >> >>
> > > >> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
> > > >> >>
> > > >> >>> The attached patch.diff will make merge.data.frame() append the
> > > >> suffixes
> > > >> >>> to
> > > >> >>> columns with common names between by.x and names(y).
> > > >> >>>
> > > >> >>> Best,
> > > >> >>>
> > > >> >>> Scott Ritchie
> > > >> >>>
> > > >> >>> On 17 February 2018 at 11:15, Scott Ritchie <
> > [hidden email]>
> > > >> >>> wrote:
> > > >> >>>
> > > >> >>> Hi Frederick,
> > > >> >>>>
> > > >> >>>> I would expect that any duplicate names in the resulting
> > data.frame
> > > >> >>>> would
> > > >> >>>> have the suffixes appended to them, regardless of whether or not
> > they
> > > >> >>>> are
> > > >> >>>> used as the join key. So in my example I would expect "names.x"
> > and
> > > >> >>>> "names.y" to indicate their source data.frame.
> > > >> >>>>
> > > >> >>>> While careful reading of the documentation reveals this is not
> > the
> > > >> >>>> case, I
> > > >> >>>> would argue the intent of the suffixes functionality should
> > equally
> > > >> be
> > > >> >>>> applied to this type of case.
> > > >> >>>>
> > > >> >>>> If you agree this would be useful, I'm happy to write a patch for
> > > >> >>>> merge.data.frame that will add suffixes in this case - I intend
> > to do
> > > >> >>>> the
> > > >> >>>> same for merge.data.table in the data.table package where I
> > initially
> > > >> >>>> encountered the edge case.
> > > >> >>>>
> > > >> >>>> Best,
> > > >> >>>>
> > > >> >>>> Scott
> > > >> >>>>
> > > >> >>>> On 17 February 2018 at 03:53, <[hidden email]> wrote:
> > > >> >>>>
> > > >> >>>> Hi Scott,
> > > >> >>>>>
> > > >> >>>>> It seems like reasonable behavior to me. What result would you
> > > >> expect?
> > > >> >>>>> That the second "name" should be called "name.y"?
> > > >> >>>>>
> > > >> >>>>> The "merge" documentation says:
> > > >> >>>>>
> > > >> >>>>>      If the columns in the data frames not used in merging have
> > any
> > > >> >>>>>      common names, these have ‘suffixes’ (‘".x"’ and ‘".y"’ by
> > > >> default)
> > > >> >>>>>      appended to try to make the names of the result unique.
> > > >> >>>>>
> > > >> >>>>> Since the first "name" column was used in merging, leaving both
> > > >> >>>>> without a suffix seems consistent with the documentation...
> > > >> >>>>>
> > > >> >>>>> Frederick
> > > >> >>>>>
> > > >> >>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie wrote:
> > > >> >>>>>
> > > >> >>>>>> Hi,
> > > >> >>>>>>
> > > >> >>>>>> I was unable to find a bug report for this with a cursory
> > search,
> > > >> but
> > > >> >>>>>>
> > > >> >>>>> would
> > > >> >>>>>
> > > >> >>>>>> like clarification if this is intended or unavoidable
> > behaviour:
> > > >> >>>>>>
> > > >> >>>>>> ```{r}
> > > >> >>>>>> # Create example data.frames
> > > >> >>>>>> parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
> > > >> >>>>>>                        sex=c("F", "M", "F", "M"),
> > > >> >>>>>>                        age=c(41, 43, 36, 51))
> > > >> >>>>>> children <- data.frame(parent=c("Sarah", "Max", "Qin"),
> > > >> >>>>>>                         name=c("Oliver", "Sebastian",
> > "Kai-lee"),
> > > >> >>>>>>                         sex=c("M", "M", "F"),
> > > >> >>>>>>                         age=c(5,8,7))
> > > >> >>>>>>
> > > >> >>>>>> # Merge() creates a duplicated "name" column:
> > > >> >>>>>> merge(parents, children, by.x = "name", by.y = "parent")
> > > >> >>>>>> ```
> > > >> >>>>>>
> > > >> >>>>>> Output:
> > > >> >>>>>> ```
> > > >> >>>>>>     name sex.x age.x      name sex.y age.y
> > > >> >>>>>> 1   Max     M    43 Sebastian     M     8
> > > >> >>>>>> 2   Qin     F    36   Kai-lee     F     7
> > > >> >>>>>> 3 Sarah     F    41    Oliver     M     5
> > > >> >>>>>> Warning message:
> > > >> >>>>>> In merge.data.frame(parents, children, by.x = "name", by.y =
> > > >> >>>>>> "parent") :
> > > >> >>>>>>    column name ‘name’ is duplicated in the result
> > > >> >>>>>> ```
> > > >> >>>>>>
> > > >> >>>>>> Kind Regards,
> > > >> >>>>>>
> > > >> >>>>>> Scott Ritchie
> > > >> >>>>>>
> > > >> >>>>>>        [[alternative HTML version deleted]]
> > > >> >>>>>>
> > > >> >>>>>> ______________________________________________
> > > >> >>>>>> [hidden email] mailing list
> > > >> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> > > >> >>>>>>
> > > >> >>>>>>
> > > >> >>>>>
> > > >> >>>>
> > > >> >>>>
> > > >> >> Index: src/library/base/R/merge.R
> > > >> >>> ============================================================
> > =======
> > > >> >>> --- src/library/base/R/merge.R  (revision 74264)
> > > >> >>> +++ src/library/base/R/merge.R  (working copy)
> > > >> >>> @@ -157,6 +157,15 @@
> > > >> >>>           }
> > > >> >>>             if(has.common.nms) names(y) <- nm.y
> > > >> >>> +        ## If by.x %in% names(y) then duplicate column names
> > still
> > > >> >>> arise,
> > > >> >>> +        ## apply suffixes to these
> > > >> >>> +        dupe.keyx <- intersect(nm.by, names(y))
> > > >> >>> +        if(length(dupe.keyx)) {
> > > >> >>> +          if(nzchar(suffixes[1L]))
> > > >> >>> +            names(x)[match(dupe.keyx, names(x), 0L)] <-
> > > >> >>> paste(dupe.keyx, suffixes[1L], sep="")
> > > >> >>> +          if(nzchar(suffixes[2L]))
> > > >> >>> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
> > > >> >>> paste(dupe.keyx, suffixes[2L], sep="")
> > > >> >>> +        }
> > > >> >>>           nm <- c(names(x), names(y))
> > > >> >>>           if(any(d <- duplicated(nm)))
> > > >> >>>               if(sum(d) > 1L)
> > > >> >>>
> > > >> >>
> > > >> >> ______________________________________________
> > > >> >> [hidden email] mailing list
> > > >> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> > > >> >>
> > > >> >>
> > > >> >
> > > >>
> > > >>         [[alternative HTML version deleted]]
> > > >>
> > > >> ______________________________________________
> > > >> [hidden email] mailing list
> > > >> https://stat.ethz.ch/mailman/listinfo/r-devel
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Gabriel Becker, PhD
> > > > Scientist (Bioinformatics)
> > > > Genentech Research
> > > >
> >
> > > Index: src/library/base/R/merge.R
> > > ===================================================================
> > > --- src/library/base/R/merge.R        (revision 74264)
> > > +++ src/library/base/R/merge.R        (working copy)
> > > @@ -157,6 +157,15 @@
> > >          }
> > >
> > >          if(has.common.nms) names(y) <- nm.y
> > > +        ## If by.x %in% names(y) then duplicate column names still
> > arise,
> > > +        ## apply suffixes to these
> > > +        dupe.keyx <- intersect(nm.by, names(y))
> > > +        if(length(dupe.keyx)) {
> > > +          if(nzchar(suffixes[1L]))
> > > +            names(x)[match(dupe.keyx, names(x), 0L)] <-
> > paste(dupe.keyx, suffixes[1L], sep="")
> > > +          if(nzchar(suffixes[2L]))
> > > +            names(y)[match(dupe.keyx, names(y), 0L)] <-
> > paste(dupe.keyx, suffixes[2L], sep="")
> > > +        }
> > >          nm <- c(names(x), names(y))
> > >          if(any(d <- duplicated(nm)))
> > >              if(sum(d) > 1L)
> >
> > > ______________________________________________
> > > [hidden email] mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >

> Index: src/library/base/R/merge.R
> ===================================================================
> --- src/library/base/R/merge.R (revision 74280)
> +++ src/library/base/R/merge.R (working copy)
> @@ -157,6 +157,14 @@
>          }
>  
>          if(has.common.nms) names(y) <- nm.y
> +        ## If by.x %in% names(y) then duplicate column names still arise,
> +        ## apply suffixes to just y - this keeps backwards compatibility
> +        ## when referring to by.x in the resulting data.frame
> +        dupe.keyx <- intersect(nm.by, names(y))
> +        if(length(dupe.keyx)) {
> +          if(nzchar(suffixes[2L]))
> +            names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx, suffixes[2L], sep="")
> +        }
>          nm <- c(names(x), names(y))
>          if(any(d <- duplicated(nm)))
>              if(sum(d) > 1L)

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

ritchie-testcase (733 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate column names created by base::merge() when by.x has the same name as a column in y

Gabriel Becker
Hi all,

For the record this approach isnt 100% backwards compatible, because
names(mergeddf) will e incompatibly different. Thatx why i claimed
bakcwards compatable-ish

That said its still worth considering imho because of the reasons stated
(and honestly one particular simple reading of the docs might suggest that
this was thr intended behavior all along). Im not a member of Rcore through
so i cant do said considering myself.

Best,
~G

On Feb 20, 2018 7:15 PM, <[hidden email]> wrote:

Hi Scott,

I tried the new patch and can confirm that it has the advertised
behavior on a couple of test cases. I think it makes sense to apply
it, because any existing code which refers to a second duplicate
data.frame column by name is already broken, while if the reference is
by numerical index then changing the column name shouldn't break it.

I don't know if you need to update the documentation as part of your
patch, or if whoever applies it would be happy to do that. Somebody
from R core want to weigh in on this?

I attach a file with the test example from your original email as well
as a second test case I added with two "by" columns.

Thanks,

Frederick

On Wed, Feb 21, 2018 at 10:06:21AM +1100, Scott Ritchie wrote:

> Hi Frederick,
>
> It looks like I didn't overwrite the patch.diff file after the last edits.
> Here's the correct patch (attached and copied below):
>
> Index: src/library/base/R/merge.R
> ===================================================================
> --- src/library/base/R/merge.R (revision 74280)
> +++ src/library/base/R/merge.R (working copy)
> @@ -157,6 +157,14 @@
>          }
>
>          if(has.common.nms) names(y) <- nm.y
> +        ## If by.x %in% names(y) then duplicate column names still arise,
> +        ## apply suffixes to just y - this keeps backwards compatibility
> +        ## when referring to by.x in the resulting data.frame
> +        dupe.keyx <- intersect(nm.by, names(y))
> +        if(length(dupe.keyx)) {
> +          if(nzchar(suffixes[2L]))
> +            names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx,
> suffixes[2L], sep="")
> +        }
>          nm <- c(names(x), names(y))
>          if(any(d <- duplicated(nm)))
>              if(sum(d) > 1L)
>
> Best,
>
> Scott
>
> On 21 February 2018 at 08:23, <[hidden email]> wrote:
>
> > Hi Scott,
> >
> > I think that's a good idea and I tried your patch on my copy of the
> > repository. But it looks to me like the recent patch is identical to
> > the previous one, can you confirm this?
> >
> > Frederick
> >
> > On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote:
> > > Thanks Gabriel,
> > >
> > > I think your suggested approach is 100% backwards compatible
> > >
> > > Currently in the case of duplicate column names only the first can be
> > > indexed by its name. This will always be the column appearing in by.x,
> > > meaning the column in y with the same name cannot be accessed.
Appending
> > > ".y" (suffixes[2L]) to this column means it can now be accessed, while
> > > keeping the current behaviour of making the key columns always
accessible

> > > by using the names provided to by.x.
> > >
> > > I've attached a new patch that has this behaviour.
> > >
> > > Best,
> > >
> > > Scott
> > >
> > > On 19 February 2018 at 05:08, Gabriel Becker <[hidden email]>
> > wrote:
> > >
> > > > It seems like there is a way that is backwards compatible-ish in the
> > sense
> > > > mentioned and still has the (arguably, but a good argument I think)
> > better
> > > > behavior:
> > > >
> > > > if by.x is 'name', (AND by.y is not also 'name'), then x's 'name'
> > column
> > > > is called name and y's 'name' column (not used int he merge) is
> > changed to
> > > > name.y.
> > > >
> > > > Now of course this would still change output, but it would change
it to
> > > > something I think would be better, while retaining the 'merge
columns
> > > > retain their exact names' mechanic as documented.
> > > >
> > > > ~G
> > > >
> > > > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <
[hidden email]>
> > > > wrote:
> > > >
> > > >> Thanks Duncan and Frederick,
> > > >>
> > > >> I suspected as much - there doesn't appear to be any reason why
> > conflicts
> > > >> between by.x and names(y) shouldn't and cannot be checked, but I
can
> > see
> > > >> how this might be more trouble than its worth given it potentially
may
> > > >> break downstream packages (i.e. any cases where this occurs but
they

> > > >> expect
> > > >> the name of the key column(s) to remain the same).
> > > >>
> > > >> Best,
> > > >>
> > > >> Scott
> > > >>
> > > >> On 18 February 2018 at 11:48, Duncan Murdoch <
> > [hidden email]>
> > > >> wrote:
> > > >>
> > > >> > On 17/02/2018 6:36 PM, [hidden email] wrote:
> > > >> >
> > > >> >> Hi Scott,
> > > >> >>
> > > >> >> Thanks for the patch. I'm not really involved in R development;
it
> > > >> >> will be up to someone in the R core team to apply it. I would
> > hazard
> > > >> >> to say that even if correct (I haven't checked), it will not be
> > > >> >> applied because the change might break existing code. For
example
> > it
> > > >> >> seems like reasonable code might easily assume that a column
with
> > the
> > > >> >> same name as "by.x" exists in the output of 'merge'. That's
just my
> > > >> >> best guess... I don't participate on here often.
> > > >> >>
> > > >> >
> > > >> >
> > > >> > I think you're right.  If I were still a member of R Core, I
would
> > want
> > > >> to
> > > >> > test this against all packages on CRAN and Bioconductor, and
since
> > that
> > > >> > test takes a couple of days to run on my laptop, I'd probably
never
> > get
> > > >> > around to it.
> > > >> >
> > > >> > There are lots of cases where "I would have done that
differently",
> > but
> > > >> > most of them are far too much trouble to change now that R is
more
> > than
> > > >> 20
> > > >> > years old.  And in many cases it will turn out that the way R
does

> > it
> > > >> > actually does make more sense than the way I would have done it.
> > > >> >
> > > >> > Duncan Murdoch
> > > >> >
> > > >> >
> > > >> >
> > > >> >> Cheers,
> > > >> >>
> > > >> >> Frederick
> > > >> >>
> > > >> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
> > > >> >>
> > > >> >>> The attached patch.diff will make merge.data.frame() append the
> > > >> suffixes
> > > >> >>> to
> > > >> >>> columns with common names between by.x and names(y).
> > > >> >>>
> > > >> >>> Best,
> > > >> >>>
> > > >> >>> Scott Ritchie
> > > >> >>>
> > > >> >>> On 17 February 2018 at 11:15, Scott Ritchie <
> > [hidden email]>
> > > >> >>> wrote:
> > > >> >>>
> > > >> >>> Hi Frederick,
> > > >> >>>>
> > > >> >>>> I would expect that any duplicate names in the resulting
> > data.frame
> > > >> >>>> would
> > > >> >>>> have the suffixes appended to them, regardless of whether or
not
> > they
> > > >> >>>> are
> > > >> >>>> used as the join key. So in my example I would expect
"names.x"

> > and
> > > >> >>>> "names.y" to indicate their source data.frame.
> > > >> >>>>
> > > >> >>>> While careful reading of the documentation reveals this is not
> > the
> > > >> >>>> case, I
> > > >> >>>> would argue the intent of the suffixes functionality should
> > equally
> > > >> be
> > > >> >>>> applied to this type of case.
> > > >> >>>>
> > > >> >>>> If you agree this would be useful, I'm happy to write a patch
for
> > > >> >>>> merge.data.frame that will add suffixes in this case - I
intend

> > to do
> > > >> >>>> the
> > > >> >>>> same for merge.data.table in the data.table package where I
> > initially
> > > >> >>>> encountered the edge case.
> > > >> >>>>
> > > >> >>>> Best,
> > > >> >>>>
> > > >> >>>> Scott
> > > >> >>>>
> > > >> >>>> On 17 February 2018 at 03:53, <[hidden email]> wrote:
> > > >> >>>>
> > > >> >>>> Hi Scott,
> > > >> >>>>>
> > > >> >>>>> It seems like reasonable behavior to me. What result would
you
> > > >> expect?
> > > >> >>>>> That the second "name" should be called "name.y"?
> > > >> >>>>>
> > > >> >>>>> The "merge" documentation says:
> > > >> >>>>>
> > > >> >>>>>      If the columns in the data frames not used in merging
have
> > any
> > > >> >>>>>      common names, these have ‘suffixes’ (‘".x"’ and ‘".y"’
by
> > > >> default)
> > > >> >>>>>      appended to try to make the names of the result unique.
> > > >> >>>>>
> > > >> >>>>> Since the first "name" column was used in merging, leaving
both
> > > >> >>>>> without a suffix seems consistent with the documentation...
> > > >> >>>>>
> > > >> >>>>> Frederick
> > > >> >>>>>
> > > >> >>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie
wrote:

> > > >> >>>>>
> > > >> >>>>>> Hi,
> > > >> >>>>>>
> > > >> >>>>>> I was unable to find a bug report for this with a cursory
> > search,
> > > >> but
> > > >> >>>>>>
> > > >> >>>>> would
> > > >> >>>>>
> > > >> >>>>>> like clarification if this is intended or unavoidable
> > behaviour:
> > > >> >>>>>>
> > > >> >>>>>> ```{r}
> > > >> >>>>>> # Create example data.frames
> > > >> >>>>>> parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
> > > >> >>>>>>                        sex=c("F", "M", "F", "M"),
> > > >> >>>>>>                        age=c(41, 43, 36, 51))
> > > >> >>>>>> children <- data.frame(parent=c("Sarah", "Max", "Qin"),
> > > >> >>>>>>                         name=c("Oliver", "Sebastian",
> > "Kai-lee"),
> > > >> >>>>>>                         sex=c("M", "M", "F"),
> > > >> >>>>>>                         age=c(5,8,7))
> > > >> >>>>>>
> > > >> >>>>>> # Merge() creates a duplicated "name" column:
> > > >> >>>>>> merge(parents, children, by.x = "name", by.y = "parent")
> > > >> >>>>>> ```
> > > >> >>>>>>
> > > >> >>>>>> Output:
> > > >> >>>>>> ```
> > > >> >>>>>>     name sex.x age.x      name sex.y age.y
> > > >> >>>>>> 1   Max     M    43 Sebastian     M     8
> > > >> >>>>>> 2   Qin     F    36   Kai-lee     F     7
> > > >> >>>>>> 3 Sarah     F    41    Oliver     M     5
> > > >> >>>>>> Warning message:
> > > >> >>>>>> In merge.data.frame(parents, children, by.x = "name", by.y =
> > > >> >>>>>> "parent") :
> > > >> >>>>>>    column name ‘name’ is duplicated in the result
> > > >> >>>>>> ```
> > > >> >>>>>>
> > > >> >>>>>> Kind Regards,
> > > >> >>>>>>
> > > >> >>>>>> Scott Ritchie
> > > >> >>>>>>
> > > >> >>>>>>        [[alternative HTML version deleted]]
> > > >> >>>>>>
> > > >> >>>>>> ______________________________________________
> > > >> >>>>>> [hidden email] mailing list
> > > >> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> > > >> >>>>>>
> > > >> >>>>>>
> > > >> >>>>>
> > > >> >>>>
> > > >> >>>>
> > > >> >> Index: src/library/base/R/merge.R
> > > >> >>> ============================================================
> > =======
> > > >> >>> --- src/library/base/R/merge.R  (revision 74264)
> > > >> >>> +++ src/library/base/R/merge.R  (working copy)
> > > >> >>> @@ -157,6 +157,15 @@
> > > >> >>>           }
> > > >> >>>             if(has.common.nms) names(y) <- nm.y
> > > >> >>> +        ## If by.x %in% names(y) then duplicate column names
> > still
> > > >> >>> arise,
> > > >> >>> +        ## apply suffixes to these
> > > >> >>> +        dupe.keyx <- intersect(nm.by, names(y))
> > > >> >>> +        if(length(dupe.keyx)) {
> > > >> >>> +          if(nzchar(suffixes[1L]))
> > > >> >>> +            names(x)[match(dupe.keyx, names(x), 0L)] <-
> > > >> >>> paste(dupe.keyx, suffixes[1L], sep="")
> > > >> >>> +          if(nzchar(suffixes[2L]))
> > > >> >>> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
> > > >> >>> paste(dupe.keyx, suffixes[2L], sep="")
> > > >> >>> +        }
> > > >> >>>           nm <- c(names(x), names(y))
> > > >> >>>           if(any(d <- duplicated(nm)))
> > > >> >>>               if(sum(d) > 1L)
> > > >> >>>
> > > >> >>
> > > >> >> ______________________________________________
> > > >> >> [hidden email] mailing list
> > > >> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> > > >> >>
> > > >> >>
> > > >> >
> > > >>
> > > >>         [[alternative HTML version deleted]]
> > > >>
> > > >> ______________________________________________
> > > >> [hidden email] mailing list
> > > >> https://stat.ethz.ch/mailman/listinfo/r-devel
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Gabriel Becker, PhD
> > > > Scientist (Bioinformatics)
> > > > Genentech Research
> > > >
> >
> > > Index: src/library/base/R/merge.R
> > > ===================================================================
> > > --- src/library/base/R/merge.R        (revision 74264)
> > > +++ src/library/base/R/merge.R        (working copy)
> > > @@ -157,6 +157,15 @@
> > >          }
> > >
> > >          if(has.common.nms) names(y) <- nm.y
> > > +        ## If by.x %in% names(y) then duplicate column names still
> > arise,
> > > +        ## apply suffixes to these
> > > +        dupe.keyx <- intersect(nm.by, names(y))
> > > +        if(length(dupe.keyx)) {
> > > +          if(nzchar(suffixes[1L]))
> > > +            names(x)[match(dupe.keyx, names(x), 0L)] <-
> > paste(dupe.keyx, suffixes[1L], sep="")
> > > +          if(nzchar(suffixes[2L]))
> > > +            names(y)[match(dupe.keyx, names(y), 0L)] <-
> > paste(dupe.keyx, suffixes[2L], sep="")
> > > +        }
> > >          nm <- c(names(x), names(y))
> > >          if(any(d <- duplicated(nm)))
> > >              if(sum(d) > 1L)
> >
> > > ______________________________________________
> > > [hidden email] mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >

> Index: src/library/base/R/merge.R
> ===================================================================
> --- src/library/base/R/merge.R        (revision 74280)
> +++ src/library/base/R/merge.R        (working copy)
> @@ -157,6 +157,14 @@
>          }
>
>          if(has.common.nms) names(y) <- nm.y
> +        ## If by.x %in% names(y) then duplicate column names still arise,
> +        ## apply suffixes to just y - this keeps backwards compatibility
> +        ## when referring to by.x in the resulting data.frame
> +        dupe.keyx <- intersect(nm.by, names(y))
> +        if(length(dupe.keyx)) {
> +          if(nzchar(suffixes[2L]))
> +            names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx,
suffixes[2L], sep="")
> +        }
>          nm <- c(names(x), names(y))
>          if(any(d <- duplicated(nm)))
>              if(sum(d) > 1L)

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate column names created by base::merge() when by.x has the same name as a column in y

Martin Maechler
>>>>> Gabriel Becker <[hidden email]>
>>>>>     on Wed, 21 Feb 2018 07:11:44 -0800 writes:

    > Hi all,
    > For the record this approach isnt 100% backwards compatible, because
    > names(mergeddf) will e incompatibly different. Thatx why i claimed
    > bakcwards compatable-ish

exactly.

    > That said its still worth considering imho because of the reasons stated
    > (and honestly one particular simple reading of the docs might suggest that
    > this was thr intended behavior all along). Im not a member of Rcore through
    > so i cant do said considering myself.

I agree with Scott, Frederik and you that this changes seems
worth considering.
As Duncan Murdoch has mentioned, this alone may not be
sufficient.

In addition to your proposed patch (which I have simplified, not
using intersection() but working with underlying  match()
directly), it is little work to introduce an extra argument, I'm
calling  'no.dups = TRUE'  which when set to false would mirror
current R's behavior... and documenting it, then also documents the
new behavior (to some extent).

My plan is to commit it soonish ;-)
Martin

    > Best,
    > ~G

    > On Feb 20, 2018 7:15 PM, <[hidden email]> wrote:

    > Hi Scott,

    > I tried the new patch and can confirm that it has the advertised
    > behavior on a couple of test cases. I think it makes sense to apply
    > it, because any existing code which refers to a second duplicate
    > data.frame column by name is already broken, while if the reference is
    > by numerical index then changing the column name shouldn't break it.

    > I don't know if you need to update the documentation as part of your
    > patch, or if whoever applies it would be happy to do that. Somebody
    > from R core want to weigh in on this?

    > I attach a file with the test example from your original email as well
    > as a second test case I added with two "by" columns.

    > Thanks,

    > Frederick

    > On Wed, Feb 21, 2018 at 10:06:21AM +1100, Scott Ritchie wrote:
    >> Hi Frederick,
    >>
    >> It looks like I didn't overwrite the patch.diff file after the last edits.
    >> Here's the correct patch (attached and copied below):
    >>
    >> Index: src/library/base/R/merge.R
    >> ===================================================================
    >> --- src/library/base/R/merge.R (revision 74280)
    >> +++ src/library/base/R/merge.R (working copy)
    >> @@ -157,6 +157,14 @@
    >> }
    >>
    >> if(has.common.nms) names(y) <- nm.y
    >> +        ## If by.x %in% names(y) then duplicate column names still arise,
    >> +        ## apply suffixes to just y - this keeps backwards compatibility
    >> +        ## when referring to by.x in the resulting data.frame
    >> +        dupe.keyx <- intersect(nm.by, names(y))
    >> +        if(length(dupe.keyx)) {
    >> +          if(nzchar(suffixes[2L]))
    >> +            names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx,
    >> suffixes[2L], sep="")
    >> +        }
    >> nm <- c(names(x), names(y))
    >> if(any(d <- duplicated(nm)))
    >> if(sum(d) > 1L)
    >>
    >> Best,
    >>
    >> Scott
    >>
    >> On 21 February 2018 at 08:23, <[hidden email]> wrote:
    >>
    >> > Hi Scott,
    >> >
    >> > I think that's a good idea and I tried your patch on my copy of the
    >> > repository. But it looks to me like the recent patch is identical to
    >> > the previous one, can you confirm this?
    >> >
    >> > Frederick
    >> >
    >> > On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote:
    >> > > Thanks Gabriel,
    >> > >
    >> > > I think your suggested approach is 100% backwards compatible
    >> > >
    >> > > Currently in the case of duplicate column names only the first can be
    >> > > indexed by its name. This will always be the column appearing in by.x,
    >> > > meaning the column in y with the same name cannot be accessed.
    > Appending
    >> > > ".y" (suffixes[2L]) to this column means it can now be accessed, while
    >> > > keeping the current behaviour of making the key columns always
    > accessible
    >> > > by using the names provided to by.x.
    >> > >
    >> > > I've attached a new patch that has this behaviour.
    >> > >
    >> > > Best,
    >> > >
    >> > > Scott
    >> > >
    >> > > On 19 February 2018 at 05:08, Gabriel Becker <[hidden email]>
    >> > wrote:
    >> > >
    >> > > > It seems like there is a way that is backwards compatible-ish in the
    >> > sense
    >> > > > mentioned and still has the (arguably, but a good argument I think)
    >> > better
    >> > > > behavior:
    >> > > >
    >> > > > if by.x is 'name', (AND by.y is not also 'name'), then x's 'name'
    >> > column
    >> > > > is called name and y's 'name' column (not used int he merge) is
    >> > changed to
    >> > > > name.y.
    >> > > >
    >> > > > Now of course this would still change output, but it would change
    > it to
    >> > > > something I think would be better, while retaining the 'merge
    > columns
    >> > > > retain their exact names' mechanic as documented.
    >> > > >
    >> > > > ~G
    >> > > >
    >> > > > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <
    > [hidden email]>
    >> > > > wrote:
    >> > > >
    >> > > >> Thanks Duncan and Frederick,
    >> > > >>
    >> > > >> I suspected as much - there doesn't appear to be any reason why
    >> > conflicts
    >> > > >> between by.x and names(y) shouldn't and cannot be checked, but I
    > can
    >> > see
    >> > > >> how this might be more trouble than its worth given it potentially
    > may
    >> > > >> break downstream packages (i.e. any cases where this occurs but
    > they
    >> > > >> expect
    >> > > >> the name of the key column(s) to remain the same).
    >> > > >>
    >> > > >> Best,
    >> > > >>
    >> > > >> Scott
    >> > > >>
    >> > > >> On 18 February 2018 at 11:48, Duncan Murdoch <
    >> > [hidden email]>
    >> > > >> wrote:
    >> > > >>
    >> > > >> > On 17/02/2018 6:36 PM, [hidden email] wrote:
    >> > > >> >
    >> > > >> >> Hi Scott,
    >> > > >> >>
    >> > > >> >> Thanks for the patch. I'm not really involved in R development;
    > it
    >> > > >> >> will be up to someone in the R core team to apply it. I would
    >> > hazard
    >> > > >> >> to say that even if correct (I haven't checked), it will not be
    >> > > >> >> applied because the change might break existing code. For
    > example
    >> > it
    >> > > >> >> seems like reasonable code might easily assume that a column
    > with
    >> > the
    >> > > >> >> same name as "by.x" exists in the output of 'merge'. That's
    > just my
    >> > > >> >> best guess... I don't participate on here often.
    >> > > >> >>
    >> > > >> >
    >> > > >> >
    >> > > >> > I think you're right.  If I were still a member of R Core, I
    > would
    >> > want
    >> > > >> to
    >> > > >> > test this against all packages on CRAN and Bioconductor, and
    > since
    >> > that
    >> > > >> > test takes a couple of days to run on my laptop, I'd probably
    > never
    >> > get
    >> > > >> > around to it.
    >> > > >> >
    >> > > >> > There are lots of cases where "I would have done that
    > differently",
    >> > but
    >> > > >> > most of them are far too much trouble to change now that R is
    > more
    >> > than
    >> > > >> 20
    >> > > >> > years old.  And in many cases it will turn out that the way R
    > does
    >> > it
    >> > > >> > actually does make more sense than the way I would have done it.
    >> > > >> >
    >> > > >> > Duncan Murdoch
    >> > > >> >
    >> > > >> >
    >> > > >> >
    >> > > >> >> Cheers,
    >> > > >> >>
    >> > > >> >> Frederick
    >> > > >> >>
    >> > > >> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote:
    >> > > >> >>
    >> > > >> >>> The attached patch.diff will make merge.data.frame() append the
    >> > > >> suffixes
    >> > > >> >>> to
    >> > > >> >>> columns with common names between by.x and names(y).
    >> > > >> >>>
    >> > > >> >>> Best,
    >> > > >> >>>
    >> > > >> >>> Scott Ritchie
    >> > > >> >>>
    >> > > >> >>> On 17 February 2018 at 11:15, Scott Ritchie <
    >> > [hidden email]>
    >> > > >> >>> wrote:
    >> > > >> >>>
    >> > > >> >>> Hi Frederick,
    >> > > >> >>>>
    >> > > >> >>>> I would expect that any duplicate names in the resulting
    >> > data.frame
    >> > > >> >>>> would
    >> > > >> >>>> have the suffixes appended to them, regardless of whether or
    > not
    >> > they
    >> > > >> >>>> are
    >> > > >> >>>> used as the join key. So in my example I would expect
    > "names.x"
    >> > and
    >> > > >> >>>> "names.y" to indicate their source data.frame.
    >> > > >> >>>>
    >> > > >> >>>> While careful reading of the documentation reveals this is not
    >> > the
    >> > > >> >>>> case, I
    >> > > >> >>>> would argue the intent of the suffixes functionality should
    >> > equally
    >> > > >> be
    >> > > >> >>>> applied to this type of case.
    >> > > >> >>>>
    >> > > >> >>>> If you agree this would be useful, I'm happy to write a patch
    > for
    >> > > >> >>>> merge.data.frame that will add suffixes in this case - I
    > intend
    >> > to do
    >> > > >> >>>> the
    >> > > >> >>>> same for merge.data.table in the data.table package where I
    >> > initially
    >> > > >> >>>> encountered the edge case.
    >> > > >> >>>>
    >> > > >> >>>> Best,
    >> > > >> >>>>
    >> > > >> >>>> Scott
    >> > > >> >>>>
    >> > > >> >>>> On 17 February 2018 at 03:53, <[hidden email]> wrote:
    >> > > >> >>>>
    >> > > >> >>>> Hi Scott,
    >> > > >> >>>>>
    >> > > >> >>>>> It seems like reasonable behavior to me. What result would
    > you
    >> > > >> expect?
    >> > > >> >>>>> That the second "name" should be called "name.y"?
    >> > > >> >>>>>
    >> > > >> >>>>> The "merge" documentation says:
    >> > > >> >>>>>
    >> > > >> >>>>>      If the columns in the data frames not used in merging
    > have
    >> > any
    >> > > >> >>>>>      common names, these have ‘suffixes’ (‘".x"’ and ‘".y"’
    > by
    >> > > >> default)
    >> > > >> >>>>>      appended to try to make the names of the result unique.
    >> > > >> >>>>>
    >> > > >> >>>>> Since the first "name" column was used in merging, leaving
    > both
    >> > > >> >>>>> without a suffix seems consistent with the documentation...
    >> > > >> >>>>>
    >> > > >> >>>>> Frederick
    >> > > >> >>>>>
    >> > > >> >>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie
    > wrote:
    >> > > >> >>>>>
    >> > > >> >>>>>> Hi,
    >> > > >> >>>>>>
    >> > > >> >>>>>> I was unable to find a bug report for this with a cursory
    >> > search,
    >> > > >> but
    >> > > >> >>>>>>
    >> > > >> >>>>> would
    >> > > >> >>>>>
    >> > > >> >>>>>> like clarification if this is intended or unavoidable
    >> > behaviour:
    >> > > >> >>>>>>
    >> > > >> >>>>>> ```{r}
    >> > > >> >>>>>> # Create example data.frames
    >> > > >> >>>>>> parents <- data.frame(name=c("Sarah", "Max", "Qin", "Lex"),
    >> > > >> >>>>>>                        sex=c("F", "M", "F", "M"),
    >> > > >> >>>>>>                        age=c(41, 43, 36, 51))
    >> > > >> >>>>>> children <- data.frame(parent=c("Sarah", "Max", "Qin"),
    >> > > >> >>>>>>                         name=c("Oliver", "Sebastian",
    >> > "Kai-lee"),
    >> > > >> >>>>>>                         sex=c("M", "M", "F"),
    >> > > >> >>>>>>                         age=c(5,8,7))
    >> > > >> >>>>>>
    >> > > >> >>>>>> # Merge() creates a duplicated "name" column:
    >> > > >> >>>>>> merge(parents, children, by.x = "name", by.y = "parent")
    >> > > >> >>>>>> ```
    >> > > >> >>>>>>
    >> > > >> >>>>>> Output:
    >> > > >> >>>>>> ```
    >> > > >> >>>>>>     name sex.x age.x      name sex.y age.y
    >> > > >> >>>>>> 1   Max     M    43 Sebastian     M     8
    >> > > >> >>>>>> 2   Qin     F    36   Kai-lee     F     7
    >> > > >> >>>>>> 3 Sarah     F    41    Oliver     M     5
    >> > > >> >>>>>> Warning message:
    >> > > >> >>>>>> In merge.data.frame(parents, children, by.x = "name", by.y =
    >> > > >> >>>>>> "parent") :
    >> > > >> >>>>>>    column name ‘name’ is duplicated in the result
    >> > > >> >>>>>> ```
    >> > > >> >>>>>>
    >> > > >> >>>>>> Kind Regards,
    >> > > >> >>>>>>
    >> > > >> >>>>>> Scott Ritchie
    >> > > >> >>>>>>
    >> > > >> >>>>>>        [[alternative HTML version deleted]]
    >> > > >> >>>>>>
    >> > > >> >>>>>> ______________________________________________
    >> > > >> >>>>>> [hidden email] mailing list
    >> > > >> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
    >> > > >> >>>>>>
    >> > > >> >>>>>>
    >> > > >> >>>>>
    >> > > >> >>>>
    >> > > >> >>>>
    >> > > >> >> Index: src/library/base/R/merge.R
    >> > > >> >>> ============================================================
    >> > =======
    >> > > >> >>> --- src/library/base/R/merge.R  (revision 74264)
    >> > > >> >>> +++ src/library/base/R/merge.R  (working copy)
    >> > > >> >>> @@ -157,6 +157,15 @@
    >> > > >> >>>           }
    >> > > >> >>>             if(has.common.nms) names(y) <- nm.y
    >> > > >> >>> +        ## If by.x %in% names(y) then duplicate column names
    >> > still
    >> > > >> >>> arise,
    >> > > >> >>> +        ## apply suffixes to these
    >> > > >> >>> +        dupe.keyx <- intersect(nm.by, names(y))
    >> > > >> >>> +        if(length(dupe.keyx)) {
    >> > > >> >>> +          if(nzchar(suffixes[1L]))
    >> > > >> >>> +            names(x)[match(dupe.keyx, names(x), 0L)] <-
    >> > > >> >>> paste(dupe.keyx, suffixes[1L], sep="")
    >> > > >> >>> +          if(nzchar(suffixes[2L]))
    >> > > >> >>> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
    >> > > >> >>> paste(dupe.keyx, suffixes[2L], sep="")
    >> > > >> >>> +        }
    >> > > >> >>>           nm <- c(names(x), names(y))
    >> > > >> >>>           if(any(d <- duplicated(nm)))
    >> > > >> >>>               if(sum(d) > 1L)
    >> > > >> >>>
    >> > > >> >>
    >> > > >> >> ______________________________________________
    >> > > >> >> [hidden email] mailing list
    >> > > >> >> https://stat.ethz.ch/mailman/listinfo/r-devel
    >> > > >> >>
    >> > > >> >>
    >> > > >> >
    >> > > >>
    >> > > >>         [[alternative HTML version deleted]]
    >> > > >>
    >> > > >> ______________________________________________
    >> > > >> [hidden email] mailing list
    >> > > >> https://stat.ethz.ch/mailman/listinfo/r-devel
    >> > > >>
    >> > > >
    >> > > >
    >> > > >
    >> > > > --
    >> > > > Gabriel Becker, PhD
    >> > > > Scientist (Bioinformatics)
    >> > > > Genentech Research
    >> > > >
    >> >
    >> > > Index: src/library/base/R/merge.R
    >> > > ===================================================================
    >> > > --- src/library/base/R/merge.R        (revision 74264)
    >> > > +++ src/library/base/R/merge.R        (working copy)
    >> > > @@ -157,6 +157,15 @@
    >> > >          }
    >> > >
    >> > >          if(has.common.nms) names(y) <- nm.y
    >> > > +        ## If by.x %in% names(y) then duplicate column names still
    >> > arise,
    >> > > +        ## apply suffixes to these
    >> > > +        dupe.keyx <- intersect(nm.by, names(y))
    >> > > +        if(length(dupe.keyx)) {
    >> > > +          if(nzchar(suffixes[1L]))
    >> > > +            names(x)[match(dupe.keyx, names(x), 0L)] <-
    >> > paste(dupe.keyx, suffixes[1L], sep="")
    >> > > +          if(nzchar(suffixes[2L]))
    >> > > +            names(y)[match(dupe.keyx, names(y), 0L)] <-
    >> > paste(dupe.keyx, suffixes[2L], sep="")
    >> > > +        }
    >> > >          nm <- c(names(x), names(y))
    >> > >          if(any(d <- duplicated(nm)))
    >> > >              if(sum(d) > 1L)
    >> >
    >> > > ______________________________________________
    >> > > [hidden email] mailing list
    >> > > https://stat.ethz.ch/mailman/listinfo/r-devel
    >> >
    >> >

    >> Index: src/library/base/R/merge.R
    >> ===================================================================
    >> --- src/library/base/R/merge.R        (revision 74280)
    >> +++ src/library/base/R/merge.R        (working copy)
    >> @@ -157,6 +157,14 @@
    >> }
    >>
    >> if(has.common.nms) names(y) <- nm.y
    >> +        ## If by.x %in% names(y) then duplicate column names still arise,
    >> +        ## apply suffixes to just y - this keeps backwards compatibility
    >> +        ## when referring to by.x in the resulting data.frame
    >> +        dupe.keyx <- intersect(nm.by, names(y))
    >> +        if(length(dupe.keyx)) {
    >> +          if(nzchar(suffixes[2L]))
    >> +            names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx,
    > suffixes[2L], sep="")
    >> +        }
    >> nm <- c(names(x), names(y))
    >> if(any(d <- duplicated(nm)))
    >> if(sum(d) > 1L)

    > [[alternative HTML version deleted]]

    > ______________________________________________
    > [hidden email] mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate column names created by base::merge() when by.x has the same name as a column in y

Scott Ritchie-2
Thanks Martin!

Can you clarify the functionality of the 'no.dups' argument so I can change
my patch to `data.table:::merge.data.table` accordingly?

- When `no.dups=TRUE` will the suffix to the by.x column name? Or will it
take the functionality of the second functionality where only the column in
y has the suffix added?
- When `no.dups=FALSE` will the output be the same as it currently (no
suffix added to either column)? Or will add the suffix to the column in y?

Best,

Scott

On 22 February 2018 at 22:31, Martin Maechler <[hidden email]>
wrote:

> >>>>> Gabriel Becker <[hidden email]>
> >>>>>     on Wed, 21 Feb 2018 07:11:44 -0800 writes:
>
>     > Hi all,
>     > For the record this approach isnt 100% backwards compatible, because
>     > names(mergeddf) will e incompatibly different. Thatx why i claimed
>     > bakcwards compatable-ish
>
> exactly.
>
>     > That said its still worth considering imho because of the reasons
> stated
>     > (and honestly one particular simple reading of the docs might
> suggest that
>     > this was thr intended behavior all along). Im not a member of Rcore
> through
>     > so i cant do said considering myself.
>
> I agree with Scott, Frederik and you that this changes seems
> worth considering.
> As Duncan Murdoch has mentioned, this alone may not be
> sufficient.
>
> In addition to your proposed patch (which I have simplified, not
> using intersection() but working with underlying  match()
> directly), it is little work to introduce an extra argument, I'm
> calling  'no.dups = TRUE'  which when set to false would mirror
> current R's behavior... and documenting it, then also documents the
> new behavior (to some extent).
>
> My plan is to commit it soonish ;-)
> Martin
>
>     > Best,
>     > ~G
>
>     > On Feb 20, 2018 7:15 PM, <[hidden email]> wrote:
>
>     > Hi Scott,
>
>     > I tried the new patch and can confirm that it has the advertised
>     > behavior on a couple of test cases. I think it makes sense to apply
>     > it, because any existing code which refers to a second duplicate
>     > data.frame column by name is already broken, while if the reference
> is
>     > by numerical index then changing the column name shouldn't break it.
>
>     > I don't know if you need to update the documentation as part of your
>     > patch, or if whoever applies it would be happy to do that. Somebody
>     > from R core want to weigh in on this?
>
>     > I attach a file with the test example from your original email as
> well
>     > as a second test case I added with two "by" columns.
>
>     > Thanks,
>
>     > Frederick
>
>     > On Wed, Feb 21, 2018 at 10:06:21AM +1100, Scott Ritchie wrote:
>     >> Hi Frederick,
>     >>
>     >> It looks like I didn't overwrite the patch.diff file after the last
> edits.
>     >> Here's the correct patch (attached and copied below):
>     >>
>     >> Index: src/library/base/R/merge.R
>     >> ===================================================================
>     >> --- src/library/base/R/merge.R (revision 74280)
>     >> +++ src/library/base/R/merge.R (working copy)
>     >> @@ -157,6 +157,14 @@
>     >> }
>     >>
>     >> if(has.common.nms) names(y) <- nm.y
>     >> +        ## If by.x %in% names(y) then duplicate column names still
> arise,
>     >> +        ## apply suffixes to just y - this keeps backwards
> compatibility
>     >> +        ## when referring to by.x in the resulting data.frame
>     >> +        dupe.keyx <- intersect(nm.by, names(y))
>     >> +        if(length(dupe.keyx)) {
>     >> +          if(nzchar(suffixes[2L]))
>     >> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
> paste(dupe.keyx,
>     >> suffixes[2L], sep="")
>     >> +        }
>     >> nm <- c(names(x), names(y))
>     >> if(any(d <- duplicated(nm)))
>     >> if(sum(d) > 1L)
>     >>
>     >> Best,
>     >>
>     >> Scott
>     >>
>     >> On 21 February 2018 at 08:23, <[hidden email]> wrote:
>     >>
>     >> > Hi Scott,
>     >> >
>     >> > I think that's a good idea and I tried your patch on my copy of
> the
>     >> > repository. But it looks to me like the recent patch is identical
> to
>     >> > the previous one, can you confirm this?
>     >> >
>     >> > Frederick
>     >> >
>     >> > On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote:
>     >> > > Thanks Gabriel,
>     >> > >
>     >> > > I think your suggested approach is 100% backwards compatible
>     >> > >
>     >> > > Currently in the case of duplicate column names only the first
> can be
>     >> > > indexed by its name. This will always be the column appearing
> in by.x,
>     >> > > meaning the column in y with the same name cannot be accessed.
>     > Appending
>     >> > > ".y" (suffixes[2L]) to this column means it can now be
> accessed, while
>     >> > > keeping the current behaviour of making the key columns always
>     > accessible
>     >> > > by using the names provided to by.x.
>     >> > >
>     >> > > I've attached a new patch that has this behaviour.
>     >> > >
>     >> > > Best,
>     >> > >
>     >> > > Scott
>     >> > >
>     >> > > On 19 February 2018 at 05:08, Gabriel Becker <
> [hidden email]>
>     >> > wrote:
>     >> > >
>     >> > > > It seems like there is a way that is backwards compatible-ish
> in the
>     >> > sense
>     >> > > > mentioned and still has the (arguably, but a good argument I
> think)
>     >> > better
>     >> > > > behavior:
>     >> > > >
>     >> > > > if by.x is 'name', (AND by.y is not also 'name'), then x's
> 'name'
>     >> > column
>     >> > > > is called name and y's 'name' column (not used int he merge)
> is
>     >> > changed to
>     >> > > > name.y.
>     >> > > >
>     >> > > > Now of course this would still change output, but it would
> change
>     > it to
>     >> > > > something I think would be better, while retaining the 'merge
>     > columns
>     >> > > > retain their exact names' mechanic as documented.
>     >> > > >
>     >> > > > ~G
>     >> > > >
>     >> > > > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <
>     > [hidden email]>
>     >> > > > wrote:
>     >> > > >
>     >> > > >> Thanks Duncan and Frederick,
>     >> > > >>
>     >> > > >> I suspected as much - there doesn't appear to be any reason
> why
>     >> > conflicts
>     >> > > >> between by.x and names(y) shouldn't and cannot be checked,
> but I
>     > can
>     >> > see
>     >> > > >> how this might be more trouble than its worth given it
> potentially
>     > may
>     >> > > >> break downstream packages (i.e. any cases where this occurs
> but
>     > they
>     >> > > >> expect
>     >> > > >> the name of the key column(s) to remain the same).
>     >> > > >>
>     >> > > >> Best,
>     >> > > >>
>     >> > > >> Scott
>     >> > > >>
>     >> > > >> On 18 February 2018 at 11:48, Duncan Murdoch <
>     >> > [hidden email]>
>     >> > > >> wrote:
>     >> > > >>
>     >> > > >> > On 17/02/2018 6:36 PM, [hidden email] wrote:
>     >> > > >> >
>     >> > > >> >> Hi Scott,
>     >> > > >> >>
>     >> > > >> >> Thanks for the patch. I'm not really involved in R
> development;
>     > it
>     >> > > >> >> will be up to someone in the R core team to apply it. I
> would
>     >> > hazard
>     >> > > >> >> to say that even if correct (I haven't checked), it will
> not be
>     >> > > >> >> applied because the change might break existing code. For
>     > example
>     >> > it
>     >> > > >> >> seems like reasonable code might easily assume that a
> column
>     > with
>     >> > the
>     >> > > >> >> same name as "by.x" exists in the output of 'merge'.
> That's
>     > just my
>     >> > > >> >> best guess... I don't participate on here often.
>     >> > > >> >>
>     >> > > >> >
>     >> > > >> >
>     >> > > >> > I think you're right.  If I were still a member of R Core,
> I
>     > would
>     >> > want
>     >> > > >> to
>     >> > > >> > test this against all packages on CRAN and Bioconductor,
> and
>     > since
>     >> > that
>     >> > > >> > test takes a couple of days to run on my laptop, I'd
> probably
>     > never
>     >> > get
>     >> > > >> > around to it.
>     >> > > >> >
>     >> > > >> > There are lots of cases where "I would have done that
>     > differently",
>     >> > but
>     >> > > >> > most of them are far too much trouble to change now that R
> is
>     > more
>     >> > than
>     >> > > >> 20
>     >> > > >> > years old.  And in many cases it will turn out that the
> way R
>     > does
>     >> > it
>     >> > > >> > actually does make more sense than the way I would have
> done it.
>     >> > > >> >
>     >> > > >> > Duncan Murdoch
>     >> > > >> >
>     >> > > >> >
>     >> > > >> >
>     >> > > >> >> Cheers,
>     >> > > >> >>
>     >> > > >> >> Frederick
>     >> > > >> >>
>     >> > > >> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie
> wrote:
>     >> > > >> >>
>     >> > > >> >>> The attached patch.diff will make merge.data.frame()
> append the
>     >> > > >> suffixes
>     >> > > >> >>> to
>     >> > > >> >>> columns with common names between by.x and names(y).
>     >> > > >> >>>
>     >> > > >> >>> Best,
>     >> > > >> >>>
>     >> > > >> >>> Scott Ritchie
>     >> > > >> >>>
>     >> > > >> >>> On 17 February 2018 at 11:15, Scott Ritchie <
>     >> > [hidden email]>
>     >> > > >> >>> wrote:
>     >> > > >> >>>
>     >> > > >> >>> Hi Frederick,
>     >> > > >> >>>>
>     >> > > >> >>>> I would expect that any duplicate names in the resulting
>     >> > data.frame
>     >> > > >> >>>> would
>     >> > > >> >>>> have the suffixes appended to them, regardless of
> whether or
>     > not
>     >> > they
>     >> > > >> >>>> are
>     >> > > >> >>>> used as the join key. So in my example I would expect
>     > "names.x"
>     >> > and
>     >> > > >> >>>> "names.y" to indicate their source data.frame.
>     >> > > >> >>>>
>     >> > > >> >>>> While careful reading of the documentation reveals this
> is not
>     >> > the
>     >> > > >> >>>> case, I
>     >> > > >> >>>> would argue the intent of the suffixes functionality
> should
>     >> > equally
>     >> > > >> be
>     >> > > >> >>>> applied to this type of case.
>     >> > > >> >>>>
>     >> > > >> >>>> If you agree this would be useful, I'm happy to write a
> patch
>     > for
>     >> > > >> >>>> merge.data.frame that will add suffixes in this case - I
>     > intend
>     >> > to do
>     >> > > >> >>>> the
>     >> > > >> >>>> same for merge.data.table in the data.table package
> where I
>     >> > initially
>     >> > > >> >>>> encountered the edge case.
>     >> > > >> >>>>
>     >> > > >> >>>> Best,
>     >> > > >> >>>>
>     >> > > >> >>>> Scott
>     >> > > >> >>>>
>     >> > > >> >>>> On 17 February 2018 at 03:53, <[hidden email]> wrote:
>     >> > > >> >>>>
>     >> > > >> >>>> Hi Scott,
>     >> > > >> >>>>>
>     >> > > >> >>>>> It seems like reasonable behavior to me. What result
> would
>     > you
>     >> > > >> expect?
>     >> > > >> >>>>> That the second "name" should be called "name.y"?
>     >> > > >> >>>>>
>     >> > > >> >>>>> The "merge" documentation says:
>     >> > > >> >>>>>
>     >> > > >> >>>>>      If the columns in the data frames not used in
> merging
>     > have
>     >> > any
>     >> > > >> >>>>>      common names, these have ‘suffixes’ (‘".x"’ and
> ‘".y"’
>     > by
>     >> > > >> default)
>     >> > > >> >>>>>      appended to try to make the names of the result
> unique.
>     >> > > >> >>>>>
>     >> > > >> >>>>> Since the first "name" column was used in merging,
> leaving
>     > both
>     >> > > >> >>>>> without a suffix seems consistent with the
> documentation...
>     >> > > >> >>>>>
>     >> > > >> >>>>> Frederick
>     >> > > >> >>>>>
>     >> > > >> >>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie
>     > wrote:
>     >> > > >> >>>>>
>     >> > > >> >>>>>> Hi,
>     >> > > >> >>>>>>
>     >> > > >> >>>>>> I was unable to find a bug report for this with a
> cursory
>     >> > search,
>     >> > > >> but
>     >> > > >> >>>>>>
>     >> > > >> >>>>> would
>     >> > > >> >>>>>
>     >> > > >> >>>>>> like clarification if this is intended or unavoidable
>     >> > behaviour:
>     >> > > >> >>>>>>
>     >> > > >> >>>>>> ```{r}
>     >> > > >> >>>>>> # Create example data.frames
>     >> > > >> >>>>>> parents <- data.frame(name=c("Sarah", "Max", "Qin",
> "Lex"),
>     >> > > >> >>>>>>                        sex=c("F", "M", "F", "M"),
>     >> > > >> >>>>>>                        age=c(41, 43, 36, 51))
>     >> > > >> >>>>>> children <- data.frame(parent=c("Sarah", "Max",
> "Qin"),
>     >> > > >> >>>>>>                         name=c("Oliver", "Sebastian",
>     >> > "Kai-lee"),
>     >> > > >> >>>>>>                         sex=c("M", "M", "F"),
>     >> > > >> >>>>>>                         age=c(5,8,7))
>     >> > > >> >>>>>>
>     >> > > >> >>>>>> # Merge() creates a duplicated "name" column:
>     >> > > >> >>>>>> merge(parents, children, by.x = "name", by.y =
> "parent")
>     >> > > >> >>>>>> ```
>     >> > > >> >>>>>>
>     >> > > >> >>>>>> Output:
>     >> > > >> >>>>>> ```
>     >> > > >> >>>>>>     name sex.x age.x      name sex.y age.y
>     >> > > >> >>>>>> 1   Max     M    43 Sebastian     M     8
>     >> > > >> >>>>>> 2   Qin     F    36   Kai-lee     F     7
>     >> > > >> >>>>>> 3 Sarah     F    41    Oliver     M     5
>     >> > > >> >>>>>> Warning message:
>     >> > > >> >>>>>> In merge.data.frame(parents, children, by.x = "name",
> by.y =
>     >> > > >> >>>>>> "parent") :
>     >> > > >> >>>>>>    column name ‘name’ is duplicated in the result
>     >> > > >> >>>>>> ```
>     >> > > >> >>>>>>
>     >> > > >> >>>>>> Kind Regards,
>     >> > > >> >>>>>>
>     >> > > >> >>>>>> Scott Ritchie
>     >> > > >> >>>>>>
>     >> > > >> >>>>>>        [[alternative HTML version deleted]]
>     >> > > >> >>>>>>
>     >> > > >> >>>>>> ______________________________________________
>     >> > > >> >>>>>> [hidden email] mailing list
>     >> > > >> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>     >> > > >> >>>>>>
>     >> > > >> >>>>>>
>     >> > > >> >>>>>
>     >> > > >> >>>>
>     >> > > >> >>>>
>     >> > > >> >> Index: src/library/base/R/merge.R
>     >> > > >> >>> ==============================
> ==============================
>     >> > =======
>     >> > > >> >>> --- src/library/base/R/merge.R  (revision 74264)
>     >> > > >> >>> +++ src/library/base/R/merge.R  (working copy)
>     >> > > >> >>> @@ -157,6 +157,15 @@
>     >> > > >> >>>           }
>     >> > > >> >>>             if(has.common.nms) names(y) <- nm.y
>     >> > > >> >>> +        ## If by.x %in% names(y) then duplicate column
> names
>     >> > still
>     >> > > >> >>> arise,
>     >> > > >> >>> +        ## apply suffixes to these
>     >> > > >> >>> +        dupe.keyx <- intersect(nm.by, names(y))
>     >> > > >> >>> +        if(length(dupe.keyx)) {
>     >> > > >> >>> +          if(nzchar(suffixes[1L]))
>     >> > > >> >>> +            names(x)[match(dupe.keyx, names(x), 0L)] <-
>     >> > > >> >>> paste(dupe.keyx, suffixes[1L], sep="")
>     >> > > >> >>> +          if(nzchar(suffixes[2L]))
>     >> > > >> >>> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
>     >> > > >> >>> paste(dupe.keyx, suffixes[2L], sep="")
>     >> > > >> >>> +        }
>     >> > > >> >>>           nm <- c(names(x), names(y))
>     >> > > >> >>>           if(any(d <- duplicated(nm)))
>     >> > > >> >>>               if(sum(d) > 1L)
>     >> > > >> >>>
>     >> > > >> >>
>     >> > > >> >> ______________________________________________
>     >> > > >> >> [hidden email] mailing list
>     >> > > >> >> https://stat.ethz.ch/mailman/listinfo/r-devel
>     >> > > >> >>
>     >> > > >> >>
>     >> > > >> >
>     >> > > >>
>     >> > > >>         [[alternative HTML version deleted]]
>     >> > > >>
>     >> > > >> ______________________________________________
>     >> > > >> [hidden email] mailing list
>     >> > > >> https://stat.ethz.ch/mailman/listinfo/r-devel
>     >> > > >>
>     >> > > >
>     >> > > >
>     >> > > >
>     >> > > > --
>     >> > > > Gabriel Becker, PhD
>     >> > > > Scientist (Bioinformatics)
>     >> > > > Genentech Research
>     >> > > >
>     >> >
>     >> > > Index: src/library/base/R/merge.R
>     >> > > ============================================================
> =======
>     >> > > --- src/library/base/R/merge.R        (revision 74264)
>     >> > > +++ src/library/base/R/merge.R        (working copy)
>     >> > > @@ -157,6 +157,15 @@
>     >> > >          }
>     >> > >
>     >> > >          if(has.common.nms) names(y) <- nm.y
>     >> > > +        ## If by.x %in% names(y) then duplicate column names
> still
>     >> > arise,
>     >> > > +        ## apply suffixes to these
>     >> > > +        dupe.keyx <- intersect(nm.by, names(y))
>     >> > > +        if(length(dupe.keyx)) {
>     >> > > +          if(nzchar(suffixes[1L]))
>     >> > > +            names(x)[match(dupe.keyx, names(x), 0L)] <-
>     >> > paste(dupe.keyx, suffixes[1L], sep="")
>     >> > > +          if(nzchar(suffixes[2L]))
>     >> > > +            names(y)[match(dupe.keyx, names(y), 0L)] <-
>     >> > paste(dupe.keyx, suffixes[2L], sep="")
>     >> > > +        }
>     >> > >          nm <- c(names(x), names(y))
>     >> > >          if(any(d <- duplicated(nm)))
>     >> > >              if(sum(d) > 1L)
>     >> >
>     >> > > ______________________________________________
>     >> > > [hidden email] mailing list
>     >> > > https://stat.ethz.ch/mailman/listinfo/r-devel
>     >> >
>     >> >
>
>     >> Index: src/library/base/R/merge.R
>     >> ===================================================================
>     >> --- src/library/base/R/merge.R        (revision 74280)
>     >> +++ src/library/base/R/merge.R        (working copy)
>     >> @@ -157,6 +157,14 @@
>     >> }
>     >>
>     >> if(has.common.nms) names(y) <- nm.y
>     >> +        ## If by.x %in% names(y) then duplicate column names still
> arise,
>     >> +        ## apply suffixes to just y - this keeps backwards
> compatibility
>     >> +        ## when referring to by.x in the resulting data.frame
>     >> +        dupe.keyx <- intersect(nm.by, names(y))
>     >> +        if(length(dupe.keyx)) {
>     >> +          if(nzchar(suffixes[2L]))
>     >> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
> paste(dupe.keyx,
>     > suffixes[2L], sep="")
>     >> +        }
>     >> nm <- c(names(x), names(y))
>     >> if(any(d <- duplicated(nm)))
>     >> if(sum(d) > 1L)
>
>     > [[alternative HTML version deleted]]
>
>     > ______________________________________________
>     > [hidden email] mailing list
>     > https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate column names created by base::merge() when by.x has the same name as a column in y

Martin Maechler
>>>>> Scott Ritchie <[hidden email]>
>>>>>     on Fri, 23 Feb 2018 12:32:41 +1100 writes:

    > Thanks Martin!
    > Can you clarify the functionality of the 'no.dups' argument so I can change
    > my patch to `data.table:::merge.data.table` accordingly?

    > - When `no.dups=TRUE` will the suffix to the by.x column name? Or will it
    > take the functionality of the second functionality where only the column in
    > y has the suffix added?
    > - When `no.dups=FALSE` will the output be the same as it currently (no
    > suffix added to either column)? Or will add the suffix to the column in y?

I had started from your patch... and worked from there.
So, there's no need (and use) to provide another one.

I also needed to update the man page, add a regression test, add
an entry to NEWS.Rd ...

Just wait until I commit..
Martin




    > Best,

    > Scott

    > On 22 February 2018 at 22:31, Martin Maechler <[hidden email]>
    > wrote:

    >> >>>>> Gabriel Becker <[hidden email]>
    >> >>>>>     on Wed, 21 Feb 2018 07:11:44 -0800 writes:
    >>
    >> > Hi all,
    >> > For the record this approach isnt 100% backwards compatible, because
    >> > names(mergeddf) will e incompatibly different. Thatx why i claimed
    >> > bakcwards compatable-ish
    >>
    >> exactly.
    >>
    >> > That said its still worth considering imho because of the reasons
    >> stated
    >> > (and honestly one particular simple reading of the docs might
    >> suggest that
    >> > this was thr intended behavior all along). Im not a member of Rcore
    >> through
    >> > so i cant do said considering myself.
    >>
    >> I agree with Scott, Frederik and you that this changes seems
    >> worth considering.
    >> As Duncan Murdoch has mentioned, this alone may not be
    >> sufficient.
    >>
    >> In addition to your proposed patch (which I have simplified, not
    >> using intersection() but working with underlying  match()
    >> directly), it is little work to introduce an extra argument, I'm
    >> calling  'no.dups = TRUE'  which when set to false would mirror
    >> current R's behavior... and documenting it, then also documents the
    >> new behavior (to some extent).
    >>
    >> My plan is to commit it soonish ;-)
    >> Martin
    >>
    >> > Best,
    >> > ~G
    >>
    >> > On Feb 20, 2018 7:15 PM, <[hidden email]> wrote:
    >>
    >> > Hi Scott,
    >>
    >> > I tried the new patch and can confirm that it has the advertised
    >> > behavior on a couple of test cases. I think it makes sense to apply
    >> > it, because any existing code which refers to a second duplicate
    >> > data.frame column by name is already broken, while if the reference
    >> is
    >> > by numerical index then changing the column name shouldn't break it.
    >>
    >> > I don't know if you need to update the documentation as part of your
    >> > patch, or if whoever applies it would be happy to do that. Somebody
    >> > from R core want to weigh in on this?
    >>
    >> > I attach a file with the test example from your original email as
    >> well
    >> > as a second test case I added with two "by" columns.
    >>
    >> > Thanks,
    >>
    >> > Frederick
    >>
    >> > On Wed, Feb 21, 2018 at 10:06:21AM +1100, Scott Ritchie wrote:
    >> >> Hi Frederick,
    >> >>
    >> >> It looks like I didn't overwrite the patch.diff file after the last
    >> edits.
    >> >> Here's the correct patch (attached and copied below):
    >> >>
    >> >> Index: src/library/base/R/merge.R
    >> >> ===================================================================
    >> >> --- src/library/base/R/merge.R (revision 74280)
    >> >> +++ src/library/base/R/merge.R (working copy)
    >> >> @@ -157,6 +157,14 @@
    >> >> }
    >> >>
    >> >> if(has.common.nms) names(y) <- nm.y
    >> >> +        ## If by.x %in% names(y) then duplicate column names still
    >> arise,
    >> >> +        ## apply suffixes to just y - this keeps backwards
    >> compatibility
    >> >> +        ## when referring to by.x in the resulting data.frame
    >> >> +        dupe.keyx <- intersect(nm.by, names(y))
    >> >> +        if(length(dupe.keyx)) {
    >> >> +          if(nzchar(suffixes[2L]))
    >> >> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
    >> paste(dupe.keyx,
    >> >> suffixes[2L], sep="")
    >> >> +        }
    >> >> nm <- c(names(x), names(y))
    >> >> if(any(d <- duplicated(nm)))
    >> >> if(sum(d) > 1L)
    >> >>
    >> >> Best,
    >> >>
    >> >> Scott
    >> >>
    >> >> On 21 February 2018 at 08:23, <[hidden email]> wrote:
    >> >>
    >> >> > Hi Scott,
    >> >> >
    >> >> > I think that's a good idea and I tried your patch on my copy of
    >> the
    >> >> > repository. But it looks to me like the recent patch is identical
    >> to
    >> >> > the previous one, can you confirm this?
    >> >> >
    >> >> > Frederick
    >> >> >
    >> >> > On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote:
    >> >> > > Thanks Gabriel,
    >> >> > >
    >> >> > > I think your suggested approach is 100% backwards compatible
    >> >> > >
    >> >> > > Currently in the case of duplicate column names only the first
    >> can be
    >> >> > > indexed by its name. This will always be the column appearing
    >> in by.x,
    >> >> > > meaning the column in y with the same name cannot be accessed.
    >> > Appending
    >> >> > > ".y" (suffixes[2L]) to this column means it can now be
    >> accessed, while
    >> >> > > keeping the current behaviour of making the key columns always
    >> > accessible
    >> >> > > by using the names provided to by.x.
    >> >> > >
    >> >> > > I've attached a new patch that has this behaviour.
    >> >> > >
    >> >> > > Best,
    >> >> > >
    >> >> > > Scott
    >> >> > >
    >> >> > > On 19 February 2018 at 05:08, Gabriel Becker <
    >> [hidden email]>
    >> >> > wrote:
    >> >> > >
    >> >> > > > It seems like there is a way that is backwards compatible-ish
    >> in the
    >> >> > sense
    >> >> > > > mentioned and still has the (arguably, but a good argument I
    >> think)
    >> >> > better
    >> >> > > > behavior:
    >> >> > > >
    >> >> > > > if by.x is 'name', (AND by.y is not also 'name'), then x's
    >> 'name'
    >> >> > column
    >> >> > > > is called name and y's 'name' column (not used int he merge)
    >> is
    >> >> > changed to
    >> >> > > > name.y.
    >> >> > > >
    >> >> > > > Now of course this would still change output, but it would
    >> change
    >> > it to
    >> >> > > > something I think would be better, while retaining the 'merge
    >> > columns
    >> >> > > > retain their exact names' mechanic as documented.
    >> >> > > >
    >> >> > > > ~G
    >> >> > > >
    >> >> > > > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie <
    >> > [hidden email]>
    >> >> > > > wrote:
    >> >> > > >
    >> >> > > >> Thanks Duncan and Frederick,
    >> >> > > >>
    >> >> > > >> I suspected as much - there doesn't appear to be any reason
    >> why
    >> >> > conflicts
    >> >> > > >> between by.x and names(y) shouldn't and cannot be checked,
    >> but I
    >> > can
    >> >> > see
    >> >> > > >> how this might be more trouble than its worth given it
    >> potentially
    >> > may
    >> >> > > >> break downstream packages (i.e. any cases where this occurs
    >> but
    >> > they
    >> >> > > >> expect
    >> >> > > >> the name of the key column(s) to remain the same).
    >> >> > > >>
    >> >> > > >> Best,
    >> >> > > >>
    >> >> > > >> Scott
    >> >> > > >>
    >> >> > > >> On 18 February 2018 at 11:48, Duncan Murdoch <
    >> >> > [hidden email]>
    >> >> > > >> wrote:
    >> >> > > >>
    >> >> > > >> > On 17/02/2018 6:36 PM, [hidden email] wrote:
    >> >> > > >> >
    >> >> > > >> >> Hi Scott,
    >> >> > > >> >>
    >> >> > > >> >> Thanks for the patch. I'm not really involved in R
    >> development;
    >> > it
    >> >> > > >> >> will be up to someone in the R core team to apply it. I
    >> would
    >> >> > hazard
    >> >> > > >> >> to say that even if correct (I haven't checked), it will
    >> not be
    >> >> > > >> >> applied because the change might break existing code. For
    >> > example
    >> >> > it
    >> >> > > >> >> seems like reasonable code might easily assume that a
    >> column
    >> > with
    >> >> > the
    >> >> > > >> >> same name as "by.x" exists in the output of 'merge'.
    >> That's
    >> > just my
    >> >> > > >> >> best guess... I don't participate on here often.
    >> >> > > >> >>
    >> >> > > >> >
    >> >> > > >> >
    >> >> > > >> > I think you're right.  If I were still a member of R Core,
    >> I
    >> > would
    >> >> > want
    >> >> > > >> to
    >> >> > > >> > test this against all packages on CRAN and Bioconductor,
    >> and
    >> > since
    >> >> > that
    >> >> > > >> > test takes a couple of days to run on my laptop, I'd
    >> probably
    >> > never
    >> >> > get
    >> >> > > >> > around to it.
    >> >> > > >> >
    >> >> > > >> > There are lots of cases where "I would have done that
    >> > differently",
    >> >> > but
    >> >> > > >> > most of them are far too much trouble to change now that R
    >> is
    >> > more
    >> >> > than
    >> >> > > >> 20
    >> >> > > >> > years old.  And in many cases it will turn out that the
    >> way R
    >> > does
    >> >> > it
    >> >> > > >> > actually does make more sense than the way I would have
    >> done it.
    >> >> > > >> >
    >> >> > > >> > Duncan Murdoch
    >> >> > > >> >
    >> >> > > >> >
    >> >> > > >> >
    >> >> > > >> >> Cheers,
    >> >> > > >> >>
    >> >> > > >> >> Frederick
    >> >> > > >> >>
    >> >> > > >> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie
    >> wrote:
    >> >> > > >> >>
    >> >> > > >> >>> The attached patch.diff will make merge.data.frame()
    >> append the
    >> >> > > >> suffixes
    >> >> > > >> >>> to
    >> >> > > >> >>> columns with common names between by.x and names(y).
    >> >> > > >> >>>
    >> >> > > >> >>> Best,
    >> >> > > >> >>>
    >> >> > > >> >>> Scott Ritchie
    >> >> > > >> >>>
    >> >> > > >> >>> On 17 February 2018 at 11:15, Scott Ritchie <
    >> >> > [hidden email]>
    >> >> > > >> >>> wrote:
    >> >> > > >> >>>
    >> >> > > >> >>> Hi Frederick,
    >> >> > > >> >>>>
    >> >> > > >> >>>> I would expect that any duplicate names in the resulting
    >> >> > data.frame
    >> >> > > >> >>>> would
    >> >> > > >> >>>> have the suffixes appended to them, regardless of
    >> whether or
    >> > not
    >> >> > they
    >> >> > > >> >>>> are
    >> >> > > >> >>>> used as the join key. So in my example I would expect
    >> > "names.x"
    >> >> > and
    >> >> > > >> >>>> "names.y" to indicate their source data.frame.
    >> >> > > >> >>>>
    >> >> > > >> >>>> While careful reading of the documentation reveals this
    >> is not
    >> >> > the
    >> >> > > >> >>>> case, I
    >> >> > > >> >>>> would argue the intent of the suffixes functionality
    >> should
    >> >> > equally
    >> >> > > >> be
    >> >> > > >> >>>> applied to this type of case.
    >> >> > > >> >>>>
    >> >> > > >> >>>> If you agree this would be useful, I'm happy to write a
    >> patch
    >> > for
    >> >> > > >> >>>> merge.data.frame that will add suffixes in this case - I
    >> > intend
    >> >> > to do
    >> >> > > >> >>>> the
    >> >> > > >> >>>> same for merge.data.table in the data.table package
    >> where I
    >> >> > initially
    >> >> > > >> >>>> encountered the edge case.
    >> >> > > >> >>>>
    >> >> > > >> >>>> Best,
    >> >> > > >> >>>>
    >> >> > > >> >>>> Scott
    >> >> > > >> >>>>
    >> >> > > >> >>>> On 17 February 2018 at 03:53, <[hidden email]> wrote:
    >> >> > > >> >>>>
    >> >> > > >> >>>> Hi Scott,
    >> >> > > >> >>>>>
    >> >> > > >> >>>>> It seems like reasonable behavior to me. What result
    >> would
    >> > you
    >> >> > > >> expect?
    >> >> > > >> >>>>> That the second "name" should be called "name.y"?
    >> >> > > >> >>>>>
    >> >> > > >> >>>>> The "merge" documentation says:
    >> >> > > >> >>>>>
    >> >> > > >> >>>>>      If the columns in the data frames not used in
    >> merging
    >> > have
    >> >> > any
    >> >> > > >> >>>>>      common names, these have ‘suffixes’ (‘".x"’ and
    >> ‘".y"’
    >> > by
    >> >> > > >> default)
    >> >> > > >> >>>>>      appended to try to make the names of the result
    >> unique.
    >> >> > > >> >>>>>
    >> >> > > >> >>>>> Since the first "name" column was used in merging,
    >> leaving
    >> > both
    >> >> > > >> >>>>> without a suffix seems consistent with the
    >> documentation...
    >> >> > > >> >>>>>
    >> >> > > >> >>>>> Frederick
    >> >> > > >> >>>>>
    >> >> > > >> >>>>> On Fri, Feb 16, 2018 at 09:08:29AM +1100, Scott Ritchie
    >> > wrote:
    >> >> > > >> >>>>>
    >> >> > > >> >>>>>> Hi,
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>> I was unable to find a bug report for this with a
    >> cursory
    >> >> > search,
    >> >> > > >> but
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>> would
    >> >> > > >> >>>>>
    >> >> > > >> >>>>>> like clarification if this is intended or unavoidable
    >> >> > behaviour:
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>> ```{r}
    >> >> > > >> >>>>>> # Create example data.frames
    >> >> > > >> >>>>>> parents <- data.frame(name=c("Sarah", "Max", "Qin",
    >> "Lex"),
    >> >> > > >> >>>>>>                        sex=c("F", "M", "F", "M"),
    >> >> > > >> >>>>>>                        age=c(41, 43, 36, 51))
    >> >> > > >> >>>>>> children <- data.frame(parent=c("Sarah", "Max",
    >> "Qin"),
    >> >> > > >> >>>>>>                         name=c("Oliver", "Sebastian",
    >> >> > "Kai-lee"),
    >> >> > > >> >>>>>>                         sex=c("M", "M", "F"),
    >> >> > > >> >>>>>>                         age=c(5,8,7))
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>> # Merge() creates a duplicated "name" column:
    >> >> > > >> >>>>>> merge(parents, children, by.x = "name", by.y =
    >> "parent")
    >> >> > > >> >>>>>> ```
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>> Output:
    >> >> > > >> >>>>>> ```
    >> >> > > >> >>>>>>     name sex.x age.x      name sex.y age.y
    >> >> > > >> >>>>>> 1   Max     M    43 Sebastian     M     8
    >> >> > > >> >>>>>> 2   Qin     F    36   Kai-lee     F     7
    >> >> > > >> >>>>>> 3 Sarah     F    41    Oliver     M     5
    >> >> > > >> >>>>>> Warning message:
    >> >> > > >> >>>>>> In merge.data.frame(parents, children, by.x = "name",
    >> by.y =
    >> >> > > >> >>>>>> "parent") :
    >> >> > > >> >>>>>>    column name ‘name’ is duplicated in the result
    >> >> > > >> >>>>>> ```
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>> Kind Regards,
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>> Scott Ritchie
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>>        [[alternative HTML version deleted]]
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>> ______________________________________________
    >> >> > > >> >>>>>> [hidden email] mailing list
    >> >> > > >> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>>
    >> >> > > >> >>>>>
    >> >> > > >> >>>>
    >> >> > > >> >>>>
    >> >> > > >> >> Index: src/library/base/R/merge.R
    >> >> > > >> >>> ==============================
    >> ==============================
    >> >> > =======
    >> >> > > >> >>> --- src/library/base/R/merge.R  (revision 74264)
    >> >> > > >> >>> +++ src/library/base/R/merge.R  (working copy)
    >> >> > > >> >>> @@ -157,6 +157,15 @@
    >> >> > > >> >>>           }
    >> >> > > >> >>>             if(has.common.nms) names(y) <- nm.y
    >> >> > > >> >>> +        ## If by.x %in% names(y) then duplicate column
    >> names
    >> >> > still
    >> >> > > >> >>> arise,
    >> >> > > >> >>> +        ## apply suffixes to these
    >> >> > > >> >>> +        dupe.keyx <- intersect(nm.by, names(y))
    >> >> > > >> >>> +        if(length(dupe.keyx)) {
    >> >> > > >> >>> +          if(nzchar(suffixes[1L]))
    >> >> > > >> >>> +            names(x)[match(dupe.keyx, names(x), 0L)] <-
    >> >> > > >> >>> paste(dupe.keyx, suffixes[1L], sep="")
    >> >> > > >> >>> +          if(nzchar(suffixes[2L]))
    >> >> > > >> >>> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
    >> >> > > >> >>> paste(dupe.keyx, suffixes[2L], sep="")
    >> >> > > >> >>> +        }
    >> >> > > >> >>>           nm <- c(names(x), names(y))
    >> >> > > >> >>>           if(any(d <- duplicated(nm)))
    >> >> > > >> >>>               if(sum(d) > 1L)
    >> >> > > >> >>>
    >> >> > > >> >>
    >> >> > > >> >> ______________________________________________
    >> >> > > >> >> [hidden email] mailing list
    >> >> > > >> >> https://stat.ethz.ch/mailman/listinfo/r-devel
    >> >> > > >> >>
    >> >> > > >> >>
    >> >> > > >> >
    >> >> > > >>
    >> >> > > >>         [[alternative HTML version deleted]]
    >> >> > > >>
    >> >> > > >> ______________________________________________
    >> >> > > >> [hidden email] mailing list
    >> >> > > >> https://stat.ethz.ch/mailman/listinfo/r-devel
    >> >> > > >>
    >> >> > > >
    >> >> > > >
    >> >> > > >
    >> >> > > > --
    >> >> > > > Gabriel Becker, PhD
    >> >> > > > Scientist (Bioinformatics)
    >> >> > > > Genentech Research
    >> >> > > >
    >> >> >
    >> >> > > Index: src/library/base/R/merge.R
    >> >> > > ============================================================
    >> =======
    >> >> > > --- src/library/base/R/merge.R        (revision 74264)
    >> >> > > +++ src/library/base/R/merge.R        (working copy)
    >> >> > > @@ -157,6 +157,15 @@
    >> >> > >          }
    >> >> > >
    >> >> > >          if(has.common.nms) names(y) <- nm.y
    >> >> > > +        ## If by.x %in% names(y) then duplicate column names
    >> still
    >> >> > arise,
    >> >> > > +        ## apply suffixes to these
    >> >> > > +        dupe.keyx <- intersect(nm.by, names(y))
    >> >> > > +        if(length(dupe.keyx)) {
    >> >> > > +          if(nzchar(suffixes[1L]))
    >> >> > > +            names(x)[match(dupe.keyx, names(x), 0L)] <-
    >> >> > paste(dupe.keyx, suffixes[1L], sep="")
    >> >> > > +          if(nzchar(suffixes[2L]))
    >> >> > > +            names(y)[match(dupe.keyx, names(y), 0L)] <-
    >> >> > paste(dupe.keyx, suffixes[2L], sep="")
    >> >> > > +        }
    >> >> > >          nm <- c(names(x), names(y))
    >> >> > >          if(any(d <- duplicated(nm)))
    >> >> > >              if(sum(d) > 1L)
    >> >> >
    >> >> > > ______________________________________________
    >> >> > > [hidden email] mailing list
    >> >> > > https://stat.ethz.ch/mailman/listinfo/r-devel
    >> >> >
    >> >> >
    >>
    >> >> Index: src/library/base/R/merge.R
    >> >> ===================================================================
    >> >> --- src/library/base/R/merge.R        (revision 74280)
    >> >> +++ src/library/base/R/merge.R        (working copy)
    >> >> @@ -157,6 +157,14 @@
    >> >> }
    >> >>
    >> >> if(has.common.nms) names(y) <- nm.y
    >> >> +        ## If by.x %in% names(y) then duplicate column names still
    >> arise,
    >> >> +        ## apply suffixes to just y - this keeps backwards
    >> compatibility
    >> >> +        ## when referring to by.x in the resulting data.frame
    >> >> +        dupe.keyx <- intersect(nm.by, names(y))
    >> >> +        if(length(dupe.keyx)) {
    >> >> +          if(nzchar(suffixes[2L]))
    >> >> +            names(y)[match(dupe.keyx, names(y), 0L)] <-
    >> paste(dupe.keyx,
    >> > suffixes[2L], sep="")
    >> >> +        }
    >> >> nm <- c(names(x), names(y))
    >> >> if(any(d <- duplicated(nm)))
    >> >> if(sum(d) > 1L)
    >>
    >> > [[alternative HTML version deleted]]
    >>
    >> > ______________________________________________
    >> > [hidden email] mailing list
    >> > https://stat.ethz.ch/mailman/listinfo/r-devel
    >>

    > [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel