`merge()` not consistent in how it treats list columns

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

`merge()` not consistent in how it treats list columns

Antoine Fabri
Dear R-devel,

When trying to merge 2 data frames by an "id" column, with this column a
character in one of them, and a list of character in the other, merge
behaves differently depending which is given first.

Example :

```
df1 <- data.frame(a=1)
df2 <- data.frame(b=2)
df1$id <- "ID"
df2$id <- list("ID")

# these print in a similar way, so the upcoming error will be hard to
diagnose
df1
#>   a id
#> 1 1 ID
df2
#>   b id
#> 1 2 ID

# especially as this works well, df2$id is treated as an atomic vector
merge(df1, df2)
#>   id a b
#> 1 ID 1 2

# But this fails with a cryptic error message
merge(df2, df1)
#> Error in sort.list(bx[m$xi]): 'x' must be atomic for 'sort.list', method
"shell" and "quick"
#> Have you called 'sort' on a list?
```

I believe that if we let it work one way it should work the other, and that
if it works neither an explicit error  mentioning how we can't join by list
column would be helpful.

Many thanks and happy new year to all the R community,

Antoine

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: `merge()` not consistent in how it treats list columns

R devel mailing list
Antoine,

Have you considered converting the non-list to a list explicitly so this
does not matter?

For a long time, few people used lists in this context, albeit in the
tidyverse it is now better supported and probably more common.

This is an area many have found annoying when you have implicit conversions.
What if one ID field was character and the other was numeric? In some
languages the conversion always goes to character (as in R) but in some it
might go numeric in one direction and in some it may refuse and demand you
convert it yourself.

Do you suggest that a unique solution exists for complex cases so that the
software should know you want to convert a vector to list? What if one side
is a list containing a list containing a list, many levels deep and the
other has no or fewer or more levels. Is it obvious to take the deepest case
and change all others to match? Do you lose things in the process?

When things may not work, sure you can suggest someone change, but you can
consider it as a case where YOU should make sure the types are compatible
before a merge.



-----Original Message-----
From: R-devel <[hidden email]> On Behalf Of Antoine Fabri
Sent: Saturday, January 2, 2021 2:16 PM
To: R-devel <[hidden email]>
Subject: [Rd] `merge()` not consistent in how it treats list columns

Dear R-devel,

When trying to merge 2 data frames by an "id" column, with this column a
character in one of them, and a list of character in the other, merge
behaves differently depending which is given first.

Example :

```
df1 <- data.frame(a=1)
df2 <- data.frame(b=2)
df1$id <- "ID"
df2$id <- list("ID")

# these print in a similar way, so the upcoming error will be hard to
diagnose
df1
#>   a id
#> 1 1 ID
df2
#>   b id
#> 1 2 ID

# especially as this works well, df2$id is treated as an atomic vector
merge(df1, df2)
#>   id a b
#> 1 ID 1 2

# But this fails with a cryptic error message merge(df2, df1) #> Error in
sort.list(bx[m$xi]): 'x' must be atomic for 'sort.list', method "shell" and
"quick"
#> Have you called 'sort' on a list?
```

I believe that if we let it work one way it should work the other, and that
if it works neither an explicit error  mentioning how we can't join by list
column would be helpful.

Many thanks and happy new year to all the R community,

Antoine

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: `merge()` not consistent in how it treats list columns

Gabriel Becker-2
In reply to this post by Antoine Fabri
Hi Antoine,


On Sat, Jan 2, 2021 at 11:16 AM Antoine Fabri <[hidden email]>
wrote:

> Dear R-devel,
>
> When trying to merge 2 data frames by an "id" column, with this column a
> character in one of them, and a list of character in the other, merge
> behaves differently depending which is given first.
>
> Example :
>
> ```
> df1 <- data.frame(a=1)
> df2 <- data.frame(b=2)
> df1$id <- "ID"
> df2$id <- list("ID")
>
> # these print in a similar way, so the upcoming error will be hard to
> diagnose
> df1
> #>   a id
> #> 1 1 ID
> df2
> #>   b id
> #> 1 2 ID
>
> # especially as this works well, df2$id is treated as an atomic vector
> merge(df1, df2)
> #>   id a b
> #> 1 ID 1 2
>

Well, sure but that is because it happens to be a list with each element
having length one. In which case, it really should not have been a list at
all, and the fact that it was seems a deeper problem that should likely be
resolved instead of treating the symptom, in my opinion.

 > df1 <- data.frame(a=1)

> df2 <- data.frame(b=2)

> df1$id <- "ID"

> df2$id <- list(c("ID", "ID2"))

> merge(df1, df2)

[1] id a  b

<0 rows> (or 0-length row.names)


Thats probably not what you wanted it to do, right? Or maybe it is, it
depends, right?. And therein lies the rub.


I have to be honest, as a developer, I really wish this, even in your
example case, threw an error. Anything else just looks to me like a
debugging nightmare looming in the wings waiting to strike.





> # But this fails with a cryptic error message
> merge(df2, df1)
> #> Error in sort.list(bx[m$xi]): 'x' must be atomic for 'sort.list', method
> "shell" and "quick"
> #> Have you called 'sort' on a list?
> ```
>
> I believe that if we let it work one way it should work the other, and that
> if it works neither an explicit error  mentioning how we can't join by list
> column would be helpful.
>

There's no reason (in principle) you wouldn't be able to join by a list
column, they should just both have to be list columns, in my ideal (but
admittedly unlikely) world.  Id rather the atomic-vector/list mismatch case
throw an error, myself.


Now I kind of doubt we can change the behavior that works now, but as Avi
points out, I think this is something that is complicated and case specific
enough that it really ought to be your job as the coder to take care of
what should happen when you try to merge on columns that are fundamentally
different types.


Plus, having an id column as a list, unless it was really explicitly
intentional, seems very likely to be a bug to me. (I mean id column in the
you want to use it to merge things, not the fact that itw as called "id",
though admittedly those are likely to go together...

Best,
~G


>
> Many thanks and happy new year to all the R community,
>
> Antoine
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: `merge()` not consistent in how it treats list columns

Antoine Fabri
Hi Gabe,


>  [... ]
> Well, sure but that is because it happens to be a list with each element
> having length one. In which case, it really should not have been a list at
> all, and the fact that it was seems a deeper problem that should likely be
> resolved instead of treating the symptom, in my opinion.
>

I wouldn't mind it explicitly failing on the ground that you don't join a
list column on a character column, and I wouldn't mind it succeeding
either, because it's consistent with `c("a", "b") == list("a", "b")`  and
`c("a", "b") %in% list("a", "b")` returning `c(TRUE, TRUE)`. But I feel
strongly that it shouldn't behave differently depending on which data frame
is provided first to the function, and I do think that if we do make it an
error, it is worth making it understandable.


>
>  > df1 <- data.frame(a=1)
>
> > df2 <- data.frame(b=2)
>
> > df1$id <- "ID"
>
> > df2$id <- list(c("ID", "ID2"))
>
> > merge(df1, df2)
>
> [1] id a  b
>
> <0 rows> (or 0-length row.names)
>
>
> Thats probably not what you wanted it to do, right? Or maybe it is, it
> depends, right?. And therein lies the rub.
>
>
> I have to be honest, as a developer, I really wish this, even in your
> example case, threw an error. Anything else just looks to me like a
> debugging nightmare looming in the wings waiting to strike.
>
>
>
What I did wrong in my real case, to provide context, is compute `df2$id <-
lapply(x, fun)`, which was a mistake, but looked ok when printing, `vapply`
solved the issue, `sapply` would still have been problematic because
`df2$id` would be an emply list for a `x` of length 0.

After correcting my mistake I tried to isolate the error and had trouble
reproducing it with my simple case because I was inverting both data frames
argument. This is how the inconsistency +  cryptic message caused me more
trouble than I think it should have.

Imagine that I can have production code work for years with `merge(df1,
df2)`, maybe not written by me, I change it to `merge(df2, df1)` for some
reason and all breaks loose with `Error in sort.list(bx[m$xi]): 'x' must be
atomic for 'sort.list', method "shell" and "quick"`. If I'm not familiar
with list columns and that they can print just like character columns I
might have a rough day.

Here's another oddity that I think is worth fixing :

df1 <- data.frame(a=1, id = "ID")
df3 <- data.frame(c=character(), id = list())
merge(df3, df1)
#> [1] x[FALSE, ] a          id
#> <0 lignes> (ou 'row.names' de longueur nulle)
merge(df1, df3)
#> [1] a          id         y[FALSE, ]
#> <0 lignes> (ou 'row.names' de longueur nulle)

 [...]
>


> There's no reason (in principle) you wouldn't be able to join by a list
> column, they should just both have to be list columns, in my ideal (but
> admittedly unlikely) world.  Id rather the atomic-vector/list mismatch case
> throw an error, myself.
>

The doc does say that "This is intended to work with data frames with
vector-like columns" in a note at the bottom, so anything we do is
consistent with the doc, and fine by me if it fails (that's how {dplyr}
joins work), but let the order of the data frames not matter. A warning is
another option.


> Now I kind of doubt we can change the behavior that works now, but as Avi
> points out, I think this is something that is complicated and case specific
> enough that it really ought to be your job as the coder to take care of
> what should happen when you try to merge on columns that are fundamentally
> different types.
>

Well yes, one can always say it's the developer's fault, but we all
appreciate a software that guides us toward the light. List columns are not
a rare thing at all anymore and using an `lapply` call instad of `sapply`
or `vapply` is probably not a rare mistake. And again, the inconsistency is
wrong in any case.

I'll read other answers when I get the digest.

Thanks,

Antoine

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel