

Hi all,
Apologies if this has been asked before (a quick google didn't find it for
me),and I know this is a case of behaving as documented but its so
unintuitive (to me at least) that I figured I'd bring it up here anyway. I
figure its probably going to not be changed, but I'm happy to submit a
patch if this is something Rcore feels can/should change.
So I recently got bitten by the fact that
> nrow(rbind(character(), character()))
[1] 2
I was checking whether the result of an rbind call had more than one row,
and that unexpected returned true, causing all sorts of shenanigans
downstream as I'm sure you can imagine.
Now I know that from ?rbind
For ‘cbind’ (‘rbind’), vectors of zero length (including ‘NULL’)
>
> are ignored unless the result would have zero rows (columns), for
>
> S compatibility. (Zeroextent matrices do not occur in S3 and are
>
> not ignored in R.)
>
But there's a couple of things here. First, for the rowbind case this
reads as "if there would be zero columns, the vectors will not be
ignored". This wording implies to me that not ignoring the vectors is a
remedy to the "problem" of the potential for a zerocolumn return, but
thats not the case. The result still has 0 columns, it just does not also
have zero rows. So even if the behavior is not changed, perhaps this
wording can be massaged for clarity?
The other issue, which I admit is likely a problem with my intuition, but
which I don't think I'm alone in having, is that even if I can't have a 0x0
matrix (which is what I'd prefer) I would have expected/preferred a 1x0
matrix, the reasoning being that if we must avoid a 0x0 return value, we
would do the minimum required to avoid, which is to not ignore the first
length 0 vector, to ensure a nonzeroextent matrix, but then ignore the
remaining ones as they contain information for 0 new rows.
Of course I can program around this now that I know the behavior, but
again, its so unintuitive (even for someone with a fairly well developed
intuition for R's sometimes "quirky" behavior) that I figured I'd bring it
up.
Thoughts?
Best,
~G
The existing behaviour seems inutitive to me. I would consider these
invariants for n vector x_i's each with size m:
* nrow(rbind(x_1, x_2, ..., x_n)) equals n
* ncol(rbind(x_1, x_2, ..., x_n)) equals m
Additionally, wouldn't you expect rbind(x_1[i], x_2[i]) to equal
rbind(x_1, x_2)[, i, drop = FALSE] ?
Hadley
Hi Hadley,
Thanks for the counterpoint. Response below.
On Thu, May 16, 2019 at 1:59 PM Hadley Wickham < [hidden email]> wrote:
> The existing behaviour seems inutitive to me. I would consider these
> invariants for n vector x_i's each with size m:
>
> * nrow(rbind(x_1, x_2, ..., x_n)) equals n
>
Personally, no I wouldn't. I would consider m==0 a degenerate case, where
there is no data, but I personally find matrices (or data.frames) with rows
but no columns a very strange concept. The converse is not true, I
understand the utility of columns but no rows, particularly in the
data.frame case, but rows with no columns are observations we didn't
observe anything about. Strange, imho.
Also, I know that you said *each with size m*, but the generalization would
be
for n vectors with m = max(length(x_i))
nrow(rbind(x_1, ..., x_n)) = m
And that is the behavior now as documented, but *only* when length(x_i) >0
for all i (or, currently, when m == 0, so all vectors are length 0).
> nrow(rbind(1:5, numeric()))
[1] 1
So that is where I was coming from. Lengthzero vectors don't add rows
because they contain no observed information.
I do see where you'er coming from, but it does make interrogating
nrow(rbind(x_1, ..., x_n)) NOT mean (give me the number of observations
for which I have data), which is what it means in nondegenerate contexts,
and that seems pretty important too.
Robin does also have an interesting point below about argument names, but
I'll leave that for another mail.
Best,
~G
Hi Gabe,
ncol(data.frame(aa=c("a", "b", "c"), AA=c("A", "B", "C")))
# [1] 2
ncol(data.frame(aa="a", AA="A"))
# [1] 2
ncol(data.frame(aa=character(0), AA=character(0)))
# [1] 2
ncol(cbind(aa=c("a", "b", "c"), AA=c("A", "B", "C")))
# [1] 2
ncol(cbind(aa="a", AA="A"))
# [1] 2
ncol(cbind(aa=character(0), AA=character(0)))
# [1] 2
nrow(rbind(aa=c("a", "b", "c"), AA=c("A", "B", "C")))
# [1] 2
nrow(rbind(aa="a", AA="A"))
# [1] 2
nrow(rbind(aa=character(0), AA=character(0)))
# [1] 2
hmmm... not sure why ncol(cbind(aa=character(0), AA=character(0))) or
nrow(rbind(aa=character(0), AA=character(0))) should do anything
different from what they do.
In my experience, and more generally speaking, the desire to treat
0length vectors as a special case that deviates from the
nonzerolength case has never been productive.
H.
Hi Herve,
Inline.
Hi Gabriel
> Personally, no I wouldn't. I would consider m==0 a degenerate case, where
there is no data, but I personally find matrices (or data.frames) with rows
but no columns a very strange concept.
This distinction between matrix and data.frames is the crux in this case.
From the dimensional modelling point of view, matrix can have nonzero
rows and zero columns, but data.frame (assuming it maps to database
table structure) should never have nonzero rows and zero columns.
This kind of issue was raised before in our issue tracker:
https://github.com/Rdatatable/data.table/issues/2422You should find that discussion useful.
Best,
Jan Gorecki
Herve Pages wrote:
> In my experience, and more generally speaking, the desire to treat
> 0length vectors as a special case that deviates from the
> nonzerolength case has never been productive.
Good idea.
Gabriel Becker Wrote:
> > nrow(rbind(aa = c("a", "b", "c"), AA = character()))
> [1] 1
> By rights of the invariance that you and Hadley are advocating, as far as
> I understand it, the last should give 2 rows, one of which is all NAs,
> rather than giving only one row as it currently does (and, I assume?,
> always has).
I think, ideally, this example should generate an error or a warning.
>>>>> Gabriel Becker
>>>>> on Thu, 16 May 2019 15:47:57 0700 writes:
> Hi Hadley,
> Thanks for the counterpoint. Response below.
> On Thu, May 16, 2019 at 1:59 PM Hadley Wickham < [hidden email]> wrote:
>> The existing behaviour seems inutitive to me. I would consider these
>> invariants for n vector x_i's each with size m:
>>
>> * nrow(rbind(x_1, x_2, ..., x_n)) equals n
>>
> Personally, no I wouldn't. I would consider m==0 a degenerate case, where
> there is no data, but I personally find matrices (or data.frames) with rows
> but no columns a very strange concept. The converse is not true, I
> understand the utility of columns but no rows, particularly in the
> data.frame case, but rows with no columns are observations we didn't
> observe anything about. Strange, imho.
Gabe, here I have to very strongly disagree.
Matrices (and higher order Arrays) are always definitely to
behave "symmetrically" / "uniformly" with respect to all of their dimensions.
We (and the S developers before us) have always taken a lot of
care trying to ensure that this is true.
So for the matrix case, if rows and columns behaved differently
that would be a bug "by definition".
Of course there's one thing where this uniformity / symmetry
must be violated: in the coercion from and to atomic vectors:
There, 'by column' (generalized for arrays to "earlier dimensions vary faster
than later one") has been chosen, not the least because this had
been adapted for Fortran (first, AFAIK) and all related ABIs
dealing with Matrix vector arithmetic for very good (numerical,
performance, known convention) reasons that enabled to know how
fast numerical linear algebra should be implemented.
Martin
Hi Martin,
Thanks for chiming in. Responses inline.
>>>>> Gabriel Becker
>>>>> on Fri, 17 May 2019 01:06:11 0700 writes:
> Hi Martin,
> Thanks for chiming in. Responses inline.
