

The duplicated() function gives TRUE if an item in a vector (or row in a
matrix, etc.) is a duplicate of an earlier item. But what I would like
to know is which item does it duplicate?
For example,
v < c("a", "b", "b", "a")
duplicated(v)
returns
[1] FALSE FALSE TRUE TRUE
What I want is a fast way to calculate
[1] NA NA 2 1
or (equally useful to me)
[1] 1 2 2 1
The result should have the property that if result[i] == j, then v[i] ==
v[j], at least for i != j.
Does this already exist somewhere, or is it easy to write?
Duncan Murdoch
what about as.integer(factor(v, levels = unique(v)))
I recall very clearly when I realized the power of this feature of
factor(), but I've not seen it discussed much.
Cheers, Mike.
On Tue, 13 Nov 2018 at 12:08 Duncan Murdoch
wrote:
> The duplicated() function gives TRUE if an item in a vector (or row in a
> matrix, etc.) is a duplicate of an earlier item. But what I would like
> to know is which item does it duplicate?
>
> For example,
>
> v < c("a", "b", "b", "a")
> duplicated(v)
>
> returns
>
> [1] FALSE FALSE TRUE TRUE
>
> What I want is a fast way to calculate
>
> [1] NA NA 2 1
>
> or (equally useful to me)
>
> [1] 1 2 2 1
>
> The result should have the property that if result[i] == j, then v[i] ==
> v[j], at least for i != j.
>
> Does this already exist somewhere, or is it easy to write?
>
> Duncan Murdoch
>
> match(v, unique(v))
[1] 1 2 2 1
Bert Gunter
"The trouble with having an open mind is that people keep coming along and
sticking things into it."
 Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Mon, Nov 12, 2018 at 5:08 PM Duncan Murdoch
wrote:
> The duplicated() function gives TRUE if an item in a vector (or row in a
> matrix, etc.) is a duplicate of an earlier item. But what I would like
> to know is which item does it duplicate?
>
> For example,
>
> v < c("a", "b", "b", "a")
> duplicated(v)
>
> returns
>
> [1] FALSE FALSE TRUE TRUE
>
> What I want is a fast way to calculate
>
> [1] NA NA 2 1
>
> or (equally useful to me)
>
> [1] 1 2 2 1
>
> The result should have the property that if result[i] == j, then v[i] ==
> v[j], at least for i != j.
>
> Does this already exist somewhere, or is it easy to write?
>
> Duncan Murdoch
>
Hi,
On 11/12/18 17:08, Duncan Murdoch wrote:
> The duplicated() function gives TRUE if an item in a vector (or row in
> a matrix, etc.) is a duplicate of an earlier item. But what I would
> like to know is which item does it duplicate?
>
> For example,
>
> v < c("a", "b", "b", "a")
> duplicated(v)
>
> returns
>
> [1] FALSE FALSE TRUE TRUE
>
> What I want is a fast way to calculate
>
> [1] NA NA 2 1
>
> or (equally useful to me)
>
> [1] 1 2 2 1
>
> The result should have the property that if result[i] == j, then v[i]
> == v[j], at least for i != j.
>
> Does this already exist somewhere, or is it easy to write?
I generally use match() for that:
> v < c("a", "b", "b", "a")
> match(v, v)
[1] 1 2 2 1
H.

It is not clear to what you want for the general case. Perhaps:
> v < letters[c(2,2,1,2,1,1)]
> wh < tapply(seq_along(v),factor(v), '[',1)
> w < wh[match(v,v[wh])]
> w
b b a b a a
1 1 3 1 3 3
> ## and if you want NA's for the first occurences of unique values
> ## of course:
> w[wh] < NA
> w
b b a b a a
NA 1 NA 1 3 3
I'd like to see a cleverer solution that vectorizes and avoids the
tapply(), though.
Cheers,
Bert
On Mon, Nov 12, 2018 at 8:33 PM Bert Gunter
> > match(v, unique(v))
> [1] 1 2 2 1
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Mon, Nov 12, 2018 at 5:08 PM Duncan Murdoch < [hidden email]>
> wrote:
>
>> The duplicated() function gives TRUE if an item in a vector (or row in a
>> matrix, etc.) is a duplicate of an earlier item. But what I would like
>> to know is which item does it duplicate?
>>
>> For example,
>>
>> v < c("a", "b", "b", "a")
>> duplicated(v)
>>
>> returns
>>
>> [1] FALSE FALSE TRUE TRUE
>>
>> What I want is a fast way to calculate
>>
>> [1] NA NA 2 1
>>
>> or (equally useful to me)
>>
>> [1] 1 2 2 1
>>
>> The result should have the property that if result[i] == j, then v[i] ==
>> v[j], at least for i != j.
>>
>> Does this already exist somewhere, or is it easy to write?
>>
>> Duncan Murdoch
>>
"I'd like to see a cleverer solution that vectorizes..."
and Herve provided it.
Bert Gunter
"The trouble with having an open mind is that people keep coming along and
sticking things into it."
 Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Mon, Nov 12, 2018 at 9:43 PM Bert Gunter
> It is not clear to what you want for the general case. Perhaps:
>
> > v < letters[c(2,2,1,2,1,1)]
> > wh < tapply(seq_along(v),factor(v), '[',1)
> > w < wh[match(v,v[wh])]
> > w
> b b a b a a
> 1 1 3 1 3 3
> > ## and if you want NA's for the first occurences of unique values
> > ## of course:
> > w[wh] < NA
> > w
> b b a b a a
> NA 1 NA 1 3 3
>
> I'd like to see a cleverer solution that vectorizes and avoids the
> tapply(), though.
>
> Cheers,
> Bert
>
>
>
>
> On Mon, Nov 12, 2018 at 8:33 PM Bert Gunter < [hidden email]>
> wrote:
>
>> > match(v, unique(v))
>> [1] 1 2 2 1
>>
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming along
>> and sticking things into it."
>>  Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>
>>
>> On Mon, Nov 12, 2018 at 5:08 PM Duncan Murdoch < [hidden email]>
>> wrote:
>>
>>> The duplicated() function gives TRUE if an item in a vector (or row in a
>>> matrix, etc.) is a duplicate of an earlier item. But what I would like
>>> to know is which item does it duplicate?
>>>
>>> For example,
>>>
>>> v < c("a", "b", "b", "a")
>>> duplicated(v)
>>>
>>> returns
>>>
>>> [1] FALSE FALSE TRUE TRUE
>>>
>>> What I want is a fast way to calculate
>>>
>>> [1] NA NA 2 1
>>>
>>> or (equally useful to me)
>>>
>>> [1] 1 2 2 1
>>>
>>> The result should have the property that if result[i] == j, then v[i] ==
>>> v[j], at least for i != j.
>>>
>>> Does this already exist somewhere, or is it easy to write?
>>>
>>> Duncan Murdoch
>>>
Hi
similar result (with different numerical values) could be achieved by making v a factor.
> v < letters[c(2,2,1,2,1,1)]
> vf<factor(v)
> as.numeric(vf)
[1] 2 2 1 2 1 1
Cheers
Petr
>>>>> PIKAL Petr
>>>>> on Tue, 13 Nov 2018 08:42:22 +0000 writes:
> Hi
> similar result (with different numerical values) could
> be achieved by making v a factor.
> > v < letters[c(2,2,1,2,1,1)]
> > vf<factor(v)
> > as.numeric(vf)
> [1] 2 2 1 2 1 1
>
> Cheers
> Petr
Yes, as was already remarked by Michael Sumner.
But really the power is in match() : It is called at *twice* by factor().
Martin
On 13/11/2018 12:35 AM, Pages, Herve wrote:
> Hi,
>
> On 11/12/18 17:08, Duncan Murdoch wrote:
>> The duplicated() function gives TRUE if an item in a vector (or row in
>> a matrix, etc.) is a duplicate of an earlier item. But what I would
>> like to know is which item does it duplicate?
>>
>> For example,
>>
>> v < c("a", "b", "b", "a")
>> duplicated(v)
>>
>> returns
>>
>> [1] FALSE FALSE TRUE TRUE
>>
>> What I want is a fast way to calculate
>>
>> [1] NA NA 2 1
>>
>> or (equally useful to me)
>>
>> [1] 1 2 2 1
>>
>> The result should have the property that if result[i] == j, then v[i]
>> == v[j], at least for i != j.
>>
>> Does this already exist somewhere, or is it easy to write?
>
> I generally use match() for that:
>
> > v < c("a", "b", "b", "a")
>
> > match(v, v)
>
> [1] 1 2 2 1
Yes, this is perfect. Thanks to you (and the private answer I received
that suggested the same).
Duncan Murdoch
You also asked about doing this for the rows of a matrix. unique() give
the unique rows but match operates on a per element, not per row,
basis. You can use split, which operates on rows of a matrix, to help.
> m < cbind( A=c(i=5,ii=5,iii=5,iv=4,v=4,vi=4), B=c(2,3,2,2,2,2) )
> unique(m)
A B
i 5 2
ii 5 3
iv 4 2
> match(m, unique(m)) # bad
[1] 1 1 1 3 3 3 4 5 4 4 4 4
> asRows < function(x) split(x, seq_len(NROW(x))) # convert to list of rows
> match(asRows(m), unique(asRows(m)))
[1] 1 2 1 3 3 3
For data.frames unique works on rows but match works on columns, and
converting
to a list of rows does not quite work, because unique looks at the row
names. A
modification of asRoiws works around that:
> d < data.frame(m)
> unique(d)
A B
i 5 2
ii 5 3
iv 4 2
> match(d, unique(d))
[1] NA NA
> asRows < function(x) lapply(split(x, seq_len(NROW(x))), as.list)
> match(asRows(d), unique(asRows(d)))
[1] 1 2 1 3 3 3
Is this the sort of issue that Hadley's vectors package is addressing?
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Tue, Nov 13, 2018 at 2:15 AM, Duncan Murdoch
wrote:
> On 13/11/2018 12:35 AM, Pages, Herve wrote:
>
>> Hi,
>>
>> On 11/12/18 17:08, Duncan Murdoch wrote:
>>
>>> The duplicated() function gives TRUE if an item in a vector (or row in
>>> a matrix, etc.) is a duplicate of an earlier item. But what I would
>>> like to know is which item does it duplicate?
>>>
>>> For example,
>>>
>>> v < c("a", "b", "b", "a")
>>> duplicated(v)
>>>
>>> returns
>>>
>>> [1] FALSE FALSE TRUE TRUE
>>>
>>> What I want is a fast way to calculate
>>>
>>> [1] NA NA 2 1
>>>
>>> or (equally useful to me)
>>>
>>> [1] 1 2 2 1
>>>
>>> The result should have the property that if result[i] == j, then v[i]
>>> == v[j], at least for i != j.
>>>
>>> Does this already exist somewhere, or is it easy to write?
>>>
>>
>> I generally use match() for that:
>>
>> > v < c("a", "b", "b", "a")
>>
>> > match(v, v)
>>
>> [1] 1 2 2 1
>>
>
> Yes, this is perfect. Thanks to you (and the private answer I received
> that suggested the same).
>
> Duncan Murdoch
>
On 13/11/2018 12:58 PM, William Dunlap wrote:
> You also asked about doing this for the rows of a matrix. unique() give
> the unique rows but match operates on a per element, not per row,
> basis. You can use split, which operates on rows of a matrix, to help.
>
> > m < cbind( A=c(i=5,ii=5,iii=5,iv=4,v=4,vi=4), B=c(2,3,2,2,2,2) )
> > unique(m)
> A B
> i 5 2
> ii 5 3
> iv 4 2
> > match(m, unique(m)) # bad
> [1] 1 1 1 3 3 3 4 5 4 4 4 4
> > asRows < function(x) split(x, seq_len(NROW(x))) # convert to
> list of rows
> > match(asRows(m), unique(asRows(m)))
> [1] 1 2 1 3 3 3
>
>
> For data.frames unique works on rows but match works on columns, and
> converting
> to a list of rows does not quite work, because unique looks at the row
> names. A
> modification of asRoiws works around that:
>
> > d < data.frame(m)
> > unique(d)
> A B
> i 5 2
> ii 5 3
> iv 4 2
> > match(d, unique(d))
> [1] NA NA
> > asRows < function(x) lapply(split(x, seq_len(NROW(x))), as.list)
> > match(asRows(d), unique(asRows(d)))
> [1] 1 2 1 3 3 3
>
Thanks! That's very nice.
>
> Is this the sort of issue that Hadley's vectors package is addressing?
I don't know; hopefully someone else will respond...
Duncan Murdoch
