union of two sets are smaller than one set?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

union of two sets are smaller than one set?

Martin Møller Skarbiniks Pedersen
This is really puzzling me and when I try to make a small example
everything works like expected.

The problem:

I got these two large vectors of strings.

> str(s1)
 chr [1:766608] "0.dk" ...
> str(s2)
 chr [1:59387] "043.dk" "0606.dk" "0618.dk" "0888.dk" "0iq.dk" "0it.dk" ...

And I need to create the union-set of s1 and s2.
I expect the size of the union-set to be between 766608 and 766608+59387.
However it is 681193 which is less that number of elements in s1!

> length(base::union(s1, s2))
[1] 681193

Any hints?

Regards
Martin

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: union of two sets are smaller than one set?

Duncan Murdoch-2
On 31/01/2021 3:57 p.m., Martin Møller Skarbiniks Pedersen wrote:

> This is really puzzling me and when I try to make a small example
> everything works like expected.
>
> The problem:
>
> I got these two large vectors of strings.
>
>> str(s1)
>   chr [1:766608] "0.dk" ...
>> str(s2)
>   chr [1:59387] "043.dk" "0606.dk" "0618.dk" "0888.dk" "0iq.dk" "0it.dk" ...
>
> And I need to create the union-set of s1 and s2.
> I expect the size of the union-set to be between 766608 and 766608+59387.
> However it is 681193 which is less that number of elements in s1!
>
>> length(base::union(s1, s2))
> [1] 681193
>
> Any hints?

I imagine unique(s1) is shorter than s1.  The union function is the same as

unique(c(s1, s2))

for your data.  (The only difference is if s1 or s2 is named:  the names
are dropped.)

Duncan Murdoch

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: union of two sets are smaller than one set?

Enrico Schumann-2
In reply to this post by Martin Møller Skarbiniks Pedersen
On Sun, 31 Jan 2021, Martin Møller Skarbiniks Pedersen writes:

> This is really puzzling me and when I try to make a small example
> everything works like expected.
>
> The problem:
>
> I got these two large vectors of strings.
>
>> str(s1)
>  chr [1:766608] "0.dk" ...
>> str(s2)
>  chr [1:59387] "043.dk" "0606.dk" "0618.dk" "0888.dk" "0iq.dk" "0it.dk" ...
>
> And I need to create the union-set of s1 and s2.
> I expect the size of the union-set to be between 766608 and 766608+59387.
> However it is 681193 which is less that number of elements in s1!
>
>> length(base::union(s1, s2))
> [1] 681193
>
> Any hints?
>
> Regards
> Martin
>

Duplicates?

kind regards
    Enrico

--
Enrico Schumann
Lucerne, Switzerland
http://enricoschumann.net

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: union of two sets are smaller than one set?

R help mailing list-2
In reply to this post by Martin Møller Skarbiniks Pedersen
Martin,

You did not say your two starting objects were already sets. You said they
were vectors of strings. It may well be that your strings included
duplicates. For example, If I read in lots of text with a blank line between
paragraphs, I would have lots of seemingly empty and identical parts. Just
converting that into a set would shrink it.

You have not said how you created or processed your initial two vectors. It
is also possible parts were sort of DELETED as in removing the string
pointed to by some entry but leaving a null pointer of sorts which would
leave the length of the vector longer than the useful contents.

Your strings seem to be what may be filenames. Are they unique, especially
if they are files in different folders/directories?

There are many ways to check, but using your method, try this:

length(base::union(s1, s1))

-----Original Message-----
From: R-help <[hidden email]> On Behalf Of Martin Møller
Skarbiniks Pedersen
Sent: Sunday, January 31, 2021 3:57 PM
To: R mailing list <[hidden email]>
Subject: [R] union of two sets are smaller than one set?

This is really puzzling me and when I try to make a small example everything
works like expected.

The problem:

I got these two large vectors of strings.

> str(s1)
 chr [1:766608] "0.dk" ...
> str(s2)
 chr [1:59387] "043.dk" "0606.dk" "0618.dk" "0888.dk" "0iq.dk" "0it.dk" ...

And I need to create the union-set of s1 and s2.
I expect the size of the union-set to be between 766608 and 766608+59387.
However it is 681193 which is less that number of elements in s1!

> length(base::union(s1, s2))
[1] 681193

Any hints?

Regards
Martin

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.