problem with duplicated function

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

problem with duplicated function

curtburk
Hello everyone,

I have two very large dataframes (~1 million rows x 5 columns), of which
two of the columns are lat/long coordinates. The names of the dataframes
are 'data07' and 'data 08'. Data08 has a few more sampling points than data
07 so I want to subset data08 so that it has the same number of data points
as data07 using the unique lat/long coordinates.

Here are the associated data structures:

*str(data07)*
'data.frame':   969109 obs. of  5 variables:
 $ cell    : int  710228 715545 720690 720824 695611 700490 700626 705371
705507 710363 ...
 $ prN     : int  288 276 286 304 258 257 264 272 286 316 ...
 $ Location: Factor w/ 32 levels " ","Blacks_Fork",..: 24 24 24 24 24 24 24
24 24 24 ...
 $ Xcor    : num  -111 -111 -111 -111 -111 ...
 $ Ycor    : num  41.7 41.7 41.7 41.7 41.8 ...

*str(data08)*
'data.frame':   969810 obs. of  5 variables:
 $ cell    : int  705528 710321 710456 715677 720762 720896 699953 700635
700771 705664 ...
 $ prN     : int  293 281 299 278 276 266 282 255 287 280 ...
 $ Location: Factor w/ 31 levels "Blacks_Fork",..: 23 23 23 23 23 23 23 23
23 23 ...
 $ Xcor    : num  -111 -111 -111 -111 -111 ...
 $ Ycor    : num  41.8 41.7 41.7 41.7 41.7 ...

I've tried using the following code to accomplish my problem:

tt <- rbind(data07, data08)

tt.dup <- duplicated(tt[,4:5]) # marks all duplicate rows in data08 from
last 2 cols                                            #that correspond to
the lat/long

tt.dup <- tt.dup[-seq_len(nrow(data07))] # remove all data07 entries (first
n)

test=ddata08[tt.dup, ] # index only TRUE/duplicated elements from data08

When I run the code 'tt.dup' is FALSE for all entries, which I know isn't
true.

Here's a small subset of the data so that you can see exactly where there
are duplicates

data07[1:10,]
                 cell prN Location     Xcor    Ycor
710229 *710228 288     Sage -111.044 41.7403*
715546 *715545 276     Sage -111.044 41.7245*
720691 *720690 286     Sage -111.044 41.7131*
720825 *720824 304     Sage -111.044 41.7109*
695612 695611 258     Sage -111.043 41.7766
700491 700490 257     Sage -111.043 41.7653
700627 700626 264     Sage -111.043 41.7630
705372 705371 272     Sage -111.043 41.7517
705508 705507 286     Sage -111.043 41.7495
710364 710363 316     Sage -111.043 41.7381

 data08[1:10,]
                 cell prN Location     Xcor    Ycor
705529 705528 293     Sage -111.044 41.7517
710322 *710321 281     Sage -111.044 41.7403*
710457 710456 299     Sage -111.044 41.7381
715678 *715677 278     Sage -111.044 41.7245*
720763 *720762 276     Sage -111.044 41.7131*
720897 *720896 266     Sage -111.044 41.7109*
699954 699953 282     Sage -111.043 41.7767
700636 700635 255     Sage -111.043 41.7653
700772 700771 287     Sage -111.043 41.7631
705665 705664 280     Sage -111.043 41.7495


If anyone has any suggestions as to where I might be going wrong I'd
greatly appreciate it.

Thank you




--
Curtis Burkhalter
Postdoctoral Research Associate, Audubon Rockies

https://sites.google.com/site/curtisburkhalter/

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: problem with duplicated function

Bert Gunter
I have NOT looked at your code in detail -- I might have if you had
used dput() to make available small subsets of your data frames that
exhibited the problems. However, the following, from ?duplicated,
sounds like it may be relevant:

"When used on a data frame with more than one column, or an array or
matrix when comparing dimensions of length greater than one, this
tests for identity of character representations. This will catch
people who unwisely rely on exact equality of floating-point numbers!
"

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sun, May 24, 2015 at 2:34 PM, Curtis Burkhalter
<[hidden email]> wrote:

> Hello everyone,
>
> I have two very large dataframes (~1 million rows x 5 columns), of which
> two of the columns are lat/long coordinates. The names of the dataframes
> are 'data07' and 'data 08'. Data08 has a few more sampling points than data
> 07 so I want to subset data08 so that it has the same number of data points
> as data07 using the unique lat/long coordinates.
>
> Here are the associated data structures:
>
> *str(data07)*
> 'data.frame':   969109 obs. of  5 variables:
>  $ cell    : int  710228 715545 720690 720824 695611 700490 700626 705371
> 705507 710363 ...
>  $ prN     : int  288 276 286 304 258 257 264 272 286 316 ...
>  $ Location: Factor w/ 32 levels " ","Blacks_Fork",..: 24 24 24 24 24 24 24
> 24 24 24 ...
>  $ Xcor    : num  -111 -111 -111 -111 -111 ...
>  $ Ycor    : num  41.7 41.7 41.7 41.7 41.8 ...
>
> *str(data08)*
> 'data.frame':   969810 obs. of  5 variables:
>  $ cell    : int  705528 710321 710456 715677 720762 720896 699953 700635
> 700771 705664 ...
>  $ prN     : int  293 281 299 278 276 266 282 255 287 280 ...
>  $ Location: Factor w/ 31 levels "Blacks_Fork",..: 23 23 23 23 23 23 23 23
> 23 23 ...
>  $ Xcor    : num  -111 -111 -111 -111 -111 ...
>  $ Ycor    : num  41.8 41.7 41.7 41.7 41.7 ...
>
> I've tried using the following code to accomplish my problem:
>
> tt <- rbind(data07, data08)
>
> tt.dup <- duplicated(tt[,4:5]) # marks all duplicate rows in data08 from
> last 2 cols                                            #that correspond to
> the lat/long
>
> tt.dup <- tt.dup[-seq_len(nrow(data07))] # remove all data07 entries (first
> n)
>
> test=ddata08[tt.dup, ] # index only TRUE/duplicated elements from data08
>
> When I run the code 'tt.dup' is FALSE for all entries, which I know isn't
> true.
>
> Here's a small subset of the data so that you can see exactly where there
> are duplicates
>
> data07[1:10,]
>                  cell prN Location     Xcor    Ycor
> 710229 *710228 288     Sage -111.044 41.7403*
> 715546 *715545 276     Sage -111.044 41.7245*
> 720691 *720690 286     Sage -111.044 41.7131*
> 720825 *720824 304     Sage -111.044 41.7109*
> 695612 695611 258     Sage -111.043 41.7766
> 700491 700490 257     Sage -111.043 41.7653
> 700627 700626 264     Sage -111.043 41.7630
> 705372 705371 272     Sage -111.043 41.7517
> 705508 705507 286     Sage -111.043 41.7495
> 710364 710363 316     Sage -111.043 41.7381
>
>  data08[1:10,]
>                  cell prN Location     Xcor    Ycor
> 705529 705528 293     Sage -111.044 41.7517
> 710322 *710321 281     Sage -111.044 41.7403*
> 710457 710456 299     Sage -111.044 41.7381
> 715678 *715677 278     Sage -111.044 41.7245*
> 720763 *720762 276     Sage -111.044 41.7131*
> 720897 *720896 266     Sage -111.044 41.7109*
> 699954 699953 282     Sage -111.043 41.7767
> 700636 700635 255     Sage -111.043 41.7653
> 700772 700771 287     Sage -111.043 41.7631
> 705665 705664 280     Sage -111.043 41.7495
>
>
> If anyone has any suggestions as to where I might be going wrong I'd
> greatly appreciate it.
>
> Thank you
>
>
>
>
> --
> Curtis Burkhalter
> Postdoctoral Research Associate, Audubon Rockies
>
> https://sites.google.com/site/curtisburkhalter/
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: problem with duplicated function

Jeff Newmiller
In reply to this post by curtburk
You are going wrong in a few places: posting using HTML format, not using dput to share your data sample, and comparing floating point numbers for equality.

HTML email is stripped to plain text on this list so we don't see what you see. In addition, HTML formatting corrupts code, so we cannot even run it.

The dput function is highly recommended for making reproducible examples. [1]

FAQ 7.31 warns against expecting floating point numbers that appear the same when printed to actually be equal. This advice actually applies to all programming languages.

[1] http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.

On May 24, 2015 2:34:13 PM PDT, Curtis Burkhalter <[hidden email]> wrote:

>Hello everyone,
>
>I have two very large dataframes (~1 million rows x 5 columns), of
>which
>two of the columns are lat/long coordinates. The names of the
>dataframes
>are 'data07' and 'data 08'. Data08 has a few more sampling points than
>data
>07 so I want to subset data08 so that it has the same number of data
>points
>as data07 using the unique lat/long coordinates.
>
>Here are the associated data structures:
>
>*str(data07)*
>'data.frame':   969109 obs. of  5 variables:
>$ cell    : int  710228 715545 720690 720824 695611 700490 700626
>705371
>705507 710363 ...
> $ prN     : int  288 276 286 304 258 257 264 272 286 316 ...
>$ Location: Factor w/ 32 levels " ","Blacks_Fork",..: 24 24 24 24 24 24
>24
>24 24 24 ...
> $ Xcor    : num  -111 -111 -111 -111 -111 ...
> $ Ycor    : num  41.7 41.7 41.7 41.7 41.8 ...
>
>*str(data08)*
>'data.frame':   969810 obs. of  5 variables:
>$ cell    : int  705528 710321 710456 715677 720762 720896 699953
>700635
>700771 705664 ...
> $ prN     : int  293 281 299 278 276 266 282 255 287 280 ...
>$ Location: Factor w/ 31 levels "Blacks_Fork",..: 23 23 23 23 23 23 23
>23
>23 23 ...
> $ Xcor    : num  -111 -111 -111 -111 -111 ...
> $ Ycor    : num  41.8 41.7 41.7 41.7 41.7 ...
>
>I've tried using the following code to accomplish my problem:
>
>tt <- rbind(data07, data08)
>
>tt.dup <- duplicated(tt[,4:5]) # marks all duplicate rows in data08
>from
>last 2 cols                                            #that correspond
>to
>the lat/long
>
>tt.dup <- tt.dup[-seq_len(nrow(data07))] # remove all data07 entries
>(first
>n)
>
>test=ddata08[tt.dup, ] # index only TRUE/duplicated elements from
>data08
>
>When I run the code 'tt.dup' is FALSE for all entries, which I know
>isn't
>true.
>
>Here's a small subset of the data so that you can see exactly where
>there
>are duplicates
>
>data07[1:10,]
>                 cell prN Location     Xcor    Ycor
>710229 *710228 288     Sage -111.044 41.7403*
>715546 *715545 276     Sage -111.044 41.7245*
>720691 *720690 286     Sage -111.044 41.7131*
>720825 *720824 304     Sage -111.044 41.7109*
>695612 695611 258     Sage -111.043 41.7766
>700491 700490 257     Sage -111.043 41.7653
>700627 700626 264     Sage -111.043 41.7630
>705372 705371 272     Sage -111.043 41.7517
>705508 705507 286     Sage -111.043 41.7495
>710364 710363 316     Sage -111.043 41.7381
>
> data08[1:10,]
>                 cell prN Location     Xcor    Ycor
>705529 705528 293     Sage -111.044 41.7517
>710322 *710321 281     Sage -111.044 41.7403*
>710457 710456 299     Sage -111.044 41.7381
>715678 *715677 278     Sage -111.044 41.7245*
>720763 *720762 276     Sage -111.044 41.7131*
>720897 *720896 266     Sage -111.044 41.7109*
>699954 699953 282     Sage -111.043 41.7767
>700636 700635 255     Sage -111.043 41.7653
>700772 700771 287     Sage -111.043 41.7631
>705665 705664 280     Sage -111.043 41.7495
>
>
>If anyone has any suggestions as to where I might be going wrong I'd
>greatly appreciate it.
>
>Thank you

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: problem with duplicated function

Rolf Turner
In reply to this post by curtburk
On 25/05/15 09:34, Curtis Burkhalter wrote:

> Hello everyone,
>
> I have two very large dataframes (~1 million rows x 5 columns), of which
> two of the columns are lat/long coordinates. The names of the dataframes
> are 'data07' and 'data 08'. Data08 has a few more sampling points than data
> 07 so I want to subset data08 so that it has the same number of data points
> as data07 using the unique lat/long coordinates.
>
> Here are the associated data structures:
>
> *str(data07)*
> 'data.frame':   969109 obs. of  5 variables:
>   $ cell    : int  710228 715545 720690 720824 695611 700490 700626 705371
> 705507 710363 ...
>   $ prN     : int  288 276 286 304 258 257 264 272 286 316 ...
>   $ Location: Factor w/ 32 levels " ","Blacks_Fork",..: 24 24 24 24 24 24 24
> 24 24 24 ...
>   $ Xcor    : num  -111 -111 -111 -111 -111 ...
>   $ Ycor    : num  41.7 41.7 41.7 41.7 41.8 ...
>
> *str(data08)*
> 'data.frame':   969810 obs. of  5 variables:
>   $ cell    : int  705528 710321 710456 715677 720762 720896 699953 700635
> 700771 705664 ...
>   $ prN     : int  293 281 299 278 276 266 282 255 287 280 ...
>   $ Location: Factor w/ 31 levels "Blacks_Fork",..: 23 23 23 23 23 23 23 23
> 23 23 ...
>   $ Xcor    : num  -111 -111 -111 -111 -111 ...
>   $ Ycor    : num  41.8 41.7 41.7 41.7 41.7 ...
>
> I've tried using the following code to accomplish my problem:
>
> tt <- rbind(data07, data08)
>
> tt.dup <- duplicated(tt[,4:5]) # marks all duplicate rows in data08 from
> last 2 cols                                            #that correspond to
> the lat/long


I get tt.dup to be:

>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
> [13] FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE

>
> tt.dup <- tt.dup[-seq_len(nrow(data07))] # remove all data07 entries (first
> n)

This just throws away the first 10 entries of tt.dup, leaving

>  [1] FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE

>
> test=ddata08[tt.dup, ] # index only TRUE/duplicated elements from data08
        ^

This leaves the c(2,4,5,6,8,10) entries of data08.
>
> When I run the code 'tt.dup' is FALSE for all entries, which I know isn't
> true.

Only 4 of the entries of tt.dup are FALSE; 6 are TRUE.  I don't
understand why you think that they are all FALSE.

Perhaps your subsets do not accurately reflect the actual nature of your
data.

cheers,

Rolf Turner

>
> Here's a small subset of the data so that you can see exactly where there
> are duplicates
>
> data07[1:10,]
>                   cell prN Location     Xcor    Ycor
> 710229 *710228 288     Sage -111.044 41.7403*
> 715546 *715545 276     Sage -111.044 41.7245*
> 720691 *720690 286     Sage -111.044 41.7131*
> 720825 *720824 304     Sage -111.044 41.7109*
> 695612 695611 258     Sage -111.043 41.7766
> 700491 700490 257     Sage -111.043 41.7653
> 700627 700626 264     Sage -111.043 41.7630
> 705372 705371 272     Sage -111.043 41.7517
> 705508 705507 286     Sage -111.043 41.7495
> 710364 710363 316     Sage -111.043 41.7381
>
>   data08[1:10,]
>                   cell prN Location     Xcor    Ycor
> 705529 705528 293     Sage -111.044 41.7517
> 710322 *710321 281     Sage -111.044 41.7403*
> 710457 710456 299     Sage -111.044 41.7381
> 715678 *715677 278     Sage -111.044 41.7245*
> 720763 *720762 276     Sage -111.044 41.7131*
> 720897 *720896 266     Sage -111.044 41.7109*
> 699954 699953 282     Sage -111.043 41.7767
> 700636 700635 255     Sage -111.043 41.7653
> 700772 700771 287     Sage -111.043 41.7631
> 705665 705664 280     Sage -111.043 41.7495
>
>
> If anyone has any suggestions as to where I might be going wrong I'd
> greatly appreciate it.
>
> Thank you
>
>
>
>


--
Technical Editor ANZJS
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276
Home phone: +64-9-480-4619

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.