help with duplicates

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

help with duplicates

Chris Anderson-14
I have a large dataset that contain duplicate records. How do I identify and remove duplicate records?


Chris Anderson
707.315.8486
www.sassydeals4u.com
____________________________________________________________
Free info for small business owners.  Click here to find great products geared for your business.
http://thirdpartyoffers.netzero.net/TGL2241/fc/BLSrjpYWIQYSqjUJ0P8Kjx22OUKmSdbeF2JnHh5X1EZsmlucvn6niiClhHS/
        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: help with duplicates

Henrique Dallazuanna
Try this:

d <- data.frame(a = c(1, 1, 2, 3), b = c(10, 10, 9, 8))
unique(d)



On Fri, Jun 5, 2009 at 1:38 PM, Chris Anderson <[hidden email]>wrote:

> I have a large dataset that contain duplicate records. How do I identify
> and remove duplicate records?
>
>
> Chris Anderson
> 707.315.8486
> www.sassydeals4u.com
> ____________________________________________________________
> Free info for small business owners.  Click here to find great products
> geared for your business.
>
> http://thirdpartyoffers.netzero.net/TGL2241/fc/BLSrjpYWIQYSqjUJ0P8Kjx22OUKmSdbeF2JnHh5X1EZsmlucvn6niiClhHS/
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


--
Henrique Dallazuanna
Curitiba-Paraná-Brasil
25° 25' 40" S 49° 16' 22" O

        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: help with duplicates

Jim Porzak
In reply to this post by Chris Anderson-14
Chris,

How large is large? How may columns?

"Duplicate" across all columns of just some?

Henrique gave you simple R answer. Perhaps doing in SQL is more efficient?
eg

SELECT DISTINCT
             <stuff>
  FROM <somewhere>;


HTH,
Jim Porzak
TGN.com
San Francisco, CA
www.linkedin.com/in/jimporzak
use R! Group SF: www.meetup.com/R-Users/


On Fri, Jun 5, 2009 at 9:38 AM, Chris Anderson <[hidden email]>wrote:

> I have a large dataset that contain duplicate records. How do I identify
> and remove duplicate records?
>
>
> Chris Anderson
> 707.315.8486
> www.sassydeals4u.com
> ____________________________________________________________
> Free info for small business owners.  Click here to find great products
> geared for your business.
>
> http://thirdpartyoffers.netzero.net/TGL2241/fc/BLSrjpYWIQYSqjUJ0P8Kjx22OUKmSdbeF2JnHh5X1EZsmlucvn6niiClhHS/
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: help with duplicates

Peter Dalgaard
In reply to this post by Chris Anderson-14
Chris Anderson wrote:

> I have a large dataset that contain duplicate records. How do I identify and remove duplicate records?
>

Here's one way:

 > aq <- airquality[sample(NROW(airquality), replace=TRUE),]
 > any(duplicated(aq))
[1] TRUE
 > which(duplicated(aq))
  [1]   2  15  34  44  45  47  49  50  52  53  65  75  76  78  83  86
88  90  91
[20]  94  96  98  99 100 103 104 107 108 110 111 112 114 117 119 120 121
122 124
[39] 125 126 127 129 130 132 133 135 137 140 141 143 145 146 147 151 152
 > aqs <- subset(aq,!duplicated(aq))
 > any(duplicated(aqs))
[1] FALSE
 > dim(aqs)
[1] 98  6
 > dim(aq)
[1] 153   6

For data frames wit many columns you might want to think more carefully
about how you recognize duplicates and maybe uses a subset of columns.

--
    O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - ([hidden email])              FAX: (+45) 35327907

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.