Quantcast

deduplication

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

deduplication

Epi-schnier
Colleagues,

I am trying to de-duplicate a large (long) database (approx 1mil records) of diagnostic tests. Individuals in the database can have up-to 25 observations, but most will have only one. IDs for de-duplication (names, sex, lab number...) are patchy. In a first step, I am using Andreas Borg's excellent record linkage package (), that leaves me with a list of 'pairs' looking very much like this:
id1<-c(4,17,9,1,1,1,3,3,6,15,1,1,1,1,3,3,3,3,4,4,4,5,5,12,9,9,10,10)
id2<-c(8,18,10,3,6,7,6,7,7,16,4,5,12,18,4,5,12,18,5,12,18,12,18,18,15,16,15,16)
id<-data.frame(cbind(id1,id2))
where a pair means that the records belong to the same individual (e.g., record 4 and record 8; 17 and 18...). My problem now is to get a list with all records that belong to the same person (in the example, obervations 1,3,4,5,6,7,8,12, 17 and 18 are all from the same person). The problem is to find the link between 1 and 8 (only through 1 and 4 and 4 and 8) and the link between 1 and 17 (through 18). I can do it in my head, but I am missing the code that would work its way through too many records.  

Any clever ideas?
(using R 2.10.1 on Windows XP)

Thanks,

Christian

 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: deduplication

cybaea
Maybe something like the following will get you started:

library("igraph")
g <- graph.data.frame(id, directed=FALSE)
neighborhood(g, +Inf)

There is perhaps a more efficient way, but I hope this helps a little.

Allan.



On 03/06/10 14:14, Epi-schnier wrote:

> Colleagues,
>
> I am trying to de-duplicate a large (long) database (approx 1mil records) of
> diagnostic tests. Individuals in the database can have up-to 25
> observations, but most will have only one. IDs for de-duplication (names,
> sex, lab number...) are patchy. In a first step, I am using Andreas Borg's
> excellent record linkage package (), that leaves me with a list of 'pairs'
> looking very much like this:
> id1<-c(4,17,9,1,1,1,3,3,6,15,1,1,1,1,3,3,3,3,4,4,4,5,5,12,9,9,10,10)
> id2<-c(8,18,10,3,6,7,6,7,7,16,4,5,12,18,4,5,12,18,5,12,18,12,18,18,15,16,15,16)
> id<-data.frame(cbind(id1,id2))
> where a pair means that the records belong to the same individual (e.g.,
> record 4 and record 8; 17 and 18...). My problem now is to get a list with
> all records that belong to the same person (in the example, obervations
> 1,3,4,5,6,7,8,12, 17 and 18 are all from the same person). The problem is to
> find the link between 1 and 8 (only through 1 and 4 and 4 and 8) and the
> link between 1 and 17 (through 18). I can do it in my head, but I am missing
> the code that would work its way through too many records.
>
> Any clever ideas?
> (using R 2.10.1 on Windows XP)
>
> Thanks,
>
> Christian
>
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: deduplication

Wu Gong
In reply to this post by Epi-schnier
Please try this

## Import data
id1<-c(4,17,9,1,1,1,3,3,6,15,1,1,1,1,3,3,3,3,4,4,4,5,5,12,9,9,10,10)
id2<-c(8,18,10,3,6,7,6,7,7,16,4,5,12,18,4,5,12,18,5,12,18,12,18,18,15,16,15,16)
id<-data.frame(id1 = id1, id2 = id2)

## Create same structure table
id <- id0 <- unique(id)
leng <- nrow(id)

n <- 0
repeat {
        if (n == leng) {break}
        n <- 0
        id <- id[order(-id$id1, -id$id2),]
        for (i in 1:leng) {
                if (id$id1[i] == id$id2[i]) {
                n <- n+1
                next }
                smal <- min(id[i,])
                larg <- max(id[i,])
                id$id2[which(id$id2 == larg)] <- smal
                id$id1[which(id$id1 == larg)] <- smal
                }}

## Create results
tab <- table(as.matrix(id0), as.matrix(id[order(as.numeric(rownames(id))),]))
res <- list()
for (i in 1:ncol(tab)) {
        res[[i]] <- rownames(tab[(tab[,i] != 0),])}
res
Loading...