extract duplications from list

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

extract duplications from list

frymor
Hi,

I have a list of 40 data.frames.

I would like to identify duplicated entries in the whole list, not only in
one specific data.frame, but in all 40.

Here is my list:

> myList
[[1]]
            X        NAME  MEM.SHIP
1 FBgn0000008 FBgn0000008 0.9304502
2 FBgn0000014 FBgn0000014 1.0000000
3 FBgn0000028 FBgn0000028 1.0000000
4 FBgn0000109 FBgn0000109 1.0000000
5 FBgn0000114 FBgn0000114 0.4839886
6 FBgn0000120 FBgn0000120 1.0000000

[[2]]
            X        NAME  MEM.SHIP
1 FBgn0000251 FBgn0000251 0.3138650
2 FBgn0001168 FBgn0001168 0.8995011
3 FBgn0001941 FBgn0001941 0.7485548
4 FBgn0003053 FBgn0000028 0.4426997
5 FBgn0003159 FBgn0003159 0.4843226
6 FBgn0000120 FBgn0003162 0.6556290


I would like to know whether there are duplicated entries in the first
and/or second column of all. In this list I have two duplications one is
FBgn0000120 in both lines Nr. 6 and the second is FBgn0000028 in line 3 and
line 4 in df1 and df2 respectively.


Is there a way to do it. With unique I don't get any results. and I cannot
convert the list into a data.frame, as the number of items in each df is
different.

Thanks
Assa

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: extract duplications from list

jholtman
Here is one way of doing it:


##########################
> files <- list(file1 = "            X        NAME  MEM.SHIP
+ 1 FBgn0000008 FBgn0000008 0.9304502
+ 2 FBgn0000014 FBgn0000014 1.0000000
+ 3 FBgn0000028 FBgn0000028 1.0000000
+ 4 FBgn0000109 FBgn0000109 1.0000000
+ 5 FBgn0000114 FBgn0000114 0.4839886
+ 6 FBgn0000120 FBgn0000120 1.0000000",
+ file2 = "            X        NAME  MEM.SHIP
+ 1 FBgn0000251 FBgn0000251 0.3138650
+ 2 FBgn0001168 FBgn0001168 0.8995011
+ 3 FBgn0001941 FBgn0001941 0.7485548
+ 4 FBgn0003053 FBgn0000028 0.4426997
+ 5 FBgn0003159 FBgn0003159 0.4843226
+ 6 FBgn0000120 FBgn0003162 0.6556290")
>
> # read in all the "files" (dummies in this case)
> # append file name
> allFiles <- do.call(rbind, lapply(names(files), function(.name){
+     input <- read.table(text = files[[.name]], as.is = TRUE)
+     input$file <- .name
+     input  # return value
+ }))
>
> # function to mark all duplicate entries
> allDup <-
+ function (value)
+ {
+     duplicated(value) | duplicated(value, fromLast = TRUE)
+ }
> allFiles$col1 <- allDup(allFiles$X)
> allFiles$col2 <- allDup(allFiles$NAME)
> allFiles
             X        NAME  MEM.SHIP  file  col1  col2
1  FBgn0000008 FBgn0000008 0.9304502 file1 FALSE FALSE
2  FBgn0000014 FBgn0000014 1.0000000 file1 FALSE FALSE
3  FBgn0000028 FBgn0000028 1.0000000 file1 FALSE  TRUE
4  FBgn0000109 FBgn0000109 1.0000000 file1 FALSE FALSE
5  FBgn0000114 FBgn0000114 0.4839886 file1 FALSE FALSE
6  FBgn0000120 FBgn0000120 1.0000000 file1  TRUE FALSE
11 FBgn0000251 FBgn0000251 0.3138650 file2 FALSE FALSE
21 FBgn0001168 FBgn0001168 0.8995011 file2 FALSE FALSE
31 FBgn0001941 FBgn0001941 0.7485548 file2 FALSE FALSE
41 FBgn0003053 FBgn0000028 0.4426997 file2 FALSE  TRUE
51 FBgn0003159 FBgn0003159 0.4843226 file2 FALSE FALSE
61 FBgn0000120 FBgn0003162 0.6556290 file2  TRUE FALSE
>
############################################


Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Mon, May 19, 2014 at 8:48 AM, Assa Yeroslaviz <[hidden email]> wrote:

> Hi,
>
> I have a list of 40 data.frames.
>
> I would like to identify duplicated entries in the whole list, not only in
> one specific data.frame, but in all 40.
>
> Here is my list:
>
> > myList
> [[1]]
>             X        NAME  MEM.SHIP
> 1 FBgn0000008 FBgn0000008 0.9304502
> 2 FBgn0000014 FBgn0000014 1.0000000
> 3 FBgn0000028 FBgn0000028 1.0000000
> 4 FBgn0000109 FBgn0000109 1.0000000
> 5 FBgn0000114 FBgn0000114 0.4839886
> 6 FBgn0000120 FBgn0000120 1.0000000
>
> [[2]]
>             X        NAME  MEM.SHIP
> 1 FBgn0000251 FBgn0000251 0.3138650
> 2 FBgn0001168 FBgn0001168 0.8995011
> 3 FBgn0001941 FBgn0001941 0.7485548
> 4 FBgn0003053 FBgn0000028 0.4426997
> 5 FBgn0003159 FBgn0003159 0.4843226
> 6 FBgn0000120 FBgn0003162 0.6556290
>
>
> I would like to know whether there are duplicated entries in the first
> and/or second column of all. In this list I have two duplications one is
> FBgn0000120 in both lines Nr. 6 and the second is FBgn0000028 in line 3 and
> line 4 in df1 and df2 respectively.
>
>
> Is there a way to do it. With unique I don't get any results. and I cannot
> convert the list into a data.frame, as the number of items in each df is
> different.
>
> Thanks
> Assa
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: extract duplications from list

arun kirshna
In reply to this post by frymor
Hi,

You may try:

myList <- list(structure(list(X = c("FBgn0000008", "FBgn0000014", "FBgn0000028",
"FBgn0000109", "FBgn0000114", "FBgn0000120"), NAME = c("FBgn0000008",
"FBgn0000014", "FBgn0000028", "FBgn0000109", "FBgn0000114", "FBgn0000120"
), MEM.SHIP = c(0.9304502, 1, 1, 1, 0.4839886, 1)), .Names = c("X",
"NAME", "MEM.SHIP"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6")), structure(list(X = c("FBgn0000251",
"FBgn0001168", "FBgn0001941", "FBgn0003053", "FBgn0003159", "FBgn0000120"
), NAME = c("FBgn0000251", "FBgn0001168", "FBgn0001941", "FBgn0000028",
"FBgn0003159", "FBgn0003162"), MEM.SHIP = c(0.313865, 0.8995011,
0.7485548, 0.4426997, 0.4843226, 0.655629)), .Names = c("X",
"NAME", "MEM.SHIP"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6")))

library(data.table)
 dt1 <- rbindlist(myList)
fun1 <- function(val){duplicated(val)|duplicated(val,fromLast=TRUE)}
dt1[,paste0("Col",1:2):=lapply(.SD, fun1),.SDcols=1:2]
dt1

A.K.




On Monday, May 19, 2014 8:50 AM, Assa Yeroslaviz <[hidden email]> wrote:
Hi,

I have a list of 40 data.frames.

I would like to identify duplicated entries in the whole list, not only in
one specific data.frame, but in all 40.

Here is my list:

> myList
[[1]]
            X        NAME  MEM.SHIP
1 FBgn0000008 FBgn0000008 0.9304502
2 FBgn0000014 FBgn0000014 1.0000000
3 FBgn0000028 FBgn0000028 1.0000000
4 FBgn0000109 FBgn0000109 1.0000000
5 FBgn0000114 FBgn0000114 0.4839886
6 FBgn0000120 FBgn0000120 1.0000000

[[2]]
            X        NAME  MEM.SHIP
1 FBgn0000251 FBgn0000251 0.3138650
2 FBgn0001168 FBgn0001168 0.8995011
3 FBgn0001941 FBgn0001941 0.7485548
4 FBgn0003053 FBgn0000028 0.4426997
5 FBgn0003159 FBgn0003159 0.4843226
6 FBgn0000120 FBgn0003162 0.6556290


I would like to know whether there are duplicated entries in the first
and/or second column of all. In this list I have two duplications one is
FBgn0000120 in both lines Nr. 6 and the second is FBgn0000028 in line 3 and
line 4 in df1 and df2 respectively.


Is there a way to do it. With unique I don't get any results. and I cannot
convert the list into a data.frame, as the number of items in each df is
different.

Thanks
Assa

    [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.