Tried this now .. I guess using nomatch? ..
dt_ind_fin_landscape[dt_ind_dev_landscape, nomatch=NA] And then filtering for NAs? _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
Hi Santosh,
On Wed, May 4, 2011 at 10:12 AM, sant <[hidden email]> wrote: > Tried this now .. I guess using nomatch? .. > dt_ind_fin_landscape[dt_ind_dev_landscape, nomatch=NA] > And then filtering for NAs? The context of your message seems to be lost, and I don't see it in (my) mailing list archives. Are you asking a question, or answering one? If you're asking a question -- what is it? :-) -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
Hi,
On Wed, May 4, 2011 at 12:23 PM, Santosh Srinivas <[hidden email]> wrote: > Hi Steve, > > Sorry ... strange problem .. Dont know why that happened. > > http://groups.google.com/group/datatable/browse_thread/thread/51a0387e95d37feb It looks like your first email was sent to the @googlegroups.com address (I didn't even know we had that setup), and the second one came through the @lists.r-forge.r-project. So (I guess) the first didn't come through because it was sent to the wrong(?) list -- anyway, in the future you should send to the @lists.r-forge... one. > I had the question and my attempt to answer before someone says go read the > manual :) It looks like the answer you offered is reasonable, though. In short -- the question was "How can I quickly tell which (keyed) rows are in one data.table vs. another)". As you mentioned, you can do this by joining using `[` -- in order to do this easily, you could ensure that each data.table has a column that isn't in the other. For example, if you have data like so: R> dt1 <- data.table(a=1:10, b=letters[1:10], key="a,b") R> dt2 <- data.table(a=c(1, 3, 5, 10), b=letters[c(1, 3, 5, 10)], key="a,b") Doing either `dt1[dt2]` or `dt2[dt1]` doesn't get you anywhere too fast (especially if one is just a subset of the other (like dt2 is to dt1): R> dt1[dt2] a b [1,] 1 a [2,] 3 c [3,] 5 e [4,] 10 j R> dt2[dt1] a b [1,] 1 a [2,] 2 b [3,] 3 c [4,] 4 d [5,] 5 e [6,] 6 f [7,] 7 g [8,] 8 h [9,] 9 i [10,] 10 j Adding some 'dummy' columns may help: R> dt1$in.1 <- TRUE R> dt2$in.2 <- TRUE Then you can (easily) ask which rows are in dt1 that aren't in dt2: R> dt2[dt1] ## nomatch=NA is the default a b in.2 in.1 [1,] 1 a TRUE TRUE [2,] 2 b NA TRUE [3,] 3 c TRUE TRUE [4,] 4 d NA TRUE [5,] 5 e TRUE TRUE [6,] 6 f NA TRUE [7,] 7 g NA TRUE [8,] 8 h NA TRUE [9,] 9 i NA TRUE [10,] 10 j TRUE TRUE ## or more email friendly format: R> which(is.na(dt2[dt1]$in.2)) [1] 2 4 6 7 8 9 Which are the rows in dt1 that aren't in dt2 HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
The original post from Santosh came through as a BCC. I guess
GoogleGroups did the BCC. Will need to do more investigation. > Which are the rows in dt1 that aren't in dt2 Another option may be a 'not join'; e.g., X[-X[Y,which=TRUE]] or seq(1,nrow(X))[-X[Y,which=TRUE]] Will add something to docs/wiki re 'not joins'. Matthew On Wed, 2011-05-04 at 13:00 -0400, Steve Lianoglou wrote: > Hi, > > On Wed, May 4, 2011 at 12:23 PM, Santosh Srinivas > <[hidden email]> wrote: > > Hi Steve, > > > > Sorry ... strange problem .. Dont know why that happened. > > > > http://groups.google.com/group/datatable/browse_thread/thread/51a0387e95d37feb > > It looks like your first email was sent to the @googlegroups.com > address (I didn't even know we had that setup), and the second one > came through the @lists.r-forge.r-project. > > So (I guess) the first didn't come through because it was sent to the > wrong(?) list -- anyway, in the future you should send to the > @lists.r-forge... one. > > > I had the question and my attempt to answer before someone says go read the > > manual :) > > It looks like the answer you offered is reasonable, though. > > In short -- the question was "How can I quickly tell which (keyed) > rows are in one data.table vs. another)". > > As you mentioned, you can do this by joining using `[` -- in order to > do this easily, you could ensure that each data.table has a column > that isn't in the other. > > For example, if you have data like so: > > > R> dt1 <- data.table(a=1:10, b=letters[1:10], key="a,b") > R> dt2 <- data.table(a=c(1, 3, 5, 10), b=letters[c(1, 3, 5, 10)], key="a,b") > > Doing either `dt1[dt2]` or `dt2[dt1]` doesn't get you anywhere too > fast (especially if one is just a subset of the other (like dt2 is to > dt1): > > R> dt1[dt2] > a b > [1,] 1 a > [2,] 3 c > [3,] 5 e > [4,] 10 j > > R> dt2[dt1] > a b > [1,] 1 a > [2,] 2 b > [3,] 3 c > [4,] 4 d > [5,] 5 e > [6,] 6 f > [7,] 7 g > [8,] 8 h > [9,] 9 i > [10,] 10 j > > Adding some 'dummy' columns may help: > > R> dt1$in.1 <- TRUE > R> dt2$in.2 <- TRUE > > Then you can (easily) ask which rows are in dt1 that aren't in dt2: > > R> dt2[dt1] ## nomatch=NA is the default > a b in.2 in.1 > [1,] 1 a TRUE TRUE > [2,] 2 b NA TRUE > [3,] 3 c TRUE TRUE > [4,] 4 d NA TRUE > [5,] 5 e TRUE TRUE > [6,] 6 f NA TRUE > [7,] 7 g NA TRUE > [8,] 8 h NA TRUE > [9,] 9 i NA TRUE > [10,] 10 j TRUE TRUE > > ## or more email friendly format: > R> which(is.na(dt2[dt1]$in.2)) > [1] 2 4 6 7 8 9 > > Which are the rows in dt1 that aren't in dt2 > > HTH, > -steve > _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
Read again and realised the 'not join' (although correct) solves a
different problem; i.e., finding the rows in dt2 that no row in dt1 matches to, not the rows in dt1 that don't match to any row in dt2. So, second attempt ... Create 2 tables Steve L created : > dt1 = data.table(a=1:10,b=letters[1:10],key="a,b") > dt2 = data.table(a=c(1,3,5,10),b=letters[c(1,3,5,10)],key="a,b") then (without dummy columns needed, so faster and more convenient) : > which(is.na(dt2[dt1,which=TRUE])) [1] 2 4 6 7 8 9 Which are the rows in dt1 that aren't in dt2. This works in this case because when a row in dt1 matches to a row in (this) dt2, it only matches to one row in dt2 since this dt2's key is unique, and, this join is using all the columns in dt2's key. To robustly cope with multiple matches : > which(is.na(dt2[dt1,which=TRUE,mult="first"])) [1] 2 4 6 7 8 9 Testing that by creating a dup in dt2 : > dt2 = data.table(a=c(1,3,5,5,10),b=letters[c(1,3,5,5,10)],key="a,b") > dt2 a b [1,] 1 a [2,] 3 c [3,] 5 e [4,] 5 e [5,] 10 j > which(is.na(dt2[dt1,which=TRUE])) [1] 2 4 7 8 9 10 # unintended result > which(is.na(dt2[dt1,which=TRUE,mult="first"])) [1] 2 4 6 7 8 9 # intended result (same as before) > Admittedly, that's ugly, but it should be fast and scale. Perhaps there is scope for a new 'whichna' argument then that does that directly : > dt2[dt1,whichna=TRUE] [1] 2 4 6 7 8 9 > and maybe a new 'not' argument : > dt2 = data.table(a=c(1,3,5,5,10,11),b=letters[c(1,3,5,5,10,11)],key="a,b") > dt2 a b [1,] 1 a [2,] 3 c [3,] 5 e [4,] 5 e [5,] 10 j [6,] 11 k # extra row not in dt1 > dt2[-dt2[dt1,which=TRUE,nomatch=0]] # not join a b [1,] 11 k > dt2[dt1,not=TRUE] # more user friendly a b [1,] 11 k > seq(1,nrow(dt2))[-dt2[dt1,which=TRUE,nomatch=0]] # which not [1] 6 > dt2[dt1,not=TRUE,which=TRUE] # more user friendly [1] 6 Setting both not=TRUE and whichna=TRUE wouldn't make sense and would be prevented. Thoughts anyone? Matthew On Sat, 2011-05-07 at 08:03 +0100, Matthew Dowle wrote: > The original post from Santosh came through as a BCC. I guess > GoogleGroups did the BCC. Will need to do more investigation. > > > Which are the rows in dt1 that aren't in dt2 > Another option may be a 'not join'; e.g., > X[-X[Y,which=TRUE]] > or > seq(1,nrow(X))[-X[Y,which=TRUE]] > > Will add something to docs/wiki re 'not joins'. > > Matthew > > > On Wed, 2011-05-04 at 13:00 -0400, Steve Lianoglou wrote: > > Hi, > > > > On Wed, May 4, 2011 at 12:23 PM, Santosh Srinivas > > <[hidden email]> wrote: > > > Hi Steve, > > > > > > Sorry ... strange problem .. Dont know why that happened. > > > > > > http://groups.google.com/group/datatable/browse_thread/thread/51a0387e95d37feb > > > > It looks like your first email was sent to the @googlegroups.com > > address (I didn't even know we had that setup), and the second one > > came through the @lists.r-forge.r-project. > > > > So (I guess) the first didn't come through because it was sent to the > > wrong(?) list -- anyway, in the future you should send to the > > @lists.r-forge... one. > > > > > I had the question and my attempt to answer before someone says go read the > > > manual :) > > > > It looks like the answer you offered is reasonable, though. > > > > In short -- the question was "How can I quickly tell which (keyed) > > rows are in one data.table vs. another)". > > > > As you mentioned, you can do this by joining using `[` -- in order to > > do this easily, you could ensure that each data.table has a column > > that isn't in the other. > > > > For example, if you have data like so: > > > > > > R> dt1 <- data.table(a=1:10, b=letters[1:10], key="a,b") > > R> dt2 <- data.table(a=c(1, 3, 5, 10), b=letters[c(1, 3, 5, 10)], key="a,b") > > > > Doing either `dt1[dt2]` or `dt2[dt1]` doesn't get you anywhere too > > fast (especially if one is just a subset of the other (like dt2 is to > > dt1): > > > > R> dt1[dt2] > > a b > > [1,] 1 a > > [2,] 3 c > > [3,] 5 e > > [4,] 10 j > > > > R> dt2[dt1] > > a b > > [1,] 1 a > > [2,] 2 b > > [3,] 3 c > > [4,] 4 d > > [5,] 5 e > > [6,] 6 f > > [7,] 7 g > > [8,] 8 h > > [9,] 9 i > > [10,] 10 j > > > > Adding some 'dummy' columns may help: > > > > R> dt1$in.1 <- TRUE > > R> dt2$in.2 <- TRUE > > > > Then you can (easily) ask which rows are in dt1 that aren't in dt2: > > > > R> dt2[dt1] ## nomatch=NA is the default > > a b in.2 in.1 > > [1,] 1 a TRUE TRUE > > [2,] 2 b NA TRUE > > [3,] 3 c TRUE TRUE > > [4,] 4 d NA TRUE > > [5,] 5 e TRUE TRUE > > [6,] 6 f NA TRUE > > [7,] 7 g NA TRUE > > [8,] 8 h NA TRUE > > [9,] 9 i NA TRUE > > [10,] 10 j TRUE TRUE > > > > ## or more email friendly format: > > R> which(is.na(dt2[dt1]$in.2)) > > [1] 2 4 6 7 8 9 > > > > Which are the rows in dt1 that aren't in dt2 > > > > HTH, > > -steve > > > > > _______________________________________________ > datatable-help mailing list > [hidden email] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
Powered by Nabble | Edit this page |