Hi all, I'm really new to data.table and something really simple has me stumped. Lets say I have the following (but much bigger so timing matters) dt = data.table(x=1:100,y=1:2,z=1: From the documentation, I understand that dt[J(1,3)] is significantly faster dt[y==1 & z==3] and I could do dt[J(1)] instead of dt[y==1] but is there any way to do dt[z==3] faster? I want to do something like df[J( ,3)] but I know that doesn't make sense. Is it because z is not the primary key that I can't seem to figure out how to use J to do this? Since it isn't sorted on z anyway I doubt I can get a speed up right? _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
Hi Chris, Welcome. That's a 'secondary key'. FR#1007 is to build in secondary keys : https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1007&group_id=240&atid=978 In the meantime you can do a 'manual' secondary key : idx = dt[,list(z,y,i=1:nrow(dt))] setkey(idx,z,y) dt[idx[J(3),i]$i] Btw, that [,i]$i is ugly and I'm coming around to the idea of making that more consistent, as requested by a previous poster (can't find it now). I'm thinking it should work like this : DT[c("a","b"),j] # always returns vector, for consistency, even though 2 groups are joined to via the mult="all" default and the correspondence might be lost. Then mult="first" and mult="last" would return vector too, consistent with mult="all" DT[c("a","b"),list(j)] # list() needed to retain the group columns and return a data.table rather than a vector. Same type (i.e. data.table) returned for all values of mult. Would that be better? Throwing that out to all. Btw, posting from googlegroups does work then, that's good. This thread should be mirrored in all places; it shouldn't matter where you post from or to, but any probs please let me know as it's the first time. Matthew On Tue, 2011-07-12 at 16:21 -0700, Chris Neff wrote: > Hi all, > > > I'm really new to data.table and something really simple has me > stumped. > > > Lets say I have the following (but much bigger so timing matters) > > > dt = data.table(x=1:100,y=1:2,z=1:4,key="y,z") > > > From the documentation, I understand that > > > dt[J(1,3)] > > > is significantly faster > > > dt[y==1 & z==3] > > > > > and I could do > > > dt[J(1)] > > > instead of > > > dt[y==1] > > > but is there any way to do > > > dt[z==3] > > > faster? I want to do something like > > > df[J( ,3)] > > > but I know that doesn't make sense. Is it because z is not the primary > key that I can't seem to figure out how to use J to do this? Since it > isn't sorted on z anyway I doubt I can get a speed up right? > _______________________________________________ > datatable-help mailing list > [hidden email] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
Hi Matt,
I though multiple key indexing was already supported in data.table(). For example: dt1 <- data.table(x=1:100, y=1:2, z=1:4, key="y, z") dt2 <- data.table(a=100:1, b=2:1, c=4:1, key="b, c") dt2[dt1] # b c a x # [1,] 1 1 97 1 # [2,] 1 1 93 1 # [3,] 1 1 89 1 # [4,] 1 1 85 1 # [5,] 1 1 81 1 # [6,] 1 1 77 1 # [7,] 1 1 73 1 # [8,] 1 1 69 1 # [9,] 1 1 65 1 # [10,] 1 1 61 1 # First 10 rows of 2500 printed. dt1[J(c(1, 1))] # y x z # 1 1 1 # 1 5 1 # 1 9 1 # 1 13 1 # 1 17 1 # 1 21 1 # 1 25 1 # First 7 rows printed. dt1[c(1, 1)] # x y z # [1,] 1 1 1 # [2,] 1 1 1 That seems to work as expected. I don't get why you and Chris are suggesting a fix is necessary. Could you maybe clarify for the rest of us? Many thanks, --Mel. -----Original Message----- From: [hidden email] [mailto:[hidden email]] On Behalf Of Matthew Dowle Sent: Tuesday, July 12, 2011 8:54 PM To: Chris Neff Cc: [hidden email] Subject: Re: [datatable-help] Select from second key but not first Hi Chris, Welcome. That's a 'secondary key'. FR#1007 is to build in secondary keys : https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1007&group_i d=240&atid=978 In the meantime you can do a 'manual' secondary key : idx = dt[,list(z,y,i=1:nrow(dt))] setkey(idx,z,y) dt[idx[J(3),i]$i] Btw, that [,i]$i is ugly and I'm coming around to the idea of making that more consistent, as requested by a previous poster (can't find it now). I'm thinking it should work like this : DT[c("a","b"),j] # always returns vector, for consistency, even though 2 groups are joined to via the mult="all" default and the correspondence might be lost. Then mult="first" and mult="last" would return vector too, consistent with mult="all" DT[c("a","b"),list(j)] # list() needed to retain the group columns and return a data.table rather than a vector. Same type (i.e. data.table) returned for all values of mult. Would that be better? Throwing that out to all. Btw, posting from googlegroups does work then, that's good. This thread should be mirrored in all places; it shouldn't matter where you post from or to, but any probs please let me know as it's the first time. Matthew On Tue, 2011-07-12 at 16:21 -0700, Chris Neff wrote: > Hi all, > > > I'm really new to data.table and something really simple has me > stumped. > > > Lets say I have the following (but much bigger so timing matters) > > > dt = data.table(x=1:100,y=1:2,z=1:4,key="y,z") > > > From the documentation, I understand that > > > dt[J(1,3)] > > > is significantly faster > > > dt[y==1 & z==3] > > > > > and I could do > > > dt[J(1)] > > > instead of > > > dt[y==1] > > > but is there any way to do > > > dt[z==3] > > > faster? I want to do something like > > > df[J( ,3)] > > > but I know that doesn't make sense. Is it because z is not the primary > key that I can't seem to figure out how to use J to do this? Since it > isn't sorted on z anyway I doubt I can get a speed up right? > _______________________________________________ > datatable-help mailing list > [hidden email] > _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
Hi,
Those are single keys of multiple columns. Given dt keyed by (y,z), Chris wanted to binary search on z, without changing the key, and without copying the whole table just to put a different key on it, IIUC. The suggestion for consistency of return type of single name j was unrelated to secondary keys per se, it just jogged my memory. > Hi Matt, > > I though multiple key indexing was already supported in data.table(). For > example: > > dt1 <- data.table(x=1:100, y=1:2, z=1:4, key="y, z") > dt2 <- data.table(a=100:1, b=2:1, c=4:1, key="b, c") > > dt2[dt1] > # b c a x > # [1,] 1 1 97 1 > # [2,] 1 1 93 1 > # [3,] 1 1 89 1 > # [4,] 1 1 85 1 > # [5,] 1 1 81 1 > # [6,] 1 1 77 1 > # [7,] 1 1 73 1 > # [8,] 1 1 69 1 > # [9,] 1 1 65 1 > # [10,] 1 1 61 1 > # First 10 rows of 2500 printed. > > dt1[J(c(1, 1))] > # y x z > # 1 1 1 > # 1 5 1 > # 1 9 1 > # 1 13 1 > # 1 17 1 > # 1 21 1 > # 1 25 1 > # First 7 rows printed. > > dt1[c(1, 1)] > # x y z > # [1,] 1 1 1 > # [2,] 1 1 1 > > That seems to work as expected. I don't get why you and Chris are > suggesting > a fix is necessary. Could you maybe clarify for the rest of us? > > Many thanks, --Mel. > > > > -----Original Message----- > From: [hidden email] > [mailto:[hidden email]] On Behalf Of Matthew > Dowle > Sent: Tuesday, July 12, 2011 8:54 PM > To: Chris Neff > Cc: [hidden email] > Subject: Re: [datatable-help] Select from second key but not first > > > Hi Chris, > > Welcome. > > That's a 'secondary key'. FR#1007 is to build in secondary keys : > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1007&group_i > d=240&atid=978 > > In the meantime you can do a 'manual' secondary key : > > idx = dt[,list(z,y,i=1:nrow(dt))] > setkey(idx,z,y) > dt[idx[J(3),i]$i] > > Btw, that [,i]$i is ugly and I'm coming around to the idea of making > that more consistent, as requested by a previous poster (can't find it > now). > > I'm thinking it should work like this : > > DT[c("a","b"),j] > # always returns vector, for consistency, even though 2 groups are > joined to via the mult="all" default and the correspondence might be > lost. Then mult="first" and mult="last" would return vector too, > consistent with mult="all" > > DT[c("a","b"),list(j)] > # list() needed to retain the group columns and return a data.table > rather than a vector. Same type (i.e. data.table) returned for all > values of mult. > > Would that be better? Throwing that out to all. > > Btw, posting from googlegroups does work then, that's good. This thread > should be mirrored in all places; it shouldn't matter where you post > from or to, but any probs please let me know as it's the first time. > > Matthew > > > On Tue, 2011-07-12 at 16:21 -0700, Chris Neff wrote: >> Hi all, >> >> >> I'm really new to data.table and something really simple has me >> stumped. >> >> >> Lets say I have the following (but much bigger so timing matters) >> >> >> dt = data.table(x=1:100,y=1:2,z=1:4,key="y,z") >> >> >> From the documentation, I understand that >> >> >> dt[J(1,3)] >> >> >> is significantly faster >> >> >> dt[y==1 & z==3] >> >> >> >> >> and I could do >> >> >> dt[J(1)] >> >> >> instead of >> >> >> dt[y==1] >> >> >> but is there any way to do >> >> >> dt[z==3] >> >> >> faster? I want to do something like >> >> >> df[J( ,3)] >> >> >> but I know that doesn't make sense. Is it because z is not the primary >> key that I can't seem to figure out how to use J to do this? Since it >> isn't sorted on z anyway I doubt I can get a speed up right? >> _______________________________________________ >> datatable-help mailing list >> [hidden email] >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > _______________________________________________ > datatable-help mailing list > [hidden email] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
Powered by Nabble | Edit this page |