|
Hi,
i have a data.table (10,000k of rows) with 20 (factor) fields and i need to filter data according some of them. I use this data.table inside a function and i don't know "in advance" wich fileds i'll use to filter data and to sum. So, for example, consider a data.table (named dt_data) with 20 fileds, named f1, f2, ... ,f20. I use this approach: i set the key on the field i have to use, for example f2. Then i "filter" the data and i use them to do some computations. Subsequently, with these computations, i discover wich fileds i have to filter, for example f4 and f5. Now, i set the key on dt_data on (f4,f5), and so on ... I use this approach because i don't know if it's possible to set the key on all fields f1, f2, .., f20 in advance and then use only some of them! Is there a better way to use data.table? thanks jj _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
You don't necessarily have to use keys at all. When you aggregate and
give the by columns, they don't necessarily have to be keys of the data table. This is called an "ad-hoc by". It is slightly slower, but my intuition says that it isn't really any slower than setting the key. When you add a key you sort by those fields. You incur a time cost for that. If you are consistently doing things with those keys then you may make up for that time cost further on. But for multiple different groupings the ad-hoc by is probably faster. Do some timings to see. Some simple ones I did show that the act of sorting is slower than ad-hoc by. On 25 August 2011 11:05, Jean Jacques Dureau <[hidden email]> wrote: > Hi, > i have a data.table (10,000k of rows) with 20 (factor) fields and i > need to filter data according some of them. > I use this data.table inside a function and i don't know "in advance" > wich fileds i'll use to filter data and to sum. > > So, for example, consider a data.table (named dt_data) with 20 fileds, > named f1, f2, ... ,f20. > > I use this approach: i set the key on the field i have to use, for > example f2. Then i "filter" the data and i use them to do some > computations. > > Subsequently, with these computations, i discover wich fileds i have > to filter, for example f4 and f5. Now, i set the key on dt_data on > (f4,f5), and so on ... > > I use this approach because i don't know if it's possible to set the > key on all fields f1, f2, .., f20 in advance and then use only some of > them! > > Is there a better way to use data.table? > > thanks > > jj > _______________________________________________ > datatable-help mailing list > [hidden email] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
JJ,
Yes, Chris is spot on. keyed by should be faster when the size of each group is large; e.g., a 1 billion row data.table of 1,000 groups. See FAQ 3.3 for why. However in your example, ad hoc by does seem more appropriate. Matthew On Thu, 2011-08-25 at 11:17 -0400, Chris Neff wrote: > You don't necessarily have to use keys at all. When you aggregate and > give the by columns, they don't necessarily have to be keys of the > data table. This is called an "ad-hoc by". It is slightly slower, but > my intuition says that it isn't really any slower than setting the > key. > > When you add a key you sort by those fields. You incur a time cost > for that. If you are consistently doing things with those keys then > you may make up for that time cost further on. But for multiple > different groupings the ad-hoc by is probably faster. Do some timings > to see. Some simple ones I did show that the act of sorting is slower > than ad-hoc by. > > On 25 August 2011 11:05, Jean Jacques Dureau <[hidden email]> wrote: > > Hi, > > i have a data.table (10,000k of rows) with 20 (factor) fields and i > > need to filter data according some of them. > > I use this data.table inside a function and i don't know "in advance" > > wich fileds i'll use to filter data and to sum. > > > > So, for example, consider a data.table (named dt_data) with 20 fileds, > > named f1, f2, ... ,f20. > > > > I use this approach: i set the key on the field i have to use, for > > example f2. Then i "filter" the data and i use them to do some > > computations. > > > > Subsequently, with these computations, i discover wich fileds i have > > to filter, for example f4 and f5. Now, i set the key on dt_data on > > (f4,f5), and so on ... > > > > I use this approach because i don't know if it's possible to set the > > key on all fields f1, f2, .., f20 in advance and then use only some of > > them! > > > > Is there a better way to use data.table? > > > > thanks > > > > jj > > _______________________________________________ > > datatable-help mailing list > > [hidden email] > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > _______________________________________________ > datatable-help mailing list > [hidden email] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
Also, it makes a difference if the groups happen to be contiguous in the
table, or not. Try creating a large table with large sized groups, where each group is scattered throughout the table non-contiguously. Time an ad hoc by. Then set a key, remove the key, and time the ad hoc by again. The 2nd ad hoc by should be much faster. Then set the key again, and time a keyed by, it should be faster still. Does that illustrate what's going on? On Thu, 2011-08-25 at 19:18 +0100, Matthew Dowle wrote: > JJ, > Yes, Chris is spot on. > keyed by should be faster when the size of each group is large; e.g., a > 1 billion row data.table of 1,000 groups. See FAQ 3.3 for why. > However in your example, ad hoc by does seem more appropriate. > Matthew > > On Thu, 2011-08-25 at 11:17 -0400, Chris Neff wrote: > > You don't necessarily have to use keys at all. When you aggregate and > > give the by columns, they don't necessarily have to be keys of the > > data table. This is called an "ad-hoc by". It is slightly slower, but > > my intuition says that it isn't really any slower than setting the > > key. > > > > When you add a key you sort by those fields. You incur a time cost > > for that. If you are consistently doing things with those keys then > > you may make up for that time cost further on. But for multiple > > different groupings the ad-hoc by is probably faster. Do some timings > > to see. Some simple ones I did show that the act of sorting is slower > > than ad-hoc by. > > > > On 25 August 2011 11:05, Jean Jacques Dureau <[hidden email]> wrote: > > > Hi, > > > i have a data.table (10,000k of rows) with 20 (factor) fields and i > > > need to filter data according some of them. > > > I use this data.table inside a function and i don't know "in advance" > > > wich fileds i'll use to filter data and to sum. > > > > > > So, for example, consider a data.table (named dt_data) with 20 fileds, > > > named f1, f2, ... ,f20. > > > > > > I use this approach: i set the key on the field i have to use, for > > > example f2. Then i "filter" the data and i use them to do some > > > computations. > > > > > > Subsequently, with these computations, i discover wich fileds i have > > > to filter, for example f4 and f5. Now, i set the key on dt_data on > > > (f4,f5), and so on ... > > > > > > I use this approach because i don't know if it's possible to set the > > > key on all fields f1, f2, .., f20 in advance and then use only some of > > > them! > > > > > > Is there a better way to use data.table? > > > > > > thanks > > > > > > jj > > > _______________________________________________ > > > datatable-help mailing list > > > [hidden email] > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > _______________________________________________ > > datatable-help mailing list > > [hidden email] > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
Dear chris and mattew,
thanks for the fantastic explanation that you gave me! I am very satisfied with the processing time of the "ad hoc by". I just wanted to confirm that working without a set of keys to data.table, I were still using the potential of this library! So, you confirmed to me that my approach is not wrong. I noticed, in fact, that with 7.000 k rows, 7 factors to group (f1, ..., f7) and a variable to sum, I get: processing time of DT[, sum (f8), by = ("f1, f2, f3, f4, f5, f6, f7")] < processing time of setkey (DT, f1, f2, f3, f4, f5, f6, f7) and DT[, sum (f8), by = key (DT)] Thank you very much, and I congratulate the developers! Considering that the statisticians are increasingly working with large data, I think it's one of the most interesting R library !!!!! jj 2011/8/25 Matthew Dowle <[hidden email]>: > Also, it makes a difference if the groups happen to be contiguous in the > table, or not. > > Try creating a large table with large sized groups, where each group is > scattered throughout the table non-contiguously. Time an ad hoc by. > Then set a key, remove the key, and time the ad hoc by again. The 2nd > ad hoc by should be much faster. Then set the key again, and time a > keyed by, it should be faster still. > > Does that illustrate what's going on? > > > On Thu, 2011-08-25 at 19:18 +0100, Matthew Dowle wrote: >> JJ, >> Yes, Chris is spot on. >> keyed by should be faster when the size of each group is large; e.g., a >> 1 billion row data.table of 1,000 groups. See FAQ 3.3 for why. >> However in your example, ad hoc by does seem more appropriate. >> Matthew >> >> On Thu, 2011-08-25 at 11:17 -0400, Chris Neff wrote: >> > You don't necessarily have to use keys at all. When you aggregate and >> > give the by columns, they don't necessarily have to be keys of the >> > data table. This is called an "ad-hoc by". It is slightly slower, but >> > my intuition says that it isn't really any slower than setting the >> > key. >> > >> > When you add a key you sort by those fields. You incur a time cost >> > for that. If you are consistently doing things with those keys then >> > you may make up for that time cost further on. But for multiple >> > different groupings the ad-hoc by is probably faster. Do some timings >> > to see. Some simple ones I did show that the act of sorting is slower >> > than ad-hoc by. >> > >> > On 25 August 2011 11:05, Jean Jacques Dureau <[hidden email]> wrote: >> > > Hi, >> > > i have a data.table (10,000k of rows) with 20 (factor) fields and i >> > > need to filter data according some of them. >> > > I use this data.table inside a function and i don't know "in advance" >> > > wich fileds i'll use to filter data and to sum. >> > > >> > > So, for example, consider a data.table (named dt_data) with 20 fileds, >> > > named f1, f2, ... ,f20. >> > > >> > > I use this approach: i set the key on the field i have to use, for >> > > example f2. Then i "filter" the data and i use them to do some >> > > computations. >> > > >> > > Subsequently, with these computations, i discover wich fileds i have >> > > to filter, for example f4 and f5. Now, i set the key on dt_data on >> > > (f4,f5), and so on ... >> > > >> > > I use this approach because i don't know if it's possible to set the >> > > key on all fields f1, f2, .., f20 in advance and then use only some of >> > > them! >> > > >> > > Is there a better way to use data.table? >> > > >> > > thanks >> > > >> > > jj >> > > _______________________________________________ >> > > datatable-help mailing list >> > > [hidden email] >> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > >> > _______________________________________________ >> > datatable-help mailing list >> > [hidden email] >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > > _______________________________________________ > datatable-help mailing list > [hidden email] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
| Powered by Nabble | Edit this page |
