Quantcast

best way to set keys, when you don't know in advance wich fields you will use

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

best way to set keys, when you don't know in advance wich fields you will use

jj.dureau
Hi,
i have a data.table (10,000k of rows) with 20 (factor) fields and i
need to filter data according some of them.
I use this data.table inside a function and i don't know "in advance"
wich fileds i'll use to filter data and to sum.

So, for example, consider a data.table (named dt_data) with 20 fileds,
named f1, f2, ... ,f20.

I use this approach: i set the key on the field i have to use, for
example f2. Then i "filter" the data and i use them to do some
computations.

Subsequently, with these computations, i discover wich fileds i have
to filter, for example f4 and f5. Now, i set the key on dt_data on
(f4,f5), and so on ...

I use this approach because i don't  know if it's possible to set the
key on all fields f1, f2, .., f20 in advance and then use only some of
them!

Is there a better way to use data.table?

thanks

jj
_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: best way to set keys, when you don't know in advance wich fields you will use

caneff
You don't necessarily have to use keys at all.  When you aggregate and
give the by columns, they don't necessarily have to be keys of the
data table.  This is called an "ad-hoc by". It is slightly slower, but
my intuition says that it isn't really any slower than setting the
key.

When you add a key you sort by those fields.  You incur a time cost
for that. If you are consistently doing things with those keys then
you may make up for that time cost further on.  But for multiple
different groupings the ad-hoc by is probably faster.  Do some timings
to see.  Some simple ones I did show that the act of sorting is slower
than ad-hoc by.

On 25 August 2011 11:05, Jean Jacques Dureau <[hidden email]> wrote:

> Hi,
> i have a data.table (10,000k of rows) with 20 (factor) fields and i
> need to filter data according some of them.
> I use this data.table inside a function and i don't know "in advance"
> wich fileds i'll use to filter data and to sum.
>
> So, for example, consider a data.table (named dt_data) with 20 fileds,
> named f1, f2, ... ,f20.
>
> I use this approach: i set the key on the field i have to use, for
> example f2. Then i "filter" the data and i use them to do some
> computations.
>
> Subsequently, with these computations, i discover wich fileds i have
> to filter, for example f4 and f5. Now, i set the key on dt_data on
> (f4,f5), and so on ...
>
> I use this approach because i don't  know if it's possible to set the
> key on all fields f1, f2, .., f20 in advance and then use only some of
> them!
>
> Is there a better way to use data.table?
>
> thanks
>
> jj
> _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: best way to set keys, when you don't know in advance wich fields you will use

Matthew Dowle
JJ,
Yes, Chris is spot on.
keyed by should be faster when the size of each group is large; e.g., a
1 billion row data.table of 1,000 groups. See FAQ 3.3 for why.
However in your example, ad hoc by does seem more appropriate.
Matthew

On Thu, 2011-08-25 at 11:17 -0400, Chris Neff wrote:

> You don't necessarily have to use keys at all.  When you aggregate and
> give the by columns, they don't necessarily have to be keys of the
> data table.  This is called an "ad-hoc by". It is slightly slower, but
> my intuition says that it isn't really any slower than setting the
> key.
>
> When you add a key you sort by those fields.  You incur a time cost
> for that. If you are consistently doing things with those keys then
> you may make up for that time cost further on.  But for multiple
> different groupings the ad-hoc by is probably faster.  Do some timings
> to see.  Some simple ones I did show that the act of sorting is slower
> than ad-hoc by.
>
> On 25 August 2011 11:05, Jean Jacques Dureau <[hidden email]> wrote:
> > Hi,
> > i have a data.table (10,000k of rows) with 20 (factor) fields and i
> > need to filter data according some of them.
> > I use this data.table inside a function and i don't know "in advance"
> > wich fileds i'll use to filter data and to sum.
> >
> > So, for example, consider a data.table (named dt_data) with 20 fileds,
> > named f1, f2, ... ,f20.
> >
> > I use this approach: i set the key on the field i have to use, for
> > example f2. Then i "filter" the data and i use them to do some
> > computations.
> >
> > Subsequently, with these computations, i discover wich fileds i have
> > to filter, for example f4 and f5. Now, i set the key on dt_data on
> > (f4,f5), and so on ...
> >
> > I use this approach because i don't  know if it's possible to set the
> > key on all fields f1, f2, .., f20 in advance and then use only some of
> > them!
> >
> > Is there a better way to use data.table?
> >
> > thanks
> >
> > jj
> > _______________________________________________
> > datatable-help mailing list
> > [hidden email]
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: best way to set keys, when you don't know in advance wich fields you will use

Matthew Dowle
Also, it makes a difference if the groups happen to be contiguous in the
table, or not.

Try creating a large table with large sized groups, where each group is
scattered throughout the table non-contiguously.  Time an ad hoc by.
Then set a key, remove the key, and time the ad hoc by again.  The 2nd
ad hoc by should be much faster.  Then set the key again, and time a
keyed by, it should be faster still.

Does that illustrate what's going on?
 

On Thu, 2011-08-25 at 19:18 +0100, Matthew Dowle wrote:

> JJ,
> Yes, Chris is spot on.
> keyed by should be faster when the size of each group is large; e.g., a
> 1 billion row data.table of 1,000 groups. See FAQ 3.3 for why.
> However in your example, ad hoc by does seem more appropriate.
> Matthew
>
> On Thu, 2011-08-25 at 11:17 -0400, Chris Neff wrote:
> > You don't necessarily have to use keys at all.  When you aggregate and
> > give the by columns, they don't necessarily have to be keys of the
> > data table.  This is called an "ad-hoc by". It is slightly slower, but
> > my intuition says that it isn't really any slower than setting the
> > key.
> >
> > When you add a key you sort by those fields.  You incur a time cost
> > for that. If you are consistently doing things with those keys then
> > you may make up for that time cost further on.  But for multiple
> > different groupings the ad-hoc by is probably faster.  Do some timings
> > to see.  Some simple ones I did show that the act of sorting is slower
> > than ad-hoc by.
> >
> > On 25 August 2011 11:05, Jean Jacques Dureau <[hidden email]> wrote:
> > > Hi,
> > > i have a data.table (10,000k of rows) with 20 (factor) fields and i
> > > need to filter data according some of them.
> > > I use this data.table inside a function and i don't know "in advance"
> > > wich fileds i'll use to filter data and to sum.
> > >
> > > So, for example, consider a data.table (named dt_data) with 20 fileds,
> > > named f1, f2, ... ,f20.
> > >
> > > I use this approach: i set the key on the field i have to use, for
> > > example f2. Then i "filter" the data and i use them to do some
> > > computations.
> > >
> > > Subsequently, with these computations, i discover wich fileds i have
> > > to filter, for example f4 and f5. Now, i set the key on dt_data on
> > > (f4,f5), and so on ...
> > >
> > > I use this approach because i don't  know if it's possible to set the
> > > key on all fields f1, f2, .., f20 in advance and then use only some of
> > > them!
> > >
> > > Is there a better way to use data.table?
> > >
> > > thanks
> > >
> > > jj
> > > _______________________________________________
> > > datatable-help mailing list
> > > [hidden email]
> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > >
> > _______________________________________________
> > datatable-help mailing list
> > [hidden email]
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: best way to set keys, when you don't know in advance wich fields you will use

jj.dureau
Dear chris and mattew,
thanks for the fantastic explanation that you gave me!

I am very satisfied with the processing time of the "ad hoc by". I
just wanted to confirm that working without a set of keys to
data.table, I were still using the potential of this library! So, you
confirmed to me that my approach is not wrong.

I noticed, in fact, that with 7.000 k rows, 7 factors to group (f1,
..., f7) and a variable to sum, I get:

 processing time of DT[, sum (f8), by = ("f1, f2, f3, f4, f5, f6, f7")]
 <
 processing time of setkey (DT, f1, f2, f3, f4, f5, f6, f7) and DT[,
sum (f8), by = key (DT)]

Thank you very much, and I congratulate the developers! Considering
that the statisticians are increasingly working with large data, I
think it's one of the most interesting R library !!!!!

jj

2011/8/25 Matthew Dowle <[hidden email]>:

> Also, it makes a difference if the groups happen to be contiguous in the
> table, or not.
>
> Try creating a large table with large sized groups, where each group is
> scattered throughout the table non-contiguously.  Time an ad hoc by.
> Then set a key, remove the key, and time the ad hoc by again.  The 2nd
> ad hoc by should be much faster.  Then set the key again, and time a
> keyed by, it should be faster still.
>
> Does that illustrate what's going on?
>
>
> On Thu, 2011-08-25 at 19:18 +0100, Matthew Dowle wrote:
>> JJ,
>> Yes, Chris is spot on.
>> keyed by should be faster when the size of each group is large; e.g., a
>> 1 billion row data.table of 1,000 groups. See FAQ 3.3 for why.
>> However in your example, ad hoc by does seem more appropriate.
>> Matthew
>>
>> On Thu, 2011-08-25 at 11:17 -0400, Chris Neff wrote:
>> > You don't necessarily have to use keys at all.  When you aggregate and
>> > give the by columns, they don't necessarily have to be keys of the
>> > data table.  This is called an "ad-hoc by". It is slightly slower, but
>> > my intuition says that it isn't really any slower than setting the
>> > key.
>> >
>> > When you add a key you sort by those fields.  You incur a time cost
>> > for that. If you are consistently doing things with those keys then
>> > you may make up for that time cost further on.  But for multiple
>> > different groupings the ad-hoc by is probably faster.  Do some timings
>> > to see.  Some simple ones I did show that the act of sorting is slower
>> > than ad-hoc by.
>> >
>> > On 25 August 2011 11:05, Jean Jacques Dureau <[hidden email]> wrote:
>> > > Hi,
>> > > i have a data.table (10,000k of rows) with 20 (factor) fields and i
>> > > need to filter data according some of them.
>> > > I use this data.table inside a function and i don't know "in advance"
>> > > wich fileds i'll use to filter data and to sum.
>> > >
>> > > So, for example, consider a data.table (named dt_data) with 20 fileds,
>> > > named f1, f2, ... ,f20.
>> > >
>> > > I use this approach: i set the key on the field i have to use, for
>> > > example f2. Then i "filter" the data and i use them to do some
>> > > computations.
>> > >
>> > > Subsequently, with these computations, i discover wich fileds i have
>> > > to filter, for example f4 and f5. Now, i set the key on dt_data on
>> > > (f4,f5), and so on ...
>> > >
>> > > I use this approach because i don't  know if it's possible to set the
>> > > key on all fields f1, f2, .., f20 in advance and then use only some of
>> > > them!
>> > >
>> > > Is there a better way to use data.table?
>> > >
>> > > thanks
>> > >
>> > > jj
>> > > _______________________________________________
>> > > datatable-help mailing list
>> > > [hidden email]
>> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> > >
>> > _______________________________________________
>> > datatable-help mailing list
>> > [hidden email]
>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>
>
> _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Loading...