Quantcast

translating SQL statements into data.table operations

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

translating SQL statements into data.table operations

Nick Switanek
I've recently stumbled across data.table, Matthew Dowle's package. I'm
impressed by the speed of the package in handling operations with large
data.frames, but am a bit overwhelmed with the syntax. I'd like to express
the SQL statement below using data.table operations rather than sqldf (which
was incredibly slow for a small subset of my financial data) or
import/export with a DBMS, but I haven't been able to figure out how to do
it. I would be grateful for your suggestions.

nick



My aim is to join events (trades) from two datasets ("edt" and "cdt") where,
for the same stock, the events in one dataset occur between 15 and 75 days
before the other, and within the same time window. I can only see how to
express the "WHERE e.SYMBOL = c.SYMBOL" part in data.table syntax. I'm also
at a loss at whether I can express the remainder using data.table's
%between% operator or not.

ctqm <- sqldf("SELECT e.*,
                 c.DATE 'DATEctrl',
                 c.TIME 'TIMEctrl',
                 c.PRICE 'PRICEctrl',
                 c.SIZE 'SIZEctrl'

                 FROM edt e, ctq c

                 WHERE e.SYMBOL = c.SYMBOL AND
                       julianday(e.DATE) - julianday(c.DATE) BETWEEN 15 AND
75 AND
                       strftime('%H:%M:%S',c.TIME) BETWEEN
strftime('%H:%M:%S',e.BEGTIME) AND strftime('%H:%M:%S',e.ENDTIME)")

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: translating SQL statements into data.table operations

Gabor Grothendieck
Note that, in general, you can speed up joins, even within sqldf, by
adding indexes to your tables and ensuring that your select statement
is written in such a way that the indexes are used.  See example 4i on
the sqldf home page.

On Wed, Mar 24, 2010 at 4:51 PM, Nick Switanek <[hidden email]> wrote:

> I've recently stumbled across data.table, Matthew Dowle's package. I'm
> impressed by the speed of the package in handling operations with large
> data.frames, but am a bit overwhelmed with the syntax. I'd like to express
> the SQL statement below using data.table operations rather than sqldf (which
> was incredibly slow for a small subset of my financial data) or
> import/export with a DBMS, but I haven't been able to figure out how to do
> it. I would be grateful for your suggestions.
>
> nick
>
>
>
> My aim is to join events (trades) from two datasets ("edt" and "cdt") where,
> for the same stock, the events in one dataset occur between 15 and 75 days
> before the other, and within the same time window. I can only see how to
> express the "WHERE e.SYMBOL = c.SYMBOL" part in data.table syntax. I'm also
> at a loss at whether I can express the remainder using data.table's
> %between% operator or not.
>
> ctqm <- sqldf("SELECT e.*,
>                 c.DATE 'DATEctrl',
>                 c.TIME 'TIMEctrl',
>                 c.PRICE 'PRICEctrl',
>                 c.SIZE 'SIZEctrl'
>
>                 FROM edt e, ctq c
>
>                 WHERE e.SYMBOL = c.SYMBOL AND
>                       julianday(e.DATE) - julianday(c.DATE) BETWEEN 15 AND
> 75 AND
>                       strftime('%H:%M:%S',c.TIME) BETWEEN
> strftime('%H:%M:%S',e.BEGTIME) AND strftime('%H:%M:%S',e.ENDTIME)")
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: translating SQL statements into data.table operations

Matthew Dowle
In reply to this post by Nick Switanek

Nick,

Good question,  but just sent to the wrong place. The posting guide asks you
to contact the package maintainer first before posting to r-help only if you
don't hear back. I guess one reason for that is that if questions about all
2000+ packages were sent to r-help, then r-help's traffic could go through
the roof.  Another reason could be that some (i.e. maybe many, maybe few)
package maintainers don't actually monitor r-help and might miss any
messages you post here.  I only saw this one thanks to google alerts.

Since I'm writing anyway ... are you using the latest version on r-forge
which has the very fast grouping? Have you set multi-column keys on both edt
and cdt and tried edt[cdt,roll=TRUE] syntax ?  We'll help you off list to
climb the learning curve quickly. We are working on FAQs and a vignette and
they should be ready soon too.

Please do follow up with us (myself and Tom Short cc'd are the main
developers) off list and one of us will be happy to help further.

Matthew


"Nick Switanek" <[hidden email]> wrote in message
news:[hidden email]...

> I've recently stumbled across data.table, Matthew Dowle's package. I'm
> impressed by the speed of the package in handling operations with large
> data.frames, but am a bit overwhelmed with the syntax. I'd like to express
> the SQL statement below using data.table operations rather than sqldf
> (which
> was incredibly slow for a small subset of my financial data) or
> import/export with a DBMS, but I haven't been able to figure out how to do
> it. I would be grateful for your suggestions.
>
> nick
>
>
>
> My aim is to join events (trades) from two datasets ("edt" and "cdt")
> where,
> for the same stock, the events in one dataset occur between 15 and 75 days
> before the other, and within the same time window. I can only see how to
> express the "WHERE e.SYMBOL = c.SYMBOL" part in data.table syntax. I'm
> also
> at a loss at whether I can express the remainder using data.table's
> %between% operator or not.
>
> ctqm <- sqldf("SELECT e.*,
>                 c.DATE 'DATEctrl',
>                 c.TIME 'TIMEctrl',
>                 c.PRICE 'PRICEctrl',
>                 c.SIZE 'SIZEctrl'
>
>                 FROM edt e, ctq c
>
>                 WHERE e.SYMBOL = c.SYMBOL AND
>                       julianday(e.DATE) - julianday(c.DATE) BETWEEN 15 AND
> 75 AND
>                       strftime('%H:%M:%S',c.TIME) BETWEEN
> strftime('%H:%M:%S',e.BEGTIME) AND strftime('%H:%M:%S',e.ENDTIME)")
>
> [[alternative HTML version deleted]]
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...