aggregate() runs out of memory

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

aggregate() runs out of memory

Sam Steingold
I have a large data.frame Z (2,424,185,944 bytes, 10,256,441 rows, 17 columns).
I want to get the result of
table(aggregate(Z$V1, FUN = length, by = list(id=Z$V2))$x)
alas, aggregate has been running for ~30 minute, RSS is 14G, VIRT is
24.3G, and no end in sight.
both V1 and V2 are characters (not factors).
Is there anything I could do to speed this up?
Thanks.

--
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/ http://www.PetitionOnline.com/tap12009/
http://dhimmi.com http://think-israel.org http://iris.org.il
WinWord 6.0 UNinstall: Not enough disk space to uninstall WinWord

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: aggregate() runs out of memory

Steve Lianoglou-6
Hi,

On Fri, Sep 14, 2012 at 3:26 PM, Sam Steingold <[hidden email]> wrote:
> I have a large data.frame Z (2,424,185,944 bytes, 10,256,441 rows, 17 columns).
> I want to get the result of
> table(aggregate(Z$V1, FUN = length, by = list(id=Z$V2))$x)
> alas, aggregate has been running for ~30 minute, RSS is 14G, VIRT is
> 24.3G, and no end in sight.
> both V1 and V2 are characters (not factors).
> Is there anything I could do to speed this up?
> Thanks.

You might find you'll get a lot of mileage out of data.table when
working with such large data.frames ...

To get something close to what you're after, you can try:

R> library(data.table)
R> Z <- as.data.table(Z)
R> setkeyv(Z, 'V2')
R> agg <- Z[, list(count=.N), by='V2']

>From here you might

R> tab1 <- table(agg$count)

I think that'll get you where you want to be ... I'm ashamed to say
that I haven't really done much w/ aggregate since I mostly have used
plyr and data.table like stuff, so I might be missing your end goal --
providing a reproducible example with a small data.frame from you can
help here (for me at least).

HTH,
-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: aggregate() runs out of memory

William Dunlap
Using data.table will probably speed lots of things up, but also note that
  aggregate(x, FUN=length, by)$x
is a slow way to compute
  table(by)
.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On Behalf
> Of Steve Lianoglou
> Sent: Friday, September 14, 2012 12:41 PM
> To: [hidden email]; [hidden email]
> Subject: Re: [R] aggregate() runs out of memory
>
> Hi,
>
> On Fri, Sep 14, 2012 at 3:26 PM, Sam Steingold <[hidden email]> wrote:
> > I have a large data.frame Z (2,424,185,944 bytes, 10,256,441 rows, 17 columns).
> > I want to get the result of
> > table(aggregate(Z$V1, FUN = length, by = list(id=Z$V2))$x)
> > alas, aggregate has been running for ~30 minute, RSS is 14G, VIRT is
> > 24.3G, and no end in sight.
> > both V1 and V2 are characters (not factors).
> > Is there anything I could do to speed this up?
> > Thanks.
>
> You might find you'll get a lot of mileage out of data.table when
> working with such large data.frames ...
>
> To get something close to what you're after, you can try:
>
> R> library(data.table)
> R> Z <- as.data.table(Z)
> R> setkeyv(Z, 'V2')
> R> agg <- Z[, list(count=.N), by='V2']
>
> >From here you might
>
> R> tab1 <- table(agg$count)
>
> I think that'll get you where you want to be ... I'm ashamed to say
> that I haven't really done much w/ aggregate since I mostly have used
> plyr and data.table like stuff, so I might be missing your end goal --
> providing a reproducible example with a small data.frame from you can
> help here (for me at least).
>
> HTH,
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: aggregate() runs out of memory

djmuseR
In reply to this post by Sam Steingold
Hi:

This should give you some idea of what Steve is talking about:

library(data.table)
dt <- data.table(x = sample(100000, 10000000, replace = TRUE),
                  y = rnorm(10000000), key = "x")
dt[, .N, by = x]
system.time(dt[, .N, by = x])

...on my system, dual core 8Gb RAM running Win7 64-bit,
> system.time(dt[, .N, by = x])
   user  system elapsed
   0.12    0.02    0.14

.N is an optimized function to find the number of rows of each data subset.
Much faster than aggregate(). It might take a little longer because you
have more columns that suck up space, but you get the idea. It's also about
5-6 times faster if you set a key variable in the data table than if you
don't.

Dennis

On Fri, Sep 14, 2012 at 12:26 PM, Sam Steingold <[hidden email]> wrote:

> I have a large data.frame Z (2,424,185,944 bytes, 10,256,441 rows, 17
> columns).
> I want to get the result of
> table(aggregate(Z$V1, FUN = length, by = list(id=Z$V2))$x)
> alas, aggregate has been running for ~30 minute, RSS is 14G, VIRT is
> 24.3G, and no end in sight.
> both V1 and V2 are characters (not factors).
> Is there anything I could do to speed this up?
> Thanks.
>
> --
> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X
> 11.0.11103000
> http://www.childpsy.net/ http://www.PetitionOnline.com/tap12009/
> http://dhimmi.com http://think-israel.org http://iris.org.il
> WinWord 6.0 UNinstall: Not enough disk space to uninstall WinWord
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: aggregate() runs out of memory

Steve Lianoglou-6
Hi,

On Fri, Sep 14, 2012 at 4:26 PM, Dennis Murphy <[hidden email]> wrote:

> Hi:
>
> This should give you some idea of what Steve is talking about:
>
> library(data.table)
> dt <- data.table(x = sample(100000, 10000000, replace = TRUE),
>                   y = rnorm(10000000), key = "x")
> dt[, .N, by = x]
> system.time(dt[, .N, by = x])
>
> ...on my system, dual core 8Gb RAM running Win7 64-bit,
>> system.time(dt[, .N, by = x])
>    user  system elapsed
>    0.12    0.02    0.14
>
> .N is an optimized function to find the number of rows of each data subset.
> Much faster than aggregate(). It might take a little longer because you
> have more columns that suck up space, but you get the idea. It's also about
> 5-6 times faster if you set a key variable in the data table than if you
> don't.

Well done, sir! (slight critique in that .N isn't a function, it's
just a variable that is constantly reset within each by-subset/group)

Also, don't forget to use the .SDcols parameter in [.data.table if you
plan on only using a subset of the columns in side your "by" stuff.

There's lots of documentation in the package `?data.table` and the
vignettes/FAQ to help you tweak your usage, if you decide to take
data.table route.

HTH,
-steve

>
> Dennis
>
> On Fri, Sep 14, 2012 at 12:26 PM, Sam Steingold <[hidden email]> wrote:
>
>> I have a large data.frame Z (2,424,185,944 bytes, 10,256,441 rows, 17
>> columns).
>> I want to get the result of
>> table(aggregate(Z$V1, FUN = length, by = list(id=Z$V2))$x)
>> alas, aggregate has been running for ~30 minute, RSS is 14G, VIRT is
>> 24.3G, and no end in sight.
>> both V1 and V2 are characters (not factors).
>> Is there anything I could do to speed this up?
>> Thanks.
>>
>> --
>> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X
>> 11.0.11103000
>> http://www.childpsy.net/ http://www.PetitionOnline.com/tap12009/
>> http://dhimmi.com http://think-israel.org http://iris.org.il
>> WinWord 6.0 UNinstall: Not enough disk space to uninstall WinWord
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: aggregate() runs out of memory

Sam Steingold
In reply to this post by Steve Lianoglou-6
Thanks Steve,
what is the analogue of .N for min and max?
i.e., what is the data.table's version of
aggregate(infl$delay,by=list(infl$share.id),FUN=min)
aggregate(infl$delay,by=list(infl$share.id),FUN=max)
thanks!
Sam.

On Fri, Sep 14, 2012 at 3:40 PM, Steve Lianoglou
<[hidden email]> wrote:

> Hi,
>
> On Fri, Sep 14, 2012 at 3:26 PM, Sam Steingold <[hidden email]> wrote:
>> I have a large data.frame Z (2,424,185,944 bytes, 10,256,441 rows, 17 columns).
>> I want to get the result of
>> table(aggregate(Z$V1, FUN = length, by = list(id=Z$V2))$x)
>> alas, aggregate has been running for ~30 minute, RSS is 14G, VIRT is
>> 24.3G, and no end in sight.
>> both V1 and V2 are characters (not factors).
>> Is there anything I could do to speed this up?
>> Thanks.
>
> You might find you'll get a lot of mileage out of data.table when
> working with such large data.frames ...
>
> To get something close to what you're after, you can try:
>
> R> library(data.table)
> R> Z <- as.data.table(Z)
> R> setkeyv(Z, 'V2')
> R> agg <- Z[, list(count=.N), by='V2']
>
> From here you might
>
> R> tab1 <- table(agg$count)
>
> I think that'll get you where you want to be ... I'm ashamed to say
> that I haven't really done much w/ aggregate since I mostly have used
> plyr and data.table like stuff, so I might be missing your end goal --
> providing a reproducible example with a small data.frame from you can
> help here (for me at least).
>
> HTH,
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact



--
Sam Steingold <http://sds.podval.org> <http://www.childpsy.net/>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: aggregate() runs out of memory

Steve Lianoglou-6
Hi,

On Mon, Nov 19, 2012 at 1:25 PM, Sam Steingold <[hidden email]> wrote:
> Thanks Steve,
> what is the analogue of .N for min and max?
> i.e., what is the data.table's version of
> aggregate(infl$delay,by=list(infl$share.id),FUN=min)
> aggregate(infl$delay,by=list(infl$share.id),FUN=max)
> thanks!

It would be helpful if I could see a bit of your table (like
`head(infl)`, if it's not too big), but anyway: there is no real
analogue of min/max -- you just use them

For instance, if you want the min and max of `delay` within each group
defined by `share.id`, and let's assume `infl` is a data.frame, you
can do something like so:

R> as.data.table(infl)
R> setkey(infl, share.id)
R> result <- infl[, list(min=min(delay), max=max(delay)), by="share.id"]

HTH,

-steve


--
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: aggregate() runs out of memory

David Winsemius
In reply to this post by Sam Steingold

On Nov 19, 2012, at 1:25 PM, Sam Steingold wrote:

> Thanks Steve,
> what is the analogue of .N for min and max?

?seq

> i.e., what is the data.table's version of
> aggregate(infl$delay,by=list(infl$share.id),FUN=min)

> aggregate(infl$delay,by=list(infl$share.id),FUN=max)

> DT[, list( max(v)), by=x]
   x V1
1: a  3
2: b  6
3: c  9


> thanks!
> Sam.
>
> On Fri, Sep 14, 2012 at 3:40 PM, Steve Lianoglou
> <[hidden email]> wrote:
>> Hi,
>>
>> On Fri, Sep 14, 2012 at 3:26 PM, Sam Steingold <[hidden email]> wrote:
>>> I have a large data.frame Z (2,424,185,944 bytes, 10,256,441 rows, 17 columns).
>>> I want to get the result of
>>> table(aggregate(Z$V1, FUN = length, by = list(id=Z$V2))$x)
>>> alas, aggregate has been running for ~30 minute, RSS is 14G, VIRT is
>>> 24.3G, and no end in sight.
>>> both V1 and V2 are characters (not factors).
>>> Is there anything I could do to speed this up?
>>> Thanks.
>>
>> You might find you'll get a lot of mileage out of data.table when
>> working with such large data.frames ...
>>
>> To get something close to what you're after, you can try:
>>
>> R> library(data.table)
>> R> Z <- as.data.table(Z)
>> R> setkeyv(Z, 'V2')
>> R> agg <- Z[, list(count=.N), by='V2']
>>
>> From here you might
>>
>> R> tab1 <- table(agg$count)
>>
>> I think that'll get you where you want to be ... I'm ashamed to say
>> that I haven't really done much w/ aggregate since I mostly have used
>> plyr and data.table like stuff, so I might be missing your end goal --
>> providing a reproducible example with a small data.frame from you can
>> help here (for me at least).
>>
>> HTH,
>> -steve
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>> | Memorial Sloan-Kettering Cancer Center
>> | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
>
>
> --
> Sam Steingold <http://sds.podval.org> <http://www.childpsy.net/>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Alameda, CA, USA

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: aggregate() runs out of memory

Sam Steingold
In reply to this post by Steve Lianoglou-6
Hi,

> * Steve Lianoglou <[hidden email]> [2012-11-19 13:30:03 -0800]:
>
> For instance, if you want the min and max of `delay` within each group
> defined by `share.id`, and let's assume `infl` is a data.frame, you
> can do something like so:
>
> R> as.data.table(infl)
> R> setkey(infl, share.id)
> R> result <- infl[, list(min=min(delay), max=max(delay)), by="share.id"]

perfect, thanks.
alas, the resulting table does not contain the share.id column.
do I need to add something like "id=unique(share.id)" to the list?
also, if there is a field in the original table infl which only depends
on share.id, how do I add this unique value to the summary?
it appears that "count=unique(country)" in list() does what I need, but
it slows down the process.

--
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/ http://openvotingconsortium.org http://jihadwatch.org
http://thereligionofpeace.com http://palestinefacts.org http://dhimmi.com
Why use Windows, when there are Doors?

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: aggregate() runs out of memory

Steve Lianoglou-6
Hi Sam,

On Mon, Nov 26, 2012 at 3:13 PM, Sam Steingold <[hidden email]> wrote:

> Hi,
>
>> * Steve Lianoglou <[hidden email]> [2012-11-19 13:30:03 -0800]:
>>
>> For instance, if you want the min and max of `delay` within each group
>> defined by `share.id`, and let's assume `infl` is a data.frame, you
>> can do something like so:
>>
>> R> as.data.table(infl)
>> R> setkey(infl, share.id)
>> R> result <- infl[, list(min=min(delay), max=max(delay)), by="share.id"]
>
> perfect, thanks.
> alas, the resulting table does not contain the share.id column.
> do I need to add something like "id=unique(share.id)" to the list?
> also, if there is a field in the original table infl which only depends
> on share.id, how do I add this unique value to the summary?
> it appears that "count=unique(country)" in list() does what I need, but
> it slows down the process.

Hmm ... I think it should be there, but I'm having  a hard time
remember what you want.

Could you please copy paste the output of `dput(head(infl, 20))` as
well as an approximation of what the result is that you want.

It will make it easier for us to talk more concretely about how to get
what you want.

-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: aggregate() runs out of memory

Sam Steingold
hi Steve,

> * Steve Lianoglou <[hidden email]> [2012-11-26 16:08:59 -0500]:
> On Mon, Nov 26, 2012 at 3:13 PM, Sam Steingold <[hidden email]> wrote:
>>> * Steve Lianoglou <[hidden email]> [2012-11-19 13:30:03 -0800]:
>>>
>>> For instance, if you want the min and max of `delay` within each group
>>> defined by `share.id`, and let's assume `infl` is a data.frame, you
>>> can do something like so:
>>>
>>> R> as.data.table(infl)
>>> R> setkey(infl, share.id)
>>> R> result <- infl[, list(min=min(delay), max=max(delay)), by="share.id"]
>>
>> perfect, thanks.
>> alas, the resulting table does not contain the share.id column.
>> do I need to add something like "id=unique(share.id)" to the list?
>> also, if there is a field in the original table infl which only depends
>> on share.id, how do I add this unique value to the summary?
>> it appears that "count=unique(country)" in list() does what I need, but
>> it slows down the process.
>
> Hmm ... I think it should be there, but I'm having  a hard time
> remember what you want.
>
> Could you please copy paste the output of `(head(infl, 20))` as
> well as an approximation of what the result is that you want.

this prints all the levels for all the factor columns and takes
megabytes.

--8<---------------cut here---------------start------------->8---
> f <- data.frame(id=rep(1:3,4),country=rep(6:8,4),delay=1:12)
> f
   id country delay
1   1       6     1
2   2       7     2
3   3       8     3
4   1       6     4
5   2       7     5
6   3       8     6
7   1       6     7
8   2       7     8
9   3       8     9
10  1       6    10
11  2       7    11
12  3       8    12
> f <- as.data.table(f)
> setkey(f,id)
> delays <- f[,list(min=min(delay),max=max(delay),count=.N,country=unique(country)),by="id"]
> delays
   id min max count country
1:  1   1  10     4       6
2:  2   2  11     4       7
3:  3   3  12     4       8
--8<---------------cut here---------------end--------------->8---

this is still too slow, apparently because of unique.
how do I speed it up?

Thanks.

--
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/ http://iris.org.il
http://ffii.org http://pmw.org.il http://mideasttruth.com
Programming is like sex: one mistake and you have to support it for a lifetime.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: aggregate() runs out of memory

Steve Lianoglou-6
Hi,

On Mon, Nov 26, 2012 at 4:57 PM, Sam Steingold <[hidden email]> wrote:
[snip]
>> Could you please copy paste the output of `(head(infl, 20))` as
>> well as an approximation of what the result is that you want.

Don't know how "dput" got clipped in your reply from the quoted text I
wrote, but I actually asked for `dput(head(infl, 20))`

The dput makes a world of difference because I can easily copy/paste
the output into R and get a working table.

> this prints all the levels for all the factor columns and takes
> megabytes.

Try using droplevels, eg:

R> dput(droplevels(head(infl, 20)))


> --8<---------------cut here---------------start------------->8---
>> f <- data.frame(id=rep(1:3,4),country=rep(6:8,4),delay=1:12)
>> f
>    id country delay
> 1   1       6     1
> 2   2       7     2
> 3   3       8     3
> 4   1       6     4
> 5   2       7     5
> 6   3       8     6
> 7   1       6     7
> 8   2       7     8
> 9   3       8     9
> 10  1       6    10
> 11  2       7    11
> 12  3       8    12
>> f <- as.data.table(f)
>> setkey(f,id)
>> delays <- f[,list(min=min(delay),max=max(delay),count=.N,country=unique(country)),by="id"]
>> delays
>    id min max count country
> 1:  1   1  10     4       6
> 2:  2   2  11     4       7
> 3:  3   3  12     4       8
> --8<---------------cut here---------------end--------------->8---
>
> this is still too slow, apparently because of unique.
> how do I speed it up?

I think I'm missing something.

Your call to `min(delay)` and `max(delay)` will return the minimum and
maximum delays within the particular "id" you are grouping by. I guess
there must be several values for "country" within each "id" group --
do you really want the same min and max values to be replicated as
many times as there are unique "country"s?

Do you perhaps want to iterate over a combo of id and country?

Anyway: if you don't use `unique` inside your calculation, I guess it
goes significantly faster, like so:

R> result <- f[, list(min=min(delay), max=max(delay),
count=.N,country=country[1L]), by="share.id"]

If that's bearable, and you really want the way you suggest (or, at
least, what I'm interpreting), I wonder if this two-step would be
faster?

R> setkeyv(f, c('share.id', 'country'))
R> r1 <- f[, list(min=min(delay), max=max(delay), count=.N), by='share.id']
R> result <- unique(f)[r1]  ## I think

-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: aggregate() runs out of memory

Sam Steingold
Hi,

> * Steve Lianoglou <[hidden email]> [2012-11-26 17:32:21 -0500]:
>
>> --8<---------------cut here---------------start------------->8---
>>> f <- data.frame(id=rep(1:3,4),country=rep(6:8,4),delay=1:12)
>>> f
>>    id country delay
>> 1   1       6     1
>> 2   2       7     2
>> 3   3       8     3
>> 4   1       6     4
>> 5   2       7     5
>> 6   3       8     6
>> 7   1       6     7
>> 8   2       7     8
>> 9   3       8     9
>> 10  1       6    10
>> 11  2       7    11
>> 12  3       8    12
>>> f <- as.data.table(f)
>>> setkey(f,id)
>>> delays <- f[,list(min=min(delay),max=max(delay),count=.N,country=unique(country)),by="id"]
>>> delays
>>    id min max count country
>> 1:  1   1  10     4       6
>> 2:  2   2  11     4       7
>> 3:  3   3  12     4       8
>> --8<---------------cut here---------------end--------------->8---
>>
>> this is still too slow, apparently because of unique.
>> how do I speed it up?
>
> I think I'm missing something.
>
> Your call to `min(delay)` and `max(delay)` will return the minimum and
> maximum delays within the particular "id" you are grouping by. I guess
> there must be several values for "country" within each "id" group --
> do you really want the same min and max values to be replicated as
> many times as there are unique "country"s?

there is precisely one country for each id.
i.e., unique(country) is the same as country[1].
thanks a lot for the suggestion!

> R> result <- f[, list(min=min(delay), max=max(delay),
> count=.N,country=country[1L]), by="share.id"]

--
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/ http://thereligionofpeace.com http://pmw.org.il
http://honestreporting.com http://americancensorship.org
Why do you never call me back after I scream that I will never talk to you again?!

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: aggregate() runs out of memory

Steve Lianoglou-6
On Monday, November 26, 2012, Sam Steingold wrote:
[snip]

>
> there is precisely one country for each id.
> i.e., unique(country) is the same as country[1].
> thanks a lot for the suggestion!
>
> > R> result <- f[, list(min=min(delay), max=max(delay),
> > count=.N,country=country[1L]), by="share.id"]


And is it performant?

It just occurred to me that this is even better:

R> setkeyv(f, c("share.id", "delay"))
R> result <- f[,  list(min=delay[1L], max=delay[.N], count=.N,
country=country[1L]), by="share.id"]



> --
> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X
> 11.0.11103000
> http://www.childpsy.net/ http://thereligionofpeace.com http://pmw.org.il
> http://honestreporting.com http://americancensorship.org
> Why do you never call me back after I scream that I will never talk to you
> again?!
>


--
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: aggregate() runs out of memory

Sam Steingold
> * Steve Lianoglou <[hidden email]> [2012-11-26 19:47:25 -0500]:
>
> On Monday, November 26, 2012, Sam Steingold wrote:
> [snip]
>
>>
>> there is precisely one country for each id.
>> i.e., unique(country) is the same as country[1].
>> thanks a lot for the suggestion!
>>
>> > R> result <- f[, list(min=min(delay), max=max(delay),
>> > count=.N,country=country[1L]), by="share.id"]
>
>
> And is it performant?

acceptable.

> It just occurred to me that this is even better:
>
> R> setkeyv(f, c("share.id", "delay"))
> R> result <- f[,  list(min=delay[1L], max=delay[.N], count=.N,
> country=country[1L]), by="share.id"]
>

this assumes that delays are sorted (like in my example)
which, in reality, they are not.
thanks for your help!

--
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/ http://honestreporting.com
http://americancensorship.org http://memri.org http://www.memritv.org
Illiterate?  Write today, for free help!

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: aggregate() runs out of memory

Steve Lianoglou-6
Hi,

On Tue, Nov 27, 2012 at 11:29 AM, Sam Steingold <[hidden email]> wrote:
>> * Steve Lianoglou <[hidden email]> [2012-11-26 19:47:25 -0500]:
[snip]

>> It just occurred to me that this is even better:
>>
>> R> setkeyv(f, c("share.id", "delay"))
>> R> result <- f[,  list(min=delay[1L], max=delay[.N], count=.N,
>> country=country[1L]), by="share.id"]
>>
>
> this assumes that delays are sorted (like in my example)
> which, in reality, they are not.
> thanks for your help!

When you include "delay" in the call to `setkeyv` as I did above, it
sorts low to high w/in each "share.id" group.

-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: aggregate() runs out of memory

Sam Steingold
> * Steve Lianoglou <[hidden email]> [2012-11-27 12:53:23 -0500]:
> On Tue, Nov 27, 2012 at 11:29 AM, Sam Steingold <[hidden email]> wrote:
>>> * Steve Lianoglou <[hidden email]> [2012-11-26 19:47:25 -0500]:
> [snip]
>>> It just occurred to me that this is even better:
>>>
>>> R> setkeyv(f, c("share.id", "delay"))
>>> R> result <- f[,  list(min=delay[1L], max=delay[.N], count=.N,
>>> country=country[1L]), by="share.id"]
>>>
>>
>> this assumes that delays are sorted (like in my example)
>> which, in reality, they are not.
>
> When you include "delay" in the call to `setkeyv` as I did above, it
> sorts low to high w/in each "share.id" group.

Ah, but then I would have to _sort_ (~n*log(n)) by delay within each ID
group, while all I care about is min/max (~n).

thanks again!

--
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/ http://think-israel.org http://truepeace.org
http://thereligionofpeace.com http://mideasttruth.com http://www.memritv.org
If You Want Breakfast In Bed, Sleep In the Kitchen.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.