SLOW split() function

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

SLOW split() function

ivo welch
dear R experts:  apologies for all my speed and memory questions.  I
have a bet with my coauthors that I can make R reasonably efficient
through R-appropriate programming techniques.  this is not just for
kicks, but for work.  for benchmarking, my [3 year old] Mac Pro has
2.8GHz Xeons, 16GB of RAM, and R 2.13.1.

right now, it seems that 'split()' is why I am losing my bet.  (split
is an integral component of *apply() and by(), so I need split() to be
fast.  its resulting list can then be fed, e.g., to mclapply().)  I
made up an example to illustrate my ills:

    library(data.table)
    N <- 1000
    T <- N*10
    d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T) ))
    setkey(d, "key"); gc() ## force a garbage collection
    cat("N=", N, ".  Size of d=", object.size(d)/1024/1024, "MB\n")
    print(system.time( s<-split(d, d$key) ))

My ordered input data table (or data frame; doesn't make a difference)
is 114MB in size.  it takes about a second to create.  split() only
needs to reshape it.  this simple operation takes almost 5 minutes on
my computer.

with a data set that is larger, this explodes further.

am I doing something wrong?  is there an alternative to split()?

sincerely,

/iaw

----
Ivo Welch ([hidden email])

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: SLOW split() function

jholtman
instead of spliting the entire dataframe, split the indices and then use these to access your data: try

system.time(s <- split(seq(nrow(d)), d$key))

this should be faster and less memory intensive.  you can then use the indices to access the subset:

result <- lapply(s, function(.indx){
    doSomething <- sum(d$someCol[.indx])
})

Sent from my iPad

On Oct 10, 2011, at 21:01, ivo welch <[hidden email]> wrote:

> dear R experts:  apologies for all my speed and memory questions.  I
> have a bet with my coauthors that I can make R reasonably efficient
> through R-appropriate programming techniques.  this is not just for
> kicks, but for work.  for benchmarking, my [3 year old] Mac Pro has
> 2.8GHz Xeons, 16GB of RAM, and R 2.13.1.
>
> right now, it seems that 'split()' is why I am losing my bet.  (split
> is an integral component of *apply() and by(), so I need split() to be
> fast.  its resulting list can then be fed, e.g., to mclapply().)  I
> made up an example to illustrate my ills:
>
>    library(data.table)
>    N <- 1000
>    T <- N*10
>    d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T) ))
>    setkey(d, "key"); gc() ## force a garbage collection
>    cat("N=", N, ".  Size of d=", object.size(d)/1024/1024, "MB\n")
>    print(system.time( s<-split(d, d$key) ))
>
> My ordered input data table (or data frame; doesn't make a difference)
> is 114MB in size.  it takes about a second to create.  split() only
> needs to reshape it.  this simple operation takes almost 5 minutes on
> my computer.
>
> with a data set that is larger, this explodes further.
>
> am I doing something wrong?  is there an alternative to split()?
>
> sincerely,
>
> /iaw
>
> ----
> Ivo Welch ([hidden email])
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: SLOW split() function

djmuseR
In reply to this post by ivo welch
I tried this:

library(data.table)
    N <- 1000
    T <- N*10
d <- data.table(gp= rep(1:T, rep(N,T)), val=rnorm(N*T), key = 'gp')
dim(d)
[1] 10000000        2

# On my humble 8Gb system,
> system.time(l <- d[, split(val, gp)])
   user  system elapsed
   4.15    0.09    4.27

I wouldn't be surprised if there were a much faster way to do this
operation in data.table since split() is a data frame operation. This
is about as fast as Jim Holtman's suggestion:

system.time(s <- split(seq_len(nrow(d)), d$gp))
   user  system elapsed
   4.15    0.09    4.29

HTH,
Dennis

On Mon, Oct 10, 2011 at 6:01 PM, ivo welch <[hidden email]> wrote:

> dear R experts:  apologies for all my speed and memory questions.  I
> have a bet with my coauthors that I can make R reasonably efficient
> through R-appropriate programming techniques.  this is not just for
> kicks, but for work.  for benchmarking, my [3 year old] Mac Pro has
> 2.8GHz Xeons, 16GB of RAM, and R 2.13.1.
>
> right now, it seems that 'split()' is why I am losing my bet.  (split
> is an integral component of *apply() and by(), so I need split() to be
> fast.  its resulting list can then be fed, e.g., to mclapply().)  I
> made up an example to illustrate my ills:
>
>    library(data.table)
>    N <- 1000
>    T <- N*10
>    d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T) ))
>    setkey(d, "key"); gc() ## force a garbage collection
>    cat("N=", N, ".  Size of d=", object.size(d)/1024/1024, "MB\n")
>    print(system.time( s<-split(d, d$key) ))
>
> My ordered input data table (or data frame; doesn't make a difference)
> is 114MB in size.  it takes about a second to create.  split() only
> needs to reshape it.  this simple operation takes almost 5 minutes on
> my computer.
>
> with a data set that is larger, this explodes further.
>
> am I doing something wrong?  is there an alternative to split()?
>
> sincerely,
>
> /iaw
>
> ----
> Ivo Welch ([hidden email])
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: SLOW split() function

William Dunlap
In reply to this post by jholtman
The following avoids the overhead of data.frame methods
(and assumes the data.frame doesn't include matrices
or other data.frames) and relies on split(vector,factor)
quickly splitting a vector into a list of vectors.
For a 10^6 row by 10 column data.frame split in 10^5
groups this took 14.1 seconds while split took 658.7 s.
Both returned the same thing.

Perhaps something based on this idea would help your
parallelized by().

mysplit.data.frame <-
function (x, f, drop = FALSE, ...)
{
    f <- as.factor(f)
    tmp <- lapply(x, function(xi) split(xi, f, drop = drop, ...))
    rn <- split(rownames(x), f, drop = drop, ...)
    tmp <- unlist(unname(tmp), recursive = FALSE)
    tmp <- split(tmp, factor(names(tmp), levels = unique(names(tmp))))
    tmp <- lapply(setNames(seq_along(tmp), names(tmp)), function(i) {
        t <- tmp[[i]]
        names(t) <- names(x)
        attr(t, "row.names") <- rn[[i]]
        class(t) <- "data.frame"
        t
    })
    tmp
}

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On Behalf Of Jim Holtman
> Sent: Monday, October 10, 2011 7:29 PM
> To: ivo welch
> Cc: r-help
> Subject: Re: [R] SLOW split() function
>
> instead of spliting the entire dataframe, split the indices and then use these to access your data:
> try
>
> system.time(s <- split(seq(nrow(d)), d$key))
>
> this should be faster and less memory intensive.  you can then use the indices to access the subset:
>
> result <- lapply(s, function(.indx){
>     doSomething <- sum(d$someCol[.indx])
> })
>
> Sent from my iPad
>
> On Oct 10, 2011, at 21:01, ivo welch <[hidden email]> wrote:
>
> > dear R experts:  apologies for all my speed and memory questions.  I
> > have a bet with my coauthors that I can make R reasonably efficient
> > through R-appropriate programming techniques.  this is not just for
> > kicks, but for work.  for benchmarking, my [3 year old] Mac Pro has
> > 2.8GHz Xeons, 16GB of RAM, and R 2.13.1.
> >
> > right now, it seems that 'split()' is why I am losing my bet.  (split
> > is an integral component of *apply() and by(), so I need split() to be
> > fast.  its resulting list can then be fed, e.g., to mclapply().)  I
> > made up an example to illustrate my ills:
> >
> >    library(data.table)
> >    N <- 1000
> >    T <- N*10
> >    d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T) ))
> >    setkey(d, "key"); gc() ## force a garbage collection
> >    cat("N=", N, ".  Size of d=", object.size(d)/1024/1024, "MB\n")
> >    print(system.time( s<-split(d, d$key) ))
> >
> > My ordered input data table (or data frame; doesn't make a difference)
> > is 114MB in size.  it takes about a second to create.  split() only
> > needs to reshape it.  this simple operation takes almost 5 minutes on
> > my computer.
> >
> > with a data set that is larger, this explodes further.
> >
> > am I doing something wrong?  is there an alternative to split()?
> >
> > sincerely,
> >
> > /iaw
> >
> > ----
> > Ivo Welch ([hidden email])
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: SLOW split() function

ivo welch
thank you, everyone.  this was very helpful to my specific task and
understanding.  for the benefit of future googlers, I thought I would
post some experiments and results here.

ultimately, I need to do a by() on an irregular matrix, and I now know
how to speed up by() on a single-core, and then again on a multi-core
machine.

library(data.table)
N <- 1000*1000
d <- data.table(data.frame( key= as.integer(runif(N, min=1,
max=N/10)), x=rnorm(N), y=rnorm(N) ))  # irregular
setkey(d, "key"); gc() ## sort and force a garbage collection


cat("N=", N, ".  Size of d=", object.size(d)/1024/1024, "MB\n")

cat("\nStandard by() Function:\n")
print(system.time( all.1 <- by( d, d$key, function(d) coef(lm(y ~ x, data=d)))))


cat("\n\nPreSplit Function [aka Jim H]\n\t(a) Splitting Operation:\n")
print(system.time(si <- split(seq(nrow(d)), d$key)))
cat("\n\t(b) Regressions:\n")
print(system.time(all.2 <- lapply(si, function(.indx) {
coef(lm(d$y[.indx] ~ d$x[.indx])) })))
print(system.time(all.2b <- lapply(si, function(.indx) { coef(lm(y ~
x, data=d[.indx,])) })))

cat("\n\nNaive Split Data Frame\n\t(a) Splitting Operation:\n")
print(system.time(ds <- split(d, d$key)))
cat("\n\t(b) Regressions:\n")
print(system.time(all.3a <- lapply(ds, function(ds) { coef(lm(ds$y ~ ds$x)) })))
print(system.time(all.3b <- lapply(ds, function(ds) { coef(lm(y ~ x,
data=ds)) })))

the first and the last ways (all.1 and all.3) are "naive" ways of
doing this, and take about 400-500 seconds on a Mac Air, core i5.
Jim's suggestion (all.2) cuts this roughly into half by speeding up
the split to take almost no time.

and now,

library(multicore)
print(system.time(all.4 <- mclapply(si, function(.indx) { coef(lm(y ~
x, data=d[.indx,])) })))

on my dual-core (quad-thread) i5, all four pseudo cores become busy,
and the time roughly halves again from 230 seconds to 120 seconds.


maybe the by() function should use Jim's approach, and multicore
should provide mcby().  of course, knowing how to do this myself fast
now by hand, this is not so important for me.  but it may help some
other novices.

thanks again everybody.

regards,

/iaw

----
Ivo Welch ([hidden email])




On Mon, Oct 10, 2011 at 9:31 PM, William Dunlap <[hidden email]> wrote:

> The following avoids the overhead of data.frame methods
> (and assumes the data.frame doesn't include matrices
> or other data.frames) and relies on split(vector,factor)
> quickly splitting a vector into a list of vectors.
> For a 10^6 row by 10 column data.frame split in 10^5
> groups this took 14.1 seconds while split took 658.7 s.
> Both returned the same thing.
>
> Perhaps something based on this idea would help your
> parallelized by().
>
> mysplit.data.frame <-
> function (x, f, drop = FALSE, ...)
> {
>    f <- as.factor(f)
>    tmp <- lapply(x, function(xi) split(xi, f, drop = drop, ...))
>    rn <- split(rownames(x), f, drop = drop, ...)
>    tmp <- unlist(unname(tmp), recursive = FALSE)
>    tmp <- split(tmp, factor(names(tmp), levels = unique(names(tmp))))
>    tmp <- lapply(setNames(seq_along(tmp), names(tmp)), function(i) {
>        t <- tmp[[i]]
>        names(t) <- names(x)
>        attr(t, "row.names") <- rn[[i]]
>        class(t) <- "data.frame"
>        t
>    })
>    tmp
> }
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>> -----Original Message-----
>> From: [hidden email] [mailto:[hidden email]] On Behalf Of Jim Holtman
>> Sent: Monday, October 10, 2011 7:29 PM
>> To: ivo welch
>> Cc: r-help
>> Subject: Re: [R] SLOW split() function
>>
>> instead of spliting the entire dataframe, split the indices and then use these to access your data:
>> try
>>
>> system.time(s <- split(seq(nrow(d)), d$key))
>>
>> this should be faster and less memory intensive.  you can then use the indices to access the subset:
>>
>> result <- lapply(s, function(.indx){
>>     doSomething <- sum(d$someCol[.indx])
>> })
>>
>> Sent from my iPad
>>
>> On Oct 10, 2011, at 21:01, ivo welch <[hidden email]> wrote:
>>
>> > dear R experts:  apologies for all my speed and memory questions.  I
>> > have a bet with my coauthors that I can make R reasonably efficient
>> > through R-appropriate programming techniques.  this is not just for
>> > kicks, but for work.  for benchmarking, my [3 year old] Mac Pro has
>> > 2.8GHz Xeons, 16GB of RAM, and R 2.13.1.
>> >
>> > right now, it seems that 'split()' is why I am losing my bet.  (split
>> > is an integral component of *apply() and by(), so I need split() to be
>> > fast.  its resulting list can then be fed, e.g., to mclapply().)  I
>> > made up an example to illustrate my ills:
>> >
>> >    library(data.table)
>> >    N <- 1000
>> >    T <- N*10
>> >    d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T) ))
>> >    setkey(d, "key"); gc() ## force a garbage collection
>> >    cat("N=", N, ".  Size of d=", object.size(d)/1024/1024, "MB\n")
>> >    print(system.time( s<-split(d, d$key) ))
>> >
>> > My ordered input data table (or data frame; doesn't make a difference)
>> > is 114MB in size.  it takes about a second to create.  split() only
>> > needs to reshape it.  this simple operation takes almost 5 minutes on
>> > my computer.
>> >
>> > with a data set that is larger, this explodes further.
>> >
>> > am I doing something wrong?  is there an alternative to split()?
>> >
>> > sincerely,
>> >
>> > /iaw
>> >
>> > ----
>> > Ivo Welch ([hidden email])
>> >
>> > ______________________________________________
>> > [hidden email] mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: SLOW split() function

Joshua Wiley-2
As another followup, given that you are doing numerous regression
models and (I presume) working with finance/stock data that is
strictly numeric (no need for special contrast coding, etc.), you can
substantially reduce the time spent estimating the coefficients.  A
simple way is to use lm.fit directly instead of lm.  For lm.fit, you
pass the y and x (design) matrices directly.  This skips a good deal
of overhead.  Here is one naive way, I imagine more speedups could be
gained by incorporating the intercept (1 vector) into d instead of
cbind()ing it.  The catch it that lm.fit requires matrices, not data
tables, so what you gain may be lost in having to do an extra
conversion.  In any case, here are the times on my system for the two
options (note I used N = 1000 * 100 because I am presently on a
glorified netbook).

> print(system.time(all.2b <- lapply(si, function(.indx) { coef(lm(y ~
+ x, data=d[.indx,])) })))
   user  system elapsed
  69.00    0.00   69.56

> print(system.time(all.2c <- lapply(si, function(.indx) { coef(lm.fit(y = d[.indx, y], x = cbind(1, d[.indx, x]))) })))
   user  system elapsed
  37.83    0.03   38.36

the column names for the coeficients will not be the same as from lm,
but the estimates should be identical.  While this is not recommended
in typical usage, in an application like regressions on rolling time
windows, etc. where you know the data are not changing, I think it
makes sense to bypass the clever determine your data and best methods
to use, and go straight to passing the design matrix.  Since you do
not need residuals, variances, etc. it may be possible to speed this
up even more, perhaps bypassing dqrls altogether.

Cheers,

Josh

On Mon, Oct 10, 2011 at 9:56 PM, ivo welch <[hidden email]> wrote:

> thank you, everyone.  this was very helpful to my specific task and
> understanding.  for the benefit of future googlers, I thought I would
> post some experiments and results here.
>
> ultimately, I need to do a by() on an irregular matrix, and I now know
> how to speed up by() on a single-core, and then again on a multi-core
> machine.
>
> library(data.table)
> N <- 1000*1000
> d <- data.table(data.frame( key= as.integer(runif(N, min=1,
> max=N/10)), x=rnorm(N), y=rnorm(N) ))  # irregular
> setkey(d, "key"); gc() ## sort and force a garbage collection
>
>
> cat("N=", N, ".  Size of d=", object.size(d)/1024/1024, "MB\n")
>
> cat("\nStandard by() Function:\n")
> print(system.time( all.1 <- by( d, d$key, function(d) coef(lm(y ~ x, data=d)))))
>
>
> cat("\n\nPreSplit Function [aka Jim H]\n\t(a) Splitting Operation:\n")
> print(system.time(si <- split(seq(nrow(d)), d$key)))
> cat("\n\t(b) Regressions:\n")
> print(system.time(all.2 <- lapply(si, function(.indx) {
> coef(lm(d$y[.indx] ~ d$x[.indx])) })))
> print(system.time(all.2b <- lapply(si, function(.indx) { coef(lm(y ~
> x, data=d[.indx,])) })))
>
> cat("\n\nNaive Split Data Frame\n\t(a) Splitting Operation:\n")
> print(system.time(ds <- split(d, d$key)))
> cat("\n\t(b) Regressions:\n")
> print(system.time(all.3a <- lapply(ds, function(ds) { coef(lm(ds$y ~ ds$x)) })))
> print(system.time(all.3b <- lapply(ds, function(ds) { coef(lm(y ~ x,
> data=ds)) })))
>
> the first and the last ways (all.1 and all.3) are "naive" ways of
> doing this, and take about 400-500 seconds on a Mac Air, core i5.
> Jim's suggestion (all.2) cuts this roughly into half by speeding up
> the split to take almost no time.
>
> and now,
>
> library(multicore)
> print(system.time(all.4 <- mclapply(si, function(.indx) { coef(lm(y ~
> x, data=d[.indx,])) })))
>
> on my dual-core (quad-thread) i5, all four pseudo cores become busy,
> and the time roughly halves again from 230 seconds to 120 seconds.
>
>
> maybe the by() function should use Jim's approach, and multicore
> should provide mcby().  of course, knowing how to do this myself fast
> now by hand, this is not so important for me.  but it may help some
> other novices.
>
> thanks again everybody.
>
> regards,
>
> /iaw
>
> ----
> Ivo Welch ([hidden email])
>
>
>
>
> On Mon, Oct 10, 2011 at 9:31 PM, William Dunlap <[hidden email]> wrote:
>> The following avoids the overhead of data.frame methods
>> (and assumes the data.frame doesn't include matrices
>> or other data.frames) and relies on split(vector,factor)
>> quickly splitting a vector into a list of vectors.
>> For a 10^6 row by 10 column data.frame split in 10^5
>> groups this took 14.1 seconds while split took 658.7 s.
>> Both returned the same thing.
>>
>> Perhaps something based on this idea would help your
>> parallelized by().
>>
>> mysplit.data.frame <-
>> function (x, f, drop = FALSE, ...)
>> {
>>    f <- as.factor(f)
>>    tmp <- lapply(x, function(xi) split(xi, f, drop = drop, ...))
>>    rn <- split(rownames(x), f, drop = drop, ...)
>>    tmp <- unlist(unname(tmp), recursive = FALSE)
>>    tmp <- split(tmp, factor(names(tmp), levels = unique(names(tmp))))
>>    tmp <- lapply(setNames(seq_along(tmp), names(tmp)), function(i) {
>>        t <- tmp[[i]]
>>        names(t) <- names(x)
>>        attr(t, "row.names") <- rn[[i]]
>>        class(t) <- "data.frame"
>>        t
>>    })
>>    tmp
>> }
>>
>> Bill Dunlap
>> Spotfire, TIBCO Software
>> wdunlap tibco.com
>>
>>> -----Original Message-----
>>> From: [hidden email] [mailto:[hidden email]] On Behalf Of Jim Holtman
>>> Sent: Monday, October 10, 2011 7:29 PM
>>> To: ivo welch
>>> Cc: r-help
>>> Subject: Re: [R] SLOW split() function
>>>
>>> instead of spliting the entire dataframe, split the indices and then use these to access your data:
>>> try
>>>
>>> system.time(s <- split(seq(nrow(d)), d$key))
>>>
>>> this should be faster and less memory intensive.  you can then use the indices to access the subset:
>>>
>>> result <- lapply(s, function(.indx){
>>>     doSomething <- sum(d$someCol[.indx])
>>> })
>>>
>>> Sent from my iPad
>>>
>>> On Oct 10, 2011, at 21:01, ivo welch <[hidden email]> wrote:
>>>
>>> > dear R experts:  apologies for all my speed and memory questions.  I
>>> > have a bet with my coauthors that I can make R reasonably efficient
>>> > through R-appropriate programming techniques.  this is not just for
>>> > kicks, but for work.  for benchmarking, my [3 year old] Mac Pro has
>>> > 2.8GHz Xeons, 16GB of RAM, and R 2.13.1.
>>> >
>>> > right now, it seems that 'split()' is why I am losing my bet.  (split
>>> > is an integral component of *apply() and by(), so I need split() to be
>>> > fast.  its resulting list can then be fed, e.g., to mclapply().)  I
>>> > made up an example to illustrate my ills:
>>> >
>>> >    library(data.table)
>>> >    N <- 1000
>>> >    T <- N*10
>>> >    d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T) ))
>>> >    setkey(d, "key"); gc() ## force a garbage collection
>>> >    cat("N=", N, ".  Size of d=", object.size(d)/1024/1024, "MB\n")
>>> >    print(system.time( s<-split(d, d$key) ))
>>> >
>>> > My ordered input data table (or data frame; doesn't make a difference)
>>> > is 114MB in size.  it takes about a second to create.  split() only
>>> > needs to reshape it.  this simple operation takes almost 5 minutes on
>>> > my computer.
>>> >
>>> > with a data set that is larger, this explodes further.
>>> >
>>> > am I doing something wrong?  is there an alternative to split()?
>>> >
>>> > sincerely,
>>> >
>>> > /iaw
>>> >
>>> > ----
>>> > Ivo Welch ([hidden email])
>>> >
>>> > ______________________________________________
>>> > [hidden email] mailing list
>>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> > and provide commented, minimal, self-contained, reproducible code.
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: SLOW split() function

ivo welch
thanks, josh.  in my posting example, I did not need anything except
coefficients.  (when this is the case, I usually do not even use
lm.fit, but I eliminate all missing obs first and then use solve
crossprod(y,cbind(1,x)) crossprod(cbind(1,x)).)  this is pretty fast.)

alas, I will need to figure how to get coef standard errors faster in
this case.  summary.lm() is really slow.

regards,

/iaw
----
Ivo Welch ([hidden email])
http://www.ivo-welch.info/
J. Fred Weston Professor of Finance
Anderson School at UCLA, C519





On Mon, Oct 10, 2011 at 11:30 PM, Joshua Wiley <[hidden email]> wrote:

> As another followup, given that you are doing numerous regression
> models and (I presume) working with finance/stock data that is
> strictly numeric (no need for special contrast coding, etc.), you can
> substantially reduce the time spent estimating the coefficients.  A
> simple way is to use lm.fit directly instead of lm.  For lm.fit, you
> pass the y and x (design) matrices directly.  This skips a good deal
> of overhead.  Here is one naive way, I imagine more speedups could be
> gained by incorporating the intercept (1 vector) into d instead of
> cbind()ing it.  The catch it that lm.fit requires matrices, not data
> tables, so what you gain may be lost in having to do an extra
> conversion.  In any case, here are the times on my system for the two
> options (note I used N = 1000 * 100 because I am presently on a
> glorified netbook).
>
>> print(system.time(all.2b <- lapply(si, function(.indx) { coef(lm(y ~
> + x, data=d[.indx,])) })))
>   user  system elapsed
>  69.00    0.00   69.56
>
>> print(system.time(all.2c <- lapply(si, function(.indx) { coef(lm.fit(y = d[.indx, y], x = cbind(1, d[.indx, x]))) })))
>   user  system elapsed
>  37.83    0.03   38.36
>
> the column names for the coeficients will not be the same as from lm,
> but the estimates should be identical.  While this is not recommended
> in typical usage, in an application like regressions on rolling time
> windows, etc. where you know the data are not changing, I think it
> makes sense to bypass the clever determine your data and best methods
> to use, and go straight to passing the design matrix.  Since you do
> not need residuals, variances, etc. it may be possible to speed this
> up even more, perhaps bypassing dqrls altogether.
>
> Cheers,
>
> Josh
>
> On Mon, Oct 10, 2011 at 9:56 PM, ivo welch <[hidden email]> wrote:
>> thank you, everyone.  this was very helpful to my specific task and
>> understanding.  for the benefit of future googlers, I thought I would
>> post some experiments and results here.
>>
>> ultimately, I need to do a by() on an irregular matrix, and I now know
>> how to speed up by() on a single-core, and then again on a multi-core
>> machine.
>>
>> library(data.table)
>> N <- 1000*1000
>> d <- data.table(data.frame( key= as.integer(runif(N, min=1,
>> max=N/10)), x=rnorm(N), y=rnorm(N) ))  # irregular
>> setkey(d, "key"); gc() ## sort and force a garbage collection
>>
>>
>> cat("N=", N, ".  Size of d=", object.size(d)/1024/1024, "MB\n")
>>
>> cat("\nStandard by() Function:\n")
>> print(system.time( all.1 <- by( d, d$key, function(d) coef(lm(y ~ x, data=d)))))
>>
>>
>> cat("\n\nPreSplit Function [aka Jim H]\n\t(a) Splitting Operation:\n")
>> print(system.time(si <- split(seq(nrow(d)), d$key)))
>> cat("\n\t(b) Regressions:\n")
>> print(system.time(all.2 <- lapply(si, function(.indx) {
>> coef(lm(d$y[.indx] ~ d$x[.indx])) })))
>> print(system.time(all.2b <- lapply(si, function(.indx) { coef(lm(y ~
>> x, data=d[.indx,])) })))
>>
>> cat("\n\nNaive Split Data Frame\n\t(a) Splitting Operation:\n")
>> print(system.time(ds <- split(d, d$key)))
>> cat("\n\t(b) Regressions:\n")
>> print(system.time(all.3a <- lapply(ds, function(ds) { coef(lm(ds$y ~ ds$x)) })))
>> print(system.time(all.3b <- lapply(ds, function(ds) { coef(lm(y ~ x,
>> data=ds)) })))
>>
>> the first and the last ways (all.1 and all.3) are "naive" ways of
>> doing this, and take about 400-500 seconds on a Mac Air, core i5.
>> Jim's suggestion (all.2) cuts this roughly into half by speeding up
>> the split to take almost no time.
>>
>> and now,
>>
>> library(multicore)
>> print(system.time(all.4 <- mclapply(si, function(.indx) { coef(lm(y ~
>> x, data=d[.indx,])) })))
>>
>> on my dual-core (quad-thread) i5, all four pseudo cores become busy,
>> and the time roughly halves again from 230 seconds to 120 seconds.
>>
>>
>> maybe the by() function should use Jim's approach, and multicore
>> should provide mcby().  of course, knowing how to do this myself fast
>> now by hand, this is not so important for me.  but it may help some
>> other novices.
>>
>> thanks again everybody.
>>
>> regards,
>>
>> /iaw
>>
>> ----
>> Ivo Welch ([hidden email])
>>
>>
>>
>>
>> On Mon, Oct 10, 2011 at 9:31 PM, William Dunlap <[hidden email]> wrote:
>>> The following avoids the overhead of data.frame methods
>>> (and assumes the data.frame doesn't include matrices
>>> or other data.frames) and relies on split(vector,factor)
>>> quickly splitting a vector into a list of vectors.
>>> For a 10^6 row by 10 column data.frame split in 10^5
>>> groups this took 14.1 seconds while split took 658.7 s.
>>> Both returned the same thing.
>>>
>>> Perhaps something based on this idea would help your
>>> parallelized by().
>>>
>>> mysplit.data.frame <-
>>> function (x, f, drop = FALSE, ...)
>>> {
>>>    f <- as.factor(f)
>>>    tmp <- lapply(x, function(xi) split(xi, f, drop = drop, ...))
>>>    rn <- split(rownames(x), f, drop = drop, ...)
>>>    tmp <- unlist(unname(tmp), recursive = FALSE)
>>>    tmp <- split(tmp, factor(names(tmp), levels = unique(names(tmp))))
>>>    tmp <- lapply(setNames(seq_along(tmp), names(tmp)), function(i) {
>>>        t <- tmp[[i]]
>>>        names(t) <- names(x)
>>>        attr(t, "row.names") <- rn[[i]]
>>>        class(t) <- "data.frame"
>>>        t
>>>    })
>>>    tmp
>>> }
>>>
>>> Bill Dunlap
>>> Spotfire, TIBCO Software
>>> wdunlap tibco.com
>>>
>>>> -----Original Message-----
>>>> From: [hidden email] [mailto:[hidden email]] On Behalf Of Jim Holtman
>>>> Sent: Monday, October 10, 2011 7:29 PM
>>>> To: ivo welch
>>>> Cc: r-help
>>>> Subject: Re: [R] SLOW split() function
>>>>
>>>> instead of spliting the entire dataframe, split the indices and then use these to access your data:
>>>> try
>>>>
>>>> system.time(s <- split(seq(nrow(d)), d$key))
>>>>
>>>> this should be faster and less memory intensive.  you can then use the indices to access the subset:
>>>>
>>>> result <- lapply(s, function(.indx){
>>>>     doSomething <- sum(d$someCol[.indx])
>>>> })
>>>>
>>>> Sent from my iPad
>>>>
>>>> On Oct 10, 2011, at 21:01, ivo welch <[hidden email]> wrote:
>>>>
>>>> > dear R experts:  apologies for all my speed and memory questions.  I
>>>> > have a bet with my coauthors that I can make R reasonably efficient
>>>> > through R-appropriate programming techniques.  this is not just for
>>>> > kicks, but for work.  for benchmarking, my [3 year old] Mac Pro has
>>>> > 2.8GHz Xeons, 16GB of RAM, and R 2.13.1.
>>>> >
>>>> > right now, it seems that 'split()' is why I am losing my bet.  (split
>>>> > is an integral component of *apply() and by(), so I need split() to be
>>>> > fast.  its resulting list can then be fed, e.g., to mclapply().)  I
>>>> > made up an example to illustrate my ills:
>>>> >
>>>> >    library(data.table)
>>>> >    N <- 1000
>>>> >    T <- N*10
>>>> >    d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T) ))
>>>> >    setkey(d, "key"); gc() ## force a garbage collection
>>>> >    cat("N=", N, ".  Size of d=", object.size(d)/1024/1024, "MB\n")
>>>> >    print(system.time( s<-split(d, d$key) ))
>>>> >
>>>> > My ordered input data table (or data frame; doesn't make a difference)
>>>> > is 114MB in size.  it takes about a second to create.  split() only
>>>> > needs to reshape it.  this simple operation takes almost 5 minutes on
>>>> > my computer.
>>>> >
>>>> > with a data set that is larger, this explodes further.
>>>> >
>>>> > am I doing something wrong?  is there an alternative to split()?
>>>> >
>>>> > sincerely,
>>>> >
>>>> > /iaw
>>>> >
>>>> > ----
>>>> > Ivo Welch ([hidden email])
>>>> >
>>>> > ______________________________________________
>>>> > [hidden email] mailing list
>>>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>>> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> > and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Joshua Wiley
> Ph.D. Student, Health Psychology
> Programmer Analyst II, ATS Statistical Consulting Group
> University of California, Los Angeles
> https://joshuawiley.com/
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: SLOW split() function

Joshua Wiley-2
I do not know if stripping down functions is generally recommended,
but it is not too difficult to do if you know that you can make
assumptions.  Here is an example (I also found a fast way to convert
the data table to a matrix, again if some assumptions can be made).
Using the stripped down function, you can get coefficients and
standard errors in less time than you can get just coefficients using
default lm.  It is hugely less flexible.

Cheers,

Josh

##########################################
library(data.table)

## stripped down lm and summary.lm (for standard errors)
minimal.lm <- function(y, x) {
  dims <- dim(x)
  x <- unlist(x, FALSE, FALSE)
  dim(x) <- dims
  obj <- lm.fit(x = x, y = y)
  resvar <- sum(obj$residuals^2)/obj$df.residual
  p <- obj$rank
  R <- .Call("La_chol2inv", x = obj$qr$qr[1L:p, 1L:p, drop = FALSE],
size = p, PACKAGE = "base")
  m <- min(dim(R))
  d <- c(R)[1L + 0L:(m - 1L) * (dim(R)[1L] + 1L)]
  se <- sqrt(d * resvar)
  cbind(coef = obj$coefficients, se)
}

N <- 1000*100
d <- data.table(data.frame( key= as.integer(runif(N, min=1,
max=N/10)), x=rnorm(N), y=rnorm(N) ))  # irregular
## add intercept column
d$int <- 1L
setkey(d, "key"); gc() ## sort and force a garbage collection

cat("N=", N, ".  Size of d=", object.size(d)/1024/1024, "MB\n")
print(system.time(si <- split(seq(nrow(d)), d$key)))

cat("\n\t(b) Regressions:\n")
## using lm
print(system.time(all.2b <- lapply(si, function(.indx) { coef(lm(y ~
x, data=d[.indx,])) })))
## using minimal.lm---faster and gives standard errors
print(system.time(all.2c <- lapply(si, function(.indx) { minimal.lm(y
= d[.indx, y], x = d[.indx, list(int, x)]) })))

#### Timings on my system ####
> print(system.time(all.2b <- lapply(si, function(.indx) { coef(lm(y ~
+ x, data=d[.indx,])) })))
   user  system elapsed
  67.87    0.01   68.46
> print(system.time(all.2c <- lapply(si, function(.indx) { minimal.lm(y = d[.indx, y], x = d[.indx, list(int, x)]) })))
   user  system elapsed
  47.72    0.00   48.00

######################################################################

On Tue, Oct 11, 2011 at 8:56 AM, ivo welch <[hidden email]> wrote:

> thanks, josh.  in my posting example, I did not need anything except
> coefficients.  (when this is the case, I usually do not even use
> lm.fit, but I eliminate all missing obs first and then use solve
> crossprod(y,cbind(1,x)) crossprod(cbind(1,x)).)  this is pretty fast.)
>
> alas, I will need to figure how to get coef standard errors faster in
> this case.  summary.lm() is really slow.
>
> regards,
>
> /iaw
> ----
> Ivo Welch ([hidden email])
> http://www.ivo-welch.info/
> J. Fred Weston Professor of Finance
> Anderson School at UCLA, C519
>
>
>
>
>
> On Mon, Oct 10, 2011 at 11:30 PM, Joshua Wiley <[hidden email]> wrote:
>> As another followup, given that you are doing numerous regression
>> models and (I presume) working with finance/stock data that is
>> strictly numeric (no need for special contrast coding, etc.), you can
>> substantially reduce the time spent estimating the coefficients.  A
>> simple way is to use lm.fit directly instead of lm.  For lm.fit, you
>> pass the y and x (design) matrices directly.  This skips a good deal
>> of overhead.  Here is one naive way, I imagine more speedups could be
>> gained by incorporating the intercept (1 vector) into d instead of
>> cbind()ing it.  The catch it that lm.fit requires matrices, not data
>> tables, so what you gain may be lost in having to do an extra
>> conversion.  In any case, here are the times on my system for the two
>> options (note I used N = 1000 * 100 because I am presently on a
>> glorified netbook).
>>
>>> print(system.time(all.2b <- lapply(si, function(.indx) { coef(lm(y ~
>> + x, data=d[.indx,])) })))
>>   user  system elapsed
>>  69.00    0.00   69.56
>>
>>> print(system.time(all.2c <- lapply(si, function(.indx) { coef(lm.fit(y = d[.indx, y], x = cbind(1, d[.indx, x]))) })))
>>   user  system elapsed
>>  37.83    0.03   38.36
>>
>> the column names for the coeficients will not be the same as from lm,
>> but the estimates should be identical.  While this is not recommended
>> in typical usage, in an application like regressions on rolling time
>> windows, etc. where you know the data are not changing, I think it
>> makes sense to bypass the clever determine your data and best methods
>> to use, and go straight to passing the design matrix.  Since you do
>> not need residuals, variances, etc. it may be possible to speed this
>> up even more, perhaps bypassing dqrls altogether.
>>
>> Cheers,
>>
>> Josh
>>
>> On Mon, Oct 10, 2011 at 9:56 PM, ivo welch <[hidden email]> wrote:
>>> thank you, everyone.  this was very helpful to my specific task and
>>> understanding.  for the benefit of future googlers, I thought I would
>>> post some experiments and results here.
>>>
>>> ultimately, I need to do a by() on an irregular matrix, and I now know
>>> how to speed up by() on a single-core, and then again on a multi-core
>>> machine.
>>>
>>> library(data.table)
>>> N <- 1000*1000
>>> d <- data.table(data.frame( key= as.integer(runif(N, min=1,
>>> max=N/10)), x=rnorm(N), y=rnorm(N) ))  # irregular
>>> setkey(d, "key"); gc() ## sort and force a garbage collection
>>>
>>>
>>> cat("N=", N, ".  Size of d=", object.size(d)/1024/1024, "MB\n")
>>>
>>> cat("\nStandard by() Function:\n")
>>> print(system.time( all.1 <- by( d, d$key, function(d) coef(lm(y ~ x, data=d)))))
>>>
>>>
>>> cat("\n\nPreSplit Function [aka Jim H]\n\t(a) Splitting Operation:\n")
>>> print(system.time(si <- split(seq(nrow(d)), d$key)))
>>> cat("\n\t(b) Regressions:\n")
>>> print(system.time(all.2 <- lapply(si, function(.indx) {
>>> coef(lm(d$y[.indx] ~ d$x[.indx])) })))
>>> print(system.time(all.2b <- lapply(si, function(.indx) { coef(lm(y ~
>>> x, data=d[.indx,])) })))
>>>
>>> cat("\n\nNaive Split Data Frame\n\t(a) Splitting Operation:\n")
>>> print(system.time(ds <- split(d, d$key)))
>>> cat("\n\t(b) Regressions:\n")
>>> print(system.time(all.3a <- lapply(ds, function(ds) { coef(lm(ds$y ~ ds$x)) })))
>>> print(system.time(all.3b <- lapply(ds, function(ds) { coef(lm(y ~ x,
>>> data=ds)) })))
>>>
>>> the first and the last ways (all.1 and all.3) are "naive" ways of
>>> doing this, and take about 400-500 seconds on a Mac Air, core i5.
>>> Jim's suggestion (all.2) cuts this roughly into half by speeding up
>>> the split to take almost no time.
>>>
>>> and now,
>>>
>>> library(multicore)
>>> print(system.time(all.4 <- mclapply(si, function(.indx) { coef(lm(y ~
>>> x, data=d[.indx,])) })))
>>>
>>> on my dual-core (quad-thread) i5, all four pseudo cores become busy,
>>> and the time roughly halves again from 230 seconds to 120 seconds.
>>>
>>>
>>> maybe the by() function should use Jim's approach, and multicore
>>> should provide mcby().  of course, knowing how to do this myself fast
>>> now by hand, this is not so important for me.  but it may help some
>>> other novices.
>>>
>>> thanks again everybody.
>>>
>>> regards,
>>>
>>> /iaw
>>>
>>> ----
>>> Ivo Welch ([hidden email])
>>>
>>>
>>>
>>>
>>> On Mon, Oct 10, 2011 at 9:31 PM, William Dunlap <[hidden email]> wrote:
>>>> The following avoids the overhead of data.frame methods
>>>> (and assumes the data.frame doesn't include matrices
>>>> or other data.frames) and relies on split(vector,factor)
>>>> quickly splitting a vector into a list of vectors.
>>>> For a 10^6 row by 10 column data.frame split in 10^5
>>>> groups this took 14.1 seconds while split took 658.7 s.
>>>> Both returned the same thing.
>>>>
>>>> Perhaps something based on this idea would help your
>>>> parallelized by().
>>>>
>>>> mysplit.data.frame <-
>>>> function (x, f, drop = FALSE, ...)
>>>> {
>>>>    f <- as.factor(f)
>>>>    tmp <- lapply(x, function(xi) split(xi, f, drop = drop, ...))
>>>>    rn <- split(rownames(x), f, drop = drop, ...)
>>>>    tmp <- unlist(unname(tmp), recursive = FALSE)
>>>>    tmp <- split(tmp, factor(names(tmp), levels = unique(names(tmp))))
>>>>    tmp <- lapply(setNames(seq_along(tmp), names(tmp)), function(i) {
>>>>        t <- tmp[[i]]
>>>>        names(t) <- names(x)
>>>>        attr(t, "row.names") <- rn[[i]]
>>>>        class(t) <- "data.frame"
>>>>        t
>>>>    })
>>>>    tmp
>>>> }
>>>>
>>>> Bill Dunlap
>>>> Spotfire, TIBCO Software
>>>> wdunlap tibco.com
>>>>
>>>>> -----Original Message-----
>>>>> From: [hidden email] [mailto:[hidden email]] On Behalf Of Jim Holtman
>>>>> Sent: Monday, October 10, 2011 7:29 PM
>>>>> To: ivo welch
>>>>> Cc: r-help
>>>>> Subject: Re: [R] SLOW split() function
>>>>>
>>>>> instead of spliting the entire dataframe, split the indices and then use these to access your data:
>>>>> try
>>>>>
>>>>> system.time(s <- split(seq(nrow(d)), d$key))
>>>>>
>>>>> this should be faster and less memory intensive.  you can then use the indices to access the subset:
>>>>>
>>>>> result <- lapply(s, function(.indx){
>>>>>     doSomething <- sum(d$someCol[.indx])
>>>>> })
>>>>>
>>>>> Sent from my iPad
>>>>>
>>>>> On Oct 10, 2011, at 21:01, ivo welch <[hidden email]> wrote:
>>>>>
>>>>> > dear R experts:  apologies for all my speed and memory questions.  I
>>>>> > have a bet with my coauthors that I can make R reasonably efficient
>>>>> > through R-appropriate programming techniques.  this is not just for
>>>>> > kicks, but for work.  for benchmarking, my [3 year old] Mac Pro has
>>>>> > 2.8GHz Xeons, 16GB of RAM, and R 2.13.1.
>>>>> >
>>>>> > right now, it seems that 'split()' is why I am losing my bet.  (split
>>>>> > is an integral component of *apply() and by(), so I need split() to be
>>>>> > fast.  its resulting list can then be fed, e.g., to mclapply().)  I
>>>>> > made up an example to illustrate my ills:
>>>>> >
>>>>> >    library(data.table)
>>>>> >    N <- 1000
>>>>> >    T <- N*10
>>>>> >    d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T) ))
>>>>> >    setkey(d, "key"); gc() ## force a garbage collection
>>>>> >    cat("N=", N, ".  Size of d=", object.size(d)/1024/1024, "MB\n")
>>>>> >    print(system.time( s<-split(d, d$key) ))
>>>>> >
>>>>> > My ordered input data table (or data frame; doesn't make a difference)
>>>>> > is 114MB in size.  it takes about a second to create.  split() only
>>>>> > needs to reshape it.  this simple operation takes almost 5 minutes on
>>>>> > my computer.
>>>>> >
>>>>> > with a data set that is larger, this explodes further.
>>>>> >
>>>>> > am I doing something wrong?  is there an alternative to split()?
>>>>> >
>>>>> > sincerely,
>>>>> >
>>>>> > /iaw
>>>>> >
>>>>> > ----
>>>>> > Ivo Welch ([hidden email])
>>>>> >
>>>>> > ______________________________________________
>>>>> > [hidden email] mailing list
>>>>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>> > and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> --
>> Joshua Wiley
>> Ph.D. Student, Health Psychology
>> Programmer Analyst II, ATS Statistical Consulting Group
>> University of California, Los Angeles
>> https://joshuawiley.com/
>>
>



--
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: SLOW split() function

Thomas Lumley-2
In reply to this post by ivo welch
On Wed, Oct 12, 2011 at 4:56 AM, ivo welch <[hidden email]> wrote:
> thanks, josh.  in my posting example, I did not need anything except
> coefficients.  (when this is the case, I usually do not even use
> lm.fit, but I eliminate all missing obs first and then use solve
> crossprod(y,cbind(1,x)) crossprod(cbind(1,x)).)  this is pretty fast.)

solve(cbind(1,x), y) should be even faster, and more numerically stable,
 [and less likely to make certain people want to cast you into the
outer darkness, where there is SAS and gnashing of teeth]

> alas, I will need to figure how to get coef standard errors faster in
> this case.  summary.lm() is really slow.

The code from summary.lm that actually computes the standard errors is
fairly efficient; you could extract that.

   -thomas

--
Thomas Lumley
Professor of Biostatistics
University of Auckland

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: SLOW split() function

Matthew Dowle
In reply to this post by Joshua Wiley-2
Using Josh's nice example, with data.table's built-in 'by' (optimised
grouping) yields a 6 times speedup (100 seconds down to 15 on
my netbook).

> system.time(all.2b <- lapply(si, function(.indx) { coef(lm(y ~
+ x, data=d[.indx,])) }))
   user  system elapsed
144.501   0.300 145.525

> system.time(all.2c <- lapply(si, function(.indx) { minimal.lm(y
+ = d[.indx, y], x = d[.indx, list(int, x)]) }))
   user  system elapsed
100.819   0.084 101.552

> system.time(all.2d <- d[,minimal.lm2(y=y, x=cbind(int, x)),by=key])
   user  system elapsed
 15.269   0.012  15.323   # 6 times faster

> head(all.2c)
$`1`
        coef        se
x1 0.5152438 0.6277254
x2 0.5621320 0.5754560

$`2`
        coef       se
x1 0.2228235 0.312918
x2 0.3312261 0.261529

$`3`
         coef        se
x1 -0.1972439 0.4674000
x2 -0.1674313 0.4479957

$`4`
          coef        se
x1 -0.13915746 0.2729158
x2 -0.03409833 0.2212416

$`5`
           coef        se
x1  0.007969786 0.2389103
x2 -0.083776526 0.2046823

$`6`
          coef        se
x1 -0.58576454 0.5677619
x2 -0.07249539 0.5009013

> head(all.2d)
     key       coef        V2
[1,]   1  0.5152438 0.6277254
[2,]   1  0.5621320 0.5754560
[3,]   2  0.2228235 0.3129180
[4,]   2  0.3312261 0.2615290
[5,]   3 -0.1972439 0.4674000
[6,]   3 -0.1674313 0.4479957

> minimal.lm2   # slightly modified version of Josh's
function(y, x) {
  obj <- lm.fit(x = x, y = y)
  resvar <- sum(obj$residuals^2)/obj$df.residual
  p <- obj$rank
  R <- .Call("La_chol2inv", x = obj$qr$qr[1L:p, 1L:p, drop = FALSE],
size = p, PACKAGE = "base")
  m <- min(dim(R))
  d <- c(R)[1L + 0L:(m - 1L) * (dim(R)[1L] + 1L)]
  se <- sqrt(d * resvar)
  list(coef = obj$coefficients, se)
}
>
Reply | Threaded
Open this post in threaded view
|

Re: SLOW split() function

Joshua Wiley-2
Very nice!  I am quite impressed at how flexible data.table is.

On Thu, Oct 13, 2011 at 1:05 AM, Matthew Dowle <[hidden email]> wrote:

> Using Josh's nice example, with data.table's built-in 'by' (optimised
> grouping) yields a 6 times speedup (100 seconds down to 15 on
> my netbook).
>
>> system.time(all.2b <- lapply(si, function(.indx) { coef(lm(y ~
> + x, data=d[.indx,])) }))
>   user  system elapsed
> 144.501   0.300 145.525
>
>> system.time(all.2c <- lapply(si, function(.indx) { minimal.lm(y
> + = d[.indx, y], x = d[.indx, list(int, x)]) }))
>   user  system elapsed
> 100.819   0.084 101.552
>
>> system.time(all.2d <- d[,minimal.lm2(y=y, x=cbind(int, x)),by=key])
>   user  system elapsed
>  15.269   0.012  15.323   # 6 times faster
>
>> head(all.2c)
> $`1`
>        coef        se
> x1 0.5152438 0.6277254
> x2 0.5621320 0.5754560
>
> $`2`
>        coef       se
> x1 0.2228235 0.312918
> x2 0.3312261 0.261529
>
> $`3`
>         coef        se
> x1 -0.1972439 0.4674000
> x2 -0.1674313 0.4479957
>
> $`4`
>          coef        se
> x1 -0.13915746 0.2729158
> x2 -0.03409833 0.2212416
>
> $`5`
>           coef        se
> x1  0.007969786 0.2389103
> x2 -0.083776526 0.2046823
>
> $`6`
>          coef        se
> x1 -0.58576454 0.5677619
> x2 -0.07249539 0.5009013
>
>> head(all.2d)
>     key       coef        V2
> [1,]   1  0.5152438 0.6277254
> [2,]   1  0.5621320 0.5754560
> [3,]   2  0.2228235 0.3129180
> [4,]   2  0.3312261 0.2615290
> [5,]   3 -0.1972439 0.4674000
> [6,]   3 -0.1674313 0.4479957
>
>> minimal.lm2   # slightly modified version of Josh's
> function(y, x) {
>  obj <- lm.fit(x = x, y = y)
>  resvar <- sum(obj$residuals^2)/obj$df.residual
>  p <- obj$rank
>  R <- .Call("La_chol2inv", x = obj$qr$qr[1L:p, 1L:p, drop = FALSE],
> size = p, PACKAGE = "base")
>  m <- min(dim(R))
>  d <- c(R)[1L + 0L:(m - 1L) * (dim(R)[1L] + 1L)]
>  se <- sqrt(d * resvar)
>  list(coef = obj$coefficients, se)
> }
>>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/SLOW-split-function-tp3892349p3900851.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.