

I'm trying to speed up a script that otherwise takes days to handle larger
data sets. So, is there a way to completely vectorize or paralellize the
following script:
*# kfold cross validation*
df < trees # a data frame 'trees' from R.
df < df[sample(nrow(df)), ] # randomly shuffles the data.
k < 10 # Number of folds. Note k=nrow(df) in the leaveoneout cross
validation.
folds < cut(seq(from=1, to=nrow(df)), breaks=k, labels=FALSE) # creates
unique numbers for k equally size folds.
df$ID < folds # adds fold IDs.
df[paste("pred", 1:3, sep="")] < NA # adds multiple columns "pred1"
"pred2" "pred3" to speed up the following loop.
library(mgcv)
for(i in 1:k) {
# looping for different models:
m1 < gam(Volume ~ s(Height), data=df, subset=(ID != i))
m2 < gam(Volume ~ s(Girth), data=df, subset=(ID != i))
m3 < gam(Volume ~ s(Girth) + s(Height), data=df, subset=(ID != i))
# looping for predictions:
df[df$ID==i, "pred1"] < predict(m1, df[df$ID==i, ], type="response")
df[df$ID==i, "pred2"] < predict(m2, df[df$ID==i, ], type="response")
df[df$ID==i, "pred3"] < predict(m3, df[df$ID==i, ], type="response")
}
# calculating residuals:
df$res1 < with(df, Volume  pred1)
df$res2 < with(df, Volume  pred2)
df$res3 < with(df, Volume  pred3)
Model < paste("m", 1:3, sep="") # creates a vector of model names.
# creating a vector of meansquare errors (MSE):
MSE < with(df, c(
sum(res1^2) / nrow(df),
sum(res2^2) / nrow(df),
sum(res3^2) / nrow(df)
))
model.mse < data.frame(Model, MSE) # creates a data frame of model names
and meansquare errors.
model.mse < model.mse[order(model.mse$MSE), ] # rearranges the previous
data frame in order of increasing meansquare errors.
I'd appreciate any help. This code takes several days if run on >=30,000
different GAM models and 3 predictors. Could you please help with
rewriting the script into sapply() or foreach()/doParallel format?
Thanks
Lexo
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


See inline.
> On Jan 23, 2019, at 2:17 AM, Aleksandre Gavashelishvili < [hidden email]> wrote:
>
> I'm trying to speed up a script that otherwise takes days to handle larger
> data sets. So, is there a way to completely vectorize or paralellize the
> following script:
>
> *# kfold cross validation*
>
> df < trees # a data frame 'trees' from R.
> df < df[sample(nrow(df)), ] # randomly shuffles the data.
> k < 10 # Number of folds. Note k=nrow(df) in the leaveoneout cross
> validation.
> folds < cut(seq(from=1, to=nrow(df)), breaks=k, labels=FALSE) # creates
> unique numbers for k equally size folds.
> df$ID < folds # adds fold IDs.
> df[paste("pred", 1:3, sep="")] < NA # adds multiple columns "pred1"
> "pred2" "pred3" to speed up the following loop.
>
> library(mgcv)
>
Rprof()
replicate(100, {
> for(i in 1:k) {
> # looping for different models:
> m1 < gam(Volume ~ s(Height), data=df, subset=(ID != i))
> m2 < gam(Volume ~ s(Girth), data=df, subset=(ID != i))
> m3 < gam(Volume ~ s(Girth) + s(Height), data=df, subset=(ID != i))
>
> # looping for predictions:
> df[df$ID==i, "pred1"] < predict(m1, df[df$ID==i, ], type="response")
> df[df$ID==i, "pred2"] < predict(m2, df[df$ID==i, ], type="response")
> df[df$ID==i, "pred3"] < predict(m3, df[df$ID==i, ], type="response")
> }
>
})
Rprof(NULL)
summaryRprof()
## read ?Rprof to get a sense of what it does
## read the summary to determine where time is being spent.
## the result was surprising to me. YMMV.
## there may be redundancies that you can eliminate by
##  doing the setup within gam() one time and saving it
##  calling the worker functions by modifying the setup
## in a loop or function and saving the results
> # calculating residuals:
> df$res1 < with(df, Volume  pred1)
> df$res2 < with(df, Volume  pred2)
> df$res3 < with(df, Volume  pred3)
>
> Model < paste("m", 1:3, sep="") # creates a vector of model names.
>
> # creating a vector of meansquare errors (MSE):
> MSE < with(df, c(
> sum(res1^2) / nrow(df),
> sum(res2^2) / nrow(df),
> sum(res3^2) / nrow(df)
> ))
>
> model.mse < data.frame(Model, MSE) # creates a data frame of model names
> and meansquare errors.
> model.mse < model.mse[order(model.mse$MSE), ] # rearranges the previous
> data frame in order of increasing meansquare errors.
>
> I'd appreciate any help. This code takes several days if run on >=30,000
> different GAM models and 3 predictors. Could you please help with
> rewriting the script into sapply() or foreach()/doParallel format?
>
This is something you should learn to do. It is pretty standard practice. Use the body of your for loop as the body of a function, add arguments, and create a suitable return value. The something like
lapply( 1:k, your.loop.body.function, other.arg1, other.arg2, ...)
should work. If it does, then parallel::mclapply(...) should also work.
HTH,
Chuck
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Charles writes about saving execution time by eliminating redundancies.
If you see redundancies related to calling a timeconsuming function
multiple times with the same arguments, a very easy way to speed up your
program is to memoise the functions using the package memoise.
HTH,
Eric
On Wed, Jan 23, 2019 at 8:34 PM Berry, Charles < [hidden email]> wrote:
> See inline.
>
> > On Jan 23, 2019, at 2:17 AM, Aleksandre Gavashelishvili <
> [hidden email]> wrote:
> >
> > I'm trying to speed up a script that otherwise takes days to handle
> larger
> > data sets. So, is there a way to completely vectorize or paralellize the
> > following script:
> >
> > *# kfold cross validation*
> >
> > df < trees # a data frame 'trees' from R.
> > df < df[sample(nrow(df)), ] # randomly shuffles the data.
> > k < 10 # Number of folds. Note k=nrow(df) in the leaveoneout cross
> > validation.
> > folds < cut(seq(from=1, to=nrow(df)), breaks=k, labels=FALSE) # creates
> > unique numbers for k equally size folds.
> > df$ID < folds # adds fold IDs.
> > df[paste("pred", 1:3, sep="")] < NA # adds multiple columns "pred1"
> > "pred2" "pred3" to speed up the following loop.
> >
> > library(mgcv)
> >
>
> Rprof()
>
> replicate(100, {
>
>
> > for(i in 1:k) {
> > # looping for different models:
> > m1 < gam(Volume ~ s(Height), data=df, subset=(ID != i))
> > m2 < gam(Volume ~ s(Girth), data=df, subset=(ID != i))
> > m3 < gam(Volume ~ s(Girth) + s(Height), data=df, subset=(ID != i))
> >
> > # looping for predictions:
> > df[df$ID==i, "pred1"] < predict(m1, df[df$ID==i, ], type="response")
> > df[df$ID==i, "pred2"] < predict(m2, df[df$ID==i, ], type="response")
> > df[df$ID==i, "pred3"] < predict(m3, df[df$ID==i, ], type="response")
> > }
> >
>
> })
>
> Rprof(NULL)
>
> summaryRprof()
>
> ## read ?Rprof to get a sense of what it does
>
> ## read the summary to determine where time is being spent.
>
> ## the result was surprising to me. YMMV.
>
> ## there may be redundancies that you can eliminate by
> ##  doing the setup within gam() one time and saving it
> ##  calling the worker functions by modifying the setup
> ## in a loop or function and saving the results
>
>
> > # calculating residuals:
> > df$res1 < with(df, Volume  pred1)
> > df$res2 < with(df, Volume  pred2)
> > df$res3 < with(df, Volume  pred3)
> >
> > Model < paste("m", 1:3, sep="") # creates a vector of model names.
> >
> > # creating a vector of meansquare errors (MSE):
> > MSE < with(df, c(
> > sum(res1^2) / nrow(df),
> > sum(res2^2) / nrow(df),
> > sum(res3^2) / nrow(df)
> > ))
> >
> > model.mse < data.frame(Model, MSE) # creates a data frame of model names
> > and meansquare errors.
> > model.mse < model.mse[order(model.mse$MSE), ] # rearranges the previous
> > data frame in order of increasing meansquare errors.
> >
> > I'd appreciate any help. This code takes several days if run on >=30,000
> > different GAM models and 3 predictors. Could you please help with
> > rewriting the script into sapply() or foreach()/doParallel format?
> >
>
> This is something you should learn to do. It is pretty standard practice.
> Use the body of your for loop as the body of a function, add arguments, and
> create a suitable return value. The something like
>
> lapply( 1:k, your.loop.body.function, other.arg1, other.arg2, ...)
>
> should work. If it does, then parallel::mclapply(...) should also work.
>
> HTH,
>
> Chuck
>
>
> ______________________________________________
> [hidden email] mailing list  To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide
> http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.
>
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Before posting on the rhelp list I did run Rprof(). In my posting I asked
for help with rewriting the specific script into sapply() or
foreach()/doParallel format.
Thanks anyway for your time and suggestions,
Lexo
On Thu, Jan 24, 2019 at 12:38 AM Eric Berger < [hidden email]> wrote:
> Charles writes about saving execution time by eliminating redundancies.
> If you see redundancies related to calling a timeconsuming function
> multiple times with the same arguments, a very easy way to speed up your
> program is to memoise the functions using the package memoise.
>
> HTH,
> Eric
>
>
>
>
> On Wed, Jan 23, 2019 at 8:34 PM Berry, Charles < [hidden email]> wrote:
>
>> See inline.
>>
>> > On Jan 23, 2019, at 2:17 AM, Aleksandre Gavashelishvili <
>> [hidden email]> wrote:
>> >
>> > I'm trying to speed up a script that otherwise takes days to handle
>> larger
>> > data sets. So, is there a way to completely vectorize or paralellize the
>> > following script:
>> >
>> > *# kfold cross validation*
>> >
>> > df < trees # a data frame 'trees' from R.
>> > df < df[sample(nrow(df)), ] # randomly shuffles the data.
>> > k < 10 # Number of folds. Note k=nrow(df) in the leaveoneout cross
>> > validation.
>> > folds < cut(seq(from=1, to=nrow(df)), breaks=k, labels=FALSE) # creates
>> > unique numbers for k equally size folds.
>> > df$ID < folds # adds fold IDs.
>> > df[paste("pred", 1:3, sep="")] < NA # adds multiple columns "pred1"
>> > "pred2" "pred3" to speed up the following loop.
>> >
>> > library(mgcv)
>> >
>>
>> Rprof()
>>
>> replicate(100, {
>>
>>
>> > for(i in 1:k) {
>> > # looping for different models:
>> > m1 < gam(Volume ~ s(Height), data=df, subset=(ID != i))
>> > m2 < gam(Volume ~ s(Girth), data=df, subset=(ID != i))
>> > m3 < gam(Volume ~ s(Girth) + s(Height), data=df, subset=(ID != i))
>> >
>> > # looping for predictions:
>> > df[df$ID==i, "pred1"] < predict(m1, df[df$ID==i, ], type="response")
>> > df[df$ID==i, "pred2"] < predict(m2, df[df$ID==i, ], type="response")
>> > df[df$ID==i, "pred3"] < predict(m3, df[df$ID==i, ], type="response")
>> > }
>> >
>>
>> })
>>
>> Rprof(NULL)
>>
>> summaryRprof()
>>
>> ## read ?Rprof to get a sense of what it does
>>
>> ## read the summary to determine where time is being spent.
>>
>> ## the result was surprising to me. YMMV.
>>
>> ## there may be redundancies that you can eliminate by
>> ##  doing the setup within gam() one time and saving it
>> ##  calling the worker functions by modifying the setup
>> ## in a loop or function and saving the results
>>
>>
>> > # calculating residuals:
>> > df$res1 < with(df, Volume  pred1)
>> > df$res2 < with(df, Volume  pred2)
>> > df$res3 < with(df, Volume  pred3)
>> >
>> > Model < paste("m", 1:3, sep="") # creates a vector of model names.
>> >
>> > # creating a vector of meansquare errors (MSE):
>> > MSE < with(df, c(
>> > sum(res1^2) / nrow(df),
>> > sum(res2^2) / nrow(df),
>> > sum(res3^2) / nrow(df)
>> > ))
>> >
>> > model.mse < data.frame(Model, MSE) # creates a data frame of model
>> names
>> > and meansquare errors.
>> > model.mse < model.mse[order(model.mse$MSE), ] # rearranges the previous
>> > data frame in order of increasing meansquare errors.
>> >
>> > I'd appreciate any help. This code takes several days if run on >=30,000
>> > different GAM models and 3 predictors. Could you please help with
>> > rewriting the script into sapply() or foreach()/doParallel format?
>> >
>>
>> This is something you should learn to do. It is pretty standard practice.
>> Use the body of your for loop as the body of a function, add arguments, and
>> create a suitable return value. The something like
>>
>> lapply( 1:k, your.loop.body.function, other.arg1, other.arg2, ...)
>>
>> should work. If it does, then parallel::mclapply(...) should also work.
>>
>> HTH,
>>
>> Chuck
>>
>>
>> ______________________________________________
>> [hidden email] mailing list  To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/rhelp>> PLEASE do read the posting guide
>> http://www.Rproject.org/postingguide.html>> and provide commented, minimal, selfcontained, reproducible code.
>>
>
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list  To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.

