Problem parallelizing across cores

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem parallelizing across cores

spottiswoode
Hi All,

I have a piece of well optimized R code for doing text analysis running
under Linux on an AWS instance.  The code first loads a number of packages
and some needed data and the actual analysis is done by a function called,
say, f(string).  I would like to parallelize calling this function across
the 8 cores of the instance to increase throughput.  I have looked at the
packages doParallel and future but am not clear how to do this.  Any method
that brings up an R instance when the function is called will not work for
me as the time to load the packages and data is comparable to the execution
time of the function leading to no speed up.  Therefore I need to keep a
number of instances of the R code running continuously so that the data
loading only occurs once when the R processes are first started and
thereafter the function f(string) is ready to run in each instance.  I hope
I have put this clearly.

I’d much appreciate any suggestions.  Thanks in advance,

James Spottiswoode


--

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Problem parallelizing across cores

Bert Gunter-2
I would suggest that that you search on "parallel computing" at the
Rseek.org site. This brought up what seemed to be many relevant hits
including, of course, the High Performance and parallel Computing Cran task
view.

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Wed, Aug 28, 2019 at 3:18 PM James Spottiswoode <
[hidden email]> wrote:

> Hi All,
>
> I have a piece of well optimized R code for doing text analysis running
> under Linux on an AWS instance.  The code first loads a number of packages
> and some needed data and the actual analysis is done by a function called,
> say, f(string).  I would like to parallelize calling this function across
> the 8 cores of the instance to increase throughput.  I have looked at the
> packages doParallel and future but am not clear how to do this.  Any method
> that brings up an R instance when the function is called will not work for
> me as the time to load the packages and data is comparable to the execution
> time of the function leading to no speed up.  Therefore I need to keep a
> number of instances of the R code running continuously so that the data
> loading only occurs once when the R processes are first started and
> thereafter the function f(string) is ready to run in each instance.  I hope
> I have put this clearly.
>
> I’d much appreciate any suggestions.  Thanks in advance,
>
> James Spottiswoode
>
>
> --
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Problem parallelizing across cores

James Spottiswoode


> On Aug 28, 2019, at 4:44 PM, James Spottiswoode <[hidden email]> wrote:
>
> Hi Bert,
>
> Thanks for your advice.  Actually i’ve already done this and have checked out doParallel and future packages.  The trouble with doParallel is that it forks R processes which spend a lot of time loading data and packages whereas my function runs in 100ms so the parallelization doesn’t help.  The future package keeps it’s children running but I haven’t figured out how to get it to work in my application.
>
> Best — James
>
>
>> On Aug 28, 2019, at 3:39 PM, Bert Gunter <[hidden email] <mailto:[hidden email]>> wrote:
>>
>>
>> I would suggest that that you search on "parallel computing" at the Rseek.org <http://rseek.org/> site. This brought up what seemed to be many relevant hits including, of course, the High Performance and parallel Computing Cran task view.
>>
>> Cheers,
>> Bert
>>
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming along and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>
>>
>> On Wed, Aug 28, 2019 at 3:18 PM James Spottiswoode <[hidden email] <mailto:[hidden email]>> wrote:
>> Hi All,
>>
>> I have a piece of well optimized R code for doing text analysis running
>> under Linux on an AWS instance.  The code first loads a number of packages
>> and some needed data and the actual analysis is done by a function called,
>> say, f(string).  I would like to parallelize calling this function across
>> the 8 cores of the instance to increase throughput.  I have looked at the
>> packages doParallel and future but am not clear how to do this.  Any method
>> that brings up an R instance when the function is called will not work for
>> me as the time to load the packages and data is comparable to the execution
>> time of the function leading to no speed up.  Therefore I need to keep a
>> number of instances of the R code running continuously so that the data
>> loading only occurs once when the R processes are first started and
>> thereafter the function f(string) is ready to run in each instance.  I hope
>> I have put this clearly.
>>
>> I’d much appreciate any suggestions.  Thanks in advance,
>>
>> James Spottiswoode
>>
>>
>> --
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] <mailto:[hidden email]> mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help <https://stat.ethz.ch/mailman/listinfo/r-help>
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html <http://www.r-project.org/posting-guide.html>
>> and provide commented, minimal, self-contained, reproducible code.
>
> James Spottiswoode
> Applied Mathematics & Statistics
> (310) 270 6220
> jamesspottiswoode Skype
> [hidden email] <mailto:[hidden email]>

James Spottiswoode
Applied Mathematics & Statistics
(310) 270 6220
jamesspottiswoode Skype
[hidden email]


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Problem parallelizing across cores

Jeff Newmiller
In reply to this post by spottiswoode
Your first option is always to serially compute results. When the computation time is long compared to session overhead and data I/O, you can consider parallel computing. You should first consider laying out your independent computation work units as a sequence, and then allocate segments of that sequence to your workers, each of which will perform their respective sub-sequences serially so as to minimize the overhead penalty... so yes, you absolutely can use a method that starts new instances of R ("SNOW"). You also have forking on Linux which has lower overhead, but not zero so the exact same logic can be applied. But if you arbitrarily shorten your serial computations too much then you cannot optimize your use of available processing resources as you have already observed.

However, your lack of reproducible example is a strong indicator that you are not really asking a question about R... so do some reading and focus your next question on R or the base R parallel package per the Posting Guide. (Do read that... posting HTML is a good way
for your message to get scrambled before we see it.) Wide-ranging discussions on computer science and HPC hardware constraints are outside the topic here.

On August 28, 2019 11:06:57 AM PDT, James Spottiswoode <[hidden email]> wrote:

>Hi All,
>
>I have a piece of well optimized R code for doing text analysis running
>under Linux on an AWS instance.  The code first loads a number of
>packages
>and some needed data and the actual analysis is done by a function
>called,
>say, f(string).  I would like to parallelize calling this function
>across
>the 8 cores of the instance to increase throughput.  I have looked at
>the
>packages doParallel and future but am not clear how to do this.  Any
>method
>that brings up an R instance when the function is called will not work
>for
>me as the time to load the packages and data is comparable to the
>execution
>time of the function leading to no speed up.  Therefore I need to keep
>a
>number of instances of the R code running continuously so that the data
>loading only occurs once when the R processes are first started and
>thereafter the function f(string) is ready to run in each instance.  I
>hope
>I have put this clearly.
>
>I’d much appreciate any suggestions.  Thanks in advance,
>
>James Spottiswoode
>
>
>--
>
> [[alternative HTML version deleted]]
>
>______________________________________________
>[hidden email] mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

--
Sent from my phone. Please excuse my brevity.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.