Speed-up/Cache loadNamespace()

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Speed-up/Cache loadNamespace()

Mario Annau
Dear all,

in our current setting we have our packages stored on a (rather slow)
network drive and need to invoke short R scripts (using RScript) in a
timely manner. Most of the script's runtime is spent with package loading
using library() (or loadNamespace to be precise).

Is there a way to cache the package namespaces as listed in
loadedNamespaces() and load them into memory before the script is executed?

My first simplistic attempt was to serialize the environment output
from loadNamespace() to a file and load it before the script is started.
However, loading the object automatically also loads all the referenced
namespaces (from the slow network share) which is undesirable for this use
case.

Cheers,
Mario

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Speed-up/Cache loadNamespace()

Duncan Murdoch-2
On 19/07/2020 11:50 a.m., Mario Annau wrote:

> Dear all,
>
> in our current setting we have our packages stored on a (rather slow)
> network drive and need to invoke short R scripts (using RScript) in a
> timely manner. Most of the script's runtime is spent with package loading
> using library() (or loadNamespace to be precise).
>
> Is there a way to cache the package namespaces as listed in
> loadedNamespaces() and load them into memory before the script is executed?
>
> My first simplistic attempt was to serialize the environment output
> from loadNamespace() to a file and load it before the script is started.
> However, loading the object automatically also loads all the referenced
> namespaces (from the slow network share) which is undesirable for this use
> case.

I don't think there is, but I doubt if it would help much.
loadNamespace will be slow if loading the package is slow, and you can't
avoid doing that once.  (If you call loadNamespace twice on the same
package, the second one does nothing, and is really quick.)

I think the only savings you might get is the effort of merging various
tables (e.g. the ones for dispatching S3 and S4 methods), and I wouldn't
think that would take a really substantial amount of time.

One thing you could do is to create a library on a faster drive, and
install the minimal set of packages there.  Then if that library comes
first in .libPaths(), you'll never hit the slow network drive.

Duncan Murdoch

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Speed-up/Cache loadNamespace()

Hugh Parsonage
In reply to this post by Mario Annau
My advice would be to avoid the network in one of the following ways

1. Store installed packages on your local drive
2. Copy the installed packages to a tempdir on your local drive each time
the script is executed
3. Keep an R session running in perpetuity and source the scripts within
that everlasting session
4. Rewrite your scripts to use base R only.

I suspect this solution list is exhaustive.

On Mon, 20 Jul 2020 at 1:50 am, Mario Annau <[hidden email]> wrote:

> Dear all,
>
> in our current setting we have our packages stored on a (rather slow)
> network drive and need to invoke short R scripts (using RScript) in a
> timely manner. Most of the script's runtime is spent with package loading
> using library() (or loadNamespace to be precise).
>
> Is there a way to cache the package namespaces as listed in
> loadedNamespaces() and load them into memory before the script is executed?
>
> My first simplistic attempt was to serialize the environment output
> from loadNamespace() to a file and load it before the script is started.
> However, loading the object automatically also loads all the referenced
> namespaces (from the slow network share) which is undesirable for this use
> case.
>
> Cheers,
> Mario
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Speed-up/Cache loadNamespace()

Mario Annau
Thanks for the quick responses. As you both suggested storing the packages
to local drive is feasible but comes with a size restriction I wanted to
avoid. I'll keep this in mind as plan B.
@Hugh: 2. would impose even greater slowdowns and 4. is just not feasible.
However, 3. sounds interesting - how would this work in a Linux environment?

Thank you,
Mario


Am So., 19. Juli 2020 um 20:11 Uhr schrieb Hugh Parsonage <
[hidden email]>:

> My advice would be to avoid the network in one of the following ways
>
> 1. Store installed packages on your local drive
> 2. Copy the installed packages to a tempdir on your local drive each time
> the script is executed
> 3. Keep an R session running in perpetuity and source the scripts within
> that everlasting session
> 4. Rewrite your scripts to use base R only.
>
> I suspect this solution list is exhaustive.
>
> On Mon, 20 Jul 2020 at 1:50 am, Mario Annau <[hidden email]> wrote:
>
>> Dear all,
>>
>> in our current setting we have our packages stored on a (rather slow)
>> network drive and need to invoke short R scripts (using RScript) in a
>> timely manner. Most of the script's runtime is spent with package loading
>> using library() (or loadNamespace to be precise).
>>
>> Is there a way to cache the package namespaces as listed in
>> loadedNamespaces() and load them into memory before the script is
>> executed?
>>
>> My first simplistic attempt was to serialize the environment output
>> from loadNamespace() to a file and load it before the script is started.
>> However, loading the object automatically also loads all the referenced
>> namespaces (from the slow network share) which is undesirable for this use
>> case.
>>
>> Cheers,
>> Mario
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Speed-up/Cache loadNamespace()

Simon Urbanek
Mario,

On unix if you use Rseve you can pre-load all packages in the server (via eval config directive or by running Rserve::run.Rserve() from a session that has everything loaded) and all client connections will have the packages already loaded and available* immediately. You could replace Rscript call with a very tiny Rserve client program which just calls source(""). I can give you more details if you're interested.

Cheers,
Simon


* - there are some packages that are inherently incompatible with fork() - e.g. you cannot fork Java JVM or open connections.


> On Jul 20, 2020, at 6:47 AM, Mario Annau <[hidden email]> wrote:
>
> Thanks for the quick responses. As you both suggested storing the packages
> to local drive is feasible but comes with a size restriction I wanted to
> avoid. I'll keep this in mind as plan B.
> @Hugh: 2. would impose even greater slowdowns and 4. is just not feasible.
> However, 3. sounds interesting - how would this work in a Linux environment?
>
> Thank you,
> Mario
>
>
> Am So., 19. Juli 2020 um 20:11 Uhr schrieb Hugh Parsonage <
> [hidden email]>:
>
>> My advice would be to avoid the network in one of the following ways
>>
>> 1. Store installed packages on your local drive
>> 2. Copy the installed packages to a tempdir on your local drive each time
>> the script is executed
>> 3. Keep an R session running in perpetuity and source the scripts within
>> that everlasting session
>> 4. Rewrite your scripts to use base R only.
>>
>> I suspect this solution list is exhaustive.
>>
>> On Mon, 20 Jul 2020 at 1:50 am, Mario Annau <[hidden email]> wrote:
>>
>>> Dear all,
>>>
>>> in our current setting we have our packages stored on a (rather slow)
>>> network drive and need to invoke short R scripts (using RScript) in a
>>> timely manner. Most of the script's runtime is spent with package loading
>>> using library() (or loadNamespace to be precise).
>>>
>>> Is there a way to cache the package namespaces as listed in
>>> loadedNamespaces() and load them into memory before the script is
>>> executed?
>>>
>>> My first simplistic attempt was to serialize the environment output
>>> from loadNamespace() to a file and load it before the script is started.
>>> However, loading the object automatically also loads all the referenced
>>> namespaces (from the slow network share) which is undesirable for this use
>>> case.
>>>
>>> Cheers,
>>> Mario
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Speed-up/Cache loadNamespace()

Dirk Eddelbuettel
In reply to this post by Mario Annau

On 19 July 2020 at 20:47, Mario Annau wrote:
| Am So., 19. Juli 2020 um 20:11 Uhr schrieb Hugh Parsonage <
| [hidden email]>:
| > 3. Keep an R session running in perpetuity and source the scripts within
| > that everlasting session
| However, 3. sounds interesting - how would this work in a Linux environment?

You had Rserve by Simon for close to 20 years. There isn't much in terms of
fancy docs but it has been widely used. In essence, R runs "headless" and
connect to it (think "telnet" or "ssh", but programmatically), fire off
request and get results with zero startup latency.  But more work to build
the access layer.

And Rserve is also underneath RestRserve which allows you to query a running
server vai REST / modern web stack tech. (Think "plumber", but in C++ and
faster / more scaleable).

Lastly, there is Jeroen's OpenCPU.

Dirk

--
https://dirk.eddelbuettel.com | @eddelbuettel | [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Speed-up/Cache loadNamespace()

Tobias Verbeke-2
----- Original Message -----
> From: "Dirk Eddelbuettel" <[hidden email]>
> To: "Mario Annau" <[hidden email]>
> Cc: "[hidden email]" <[hidden email]>
> Sent: Sunday, July 19, 2020 10:09:24 PM
> Subject: Re: [Rd] Speed-up/Cache loadNamespace()

> On 19 July 2020 at 20:47, Mario Annau wrote:
>| Am So., 19. Juli 2020 um 20:11 Uhr schrieb Hugh Parsonage <
>| [hidden email]>:
>| > 3. Keep an R session running in perpetuity and source the scripts within
>| > that everlasting session
>| However, 3. sounds interesting - how would this work in a Linux environment?
>
> You had Rserve by Simon for close to 20 years. There isn't much in terms of
> fancy docs but it has been widely used. In essence, R runs "headless" and
> connect to it (think "telnet" or "ssh", but programmatically), fire off
> request and get results with zero startup latency.  But more work to build
> the access layer.
>
> And Rserve is also underneath RestRserve which allows you to query a running
> server vai REST / modern web stack tech. (Think "plumber", but in C++ and
> faster / more scaleable).
>
> Lastly, there is Jeroen's OpenCPU.

Or... lastly, the R Service Bus which has been used in production since 2010 and got a maintenance release (6.4.0) last week:

https://rservicebus.io/

For REST (both asynchronous and synchronous APIs are available), you can start here: https://rservicebus.io/api/introduction/

Best,
Tobias

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Speed-up/Cache loadNamespace()

Abby Spurdle
In reply to this post by Mario Annau
It's possible to run R (or a c parent process) as a background process
via a named pipe, and then write script files to the named pipe.
However, the details depend on what shell you use.

The last time I tried (which was a long time ago), I created a small c
program to run R, read from the named pipe from within c, then wrote
it's contents to R's standard in.

It might be possible to do it without the c program.
Haven't checked.


On Mon, Jul 20, 2020 at 3:50 AM Mario Annau <[hidden email]> wrote:

>
> Dear all,
>
> in our current setting we have our packages stored on a (rather slow)
> network drive and need to invoke short R scripts (using RScript) in a
> timely manner. Most of the script's runtime is spent with package loading
> using library() (or loadNamespace to be precise).
>
> Is there a way to cache the package namespaces as listed in
> loadedNamespaces() and load them into memory before the script is executed?
>
> My first simplistic attempt was to serialize the environment output
> from loadNamespace() to a file and load it before the script is started.
> However, loading the object automatically also loads all the referenced
> namespaces (from the slow network share) which is undesirable for this use
> case.
>
> Cheers,
> Mario
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Speed-up/Cache loadNamespace()

Gábor Csárdi
On Mon, Jul 20, 2020 at 9:15 AM Abby Spurdle <[hidden email]> wrote:
>
> It's possible to run R (or a c parent process) as a background process
> via a named pipe, and then write script files to the named pipe.
> However, the details depend on what shell you use.

I would use screen or tmux for this, if this is an R process that you
want to interact with, and you want to keep it running after a SIGHUP.

Gabor

[...]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Speed-up/Cache loadNamespace()

Sokol Serguei
In reply to this post by Abby Spurdle
Le 20/07/2020 à 10:15, Abby Spurdle a écrit :

> It's possible to run R (or a c parent process) as a background process
> via a named pipe, and then write script files to the named pipe.
> However, the details depend on what shell you use.
>
> The last time I tried (which was a long time ago), I created a small c
> program to run R, read from the named pipe from within c, then wrote
> it's contents to R's standard in.
>
> It might be possible to do it without the c program.
> Haven't checked.
For testing purposes, you can do:

- in a shell 1:
  mkfifo rpipe
  exec 3>rpipe # without this trick, Rscript will end after the first
"echo" hereafter or at the end of your first script.

- in a shell 2:
  Rscript rfifo

- in a shell 3:
  echo "print('hello')" > rpipe
  echo "print('hello again')" > rpipe

Then in the shell 2, you will see the output:
[1] "hello"
[1] "hello again"
etc.

If your R scripts contain "stop()" or "q('yes')" or any other error, it
will end the Rscript process. Kind of watch-dog can be set for automatic
relaunching if needed. Another way to stop the Rscript process is to
kill the "exec 3>rpipe" one. You can find its PID with "fuser rpipe"

Best,
Serguei.

>
>
> On Mon, Jul 20, 2020 at 3:50 AM Mario Annau <[hidden email]> wrote:
>> Dear all,
>>
>> in our current setting we have our packages stored on a (rather slow)
>> network drive and need to invoke short R scripts (using RScript) in a
>> timely manner. Most of the script's runtime is spent with package loading
>> using library() (or loadNamespace to be precise).
>>
>> Is there a way to cache the package namespaces as listed in
>> loadedNamespaces() and load them into memory before the script is executed?
>>
>> My first simplistic attempt was to serialize the environment output
>> from loadNamespace() to a file and load it before the script is started.
>> However, loading the object automatically also loads all the referenced
>> namespaces (from the slow network share) which is undesirable for this use
>> case.
>>
>> Cheers,
>> Mario
>>
>>          [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel


--
Serguei Sokol
Ingenieur de recherche INRAE

Cellule mathématiques
TBI, INSA/INRAE UMR 792, INSA/CNRS UMR 5504
135 Avenue de Rangueil
31077 Toulouse Cedex 04

tel: +33 5 61 55 98 49
email: [hidden email]
http://www.toulouse-biotechnology-institute.fr/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Speed-up/Cache loadNamespace()

Abby Spurdle
Thank you Serguei and Gabor.
Great suggestions.

> If your R scripts contain "stop()" or "q('yes')" or any other error, it
> will end the Rscript process. Kind of watch-dog can be set for automatic
> relaunching if needed.

It should be possible to change the error handling behavior.
From within R:

    options (error = function () NULL)

Or something better...

Also, it may be desirable to wipe the global environment (or parts of
it), after each script:

    remove (list = ls (envir=.GlobalEnv, all.names=TRUE) )

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Speed-up/Cache loadNamespace()

Gabriel Becker-2
Mario, Abby, et al.

Note that there is no fully safe way of unloading packages which register
methods (as answered by Luke Tierney here:
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16644 ) which makes the
single R session running arbitrary different scripts thing pretty iffy over
the long term. Even swtichr (which tries hard to support something based on
this) only gets "pretty close".

If the scripts are always the same (up to bugfixes, etc) and most
importantly require the same loaded packages then the above won't be an
issue, of course. Just something to be aware of when planning something
like this.

Best,
~G

On Mon, Jul 20, 2020 at 2:59 PM Abby Spurdle <[hidden email]> wrote:

> Thank you Serguei and Gabor.
> Great suggestions.
>
> > If your R scripts contain "stop()" or "q('yes')" or any other error, it
> > will end the Rscript process. Kind of watch-dog can be set for automatic
> > relaunching if needed.
>
> It should be possible to change the error handling behavior.
> From within R:
>
>     options (error = function () NULL)
>
> Or something better...
>
> Also, it may be desirable to wipe the global environment (or parts of
> it), after each script:
>
>     remove (list = ls (envir=.GlobalEnv, all.names=TRUE) )
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel