Re: [R-pkg-devel] Run garbage collector when too many open files

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [R-pkg-devel] Run garbage collector when too many open files

Jan van der LAan-2
Dear Uwe,

(When replying to your message, I sent the reply to r-devel and not
r-package-devel, as Martin Meachler suggested that this thread would be
a better fit for r-devel.)

Thanks. In the example below I used rm() explicitly, but in general
users wouldn't do that.

One of the reasons for the large number of file handles is that
sometimes unnamed temporary objects are created. For example:

 > library(ldat)
 > libraty(lvec)
 >
 > a <- lvec(10, "integer")
OPENFILE '/tmp/RtmpVqkDsw/file32145169fb06/lvec3214753f2af0'
 > b <- as_rvec(a[1:3])
OPENFILE '/tmp/RtmpVqkDsw/file32145169fb06/lvec32146a50f383'
OPENFILE '/tmp/RtmpVqkDsw/file32145169fb06/lvec3214484b652c'
 > print(b)
[1] 0 0 0
 >
 >
 > gc()
CLOSEFILE '/tmp/RtmpVqkDsw/file32145169fb06/lvec3214484b652c'
CLOSEFILE '/tmp/RtmpVqkDsw/file32145169fb06/lvec32146a50f383'
           used (Mb) gc trigger (Mb) max used (Mb)
Ncells  796936 42.6    1442291 77.1  1168576 62.5
Vcells 1519523 11.6    4356532 33.3  4740854 36.2


For debugging, I log when files are opened and closed. The call a[1:3]
(which creates a slice of a) creates two temporary objects [1]. These
are only deleted when I explicitly call gc() or on some other random
moment in time.

I hope this illustrates the problem better.


Best,
Jan


[1] One improvement would be to create less temporary files; often these
contain only very little information that is better kept in memory. But
that is only a partial solution.




On 07-08-18 15:24, Uwe Ligges wrote:

> Why not add functionality that allows to delete object + runs cleanup code?
>
> Best,
> Uwe Ligges
>
>
>
> On 07.08.2018 14:26, Jan van der Laan wrote:
>>
>>
>> In my package I open handles to temporary files from c++, handles to
>> them are returned to R through vptr objects. The files are deleted
>> then the corresponding R-object is deleted and the garbage collector
>> runs:
>>
>> a <- lvec(10, "integer")
>> rm(a)
>>
>> Then when the garbage collector runs the file is deleted. However, on
>> some platforms (probably with lower limits on the maximum number of
>> file handles a process can have open), I run into the problem that the
>> garbage collector doesn't run often enough. In this case that means
>> that another package of mine using this package generates an error
>> when its tests are run.
>>
>> The simplest solution is to add some calls to gc() in my tests. But a
>> more general/automatic solution would be nice.
>>
>> I thought about something in the lines of
>>
>> robust_lvec <- function(...) {
>>    tryCatch({
>>      lvec(...)
>>    }, error = function(e) {
>>      gc()
>>      lvec(...) # duplicated code
>>    })
>> }
>>
>> e.g. try to open a file, when that fails call the garbage collector
>> and try again. However, this introduces duplicated code (in this case
>> only one line, but that can be more), and doesn't help if it is
>> another function that tries to open a file.
>>
>> Is there a better solution?
>>
>> Thanks!
>>
>> Jan
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-package-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: [R-pkg-devel] Run garbage collector when too many open files

luke-tierney
In R 3.5 and later you should not need to gc() -- that should happen
automatically within the connections code.

Nevertheless, I would recommend redesigning your approach to avoid
hanging onto open file connections as these are a scarce resource.
You can keep around your temporary files without having them open and
only open/close them on access, with the close run in an on.exit or a
tryCatch/finally clause.

Best,

luke

On Tue, 7 Aug 2018, Jan van der Laan wrote:

> Dear Uwe,
>
> (When replying to your message, I sent the reply to r-devel and not
> r-package-devel, as Martin Meachler suggested that this thread would be a
> better fit for r-devel.)
>
> Thanks. In the example below I used rm() explicitly, but in general users
> wouldn't do that.
>
> One of the reasons for the large number of file handles is that sometimes
> unnamed temporary objects are created. For example:
>
>> library(ldat)
>> libraty(lvec)
>>
>> a <- lvec(10, "integer")
> OPENFILE '/tmp/RtmpVqkDsw/file32145169fb06/lvec3214753f2af0'
>> b <- as_rvec(a[1:3])
> OPENFILE '/tmp/RtmpVqkDsw/file32145169fb06/lvec32146a50f383'
> OPENFILE '/tmp/RtmpVqkDsw/file32145169fb06/lvec3214484b652c'
>> print(b)
> [1] 0 0 0
>>
>>
>> gc()
> CLOSEFILE '/tmp/RtmpVqkDsw/file32145169fb06/lvec3214484b652c'
> CLOSEFILE '/tmp/RtmpVqkDsw/file32145169fb06/lvec32146a50f383'
>          used (Mb) gc trigger (Mb) max used (Mb)
> Ncells  796936 42.6    1442291 77.1  1168576 62.5
> Vcells 1519523 11.6    4356532 33.3  4740854 36.2
>
>
> For debugging, I log when files are opened and closed. The call a[1:3] (which
> creates a slice of a) creates two temporary objects [1]. These are only
> deleted when I explicitly call gc() or on some other random moment in time.
>
> I hope this illustrates the problem better.
>
>
> Best,
> Jan
>
>
> [1] One improvement would be to create less temporary files; often these
> contain only very little information that is better kept in memory. But that
> is only a partial solution.
>
>
>
>
> On 07-08-18 15:24, Uwe Ligges wrote:
>> Why not add functionality that allows to delete object + runs cleanup code?
>>
>> Best,
>> Uwe Ligges
>>
>>
>>
>> On 07.08.2018 14:26, Jan van der Laan wrote:
>>>
>>>
>>> In my package I open handles to temporary files from c++, handles to them
>>> are returned to R through vptr objects. The files are deleted then the
>>> corresponding R-object is deleted and the garbage collector runs:
>>>
>>> a <- lvec(10, "integer")
>>> rm(a)
>>>
>>> Then when the garbage collector runs the file is deleted. However, on some
>>> platforms (probably with lower limits on the maximum number of file
>>> handles a process can have open), I run into the problem that the garbage
>>> collector doesn't run often enough. In this case that means that another
>>> package of mine using this package generates an error when its tests are
>>> run.
>>>
>>> The simplest solution is to add some calls to gc() in my tests. But a more
>>> general/automatic solution would be nice.
>>>
>>> I thought about something in the lines of
>>>
>>> robust_lvec <- function(...) {
>>>    tryCatch({
>>>      lvec(...)
>>>    }, error = function(e) {
>>>      gc()
>>>      lvec(...) # duplicated code
>>>    })
>>> }
>>>
>>> e.g. try to open a file, when that fails call the garbage collector and
>>> try again. However, this introduces duplicated code (in this case only one
>>> line, but that can be more), and doesn't help if it is another function
>>> that tries to open a file.
>>>
>>> Is there a better solution?
>>>
>>> Thanks!
>>>
>>> Jan
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   [hidden email]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Run garbage collector when too many open files

Jan van der LAan-2
Dear Luke,


Thanks. See below


On 07-08-18 17:07, [hidden email] wrote:
> In R 3.5 and later you should not need to gc() -- that should happen
> automatically within the connections code.

Could you elaborate on what has changed in R 3.5? As far as I can tell
my problem also occurs in R 3.5 (my computer is still on 3.4.4; but I
assume the solaris CRAN machine isn't). And what do you mean with 'the
connections code'? Is there something I can so/should do to have the
garbage collector be a bit more aggressive in cleaning up my mess?

>
> Nevertheless, I would recommend redesigning your approach to avoid
> hanging onto open file connections as these are a scarce resource.
> You can keep around your temporary files without having them open and
> only open/close them on access, with the close run in an on.exit or a
> tryCatch/finally clause.

I am afraid that this will have a large performance penalty. The files
in question are memory mapped files from which code will reading and
writing continuously in most cases. Of course, there will probably
objects that are not used for large amounts of time that could be
temporarily closed , but it will be a bit difficult for the package to
detect which objects that will be. I would have to write my own 'garbage
collector'.


Best,
Uwe

>
> Best,
>
> luke
>
> On Tue, 7 Aug 2018, Jan van der Laan wrote:
>
>> Dear Uwe,
>>
>> (When replying to your message, I sent the reply to r-devel and not
>> r-package-devel, as Martin Meachler suggested that this thread would
>> be a better fit for r-devel.)
>>
>> Thanks. In the example below I used rm() explicitly, but in general
>> users wouldn't do that.
>>
>> One of the reasons for the large number of file handles is that
>> sometimes unnamed temporary objects are created. For example:
>>
>>> library(ldat)
>>> libraty(lvec)
>>>
>>> a <- lvec(10, "integer")
>> OPENFILE '/tmp/RtmpVqkDsw/file32145169fb06/lvec3214753f2af0'
>>> b <- as_rvec(a[1:3])
>> OPENFILE '/tmp/RtmpVqkDsw/file32145169fb06/lvec32146a50f383'
>> OPENFILE '/tmp/RtmpVqkDsw/file32145169fb06/lvec3214484b652c'
>>> print(b)
>> [1] 0 0 0
>>>
>>>
>>> gc()
>> CLOSEFILE '/tmp/RtmpVqkDsw/file32145169fb06/lvec3214484b652c'
>> CLOSEFILE '/tmp/RtmpVqkDsw/file32145169fb06/lvec32146a50f383'
>>          used (Mb) gc trigger (Mb) max used (Mb)
>> Ncells  796936 42.6    1442291 77.1  1168576 62.5
>> Vcells 1519523 11.6    4356532 33.3  4740854 36.2
>>
>>
>> For debugging, I log when files are opened and closed. The call
>> a[1:3] (which creates a slice of a) creates two temporary objects
>> [1]. These are only deleted when I explicitly call gc() or on some
>> other random moment in time.
>>
>> I hope this illustrates the problem better.
>>
>>
>> Best,
>> Jan
>>
>>
>> [1] One improvement would be to create less temporary files; often
>> these contain only very little information that is better kept in
>> memory. But that is only a partial solution.
>>
>>
>>
>>
>> On 07-08-18 15:24, Uwe Ligges wrote:
>>> Why not add functionality that allows to delete object + runs
>>> cleanup code?
>>>
>>> Best,
>>> Uwe Ligges
>>>
>>>
>>>
>>> On 07.08.2018 14:26, Jan van der Laan wrote:
>>>>
>>>>
>>>> In my package I open handles to temporary files from c++, handles
>>>> to them are returned to R through vptr objects. The files are
>>>> deleted then the corresponding R-object is deleted and the garbage
>>>> collector runs:
>>>>
>>>> a <- lvec(10, "integer")
>>>> rm(a)
>>>>
>>>> Then when the garbage collector runs the file is deleted. However,
>>>> on some platforms (probably with lower limits on the maximum number
>>>> of file handles a process can have open), I run into the problem
>>>> that the garbage collector doesn't run often enough. In this case
>>>> that means that another package of mine using this package
>>>> generates an error when its tests are run.
>>>>
>>>> The simplest solution is to add some calls to gc() in my tests. But
>>>> a more general/automatic solution would be nice.
>>>>
>>>> I thought about something in the lines of
>>>>
>>>> robust_lvec <- function(...) {
>>>>    tryCatch({
>>>>      lvec(...)
>>>>    }, error = function(e) {
>>>>      gc()
>>>>      lvec(...) # duplicated code
>>>>    })
>>>> }
>>>>
>>>> e.g. try to open a file, when that fails call the garbage collector
>>>> and try again. However, this introduces duplicated code (in this
>>>> case only one line, but that can be more), and doesn't help if it
>>>> is another function that tries to open a file.
>>>>
>>>> Is there a better solution?
>>>>
>>>> Thanks!
>>>>
>>>> Jan
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Run garbage collector when too many open files

luke-tierney
On Tue, 7 Aug 2018, Jan van der Laan wrote:

> Dear Luke,
>
>
> Thanks. See below
>
>
> On 07-08-18 17:07, [hidden email] wrote:
>> In R 3.5 and later you should not need to gc() -- that should happen
>> automatically within the connections code.
>
> Could you elaborate on what has changed in R 3.5? As far as I can tell my
> problem also occurs in R 3.5 (my computer is still on 3.4.4; but I assume the
> solaris CRAN machine isn't). And what do you mean with 'the connections
> code'? Is there something I can so/should do to have the garbage collector be
> a bit more aggressive in cleaning up my mess?

If you are not opening files through R connections then this is not relevant.

If you are opening files on your own via C or C++ level calls then it
is a good idea to run gc if there is a failure -- that is what the
connections code does. I would put this logic at a low level, inside
lvec and such, before signaling an error. But this should be a
fall-back. You might be starving other libraries of file handles

>> Nevertheless, I would recommend redesigning your approach to avoid
>> hanging onto open file connections as these are a scarce resource.
>> You can keep around your temporary files without having them open and
>> only open/close them on access, with the close run in an on.exit or a
>> tryCatch/finally clause.
>
> I am afraid that this will have a large performance penalty.

Have you done enough profiling to be sure this is true, in particular
for realistic usage, not small toy examples? This would be the
cleanest design. You could also maintain a small cache of open files,
but that is more work to implement.

Best,

luke

> The files in
> question are memory mapped files from which code will reading and writing
> continuously in most cases. Of course, there will probably objects that are
> not used for large amounts of time that could be temporarily closed , but it
> will be a bit difficult for the package to detect which objects that will be.
> I would have to write my own 'garbage collector'.
>
>
> Best,
> Uwe
>
>>
>> Best,
>>
>> luke
>>
>> On Tue, 7 Aug 2018, Jan van der Laan wrote:
>>
>>> Dear Uwe,
>>>
>>> (When replying to your message, I sent the reply to r-devel and not
>>> r-package-devel, as Martin Meachler suggested that this thread would be a
>>> better fit for r-devel.)
>>>
>>> Thanks. In the example below I used rm() explicitly, but in general users
>>> wouldn't do that.
>>>
>>> One of the reasons for the large number of file handles is that sometimes
>>> unnamed temporary objects are created. For example:
>>>
>>>> library(ldat)
>>>> libraty(lvec)
>>>>
>>>> a <- lvec(10, "integer")
>>> OPENFILE '/tmp/RtmpVqkDsw/file32145169fb06/lvec3214753f2af0'
>>>> b <- as_rvec(a[1:3])
>>> OPENFILE '/tmp/RtmpVqkDsw/file32145169fb06/lvec32146a50f383'
>>> OPENFILE '/tmp/RtmpVqkDsw/file32145169fb06/lvec3214484b652c'
>>>> print(b)
>>> [1] 0 0 0
>>>>
>>>>
>>>> gc()
>>> CLOSEFILE '/tmp/RtmpVqkDsw/file32145169fb06/lvec3214484b652c'
>>> CLOSEFILE '/tmp/RtmpVqkDsw/file32145169fb06/lvec32146a50f383'
>>>          used (Mb) gc trigger (Mb) max used (Mb)
>>> Ncells  796936 42.6    1442291 77.1  1168576 62.5
>>> Vcells 1519523 11.6    4356532 33.3  4740854 36.2
>>>
>>>
>>> For debugging, I log when files are opened and closed. The call a[1:3]
>>> (which creates a slice of a) creates two temporary objects [1]. These are
>>> only deleted when I explicitly call gc() or on some other random moment in
>>> time.
>>>
>>> I hope this illustrates the problem better.
>>>
>>>
>>> Best,
>>> Jan
>>>
>>>
>>> [1] One improvement would be to create less temporary files; often these
>>> contain only very little information that is better kept in memory. But
>>> that is only a partial solution.
>>>
>>>
>>>
>>>
>>> On 07-08-18 15:24, Uwe Ligges wrote:
>>>> Why not add functionality that allows to delete object + runs cleanup
>>>> code?
>>>>
>>>> Best,
>>>> Uwe Ligges
>>>>
>>>>
>>>>
>>>> On 07.08.2018 14:26, Jan van der Laan wrote:
>>>>>
>>>>>
>>>>> In my package I open handles to temporary files from c++, handles to
>>>>> them are returned to R through vptr objects. The files are deleted then
>>>>> the corresponding R-object is deleted and the garbage collector runs:
>>>>>
>>>>> a <- lvec(10, "integer")
>>>>> rm(a)
>>>>>
>>>>> Then when the garbage collector runs the file is deleted. However, on
>>>>> some platforms (probably with lower limits on the maximum number of file
>>>>> handles a process can have open), I run into the problem that the
>>>>> garbage collector doesn't run often enough. In this case that means that
>>>>> another package of mine using this package generates an error when its
>>>>> tests are run.
>>>>>
>>>>> The simplest solution is to add some calls to gc() in my tests. But a
>>>>> more general/automatic solution would be nice.
>>>>>
>>>>> I thought about something in the lines of
>>>>>
>>>>> robust_lvec <- function(...) {
>>>>>    tryCatch({
>>>>>      lvec(...)
>>>>>    }, error = function(e) {
>>>>>      gc()
>>>>>      lvec(...) # duplicated code
>>>>>    })
>>>>> }
>>>>>
>>>>> e.g. try to open a file, when that fails call the garbage collector and
>>>>> try again. However, this introduces duplicated code (in this case only
>>>>> one line, but that can be more), and doesn't help if it is another
>>>>> function that tries to open a file.
>>>>>
>>>>> Is there a better solution?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Jan
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>
>

--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   [hidden email]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel