stopping finalizers

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

stopping finalizers

Thomas Lumley-2
Is there some way to prevent finalizers running during a section of code?

I have a package that includes R objects linked to database tables.  To
maintain the call-by-value semantics, tables are copied rather than
modified, and the extra tables are removed by finalizers during garbage
collection.

However, if the garbage collection occurs in the middle of processing
another SQL query (which is relatively likely, since that's where the
memory allocations are) there are problems with the database interface.

Since the guarantees for the finalizer are "at most once, not before the
object is out of scope" it seems harmless to be able to prevent finalizers
from running during a particular code block, but I can't see any way to do
it.

Suggestions?

    -thomas


--
Thomas Lumley
Professor of Biostatistics
University of Auckland

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stopping finalizers

luke-tierney
It might help if you could be more specific about what the issue is --
if they are out of scope why does it matter whether the finalizers
run?

Generically two approaches I can think of:

     you keep track of whenit is safe to fully run your finalizers and have
     your finalizers put the objects on a linked list if it isn't safe to
     run the finalizer now and clear the list each time you make a new one

     keep track of your objects with a weak list andturn them into strong
     references before your calls, then drop the list after.

I'm pretty sure we don't have a mechanism for temporarily suspending
running the finalizers but it is probably fairly easy to add if that
is the only option.

I might be able to think of other options with more details on the
issue.

Best,

luke

On Tue, 12 Feb 2013, Thomas Lumley wrote:

> Is there some way to prevent finalizers running during a section of code?
>
> I have a package that includes R objects linked to database tables.  To
> maintain the call-by-value semantics, tables are copied rather than
> modified, and the extra tables are removed by finalizers during garbage
> collection.
>
> However, if the garbage collection occurs in the middle of processing
> another SQL query (which is relatively likely, since that's where the
> memory allocations are) there are problems with the database interface.
>
> Since the guarantees for the finalizer are "at most once, not before the
> object is out of scope" it seems harmless to be able to prevent finalizers
> from running during a particular code block, but I can't see any way to do
> it.
>
> Suggestions?
>
>    -thomas
>
>
>

--
Luke Tierney
Chair, Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   [hidden email]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stopping finalizers

Antonio, Fabio Di Narzo
In reply to this post by Thomas Lumley-2
I'm not sure I got your problem right, but you can keep a named copy
of your not-to-be-finalized object as long as it needs to be around,
so it doesn't go out of scope too early.

Something like:
local({
   dontFinalizeMe <- obj
   ##
   ## code which creates copies, overwrites, and indirectly uses 'obj'
   ##
})

hth,
--
Antonio, Fabio Di Narzo,
Biostatistician
Mount Sinai School of Medicine, NY.

2013/2/12 Thomas Lumley <[hidden email]>:

> Is there some way to prevent finalizers running during a section of code?
>
> I have a package that includes R objects linked to database tables.  To
> maintain the call-by-value semantics, tables are copied rather than
> modified, and the extra tables are removed by finalizers during garbage
> collection.
>
> However, if the garbage collection occurs in the middle of processing
> another SQL query (which is relatively likely, since that's where the
> memory allocations are) there are problems with the database interface.
>
> Since the guarantees for the finalizer are "at most once, not before the
> object is out of scope" it seems harmless to be able to prevent finalizers
> from running during a particular code block, but I can't see any way to do
> it.
>
> Suggestions?
>
>     -thomas
>
>
> --
> Thomas Lumley
> Professor of Biostatistics
> University of Auckland
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stopping finalizers

Thomas Lumley-2
In reply to this post by luke-tierney
Luke,

We're actually adopting the first of your generic approaches.

As a more concrete description:

There are R objects representing survey data sets, with the data stored in
a database table.  The subset() method, when applied to these objects,
creates a new table indicating which rows of the data table are in the
subset -- we don't modify the original table, because that breaks the
call-by-value semantics. When the subset object in R goes out of scope, we
need to delete the extra database table.

 I have been doing this with a finalizer on an environment that's part of
the subset object in R.   This all worked fine with JDBC, but the native
database interface requires that all communications with the database come
in send/receive pairs. Since R is single-threaded, this would normally not
be any issue. However, since garbage collection can happen at any time, it
is possible that the send part of the finalizer query "drop table
_sbs_whatever" comes between the send and receive of some other query, and
the database connection then falls over.   So, I'm happy for the finalizer
to run at any time except during a small critical section of R code.

In this particular case the finalizer only issues "drop table" queries, and
it doesn't need to know if they succeed, so we can keep a lock in the
database connection and just store any "drop table" queries that arrive
during a database operation for later execution.   More generally, though,
the fact that no R operation is atomic with respect to garbage collection
seems to make it a bit difficult to use finalizers -- if you need a
finalizer, it will often be in order to access and free some external
resource, which is when the race conditions can matter.

What I was envisaging was something like

without_gc(expr)

to evaluate expr with the memory manager set to allocate memory (or attempt
to do so) without garbage collection.  Even better would be if gc could
run, but weak references were temporarily treated as strong so that garbage
without finalizers would be collected but finalizers didn't get triggered.
 Using this facility would be inefficient, because it would allocate more
memory than necessary and would also mess with the tuning of the garbage
collector,  but when communicating with other programs it seems it would be
very useful to have some way of running an R code block and knowing that no
other R code block would run during it (user interrupts are another issue,
but they can be caught, and in any case I'm happy to fail when the user
presses CTRL-C).

     -thomas




On Fri, Feb 15, 2013 at 12:53 AM, <[hidden email]> wrote:

> It might help if you could be more specific about what the issue is --
> if they are out of scope why does it matter whether the finalizers
> run?
>
> Generically two approaches I can think of:
>
>     you keep track of whenit is safe to fully run your finalizers and have
>     your finalizers put the objects on a linked list if it isn't safe to
>     run the finalizer now and clear the list each time you make a new one
>
>     keep track of your objects with a weak list andturn them into strong
>     references before your calls, then drop the list after.
>
> I'm pretty sure we don't have a mechanism for temporarily suspending
> running the finalizers but it is probably fairly easy to add if that
> is the only option.
>
> I might be able to think of other options with more details on the
> issue.
>
> Best,
>
> luke
>
>
> On Tue, 12 Feb 2013, Thomas Lumley wrote:
>
>  Is there some way to prevent finalizers running during a section of code?
>>
>> I have a package that includes R objects linked to database tables.  To
>> maintain the call-by-value semantics, tables are copied rather than
>> modified, and the extra tables are removed by finalizers during garbage
>> collection.
>>
>> However, if the garbage collection occurs in the middle of processing
>> another SQL query (which is relatively likely, since that's where the
>> memory allocations are) there are problems with the database interface.
>>
>> Since the guarantees for the finalizer are "at most once, not before the
>> object is out of scope" it seems harmless to be able to prevent finalizers
>> from running during a particular code block, but I can't see any way to do
>> it.
>>
>> Suggestions?
>>
>>    -thomas
>>
>>
>>
>>
> --
> Luke Tierney
> Chair, Statistics and Actuarial Science
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa                  Phone:             319-335-3386
> Department of Statistics and        Fax:               319-335-3017
>    Actuarial Science
> 241 Schaeffer Hall                  email:   [hidden email]
> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>



--
Thomas Lumley
Professor of Biostatistics
University of Auckland

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stopping finalizers

Simon Urbanek
I would argue that addressing this at a generic R level is the wrong place (almost, see below) -- because this is about *specific* finalizers. You don't mind running other people 's finalizers because they don't mess up your connection. Therefore, I'd argue that primarily the approach should be in your DB driver to synchronize the calls - which is what you did by queueing.

In a sense a more general approach won't be any different - you will need at least a list of *specific* objects that should be deferred - it's either in your driver or in R. I'd argue that the design is much easier in the driver because you know which finalizers to register, whereas R has no concept of finalizer "classes" to group them. You can also do this much more easily in the driver since you know whether they need to be deferred and if they do, you can easily process the deferred once when you get out of your critical section.

Because of this difference between specific finalizers and all GC any R-side solution that doesn't register such finalizers in a special way will be inherently wasteful - as you pointed out in the discussion below - and thus I'd say it's more dangerous than helpful, because R can then lock itself out of memory even if not necessary.

So IMHO the necessary practical way to solve this at R level (if someone wanted to spend the time) would be to create "bags" of finalizers and use those when defining critical regions, something like (pseudocode)

add_fin_bag("myDB", obj1)
// ...
add_fin_bag("myDB", obj2)

critical_fin_section_begin("myDB")
// ... here no finalizers for objects in the bag can be fired
critical_fin_section_end("myDB")

Technically, the simple solution would be to simply preserve the bag in the critical region. However, this would not guarantee that the finalizers get fired at the end of the section even if gc occurred. I suspect it would be harder to guarantee that (other than running gc explicitly or performing explicit de-allocation after the finalizer was detected to be scheduled but not fired).

Cheers,
Simon


On Feb 14, 2013, at 3:12 PM, Thomas Lumley wrote:

> Luke,
>
> We're actually adopting the first of your generic approaches.
>
> As a more concrete description:
>
> There are R objects representing survey data sets, with the data stored in
> a database table.  The subset() method, when applied to these objects,
> creates a new table indicating which rows of the data table are in the
> subset -- we don't modify the original table, because that breaks the
> call-by-value semantics. When the subset object in R goes out of scope, we
> need to delete the extra database table.
>
> I have been doing this with a finalizer on an environment that's part of
> the subset object in R.   This all worked fine with JDBC, but the native
> database interface requires that all communications with the database come
> in send/receive pairs. Since R is single-threaded, this would normally not
> be any issue. However, since garbage collection can happen at any time, it
> is possible that the send part of the finalizer query "drop table
> _sbs_whatever" comes between the send and receive of some other query, and
> the database connection then falls over.   So, I'm happy for the finalizer
> to run at any time except during a small critical section of R code.
>
> In this particular case the finalizer only issues "drop table" queries, and
> it doesn't need to know if they succeed, so we can keep a lock in the
> database connection and just store any "drop table" queries that arrive
> during a database operation for later execution.   More generally, though,
> the fact that no R operation is atomic with respect to garbage collection
> seems to make it a bit difficult to use finalizers -- if you need a
> finalizer, it will often be in order to access and free some external
> resource, which is when the race conditions can matter.
>
> What I was envisaging was something like
>
> without_gc(expr)
>
> to evaluate expr with the memory manager set to allocate memory (or attempt
> to do so) without garbage collection.  Even better would be if gc could
> run, but weak references were temporarily treated as strong so that garbage
> without finalizers would be collected but finalizers didn't get triggered.
> Using this facility would be inefficient, because it would allocate more
> memory than necessary and would also mess with the tuning of the garbage
> collector,  but when communicating with other programs it seems it would be
> very useful to have some way of running an R code block and knowing that no
> other R code block would run during it (user interrupts are another issue,
> but they can be caught, and in any case I'm happy to fail when the user
> presses CTRL-C).
>
>     -thomas
>
>
>
>
> On Fri, Feb 15, 2013 at 12:53 AM, <[hidden email]> wrote:
>
>> It might help if you could be more specific about what the issue is --
>> if they are out of scope why does it matter whether the finalizers
>> run?
>>
>> Generically two approaches I can think of:
>>
>>    you keep track of whenit is safe to fully run your finalizers and have
>>    your finalizers put the objects on a linked list if it isn't safe to
>>    run the finalizer now and clear the list each time you make a new one
>>
>>    keep track of your objects with a weak list andturn them into strong
>>    references before your calls, then drop the list after.
>>
>> I'm pretty sure we don't have a mechanism for temporarily suspending
>> running the finalizers but it is probably fairly easy to add if that
>> is the only option.
>>
>> I might be able to think of other options with more details on the
>> issue.
>>
>> Best,
>>
>> luke
>>
>>
>> On Tue, 12 Feb 2013, Thomas Lumley wrote:
>>
>> Is there some way to prevent finalizers running during a section of code?
>>>
>>> I have a package that includes R objects linked to database tables.  To
>>> maintain the call-by-value semantics, tables are copied rather than
>>> modified, and the extra tables are removed by finalizers during garbage
>>> collection.
>>>
>>> However, if the garbage collection occurs in the middle of processing
>>> another SQL query (which is relatively likely, since that's where the
>>> memory allocations are) there are problems with the database interface.
>>>
>>> Since the guarantees for the finalizer are "at most once, not before the
>>> object is out of scope" it seems harmless to be able to prevent finalizers
>>> from running during a particular code block, but I can't see any way to do
>>> it.
>>>
>>> Suggestions?
>>>
>>>   -thomas
>>>
>>>
>>>
>>>
>> --
>> Luke Tierney
>> Chair, Statistics and Actuarial Science
>> Ralph E. Wareham Professor of Mathematical Sciences
>> University of Iowa                  Phone:             319-335-3386
>> Department of Statistics and        Fax:               319-335-3017
>>   Actuarial Science
>> 241 Schaeffer Hall                  email:   [hidden email]
>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>>
>
>
>
> --
> Thomas Lumley
> Professor of Biostatistics
> University of Auckland
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stopping finalizers

hadley wickham
In reply to this post by Thomas Lumley-2
> There are R objects representing survey data sets, with the data stored in
> a database table.  The subset() method, when applied to these objects,
> creates a new table indicating which rows of the data table are in the
> subset -- we don't modify the original table, because that breaks the
> call-by-value semantics. When the subset object in R goes out of scope, we
> need to delete the extra database table.

Isn't subset slightly too early to do this?  It would be slightly more
efficient for subset to return an object that creates the table when
you first attempt to modify it.

Hadley

--
Chief Scientist, RStudio
http://had.co.nz/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stopping finalizers

Thomas Lumley-2
On Fri, Feb 15, 2013 at 3:35 PM, Hadley Wickham <[hidden email]> wrote:

> > There are R objects representing survey data sets, with the data stored
> in
> > a database table.  The subset() method, when applied to these objects,
> > creates a new table indicating which rows of the data table are in the
> > subset -- we don't modify the original table, because that breaks the
> > call-by-value semantics. When the subset object in R goes out of scope,
> we
> > need to delete the extra database table.
>
> Isn't subset slightly too early to do this?  It would be slightly more
> efficient for subset to return an object that creates the table when
> you first attempt to modify it.
>

The subset table isn't a copy of the subset, it contains the unique key and
an indicator column showing whether the element is in the subset.  I need
this even if the subset is never modified, so that I can join it to the
main table and use it in SQL 'where' conditions to get computations for the
right subset of the data.

 The whole point of this new sqlsurvey package is that most of the
aggregation operations happen in the database rather than in R, which is
faster for very large data tables.  The use case is things like the
American Community Survey and the Nationwide Emergency Department
Subsample, with millions or tens of millions of records and quite a lot of
variables.  At this scale, loading stuff into memory isn't feasible on
commodity desktops and laptops, and even on computers with enough memory,
the database (MonetDB) is faster.

   -thomas

--
Thomas Lumley
Professor of Biostatistics
University of Auckland

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stopping finalizers

hadley wickham
> The subset table isn't a copy of the subset, it contains the unique key and
> an indicator column showing whether the element is in the subset.  I need
> this even if the subset is never modified, so that I can join it to the main
> table and use it in SQL 'where' conditions to get computations for the right
> subset of the data.

Cool - Is that faster than storing a column that just contains the
include indices?

>  The whole point of this new sqlsurvey package is that most of the
> aggregation operations happen in the database rather than in R, which is
> faster for very large data tables.  The use case is things like the American
> Community Survey and the Nationwide Emergency Department Subsample, with
> millions or tens of millions of records and quite a lot of variables.  At
> this scale, loading stuff into memory isn't feasible on commodity desktops
> and laptops, and even on computers with enough memory, the database
> (MonetDB) is faster.

Have you done any comparisons of monetdb vs sqlite - I'm interested to
know how much faster it is. I'm working on a package
(https://github.com/hadley/dplyr) that compiles R data manipulation
expressions into (e.g. SQL), and have been wondering if it's worth
considering a column-store like monetdb.

Hadley

--
Chief Scientist, RStudio
http://had.co.nz/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: stopping finalizers

Thomas Lumley-2
On Sat, Feb 16, 2013 at 2:32 PM, Hadley Wickham <[hidden email]> wrote:

> > The subset table isn't a copy of the subset, it contains the unique key
> and
> > an indicator column showing whether the element is in the subset.  I need
> > this even if the subset is never modified, so that I can join it to the
> main
> > table and use it in SQL 'where' conditions to get computations for the
> right
> > subset of the data.
>
> Cool - Is that faster than storing a column that just contains the
> include indices?
>

I haven't tried -- I'm writing this so it doesn't modify database tables,
partly for safety and partly to reduce the privileges it needs to run.


> >  The whole point of this new sqlsurvey package is that most of the
> > aggregation operations happen in the database rather than in R, which is
> > faster for very large data tables.  The use case is things like the
> American
> > Community Survey and the Nationwide Emergency Department Subsample, with
> > millions or tens of millions of records and quite a lot of variables.  At
> > this scale, loading stuff into memory isn't feasible on commodity
> desktops
> > and laptops, and even on computers with enough memory, the database
> > (MonetDB) is faster.
>
> Have you done any comparisons of monetdb vs sqlite - I'm interested to
> know how much faster it is. I'm working on a package
> (https://github.com/hadley/dplyr) that compiles R data manipulation
> expressions into (e.g. SQL), and have been wondering if it's worth
> considering a column-store like monetdb.


It's enormously faster than SQLite for databases slightly larger than
physical memory.  I don't have measurements, but I started this project
using SQLite and it just wasn't fast enough to be worthwhile.  My guess is
that for the JOIN and SELECT SUM() ... GROUP BY operations I'm using it's
perhaps ten times faster.

For moderate-sized databases it's competitive with analysis in memory even
if you ignore the data loading time.

For example, using a data set already in memory, with 18000 records and 96
variables:

> system.time(svymean(~BPXSAR+BPXDAR,subset(dhanes,RIAGENDR==2)))

   user  system elapsed

   0.09    0.01    0.10

Using MonetDB

> system.time(svymean(~bpxsar+bpxdar,se=TRUE,subset(sqhanes,riagendr==2)))

   user  system elapsed

  0.020   0.001   0.108

   -thomas

--
Thomas Lumley
Professor of Biostatistics
University of Auckland

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel