Recycling memory with a small free list

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Recycling memory with a small free list

Nathan Kurz
I'm trying to improve the performance of the update loop within a
logistic regression, and am struggling against the overhead of memory
allocation and garbage collection.   The main issue I'd like to solve
is with assignments inside of loops like this:

reweight = function(iter, w, Q) {
  for (i in 1:iter) {
    wT = w * Q
  }
}

If the matrix Q is large I can get a significant gain in performance
if I can reuse the same allocation for wT rather than making a new
allocation on each loop iteration.

While I can solve this problem in this case with an Rcpp function that
does the modifications in place, this seems like a case where a more
general solution might work well.   While normally we don't know
whether an allocation has gone out of scope, the left-hand-side of an
assignment is a frequent exception.  Instead of simply letting the LHS
become unreferenced, has the possibility of adding a small free-list
been considered?

That is, before the RHS is executed, the LHS allocation would be added
to a small fixed length list of available space which is checked
before future allocations.   If the same size is requested before the
next garbage collection, the allocation is short-circuited and the
allocation is reused.   This list could be very small, possibly even
only a single entry.  Entries would only be put on the list if they
have no other references.   If a garbage collection is triggered, the
list would be emptied and the contents collected.

While my understanding of R's memory strategy is still limited, it
looks like this would be a reasonably straightforward patch that could
touch only the assignment operator, the vector allocator, and the
collection routine.   Overall memory usage should go down, and in the
cases I'm considering, the performance improvement would be large.
Measurement details are here:
http://stackoverflow.com/questions/28532493/reusing-existing-memory-for-blas-operations-in-r

Are there downsides or difficulties to this approach that I'm missing?

Nathan Kurz
[hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Recycling memory with a small free list

Radford Neal
> ... with assignments inside of loops like this:
>
> reweight = function(iter, w, Q) {
>   for (i in 1:iter) {
>     wT = w * Q
>   }
> }
> ... before the RHS is executed, the LHS allocation would be added
> to a small fixed length list of available space which is checked
> before future allocations.   If the same size is requested before the
> next garbage collection, the allocation is short-circuited and the
> allocation is reused.   This list could be very small, possibly even
> only a single entry.  Entries would only be put on the list if they
> have no other references.

Reusing the LHS storage immediately isn't possible in general, because
evaluation of the RHS might produce an error, in which case the LHS
variable is supposed to be unchanged.  Detecting special cases where
there is guaranteed to be no error, or at least no error after the
first modification to newly allocated memory, might be too
complicated.  

Putting the LHS storage on a small free list for later reuse (only
after the old value of the variable will definitely be replaced) seems
more promising (then one would need only two copies for examples such
as above, with them being used in alternate iterations).  However,
there's a danger of getting carried away and essentially rewriting
malloc.  To avoid this, one might try just calling "free" on the
no-longer-needed object, letting "malloc" then figure out when it can
be re-used.  Unfortunately, that seems not to be safe, because it's
posslble that there is a reference to the no-longer-needed object on
the PROTECT stack, even though no one should actually be looking at
it any more.

In the current version of pqR (see pqR-project.org), modifications are
(often) done in place for statements such as w = w * Q, but not
curretly when the LHS variable does not appear on the RHS.

Regards,

    Radford Neal

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Recycling memory with a small free list

Nathan Kurz
On Wed, Feb 18, 2015 at 7:19 AM, Radford Neal <[hidden email]> wrote:

>> ... with assignments inside of loops like this:
>>
>> reweight = function(iter, w, Q) {
>>   for (i in 1:iter) {
>>     wT = w * Q
>>   }
>> }
>> ... before the RHS is executed, the LHS allocation would be added
>> to a small fixed length list of available space which is checked
>> before future allocations.   If the same size is requested before the
>> next garbage collection, the allocation is short-circuited and the
>> allocation is reused.   This list could be very small, possibly even
>> only a single entry.  Entries would only be put on the list if they
>> have no other references.

Here's an article about the benefits of this approach in Go that might
explain better than I was able:
https://blog.cloudflare.com/recycling-memory-buffers-in-go/
Their charts explain the goal very clearly: stabilize at a smaller
amount of memory to reduce churn, which improves performance in a
myriad of ways.

> Reusing the LHS storage immediately isn't possible in general, because
> evaluation of the RHS might produce an error, in which case the LHS
> variable is supposed to be unchanged.

What's the guarantee R actually makes?  What's an example of the use
case where this behaviour would be required? More generally, can one
not assume "a = NULL; a = func()" is equivalent to "a = func()" unless
func() references 'a' or has it as an argument?  Or is the difficulty
that there is no way to know in advance it if will be referenced?

> Detecting special cases where
> there is guaranteed to be no error, or at least no error after the
> first modification to newly allocated memory, might be too
> complicated.

Yes, if required, the complexity of guaranteeing this might  well rule
out the approach I suggested.

> Putting the LHS storage on a small free list for later reuse (only
> after the old value of the variable will definitely be replaced) seems
> more promising (then one would need only two copies for examples such
> as above, with them being used in alternate iterations).

OK, let's consider that potentially easier option instead:  do nothing
immediately, but add a small queue for recycling from which the
temporary might be drawn.   It has slightly worse cache behavior, but
should handle most of the issues with memory churn.

> However,
> there's a danger of getting carried away and essentially rewriting
> malloc.  To avoid this, one might try just calling "free" on the
> no-longer-needed object, letting "malloc" then figure out when it can
> be re-used.

Yes, I think that's what I was anticipating:  add a free() equivalent
that does nothing if the object has multiple references/names, but
adds the object to small fixed size "free list" if it does not.
Perhaps this is only for certain types or for objects above a certain
size.

When requesting memory, allocvector() or perhaps R_alloc() does a
quick check of that "free list" to see if it has anything of the exact
requested size.  If it does, it short circuits and recycles it.  If it
doesn't, normal allocation takes place.

The "free list" is stored as two small fixed size arrays containing
size/address pairs.   Searching is done linearly using code that
optimizes to SIMD comparisons.   For 4/8/16 slots overhead of the
search should be unmeasurably fast.

The key to the approach would be keeping it simple, and realizing that
the goal is only to get the lowest hanging fruit:  repeated
assignments of large arrays used in a loop.  If it's complex, skip it
--- the behavior will be no worse than current.

By the way, what's happening with Luke's refcnt patches?  From the
outside, they seem like a great improvement.
http://homepage.stat.uiowa.edu/~luke/talks/dsc2014.pdf
http://developer.r-project.org/Refcnt.html
Are they slated to become the standard approach?  Are they going to be dropped?
 Will both approaches be kept in parallel?

> Unfortunately, that seems not to be safe, because it's
> possible that there is a reference to the no-longer-needed object on
> the PROTECT stack, even though no one should actually be looking at
> it any more.

Can you explain this case?   I don't think I understand it.

> In the current version of pqR (see pqR-project.org), modifications are
> (often) done in place for statements such as w = w * Q, but not
> curretly when the LHS variable does not appear on the RHS.

Yes, I looked at it earlier, and was excited to see that Luke had
ported half of your approach to standard R:
https://github.com/wch/r-source/blob/trunk/src/main/arithmetic.h#L65

But only the RHS temporary variables optimizations made it over. Your
LHS "w = w * Q" optimizations did not, but I didn't see any discussion
of why.   Was
it attempted and issues were found?

I think what I'm suggesting is complementary to that.   Direct reuse
is best if it can be detected, but recycling will provide more
opportunities for optimization.  Of course, what I'm suggesting is
always quite obvious, and I presume it's part what he includes in the
slide in his talk that mentions "Explore releasing memory when
reference count drops to zero".

--nate

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Recycling memory with a small free list

Radford Neal
Radford Neal:
> > there's a danger of getting carried away and essentially rewriting
> > malloc.  To avoid this, one might try just calling "free" on the
> > no-longer-needed object, letting "malloc" then figure out when it can
> > be re-used.

Nathan Kurz:
> Yes, I think that's what I was anticipating:  add a free() equivalent...

Radford Neal:
> > Unfortunately, that seems not to be safe, because it's
> > possible that there is a reference to the no-longer-needed object on
> > the PROTECT stack, even though no one should actually be looking at
> > it any more.

Nathan Kurz:
> Can you explain this case?   I don't think I understand it.


My comment about it not being safe was referring to actually calling
the "free" function in the standard C library, not a "free equivalent".  
Except for small objects, R calls "free" when it concludes that it no
longer needs an object, having allocated space for it earlier with
"malloc".  After "free" is called, R had better not do anything like try
to mark it in the garbage collection phase...

Keeping a free list apart from the one maintained by malloc/free would
I think have to be how it is done, hence my comment about ending up
rewriting malloc/free.  But it may not be too hard to restrain oneself
and only do the simplest things.

   Radford Neal

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Recycling memory with a small free list

Tierney, Luke
In reply to this post by Nathan Kurz
On Wed, 18 Feb 2015, Nathan Kurz wrote:

> On Wed, Feb 18, 2015 at 7:19 AM, Radford Neal <[hidden email]> wrote:
>>> ... with assignments inside of loops like this:
>>>
>>> reweight = function(iter, w, Q) {
>>>   for (i in 1:iter) {
>>>     wT = w * Q
>>>   }
>>> }
>>> ... before the RHS is executed, the LHS allocation would be added
>>> to a small fixed length list of available space which is checked
>>> before future allocations.   If the same size is requested before the
>>> next garbage collection, the allocation is short-circuited and the
>>> allocation is reused.   This list could be very small, possibly even
>>> only a single entry.  Entries would only be put on the list if they
>>> have no other references.
>
> Here's an article about the benefits of this approach in Go that might
> explain better than I was able:
> https://blog.cloudflare.com/recycling-memory-buffers-in-go/
> Their charts explain the goal very clearly: stabilize at a smaller
> amount of memory to reduce churn, which improves performance in a
> myriad of ways.

Thanks -- will have a look.

>> Reusing the LHS storage immediately isn't possible in general, because
>> evaluation of the RHS might produce an error, in which case the LHS
>> variable is supposed to be unchanged.
>
> What's the guarantee R actually makes?  What's an example of the use
> case where this behaviour would be required? More generally, can one
> not assume "a = NULL; a = func()" is equivalent to "a = func()" unless
> func() references 'a' or has it as an argument?  Or is the difficulty
> that there is no way to know in advance it if will be referenced?
>
>> Detecting special cases where
>> there is guaranteed to be no error, or at least no error after the
>> first modification to newly allocated memory, might be too
>> complicated.
>
> Yes, if required, the complexity of guaranteeing this might  well rule
> out the approach I suggested.
>
>> Putting the LHS storage on a small free list for later reuse (only
>> after the old value of the variable will definitely be replaced) seems
>> more promising (then one would need only two copies for examples such
>> as above, with them being used in alternate iterations).
>
> OK, let's consider that potentially easier option instead:  do nothing
> immediately, but add a small queue for recycling from which the
> temporary might be drawn.   It has slightly worse cache behavior, but
> should handle most of the issues with memory churn.
>
>> However,
>> there's a danger of getting carried away and essentially rewriting
>> malloc.  To avoid this, one might try just calling "free" on the
>> no-longer-needed object, letting "malloc" then figure out when it can
>> be re-used.
>
> Yes, I think that's what I was anticipating:  add a free() equivalent
> that does nothing if the object has multiple references/names, but
> adds the object to small fixed size "free list" if it does not.
> Perhaps this is only for certain types or for objects above a certain
> size.
>
> When requesting memory, allocvector() or perhaps R_alloc() does a
> quick check of that "free list" to see if it has anything of the exact
> requested size.  If it does, it short circuits and recycles it.  If it
> doesn't, normal allocation takes place.
>
> The "free list" is stored as two small fixed size arrays containing
> size/address pairs.   Searching is done linearly using code that
> optimizes to SIMD comparisons.   For 4/8/16 slots overhead of the
> search should be unmeasurably fast.
>
> The key to the approach would be keeping it simple, and realizing that
> the goal is only to get the lowest hanging fruit:  repeated
> assignments of large arrays used in a loop.  If it's complex, skip it
> --- the behavior will be no worse than current.
>
> By the way, what's happening with Luke's refcnt patches?  From the
> outside, they seem like a great improvement.
> http://homepage.stat.uiowa.edu/~luke/talks/dsc2014.pdf
> http://developer.r-project.org/Refcnt.html
> Are they slated to become the standard approach?  Are they going to be dropped?
> Will both approaches be kept in parallel?

The approach can be enabled in R-devel by defining a preprocessor
variable.  It's about 90% of where it needs to be to become the
default. I had to put work on hold for a while but will be getting
back to it soon. It's too late to turn on for 3.2.0 due in April, but
I'm hopeful of switching to reference counting in R-devel by August or
so.

>
>> Unfortunately, that seems not to be safe, because it's
>> possible that there is a reference to the no-longer-needed object on
>> the PROTECT stack, even though no one should actually be looking at
>> it any more.
>
> Can you explain this case?   I don't think I understand it.
>
>> In the current version of pqR (see pqR-project.org), modifications are
>> (often) done in place for statements such as w = w * Q, but not
>> curretly when the LHS variable does not appear on the RHS.
>
> Yes, I looked at it earlier, and was excited to see that Luke had
> ported half of your approach to standard R:
> https://github.com/wch/r-source/blob/trunk/src/main/arithmetic.h#L65
>
> But only the RHS temporary variables optimizations made it over. Your
> LHS "w = w * Q" optimizations did not, but I didn't see any discussion
> of why.   Was
> it attempted and issues were found?
>
> I think what I'm suggesting is complementary to that.   Direct reuse
> is best if it can be detected, but recycling will provide more
> opportunities for optimization.  Of course, what I'm suggesting is
> always quite obvious, and I presume it's part what he includes in the
> slide in his talk that mentions "Explore releasing memory when
> reference count drops to zero".

This is part of the missing 10% of things I 'd like to explore before
going live. Releasing large (malloc'ed) objects with reference counts
that hit zero back to the malloc system is probably not to hard to get
right. Holding onto these objects in a free list might be worth
looking into, but as Radford suggests a good malloc may be good enough
at doing that already.

Best,

luke

>
> --nate
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   [hidden email]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Recycling memory with a small free list

R devel mailing list
If you link to tcmalloc instead of the default malloc on your system, the
performance of large allocations should improve.  On unix machines you
don't even need to recompile -- you can do this with LD_PRELOAD.  The
downside is that you'll almost certainly end up with higher average memory
usage.as tcmalloc never returns memory to the OS.

It would also be worth checking what jemalloc does with large allocations.


It may well be worth tweaking the way that large allocations are handled in
R -- most allocation libraries assume that large allocations are infrequent
and that you won't be frequently requesting the same sized memory block.
Those assumptions don't hold in R.  On the other hand, I don't see much
benefit to R having it's own logic for handling small allocations, as most
malloc implementations handle those extremely efficiently.

Karl

On Thu, Feb 19, 2015 at 10:15 AM, <[hidden email]> wrote:

> On Wed, 18 Feb 2015, Nathan Kurz wrote:
>
>  On Wed, Feb 18, 2015 at 7:19 AM, Radford Neal <[hidden email]>
>> wrote:
>>
>>> ... with assignments inside of loops like this:
>>>>
>>>> reweight = function(iter, w, Q) {
>>>>   for (i in 1:iter) {
>>>>     wT = w * Q
>>>>   }
>>>> }
>>>> ... before the RHS is executed, the LHS allocation would be added
>>>> to a small fixed length list of available space which is checked
>>>> before future allocations.   If the same size is requested before the
>>>> next garbage collection, the allocation is short-circuited and the
>>>> allocation is reused.   This list could be very small, possibly even
>>>> only a single entry.  Entries would only be put on the list if they
>>>> have no other references.
>>>>
>>>
>> Here's an article about the benefits of this approach in Go that might
>> explain better than I was able:
>> https://blog.cloudflare.com/recycling-memory-buffers-in-go/
>> Their charts explain the goal very clearly: stabilize at a smaller
>> amount of memory to reduce churn, which improves performance in a
>> myriad of ways.
>>
>
> Thanks -- will have a look.
>
>
>  Reusing the LHS storage immediately isn't possible in general, because
>>> evaluation of the RHS might produce an error, in which case the LHS
>>> variable is supposed to be unchanged.
>>>
>>
>> What's the guarantee R actually makes?  What's an example of the use
>> case where this behaviour would be required? More generally, can one
>> not assume "a = NULL; a = func()" is equivalent to "a = func()" unless
>> func() references 'a' or has it as an argument?  Or is the difficulty
>> that there is no way to know in advance it if will be referenced?
>>
>>  Detecting special cases where
>>> there is guaranteed to be no error, or at least no error after the
>>> first modification to newly allocated memory, might be too
>>> complicated.
>>>
>>
>> Yes, if required, the complexity of guaranteeing this might  well rule
>> out the approach I suggested.
>>
>>  Putting the LHS storage on a small free list for later reuse (only
>>> after the old value of the variable will definitely be replaced) seems
>>> more promising (then one would need only two copies for examples such
>>> as above, with them being used in alternate iterations).
>>>
>>
>> OK, let's consider that potentially easier option instead:  do nothing
>> immediately, but add a small queue for recycling from which the
>> temporary might be drawn.   It has slightly worse cache behavior, but
>> should handle most of the issues with memory churn.
>>
>>  However,
>>> there's a danger of getting carried away and essentially rewriting
>>> malloc.  To avoid this, one might try just calling "free" on the
>>> no-longer-needed object, letting "malloc" then figure out when it can
>>> be re-used.
>>>
>>
>> Yes, I think that's what I was anticipating:  add a free() equivalent
>> that does nothing if the object has multiple references/names, but
>> adds the object to small fixed size "free list" if it does not.
>> Perhaps this is only for certain types or for objects above a certain
>> size.
>>
>> When requesting memory, allocvector() or perhaps R_alloc() does a
>> quick check of that "free list" to see if it has anything of the exact
>> requested size.  If it does, it short circuits and recycles it.  If it
>> doesn't, normal allocation takes place.
>>
>> The "free list" is stored as two small fixed size arrays containing
>> size/address pairs.   Searching is done linearly using code that
>> optimizes to SIMD comparisons.   For 4/8/16 slots overhead of the
>> search should be unmeasurably fast.
>>
>> The key to the approach would be keeping it simple, and realizing that
>> the goal is only to get the lowest hanging fruit:  repeated
>> assignments of large arrays used in a loop.  If it's complex, skip it
>> --- the behavior will be no worse than current.
>>
>> By the way, what's happening with Luke's refcnt patches?  From the
>> outside, they seem like a great improvement.
>> http://homepage.stat.uiowa.edu/~luke/talks/dsc2014.pdf
>> http://developer.r-project.org/Refcnt.html
>> Are they slated to become the standard approach?  Are they going to be
>> dropped?
>> Will both approaches be kept in parallel?
>>
>
> The approach can be enabled in R-devel by defining a preprocessor
> variable.  It's about 90% of where it needs to be to become the
> default. I had to put work on hold for a while but will be getting
> back to it soon. It's too late to turn on for 3.2.0 due in April, but
> I'm hopeful of switching to reference counting in R-devel by August or
> so.
>
>
>>  Unfortunately, that seems not to be safe, because it's
>>> possible that there is a reference to the no-longer-needed object on
>>> the PROTECT stack, even though no one should actually be looking at
>>> it any more.
>>>
>>
>> Can you explain this case?   I don't think I understand it.
>>
>>  In the current version of pqR (see pqR-project.org), modifications are
>>> (often) done in place for statements such as w = w * Q, but not
>>> curretly when the LHS variable does not appear on the RHS.
>>>
>>
>> Yes, I looked at it earlier, and was excited to see that Luke had
>> ported half of your approach to standard R:
>> https://github.com/wch/r-source/blob/trunk/src/main/arithmetic.h#L65
>>
>> But only the RHS temporary variables optimizations made it over. Your
>> LHS "w = w * Q" optimizations did not, but I didn't see any discussion
>> of why.   Was
>> it attempted and issues were found?
>>
>> I think what I'm suggesting is complementary to that.   Direct reuse
>> is best if it can be detected, but recycling will provide more
>> opportunities for optimization.  Of course, what I'm suggesting is
>> always quite obvious, and I presume it's part what he includes in the
>> slide in his talk that mentions "Explore releasing memory when
>> reference count drops to zero".
>>
>
> This is part of the missing 10% of things I 'd like to explore before
> going live. Releasing large (malloc'ed) objects with reference counts
> that hit zero back to the malloc system is probably not to hard to get
> right. Holding onto these objects in a free list might be worth
> looking into, but as Radford suggests a good malloc may be good enough
> at doing that already.
>
> Best,
>
> luke
>
>
>> --nate
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
> --
> Luke Tierney
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa                  Phone:             319-335-3386
> Department of Statistics and        Fax:               319-335-3017
>    Actuarial Science
> 241 Schaeffer Hall                  email:   [hidden email]
> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel