cbind() & rbind() for S4 objects -- 'Matrix' package changes

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

cbind() & rbind() for S4 objects -- 'Matrix' package changes

Martin Maechler
As some of you may have seen / heard in the past,
it is not possible to make cbind() and rbind() into proper S4
generic functions, since their first formal argument is '...'.
[ BTW: S3-methods for these of course only dispatch on the first
  argument which is also not really satisfactory in the context
  of many possible matrix classes.]

For this reason, after quite some discussion on R-core (and
maybe a bit on R-devel) about the options, since R-2.2.0 we have
had S4 generic functions cbind2() and rbind2() (and default methods)
in R's "methods" which are a version of cbind() and
rbind() respectively for two arguments (x,y)
 {and fixed 'deparse.level = 0' : the argument names are 'x' and 'y' and
  hence don't make sense to be used to construct column-names or
  row-names for rbind(), respectively.}

We have been defining methods for cbind2() and rbind2()
for the 'Matrix' classes in late summer 2005 as well.  So far so
good.

In addition, [see also  help(cbind2) ],
we have defined cbind() and rbind() functions which recursively call
cbind2() and rbind2(), more or less following John Chambers
proposal of dealing with such "(...)" argument functions.
These new recursively defined cbind() / rbind() functions
however have typically remained invisible in the methods package
[you can see them via  methods:::cbind  or  methods:::rbind ]
and have been ``activated'' --- replacing  base::cbind / rbind ---
only via an explicit or implicit call to
     methods:::bind_activation(TRUE)

One reason I didn't dare to make them the default was that I
noticed they didn't behave identically to cbind() / rbind() in
all cases, though IIRC the rare difference was only in the dimnames
returned; further, being entirely written in R, and recursive,
they were slower than the mostly C-based fast  cbind() / rbind()
functions.

As some Bioconductor developers have recently found,
these versions of cbind() and rbind() that have been
automagically activated by loading the  Matrix package
can have a detrimental effect in some extreme cases,
e.g. when using
     do.call(cbind, list_of_length_1000)
because of the recursion and the many many calls to the S4
generic, each time searching for method dispatch ...
For the bioconductor applications and potentially for others using cbind() /
rbind() extensively, this can lead to unacceptable performance
loss just because loading 'Matrix' currently calls
     methods:::bind_activation(TRUE)

For this reason, we plan to refrain from doing this activation
on loading of Matrix, but propose to

1)  define and export
            cBind <- methods:::cbind
            rBind <- methods:::cbind

    also do this for R-2.5.0 so that other useRs / packages
    can start cBind() / rBind() in their code when they want to
    have something that can become properly object-oriented

Possibly --- and this is the big  RFC (request for comments) ---

2) __ for 'Matrix' only __ also
   define and export
            cbind <- methods:::cbind
            rbind <- methods:::cbind

I currently see the possibilities of doing
 either '1)'
 or     '1) and 2)'
 or less likely  '2) alone'

and like to get your feedback on this.

"1)" alone would have the considerable drawback for current
  Matrix useRs that their code / scripts which has been using
  cbind() and rbind() for "Matrix" (and "matrix" and "numeric")
  objects no longer works, but needs to be changed to use
        rBind() and cBind()  *instead*

As soon as "2)" is done (in conjunction with "1)" or not),
those who need a very fast but non-OO version of cbind() / rbind()
need to call  base::cbind() or  base::rbind()  explicitly.
This however would not be necessary for packages with a NAMESPACE
since these import 'base' automagically and hence would use
base::cbind() automagically {unless they also import(Matrix)}.

We are quite interested in your feedback!

Martin Maechler and Doug Bates <[hidden email]>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: cbind() & rbind() for S4 objects -- 'Matrix' package changes

Luke Tierney
On Tue, 20 Mar 2007, Martin Maechler wrote:

> As some of you may have seen / heard in the past,
> it is not possible to make cbind() and rbind() into proper S4
> generic functions, since their first formal argument is '...'.
> [ BTW: S3-methods for these of course only dispatch on the first
>  argument which is also not really satisfactory in the context
>  of many possible matrix classes.]
>
> For this reason, after quite some discussion on R-core (and
> maybe a bit on R-devel) about the options, since R-2.2.0 we have
> had S4 generic functions cbind2() and rbind2() (and default methods)
> in R's "methods" which are a version of cbind() and
> rbind() respectively for two arguments (x,y)
> {and fixed 'deparse.level = 0' : the argument names are 'x' and 'y' and
>  hence don't make sense to be used to construct column-names or
>  row-names for rbind(), respectively.}
>
> We have been defining methods for cbind2() and rbind2()
> for the 'Matrix' classes in late summer 2005 as well.  So far so
> good.
>
> In addition, [see also  help(cbind2) ],
> we have defined cbind() and rbind() functions which recursively call
> cbind2() and rbind2(), more or less following John Chambers
> proposal of dealing with such "(...)" argument functions.
> These new recursively defined cbind() / rbind() functions
> however have typically remained invisible in the methods package
> [you can see them via  methods:::cbind  or  methods:::rbind ]
> and have been ``activated'' --- replacing  base::cbind / rbind ---
> only via an explicit or implicit call to
>     methods:::bind_activation(TRUE)
>
> One reason I didn't dare to make them the default was that I
> noticed they didn't behave identically to cbind() / rbind() in
> all cases, though IIRC the rare difference was only in the dimnames
> returned; further, being entirely written in R, and recursive,
> they were slower than the mostly C-based fast  cbind() / rbind()
> functions.
>
> As some Bioconductor developers have recently found,
> these versions of cbind() and rbind() that have been
> automagically activated by loading the  Matrix package
> can have a detrimental effect in some extreme cases,
> e.g. when using
>     do.call(cbind, list_of_length_1000)
> because of the recursion and the many many calls to the S4
> generic, each time searching for method dispatch ...
> For the bioconductor applications and potentially for others using cbind() /
> rbind() extensively, this can lead to unacceptable performance
> loss just because loading 'Matrix' currently calls
>     methods:::bind_activation(TRUE)

The recursion part is potentially problematic because stack space
limitations will cause this to fail for "relatively" short
list_of_length_1000, but that should be easily curable by rewriting
methods:::cbind and methods:::rbind to use iteration rather than
recursion. This might also help a little with efficiency by avoiding
call overhead.  It would be interesting to know how much of the
performance hit is dispatch overhead and how much closure call
overhead.  If it's dispatch overhead then it may be worth figuring out
some way of handling this with internal dispatch at the C level (at
the cost of maintaining the C level stuff).

My initial reaction to scanning the methods:::cbind code is that it is
doing too much, at least too much R-level work, but I haven't thought
it through carefully.

> For this reason, we plan to refrain from doing this activation
> on loading of Matrix, but propose to
>
> 1)  define and export
>    cBind <- methods:::cbind
>    rBind <- methods:::cbind
>
>    also do this for R-2.5.0 so that other useRs / packages
>    can start cBind() / rBind() in their code when they want to
>    have something that can become properly object-oriented

In mackage methods?

> Possibly --- and this is the big  RFC (request for comments) ---
>
> 2) __ for 'Matrix' only __ also
>   define and export
>    cbind <- methods:::cbind
>    rbind <- methods:::cbind
>
> I currently see the possibilities of doing
> either '1)'
> or     '1) and 2)'
> or less likely  '2) alone'
>
> and like to get your feedback on this.
>
> "1)" alone would have the considerable drawback for current
>  Matrix useRs that their code / scripts which has been using
>  cbind() and rbind() for "Matrix" (and "matrix" and "numeric")
>  objects no longer works, but needs to be changed to use
> rBind() and cBind()  *instead*
>
> As soon as "2)" is done (in conjunction with "1)" or not),
> those who need a very fast but non-OO version of cbind() / rbind()
> need to call  base::cbind() or  base::rbind()  explicitly.
> This however would not be necessary for packages with a NAMESPACE
> since these import 'base' automagically and hence would use
> base::cbind() automagically {unless they also import(Matrix)}.
>
> We are quite interested in your feedback!

Either one seems cleaner to me than having loading of one package
result in mucking about in the internals of another.

If we are thinking of these as long term solutions then I think having
different names is cleaner, so 1) but not 2).  If we are thinking of
this as a transition towards making base::cbind and base::rbind
support S4 dispatch via cbind2/rbind2 (assuming this can be done
efficiently) then there may be some merit to 2) to minimize the need
for code rewriting.

It might be worth experimenting with having .Internal(cbind(...))
check its arguments and call methods:::cbind if (Methods is loaded
and) any of the arguments are S4 -- as the S4 property is now cheap to
determine that may be very low cost especially if done after the
object bits have been checked with positive result.

Best,

luke

> Martin Maechler and Doug Bates <[hidden email]>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

--
Luke Tierney
Chair, Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:      [hidden email]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: cbind() & rbind() for S4 objects -- 'Matrix' package changes

Martin Maechler
Thank you, Luke, for your feedback,

   (more inline below)

>>>>> "Luke" == Luke Tierney <[hidden email]>
>>>>>     on Tue, 20 Mar 2007 20:27:18 -0500 (CDT) writes:

    Luke> On Tue, 20 Mar 2007, Martin Maechler wrote:

    >> As some of you may have seen / heard in the past,
    >> it is not possible to make cbind() and rbind() into proper S4
    >> generic functions, since their first formal argument is '...'.
    >> [ BTW: S3-methods for these of course only dispatch on the first
    >> argument which is also not really satisfactory in the context
    >> of many possible matrix classes.]
    >>
    >> For this reason, after quite some discussion on R-core (and
    >> maybe a bit on R-devel) about the options, since R-2.2.0 we have
    >> had S4 generic functions cbind2() and rbind2() (and default methods)
    >> in R's "methods" which are a version of cbind() and
    >> rbind() respectively for two arguments (x,y)
    >> {and fixed 'deparse.level = 0' : the argument names are 'x' and 'y' and
    >> hence don't make sense to be used to construct column-names or
    >> row-names for rbind(), respectively.}
    >>
    >> We have been defining methods for cbind2() and rbind2()
    >> for the 'Matrix' classes in late summer 2005 as well.  So far so
    >> good.
    >>
    >> In addition, [see also  help(cbind2) ],
    >> we have defined cbind() and rbind() functions which recursively call
    >> cbind2() and rbind2(), more or less following John Chambers
    >> proposal of dealing with such "(...)" argument functions.
    >> These new recursively defined cbind() / rbind() functions
    >> however have typically remained invisible in the methods package
    >> [you can see them via  methods:::cbind  or  methods:::rbind ]
    >> and have been ``activated'' --- replacing  base::cbind / rbind ---
    >> only via an explicit or implicit call to
    >> methods:::bind_activation(TRUE)
    >>
    >> One reason I didn't dare to make them the default was that I
    >> noticed they didn't behave identically to cbind() / rbind() in
    >> all cases, though IIRC the rare difference was only in the dimnames
    >> returned; further, being entirely written in R, and recursive,
    >> they were slower than the mostly C-based fast  cbind() / rbind()
    >> functions.
    >>
    >> As some Bioconductor developers have recently found,
    >> these versions of cbind() and rbind() that have been
    >> automagically activated by loading the  Matrix package
    >> can have a detrimental effect in some extreme cases,
    >> e.g. when using
    >> do.call(cbind, list_of_length_1000)
    >> because of the recursion and the many many calls to the S4
    >> generic, each time searching for method dispatch ...
    >> For the bioconductor applications and potentially for others using cbind() /
    >> rbind() extensively, this can lead to unacceptable performance
    >> loss just because loading 'Matrix' currently calls
    >> methods:::bind_activation(TRUE)

    Luke> The recursion part is potentially problematic because stack space
    Luke> limitations will cause this to fail for "relatively" short
    Luke> list_of_length_1000, but that should be easily curable by rewriting
    Luke> methods:::cbind and methods:::rbind to use iteration rather than
    Luke> recursion.

Yes, thank you, Luke,  something I should have at least tried
doing but didn't get to.

    Luke> This might also help a little with efficiency by avoiding
    Luke> call overhead.  It would be interesting to know how much of the
    Luke> performance hit is dispatch overhead and how much closure call
    Luke> overhead.  If it's dispatch overhead then it may be worth figuring out
    Luke> some way of handling this with internal dispatch at the C level (at
    Luke> the cost of maintaining the C level stuff).

    Luke> My initial reaction to scanning the methods:::cbind code is that it is
    Luke> doing too much, at least too much R-level work, but I haven't thought
    Luke> it through carefully.

I can well understand that reaction..
The code started up quite small and easily understandable ...
until I started trying to emulate the "standard"  cbind() /
rbind() behavior, notably about automatic colnames / rownames
creation. When delving into that the code got the current
partial messyness....

    >> For this reason, we plan to refrain from doing this activation
    >> on loading of Matrix, but propose to
    >>
    >> 1)  define and export
    >> cBind <- methods:::cbind
    >> rBind <- methods:::cbind
    >>
    >> also do this for R-2.5.0 so that other useRs / packages
    >> can start cBind() / rBind() in their code when they want to
    >> have something that can become properly object-oriented

    Luke> In mackage methods?

yes, inside methods, such that in fact there  cBind <- cbind
(plus namespace export) would be sufficient.

In the mean time I'm less sure if this is desirable;
but at least this would ``expose'' the code and have it used
and consequently hopefully made more efficient (and hopefully
even simplified).

    >> Possibly --- and this is the big  RFC (request for comments) ---
    >>
    >> 2) __ for 'Matrix' only __ also
    >> define and export
    >> cbind <- methods:::cbind
    >> rbind <- methods:::cbind
    >>
    >> I currently see the possibilities of doing
    >> either '1)'
    >> or     '1) and 2)'
    >> or less likely  '2) alone'

    >> and like to get your feedback on this.

    >> "1)" alone would have the considerable drawback for current
    >> Matrix useRs that their code / scripts which has been using
    >> cbind() and rbind() for "Matrix" (and "matrix" and "numeric")
    >> objects no longer works, but needs to be changed to use
    >> rBind() and cBind()  *instead*
    >>
    >> As soon as "2)" is done (in conjunction with "1)" or not),
    >> those who need a very fast but non-OO version of cbind() / rbind()
    >> need to call  base::cbind() or  base::rbind()  explicitly.
    >> This however would not be necessary for packages with a NAMESPACE
    >> since these import 'base' automagically and hence would use
    >> base::cbind() automagically {unless they also import(Matrix)}.
    >>
    >> We are quite interested in your feedback!

    Luke> Either one seems cleaner to me than having loading of one package
    Luke> result in mucking about in the internals of another.

I agree {{the reason I had chosen the unclean approach was the
          hope that methods:::cbind could be improved to soon replace
          base::cbind -- and so "Matrix" would just do something
          that would happen more universally in the future anyway}}

    Luke> If we are thinking of these as long term solutions then I think having
    Luke> different names is cleaner, so 1) but not 2).  If we are thinking of
    Luke> this as a transition towards making base::cbind and base::rbind
    Luke> support S4 dispatch via cbind2/rbind2 (assuming this can be done
    Luke> efficiently) then there may be some merit to 2) to minimize the need
    Luke> for code rewriting.

Yes, exactly.  
My originally intent was strongly in the direction of making
base::cbind support S4 via cbind2() and rbind2() -- and one will
continue to be able to experiment with that after calling
methods:::bind_activation(TRUE).
The problem I had underestimated is
    Luke> assuming this can be done efficiently

    Luke> It might be worth experimenting with having .Internal(cbind(...))
    Luke> check its arguments and call methods:::cbind if (Methods is loaded
    Luke> and) any of the arguments are S4 -- as the S4 property is now cheap to
    Luke> determine that may be very low cost especially if done after the
    Luke> object bits have been checked with positive result.

Very nice idea ... currently I don't have the time to do it, so,
unless another R-core member (or an avid R-devel reader) jumps
in, I don't see this possible for R 2.5.0 (and hence the 2.5.x
series).

Too bad that nobody else (on R-devel) seems interested enough.
In the mean time, I'm proposing to implement '2)'
which will give

>>    > library(Matrix)
>>    Loading required package: lattice
>>
>>    Attaching package: 'Matrix'
>>
>>
>>   The following object(s) are masked from package:base :
>>
>>    cbind,
>>    rbind

every time Matrix is loaded, but that seems appropriate to me.

The question remains if it wasn't worth to do '1)' as well
{not in Matrix, but the 'methods' package} such that people can
more easily get experience with such a  cbind2/rbind2 - based
version of cbind/rbind.

Martin


    Luke> Best,

    Luke> luke

    >> Martin Maechler and Doug Bates <[hidden email]>

    Luke> --
    Luke> Luke Tierney
    Luke> Chair, Statistics and Actuarial Science
    Luke> Ralph E. Wareham Professor of Mathematical Sciences
    Luke> University of Iowa                  Phone:             319-335-3386
    Luke> Department of Statistics and        Fax:               319-335-3017
    Luke> Actuarial Science
    Luke> 241 Schaeffer Hall                  email:      [hidden email]
    Luke> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel