maximum matrix size

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

maximum matrix size

R devel mailing list
I am now getting the occasional complaint about survival routines that are not able to
handle big data.   I looked in the manuals to try and update my understanding of max
vector size, max matrix, max data set, etc; but it is either not there or I missed it (the
latter more likely).   Is it still .Machine$integer.max for everything?   Will that
change?   Found where?

I am going to need to go through the survival package and put specific checks in front
some or all of my .Call() statements, in order to give a sensible message whenever a
bounday is struck.  A well meaning person just posted a suggested "bug fix" to the github
source of one routine where my .C call allocates a scratch vector, suggesting  "resid =
double( as.double(n) *nvar)" to prevent a "NA produced by integer overflow" message,  in
the code below.   A fix is obvously not quite that easy :-)

         resid <- .C(Ccoxscore, as.integer(n),
                 as.integer(nvar),
                 as.double(y),
                 x=as.double(x),
                 as.integer(newstrat),
                 as.double(score),
                 as.double(weights[ord]),
                 as.integer(method=='efron'),
                 resid= double(n*nvar),
                 double(2*nvar))$resid

Terry T.


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: maximum matrix size

Henrik Bengtsson-5
On Tue, Oct 2, 2018 at 9:43 AM Therneau, Terry M., Ph.D. via R-devel
<[hidden email]> wrote:
>
> I am now getting the occasional complaint about survival routines that are not able to
> handle big data.   I looked in the manuals to try and update my understanding of max
> vector size, max matrix, max data set, etc; but it is either not there or I missed it (the
> latter more likely).   Is it still .Machine$integer.max for everything?   Will that
> change?   Found where?

FWIW, this is the reference I've decided to follow for matrixStats:

"* For now, keep 2^31-1 limit on matrix rows and columns."

from Slide 5 in Luke Tierney's 'Some new developments for the R
engine', June 24, 2012
(http://homepage.stat.uiowa.edu/~luke/talks/purdue12.pdf).

/Henrik

>
> I am going to need to go through the survival package and put specific checks in front
> some or all of my .Call() statements, in order to give a sensible message whenever a
> bounday is struck.  A well meaning person just posted a suggested "bug fix" to the github
> source of one routine where my .C call allocates a scratch vector, suggesting  "resid =
> double( as.double(n) *nvar)" to prevent a "NA produced by integer overflow" message,  in
> the code below.   A fix is obvously not quite that easy :-)
>
>          resid <- .C(Ccoxscore, as.integer(n),
>                  as.integer(nvar),
>                  as.double(y),
>                  x=as.double(x),
>                  as.integer(newstrat),
>                  as.double(score),
>                  as.double(weights[ord]),
>                  as.integer(method=='efron'),
>                  resid= double(n*nvar),
>                  double(2*nvar))$resid
>
> Terry T.
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: maximum matrix size

plangfelder
In reply to this post by R devel mailing list
Does this help a little?

https://cran.r-project.org/doc/manuals/r-release/R-ints.html#Long-vectors

One thing I seem to remember but cannot find a reference for is that
long vectors can only be passed to .Call calls, not C/Fortran. I
remember rewriting .C() in my WGCNA package to .Call for this very
reason but perhaps the restriction has been removed.

Peter
On Tue, Oct 2, 2018 at 9:43 AM Therneau, Terry M., Ph.D. via R-devel
<[hidden email]> wrote:

>
> I am now getting the occasional complaint about survival routines that are not able to
> handle big data.   I looked in the manuals to try and update my understanding of max
> vector size, max matrix, max data set, etc; but it is either not there or I missed it (the
> latter more likely).   Is it still .Machine$integer.max for everything?   Will that
> change?   Found where?
>
> I am going to need to go through the survival package and put specific checks in front
> some or all of my .Call() statements, in order to give a sensible message whenever a
> bounday is struck.  A well meaning person just posted a suggested "bug fix" to the github
> source of one routine where my .C call allocates a scratch vector, suggesting  "resid =
> double( as.double(n) *nvar)" to prevent a "NA produced by integer overflow" message,  in
> the code below.   A fix is obvously not quite that easy :-)
>
>          resid <- .C(Ccoxscore, as.integer(n),
>                  as.integer(nvar),
>                  as.double(y),
>                  x=as.double(x),
>                  as.integer(newstrat),
>                  as.double(score),
>                  as.double(weights[ord]),
>                  as.integer(method=='efron'),
>                  resid= double(n*nvar),
>                  double(2*nvar))$resid
>
> Terry T.
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: maximum matrix size

R devel mailing list
That is indeed helpful; reading the sections around it largely answered my questions.
Rinternals.h has the definitions

#define allocMatrix Rf_allocMatrix
SEXP Rf_allocMatrix(SEXPTYPE, int, int);
#define allocVector        Rf_allocVector
SEXP     Rf_allocVector(SEXPTYPE, R_xlen_t);

Which answers the further question of what to expect inside C routines invoked by Call.

It looks like the internal C routines for coxph work on large matrices by pure serendipity
(nrow and ncol each less than 2^31 but with the product  > 2^31), but residuals.coxph
fails with an allocation error on the same data.  A slight change and it could just as
easily have led to a hard crash.    Sigh...   I'll need to do a complete code review.  
I've been converting .C routines to .Call  as convenient, this will force conversion of
many of the rest as a side effect (20 done, 23 to go).  As a statsitician my overall
response is "haven't they ever heard of sampling"?  But as I said earlier, it isn't just
one user.

Terry T.

On 10/02/2018 12:22 PM, Peter Langfelder wrote:

> Does this help a little?
>
> https://cran.r-project.org/doc/manuals/r-release/R-ints.html#Long-vectors
>
> One thing I seem to remember but cannot find a reference for is that
> long vectors can only be passed to .Call calls, not C/Fortran. I
> remember rewriting .C() in my WGCNA package to .Call for this very
> reason but perhaps the restriction has been removed.
>
> Peter
> On Tue, Oct 2, 2018 at 9:43 AM Therneau, Terry M., Ph.D. via R-devel
> <[hidden email]> wrote:
>> I am now getting the occasional complaint about survival routines that are not able to
>> handle big data.   I looked in the manuals to try and update my understanding of max
>> vector size, max matrix, max data set, etc; but it is either not there or I missed it (the
>> latter more likely).   Is it still .Machine$integer.max for everything?   Will that
>> change?   Found where?
>>
>> I am going to need to go through the survival package and put specific checks in front
>> some or all of my .Call() statements, in order to give a sensible message whenever a
>> bounday is struck.  A well meaning person just posted a suggested "bug fix" to the github
>> source of one routine where my .C call allocates a scratch vector, suggesting  "resid =
>> double( as.double(n) *nvar)" to prevent a "NA produced by integer overflow" message,  in
>> the code below.   A fix is obvously not quite that easy :-)
>>
>>           resid <- .C(Ccoxscore, as.integer(n),
>>                   as.integer(nvar),
>>                   as.double(y),
>>                   x=as.double(x),
>>                   as.integer(newstrat),
>>                   as.double(score),
>>                   as.double(weights[ord]),
>>                   as.integer(method=='efron'),
>>                   resid= double(n*nvar),
>>                   double(2*nvar))$resid
>>
>> Terry T.
>>
>>
>>          [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel