[patch] Support many columns in model.matrix

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[patch] Support many columns in model.matrix

R devel mailing list
Generating a model matrix with very large numbers of columns overflows
the stack and/or runs very slowly, due to the implementation of
TrimRepeats().

This patch modifies it to use Rf_duplicated() to find the duplicates.
This makes the running time linear in the number of columns and
eliminates the recursive function calls.

Thanks

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

stats_model.patch (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [patch] Support many columns in model.matrix

Martin Maechler
>>>>> Karl Millar via R-devel <[hidden email]>
>>>>>     on Fri, 26 Feb 2016 15:58:20 -0800 writes:

    > Generating a model matrix with very large numbers of
    > columns overflows the stack and/or runs very slowly, due
    > to the implementation of TrimRepeats().

    > This patch modifies it to use Rf_duplicated() to find the
    > duplicates.  This makes the running time linear in the
    > number of columns and eliminates the recursive function
    > calls.

Thank you, Karl.
I've committed this (very slightly modified) to R-devel,

(also after looking for a an example that runs on a non-huge
 computer and shows the difference) :

nF <- 11 ; set.seed(1)
lff <- setNames(replicate(nF, as.factor(rpois(128, 1/4)), simplify=FALSE), letters[1:nF])
str(dd <- as.data.frame(lff)); prod(sapply(dd, nlevels))
## 'data.frame': 128 obs. of  11 variables:
##  $ a: Factor w/ 3 levels "0","1","2": 1 1 1 2 1 2 2 1 1 1 ...
##  $ b: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 2 1 1 1 ...
##  $ c: Factor w/ 3 levels "0","1","2": 1 1 1 2 1 1 1 2 1 1 ...
##  $ d: Factor w/ 3 levels "0","1","2": 1 1 2 2 1 2 1 1 2 1 ...
##  $ e: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 2 1 ...
##  $ f: Factor w/ 2 levels "0","1": 2 1 2 1 2 1 1 2 1 2 ...
##  $ g: Factor w/ 4 levels "0","1","2","3": 2 1 1 2 1 3 1 1 1 1 ...
##  $ h: Factor w/ 4 levels "0","1","2","4": 1 1 1 1 2 1 1 1 1 1 ...
##  $ i: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
##  $ j: Factor w/ 3 levels "0","1","2": 1 2 3 1 1 1 1 1 1 1 ...
##  $ k: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##
## [1] 139968

system.time(mff <- model.matrix(~ . ^ 11, dd, contrasts = list(a = "contr.helmert")))
##  user  system elapsed
## 0.255   0.033   0.287  --- *with* the patch on my desktop (16 GB)
## 1.489   0.031   1.522  --- for R-patched (i.e. w/o the patch)

> dim(mff)
[1]    128 139968
> object.size(mff)
154791504 bytes

---

BTW: These example would gain tremendously if I finally got
     around to provide

   model.matrix(........, sparse = TRUE)

which would then produce a Matrix-package sparse matrix.

Even for this somewhat small case, a sparse matrix is a factor
of 13.5 x smaller :

> s1 <- object.size(mff); s2 <- object.size(M <- Matrix::Matrix(mff)); as.vector( s1/s2 )
[1] 13.47043

I'm happy to collaborate with you on adding such a (C level)
interface to sparse matrices for this case.

Martin Maechler

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: [patch] Support many columns in model.matrix

R devel mailing list
Thanks.

Couldn't you implement model.matrix(..., sparse = TRUE)  with a small
amount of R code similar to MatrixModels::model.Matrix ?

On Mon, Feb 29, 2016 at 10:01 AM, Martin Maechler
<[hidden email]> wrote:

>>>>>> Karl Millar via R-devel <[hidden email]>
>>>>>>     on Fri, 26 Feb 2016 15:58:20 -0800 writes:
>
>     > Generating a model matrix with very large numbers of
>     > columns overflows the stack and/or runs very slowly, due
>     > to the implementation of TrimRepeats().
>
>     > This patch modifies it to use Rf_duplicated() to find the
>     > duplicates.  This makes the running time linear in the
>     > number of columns and eliminates the recursive function
>     > calls.
>
> Thank you, Karl.
> I've committed this (very slightly modified) to R-devel,
>
> (also after looking for a an example that runs on a non-huge
>  computer and shows the difference) :
>
> nF <- 11 ; set.seed(1)
> lff <- setNames(replicate(nF, as.factor(rpois(128, 1/4)), simplify=FALSE), letters[1:nF])
> str(dd <- as.data.frame(lff)); prod(sapply(dd, nlevels))
> ## 'data.frame':        128 obs. of  11 variables:
> ##  $ a: Factor w/ 3 levels "0","1","2": 1 1 1 2 1 2 2 1 1 1 ...
> ##  $ b: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 2 1 1 1 ...
> ##  $ c: Factor w/ 3 levels "0","1","2": 1 1 1 2 1 1 1 2 1 1 ...
> ##  $ d: Factor w/ 3 levels "0","1","2": 1 1 2 2 1 2 1 1 2 1 ...
> ##  $ e: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 2 1 ...
> ##  $ f: Factor w/ 2 levels "0","1": 2 1 2 1 2 1 1 2 1 2 ...
> ##  $ g: Factor w/ 4 levels "0","1","2","3": 2 1 1 2 1 3 1 1 1 1 ...
> ##  $ h: Factor w/ 4 levels "0","1","2","4": 1 1 1 1 2 1 1 1 1 1 ...
> ##  $ i: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
> ##  $ j: Factor w/ 3 levels "0","1","2": 1 2 3 1 1 1 1 1 1 1 ...
> ##  $ k: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
> ##
> ## [1] 139968
>
> system.time(mff <- model.matrix(~ . ^ 11, dd, contrasts = list(a = "contr.helmert")))
> ##  user  system elapsed
> ## 0.255   0.033   0.287  --- *with* the patch on my desktop (16 GB)
> ## 1.489   0.031   1.522  --- for R-patched (i.e. w/o the patch)
>
>> dim(mff)
> [1]    128 139968
>> object.size(mff)
> 154791504 bytes
>
> ---
>
> BTW: These example would gain tremendously if I finally got
>      around to provide
>
>    model.matrix(........, sparse = TRUE)
>
> which would then produce a Matrix-package sparse matrix.
>
> Even for this somewhat small case, a sparse matrix is a factor
> of 13.5 x smaller :
>
>> s1 <- object.size(mff); s2 <- object.size(M <- Matrix::Matrix(mff)); as.vector( s1/s2 )
> [1] 13.47043
>
> I'm happy to collaborate with you on adding such a (C level)
> interface to sparse matrices for this case.
>
> Martin Maechler

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: [patch] Support many columns in model.matrix

Martin Maechler
>>>>> Karl Millar <[hidden email]>
>>>>>     on Mon, 29 Feb 2016 10:22:51 -0800 writes:

    > Thanks.
    > Couldn't you implement model.matrix(..., sparse = TRUE)  with a small
    > amount of R code similar to MatrixModels::model.Matrix ?

yes, and basically call R level  Matrix::sparse.model.matrix()
[[ or even just mention the latter on the help page for
   model.matrix() ]].

Thank you, Karl

    > On Mon, Feb 29, 2016 at 10:01 AM, Martin Maechler
    > <[hidden email]> wrote:
    >>>>>>> Karl Millar via R-devel <[hidden email]>
    >>>>>>> on Fri, 26 Feb 2016 15:58:20 -0800 writes:
    >>
    >> > Generating a model matrix with very large numbers of
    >> > columns overflows the stack and/or runs very slowly, due
    >> > to the implementation of TrimRepeats().
    >>
    >> > This patch modifies it to use Rf_duplicated() to find the
    >> > duplicates.  This makes the running time linear in the
    >> > number of columns and eliminates the recursive function
    >> > calls.
    >>
    >> Thank you, Karl.
    >> I've committed this (very slightly modified) to R-devel,
    >>
    >> (also after looking for a an example that runs on a non-huge
    >> computer and shows the difference) :
    >>
    >> nF <- 11 ; set.seed(1)
    >> lff <- setNames(replicate(nF, as.factor(rpois(128, 1/4)), simplify=FALSE), letters[1:nF])
    >> str(dd <- as.data.frame(lff)); prod(sapply(dd, nlevels))
    >> ## 'data.frame':        128 obs. of  11 variables:
    >> ##  $ a: Factor w/ 3 levels "0","1","2": 1 1 1 2 1 2 2 1 1 1 ...
    >> ##  $ b: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 2 1 1 1 ...
    >> ##  $ c: Factor w/ 3 levels "0","1","2": 1 1 1 2 1 1 1 2 1 1 ...
    >> ##  $ d: Factor w/ 3 levels "0","1","2": 1 1 2 2 1 2 1 1 2 1 ...
    >> ##  $ e: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 2 1 ...
    >> ##  $ f: Factor w/ 2 levels "0","1": 2 1 2 1 2 1 1 2 1 2 ...
    >> ##  $ g: Factor w/ 4 levels "0","1","2","3": 2 1 1 2 1 3 1 1 1 1 ...
    >> ##  $ h: Factor w/ 4 levels "0","1","2","4": 1 1 1 1 2 1 1 1 1 1 ...
    >> ##  $ i: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
    >> ##  $ j: Factor w/ 3 levels "0","1","2": 1 2 3 1 1 1 1 1 1 1 ...
    >> ##  $ k: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
    >> ##
    >> ## [1] 139968
    >>
    >> system.time(mff <- model.matrix(~ . ^ 11, dd, contrasts = list(a = "contr.helmert")))
    >> ##  user  system elapsed
    >> ## 0.255   0.033   0.287  --- *with* the patch on my desktop (16 GB)
    >> ## 1.489   0.031   1.522  --- for R-patched (i.e. w/o the patch)
    >>
    >>> dim(mff)
    >> [1]    128 139968
    >>> object.size(mff)
    >> 154791504 bytes
    >>
    >> ---
    >>
    >> BTW: These example would gain tremendously if I finally got
    >> around to provide
    >>
    >> model.matrix(........, sparse = TRUE)
    >>
    >> which would then produce a Matrix-package sparse matrix.
    >>
    >> Even for this somewhat small case, a sparse matrix is a factor
    >> of 13.5 x smaller :
    >>
    >>> s1 <- object.size(mff); s2 <- object.size(M <- Matrix::Matrix(mff)); as.vector( s1/s2 )
    >> [1] 13.47043
    >>
    >> I'm happy to collaborate with you on adding such a (C level)
    >> interface to sparse matrices for this case.
    >>
    >> Martin Maechler

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel