summary( prcomp(*, tol = .) ) -- and 'rank.'

9 messages
Open this post in threaded view
|

summary( prcomp(*, tol = .) ) -- and 'rank.'

 Following from the R-help thread of March 22 on "Memory usage in prcomp", I've started looking into adding an optional   'rank.'  argument to prcomp  allowing to more efficiently get only a few PCs instead of the full p PCs, say when p = 1000 and you know you only want 5 PCs.  (https://stat.ethz.ch/pipermail/r-help/2016-March/437228.htmlAs it was mentioned, we already have an optional 'tol' argument which allows *not* to choose all PCs. When I do that, say      C <- chol(S <- toeplitz(.9 ^ (0:31))) # Cov.matrix and its root      all.equal(S, crossprod(C))      set.seed(17)      X <- matrix(rnorm(32000), 1000, 32)      Z <- X %*% C  ## ==>  cov(Z) ~=  C'C = S      all.equal(cov(Z), S, tol = 0.08)      pZ <- prcomp(Z, tol = 0.1)      summary(pZ) # only ~14 PCs (out of 32)       I get for the last line, the   summary.prcomp(.) call : > summary(pZ) # only ~14 PCs (out of 32) Importance of components:                           PC1    PC2    PC3    PC4     PC5     PC6     PC7     PC8 Standard deviation     3.6415 2.7178 1.8447 1.3943 1.10207 0.90922 0.76951 0.67490 Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713 0.01943 0.01495 Cumulative Proportion  0.4352 0.6775 0.7892 0.8530 0.89288 0.92001 0.93944 0.95439                            PC9    PC10    PC11    PC12    PC13   PC14 Standard deviation     0.60833 0.51638 0.49048 0.44452 0.40326 0.3904 Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050 Cumulative Proportion  0.96653 0.97528 0.98318 0.98966 0.99500 1.0000 > which computes the *proportions* as if there were only 14 PCs in total (but there were 32 originally). I would think that the summary should  or could in addition show the usual  "proportion of variance explained"  like result which does involve all 32  variances or std.dev.s ... which are returned from the svd() anyway, even in the case when I use my new 'rank.' argument which only returns a "few" PCs instead of all. Would you think the current  summary() output is good enough or rather misleading? I think I would want to see (possibly in addition) proportions with respect to the full variance and not just to the variance of those few components selected. Opinions? Martin Maechler ETH Zurich ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|

Re: summary( prcomp(*, tol = .) ) -- and 'rank.'

 Martin, I fully agree.  This becomes an issue when you have big matrices. (Note that there are awesome methods for actually only computing a small number of PCs (unlike your code which uses svn which gets all of them); these are available in various CRAN packages). Best, Kasper On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler <[hidden email] > wrote: > Following from the R-help thread of March 22 on "Memory usage in prcomp", > > I've started looking into adding an optional   'rank.'  argument > to prcomp  allowing to more efficiently get only a few PCs > instead of the full p PCs, say when p = 1000 and you know you > only want 5 PCs. > >  (https://stat.ethz.ch/pipermail/r-help/2016-March/437228.html> > As it was mentioned, we already have an optional 'tol' argument > which allows *not* to choose all PCs. > > When I do that, > say > >      C <- chol(S <- toeplitz(.9 ^ (0:31))) # Cov.matrix and its root >      all.equal(S, crossprod(C)) >      set.seed(17) >      X <- matrix(rnorm(32000), 1000, 32) >      Z <- X %*% C  ## ==>  cov(Z) ~=  C'C = S >      all.equal(cov(Z), S, tol = 0.08) >      pZ <- prcomp(Z, tol = 0.1) >      summary(pZ) # only ~14 PCs (out of 32) > > I get for the last line, the   summary.prcomp(.) call : > > > summary(pZ) # only ~14 PCs (out of 32) > Importance of components: >                           PC1    PC2    PC3    PC4     PC5     PC6 >  PC7     PC8 > Standard deviation     3.6415 2.7178 1.8447 1.3943 1.10207 0.90922 0.76951 > 0.67490 > Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713 0.01943 > 0.01495 > Cumulative Proportion  0.4352 0.6775 0.7892 0.8530 0.89288 0.92001 0.93944 > 0.95439 >                            PC9    PC10    PC11    PC12    PC13   PC14 > Standard deviation     0.60833 0.51638 0.49048 0.44452 0.40326 0.3904 > Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050 > Cumulative Proportion  0.96653 0.97528 0.98318 0.98966 0.99500 1.0000 > > > > which computes the *proportions* as if there were only 14 PCs in > total (but there were 32 originally). > > I would think that the summary should  or could in addition show > the usual  "proportion of variance explained"  like result which > does involve all 32  variances or std.dev.s ... which are > returned from the svd() anyway, even in the case when I use my > new 'rank.' argument which only returns a "few" PCs instead of > all. > > Would you think the current  summary() output is good enough or > rather misleading? > > I think I would want to see (possibly in addition) proportions > with respect to the full variance and not just to the variance > of those few components selected. > > Opinions? > > Martin Maechler > ETH Zurich > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel>         [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|

Re: summary( prcomp(*, tol = .) ) -- and 'rank.'

Open this post in threaded view
|

Re: summary( prcomp(*, tol = .) ) -- and 'rank.'

Open this post in threaded view
|

Re: summary( prcomp(*, tol = .) ) -- and 'rank.'

Open this post in threaded view
|

Re: summary( prcomp(*, tol = .) ) -- and 'rank.'

Open this post in threaded view
|

Re: summary( prcomp(*, tol = .) ) -- and 'rank.'

Open this post in threaded view
|

Re: summary( prcomp(*, tol = .) ) -- and 'rank.'

Open this post in threaded view
|