

Following from the Rhelp thread of March 22 on "Memory usage in prcomp",
I've started looking into adding an optional 'rank.' argument
to prcomp allowing to more efficiently get only a few PCs
instead of the full p PCs, say when p = 1000 and you know you
only want 5 PCs.
( https://stat.ethz.ch/pipermail/rhelp/2016March/437228.htmlAs it was mentioned, we already have an optional 'tol' argument
which allows *not* to choose all PCs.
When I do that,
say
C < chol(S < toeplitz(.9 ^ (0:31))) # Cov.matrix and its root
all.equal(S, crossprod(C))
set.seed(17)
X < matrix(rnorm(32000), 1000, 32)
Z < X %*% C ## ==> cov(Z) ~= C'C = S
all.equal(cov(Z), S, tol = 0.08)
pZ < prcomp(Z, tol = 0.1)
summary(pZ) # only ~14 PCs (out of 32)
I get for the last line, the summary.prcomp(.) call :
> summary(pZ) # only ~14 PCs (out of 32)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
Standard deviation 3.6415 2.7178 1.8447 1.3943 1.10207 0.90922 0.76951 0.67490
Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713 0.01943 0.01495
Cumulative Proportion 0.4352 0.6775 0.7892 0.8530 0.89288 0.92001 0.93944 0.95439
PC9 PC10 PC11 PC12 PC13 PC14
Standard deviation 0.60833 0.51638 0.49048 0.44452 0.40326 0.3904
Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050
Cumulative Proportion 0.96653 0.97528 0.98318 0.98966 0.99500 1.0000
>
which computes the *proportions* as if there were only 14 PCs in
total (but there were 32 originally).
I would think that the summary should or could in addition show
the usual "proportion of variance explained" like result which
does involve all 32 variances or std.dev.s ... which are
returned from the svd() anyway, even in the case when I use my
new 'rank.' argument which only returns a "few" PCs instead of
all.
Would you think the current summary() output is good enough or
rather misleading?
I think I would want to see (possibly in addition) proportions
with respect to the full variance and not just to the variance
of those few components selected.
Opinions?
Martin Maechler
ETH Zurich
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rdevel


Martin, I fully agree. This becomes an issue when you have big matrices.
(Note that there are awesome methods for actually only computing a small
number of PCs (unlike your code which uses svn which gets all of them);
these are available in various CRAN packages).
Best,
Kasper
On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler < [hidden email]
> wrote:
> Following from the Rhelp thread of March 22 on "Memory usage in prcomp",
>
> I've started looking into adding an optional 'rank.' argument
> to prcomp allowing to more efficiently get only a few PCs
> instead of the full p PCs, say when p = 1000 and you know you
> only want 5 PCs.
>
> ( https://stat.ethz.ch/pipermail/rhelp/2016March/437228.html>
> As it was mentioned, we already have an optional 'tol' argument
> which allows *not* to choose all PCs.
>
> When I do that,
> say
>
> C < chol(S < toeplitz(.9 ^ (0:31))) # Cov.matrix and its root
> all.equal(S, crossprod(C))
> set.seed(17)
> X < matrix(rnorm(32000), 1000, 32)
> Z < X %*% C ## ==> cov(Z) ~= C'C = S
> all.equal(cov(Z), S, tol = 0.08)
> pZ < prcomp(Z, tol = 0.1)
> summary(pZ) # only ~14 PCs (out of 32)
>
> I get for the last line, the summary.prcomp(.) call :
>
> > summary(pZ) # only ~14 PCs (out of 32)
> Importance of components:
> PC1 PC2 PC3 PC4 PC5 PC6
> PC7 PC8
> Standard deviation 3.6415 2.7178 1.8447 1.3943 1.10207 0.90922 0.76951
> 0.67490
> Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713 0.01943
> 0.01495
> Cumulative Proportion 0.4352 0.6775 0.7892 0.8530 0.89288 0.92001 0.93944
> 0.95439
> PC9 PC10 PC11 PC12 PC13 PC14
> Standard deviation 0.60833 0.51638 0.49048 0.44452 0.40326 0.3904
> Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050
> Cumulative Proportion 0.96653 0.97528 0.98318 0.98966 0.99500 1.0000
> >
>
> which computes the *proportions* as if there were only 14 PCs in
> total (but there were 32 originally).
>
> I would think that the summary should or could in addition show
> the usual "proportion of variance explained" like result which
> does involve all 32 variances or std.dev.s ... which are
> returned from the svd() anyway, even in the case when I use my
> new 'rank.' argument which only returns a "few" PCs instead of
> all.
>
> Would you think the current summary() output is good enough or
> rather misleading?
>
> I think I would want to see (possibly in addition) proportions
> with respect to the full variance and not just to the variance
> of those few components selected.
>
> Opinions?
>
> Martin Maechler
> ETH Zurich
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/rdevel>
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rdevel


I agree with Kasper, this is a 'big' issue. Does your method of taking only
n PCs reduce the load on memory?
The new addition to the summary looks like a good idea, but Proportion of
Variance as you describe it may be confusing to new users. Am I correct in
saying Proportion of variance describes the amount of variance with respect
to the number of components the user chooses to show? So if I only choose
one I will explain 100% of the variance? I think showing 'Total Proportion
of Variance' is important if that is the case.
Regards,
Steve Bronder
Website: stevebronder.com
Phone: 4127191282
Email: [hidden email]
On Thu, Mar 24, 2016 at 2:58 PM, Kasper Daniel Hansen <
[hidden email]> wrote:
> Martin, I fully agree. This becomes an issue when you have big matrices.
>
> (Note that there are awesome methods for actually only computing a small
> number of PCs (unlike your code which uses svn which gets all of them);
> these are available in various CRAN packages).
>
> Best,
> Kasper
>
> On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler <
> [hidden email]
> > wrote:
>
> > Following from the Rhelp thread of March 22 on "Memory usage in prcomp",
> >
> > I've started looking into adding an optional 'rank.' argument
> > to prcomp allowing to more efficiently get only a few PCs
> > instead of the full p PCs, say when p = 1000 and you know you
> > only want 5 PCs.
> >
> > ( https://stat.ethz.ch/pipermail/rhelp/2016March/437228.html> >
> > As it was mentioned, we already have an optional 'tol' argument
> > which allows *not* to choose all PCs.
> >
> > When I do that,
> > say
> >
> > C < chol(S < toeplitz(.9 ^ (0:31))) # Cov.matrix and its root
> > all.equal(S, crossprod(C))
> > set.seed(17)
> > X < matrix(rnorm(32000), 1000, 32)
> > Z < X %*% C ## ==> cov(Z) ~= C'C = S
> > all.equal(cov(Z), S, tol = 0.08)
> > pZ < prcomp(Z, tol = 0.1)
> > summary(pZ) # only ~14 PCs (out of 32)
> >
> > I get for the last line, the summary.prcomp(.) call :
> >
> > > summary(pZ) # only ~14 PCs (out of 32)
> > Importance of components:
> > PC1 PC2 PC3 PC4 PC5 PC6
> > PC7 PC8
> > Standard deviation 3.6415 2.7178 1.8447 1.3943 1.10207 0.90922
> 0.76951
> > 0.67490
> > Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713
> 0.01943
> > 0.01495
> > Cumulative Proportion 0.4352 0.6775 0.7892 0.8530 0.89288 0.92001
> 0.93944
> > 0.95439
> > PC9 PC10 PC11 PC12 PC13 PC14
> > Standard deviation 0.60833 0.51638 0.49048 0.44452 0.40326 0.3904
> > Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050
> > Cumulative Proportion 0.96653 0.97528 0.98318 0.98966 0.99500 1.0000
> > >
> >
> > which computes the *proportions* as if there were only 14 PCs in
> > total (but there were 32 originally).
> >
> > I would think that the summary should or could in addition show
> > the usual "proportion of variance explained" like result which
> > does involve all 32 variances or std.dev.s ... which are
> > returned from the svd() anyway, even in the case when I use my
> > new 'rank.' argument which only returns a "few" PCs instead of
> > all.
> >
> > Would you think the current summary() output is good enough or
> > rather misleading?
> >
> > I think I would want to see (possibly in addition) proportions
> > with respect to the full variance and not just to the variance
> > of those few components selected.
> >
> > Opinions?
> >
> > Martin Maechler
> > ETH Zurich
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/rdevel> >
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/rdevel>
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rdevel


In reply to this post by Kasper Daniel Hansen2
I agree with Kasper, this is a 'big' issue. Does your method of taking only
n PCs reduce the load on memory?
The new addition to the summary looks like a good idea, but Proportion of
Variance as you describe it may be confusing to new users. Am I correct in
saying Proportion of variance describes the amount of variance with respect
to the number of components the user chooses to show? So if I only choose
one I will explain 100% of the variance? I think showing 'Total Proportion
of Variance' is important if that is the case.
Regards,
Steve Bronder
Website: stevebronder.com
Phone: 4127191282
Email: [hidden email]
On Thu, Mar 24, 2016 at 2:58 PM, Kasper Daniel Hansen <
[hidden email]> wrote:
> Martin, I fully agree. This becomes an issue when you have big matrices.
>
> (Note that there are awesome methods for actually only computing a small
> number of PCs (unlike your code which uses svn which gets all of them);
> these are available in various CRAN packages).
>
> Best,
> Kasper
>
> On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler <
> [hidden email]
> > wrote:
>
> > Following from the Rhelp thread of March 22 on "Memory usage in prcomp",
> >
> > I've started looking into adding an optional 'rank.' argument
> > to prcomp allowing to more efficiently get only a few PCs
> > instead of the full p PCs, say when p = 1000 and you know you
> > only want 5 PCs.
> >
> > ( https://stat.ethz.ch/pipermail/rhelp/2016March/437228.html> >
> > As it was mentioned, we already have an optional 'tol' argument
> > which allows *not* to choose all PCs.
> >
> > When I do that,
> > say
> >
> > C < chol(S < toeplitz(.9 ^ (0:31))) # Cov.matrix and its root
> > all.equal(S, crossprod(C))
> > set.seed(17)
> > X < matrix(rnorm(32000), 1000, 32)
> > Z < X %*% C ## ==> cov(Z) ~= C'C = S
> > all.equal(cov(Z), S, tol = 0.08)
> > pZ < prcomp(Z, tol = 0.1)
> > summary(pZ) # only ~14 PCs (out of 32)
> >
> > I get for the last line, the summary.prcomp(.) call :
> >
> > > summary(pZ) # only ~14 PCs (out of 32)
> > Importance of components:
> > PC1 PC2 PC3 PC4 PC5 PC6
> > PC7 PC8
> > Standard deviation 3.6415 2.7178 1.8447 1.3943 1.10207 0.90922
> 0.76951
> > 0.67490
> > Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713
> 0.01943
> > 0.01495
> > Cumulative Proportion 0.4352 0.6775 0.7892 0.8530 0.89288 0.92001
> 0.93944
> > 0.95439
> > PC9 PC10 PC11 PC12 PC13 PC14
> > Standard deviation 0.60833 0.51638 0.49048 0.44452 0.40326 0.3904
> > Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050
> > Cumulative Proportion 0.96653 0.97528 0.98318 0.98966 0.99500 1.0000
> > >
> >
> > which computes the *proportions* as if there were only 14 PCs in
> > total (but there were 32 originally).
> >
> > I would think that the summary should or could in addition show
> > the usual "proportion of variance explained" like result which
> > does involve all 32 variances or std.dev.s ... which are
> > returned from the svd() anyway, even in the case when I use my
> > new 'rank.' argument which only returns a "few" PCs instead of
> > all.
> >
> > Would you think the current summary() output is good enough or
> > rather misleading?
> >
> > I think I would want to see (possibly in addition) proportions
> > with respect to the full variance and not just to the variance
> > of those few components selected.
> >
> > Opinions?
> >
> > Martin Maechler
> > ETH Zurich
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/rdevel> >
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/rdevel>
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rdevel


As I see it, the display showing the first p << n PCs adding up to 100% of the variance is plainly wrong.
I suspect it comes about via a mental shortcircuit: If we try to control p using a tolerance, then that amounts to saying that the remaining PCs are effectively zerovariance, but that is (usually) not the intention at all.
The common case is that the remainder terms have a roughly _constant_, smallish variance and are interpreted as noise. Of course the magnitude of the noise is important information.
pd
> On 25 Mar 2016, at 00:02 , Steve Bronder < [hidden email]> wrote:
>
> I agree with Kasper, this is a 'big' issue. Does your method of taking only
> n PCs reduce the load on memory?
>
> The new addition to the summary looks like a good idea, but Proportion of
> Variance as you describe it may be confusing to new users. Am I correct in
> saying Proportion of variance describes the amount of variance with respect
> to the number of components the user chooses to show? So if I only choose
> one I will explain 100% of the variance? I think showing 'Total Proportion
> of Variance' is important if that is the case.
>
>
> Regards,
>
> Steve Bronder
> Website: stevebronder.com
> Phone: 4127191282
> Email: [hidden email]
>
>
> On Thu, Mar 24, 2016 at 2:58 PM, Kasper Daniel Hansen <
> [hidden email]> wrote:
>
>> Martin, I fully agree. This becomes an issue when you have big matrices.
>>
>> (Note that there are awesome methods for actually only computing a small
>> number of PCs (unlike your code which uses svn which gets all of them);
>> these are available in various CRAN packages).
>>
>> Best,
>> Kasper
>>
>> On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler <
>> [hidden email]
>>> wrote:
>>
>>> Following from the Rhelp thread of March 22 on "Memory usage in prcomp",
>>>
>>> I've started looking into adding an optional 'rank.' argument
>>> to prcomp allowing to more efficiently get only a few PCs
>>> instead of the full p PCs, say when p = 1000 and you know you
>>> only want 5 PCs.
>>>
>>> ( https://stat.ethz.ch/pipermail/rhelp/2016March/437228.html>>>
>>> As it was mentioned, we already have an optional 'tol' argument
>>> which allows *not* to choose all PCs.
>>>
>>> When I do that,
>>> say
>>>
>>> C < chol(S < toeplitz(.9 ^ (0:31))) # Cov.matrix and its root
>>> all.equal(S, crossprod(C))
>>> set.seed(17)
>>> X < matrix(rnorm(32000), 1000, 32)
>>> Z < X %*% C ## ==> cov(Z) ~= C'C = S
>>> all.equal(cov(Z), S, tol = 0.08)
>>> pZ < prcomp(Z, tol = 0.1)
>>> summary(pZ) # only ~14 PCs (out of 32)
>>>
>>> I get for the last line, the summary.prcomp(.) call :
>>>
>>>> summary(pZ) # only ~14 PCs (out of 32)
>>> Importance of components:
>>> PC1 PC2 PC3 PC4 PC5 PC6
>>> PC7 PC8
>>> Standard deviation 3.6415 2.7178 1.8447 1.3943 1.10207 0.90922
>> 0.76951
>>> 0.67490
>>> Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713
>> 0.01943
>>> 0.01495
>>> Cumulative Proportion 0.4352 0.6775 0.7892 0.8530 0.89288 0.92001
>> 0.93944
>>> 0.95439
>>> PC9 PC10 PC11 PC12 PC13 PC14
>>> Standard deviation 0.60833 0.51638 0.49048 0.44452 0.40326 0.3904
>>> Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050
>>> Cumulative Proportion 0.96653 0.97528 0.98318 0.98966 0.99500 1.0000
>>>>
>>>
>>> which computes the *proportions* as if there were only 14 PCs in
>>> total (but there were 32 originally).
>>>
>>> I would think that the summary should or could in addition show
>>> the usual "proportion of variance explained" like result which
>>> does involve all 32 variances or std.dev.s ... which are
>>> returned from the svd() anyway, even in the case when I use my
>>> new 'rank.' argument which only returns a "few" PCs instead of
>>> all.
>>>
>>> Would you think the current summary() output is good enough or
>>> rather misleading?
>>>
>>> I think I would want to see (possibly in addition) proportions
>>> with respect to the full variance and not just to the variance
>>> of those few components selected.
>>>
>>> Opinions?
>>>
>>> Martin Maechler
>>> ETH Zurich
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/rdevel>>>
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/rdevel>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/rdevel
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email] Priv: [hidden email]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rdevel


> On 25 Mar 2016, at 10:41 am, peter dalgaard < [hidden email]> wrote:
>
> As I see it, the display showing the first p << n PCs adding up to 100% of the variance is plainly wrong.
>
> I suspect it comes about via a mental shortcircuit: If we try to control p using a tolerance, then that amounts to saying that the remaining PCs are effectively zerovariance, but that is (usually) not the intention at all.
>
> The common case is that the remainder terms have a roughly _constant_, smallish variance and are interpreted as noise. Of course the magnitude of the noise is important information.
>
But then you should use Factor Analysis which has that concept of “noise” (unlike PCA).
Cheers, Jari Oksanen
>> On 25 Mar 2016, at 00:02 , Steve Bronder < [hidden email]> wrote:
>>
>> I agree with Kasper, this is a 'big' issue. Does your method of taking only
>> n PCs reduce the load on memory?
>>
>> The new addition to the summary looks like a good idea, but Proportion of
>> Variance as you describe it may be confusing to new users. Am I correct in
>> saying Proportion of variance describes the amount of variance with respect
>> to the number of components the user chooses to show? So if I only choose
>> one I will explain 100% of the variance? I think showing 'Total Proportion
>> of Variance' is important if that is the case.
>>
>>
>> Regards,
>>
>> Steve Bronder
>> Website: stevebronder.com
>> Phone: 4127191282
>> Email: [hidden email]
>>
>>
>> On Thu, Mar 24, 2016 at 2:58 PM, Kasper Daniel Hansen <
>> [hidden email]> wrote:
>>
>>> Martin, I fully agree. This becomes an issue when you have big matrices.
>>>
>>> (Note that there are awesome methods for actually only computing a small
>>> number of PCs (unlike your code which uses svn which gets all of them);
>>> these are available in various CRAN packages).
>>>
>>> Best,
>>> Kasper
>>>
>>> On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler <
>>> [hidden email]
>>>> wrote:
>>>
>>>> Following from the Rhelp thread of March 22 on "Memory usage in prcomp",
>>>>
>>>> I've started looking into adding an optional 'rank.' argument
>>>> to prcomp allowing to more efficiently get only a few PCs
>>>> instead of the full p PCs, say when p = 1000 and you know you
>>>> only want 5 PCs.
>>>>
>>>> ( https://stat.ethz.ch/pipermail/rhelp/2016March/437228.html>>>>
>>>> As it was mentioned, we already have an optional 'tol' argument
>>>> which allows *not* to choose all PCs.
>>>>
>>>> When I do that,
>>>> say
>>>>
>>>> C < chol(S < toeplitz(.9 ^ (0:31))) # Cov.matrix and its root
>>>> all.equal(S, crossprod(C))
>>>> set.seed(17)
>>>> X < matrix(rnorm(32000), 1000, 32)
>>>> Z < X %*% C ## ==> cov(Z) ~= C'C = S
>>>> all.equal(cov(Z), S, tol = 0.08)
>>>> pZ < prcomp(Z, tol = 0.1)
>>>> summary(pZ) # only ~14 PCs (out of 32)
>>>>
>>>> I get for the last line, the summary.prcomp(.) call :
>>>>
>>>>> summary(pZ) # only ~14 PCs (out of 32)
>>>> Importance of components:
>>>> PC1 PC2 PC3 PC4 PC5 PC6
>>>> PC7 PC8
>>>> Standard deviation 3.6415 2.7178 1.8447 1.3943 1.10207 0.90922
>>> 0.76951
>>>> 0.67490
>>>> Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713
>>> 0.01943
>>>> 0.01495
>>>> Cumulative Proportion 0.4352 0.6775 0.7892 0.8530 0.89288 0.92001
>>> 0.93944
>>>> 0.95439
>>>> PC9 PC10 PC11 PC12 PC13 PC14
>>>> Standard deviation 0.60833 0.51638 0.49048 0.44452 0.40326 0.3904
>>>> Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050
>>>> Cumulative Proportion 0.96653 0.97528 0.98318 0.98966 0.99500 1.0000
>>>>>
>>>>
>>>> which computes the *proportions* as if there were only 14 PCs in
>>>> total (but there were 32 originally).
>>>>
>>>> I would think that the summary should or could in addition show
>>>> the usual "proportion of variance explained" like result which
>>>> does involve all 32 variances or std.dev.s ... which are
>>>> returned from the svd() anyway, even in the case when I use my
>>>> new 'rank.' argument which only returns a "few" PCs instead of
>>>> all.
>>>>
>>>> Would you think the current summary() output is good enough or
>>>> rather misleading?
>>>>
>>>> I think I would want to see (possibly in addition) proportions
>>>> with respect to the full variance and not just to the variance
>>>> of those few components selected.
>>>>
>>>> Opinions?
>>>>
>>>> Martin Maechler
>>>> ETH Zurich
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/rdevel>>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/rdevel>>>
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/rdevel>
> 
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Office: A 4.23
> Email: [hidden email] Priv: [hidden email]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/rdevel______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rdevel


> On 25 Mar 2016, at 10:08 , Jari Oksanen < [hidden email]> wrote:
>
>>
>> On 25 Mar 2016, at 10:41 am, peter dalgaard < [hidden email]> wrote:
>>
>> As I see it, the display showing the first p << n PCs adding up to 100% of the variance is plainly wrong.
>>
>> I suspect it comes about via a mental shortcircuit: If we try to control p using a tolerance, then that amounts to saying that the remaining PCs are effectively zerovariance, but that is (usually) not the intention at all.
>>
>> The common case is that the remainder terms have a roughly _constant_, smallish variance and are interpreted as noise. Of course the magnitude of the noise is important information.
>>
> But then you should use Factor Analysis which has that concept of “noise” (unlike PCA).
Actually, FA has a slightly different concept of noise. PCA can be interpreted as a purely technical operation, but also as an FA variant with same variance for all components.
Specifically, FA is
Sigma = LL' + Psi
with Psi a diagonal matrix. If Psi = sigma^2 I , then L can be determined (up to rotation) as the first p components of PCA. (This is used in ML algorithms for FA since it allows you to concentrate the likelihood to be a function of Psi.)
Methods like PC regression are not being very specific about the model, but the underlying line of thought is that PCs with small variances are "uninformative", so that you can make do with only the first handful regressors. I tend to interpret "uninformative" as "noiselike" in these contexts.
pd
>
> Cheers, Jari Oksanen
>
>>> On 25 Mar 2016, at 00:02 , Steve Bronder < [hidden email]> wrote:
>>>
>>> I agree with Kasper, this is a 'big' issue. Does your method of taking only
>>> n PCs reduce the load on memory?
>>>
>>> The new addition to the summary looks like a good idea, but Proportion of
>>> Variance as you describe it may be confusing to new users. Am I correct in
>>> saying Proportion of variance describes the amount of variance with respect
>>> to the number of components the user chooses to show? So if I only choose
>>> one I will explain 100% of the variance? I think showing 'Total Proportion
>>> of Variance' is important if that is the case.
>>>
>>>
>>> Regards,
>>>
>>> Steve Bronder
>>> Website: stevebronder.com
>>> Phone: 4127191282
>>> Email: [hidden email]
>>>
>>>
>>> On Thu, Mar 24, 2016 at 2:58 PM, Kasper Daniel Hansen <
>>> [hidden email]> wrote:
>>>
>>>> Martin, I fully agree. This becomes an issue when you have big matrices.
>>>>
>>>> (Note that there are awesome methods for actually only computing a small
>>>> number of PCs (unlike your code which uses svn which gets all of them);
>>>> these are available in various CRAN packages).
>>>>
>>>> Best,
>>>> Kasper
>>>>
>>>> On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler <
>>>> [hidden email]
>>>>> wrote:
>>>>
>>>>> Following from the Rhelp thread of March 22 on "Memory usage in prcomp",
>>>>>
>>>>> I've started looking into adding an optional 'rank.' argument
>>>>> to prcomp allowing to more efficiently get only a few PCs
>>>>> instead of the full p PCs, say when p = 1000 and you know you
>>>>> only want 5 PCs.
>>>>>
>>>>> ( https://stat.ethz.ch/pipermail/rhelp/2016March/437228.html>>>>>
>>>>> As it was mentioned, we already have an optional 'tol' argument
>>>>> which allows *not* to choose all PCs.
>>>>>
>>>>> When I do that,
>>>>> say
>>>>>
>>>>> C < chol(S < toeplitz(.9 ^ (0:31))) # Cov.matrix and its root
>>>>> all.equal(S, crossprod(C))
>>>>> set.seed(17)
>>>>> X < matrix(rnorm(32000), 1000, 32)
>>>>> Z < X %*% C ## ==> cov(Z) ~= C'C = S
>>>>> all.equal(cov(Z), S, tol = 0.08)
>>>>> pZ < prcomp(Z, tol = 0.1)
>>>>> summary(pZ) # only ~14 PCs (out of 32)
>>>>>
>>>>> I get for the last line, the summary.prcomp(.) call :
>>>>>
>>>>>> summary(pZ) # only ~14 PCs (out of 32)
>>>>> Importance of components:
>>>>> PC1 PC2 PC3 PC4 PC5 PC6
>>>>> PC7 PC8
>>>>> Standard deviation 3.6415 2.7178 1.8447 1.3943 1.10207 0.90922
>>>> 0.76951
>>>>> 0.67490
>>>>> Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713
>>>> 0.01943
>>>>> 0.01495
>>>>> Cumulative Proportion 0.4352 0.6775 0.7892 0.8530 0.89288 0.92001
>>>> 0.93944
>>>>> 0.95439
>>>>> PC9 PC10 PC11 PC12 PC13 PC14
>>>>> Standard deviation 0.60833 0.51638 0.49048 0.44452 0.40326 0.3904
>>>>> Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050
>>>>> Cumulative Proportion 0.96653 0.97528 0.98318 0.98966 0.99500 1.0000
>>>>>>
>>>>>
>>>>> which computes the *proportions* as if there were only 14 PCs in
>>>>> total (but there were 32 originally).
>>>>>
>>>>> I would think that the summary should or could in addition show
>>>>> the usual "proportion of variance explained" like result which
>>>>> does involve all 32 variances or std.dev.s ... which are
>>>>> returned from the svd() anyway, even in the case when I use my
>>>>> new 'rank.' argument which only returns a "few" PCs instead of
>>>>> all.
>>>>>
>>>>> Would you think the current summary() output is good enough or
>>>>> rather misleading?
>>>>>
>>>>> I think I would want to see (possibly in addition) proportions
>>>>> with respect to the full variance and not just to the variance
>>>>> of those few components selected.
>>>>>
>>>>> Opinions?
>>>>>
>>>>> Martin Maechler
>>>>> ETH Zurich
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/rdevel>>>>>
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/rdevel>>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/rdevel>>
>> 
>> Peter Dalgaard, Professor,
>> Center for Statistics, Copenhagen Business School
>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>> Phone: (+45)38153501
>> Office: A 4.23
>> Email: [hidden email] Priv: [hidden email]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/rdevel
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email] Priv: [hidden email]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rdevel


> On 25 Mar 2016, at 11:45 am, peter dalgaard < [hidden email]> wrote:
>
>>
>> On 25 Mar 2016, at 10:08 , Jari Oksanen < [hidden email]> wrote:
>>
>>>
>>> On 25 Mar 2016, at 10:41 am, peter dalgaard < [hidden email]> wrote:
>>>
>>> As I see it, the display showing the first p << n PCs adding up to 100% of the variance is plainly wrong.
>>>
>>> I suspect it comes about via a mental shortcircuit: If we try to control p using a tolerance, then that amounts to saying that the remaining PCs are effectively zerovariance, but that is (usually) not the intention at all.
>>>
>>> The common case is that the remainder terms have a roughly _constant_, smallish variance and are interpreted as noise. Of course the magnitude of the noise is important information.
>>>
>> But then you should use Factor Analysis which has that concept of “noise” (unlike PCA).
>
> Actually, FA has a slightly different concept of noise. PCA can be interpreted as a purely technical operation, but also as an FA variant with same variance for all components.
>
> Specifically, FA is
>
> Sigma = LL' + Psi
>
> with Psi a diagonal matrix. If Psi = sigma^2 I , then L can be determined (up to rotation) as the first p components of PCA. (This is used in ML algorithms for FA since it allows you to concentrate the likelihood to be a function of Psi.)
>
If I remember correctly, we took a correlation matrix and replaced the diagonal elements with variable “communalities” < 1 estimated by some trick, and then chunked that matrix into PCA and called the result FA. A more advanced way was to do this iteratively: take some first axes of PCA/FA, calculate diagonal elements from them & refeed them into PCA. It was done like that because algorithms & computers were not strong enough for real FA. Now they are, and I think it would be better to treat PCA like PCA, at least in the default output of standard stats::summary function. So summary should show proportion of total variance (for people who think this is a cool thing to know) instead of showing a proportion of an unspecified part of the variance.
Cheers, Jari Oksanen (who now switches to listening to today’s Passion instead of continuing with PCA)
> Methods like PC regression are not being very specific about the model, but the underlying line of thought is that PCs with small variances are "uninformative", so that you can make do with only the first handful regressors. I tend to interpret "uninformative" as "noiselike" in these contexts.
>
> pd
>
>>
>> Cheers, Jari Oksanen
>>
>>>> On 25 Mar 2016, at 00:02 , Steve Bronder < [hidden email]> wrote:
>>>>
>>>> I agree with Kasper, this is a 'big' issue. Does your method of taking only
>>>> n PCs reduce the load on memory?
>>>>
>>>> The new addition to the summary looks like a good idea, but Proportion of
>>>> Variance as you describe it may be confusing to new users. Am I correct in
>>>> saying Proportion of variance describes the amount of variance with respect
>>>> to the number of components the user chooses to show? So if I only choose
>>>> one I will explain 100% of the variance? I think showing 'Total Proportion
>>>> of Variance' is important if that is the case.
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Steve Bronder
>>>> Website: stevebronder.com
>>>> Phone: 4127191282
>>>> Email: [hidden email]
>>>>
>>>>
>>>> On Thu, Mar 24, 2016 at 2:58 PM, Kasper Daniel Hansen <
>>>> [hidden email]> wrote:
>>>>
>>>>> Martin, I fully agree. This becomes an issue when you have big matrices.
>>>>>
>>>>> (Note that there are awesome methods for actually only computing a small
>>>>> number of PCs (unlike your code which uses svn which gets all of them);
>>>>> these are available in various CRAN packages).
>>>>>
>>>>> Best,
>>>>> Kasper
>>>>>
>>>>> On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler <
>>>>> [hidden email]
>>>>>> wrote:
>>>>>
>>>>>> Following from the Rhelp thread of March 22 on "Memory usage in prcomp",
>>>>>>
>>>>>> I've started looking into adding an optional 'rank.' argument
>>>>>> to prcomp allowing to more efficiently get only a few PCs
>>>>>> instead of the full p PCs, say when p = 1000 and you know you
>>>>>> only want 5 PCs.
>>>>>>
>>>>>> ( https://stat.ethz.ch/pipermail/rhelp/2016March/437228.html>>>>>>
>>>>>> As it was mentioned, we already have an optional 'tol' argument
>>>>>> which allows *not* to choose all PCs.
>>>>>>
>>>>>> When I do that,
>>>>>> say
>>>>>>
>>>>>> C < chol(S < toeplitz(.9 ^ (0:31))) # Cov.matrix and its root
>>>>>> all.equal(S, crossprod(C))
>>>>>> set.seed(17)
>>>>>> X < matrix(rnorm(32000), 1000, 32)
>>>>>> Z < X %*% C ## ==> cov(Z) ~= C'C = S
>>>>>> all.equal(cov(Z), S, tol = 0.08)
>>>>>> pZ < prcomp(Z, tol = 0.1)
>>>>>> summary(pZ) # only ~14 PCs (out of 32)
>>>>>>
>>>>>> I get for the last line, the summary.prcomp(.) call :
>>>>>>
>>>>>>> summary(pZ) # only ~14 PCs (out of 32)
>>>>>> Importance of components:
>>>>>> PC1 PC2 PC3 PC4 PC5 PC6
>>>>>> PC7 PC8
>>>>>> Standard deviation 3.6415 2.7178 1.8447 1.3943 1.10207 0.90922
>>>>> 0.76951
>>>>>> 0.67490
>>>>>> Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713
>>>>> 0.01943
>>>>>> 0.01495
>>>>>> Cumulative Proportion 0.4352 0.6775 0.7892 0.8530 0.89288 0.92001
>>>>> 0.93944
>>>>>> 0.95439
>>>>>> PC9 PC10 PC11 PC12 PC13 PC14
>>>>>> Standard deviation 0.60833 0.51638 0.49048 0.44452 0.40326 0.3904
>>>>>> Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050
>>>>>> Cumulative Proportion 0.96653 0.97528 0.98318 0.98966 0.99500 1.0000
>>>>>>>
>>>>>>
>>>>>> which computes the *proportions* as if there were only 14 PCs in
>>>>>> total (but there were 32 originally).
>>>>>>
>>>>>> I would think that the summary should or could in addition show
>>>>>> the usual "proportion of variance explained" like result which
>>>>>> does involve all 32 variances or std.dev.s ... which are
>>>>>> returned from the svd() anyway, even in the case when I use my
>>>>>> new 'rank.' argument which only returns a "few" PCs instead of
>>>>>> all.
>>>>>>
>>>>>> Would you think the current summary() output is good enough or
>>>>>> rather misleading?
>>>>>>
>>>>>> I think I would want to see (possibly in addition) proportions
>>>>>> with respect to the full variance and not just to the variance
>>>>>> of those few components selected.
>>>>>>
>>>>>> Opinions?
>>>>>>
>>>>>> Martin Maechler
>>>>>> ETH Zurich
>>>>>>
>>>>>> ______________________________________________
>>>>>> [hidden email] mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/rdevel>>>>>>
>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/rdevel>>>>>
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/rdevel>>>
>>> 
>>> Peter Dalgaard, Professor,
>>> Center for Statistics, Copenhagen Business School
>>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>>> Phone: (+45)38153501
>>> Office: A 4.23
>>> Email: [hidden email] Priv: [hidden email]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/rdevel>
> 
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Office: A 4.23
> Email: [hidden email] Priv: [hidden email]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rdevel


>>>>> peter dalgaard < [hidden email]>
>>>>> on Fri, 25 Mar 2016 09:41:00 +0100 writes:
> As I see it, the display showing the first p << n PCs
> adding up to 100% of the variance is plainly wrong. I
> suspect it comes about via a mental shortcircuit: If we
> try to control p using a tolerance, then that amounts to
> saying that the remaining PCs are effectively
> zerovariance, but that is (usually) not the intention at
> all.
> The common case is that the remainder terms have a roughly
> _constant_, smallish variance and are interpreted as
> noise. Of course the magnitude of the noise is important
> information.
Thank you, Peter, Kasper, Steve.
@Kasper, I've known about *approximate* first few PC methods.
(Are there also exact ones which are more efficient than those
based on LAPACK'S DGESDD ?)
... and so indeed, prcomp() will not be the method of choice for
really large problems. Still, of course, we should try to "do our best".
Given your sentiments, notably Peter's, I now envisage to
do the nonbackcompatible change to *summary.prcomp()* and
compute "importances" based on all proportions up to 'p' (= 100%).
What I think would be nice is for the print.summary.prcomp()
method to only show the first 'k' (my notation), i.e., those
which were chosen by 'tol' and/or 'rank.'.
Martin
>> On 25 Mar 2016, at 00:02 , Steve Bronder
>> < [hidden email]> wrote:
>>
>> I agree with Kasper, this is a 'big' issue. Does your
>> method of taking only n PCs reduce the load on memory?
>>
>> The new addition to the summary looks like a good idea,
>> but Proportion of Variance as you describe it may be
>> confusing to new users. Am I correct in saying Proportion
>> of variance describes the amount of variance with respect
>> to the number of components the user chooses to show? So
>> if I only choose one I will explain 100% of the variance?
>> I think showing 'Total Proportion of Variance' is
>> important if that is the case.
>>
>>
>> Regards,
>>
>> Steve Bronder Website: stevebronder.com Phone:
>> 4127191282 Email: [hidden email]
>>
>>
>> On Thu, Mar 24, 2016 at 2:58 PM, Kasper Daniel Hansen <
>> [hidden email]> wrote:
>>
>>> Martin, I fully agree. This becomes an issue when you
>>> have big matrices.
>>>
>>> (Note that there are awesome methods for actually only
>>> computing a small number of PCs (unlike your code which
>>> uses svn which gets all of them); these are available in
>>> various CRAN packages).
>>>
>>> Best, Kasper
>>>
>>> On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler <
>>> [hidden email]
>>>> wrote:
>>>
>>>> Following from the Rhelp thread of March 22 on "Memory
>>>> usage in prcomp",
>>>>
>>>> I've started looking into adding an optional 'rank.'
>>>> argument to prcomp allowing to more efficiently get
>>>> only a few PCs instead of the full p PCs, say when p =
>>>> 1000 and you know you only want 5 PCs.
>>>>
>>>> ( https://stat.ethz.ch/pipermail/rhelp/2016March/437228.html >>>>
>>>> As it was mentioned, we already have an optional 'tol'
>>>> argument which allows *not* to choose all PCs.
>>>>
>>>> When I do that, say
>>>>
>>>> C < chol(S < toeplitz(.9 ^ (0:31))) # Cov.matrix and
>>>> its root all.equal(S, crossprod(C)) set.seed(17) X <
>>>> matrix(rnorm(32000), 1000, 32) Z < X %*% C ## ==>
>>>> cov(Z) ~= C'C = S all.equal(cov(Z), S, tol = 0.08) pZ
>>>> < prcomp(Z, tol = 0.1) summary(pZ) # only ~14 PCs (out
>>>> of 32)
>>>>
>>>> I get for the last line, the summary.prcomp(.) call :
>>>>
>>>>> summary(pZ) # only ~14 PCs (out of 32)
>>>> Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7
>>>> PC8 Standard deviation 3.6415 2.7178 1.8447 1.3943
>>>> 1.10207 0.90922
>>> 0.76951
>>>> 0.67490 Proportion of Variance 0.4352 0.2424 0.1117
>>>> 0.0638 0.03986 0.02713
>>> 0.01943
>>>> 0.01495 Cumulative Proportion 0.4352 0.6775 0.7892
>>>> 0.8530 0.89288 0.92001
>>> 0.93944
>>>> 0.95439 PC9 PC10 PC11 PC12 PC13 PC14 Standard deviation
>>>> 0.60833 0.51638 0.49048 0.44452 0.40326 0.3904
>>>> Proportion of Variance 0.01214 0.00875 0.00789 0.00648
>>>> 0.00534 0.0050 Cumulative Proportion 0.96653 0.97528
>>>> 0.98318 0.98966 0.99500 1.0000
>>>>>
>>>>
>>>> which computes the *proportions* as if there were only
>>>> 14 PCs in total (but there were 32 originally).
>>>>
>>>> I would think that the summary should or could in
>>>> addition show the usual "proportion of variance
>>>> explained" like result which does involve all 32
>>>> variances or std.dev.s ... which are returned from the
>>>> svd() anyway, even in the case when I use my new
>>>> 'rank.' argument which only returns a "few" PCs instead
>>>> of all.
>>>>
>>>> Would you think the current summary() output is good
>>>> enough or rather misleading?
>>>>
>>>> I think I would want to see (possibly in addition)
>>>> proportions with respect to the full variance and not
>>>> just to the variance of those few components selected.
>>>>
>>>> Opinions?
>>>>
>>>> Martin Maechler ETH Zurich
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/rdevel >>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/rdevel >>>
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/rdevel > 
> Peter Dalgaard, Professor, Center for Statistics,
> Copenhagen Business School Solbjerg Plads 3, 2000
> Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23
> Email: [hidden email] Priv: [hidden email]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rdevel

