Limited number of principal components in PCA

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Limited number of principal components in PCA

William Armstrong
Hi all,

I am attempting to run PCA on a matrix (nrow=66, ncol=84) using 'prcomp' (stats package).  My data (referred to as 'Q' in the code below) are separate river streamflow gaging stations (columns) and peak instantaneous discharge (rows).  I am attempting to use PCA to identify regions of that vary together.

I am entering the following command:

test_pca_Q<-prcomp(~.,data=Q,scale.=TRUE,retx=FALSE,na.action=na.omit)

It is outputting 54 'standard deviation' numbers (which are the sqrt(eigenvalues) in respect to a certain PC, am I correct?), and 54 'rotation' numbers, which are the variable loadings with respect to a given PC.

I have two questions:

1.) Why is it only outputting 54 PCs and standard deviations?  If I have 84 variables isn't the maximum number of PCs I can create 84 as well?

2.) Can I now use the 'rotation' values to find clusters of gages that I acting together, or is there another step I must take?

Thank you very much for your insight.

Billy
Reply | Threaded
Open this post in threaded view
|

Re: Limited number of principal components in PCA

Joshua Wiley-2
Hi Billy,

Can you provide your data?  You could attach it as a text file or
provide it by pasting the output of:

dput(Q)

into an email.  It would help if we could reproduce what you are
doing.  You might also consider a list or forum that is more
statistics oriented than Rhelp, as your questions are more related to
the statistics than the software itself (but still, if you give us
data, you will probably get farther).

Cheers,

Josh

On Fri, Jul 29, 2011 at 11:33 AM, William Armstrong
<[hidden email]> wrote:

> Hi all,
>
> I am attempting to run PCA on a matrix (nrow=66, ncol=84) using 'prcomp'
> (stats package).  My data (referred to as 'Q' in the code below) are
> separate river streamflow gaging stations (columns) and peak instantaneous
> discharge (rows).  I am attempting to use PCA to identify regions of that
> vary together.
>
> I am entering the following command:
>
> test_pca_Q<-prcomp(~.,data=Q,scale.=TRUE,retx=FALSE,na.action=na.omit)
>
> It is outputting 54 'standard deviation' numbers (which are the
> sqrt(eigenvalues) in respect to a certain PC, am I correct?), and 54
> 'rotation' numbers, which are the variable loadings with respect to a given
> PC.
>
> I have two questions:
>
> 1.) Why is it only outputting 54 PCs and standard deviations?  If I have 84
> variables isn't the maximum number of PCs I can create 84 as well?
>
> 2.) Can I now use the 'rotation' values to find clusters of gages that I
> acting together, or is there another step I must take?
>
> Thank you very much for your insight.
>
> Billy
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Limited-number-of-principal-components-in-PCA-tp3704956p3704956.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
https://joshuawiley.com/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Limited number of principal components in PCA

David Carlson
Providing the data will help, but the first thing I noted is that you have more columns (variables) than rows (cases). PCA will return a maximum of (the number of columns) or (the number of rows-1) whichever is less. With 84 columns and 66 rows means you can get no more than 65 components. If the variables are highly correlated, you will get fewer components and that probably explains the reduction to 54. I would guess the variables are highly correlated and the first eigenvalue is very large.

----------------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77843-4352



-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Joshua Wiley
Sent: Friday, July 29, 2011 10:20 PM
To: William Armstrong
Cc: [hidden email]
Subject: Re: [R] Limited number of principal components in PCA

Hi Billy,

Can you provide your data?  You could attach it as a text file or
provide it by pasting the output of:

dput(Q)

into an email.  It would help if we could reproduce what you are
doing.  You might also consider a list or forum that is more
statistics oriented than Rhelp, as your questions are more related to
the statistics than the software itself (but still, if you give us
data, you will probably get farther).

Cheers,

Josh

On Fri, Jul 29, 2011 at 11:33 AM, William Armstrong
<[hidden email]> wrote:

> Hi all,
>
> I am attempting to run PCA on a matrix (nrow=66, ncol=84) using 'prcomp'
> (stats package).  My data (referred to as 'Q' in the code below) are
> separate river streamflow gaging stations (columns) and peak instantaneous
> discharge (rows).  I am attempting to use PCA to identify regions of that
> vary together.
>
> I am entering the following command:
>
> test_pca_Q<-prcomp(~.,data=Q,scale.=TRUE,retx=FALSE,na.action=na.omit)
>
> It is outputting 54 'standard deviation' numbers (which are the
> sqrt(eigenvalues) in respect to a certain PC, am I correct?), and 54
> 'rotation' numbers, which are the variable loadings with respect to a given
> PC.
>
> I have two questions:
>
> 1.) Why is it only outputting 54 PCs and standard deviations?  If I have 84
> variables isn't the maximum number of PCs I can create 84 as well?
>
> 2.) Can I now use the 'rotation' values to find clusters of gages that I
> acting together, or is there another step I must take?
>
> Thank you very much for your insight.
>
> Billy
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Limited-number-of-principal-components-in-PCA-tp3704956p3704956.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
https://joshuawiley.com/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Limited number of principal components in PCA

William Armstrong
David and Josh,

Thank you for the suggestions.  I have attached a file ('q_values.txt') that contains the values of the 'Q' variable.

David -- I am attempting an 'S' mode PCA, where the columns are actually the cases (different stream gaging stations) and the rows are the variables (the maximum flow at each station for a given year).  I think the format you are referring to is 'R' mode, but I was under the impression that R (the program, not the PCA mode) could handle the analyses in either format.  Am I mistaken?

My first eigenvalue is:

> unrotated_pca_q$sdev[1]^2
[1] 17.77812

Does that value seem large enough to explain the reduction in principal components from 65 to 54?

Also, the loadings on the first PC are not particularly high:

 > max(abs(unrotated_pca_q$rotation[1:84]))
[1] 0.1794776

Does that suggest that maybe the data are not very highly correlated?

Thank you both very much for your help.

Billy

q_values.txt
Reply | Threaded
Open this post in threaded view
|

Re: Limited number of principal components in PCA

Michael Dewey
At 19:07 04/08/2011, William Armstrong wrote:

>David and Josh,
>
>Thank you for the suggestions.  I have attached a file ('q_values.txt') that
>contains the values of the 'Q' variable.
>
>David -- I am attempting an 'S' mode PCA, where the columns are actually the
>cases (different stream gaging stations) and the rows are the variables (the
>maximum flow at each station for a given year).  I think the format you are
>referring to is 'R' mode, but I was under the impression that R (the
>program, not the PCA mode) could handle the analyses in either format.  Am I
>mistaken?
>
>My first eigenvalue is:
>
> > unrotated_pca_q$sdev[1]^2
>[1] 17.77812
>
>Does that value seem large enough to explain the reduction in principal
>components from 65 to 54?

try doing
table(complete.cases(q_values))
or whatever you are calling q_values.txt

Does that help?

Moral: when R does not do what you thought you told it to do it may
still have done what you told it to do.



>Also, the loadings on the first PC are not particularly high:
>
>  > max(abs(unrotated_pca_q$rotation[1:84]))
>[1] 0.1794776
>
>Does that suggest that maybe the data are not very highly correlated?
>
>Thank you both very much for your help.
>
>Billy
>
>http://r.789695.n4.nabble.com/file/n3719440/q_values.txt q_values.txt
>
>--
>View this message in context:
>http://r.789695.n4.nabble.com/Limited-number-of-principal-components-in-PCA-tp3704956p3719440.html
>Sent from the R help mailing list archive at Nabble.com.

Michael Dewey
[hidden email]
http://www.aghmed.fsnet.co.uk/home.html

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Limited number of principal components in PCA

Joshua Wiley-2
In reply to this post by William Armstrong
Hi Billy,

Thanks for posting your data.  Okay, first off as Michael pointed out:

> table(complete.cases(Q))

FALSE  TRUE
   12    54

shows that of the 66 rows in your data set, only 54 of them are
complete.  That means when you use na = na.omit, you are actually only
passing a data frame with 54 rows.  Second, prcomp() will not return
more components than observations.  Think of it this way, it is like
trying to connect 54 data points with 84 lines---53 lines will fit
perfectly (straight line connects two points), you are trying to go
way past that.

I have not heard (this is just an indicator of my ignorance, not their
lack of existence) of 'S' vs. 'R' mode PCA, but if you want the
columns to be the cases, just t()ranspose the data frame.

> table(complete.cases(t(Q)))
FALSE  TRUE
    9    75

So there are still only 75 possible observations to work with due to
missingness, but that is enough for only 66 variables

test_pca_Q2 <- prcomp(~ ., data = data.frame(t(Q)), scale = TRUE, retx = FALSE,
  na.action = na.omit)

> length(test_pca_Q2$sdev)
[1] 66

so there are 66 SDs for the 66 principal components.

Regarding what the 'sdev' values are, they are the square root of the
eigen values of the correlation (in your case since you scaled)
matrix.  You can see this below:

## first ten (1:10) square roots of the eigen values of the correlation matrix
## of the complete cases of the transposed data set 'Q'
> sqrt(eigen(cor(na.omit(t(Q))))$values)[1:10]
 [1] 7.3267465 2.0349335 1.2913823 1.0750288 0.9035650 0.8301671 0.7370896
 [8] 0.7132530 0.6196836 0.5396176
## sdev from prcomp()
> test_pca_Q2$sdev[1:10]
 [1] 7.3267465 2.0349335 1.2913823 1.0750288 0.9035650 0.8301671 0.7370896
 [8] 0.7132530 0.6196836 0.5396176

You can also try the principal() function in package "psych".  It has
a lot of nice options, and I tend to use it for all this sort of
stuff.

Cheers,

Josh

On Thu, Aug 4, 2011 at 11:07 AM, William Armstrong <[hidden email]> wrote:

> David and Josh,
>
> Thank you for the suggestions.  I have attached a file ('q_values.txt') that
> contains the values of the 'Q' variable.
>
> David -- I am attempting an 'S' mode PCA, where the columns are actually the
> cases (different stream gaging stations) and the rows are the variables (the
> maximum flow at each station for a given year).  I think the format you are
> referring to is 'R' mode, but I was under the impression that R (the
> program, not the PCA mode) could handle the analyses in either format.  Am I
> mistaken?
>
> My first eigenvalue is:
>
>> unrotated_pca_q$sdev[1]^2
> [1] 17.77812
>
> Does that value seem large enough to explain the reduction in principal
> components from 65 to 54?
>
> Also, the loadings on the first PC are not particularly high:
>
>  > max(abs(unrotated_pca_q$rotation[1:84]))
> [1] 0.1794776
>
> Does that suggest that maybe the data are not very highly correlated?
>
> Thank you both very much for your help.
>
> Billy
>
> http://r.789695.n4.nabble.com/file/n3719440/q_values.txt q_values.txt
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Limited-number-of-principal-components-in-PCA-tp3704956p3719440.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.