Quantcast

Correlation of huge matrix saved as binary file

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Correlation of huge matrix saved as binary file

Bryo
Hi,

I have a 900,000,000*9,000 matrix where I need to calculate the correlation between all entries along the smaller dimension, thus creating a 9k*9k correlation matrix. This matrix is too big to be uploaded in R, and is saved as a binary file. To access the data in the file I use mmap and some api-functions (to get all values in one row, one column, or one particular value). I'm looking for some advice in how to calculate the correlation matrix. Right now my approach is to do something similar to this (toy code):

corr.matrix<-matrix('numeric',ncol=9000,nrow=9000)

for (i in 1:9000) {
for (j in (i+1):9000) {
# i1=... getting the index of  item (i) in a second file
# i2=....getting the index of item (j)
g1=api$getCol(i1)
g2=api$getCol(i2)
cor.matrix[i,j]=cor(g1,g2)
}}

This will work, but will take forever. Any advice for how this can be done more efficiently? I'm running on a 2.6.18 linux system, with R version R-2.11.1.

Thanks!
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Correlation of huge matrix saved as binary file

plangfelder
I don't think you can speed it up by a whole lot... but you can try a
few things, especially if you don't have missing data in the matrix
(which you probably don't). The main question is what takes most of
the time- the api calls or the cor() call? If it's cor, here's what
you can try:

1. Pre-standardize the entire matrix input matrix, i.e. scale each
column to mean=0 and sum of squares=1. Save the standardized matrix
(or make sure it's available to api). Since your matrix only has 9000
columns, this should not take extremely long.

2. Instead of calculating correlations, calculate simply sum(g1*g2) -
if g1 and g2 are standardized as above, correlation equals sum(g1*g2).

3. Instead of calculating the correlations one-by-one, calculate them
in small blocks (if you have enough memory and you run a 64-bit R).
With 900M rows, you will only be able to put a 900Mx2 into an R
object, but if you have two such standardized matrices loaded in g1,
g2, you can get their (2x2) correlation matrix by t(g1) %*% g2. This
2x2 matrix you can use to fill the appropriate components of the
result matrix.

4. Use one of the multi-threading packages (multicore comes to mind
but there are others) to parallelize your code. If you have 8
available cores, you can expect a nearly 8x speedup.

All in all, this will probably still take forever, but should be one
or two orders of magnitude faster than your current code :)

HTH,

Peter

On Fri, Mar 2, 2012 at 2:50 PM, Bryo <[hidden email]> wrote:

> Hi,
>
> I have a 900,000,000*9,000 matrix where I need to calculate the correlation
> between all entries along the smaller dimension, thus creating a 9k*9k
> correlation matrix. This matrix is too big to be uploaded in R, and is saved
> as a binary file. To access the data in the file I use mmap and some
> api-functions (to get all values in one row, one column, or one particular
> value). I'm looking for some advice in how to calculate the correlation
> matrix. Right now my approach is to do something similar to this (toy code):
>
> corr.matrix<-matrix('numeric',ncol=9000,nrow=9000)
>
> for (i in 1:9000) {
> for (j in (i+1):9000) {
> # i1=... getting the index of  item (i) in a second file
> # i2=....getting the index of item (j)
> g1=api$getCol(i1)
> g2=api$getCol(i2)
> cor.matrix[i,j]=cor(g1,g2)
> }}
>
> This will work, but will take forever. Any advice for how this can be done
> more efficiently? I'm running on a 2.6.18 linux system, with R version
> R-2.11.1.
>
> Thanks!

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Correlation of huge matrix saved as binary file

Thomas Lumley-2
On Sat, Mar 3, 2012 at 2:36 PM, Peter Langfelder
<[hidden email]> wrote:

> 3. Instead of calculating the correlations one-by-one, calculate them
> in small blocks (if you have enough memory and you run a 64-bit R).
> With 900M rows, you will only be able to put a 900Mx2 into an R
> object, but if you have two such standardized matrices loaded in g1,
> g2, you can get their (2x2) correlation matrix by t(g1) %*% g2. This
> 2x2 matrix you can use to fill the appropriate components of the
> result matrix.

Or split it the other way.   Compute the covariance of all 9000
variables on, say, 50k observations and store it. Repeat 180 times,
then add up the covariances and scale to a correlation.

    -thomas

--
Thomas Lumley
Professor of Biostatistics
University of Auckland

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...