I don't think you can speed it up by a whole lot... but you can try a

few things, especially if you don't have missing data in the matrix

(which you probably don't). The main question is what takes most of

the time- the api calls or the cor() call? If it's cor, here's what

you can try:

1. Pre-standardize the entire matrix input matrix, i.e. scale each

column to mean=0 and sum of squares=1. Save the standardized matrix

(or make sure it's available to api). Since your matrix only has 9000

columns, this should not take extremely long.

2. Instead of calculating correlations, calculate simply sum(g1*g2) -

if g1 and g2 are standardized as above, correlation equals sum(g1*g2).

3. Instead of calculating the correlations one-by-one, calculate them

in small blocks (if you have enough memory and you run a 64-bit R).

With 900M rows, you will only be able to put a 900Mx2 into an R

object, but if you have two such standardized matrices loaded in g1,

g2, you can get their (2x2) correlation matrix by t(g1) %*% g2. This

2x2 matrix you can use to fill the appropriate components of the

result matrix.

4. Use one of the multi-threading packages (multicore comes to mind

but there are others) to parallelize your code. If you have 8

available cores, you can expect a nearly 8x speedup.

All in all, this will probably still take forever, but should be one

or two orders of magnitude faster than your current code :)

HTH,

Peter

On Fri, Mar 2, 2012 at 2:50 PM, Bryo <

[hidden email]> wrote:

> Hi,

>

> I have a 900,000,000*9,000 matrix where I need to calculate the correlation

> between all entries along the smaller dimension, thus creating a 9k*9k

> correlation matrix. This matrix is too big to be uploaded in R, and is saved

> as a binary file. To access the data in the file I use mmap and some

> api-functions (to get all values in one row, one column, or one particular

> value). I'm looking for some advice in how to calculate the correlation

> matrix. Right now my approach is to do something similar to this (toy code):

>

> corr.matrix<-matrix('numeric',ncol=9000,nrow=9000)

>

> for (i in 1:9000) {

> for (j in (i+1):9000) {

> # i1=... getting the index of item (i) in a second file

> # i2=....getting the index of item (j)

> g1=api$getCol(i1)

> g2=api$getCol(i2)

> cor.matrix[i,j]=cor(g1,g2)

> }}

>

> This will work, but will take forever. Any advice for how this can be done

> more efficiently? I'm running on a 2.6.18 linux system, with R version

> R-2.11.1.

>

> Thanks!

______________________________________________

[hidden email] mailing list

https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide

http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.