# dist function in R is very slow

5 messages
Open this post in threaded view
|
Report Content as Inappropriate

## dist function in R is very slow

 Dear R developers, I am visualising high dimensional genomic data and for this purpose I need to compute pairwise distances between many points in a high-dimensional space (say I have a matrix of 5,000 rows and 20,000 columns, so the result is a 5,000x5,000 matrix or it's upper diagonal).Computing such thing in R takes many hours (I am doing this on a Linux server with more than 100 GB of RAM, so this is not the problem). When I write the matrix to disk, read it ans compute the distances in C, write them to the disk and read them into R it takes 10 - 15 minutes (and I did not spend much time on optimising my C code).The question is why the R function is so slow? I understand that it calls C (or C++) to compute the distance. My suspicion is that the transposed matrix is passed to C and so each time a distance between two columns of a matrix is computed, and since C stores matrices by rows it is very inefficient and causes many cache misses (my first C implementation was like this and I had to stop the r  un after an hour when it failed to complete).If my suspicion is correct, is it possible to re-write the dist function so that it works faster on large matrices? Best regards,Moshe OlshanskyMonash University         [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|
Report Content as Inappropriate

## Re: dist function in R is very slow

 > On 17 Jun 2017, at 08:47, Moshe Olshansky via R-devel <[hidden email]> wrote: > > I am visualising high dimensional genomic data and for this purpose I need to compute pairwise distances between many points in a high-dimensional space (say I have a matrix of 5,000 rows and 20,000 columns, so the result is a 5,000x5,000 matrix or it's upper diagonal).Computing such thing in R takes many hours (I am doing this on a Linux server with more than 100 GB of RAM, so this is not the problem). When I write the matrix to disk, read it ans compute the distances in C, write them to the disk and read them into R it takes 10 - 15 minutes (and I did not spend much time on optimising my C code).The question is why the R function is so slow? I understand that it calls C (or C++) to compute the distance. My suspicion is that the transposed matrix is passed to C and so each time a distance between two columns of a matrix is computed, and since C stores matrices by rows it is very inefficient and causes many cache misses (my first C implementation was like this and I had to stop the   run after an hour when it failed to complete). There are two many reasons for the relatively low speed of the built-in dist() function: (i) it operates on row vectors, which leads to many cache misses because matrices are stored by column in R (as you guessed); (ii) the function takes care to handle missing values correctly, which adds a relatively expensive test and conditional branch to each iteration of the inner loop. A faster implementation, which omits the NA test and can compute distances between column vectors, is available as dist.matrix() in the "wordspace" package.  However, it cannot be used with matrices that might contain NAs (and doesn't warn about such arguments). If you want the best possible speed, use cosine similarity (or equivalently, angular distance).  The underlying cross product is very efficient with a suitable BLAS implementation. Best, Stefan ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|
Report Content as Inappropriate

## Re: dist function in R is very slow

 In reply to this post by R devel mailing list Hi Stefan, Thank you very much for pointing me to the wordspace package. It does the job a bit faster than my C code but is 100 times more convenient. By the way, since the tcrossprod function in the Matrix package is so fast, the Euclidean distance can be computed very fast: euc_dist <- function(m) {mtm <- Matrix::tcrossprod(m); sq <- rowSums(m*m);  sqrt(outer(sq,sq,"+") - 2*mtm)} It takes less than 50 seconds for my (dense) matrix of 5,054 rows and 12,803 columns, while dist.matrix with method="euclidean" takes almost 10 minutes (which is still orders of magnitude faster than dist).       From: Stefan Evert <[hidden email]>  To: Moshe Olshansky <[hidden email]> Cc: R-devel Mailing List <[hidden email]>  Sent: Sunday, 18 June 2017, 2:33  Subject: Re: [Rd] dist function in R is very slow   > On 17 Jun 2017, at 08:47, Moshe Olshansky via R-devel <[hidden email]> wrote: > > I am visualising high dimensional genomic data and for this purpose I need to compute pairwise distances between many points in a high-dimensional space (say I have a matrix of 5,000 rows and 20,000 columns, so the result is a 5,000x5,000 matrix or it's upper diagonal).Computing such thing in R takes many hours (I am doing this on a Linux server with more than 100 GB of RAM, so this is not the problem). When I write the matrix to disk, read it ans compute the distances in C, write them to the disk and read them into R it takes 10 - 15 minutes (and I did not spend much time on optimising my C code).The question is why the R function is so slow? I understand that it calls C (or C++) to compute the distance. My suspicion is that the transposed matrix is passed to C and so each time a distance between two columns of a matrix is computed, and since C stores matrices by rows it is very inefficient and causes many cache misses (my first C implementation was like this and I had to stop the run after an hour when it failed to complete). There are two many reasons for the relatively low speed of the built-in dist() function: (i) it operates on row vectors, which leads to many cache misses because matrices are stored by column in R (as you guessed); (ii) the function takes care to handle missing values correctly, which adds a relatively expensive test and conditional branch to each iteration of the inner loop. A faster implementation, which omits the NA test and can compute distances between column vectors, is available as dist.matrix() in the "wordspace" package.  However, it cannot be used with matrices that might contain NAs (and doesn't warn about such arguments). If you want the best possible speed, use cosine similarity (or equivalently, angular distance).  The underlying cross product is very efficient with a suitable BLAS implementation. Best, Stefan             [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|
Report Content as Inappropriate