Performance of .C and .Call functions vs. native R code

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Performance of .C and .Call functions vs. native R code

Alireza Mahani
Hello,

I am in the process of writing an R extension for parallelized MCMC, with heavy use of compiled code (C++). I have been getting my feet wet by implementing a simple matrix-vector multiplication function in C++ (which calls a BLAS level 2 function dgemv), and comparing it to the '%*%' operator in R (which apparently calls a BLAS level 3 function dgemm).

Interestingly, I cannot replicate the performance of the R native operator, using either '.C' or '.Call'. The relative times are 17 (R), 30 (.C), and 26 (.Call). In other words, R native operator is 1.5x faster than my compiled code. Can you explain to me why this is? Through testing I strongly suspect that the BLAS function itself isn't what takes the bulk part of the time, but perhaps data transfer and other overhead associated with the calls (.C and .Call) are the main issues. Are there any ways to reach the performance level of native R code in this case?

Thank you,
Alireza Mahani
Reply | Threaded
Open this post in threaded view
|

Re: Performance of .C and .Call functions vs. native R code

Jeffrey Ryan
The .Call overhead isn't the issue. If you'd like some insight into what you are doing wrong (and right), you need to provide code for the list to reproduce your timings with.

This is outlined in the posting guide as well.

Best,
Jeff



On Jul 13, 2011, at 8:28 AM, asmahani <[hidden email]> wrote:

> Hello,
>
> I am in the process of writing an R extension for parallelized MCMC, with
> heavy use of compiled code (C++). I have been getting my feet wet by
> implementing a simple matrix-vector multiplication function in C++ (which
> calls a BLAS level 2 function dgemv), and comparing it to the '%*%' operator
> in R (which apparently calls a BLAS level 3 function dgemm).
>
> Interestingly, I cannot replicate the performance of the R native operator,
> using either '.C' or '.Call'. The relative times are 17 (R), 30 (.C), and 26
> (.Call). In other words, R native operator is 1.5x faster than my compiled
> code. Can you explain to me why this is? Through testing I strongly suspect
> that the BLAS function itself isn't what takes the bulk part of the time,
> but perhaps data transfer and other overhead associated with the calls (.C
> and .Call) are the main issues. Are there any ways to reach the performance
> level of native R code in this case?
>
> Thank you,
> Alireza Mahani
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Performance-of-C-and-Call-functions-vs-native-R-code-tp3665017p3665017.html
> Sent from the R devel mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Performance of .C and .Call functions vs. native R code

Alireza Mahani
(I am using a LINUX machine)

Jeff,

In creating reproducible results, I 'partially' answered my question. I have attached two scripts, 'mvMultiply.r' and 'mvMultiply.cc'. Please copy both files into your chosen directory, then run 'Rscript mvMultiply.r' in that directory while changing the two boolean parameters 'INCLUDE_DATAPREP' and 'ROWMAJOR' to all four permutations. (The variable 'diffVec' is there to verify that the two methods produce the same output vector.)

Below are the results that I get, along with discussion (tR and tCall are in sec):

INCLUDE_DATAPREP,ROWMAJOR,tR,tCall
F,F,13.536,13.875
F,T,13.824,14.299
T,F,13.688,18.167
T,T,13.982,30.730

Interpretation: The execution time for the .Call line is nearly identical to the call to R operator '%*%'. Two data preparation lines for the .Call method create the overhead:

A <- t(A) (~12sec, or 12usec per call)
dim(A) <- dim(A)[1] * dim(A)[2] (~4sec, or 4usec per call)

While the first line can be avoided by providing options in c++ function (as is done in the BLAS API), I wonder if the second line can be avoided, aside from the obvious option of rewriting the R scripts to use vectors instead of matrices. But this defies one of the primary advantages of using R, which is succinct, high-level coding. In particular, if one has several matrices as input into a .Call function, then the overhead from matrix-to-vector transformations can add up. To summarize, my questions are:

1- Do the above results seem reasonable to you? Is there a similar penalty in R's '%*%' operator for transforming matrices to vectors before calling BLAS functions?
2- Are there techniques for reducing the above overhead for developers looking to augment their R code with compiled code?

Regards,
Alireza

---------------------------------------
# mvMultiply.r
# comparing performance of matrix multiplication in R (using '%*%' operator) vs. calling compiled code (using .Call function)
# y [m x 1] = A [m x n] %*% x [n x 1]

rm(list = ls())
system("R CMD SHLIB mvMultiply.cc")
dyn.load("mvMultiply.so")

INCLUDE_DATAPREP <- F
ROWMAJOR <- F #indicates whether the c++ function treats A in a row-major or column-major fashion

m <- 100
n <- 10
N <- 1000000

diffVec <- array(0, dim = N)
tR <- 0.0
tCall <- 0.0
for (i in 1:N) {
       
        A <- runif(m*n); dim(A) <- c(m,n)
        x <- runif(n)

        t1 <- proc.time()[3]
        y1 <- A %*% x
        tR <- tR + proc.time()[3] - t1
       
        if (INCLUDE_DATAPREP) {
                t1 <- proc.time()[3]
        }
        if (ROWMAJOR) { #since R will convert matrix to vector in a column-major fashion, if the c++ function expects a row-major format, we need to transpose A before converting it to a vector
                A <- t(A)
        }
        dim(A) <- dim(A)[1] * dim(A)[2]
        if (!INCLUDE_DATAPREP) {
                t1 <- proc.time()[3]
        }
        y2 <- .Call("matvecMultiply", as.double(A), as.double(x), as.logical(c(ROWMAJOR)))
        tCall <- tCall + proc.time()[3] - t1
       
        diffVec[i] <- max(abs(y2 - y1))
}
cat("Data prep time for '.Call' included: ", INCLUDE_DATAPREP, "\n")
cat("C++ function expects row-major matrix: ", ROWMAJOR, "\n")
cat("Time - Using '%*%' operator in R: ", tR, "sec\n")
cat("Time - Using '.Call' function: ", tCall, "sec\n")
cat("Maximum difference between methods: ", max(diffVec), "\n")

dyn.unload("mvMultiply.so")
---------------------------------------
# mvMultiply.cc
#include <Rinternals.h>
#include <R.h>

extern "C" {

SEXP matvecMultiply(SEXP A, SEXP x, SEXP rowmajor) {
        double *rA = REAL(A), *rx = REAL(x), *ry;
        int *rrm = LOGICAL(rowmajor);
        int n = length(x);
        int m = length(A) / n;
        SEXP y;
        PROTECT(y = allocVector(REALSXP, m));
        ry = REAL(y);
        for (int i = 0; i < m; i++) {
                ry[i] = 0.0;
                for (int j = 0; j < n; j++) {
                        if (rrm[0] == 1) {
                                ry[i] += rA[i * n + j] * rx[j];
                        } else {
                                ry[i] += rA[j * m + i] * rx[j];
                        }
                }
        }
        UNPROTECT(1);
        return(y);
}

}
Reply | Threaded
Open this post in threaded view
|

Re: Performance of .C and .Call functions vs. native R code

Gabriel Becker
On Thu, Jul 14, 2011 at 8:21 AM, Alireza Mahani
<[hidden email]>wrote:

> (I am using a LINUX machine)
>
> Jeff,
>
> In creating reproducible results, I 'partially' answered my question. I
> have
> attached two scripts, 'mvMultiply.r' and 'mvMultiply.cc'. Please copy both
> files into your chosen directory, then run 'Rscript mvMultiply.r' in that
> directory while changing the two boolean parameters 'INCLUDE_DATAPREP' and
> 'ROWMAJOR' to all four permutations. (The variable 'diffVec' is there to
> verify that the two methods produce the same output vector.)
>
> Below are the results that I get, along with discussion (tR and tCall are
> in
> sec):
>
> INCLUDE_DATAPREP,ROWMAJOR,tR,tCall
> F,F,13.536,13.875
> F,T,13.824,14.299
> T,F,13.688,18.167
> T,T,13.982,30.730
>
> Interpretation: The execution time for the .Call line is nearly identical
> to
> the call to R operator '%*%'. Two data preparation lines for the .Call
> method create the overhead:
>
> A <- t(A) (~12sec, or 12usec per call)
> dim(A) <- dim(A)[1] * dim(A)[2] (~4sec, or 4usec per call)
>
>
AFAIK R stores matrices as vectors internally anyway and the dims just tell
it the position of the various elements, so I'm not sure that second line is
needed at all.

I have attached a tiny piece of c code which verifies this.

The output I get from that is:

> dyn.load("/home/gmbecker/gabe/matvectest.so")
> vec = 1.1:8.1
> mat = matrix(vec, ncol = 4)
> .Call("R_MatVecTest", vec, mat, 8L)
[1] TRUE


Note if you create the matrix with byrow=TRUE they may not be the same.

Hope that helps,
Gabe


> While the first line can be avoided by providing options in c++ function
> (as
> is done in the BLAS API), I wonder if the second line can be avoided, aside
> from the obvious option of rewriting the R scripts to use evectors instead
> of
> matrices. But this defies one of the primary advantages of using R, which
> is
> succinct, high-level coding. In particular, if one has several matrices as
> input into a .Call function, then the overhead from matrix-to-vector
> transformations can add up. To summarize, my questions are:
>
> 1- Do the above results seem reasonable to you? Is there a similar penalty
> in R's '%*%' operator for transforming matrices to vectors before calling
> BLAS functions?
> 2- Are there techniques for reducing the above overhead for developers
> looking to augment their R code with compiled code?
>
> Regards,
> Alireza
>
> ---------------------------------------
> # mvMultiply.r
> # comparing performance of matrix multiplication in R (using '%*%'
> operator)
> vs. calling compiled code (using .Call function)
> # y [m x 1] = A [m x n] %*% x [n x 1]
>
> rm(list = ls())
> system("R CMD SHLIB mvMultiply.cc")
> dyn.load("mvMultiply.so")
>
> INCLUDE_DATAPREP <- F
> ROWMAJOR <- F #indicates whether the c++ function treats A in a row-major
> or
> column-major fashion
>
> m <- 100
> n <- 10
> N <- 1000000
>
> diffVec <- array(0, dim = N)
> tR <- 0.0
> tCall <- 0.0
> for (i in 1:N) {
>
>        A <- runif(m*n); dim(A) <- c(m,n)
>        x <- runif(n)
>
>        t1 <- proc.time()[3]
>        y1 <- A %*% x
>        tR <- tR + proc.time()[3] - t1
>
>        if (INCLUDE_DATAPREP) {
>                t1 <- proc.time()[3]
>        }
>        if (ROWMAJOR) { #since R will convert matrix to vector in a
> column-major
> fashion, if the c++ function expects a row-major format, we need to
> transpose A before converting it to a vector
>                A <- t(A)
>        }
>        dim(A) <- dim(A)[1] * dim(A)[2]
>        if (!INCLUDE_DATAPREP) {
>                t1 <- proc.time()[3]
>        }
>        y2 <- .Call("matvecMultiply", as.double(A), as.double(x),
> as.logical(c(ROWMAJOR)))
>        tCall <- tCall + proc.time()[3] - t1
>
>        diffVec[i] <- max(abs(y2 - y1))
> }
> cat("Data prep time for '.Call' included: ", INCLUDE_DATAPREP, "\n")
> cat("C++ function expects row-major matrix: ", ROWMAJOR, "\n")
> cat("Time - Using '%*%' operator in R: ", tR, "sec\n")
> cat("Time - Using '.Call' function: ", tCall, "sec\n")
> cat("Maximum difference between methods: ", max(diffVec), "\n")
>
> dyn.unload("mvMultiply.so")
> ---------------------------------------
> # mvMultiply.cc
> #include <Rinternals.h>
> #include <R.h>
>
> extern "C" {
>
> SEXP matvecMultiply(SEXP A, SEXP x, SEXP rowmajor) {
>        double *rA = REAL(A), *rx = REAL(x), *ry;
>        int *rrm = LOGICAL(rowmajor);
>        int n = length(x);
>        int m = length(A) / n;
>        SEXP y;
>        PROTECT(y = allocVector(REALSXP, m));
>        ry = REAL(y);
>        for (int i = 0; i < m; i++) {
>                ry[i] = 0.0;
>                for (int j = 0; j < n; j++) {
>                        if (rrm[0] == 1) {
>                                ry[i] += rA[i * n + j] * rx[j];
>                        } else {
>                                ry[i] += rA[j * m + i] * rx[j];
>                        }
>                }
>        }
>        UNPROTECT(1);
>        return(y);
> }
>
> }
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Performance-of-C-and-Call-functions-vs-native-R-code-tp3665017p3667896.html
> Sent from the R devel mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>


--
Gabriel Becker
Graduate Student
Statistics Department
University of California, Davis

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Performance of .C and .Call functions vs. native R code

Alireza Mahani
You are absolutely right Gabe! I removed the line 'dim(A) <- dim(A)[1] * dim(A)[2]' and my code still executes properly. As you said, matrices are internally stored as one-dimensional arrays (column-major by default), it's just that R exposes them differently by assigning different attributes to them, but as far as C/C++ code is concerned there is no distinction.

Many thanks,
Alireza
Reply | Threaded
Open this post in threaded view
|

Re: Performance of .C and .Call functions vs. native R code

Douglas Bates-2
In reply to this post by Alireza Mahani
On Thu, Jul 14, 2011 at 10:21 AM, Alireza Mahani
<[hidden email]> wrote:

> (I am using a LINUX machine)
>
> Jeff,
>
> In creating reproducible results, I 'partially' answered my question. I have
> attached two scripts, 'mvMultiply.r' and 'mvMultiply.cc'. Please copy both
> files into your chosen directory, then run 'Rscript mvMultiply.r' in that
> directory while changing the two boolean parameters 'INCLUDE_DATAPREP' and
> 'ROWMAJOR' to all four permutations. (The variable 'diffVec' is there to
> verify that the two methods produce the same output vector.)
>
> Below are the results that I get, along with discussion (tR and tCall are in
> sec):
>
> INCLUDE_DATAPREP,ROWMAJOR,tR,tCall
> F,F,13.536,13.875
> F,T,13.824,14.299
> T,F,13.688,18.167
> T,T,13.982,30.730
>
> Interpretation: The execution time for the .Call line is nearly identical to
> the call to R operator '%*%'. Two data preparation lines for the .Call
> method create the overhead:
>
> A <- t(A) (~12sec, or 12usec per call)
> dim(A) <- dim(A)[1] * dim(A)[2] (~4sec, or 4usec per call)
>
> While the first line can be avoided by providing options in c++ function (as
> is done in the BLAS API), I wonder if the second line can be avoided, aside
> from the obvious option of rewriting the R scripts to use vectors instead of
> matrices. But this defies one of the primary advantages of using R, which is
> succinct, high-level coding. In particular, if one has several matrices as
> input into a .Call function, then the overhead from matrix-to-vector
> transformations can add up. To summarize, my questions are:
>
> 1- Do the above results seem reasonable to you? Is there a similar penalty
> in R's '%*%' operator for transforming matrices to vectors before calling
> BLAS functions?
> 2- Are there techniques for reducing the above overhead for developers
> looking to augment their R code with compiled code?
>
> Regards,
> Alireza
>
> ---------------------------------------
> # mvMultiply.r
> # comparing performance of matrix multiplication in R (using '%*%' operator)
> vs. calling compiled code (using .Call function)
> # y [m x 1] = A [m x n] %*% x [n x 1]
>
> rm(list = ls())
> system("R CMD SHLIB mvMultiply.cc")
> dyn.load("mvMultiply.so")
>
> INCLUDE_DATAPREP <- F
> ROWMAJOR <- F #indicates whether the c++ function treats A in a row-major or
> column-major fashion
>
> m <- 100
> n <- 10
> N <- 1000000
>
> diffVec <- array(0, dim = N)
> tR <- 0.0
> tCall <- 0.0
> for (i in 1:N) {
>
>        A <- runif(m*n); dim(A) <- c(m,n)
>        x <- runif(n)
>
>        t1 <- proc.time()[3]
>        y1 <- A %*% x
>        tR <- tR + proc.time()[3] - t1
>
>        if (INCLUDE_DATAPREP) {
>                t1 <- proc.time()[3]
>        }
>        if (ROWMAJOR) { #since R will convert matrix to vector in a column-major
> fashion, if the c++ function expects a row-major format, we need to
> transpose A before converting it to a vector
>                A <- t(A)
>        }
>        dim(A) <- dim(A)[1] * dim(A)[2]
>        if (!INCLUDE_DATAPREP) {
>                t1 <- proc.time()[3]
>        }
>        y2 <- .Call("matvecMultiply", as.double(A), as.double(x),
> as.logical(c(ROWMAJOR)))
>        tCall <- tCall + proc.time()[3] - t1
>
>        diffVec[i] <- max(abs(y2 - y1))
> }
> cat("Data prep time for '.Call' included: ", INCLUDE_DATAPREP, "\n")
> cat("C++ function expects row-major matrix: ", ROWMAJOR, "\n")
> cat("Time - Using '%*%' operator in R: ", tR, "sec\n")
> cat("Time - Using '.Call' function: ", tCall, "sec\n")
> cat("Maximum difference between methods: ", max(diffVec), "\n")
>
> dyn.unload("mvMultiply.so")
> ---------------------------------------
> # mvMultiply.cc
> #include <Rinternals.h>
> #include <R.h>
>
> extern "C" {
>
> SEXP matvecMultiply(SEXP A, SEXP x, SEXP rowmajor) {
>        double *rA = REAL(A), *rx = REAL(x), *ry;
>        int *rrm = LOGICAL(rowmajor);
>        int n = length(x);
>        int m = length(A) / n;
>        SEXP y;
>        PROTECT(y = allocVector(REALSXP, m));
>        ry = REAL(y);
>        for (int i = 0; i < m; i++) {
>                ry[i] = 0.0;
>                for (int j = 0; j < n; j++) {
>                        if (rrm[0] == 1) {
>                                ry[i] += rA[i * n + j] * rx[j];
>                        } else {
>                                ry[i] += rA[j * m + i] * rx[j];
>                        }
>                }
>        }
>        UNPROTECT(1);
>        return(y);
> }
>
> }
>
I realize that you are just beginning to use the .C and .Call
interfaces and your example is therefore a simple one.  However, if
you plan to continue with such development it is worthwhile learning
of some of the tools available.  I think one of the most important is
the "inline" package that can take a C or C++ code segment as a text
string and go through all the steps of creating and loading a
.Call'able compiled function.

Second, if you are going to use C++ (the code you show could be C code
as it doesn't use any C++ extensions) then you should look at the Rcpp
package written by Dirk Eddelbuettel and Romain Francois which allows
for comparatively painless interfacing of R objects and C++ objects.
The Rcpp-devel list, which I have copied on this reply, is for
questions related to that system.  The inline package allows for
various "plugin" constructions to wrap your code in the appropriate
headers and point the compiler to the locations of header files and
libraries.  There are two extensions to Rcpp for numerical linear
algebra in C++, RcppArmadillo and RcppEigen.  I show the use of
RcppEigen here.

Third there are several packages in R that do the busy work of
benchmarking expressions and neatly formulating the results.  I use
the rbenchmark package.

Putting all these together yields the enclosed script and results.

In Eigen, a MatrixXd object is the equivalent of R's numeric matrix
(similarly MatrixXi for integer and MatrixXcd for complex) and a
VectorXd object is the equivalent of a numeric vector.  A "mapped"
matrix or vector is one that uses the storage allocated by R, thereby
avoiding a copy operation (similar to your accessing elements of the
arrays through the pointer returned by REAL()).  To adhere to R's
functional programming semantics it is a good idea to declare such
objects as const.  The 'as' and 'wrap' functions are provided by Rcpp
with extensions in RcppEigen to the Eigen classes in the development
version.  In the released versions of Rcpp and RcppEigen we use
intermediate Rcpp objects. These functions have the advantage of
checking the types of R objects being passed.  The Eigen code for
matrix multiplication will check the consistency of dimensions in the
operation.

Given the size of the matrix you are working with it is not surprising
that interpretation overhead and checking will be a large part of the
elapsed time, hence the relative differences between different methods
of doing the numerical calculation will be small.  The operation of
multiplying a 100 x 10 matrix by a 10-vector involves "only" 1000
floating point operations.  Furthermore, each element of the matrix is
used only once so sophisticated methods of manipulating cache contents
won't buy you much.  These benchmark results are from a system that
uses Atlas BLAS (basic linear algebra subroutines); other systems will
provide different results.  Interestingly, I found on some systems
using R's BLAS, which are not accelerated, the R code is closer in
speed to the code using Eigen.  An example is given in the second
version of the output.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

matvec_Rout.txt (3K) Download Attachment
matvec_Rout_2.txt (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Performance of .C and .Call functions vs. native R code

Douglas Bates-2
I just saw that I left a syntax error in the .R and the first
_Rout.txt files.  Notice that in the second _Rout.txt file the order
of the arguments in the constructors for the MMatrixXd and the
MVectorXd are in a different order than in the .R and the first
_Rout.txt files.  The correct order has the pointer first, then the
dimensions.  For the first _Rout.txt file this part of the code is not
used.

On Tue, Jul 19, 2011 at 10:00 AM, Douglas Bates <[hidden email]> wrote:

> On Thu, Jul 14, 2011 at 10:21 AM, Alireza Mahani
> <[hidden email]> wrote:
>> (I am using a LINUX machine)
>>
>> Jeff,
>>
>> In creating reproducible results, I 'partially' answered my question. I have
>> attached two scripts, 'mvMultiply.r' and 'mvMultiply.cc'. Please copy both
>> files into your chosen directory, then run 'Rscript mvMultiply.r' in that
>> directory while changing the two boolean parameters 'INCLUDE_DATAPREP' and
>> 'ROWMAJOR' to all four permutations. (The variable 'diffVec' is there to
>> verify that the two methods produce the same output vector.)
>>
>> Below are the results that I get, along with discussion (tR and tCall are in
>> sec):
>>
>> INCLUDE_DATAPREP,ROWMAJOR,tR,tCall
>> F,F,13.536,13.875
>> F,T,13.824,14.299
>> T,F,13.688,18.167
>> T,T,13.982,30.730
>>
>> Interpretation: The execution time for the .Call line is nearly identical to
>> the call to R operator '%*%'. Two data preparation lines for the .Call
>> method create the overhead:
>>
>> A <- t(A) (~12sec, or 12usec per call)
>> dim(A) <- dim(A)[1] * dim(A)[2] (~4sec, or 4usec per call)
>>
>> While the first line can be avoided by providing options in c++ function (as
>> is done in the BLAS API), I wonder if the second line can be avoided, aside
>> from the obvious option of rewriting the R scripts to use vectors instead of
>> matrices. But this defies one of the primary advantages of using R, which is
>> succinct, high-level coding. In particular, if one has several matrices as
>> input into a .Call function, then the overhead from matrix-to-vector
>> transformations can add up. To summarize, my questions are:
>>
>> 1- Do the above results seem reasonable to you? Is there a similar penalty
>> in R's '%*%' operator for transforming matrices to vectors before calling
>> BLAS functions?
>> 2- Are there techniques for reducing the above overhead for developers
>> looking to augment their R code with compiled code?
>>
>> Regards,
>> Alireza
>>
>> ---------------------------------------
>> # mvMultiply.r
>> # comparing performance of matrix multiplication in R (using '%*%' operator)
>> vs. calling compiled code (using .Call function)
>> # y [m x 1] = A [m x n] %*% x [n x 1]
>>
>> rm(list = ls())
>> system("R CMD SHLIB mvMultiply.cc")
>> dyn.load("mvMultiply.so")
>>
>> INCLUDE_DATAPREP <- F
>> ROWMAJOR <- F #indicates whether the c++ function treats A in a row-major or
>> column-major fashion
>>
>> m <- 100
>> n <- 10
>> N <- 1000000
>>
>> diffVec <- array(0, dim = N)
>> tR <- 0.0
>> tCall <- 0.0
>> for (i in 1:N) {
>>
>>        A <- runif(m*n); dim(A) <- c(m,n)
>>        x <- runif(n)
>>
>>        t1 <- proc.time()[3]
>>        y1 <- A %*% x
>>        tR <- tR + proc.time()[3] - t1
>>
>>        if (INCLUDE_DATAPREP) {
>>                t1 <- proc.time()[3]
>>        }
>>        if (ROWMAJOR) { #since R will convert matrix to vector in a column-major
>> fashion, if the c++ function expects a row-major format, we need to
>> transpose A before converting it to a vector
>>                A <- t(A)
>>        }
>>        dim(A) <- dim(A)[1] * dim(A)[2]
>>        if (!INCLUDE_DATAPREP) {
>>                t1 <- proc.time()[3]
>>        }
>>        y2 <- .Call("matvecMultiply", as.double(A), as.double(x),
>> as.logical(c(ROWMAJOR)))
>>        tCall <- tCall + proc.time()[3] - t1
>>
>>        diffVec[i] <- max(abs(y2 - y1))
>> }
>> cat("Data prep time for '.Call' included: ", INCLUDE_DATAPREP, "\n")
>> cat("C++ function expects row-major matrix: ", ROWMAJOR, "\n")
>> cat("Time - Using '%*%' operator in R: ", tR, "sec\n")
>> cat("Time - Using '.Call' function: ", tCall, "sec\n")
>> cat("Maximum difference between methods: ", max(diffVec), "\n")
>>
>> dyn.unload("mvMultiply.so")
>> ---------------------------------------
>> # mvMultiply.cc
>> #include <Rinternals.h>
>> #include <R.h>
>>
>> extern "C" {
>>
>> SEXP matvecMultiply(SEXP A, SEXP x, SEXP rowmajor) {
>>        double *rA = REAL(A), *rx = REAL(x), *ry;
>>        int *rrm = LOGICAL(rowmajor);
>>        int n = length(x);
>>        int m = length(A) / n;
>>        SEXP y;
>>        PROTECT(y = allocVector(REALSXP, m));
>>        ry = REAL(y);
>>        for (int i = 0; i < m; i++) {
>>                ry[i] = 0.0;
>>                for (int j = 0; j < n; j++) {
>>                        if (rrm[0] == 1) {
>>                                ry[i] += rA[i * n + j] * rx[j];
>>                        } else {
>>                                ry[i] += rA[j * m + i] * rx[j];
>>                        }
>>                }
>>        }
>>        UNPROTECT(1);
>>        return(y);
>> }
>>
>> }
>>
>
> I realize that you are just beginning to use the .C and .Call
> interfaces and your example is therefore a simple one.  However, if
> you plan to continue with such development it is worthwhile learning
> of some of the tools available.  I think one of the most important is
> the "inline" package that can take a C or C++ code segment as a text
> string and go through all the steps of creating and loading a
> .Call'able compiled function.
>
> Second, if you are going to use C++ (the code you show could be C code
> as it doesn't use any C++ extensions) then you should look at the Rcpp
> package written by Dirk Eddelbuettel and Romain Francois which allows
> for comparatively painless interfacing of R objects and C++ objects.
> The Rcpp-devel list, which I have copied on this reply, is for
> questions related to that system.  The inline package allows for
> various "plugin" constructions to wrap your code in the appropriate
> headers and point the compiler to the locations of header files and
> libraries.  There are two extensions to Rcpp for numerical linear
> algebra in C++, RcppArmadillo and RcppEigen.  I show the use of
> RcppEigen here.
>
> Third there are several packages in R that do the busy work of
> benchmarking expressions and neatly formulating the results.  I use
> the rbenchmark package.
>
> Putting all these together yields the enclosed script and results.
>
> In Eigen, a MatrixXd object is the equivalent of R's numeric matrix
> (similarly MatrixXi for integer and MatrixXcd for complex) and a
> VectorXd object is the equivalent of a numeric vector.  A "mapped"
> matrix or vector is one that uses the storage allocated by R, thereby
> avoiding a copy operation (similar to your accessing elements of the
> arrays through the pointer returned by REAL()).  To adhere to R's
> functional programming semantics it is a good idea to declare such
> objects as const.  The 'as' and 'wrap' functions are provided by Rcpp
> with extensions in RcppEigen to the Eigen classes in the development
> version.  In the released versions of Rcpp and RcppEigen we use
> intermediate Rcpp objects. These functions have the advantage of
> checking the types of R objects being passed.  The Eigen code for
> matrix multiplication will check the consistency of dimensions in the
> operation.
>
> Given the size of the matrix you are working with it is not surprising
> that interpretation overhead and checking will be a large part of the
> elapsed time, hence the relative differences between different methods
> of doing the numerical calculation will be small.  The operation of
> multiplying a 100 x 10 matrix by a 10-vector involves "only" 1000
> floating point operations.  Furthermore, each element of the matrix is
> used only once so sophisticated methods of manipulating cache contents
> won't buy you much.  These benchmark results are from a system that
> uses Atlas BLAS (basic linear algebra subroutines); other systems will
> provide different results.  Interestingly, I found on some systems
> using R's BLAS, which are not accelerated, the R code is closer in
> speed to the code using Eigen.  An example is given in the second
> version of the output.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Performance of .C and .Call functions vs. native R code

Alireza Mahani
Prof. Bates,

It looks like you read my mind! I am working on writing an R package for high-performance MCMC estimation of a class of Hierarchical Bayesian models most often used in the field of quantitative marketing. This would essentially be a parallelized version of Peter Rossi's bayesm package. While I've made great progress in parallelizing the most mathematically difficult part of the algorithm, namely slice sampling of low-level coefficients, yet I've realized that putting the entire code together while minimizing bugs is a big challenge in C/C++/CUDA environments. I have therefore decided to follow a more logical path of first developing the code logic in R, and then exporting it function by function to compiled code.

The tools that you mentioned seem to be exactly the kind of stuff I need in order to be able to do go through this incremental, test-oriented development process with relatively little pain.

I'm not sure if this is what you had in mind while suggesting the tools to me, so please let me know if I'm misinterpreting your comments, or if I need to be aware of other tools beyond what you mentioned.

Many thanks,
Alireza