accelerating matrix multiply

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

accelerating matrix multiply

Cohn, Robert S
I am using R to multiply some large (30k x 30k double) matrices on a 64 core machine (xeon phi).  I added some timers to src/main/array.c to see where the time is going. All of the time is being spent in the matprod function, most of that time is spent in dgemm. 15 seconds is in matprod in some code that is checking if there are NaNs.

> system.time (C <- B %*% A)
nancheck: wall time 15.240282s
  dgemm: wall time 43.111064s
matprod: wall time 58.351572s
    user   system  elapsed
2710.154   20.999   58.398

The NaN checking code is not being vectorized because of the early exit when NaN is detected:

        /* Don't trust the BLAS to handle NA/NaNs correctly: PR#4582
         * The test is only O(n) here.
         */
        for (R_xlen_t i = 0; i < NRX*ncx; i++)
            if (ISNAN(x[i])) {have_na = TRUE; break;}
        if (!have_na)
            for (R_xlen_t i = 0; i < NRY*ncy; i++)
                if (ISNAN(y[i])) {have_na = TRUE; break;}

I tried deleting the 'break'. By inspecting the asm code, I verified that the loop was not being vectorized before, but now is vectorized. Total time goes down:

system.time (C <- B %*% A)
nancheck: wall time 1.898667s
  dgemm: wall time 43.913621s
matprod: wall time 45.812468s
   user   system  elapsed
2727.877   20.723   45.859

The break accelerates the case when there is a NaN, at the expense of the much more common case when there isn't a NaN. If a NaN is detected, it doesn't call dgemm and calls its own matrix multiply, which makes the NaN check time insignificant so I doubt the early exit provides any benefit.

I was a little surprised that the O(n) NaN check is costly compared to the O(n**2) dgemm that follows. I think the reason is that nan check is single thread and not vectorized, and my machine can do 2048 floating point ops/cycle when you consider the cores/dual issue/8 way SIMD/muladd, and the constant factor will be significant for even large matrices.

Would you consider deleting the breaks? I can submit a patch if that will help. Thanks.

Robert

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: accelerating matrix multiply

Radford Neal
> From: "Cohn, Robert S" <[hidden email]>
>
> I am using R to multiply some large (30k x 30k double) matrices on a
>  64 core machine (xeon phi).  I added some timers to
>  src/main/array.c to see where the time is going. All of the time is
>  being spent in the matprod function, most of that time is spent in
>  dgemm. 15 seconds is in matprod in some code that is checking if
>  there are NaNs.
>
> The NaN checking code is not being vectorized...

This can be a problem with big matrices when lots of cores are used
for the actual multiply, but is even more of a problem when at least
one of the matrices is small (eg, a vector-matrix multiply), in which
case the NaN check can dominate, slowing the operation by up to a
factor of about ten.

I pointed this problem out over six years ago, and provided a
patch that greatly speeds up many matrix multiplies (see
http://www.cs.utoronto.ca/~radford/R-mods.html).  But this
improvement has not been incorporated into R Core versions of R.

Since then, a more elaborate solution to the problem of NaN checks has
been incorporated into my pqR version of R (see pqR-project.org).  The
documentation on this approach can be found with help("%*%") if you're
running pqR, or you can just look at the source for this help file in
the pqR source code repository, at

https://github.com/radfordneal/pqR/blob/Release-2016-10-24/src/library/base/man/matmult.Rd

    Radford

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: accelerating matrix multiply

Martin Maechler
In reply to this post by Cohn, Robert S
>>>>> Cohn, Robert S <[hidden email]>
>>>>>     on Sat, 7 Jan 2017 16:41:42 +0000 writes:

> I am using R to multiply some large (30k x 30k double)
> matrices on a 64 core machine (xeon phi).  I added some timers
> to src/main/array.c to see where the time is going. All of the
> time is being spent in the matprod function, most of that time
> is spent in dgemm. 15 seconds is in matprod in some code that
> is checking if there are NaNs.

> > system.time (C <- B %*% A)
> nancheck: wall time 15.240282s
>    dgemm: wall time 43.111064s
>  matprod: wall time 58.351572s
>     user   system  elapsed
> 2710.154   20.999   58.398
>
> The NaN checking code is not being vectorized because of the
> early exit when NaN is detected:
>
> /* Don't trust the BLAS to handle NA/NaNs correctly: PR#4582
> * The test is only O(n) here.
> */
> for (R_xlen_t i = 0; i < NRX*ncx; i++)
>    if (ISNAN(x[i])) {have_na = TRUE; break;}
> if (!have_na)
>    for (R_xlen_t i = 0; i < NRY*ncy; i++)
> if (ISNAN(y[i])) {have_na = TRUE; break;}
>
> I tried deleting the 'break'. By inspecting the asm code, I
> verified that the loop was not being vectorized before, but
> now is vectorized. Total time goes down:
>
> system.time (C <- B %*% A)
> nancheck: wall time  1.898667s
>    dgemm: wall time 43.913621s
>  matprod: wall time 45.812468s
>     user   system  elapsed
> 2727.877   20.723   45.859
>
> The break accelerates the case when there is a NaN, at the
> expense of the much more common case when there isn't a
> NaN. If a NaN is detected, it doesn't call dgemm and calls its
> own matrix multiply, which makes the NaN check time
> insignificant so I doubt the early exit provides any benefit.
>
> I was a little surprised that the O(n) NaN check is costly
> compared to the O(n**2) dgemm that follows. I think the reason
> is that nan check is single thread and not vectorized, and my
> machine can do 2048 floating point ops/cycle when you consider
> the cores/dual issue/8 way SIMD/muladd, and the constant
> factor will be significant for even large matrices.
>
> Would you consider deleting the breaks? I can submit a patch
> if that will help. Thanks.
>
> Robert

Thank you Robert for bringing the issue up ("again", possibly).
Within R core, some have seen somewhat similar timing on some
platforms (gcc) .. but much less dramatical differences e.g. on
macOS with clang.

As seen in the source code you cite above, the current
implementation was triggered by a nasty BLAS bug .. actually
also showing up only on some platforms, possibly depending on
runtime libraries in addition to the compilers used.

Do you have R code (including set.seed(.) if relevant) to show
on how to generate the large square matrices you've mentioned in
the beginning?  So we get to some reproducible benchmarks?

With best regards,
Martin Maechler

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: accelerating matrix multiply

Cohn, Robert S
> Do you have R code (including set.seed(.) if relevant) to show on how to generate
> the large square matrices you've mentioned in the beginning?  So we get to some
> reproducible benchmarks?


Hi Martin,

Here is the program I used. I only generate 2 random numbers and reuse them to make the benchmark run faster. Let me know if there is something I can do to help--alternate benchmarks, tests, experiments with compilers other than icc.

MKL LAPACK behavior is undefined for NaN's so I left the check in, just made it more efficient on a CPU with SIMD. Thanks for looking at this.

set.seed (1)
m <- 30000
n <- 30000
A <- matrix (runif(2),nrow=m,ncol=n)
B <- matrix (runif(2),nrow=m,ncol=n)
print(typeof(A[1,2]))
print(A[1,2])

# Matrix multiply
system.time (C <- B %*% A)
system.time (C <- B %*% A)
system.time (C <- B %*% A)

-----Original Message-----
From: Martin Maechler [mailto:[hidden email]]
Sent: Tuesday, January 10, 2017 8:59 AM
To: Cohn, Robert S <[hidden email]>
Cc: [hidden email]
Subject: Re: [Rd] accelerating matrix multiply

>>>>> Cohn, Robert S <[hidden email]>
>>>>>     on Sat, 7 Jan 2017 16:41:42 +0000 writes:

> I am using R to multiply some large (30k x 30k double) matrices on a
> 64 core machine (xeon phi).  I added some timers to src/main/array.c
> to see where the time is going. All of the time is being spent in the
> matprod function, most of that time is spent in dgemm. 15 seconds is
> in matprod in some code that is checking if there are NaNs.

> > system.time (C <- B %*% A)
> nancheck: wall time 15.240282s
>    dgemm: wall time 43.111064s
>  matprod: wall time 58.351572s
>     user   system  elapsed
> 2710.154   20.999   58.398
>
> The NaN checking code is not being vectorized because of the early
> exit when NaN is detected:
>
> /* Don't trust the BLAS to handle NA/NaNs correctly: PR#4582
> * The test is only O(n) here.
> */
> for (R_xlen_t i = 0; i < NRX*ncx; i++)
>    if (ISNAN(x[i])) {have_na = TRUE; break;}
> if (!have_na)
>    for (R_xlen_t i = 0; i < NRY*ncy; i++)
> if (ISNAN(y[i])) {have_na = TRUE; break;}
>
> I tried deleting the 'break'. By inspecting the asm code, I verified
> that the loop was not being vectorized before, but now is vectorized.
> Total time goes down:
>
> system.time (C <- B %*% A)
> nancheck: wall time  1.898667s
>    dgemm: wall time 43.913621s
>  matprod: wall time 45.812468s
>     user   system  elapsed
> 2727.877   20.723   45.859
>
> The break accelerates the case when there is a NaN, at the expense of
> the much more common case when there isn't a NaN. If a NaN is
> detected, it doesn't call dgemm and calls its own matrix multiply,
> which makes the NaN check time insignificant so I doubt the early exit
> provides any benefit.
>
> I was a little surprised that the O(n) NaN check is costly compared to
> the O(n**2) dgemm that follows. I think the reason is that nan check
> is single thread and not vectorized, and my machine can do 2048
> floating point ops/cycle when you consider the cores/dual issue/8 way
> SIMD/muladd, and the constant factor will be significant for even
> large matrices.
>
> Would you consider deleting the breaks? I can submit a patch if that
> will help. Thanks.
>
> Robert

Thank you Robert for bringing the issue up ("again", possibly).
Within R core, some have seen somewhat similar timing on some platforms (gcc) .. but much less dramatical differences e.g. on macOS with clang.

As seen in the source code you cite above, the current implementation was triggered by a nasty BLAS bug .. actually also showing up only on some platforms, possibly depending on runtime libraries in addition to the compilers used.

Do you have R code (including set.seed(.) if relevant) to show on how to generate the large square matrices you've mentioned in the beginning?  So we get to some reproducible benchmarks?

With best regards,
Martin Maechler

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: accelerating matrix multiply

Tomas Kalibera

Hi Robert,

thanks for the report and your suggestions how to make the NaN checks
faster.

Based on my experiments it seems that the "break" in the loop actually
can have positive impact on performance even in the common case when we
don't have NaNs. With gcc on linux (corei7), where isnan is inlined, the
"break" version uses a conditional jump while the "nobreak" version uses
a conditional move. The conditional jump is faster because it takes
advantage of the branch prediction. Neither of the two versions is
vectorized (only scalar SSE instructions used).

How do you run R on Xeon Phi? Do you offload the NaN checks to the Phi
coprocessor? So far I tried without offloading to Phi, icc could
vectorize the "nobreak" version, but the performance of it was the same
as "break".

For my experiments I extracted NaN checks into a function. This was the
"break" version (same performance as the current code):

static __attribute__ ((noinline)) Rboolean hasNA(double *x, int n) {
   for (R_xlen_t i = 0; i < n; i++)
     if (ISNAN(x[i])) return TRUE;
   return FALSE;
}

And this was the "nobreak" version:

static __attribute__ ((noinline)) Rboolean hasNA(double *x, int n) {
   Rboolean has = FALSE;
   for (R_xlen_t i = 0; i < n; i++)
     if (ISNAN(x[i])) has=TRUE;
   return has;
}

Thanks,
Tomas

On 01/11/2017 02:28 PM, Cohn, Robert S wrote:

>> Do you have R code (including set.seed(.) if relevant) to show on how to generate
>> the large square matrices you've mentioned in the beginning?  So we get to some
>> reproducible benchmarks?
>
> Hi Martin,
>
> Here is the program I used. I only generate 2 random numbers and reuse them to make the benchmark run faster. Let me know if there is something I can do to help--alternate benchmarks, tests, experiments with compilers other than icc.
>
> MKL LAPACK behavior is undefined for NaN's so I left the check in, just made it more efficient on a CPU with SIMD. Thanks for looking at this.
>
> set.seed (1)
> m <- 30000
> n <- 30000
> A <- matrix (runif(2),nrow=m,ncol=n)
> B <- matrix (runif(2),nrow=m,ncol=n)
> print(typeof(A[1,2]))
> print(A[1,2])
>
> # Matrix multiply
> system.time (C <- B %*% A)
> system.time (C <- B %*% A)
> system.time (C <- B %*% A)
>
> -----Original Message-----
> From: Martin Maechler [mailto:[hidden email]]
> Sent: Tuesday, January 10, 2017 8:59 AM
> To: Cohn, Robert S <[hidden email]>
> Cc: [hidden email]
> Subject: Re: [Rd] accelerating matrix multiply
>
>>>>>> Cohn, Robert S <[hidden email]>
>>>>>>      on Sat, 7 Jan 2017 16:41:42 +0000 writes:
>> I am using R to multiply some large (30k x 30k double) matrices on a
>> 64 core machine (xeon phi).  I added some timers to src/main/array.c
>> to see where the time is going. All of the time is being spent in the
>> matprod function, most of that time is spent in dgemm. 15 seconds is
>> in matprod in some code that is checking if there are NaNs.
>>> system.time (C <- B %*% A)
>> nancheck: wall time 15.240282s
>>     dgemm: wall time 43.111064s
>>   matprod: wall time 58.351572s
>>      user   system  elapsed
>> 2710.154   20.999   58.398
>>
>> The NaN checking code is not being vectorized because of the early
>> exit when NaN is detected:
>>
>> /* Don't trust the BLAS to handle NA/NaNs correctly: PR#4582
>> * The test is only O(n) here.
>> */
>> for (R_xlen_t i = 0; i < NRX*ncx; i++)
>>    if (ISNAN(x[i])) {have_na = TRUE; break;}
>> if (!have_na)
>>    for (R_xlen_t i = 0; i < NRY*ncy; i++)
>> if (ISNAN(y[i])) {have_na = TRUE; break;}
>>
>> I tried deleting the 'break'. By inspecting the asm code, I verified
>> that the loop was not being vectorized before, but now is vectorized.
>> Total time goes down:
>>
>> system.time (C <- B %*% A)
>> nancheck: wall time  1.898667s
>>     dgemm: wall time 43.913621s
>>   matprod: wall time 45.812468s
>>      user   system  elapsed
>> 2727.877   20.723   45.859
>>
>> The break accelerates the case when there is a NaN, at the expense of
>> the much more common case when there isn't a NaN. If a NaN is
>> detected, it doesn't call dgemm and calls its own matrix multiply,
>> which makes the NaN check time insignificant so I doubt the early exit
>> provides any benefit.
>>
>> I was a little surprised that the O(n) NaN check is costly compared to
>> the O(n**2) dgemm that follows. I think the reason is that nan check
>> is single thread and not vectorized, and my machine can do 2048
>> floating point ops/cycle when you consider the cores/dual issue/8 way
>> SIMD/muladd, and the constant factor will be significant for even
>> large matrices.
>>
>> Would you consider deleting the breaks? I can submit a patch if that
>> will help. Thanks.
>>
>> Robert
> Thank you Robert for bringing the issue up ("again", possibly).
> Within R core, some have seen somewhat similar timing on some platforms (gcc) .. but much less dramatical differences e.g. on macOS with clang.
>
> As seen in the source code you cite above, the current implementation was triggered by a nasty BLAS bug .. actually also showing up only on some platforms, possibly depending on runtime libraries in addition to the compilers used.
>
> Do you have R code (including set.seed(.) if relevant) to show on how to generate the large square matrices you've mentioned in the beginning?  So we get to some reproducible benchmarks?
>
> With best regards,
> Martin Maechler
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: accelerating matrix multiply

Cohn, Robert S
Hi Tomas,

Can you share the full code for your benchmark, compiler options, and performance results so that I can try to reproduce them? There are a lot of variables that can affect the results. Private email is fine if it is too much for the mailing list.

I am measuring on Knight's Landing (KNL) that was released in November. KNL is not a co-processor so no offload is necessary. R executes directly on the Phi, which looks like a multi-core machine with 64 cores.

Robert

-----Original Message-----
From: Tomas Kalibera [mailto:[hidden email]]
Sent: Monday, January 16, 2017 12:00 PM
To: Cohn, Robert S <[hidden email]>
Cc: [hidden email]
Subject: Re: [Rd] accelerating matrix multiply


Hi Robert,

thanks for the report and your suggestions how to make the NaN checks faster.

Based on my experiments it seems that the "break" in the loop actually can have positive impact on performance even in the common case when we don't have NaNs. With gcc on linux (corei7), where isnan is inlined, the "break" version uses a conditional jump while the "nobreak" version uses a conditional move. The conditional jump is faster because it takes advantage of the branch prediction. Neither of the two versions is vectorized (only scalar SSE instructions used).

How do you run R on Xeon Phi? Do you offload the NaN checks to the Phi coprocessor? So far I tried without offloading to Phi, icc could vectorize the "nobreak" version, but the performance of it was the same as "break".

For my experiments I extracted NaN checks into a function. This was the "break" version (same performance as the current code):

static __attribute__ ((noinline)) Rboolean hasNA(double *x, int n) {
   for (R_xlen_t i = 0; i < n; i++)
     if (ISNAN(x[i])) return TRUE;
   return FALSE;
}

And this was the "nobreak" version:

static __attribute__ ((noinline)) Rboolean hasNA(double *x, int n) {
   Rboolean has = FALSE;
   for (R_xlen_t i = 0; i < n; i++)
     if (ISNAN(x[i])) has=TRUE;
   return has;
}

Thanks,
Tomas

On 01/11/2017 02:28 PM, Cohn, Robert S wrote:

>> Do you have R code (including set.seed(.) if relevant) to show on how
>> to generate the large square matrices you've mentioned in the
>> beginning?  So we get to some reproducible benchmarks?
>
> Hi Martin,
>
> Here is the program I used. I only generate 2 random numbers and reuse them to make the benchmark run faster. Let me know if there is something I can do to help--alternate benchmarks, tests, experiments with compilers other than icc.
>
> MKL LAPACK behavior is undefined for NaN's so I left the check in, just made it more efficient on a CPU with SIMD. Thanks for looking at this.
>
> set.seed (1)
> m <- 30000
> n <- 30000
> A <- matrix (runif(2),nrow=m,ncol=n)
> B <- matrix (runif(2),nrow=m,ncol=n)
> print(typeof(A[1,2]))
> print(A[1,2])
>
> # Matrix multiply
> system.time (C <- B %*% A)
> system.time (C <- B %*% A)
> system.time (C <- B %*% A)
>
> -----Original Message-----
> From: Martin Maechler [mailto:[hidden email]]
> Sent: Tuesday, January 10, 2017 8:59 AM
> To: Cohn, Robert S <[hidden email]>
> Cc: [hidden email]
> Subject: Re: [Rd] accelerating matrix multiply
>
>>>>>> Cohn, Robert S <[hidden email]>
>>>>>>      on Sat, 7 Jan 2017 16:41:42 +0000 writes:
>> I am using R to multiply some large (30k x 30k double) matrices on a
>> 64 core machine (xeon phi).  I added some timers to src/main/array.c
>> to see where the time is going. All of the time is being spent in the
>> matprod function, most of that time is spent in dgemm. 15 seconds is
>> in matprod in some code that is checking if there are NaNs.
>>> system.time (C <- B %*% A)
>> nancheck: wall time 15.240282s
>>     dgemm: wall time 43.111064s
>>   matprod: wall time 58.351572s
>>      user   system  elapsed
>> 2710.154   20.999   58.398
>>
>> The NaN checking code is not being vectorized because of the early
>> exit when NaN is detected:
>>
>> /* Don't trust the BLAS to handle NA/NaNs correctly: PR#4582
>> * The test is only O(n) here.
>> */
>> for (R_xlen_t i = 0; i < NRX*ncx; i++)
>>    if (ISNAN(x[i])) {have_na = TRUE; break;}
>> if (!have_na)
>>    for (R_xlen_t i = 0; i < NRY*ncy; i++)
>> if (ISNAN(y[i])) {have_na = TRUE; break;}
>>
>> I tried deleting the 'break'. By inspecting the asm code, I verified
>> that the loop was not being vectorized before, but now is vectorized.
>> Total time goes down:
>>
>> system.time (C <- B %*% A)
>> nancheck: wall time  1.898667s
>>     dgemm: wall time 43.913621s
>>   matprod: wall time 45.812468s
>>      user   system  elapsed
>> 2727.877   20.723   45.859
>>
>> The break accelerates the case when there is a NaN, at the expense of
>> the much more common case when there isn't a NaN. If a NaN is
>> detected, it doesn't call dgemm and calls its own matrix multiply,
>> which makes the NaN check time insignificant so I doubt the early
>> exit provides any benefit.
>>
>> I was a little surprised that the O(n) NaN check is costly compared
>> to the O(n**2) dgemm that follows. I think the reason is that nan
>> check is single thread and not vectorized, and my machine can do 2048
>> floating point ops/cycle when you consider the cores/dual issue/8 way
>> SIMD/muladd, and the constant factor will be significant for even
>> large matrices.
>>
>> Would you consider deleting the breaks? I can submit a patch if that
>> will help. Thanks.
>>
>> Robert
> Thank you Robert for bringing the issue up ("again", possibly).
> Within R core, some have seen somewhat similar timing on some platforms (gcc) .. but much less dramatical differences e.g. on macOS with clang.
>
> As seen in the source code you cite above, the current implementation was triggered by a nasty BLAS bug .. actually also showing up only on some platforms, possibly depending on runtime libraries in addition to the compilers used.
>
> Do you have R code (including set.seed(.) if relevant) to show on how to generate the large square matrices you've mentioned in the beginning?  So we get to some reproducible benchmarks?
>
> With best regards,
> Martin Maechler
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: accelerating matrix multiply

Tomas Kalibera
Hi Robert,

I've run more experiments (and yes, the code is probably too long for
the list). The tradeoffs are platform dependent. The "nobreak" version
is slower than "break" on a corei7 (i7-3610QM), it is faster on opteron
(6282) and it is about the same on Xeon (E5-2640, E5-2670 even though
seen slower for big vectors).

It may be hard to get a universally better version. Still, a version
that performs fastest on platforms I checked, and sometimes by a lot -
about 2x faster than default - is

Rboolean hasNaN_pairsum(double *x, R_xlen_t n)
{
     if ((n&1) != 0 && ISNAN(x[0]))
         return TRUE;
     for (int i = n&1; i < n; i += 2)
         if (ISNAN(x[i]+x[i+1])) /* may also return TRUE for +-Inf */
             return TRUE;
     return FALSE;
}

It may also return "true" when some elements are Inf, but that is
safe/conservative for this purpose, and actually the MKL disclaimer
suggests we should be checking for Inf anyway.
This version is from pqR (except that pqR would check also the
individual arguments of the sum, it the sum is found to have NaN).
Does it perform well on Knights Landing?

Best
Tomas


On 01/16/2017 06:32 PM, Cohn, Robert S wrote:

> Hi Tomas,
>
> Can you share the full code for your benchmark, compiler options, and performance results so that I can try to reproduce them? There are a lot of variables that can affect the results. Private email is fine if it is too much for the mailing list.
>
> I am measuring on Knight's Landing (KNL) that was released in November. KNL is not a co-processor so no offload is necessary. R executes directly on the Phi, which looks like a multi-core machine with 64 cores.
>
> Robert
>
> -----Original Message-----
> From: Tomas Kalibera [mailto:[hidden email]]
> Sent: Monday, January 16, 2017 12:00 PM
> To: Cohn, Robert S <[hidden email]>
> Cc: [hidden email]
> Subject: Re: [Rd] accelerating matrix multiply
>
>
> Hi Robert,
>
> thanks for the report and your suggestions how to make the NaN checks faster.
>
> Based on my experiments it seems that the "break" in the loop actually can have positive impact on performance even in the common case when we don't have NaNs. With gcc on linux (corei7), where isnan is inlined, the "break" version uses a conditional jump while the "nobreak" version uses a conditional move. The conditional jump is faster because it takes advantage of the branch prediction. Neither of the two versions is vectorized (only scalar SSE instructions used).
>
> How do you run R on Xeon Phi? Do you offload the NaN checks to the Phi coprocessor? So far I tried without offloading to Phi, icc could vectorize the "nobreak" version, but the performance of it was the same as "break".
>
> For my experiments I extracted NaN checks into a function. This was the "break" version (same performance as the current code):
>
> static __attribute__ ((noinline)) Rboolean hasNA(double *x, int n) {
>     for (R_xlen_t i = 0; i < n; i++)
>       if (ISNAN(x[i])) return TRUE;
>     return FALSE;
> }
>
> And this was the "nobreak" version:
>
> static __attribute__ ((noinline)) Rboolean hasNA(double *x, int n) {
>     Rboolean has = FALSE;
>     for (R_xlen_t i = 0; i < n; i++)
>       if (ISNAN(x[i])) has=TRUE;
>     return has;
> }
>
> Thanks,
> Tomas
>
> On 01/11/2017 02:28 PM, Cohn, Robert S wrote:
>>> Do you have R code (including set.seed(.) if relevant) to show on how
>>> to generate the large square matrices you've mentioned in the
>>> beginning?  So we get to some reproducible benchmarks?
>> Hi Martin,
>>
>> Here is the program I used. I only generate 2 random numbers and reuse them to make the benchmark run faster. Let me know if there is something I can do to help--alternate benchmarks, tests, experiments with compilers other than icc.
>>
>> MKL LAPACK behavior is undefined for NaN's so I left the check in, just made it more efficient on a CPU with SIMD. Thanks for looking at this.
>>
>> set.seed (1)
>> m <- 30000
>> n <- 30000
>> A <- matrix (runif(2),nrow=m,ncol=n)
>> B <- matrix (runif(2),nrow=m,ncol=n)
>> print(typeof(A[1,2]))
>> print(A[1,2])
>>
>> # Matrix multiply
>> system.time (C <- B %*% A)
>> system.time (C <- B %*% A)
>> system.time (C <- B %*% A)
>>
>> -----Original Message-----
>> From: Martin Maechler [mailto:[hidden email]]
>> Sent: Tuesday, January 10, 2017 8:59 AM
>> To: Cohn, Robert S <[hidden email]>
>> Cc: [hidden email]
>> Subject: Re: [Rd] accelerating matrix multiply
>>
>>>>>>> Cohn, Robert S <[hidden email]>
>>>>>>>       on Sat, 7 Jan 2017 16:41:42 +0000 writes:
>>> I am using R to multiply some large (30k x 30k double) matrices on a
>>> 64 core machine (xeon phi).  I added some timers to src/main/array.c
>>> to see where the time is going. All of the time is being spent in the
>>> matprod function, most of that time is spent in dgemm. 15 seconds is
>>> in matprod in some code that is checking if there are NaNs.
>>>> system.time (C <- B %*% A)
>>> nancheck: wall time 15.240282s
>>>      dgemm: wall time 43.111064s
>>>    matprod: wall time 58.351572s
>>>       user   system  elapsed
>>> 2710.154   20.999   58.398
>>>
>>> The NaN checking code is not being vectorized because of the early
>>> exit when NaN is detected:
>>>
>>> /* Don't trust the BLAS to handle NA/NaNs correctly: PR#4582
>>> * The test is only O(n) here.
>>> */
>>> for (R_xlen_t i = 0; i < NRX*ncx; i++)
>>>    if (ISNAN(x[i])) {have_na = TRUE; break;}
>>> if (!have_na)
>>>    for (R_xlen_t i = 0; i < NRY*ncy; i++)
>>> if (ISNAN(y[i])) {have_na = TRUE; break;}
>>>
>>> I tried deleting the 'break'. By inspecting the asm code, I verified
>>> that the loop was not being vectorized before, but now is vectorized.
>>> Total time goes down:
>>>
>>> system.time (C <- B %*% A)
>>> nancheck: wall time  1.898667s
>>>      dgemm: wall time 43.913621s
>>>    matprod: wall time 45.812468s
>>>       user   system  elapsed
>>> 2727.877   20.723   45.859
>>>
>>> The break accelerates the case when there is a NaN, at the expense of
>>> the much more common case when there isn't a NaN. If a NaN is
>>> detected, it doesn't call dgemm and calls its own matrix multiply,
>>> which makes the NaN check time insignificant so I doubt the early
>>> exit provides any benefit.
>>>
>>> I was a little surprised that the O(n) NaN check is costly compared
>>> to the O(n**2) dgemm that follows. I think the reason is that nan
>>> check is single thread and not vectorized, and my machine can do 2048
>>> floating point ops/cycle when you consider the cores/dual issue/8 way
>>> SIMD/muladd, and the constant factor will be significant for even
>>> large matrices.
>>>
>>> Would you consider deleting the breaks? I can submit a patch if that
>>> will help. Thanks.
>>>
>>> Robert
>> Thank you Robert for bringing the issue up ("again", possibly).
>> Within R core, some have seen somewhat similar timing on some platforms (gcc) .. but much less dramatical differences e.g. on macOS with clang.
>>
>> As seen in the source code you cite above, the current implementation was triggered by a nasty BLAS bug .. actually also showing up only on some platforms, possibly depending on runtime libraries in addition to the compilers used.
>>
>> Do you have R code (including set.seed(.) if relevant) to show on how to generate the large square matrices you've mentioned in the beginning?  So we get to some reproducible benchmarks?
>>
>> With best regards,
>> Martin Maechler
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel