

I am writing a wrapper function in C++ that calls a GPU kernel. My array type for the GPU kernel is float, so I would like my wrapper function to receive float arrays from R. I understand that I can use 'as.single' in R to copy a doubleprecision vector from R in singleprecision format while using the '.C' interface, but is there any way to do something similar for '.Call'? Given that the latter passes pointers from R to C/C++, I'm guessing this may be impossible, but I wanted to doublecheck. If you can suggest a solution, a small code sample would be much appreciated.
The reason I prefer '.Call' to '.C' is because the former passes pointers and therefore creates less data transfer overhead as opposed to the latter which copies data. Is this argument controversial?
Alireza


On 18/07/2011 11:52 AM, Alireza Mahani wrote:
> I am writing a wrapper function in C++ that calls a GPU kernel. My array type
> for the GPU kernel is float, so I would like my wrapper function to receive
> float arrays from R. I understand that I can use 'as.single' in R to copy a
> doubleprecision vector from R in singleprecision format while using the
> '.C' interface, but is there any way to do something similar for '.Call'?
> Given that the latter passes pointers from R to C/C++, I'm guessing this may
> be impossible, but I wanted to doublecheck. If you can suggest a solution,
> a small code sample would be much appreciated.
>
> The reason I prefer '.Call' to '.C' is because the former passes pointers
> and therefore creates less data transfer overhead as opposed to the latter
> which copies data. Is this argument controversial?
R has no native type holding singles. It exports a double to a single
vector in .C if you ask it to, but that's not available in .Call.
You'll need to do the copying yourself.
Duncan Murdoch
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rdevel


Duncan,
Thank you for your reply. This is a rather unfortunate limitation, because for large data sizes there is a significant difference between the performance of '.C' and '.Call'. I will have to do some tests to see what sort of penalty I incur for copying from double to float inside my C++ code (so I can use '.Call'). I won't be able to use memset(); rather, I have to have an explicit loop and use casting. Is there a more efficient option?
Alireza


On Jul 18, 2011, at 6:15 PM, Alireza Mahani wrote:
> Duncan,
>
> Thank you for your reply. This is a rather unfortunate limitation, because for large data sizes there is a significant difference between the performance of '.C' and '.Call'.
I think you may have missed the main point  R does NOT have any objects that are in "float" (single precision) representations because that is insufficient precision, so one way or another you'll have to do something like
SEXP foo(SEXP bar) {
double *d = REAL(bar);
int i, n = LENGTH(bar);
float *f = (float*) R_alloc(sizeof(float), n);
for (i = 0; i < n; i++) f[i] = d[i];
// continue with floats ..
There is simply no other way as, again, there are no floats in R anywhere. This has nothing to do with .C/.Call, either way floats will need to be created from the input vectors.
If you make up your own stuff, you could use raw vectors to store floats if only your functions work on it (or external pointers if you want to keep track of the memory yourself).
> I will have to do some tests to see what
> sort of penalty I incur for copying from double to float inside my C++ code
> (so I can use '.Call'). I won't be able to use memset(); rather, I have to
> have an explicit loop and use casting. Is there a more efficient option?
>
I'm not aware of any, if you use floats, you incur a penalty regardless (one for additional storage, another for converting).
One of the reasons that GPU processing has not caught much traction in stat computing is because it's (practically) limited to single precision computations (and we even need quad precision occasionally). Although some GPUs support double precision, they are still not fast enough to be a real threat to CPUs. (I'm talking about generic usability  for very specialized tasks GPU can deliver big speedups, but those require lowlevel exploitation of the architecture).
Cheers,
Simon
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rdevel


Simon,
Thank you for elaborating on the limitations of R in handling float types. I think I'm pretty much there with you.
As for the insufficiency of singleprecision math (and hence limitations of GPU), my personal take so far has been that doubleprecision becomes crucial when some sort of error accumulation occurs. For example, in differential equations where boundary values are integrated to arrive at interior values, etc. On the other hand, in my personal line of work (Hierarchical Bayesian models for quantitative marketing), we have so much inherent uncertainty and noise at so many levels in the problem (and no significant error accumulation sources) that single vs double precision issue is often inconsequential for us. So I think it really depends on the field as well as the nature of the problem.
Regards,
Alireza


On Mon, 18 Jul 2011, Alireza Mahani wrote:
> Simon,
>
> Thank you for elaborating on the limitations of R in handling float types. I
> think I'm pretty much there with you.
>
> As for the insufficiency of singleprecision math (and hence limitations of
> GPU), my personal take so far has been that doubleprecision becomes crucial
> when some sort of error accumulation occurs. For example, in differential
> equations where boundary values are integrated to arrive at interior values,
> etc. On the other hand, in my personal line of work (Hierarchical Bayesian
> models for quantitative marketing), we have so much inherent uncertainty and
> noise at so many levels in the problem (and no significant error
> accumulation sources) that single vs double precision issue is often
> inconsequential for us. So I think it really depends on the field as well as
> the nature of the problem.
The main reason to use only double precision in R was that on modern
CPUs double precision calculations are as fast as singleprecision
ones, and with 64bit CPUs they are a single access. So the extra
precision comes moreorless for free. You also underestimate the
extent to which stability of commonly used algorithms relies on double
precision. (There are stable singleprecision versions, but they are
no longer commonly used. And as Simon said, in some cases stability
is ensured by using extra precision where available.)
I disagree slightly with Simon on GPUs: I am told by local experts
that the doubleprecision on the latest GPUs (those from the last year
or so) is perfectly usable. See the performance claims on
http://en.wikipedia.org/wiki/Nvidia_Tesla of about 50% of the SP
performance in DP.

Brian D. Ripley, [hidden email]
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rdevel


"Prof Brian Ripley" < [hidden email]> wrote in message
news: [hidden email]...
> On Mon, 18 Jul 2011, Alireza Mahani wrote:
>
>> Simon,
>>
>> Thank you for elaborating on the limitations of R in handling float
>> types. I
>> think I'm pretty much there with you.
>>
>> As for the insufficiency of singleprecision math (and hence limitations
>> of
>> GPU), my personal take so far has been that doubleprecision becomes
>> crucial
>> when some sort of error accumulation occurs. For example, in differential
>> equations where boundary values are integrated to arrive at interior
>> values,
>> etc. On the other hand, in my personal line of work (Hierarchical
>> Bayesian
>> models for quantitative marketing), we have so much inherent uncertainty
>> and
>> noise at so many levels in the problem (and no significant error
>> accumulation sources) that single vs double precision issue is often
>> inconsequential for us. So I think it really depends on the field as well
>> as
>> the nature of the problem.
>
> The main reason to use only double precision in R was that on modern CPUs
> double precision calculations are as fast as singleprecision ones, and
> with 64bit CPUs they are a single access.
> So the extra precision comes moreorless for free.
But, isn't it much more of the 'less free' when large data sets are
considered? If a double matrix takes 3GB, it's 1.5GB in single.
That might alleviate the dreaded outofmemory error for some
users in some circumstances. On 64bit, 50GB reduces to 25GB
and that might make the difference between getting
something done, or not. If single were appropriate, of course.
For GPU too, i/o often dominates iiuc.
For space reasons, is there any possibility of R supporting single
precision (and single bit logical to reduce memory for logicals by
32 times)? I guess there might be complaints from users using
single inappropriately (or worse, not realising we have an instable
result due to single).
Matthew
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rdevel


On Jul 19, 2011, at 7:48 AM, Matthew Dowle wrote:
>
> "Prof Brian Ripley" < [hidden email]> wrote in message
> news: [hidden email]...
>> On Mon, 18 Jul 2011, Alireza Mahani wrote:
>>
>>> Simon,
>>>
>>> Thank you for elaborating on the limitations of R in handling float
>>> types. I
>>> think I'm pretty much there with you.
>>>
>>> As for the insufficiency of singleprecision math (and hence limitations
>>> of
>>> GPU), my personal take so far has been that doubleprecision becomes
>>> crucial
>>> when some sort of error accumulation occurs. For example, in differential
>>> equations where boundary values are integrated to arrive at interior
>>> values,
>>> etc. On the other hand, in my personal line of work (Hierarchical
>>> Bayesian
>>> models for quantitative marketing), we have so much inherent uncertainty
>>> and
>>> noise at so many levels in the problem (and no significant error
>>> accumulation sources) that single vs double precision issue is often
>>> inconsequential for us. So I think it really depends on the field as well
>>> as
>>> the nature of the problem.
>>
>> The main reason to use only double precision in R was that on modern CPUs
>> double precision calculations are as fast as singleprecision ones, and
>> with 64bit CPUs they are a single access.
>> So the extra precision comes moreorless for free.
>
> But, isn't it much more of the 'less free' when large data sets are considered? If a double matrix takes 3GB, it's 1.5GB in single.
> That might alleviate the dreaded outofmemory error for some users in some circumstances. On 64bit, 50GB reduces to 25GB
I'd like to see your 50Gb matrix in R ;)  you can't have a float matrix bigger than 8Gb, for doubles it is 16Gb so you don't gain anything in scalability. IMHO memory is not a strong case these days when hundreds GB of RAM are affordable...
Also you would not complain about pointers going from 4 to 8 bytes in 64bit thus doubling your memory use for string vectors...
Cheers,
Simon
> and that might make the difference between getting
> something done, or not. If single were appropriate, of course.
> For GPU too, i/o often dominates iiuc.
>
> For space reasons, is there any possibility of R supporting single
> precision (and single bit logical to reduce memory for logicals by
> 32 times)? I guess there might be complaints from users using
> single inappropriately (or worse, not realising we have an instable
> result due to single).
>
> Matthew
>
>> You also underestimate the extent to which stability of commonly used
>> algorithms relies on double precision. (There are stable singleprecision
>> versions, but they are no longer commonly used. And as Simon said, in
>> some cases stability is ensured by using extra precision where available.)
>>
>> I disagree slightly with Simon on GPUs: I am told by local experts that
>> the doubleprecision on the latest GPUs (those from the last year or so)
>> is perfectly usable. See the performance claims on
>> http://en.wikipedia.org/wiki/Nvidia_Tesla of about 50% of the SP
>> performance in DP.
>>
>>>
>>> Regards,
>>> Alireza
>>>
>>>
>>> 
>>> View this message in context:
>>> http://r.789695.n4.nabble.com/ManipulatingsingleprecisionfloatarraysinCallfunctionstp3675684p3677232.html>>> Sent from the R devel mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/rdevel>>>
>>
>> 
>> Brian D. Ripley, [hidden email]
>> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/>> University of Oxford, Tel: +44 1865 272861 (self)
>> 1 South Parks Road, +44 1865 272866 (PA)
>> Oxford OX1 3TG, UK Fax: +44 1865 272595
>>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/rdevel>
>
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rdevel


On 110719 7:48 AM, Matthew Dowle wrote:
>
> "Prof Brian Ripley"< [hidden email]> wrote in message
> news: [hidden email]...
>> On Mon, 18 Jul 2011, Alireza Mahani wrote:
>>
>>> Simon,
>>>
>>> Thank you for elaborating on the limitations of R in handling float
>>> types. I
>>> think I'm pretty much there with you.
>>>
>>> As for the insufficiency of singleprecision math (and hence limitations
>>> of
>>> GPU), my personal take so far has been that doubleprecision becomes
>>> crucial
>>> when some sort of error accumulation occurs. For example, in differential
>>> equations where boundary values are integrated to arrive at interior
>>> values,
>>> etc. On the other hand, in my personal line of work (Hierarchical
>>> Bayesian
>>> models for quantitative marketing), we have so much inherent uncertainty
>>> and
>>> noise at so many levels in the problem (and no significant error
>>> accumulation sources) that single vs double precision issue is often
>>> inconsequential for us. So I think it really depends on the field as well
>>> as
>>> the nature of the problem.
>>
>> The main reason to use only double precision in R was that on modern CPUs
>> double precision calculations are as fast as singleprecision ones, and
>> with 64bit CPUs they are a single access.
>> So the extra precision comes moreorless for free.
>
> But, isn't it much more of the 'less free' when large data sets are
> considered? If a double matrix takes 3GB, it's 1.5GB in single.
> That might alleviate the dreaded outofmemory error for some
> users in some circumstances. On 64bit, 50GB reduces to 25GB
> and that might make the difference between getting
> something done, or not. If single were appropriate, of course.
> For GPU too, i/o often dominates iiuc.
>
> For space reasons, is there any possibility of R supporting single
> precision (and single bit logical to reduce memory for logicals by
> 32 times)? I guess there might be complaints from users using
> single inappropriately (or worse, not realising we have an instable
> result due to single).
You can do any of this using external pointers now. That will remind
you that every single function to operate on such objects needs to be
rewritten.
It's a huge amount of work, benefiting very few people. I don't think
anyone in R Core will do it.
Duncan Murdoch
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rdevel


"Duncan Murdoch" < [hidden email]> wrote in message
news: [hidden email]...
> On 110719 7:48 AM, Matthew Dowle wrote:
>>
>> "Prof Brian Ripley"< [hidden email]> wrote in message
>> news: [hidden email]...
>>> On Mon, 18 Jul 2011, Alireza Mahani wrote:
>>>
>>>> Simon,
>>>>
>>>> Thank you for elaborating on the limitations of R in handling float
>>>> types. I
>>>> think I'm pretty much there with you.
>>>>
>>>> As for the insufficiency of singleprecision math (and hence
>>>> limitations
>>>> of
>>>> GPU), my personal take so far has been that doubleprecision becomes
>>>> crucial
>>>> when some sort of error accumulation occurs. For example, in
>>>> differential
>>>> equations where boundary values are integrated to arrive at interior
>>>> values,
>>>> etc. On the other hand, in my personal line of work (Hierarchical
>>>> Bayesian
>>>> models for quantitative marketing), we have so much inherent
>>>> uncertainty
>>>> and
>>>> noise at so many levels in the problem (and no significant error
>>>> accumulation sources) that single vs double precision issue is often
>>>> inconsequential for us. So I think it really depends on the field as
>>>> well
>>>> as
>>>> the nature of the problem.
>>>
>>> The main reason to use only double precision in R was that on modern
>>> CPUs
>>> double precision calculations are as fast as singleprecision ones, and
>>> with 64bit CPUs they are a single access.
>>> So the extra precision comes moreorless for free.
>>
>> But, isn't it much more of the 'less free' when large data sets are
>> considered? If a double matrix takes 3GB, it's 1.5GB in single.
>> That might alleviate the dreaded outofmemory error for some
>> users in some circumstances. On 64bit, 50GB reduces to 25GB
>> and that might make the difference between getting
>> something done, or not. If single were appropriate, of course.
>> For GPU too, i/o often dominates iiuc.
>>
>> For space reasons, is there any possibility of R supporting single
>> precision (and single bit logical to reduce memory for logicals by
>> 32 times)? I guess there might be complaints from users using
>> single inappropriately (or worse, not realising we have an instable
>> result due to single).
>
> You can do any of this using external pointers now. That will remind you
> that every single function to operate on such objects needs to be
> rewritten.
>
> It's a huge amount of work, benefiting very few people. I don't think
> anyone in R Core will do it.
Ok, thanks for the responses.
Matthew
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rdevel


On Jul 19, 2011, at 2:26 AM, Prof Brian Ripley wrote:
> On Mon, 18 Jul 2011, Alireza Mahani wrote:
>
>> Simon,
>>
>> Thank you for elaborating on the limitations of R in handling float types. I
>> think I'm pretty much there with you.
>>
>> As for the insufficiency of singleprecision math (and hence limitations of
>> GPU), my personal take so far has been that doubleprecision becomes crucial
>> when some sort of error accumulation occurs. For example, in differential
>> equations where boundary values are integrated to arrive at interior values,
>> etc. On the other hand, in my personal line of work (Hierarchical Bayesian
>> models for quantitative marketing), we have so much inherent uncertainty and
>> noise at so many levels in the problem (and no significant error
>> accumulation sources) that single vs double precision issue is often
>> inconsequential for us. So I think it really depends on the field as well as
>> the nature of the problem.
>
> The main reason to use only double precision in R was that on modern CPUs double precision calculations are as fast as singleprecision ones, and with 64bit CPUs they are a single access. So the extra precision comes moreorless for free. You also underestimate the extent to which stability of commonly used algorithms relies on double precision. (There are stable singleprecision versions, but they are no longer commonly used. And as Simon said, in some cases stability is ensured by using extra precision where available.)
>
> I disagree slightly with Simon on GPUs: I am told by local experts that the doubleprecision on the latest GPUs (those from the last year or so) is perfectly usable. See the performance claims on http://en.wikipedia.org/wiki/Nvidia_Tesla of about 50% of the SP performance in DP.
>
That would be good news. Unfortunately those seem to be still targeted at a specialized market and are not really graphics cards in traditional sense. Although this is sort of required for the purpose it removes the benefit of ubiquity. So, yes, I agree with you that it may be an interesting way forward, but I fear it's too much of a niche to be widely supported. I may want to ask our GPU specialists here to see if they have any around so I could revisit our OpenCL R benchmarks. Last time we abandoned our OpenCL R plans exactly due to the lack of speed in double precision.
Thanks,
Simon
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rdevel


"Duncan Murdoch" < [hidden email]> wrote in message
news: [hidden email]...
> On 110719 7:48 AM, Matthew Dowle wrote:
>>
>> "Prof Brian Ripley"< [hidden email]> wrote in message
>> news: [hidden email]...
>>> On Mon, 18 Jul 2011, Alireza Mahani wrote:
>>>
>>>> Simon,
>>>>
>>>> Thank you for elaborating on the limitations of R in handling float
>>>> types. I
>>>> think I'm pretty much there with you.
>>>>
>>>> As for the insufficiency of singleprecision math (and hence
>>>> limitations
>>>> of
>>>> GPU), my personal take so far has been that doubleprecision becomes
>>>> crucial
>>>> when some sort of error accumulation occurs. For example, in
>>>> differential
>>>> equations where boundary values are integrated to arrive at interior
>>>> values,
>>>> etc. On the other hand, in my personal line of work (Hierarchical
>>>> Bayesian
>>>> models for quantitative marketing), we have so much inherent
>>>> uncertainty
>>>> and
>>>> noise at so many levels in the problem (and no significant error
>>>> accumulation sources) that single vs double precision issue is often
>>>> inconsequential for us. So I think it really depends on the field as
>>>> well
>>>> as
>>>> the nature of the problem.
>>>
>>> The main reason to use only double precision in R was that on modern
>>> CPUs
>>> double precision calculations are as fast as singleprecision ones, and
>>> with 64bit CPUs they are a single access.
>>> So the extra precision comes moreorless for free.
>>
>> But, isn't it much more of the 'less free' when large data sets are
>> considered? If a double matrix takes 3GB, it's 1.5GB in single.
>> That might alleviate the dreaded outofmemory error for some
>> users in some circumstances. On 64bit, 50GB reduces to 25GB
>> and that might make the difference between getting
>> something done, or not. If single were appropriate, of course.
>> For GPU too, i/o often dominates iiuc.
>>
>> For space reasons, is there any possibility of R supporting single
>> precision (and single bit logical to reduce memory for logicals by
>> 32 times)? I guess there might be complaints from users using
>> single inappropriately (or worse, not realising we have an instable
>> result due to single).
>
> You can do any of this using external pointers now. That will remind you
> that every single function to operate on such objects needs to be
> rewritten.
>
> It's a huge amount of work, benefiting very few people. I don't think
> anyone in R Core will do it.
>
> Duncan Murdoch
I've been informed off list about the 'bit' package, which seems
great and answers my parenthetic complaint (at least).
http://cran.rproject.org/web/packages/bit/index.htmlMatthew
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rdevel


On Jul 19, 2011, at 12:56 PM, Simon Urbanek wrote:
>
> On Jul 19, 2011, at 2:26 AM, Prof Brian Ripley wrote:
>
>> On Mon, 18 Jul 2011, Alireza Mahani wrote:
>>
>>> Simon,
>>>
>>> Thank you for elaborating on the limitations of R in handling float types. I
>>> think I'm pretty much there with you.
>>>
>>> As for the insufficiency of singleprecision math (and hence limitations of
>>> GPU), my personal take so far has been that doubleprecision becomes crucial
>>> when some sort of error accumulation occurs. For example, in differential
>>> equations where boundary values are integrated to arrive at interior values,
>>> etc. On the other hand, in my personal line of work (Hierarchical Bayesian
>>> models for quantitative marketing), we have so much inherent uncertainty and
>>> noise at so many levels in the problem (and no significant error
>>> accumulation sources) that single vs double precision issue is often
>>> inconsequential for us. So I think it really depends on the field as well as
>>> the nature of the problem.
>>
>> The main reason to use only double precision in R was that on modern CPUs double precision calculations are as fast as singleprecision ones, and with 64bit CPUs they are a single access. So the extra precision comes moreorless for free. You also underestimate the extent to which stability of commonly used algorithms relies on double precision. (There are stable singleprecision versions, but they are no longer commonly used. And as Simon said, in some cases stability is ensured by using extra precision where available.)
>>
>> I disagree slightly with Simon on GPUs: I am told by local experts that the doubleprecision on the latest GPUs (those from the last year or so) is perfectly usable. See the performance claims on http://en.wikipedia.org/wiki/Nvidia_Tesla of about 50% of the SP performance in DP.
>>
>
> That would be good news. Unfortunately those seem to be still targeted at a specialized market and are not really graphics cards in traditional sense. Although this is sort of required for the purpose it removes the benefit of ubiquity. So, yes, I agree with you that it may be an interesting way forward, but I fear it's too much of a niche to be widely supported. I may want to ask our GPU specialists here to see if they have any around so I could revisit our OpenCL R benchmarks. Last time we abandoned our OpenCL R plans exactly due to the lack of speed in double precision.
>
A quick update  it turns out we have a few Tesla/Fermi machines here, so I ran some very quick benchmarks on them. The test case was the same as for the original OpenCL comparisons posted here a while ago when Apple introduced it: dnorm on long vectors:
64M, single:
 GPU  total: 4894.1 ms, compute: 234.5 ms, compile: 4565.7 ms, real: 328.3 ms
 CPU  total: 2290.8 ms
64M, double:
 GPU  total: 5448.4 ms, compute: 634.1 ms, compile: 4636.4 ms, real: 812.0 ms
 CPU  total: 2415.8 ms
128M, single:
 GPU  total: 5843.7 ms, compute: 469.2 ms, compile: 5040.5 ms, real: 803.1 ms
 CPU  total: 4568.9 ms
128M, double:
 GPU  total: 6042.8 ms, compute: 1093.9 ms, compile: 4583.3 ms, real: 1459.5 ms
 CPU  total: 4946.8 ms
The CPU times are based on a dual Xeon X5690 machine (12 cores @ 3.47GHz) using OpenMP, but are very approximate, because there were two other jobs running on machine  still, it should be a good ballpark figure. The GPU times are run on Tesla S2050 using OpenCL, addressed as one device so presumably comparable to the performance of one Tesla M2050.
The figures to compare are GPU.real (which is computation + host memory I/O) and CPU.total, because we can assume that we can compile the kernel in advance, but you can't save on the memory transfer (unless you find a good way to chain calls which is not realistic in R).
So the good news is that the new GPUs fulfill their promise : double precision is only twice as slow as single precision. Also they scale approximately linearly  see the real time of 64M double is almost the same as 128M single. They also outperform the CPUs as well, although not by an order of magnitude.
The double precision support is very good news, and even though we are still using GPUs in a suboptimal manner, they are faster than the CPUs. The only practical drawback is that using OpenCL requires serious work, it's not as easy as slapping omp pragmas on existing code. Also the HPC Teslas are quite expensive so I don't expect to see them in desktops anytime soon. However, for people that are thinking about big computation, it may be an interesting way to go. Given that it's not mainstream I don't expect core R to have OCL support just yet, but it may be worth keeping in mind for the future as we are designing the parallelization framework in R.
Cheers,
Simon
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rdevel


On 08/05/2011 08:36 PM, Simon Urbanek wrote:
>
> On Jul 19, 2011, at 12:56 PM, Simon Urbanek wrote:
>
>>
>> On Jul 19, 2011, at 2:26 AM, Prof Brian Ripley wrote:
>>
>>> On Mon, 18 Jul 2011, Alireza Mahani wrote:
>>>
>>>> Simon,
>>>>
>>>> Thank you for elaborating on the limitations of R in handling float types. I
>>>> think I'm pretty much there with you.
>>>>
>>>> As for the insufficiency of singleprecision math (and hence limitations of
>>>> GPU), my personal take so far has been that doubleprecision becomes crucial
>>>> when some sort of error accumulation occurs. For example, in differential
>>>> equations where boundary values are integrated to arrive at interior values,
>>>> etc. On the other hand, in my personal line of work (Hierarchical Bayesian
>>>> models for quantitative marketing), we have so much inherent uncertainty and
>>>> noise at so many levels in the problem (and no significant error
>>>> accumulation sources) that single vs double precision issue is often
>>>> inconsequential for us. So I think it really depends on the field as well as
>>>> the nature of the problem.
>>>
>>> The main reason to use only double precision in R was that on modern CPUs double precision calculations are as fast as singleprecision ones, and with 64bit CPUs they are a single access. So the extra precision comes moreorless for free. You also underestimate the extent to which stability of commonly used algorithms relies on double precision. (There are stable singleprecision versions, but they are no longer commonly used. And as Simon said, in some cases stability is ensured by using extra precision where available.)
>>>
>>> I disagree slightly with Simon on GPUs: I am told by local experts that the doubleprecision on the latest GPUs (those from the last year or so) is perfectly usable. See the performance claims on http://en.wikipedia.org/wiki/Nvidia_Tesla of about 50% of the SP performance in DP.
>>>
>>
>> That would be good news. Unfortunately those seem to be still targeted at a specialized market and are not really graphics cards in traditional sense. Although this is sort of required for the purpose it removes the benefit of ubiquity. So, yes, I agree with you that it may be an interesting way forward, but I fear it's too much of a niche to be widely supported. I may want to ask our GPU specialists here to see if they have any around so I could revisit our OpenCL R benchmarks. Last time we abandoned our OpenCL R plans exactly due to the lack of speed in double precision.
>>
>
> A quick update  it turns out we have a few Tesla/Fermi machines here, so I ran some very quick benchmarks on them. The test case was the same as for the original OpenCL comparisons posted here a while ago when Apple introduced it: dnorm on long vectors:
>
> 64M, single:
>  GPU  total: 4894.1 ms, compute: 234.5 ms, compile: 4565.7 ms, real: 328.3 ms
>  CPU  total: 2290.8 ms
>
> 64M, double:
>  GPU  total: 5448.4 ms, compute: 634.1 ms, compile: 4636.4 ms, real: 812.0 ms
>  CPU  total: 2415.8 ms
>
> 128M, single:
>  GPU  total: 5843.7 ms, compute: 469.2 ms, compile: 5040.5 ms, real: 803.1 ms
>  CPU  total: 4568.9 ms
>
> 128M, double:
>  GPU  total: 6042.8 ms, compute: 1093.9 ms, compile: 4583.3 ms, real: 1459.5 ms
>  CPU  total: 4946.8 ms
>
> The CPU times are based on a dual Xeon X5690 machine (12 cores @ 3.47GHz) using OpenMP, but are very approximate, because there were two other jobs running on machine  still, it should be a good ballpark figure. The GPU times are run on Tesla S2050 using OpenCL, addressed as one device so presumably comparable to the performance of one Tesla M2050.
> The figures to compare are GPU.real (which is computation + host memory I/O) and CPU.total, because we can assume that we can compile the kernel in advance, but you can't save on the memory transfer (unless you find a good way to chain calls which is not realistic in R).
>
> So the good news is that the new GPUs fulfill their promise : double precision is only twice as slow as single precision. Also they scale approximately linearly  see the real time of 64M double is almost the same as 128M single. They also outperform the CPUs as well, although not by an order of magnitude.
>
> The double precision support is very good news, and even though we are still using GPUs in a suboptimal manner, they are faster than the CPUs. The only practical drawback is that using OpenCL requires serious work, it's not as easy as slapping omp pragmas on existing code. Also the HPC Teslas are quite expensive so I don't expect to see them in desktops anytime soon. However, for people that are thinking about big computation, it may be an interesting way to go. Given that it's not mainstream I don't expect core R to have OCL support just yet, but it may be worth keeping in mind for the future as we are designing the parallelization framework in R.
+1. Chip vendors nowadays also offer a CPU runtime for execution of
OpenCL code on common x86 multicore CPUs (e.g. of the Opteron series
or Core i7 family) so it may be more ubiquitous soon.
Best,
Tobias
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rdevel


I have created a small package called OpenCL which allows the use of OpenCL kernels in R. It supports both single and double precision and arbitrary number of input arguments. The kernel in the ?oclRun example is very close to what I used for the testing below (obviously you won't be able to run fair singleprecision tests, because R needs to convert both input and output vectors to/from double precision).
Its home is at
http://rforge.net/OpenCLand CRAN deo volente it may appear on CRAN soon.
Cheers,
Simon
On Aug 5, 2011, at 2:36 PM, Simon Urbanek wrote:
>
> On Jul 19, 2011, at 12:56 PM, Simon Urbanek wrote:
>
>>
>> On Jul 19, 2011, at 2:26 AM, Prof Brian Ripley wrote:
>>
>>> On Mon, 18 Jul 2011, Alireza Mahani wrote:
>>>
>>>> Simon,
>>>>
>>>> Thank you for elaborating on the limitations of R in handling float types. I
>>>> think I'm pretty much there with you.
>>>>
>>>> As for the insufficiency of singleprecision math (and hence limitations of
>>>> GPU), my personal take so far has been that doubleprecision becomes crucial
>>>> when some sort of error accumulation occurs. For example, in differential
>>>> equations where boundary values are integrated to arrive at interior values,
>>>> etc. On the other hand, in my personal line of work (Hierarchical Bayesian
>>>> models for quantitative marketing), we have so much inherent uncertainty and
>>>> noise at so many levels in the problem (and no significant error
>>>> accumulation sources) that single vs double precision issue is often
>>>> inconsequential for us. So I think it really depends on the field as well as
>>>> the nature of the problem.
>>>
>>> The main reason to use only double precision in R was that on modern CPUs double precision calculations are as fast as singleprecision ones, and with 64bit CPUs they are a single access. So the extra precision comes moreorless for free. You also underestimate the extent to which stability of commonly used algorithms relies on double precision. (There are stable singleprecision versions, but they are no longer commonly used. And as Simon said, in some cases stability is ensured by using extra precision where available.)
>>>
>>> I disagree slightly with Simon on GPUs: I am told by local experts that the doubleprecision on the latest GPUs (those from the last year or so) is perfectly usable. See the performance claims on http://en.wikipedia.org/wiki/Nvidia_Tesla of about 50% of the SP performance in DP.
>>>
>>
>> That would be good news. Unfortunately those seem to be still targeted at a specialized market and are not really graphics cards in traditional sense. Although this is sort of required for the purpose it removes the benefit of ubiquity. So, yes, I agree with you that it may be an interesting way forward, but I fear it's too much of a niche to be widely supported. I may want to ask our GPU specialists here to see if they have any around so I could revisit our OpenCL R benchmarks. Last time we abandoned our OpenCL R plans exactly due to the lack of speed in double precision.
>>
>
> A quick update  it turns out we have a few Tesla/Fermi machines here, so I ran some very quick benchmarks on them. The test case was the same as for the original OpenCL comparisons posted here a while ago when Apple introduced it: dnorm on long vectors:
>
> 64M, single:
>  GPU  total: 4894.1 ms, compute: 234.5 ms, compile: 4565.7 ms, real: 328.3 ms
>  CPU  total: 2290.8 ms
>
> 64M, double:
>  GPU  total: 5448.4 ms, compute: 634.1 ms, compile: 4636.4 ms, real: 812.0 ms
>  CPU  total: 2415.8 ms
>
> 128M, single:
>  GPU  total: 5843.7 ms, compute: 469.2 ms, compile: 5040.5 ms, real: 803.1 ms
>  CPU  total: 4568.9 ms
>
> 128M, double:
>  GPU  total: 6042.8 ms, compute: 1093.9 ms, compile: 4583.3 ms, real: 1459.5 ms
>  CPU  total: 4946.8 ms
>
> The CPU times are based on a dual Xeon X5690 machine (12 cores @ 3.47GHz) using OpenMP, but are very approximate, because there were two other jobs running on machine  still, it should be a good ballpark figure. The GPU times are run on Tesla S2050 using OpenCL, addressed as one device so presumably comparable to the performance of one Tesla M2050.
> The figures to compare are GPU.real (which is computation + host memory I/O) and CPU.total, because we can assume that we can compile the kernel in advance, but you can't save on the memory transfer (unless you find a good way to chain calls which is not realistic in R).
>
> So the good news is that the new GPUs fulfill their promise : double precision is only twice as slow as single precision. Also they scale approximately linearly  see the real time of 64M double is almost the same as 128M single. They also outperform the CPUs as well, although not by an order of magnitude.
>
> The double precision support is very good news, and even though we are still using GPUs in a suboptimal manner, they are faster than the CPUs. The only practical drawback is that using OpenCL requires serious work, it's not as easy as slapping omp pragmas on existing code. Also the HPC Teslas are quite expensive so I don't expect to see them in desktops anytime soon. However, for people that are thinking about big computation, it may be an interesting way to go. Given that it's not mainstream I don't expect core R to have OCL support just yet, but it may be worth keeping in mind for the future as we are designing the parallelization framework in R.
>
> Cheers,
> Simon
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/rdevel>
>
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rdevel

