Peculiar timing result

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Peculiar timing result

Douglas Bates-2
I have been timing a particular model fit using lmer on several
different computers and came up with a peculiar result - the model fit
is considerably slower on a dual-core Athlon 64 using Goto's
multithreaded BLAS than on a single-core processor.

Here is the timing on a single-core Athlon 64 3000+ running under
today's R-devel with version 0.995-5 of the Matrix package.

> library(Matrix)
> data(star, package = 'mlmRev')
> system.time(fm1 <- lmer(math~gr+sx+eth+cltype+(yrs|id)+(1|tch)+(yrs|sch), star, control = list(nit=0,grad=0,msV=1)))
  0      241720.:  1.16440 0.335239  0.00000  1.78732 0.867209 0.382318  0.00000
  1      239722.:  1.94952 5.00000e-10 0.00933767  1.65999 0.858003
0.341520 0.00908757
  2      239580.:  1.95924 0.0884059 0.00933767  1.65308 0.857487
0.339296 0.00954718
  3      239215.:  2.60877 0.0765848 0.0177699  1.45739 0.843562
0.275100 0.0236849
  4      239204.:  2.62582 0.106670 0.0239698  1.41976 0.841086
0.261033 0.0267073
  5      239176.:  2.63149 0.0787924 0.0367185  1.37952 0.838527
0.245076 0.0301134
  6      239141.:  2.64949 0.108534 0.0594586  1.28846 0.832543
0.208404 0.0375456
  7      239049.:  2.64794 0.0789214 0.121782  1.10436 0.819711
0.126101 0.0524965
  8      239004.:  2.66044 0.117957 0.181505 0.932202 0.798982
0.0718116 0.0628958
  9      238944.:  2.66310 0.0819653 0.334477 0.631735 0.740855
0.258359 0.0806590
 10      238893.:  2.72626 0.0975205 0.653432 0.703912 0.666067
0.109922 0.201809
 11      238892.:  2.74381 0.111146 0.666155 0.693544 0.662000 0.104060 0.207591
 12      238888.:  2.75052 0.0990238 0.689199 0.694588 0.655781
0.106516 0.216460
 13      238861.:  2.80325 0.126935  1.05912 0.733914 0.556162 0.159296 0.360938
 14      238832.:  2.82656 0.117617  1.59471 0.607916 0.441371
0.0749944 0.976142
 15      238811.:  2.87350 0.136332  1.59046 0.653141 0.353763 0.226061  1.79285
 16      238810.:  2.87663 0.125135  1.58992 0.656808 0.352605 0.220488  1.79282
 17      238806.:  2.89342 0.141551  1.58607 0.676523 0.344212 0.181833  1.79268
 18      238804.:  2.90080 0.125137  1.56624 0.682921 0.261295 0.180598  1.74325
 19      238802.:  2.91950 0.128569  1.56836 0.680436 0.336051 0.159940  1.80400
 20      238801.:  2.92795 0.134762  1.56597 0.685121 0.331695 0.145547  1.80414
 21      238801.:  2.93741 0.125667  1.56139 0.687827 0.332700 0.138854  1.81495
 22      238800.:  2.94588 0.131757  1.55294 0.687909 0.330414 0.137834  1.82826
 23      238799.:  2.96867 0.129716  1.52943 0.688678 0.323171 0.139912  1.84615
 24      238799.:  2.98994 0.133378  1.52188 0.700038 0.337387 0.131403  1.77623
 25      238799.:  3.00312 0.135308  1.51475 0.697550 0.311750 0.145683  1.78037
 26      238799.:  3.00461 0.129920  1.51083 0.697666 0.306722 0.138745  1.81188
 27      238799.:  3.00504 0.134882  1.50539 0.696745 0.302949 0.135897  1.84405
 28      238799.:  3.00422 0.134013  1.47947 0.698115 0.303243 0.133806  1.86486
 29      238799.:  3.00819 0.134378  1.48185 0.701871 0.307097 0.134637  1.84996
 30      238799.:  3.01313 0.134279  1.49098 0.702883 0.304788 0.133682  1.86254
 31      238799.:  3.01291 0.134253  1.49060 0.701818 0.303155 0.133771  1.84613
 32      238799.:  3.01142 0.134314  1.48921 0.701782 0.303589 0.134439  1.84653
 33      238799.:  3.01174 0.134315  1.48926 0.701641 0.304120 0.134145  1.84635
 34      238799.:  3.01175 0.134304  1.48942 0.701740 0.303762 0.134185  1.84649
 35      238799.:  3.01173 0.134307  1.48937 0.701724 0.303809 0.134206  1.84647
[1] 43.10  3.78 48.41  0.00  0.00


(If you run the timing yourself and don't want to see the iteration
output, take the msV=1 out of the control list.  I keep it in there so
I can monitor the progress.)

If I time the same model fit on a dual-core Athlon 64 X2 3800+ with
the same version of R, BLAS and Matrix package, the timing ends up
with something like

90 140 235 0 0

I do see that the multithreading is working on a calculation that is
essentially BLAS-bound such as

> mm <- Matrix(rnorm(1e6), nc = 1e3)
> system.time(crossprod(mm))
[1] 0.57 0.02 0.60 0.00 0.00

On the X2 processor it still takes about 0.6 seconds user time but
only 0.3 seconds elapsed time when the machine is otherwise idle and
both cores are available for the calculation.

Any suggestions why the dual-core processor is so much slower than the
single core processor?

By the way, I would be interested in accumulating timings of this
model fit on other systems.  If you do time it please send me
(off-list) a summary of the version of R, version of the accelerated
BLAS if you use them, processor speed and configuration (i.e.
multiprocessor, multicore, etc.) and, if you know it, memory speed.

This is an example of a complex multilevel model with crossed grouping
factors fit to a relatively large (30000 observations on 10000
students, 1400 teachers and 80 schools) data set.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Peculiar timing result

Paul Gilbert
Doug

This is probably not your reason, but I am finding my dual core Athlon
64 is much slower running 64 bit Linux and R than it was running 32 bit
Linux and R. All the programs are bigger.  (Some, like the clock applet,
are a lot bigger for no obvious reason.)  The difference is enough to
put my meager 1GB machine into swapping much more, with the result that
it is a lot slower.

Paul

Douglas Bates wrote:

>I have been timing a particular model fit using lmer on several
>different computers and came up with a peculiar result - the model fit
>is considerably slower on a dual-core Athlon 64 using Goto's
>multithreaded BLAS than on a single-core processor.
>
>Here is the timing on a single-core Athlon 64 3000+ running under
>today's R-devel with version 0.995-5 of the Matrix package.
>
>  
>
>>library(Matrix)
>>data(star, package = 'mlmRev')
>>system.time(fm1 <- lmer(math~gr+sx+eth+cltype+(yrs|id)+(1|tch)+(yrs|sch), star, control = list(nit=0,grad=0,msV=1)))
>>    
>>
>  0      241720.:  1.16440 0.335239  0.00000  1.78732 0.867209 0.382318  0.00000
>  1      239722.:  1.94952 5.00000e-10 0.00933767  1.65999 0.858003
>0.341520 0.00908757
>  2      239580.:  1.95924 0.0884059 0.00933767  1.65308 0.857487
>0.339296 0.00954718
>  3      239215.:  2.60877 0.0765848 0.0177699  1.45739 0.843562
>0.275100 0.0236849
>  4      239204.:  2.62582 0.106670 0.0239698  1.41976 0.841086
>0.261033 0.0267073
>  5      239176.:  2.63149 0.0787924 0.0367185  1.37952 0.838527
>0.245076 0.0301134
>  6      239141.:  2.64949 0.108534 0.0594586  1.28846 0.832543
>0.208404 0.0375456
>  7      239049.:  2.64794 0.0789214 0.121782  1.10436 0.819711
>0.126101 0.0524965
>  8      239004.:  2.66044 0.117957 0.181505 0.932202 0.798982
>0.0718116 0.0628958
>  9      238944.:  2.66310 0.0819653 0.334477 0.631735 0.740855
>0.258359 0.0806590
> 10      238893.:  2.72626 0.0975205 0.653432 0.703912 0.666067
>0.109922 0.201809
> 11      238892.:  2.74381 0.111146 0.666155 0.693544 0.662000 0.104060 0.207591
> 12      238888.:  2.75052 0.0990238 0.689199 0.694588 0.655781
>0.106516 0.216460
> 13      238861.:  2.80325 0.126935  1.05912 0.733914 0.556162 0.159296 0.360938
> 14      238832.:  2.82656 0.117617  1.59471 0.607916 0.441371
>0.0749944 0.976142
> 15      238811.:  2.87350 0.136332  1.59046 0.653141 0.353763 0.226061  1.79285
> 16      238810.:  2.87663 0.125135  1.58992 0.656808 0.352605 0.220488  1.79282
> 17      238806.:  2.89342 0.141551  1.58607 0.676523 0.344212 0.181833  1.79268
> 18      238804.:  2.90080 0.125137  1.56624 0.682921 0.261295 0.180598  1.74325
> 19      238802.:  2.91950 0.128569  1.56836 0.680436 0.336051 0.159940  1.80400
> 20      238801.:  2.92795 0.134762  1.56597 0.685121 0.331695 0.145547  1.80414
> 21      238801.:  2.93741 0.125667  1.56139 0.687827 0.332700 0.138854  1.81495
> 22      238800.:  2.94588 0.131757  1.55294 0.687909 0.330414 0.137834  1.82826
> 23      238799.:  2.96867 0.129716  1.52943 0.688678 0.323171 0.139912  1.84615
> 24      238799.:  2.98994 0.133378  1.52188 0.700038 0.337387 0.131403  1.77623
> 25      238799.:  3.00312 0.135308  1.51475 0.697550 0.311750 0.145683  1.78037
> 26      238799.:  3.00461 0.129920  1.51083 0.697666 0.306722 0.138745  1.81188
> 27      238799.:  3.00504 0.134882  1.50539 0.696745 0.302949 0.135897  1.84405
> 28      238799.:  3.00422 0.134013  1.47947 0.698115 0.303243 0.133806  1.86486
> 29      238799.:  3.00819 0.134378  1.48185 0.701871 0.307097 0.134637  1.84996
> 30      238799.:  3.01313 0.134279  1.49098 0.702883 0.304788 0.133682  1.86254
> 31      238799.:  3.01291 0.134253  1.49060 0.701818 0.303155 0.133771  1.84613
> 32      238799.:  3.01142 0.134314  1.48921 0.701782 0.303589 0.134439  1.84653
> 33      238799.:  3.01174 0.134315  1.48926 0.701641 0.304120 0.134145  1.84635
> 34      238799.:  3.01175 0.134304  1.48942 0.701740 0.303762 0.134185  1.84649
> 35      238799.:  3.01173 0.134307  1.48937 0.701724 0.303809 0.134206  1.84647
>[1] 43.10  3.78 48.41  0.00  0.00
>
>
>(If you run the timing yourself and don't want to see the iteration
>output, take the msV=1 out of the control list.  I keep it in there so
>I can monitor the progress.)
>
>If I time the same model fit on a dual-core Athlon 64 X2 3800+ with
>the same version of R, BLAS and Matrix package, the timing ends up
>with something like
>
>90 140 235 0 0
>
>I do see that the multithreading is working on a calculation that is
>essentially BLAS-bound such as
>
>  
>
>>mm <- Matrix(rnorm(1e6), nc = 1e3)
>>system.time(crossprod(mm))
>>    
>>
>[1] 0.57 0.02 0.60 0.00 0.00
>
>On the X2 processor it still takes about 0.6 seconds user time but
>only 0.3 seconds elapsed time when the machine is otherwise idle and
>both cores are available for the calculation.
>
>Any suggestions why the dual-core processor is so much slower than the
>single core processor?
>
>By the way, I would be interested in accumulating timings of this
>model fit on other systems.  If you do time it please send me
>(off-list) a summary of the version of R, version of the accelerated
>BLAS if you use them, processor speed and configuration (i.e.
>multiprocessor, multicore, etc.) and, if you know it, memory speed.
>
>This is an example of a complex multilevel model with crossed grouping
>factors fit to a relatively large (30000 observations on 10000
>students, 1400 teachers and 80 schools) data set.
>
>______________________________________________
>[hidden email] mailing list
>https://stat.ethz.ch/mailman/listinfo/r-devel
>  
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Peculiar timing result

Liaw, Andy
In reply to this post by Douglas Bates-2
Paul,

I think what you're seeing is the performance gap between 64-bit binary and
32-bit binary on x86_64.  I believe Prof. Ripley had mentioned this several
times in the past.

I do remember back when I was playing with optimized BLAS with R on 32-bit
Linux that I've seen something similar to what Prof. Bates is seeing, that
the one that uses threaded BLAS took longer the than non-threaded one (on
dual Xeon boxes).  I never dug into it more, though.

Andy

From: Paul Gilbert

>
> Doug
>
> This is probably not your reason, but I am finding my dual
> core Athlon
> 64 is much slower running 64 bit Linux and R than it was
> running 32 bit
> Linux and R. All the programs are bigger.  (Some, like the
> clock applet,
> are a lot bigger for no obvious reason.)  The difference is enough to
> put my meager 1GB machine into swapping much more, with the
> result that
> it is a lot slower.
>
> Paul
>
> Douglas Bates wrote:
>
> >I have been timing a particular model fit using lmer on several
> >different computers and came up with a peculiar result - the
> model fit
> >is considerably slower on a dual-core Athlon 64 using Goto's
> >multithreaded BLAS than on a single-core processor.
> >
> >Here is the timing on a single-core Athlon 64 3000+ running under
> >today's R-devel with version 0.995-5 of the Matrix package.
> >
> >  
> >
> >>library(Matrix)
> >>data(star, package = 'mlmRev')
> >>system.time(fm1 <-
> >>lmer(math~gr+sx+eth+cltype+(yrs|id)+(1|tch)+(yrs|sch),
> star, control = list(nit=0,grad=0,msV=1)))
> >>    
> >>
> >  0      241720.:  1.16440 0.335239  0.00000  1.78732
> 0.867209 0.382318  0.00000
> >  1      239722.:  1.94952 5.00000e-10 0.00933767  1.65999 0.858003
> >0.341520 0.00908757
> >  2      239580.:  1.95924 0.0884059 0.00933767  1.65308 0.857487
> >0.339296 0.00954718
> >  3      239215.:  2.60877 0.0765848 0.0177699  1.45739 0.843562
> >0.275100 0.0236849
> >  4      239204.:  2.62582 0.106670 0.0239698  1.41976 0.841086
> >0.261033 0.0267073
> >  5      239176.:  2.63149 0.0787924 0.0367185  1.37952 0.838527
> >0.245076 0.0301134
> >  6      239141.:  2.64949 0.108534 0.0594586  1.28846 0.832543
> >0.208404 0.0375456
> >  7      239049.:  2.64794 0.0789214 0.121782  1.10436 0.819711
> >0.126101 0.0524965
> >  8      239004.:  2.66044 0.117957 0.181505 0.932202 0.798982
> >0.0718116 0.0628958
> >  9      238944.:  2.66310 0.0819653 0.334477 0.631735 0.740855
> >0.258359 0.0806590
> > 10      238893.:  2.72626 0.0975205 0.653432 0.703912 0.666067
> >0.109922 0.201809
> > 11      238892.:  2.74381 0.111146 0.666155 0.693544
> 0.662000 0.104060 0.207591
> > 12      238888.:  2.75052 0.0990238 0.689199 0.694588 0.655781
> >0.106516 0.216460
> > 13      238861.:  2.80325 0.126935  1.05912 0.733914
> 0.556162 0.159296 0.360938
> > 14      238832.:  2.82656 0.117617  1.59471 0.607916 0.441371
> >0.0749944 0.976142
> > 15      238811.:  2.87350 0.136332  1.59046 0.653141
> 0.353763 0.226061  1.79285
> > 16      238810.:  2.87663 0.125135  1.58992 0.656808
> 0.352605 0.220488  1.79282
> > 17      238806.:  2.89342 0.141551  1.58607 0.676523
> 0.344212 0.181833  1.79268
> > 18      238804.:  2.90080 0.125137  1.56624 0.682921
> 0.261295 0.180598  1.74325
> > 19      238802.:  2.91950 0.128569  1.56836 0.680436
> 0.336051 0.159940  1.80400
> > 20      238801.:  2.92795 0.134762  1.56597 0.685121
> 0.331695 0.145547  1.80414
> > 21      238801.:  2.93741 0.125667  1.56139 0.687827
> 0.332700 0.138854  1.81495
> > 22      238800.:  2.94588 0.131757  1.55294 0.687909
> 0.330414 0.137834  1.82826
> > 23      238799.:  2.96867 0.129716  1.52943 0.688678
> 0.323171 0.139912  1.84615
> > 24      238799.:  2.98994 0.133378  1.52188 0.700038
> 0.337387 0.131403  1.77623
> > 25      238799.:  3.00312 0.135308  1.51475 0.697550
> 0.311750 0.145683  1.78037
> > 26      238799.:  3.00461 0.129920  1.51083 0.697666
> 0.306722 0.138745  1.81188
> > 27      238799.:  3.00504 0.134882  1.50539 0.696745
> 0.302949 0.135897  1.84405
> > 28      238799.:  3.00422 0.134013  1.47947 0.698115
> 0.303243 0.133806  1.86486
> > 29      238799.:  3.00819 0.134378  1.48185 0.701871
> 0.307097 0.134637  1.84996
> > 30      238799.:  3.01313 0.134279  1.49098 0.702883
> 0.304788 0.133682  1.86254
> > 31      238799.:  3.01291 0.134253  1.49060 0.701818
> 0.303155 0.133771  1.84613
> > 32      238799.:  3.01142 0.134314  1.48921 0.701782
> 0.303589 0.134439  1.84653
> > 33      238799.:  3.01174 0.134315  1.48926 0.701641
> 0.304120 0.134145  1.84635
> > 34      238799.:  3.01175 0.134304  1.48942 0.701740
> 0.303762 0.134185  1.84649
> > 35      238799.:  3.01173 0.134307  1.48937 0.701724
> 0.303809 0.134206  1.84647
> >[1] 43.10  3.78 48.41  0.00  0.00
> >
> >
> >(If you run the timing yourself and don't want to see the iteration
> >output, take the msV=1 out of the control list.  I keep it
> in there so
> >I can monitor the progress.)
> >
> >If I time the same model fit on a dual-core Athlon 64 X2
> 3800+ with the
> >same version of R, BLAS and Matrix package, the timing ends up with
> >something like
> >
> >90 140 235 0 0
> >
> >I do see that the multithreading is working on a calculation that is
> >essentially BLAS-bound such as
> >
> >  
> >
> >>mm <- Matrix(rnorm(1e6), nc = 1e3)
> >>system.time(crossprod(mm))
> >>    
> >>
> >[1] 0.57 0.02 0.60 0.00 0.00
> >
> >On the X2 processor it still takes about 0.6 seconds user
> time but only
> >0.3 seconds elapsed time when the machine is otherwise idle and both
> >cores are available for the calculation.
> >
> >Any suggestions why the dual-core processor is so much
> slower than the
> >single core processor?
> >
> >By the way, I would be interested in accumulating timings of
> this model
> >fit on other systems.  If you do time it please send me
> >(off-list) a summary of the version of R, version of the accelerated
> >BLAS if you use them, processor speed and configuration (i.e.
> >multiprocessor, multicore, etc.) and, if you know it, memory speed.
> >
> >This is an example of a complex multilevel model with
> crossed grouping
> >factors fit to a relatively large (30000 observations on 10000
> >students, 1400 teachers and 80 schools) data set.
> >
> >______________________________________________
> >[hidden email] mailing list
> >https://stat.ethz.ch/mailman/listinfo/r-devel
> >  
> >
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Peculiar timing result

Brian Ripley
In reply to this post by Douglas Bates-2
On Fri, 3 Mar 2006, Douglas Bates wrote:

> I have been timing a particular model fit using lmer on several
> different computers and came up with a peculiar result - the model fit
> is considerably slower on a dual-core Athlon 64 using Goto's
> multithreaded BLAS than on a single-core processor.

Is there a Goto BLAS tuned for that chip?  I can only see one tuned for an
(unspecified) Opteron.  L1 and L2 cache sizes do sometimes matter a lot
for tuned BLAS, and (according to the AMD site I just looked up) the X2
3800+ only has a 512Kb per core L2 cache.  Opterons have a 1Mb L2 cache.

Also, the very large system time seen in the dual-core run is typical of
what I see when pthreads is not working right, and I suggest you try a
limit of one thread (see the R-admin manual).  On our dual-processor
Opteron 248 that ran in 44 secs instead of 328.

> Here is the timing on a single-core Athlon 64 3000+ running under
> today's R-devel with version 0.995-5 of the Matrix package.
>
>> library(Matrix)
>> data(star, package = 'mlmRev')
>> system.time(fm1 <- lmer(math~gr+sx+eth+cltype+(yrs|id)+(1|tch)+(yrs|sch), star,
control = list(nit=0,grad=0,msV=1)))

> [1] 43.10  3.78 48.41  0.00  0.00
>
>
> (If you run the timing yourself and don't want to see the iteration
> output, take the msV=1 out of the control list.  I keep it in there so
> I can monitor the progress.)
>
> If I time the same model fit on a dual-core Athlon 64 X2 3800+ with
> the same version of R, BLAS and Matrix package, the timing ends up
> with something like
>
> 90 140 235 0 0
....

--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Peculiar timing result

Douglas Bates-2
In reply to this post by Paul Gilbert
I don't think this calculation is memory-bound at all and I would be
surprised if changing to a 32-bit environment would change things.  I
do have a 32-bit chroot environment on these machines (needed for
things like wine and acroread) so I'll try that out but I think I will
need to use Atlas as the accelerated BLAS rather than Goto's BLAS.  I
imagine the Opteron/Athlon 64 version of Goto's BLAS assumes a 64-bit
environment.

On 3/3/06, Paul Gilbert <[hidden email]> wrote:

> Doug
>
> This is probably not your reason, but I am finding my dual core Athlon
> 64 is much slower running 64 bit Linux and R than it was running 32 bit
> Linux and R. All the programs are bigger.  (Some, like the clock applet,
> are a lot bigger for no obvious reason.)  The difference is enough to
> put my meager 1GB machine into swapping much more, with the result that
> it is a lot slower.
>
> Paul
>
> Douglas Bates wrote:
>
> >I have been timing a particular model fit using lmer on several
> >different computers and came up with a peculiar result - the model fit
> >is considerably slower on a dual-core Athlon 64 using Goto's
> >multithreaded BLAS than on a single-core processor.
> >
> >Here is the timing on a single-core Athlon 64 3000+ running under
> >today's R-devel with version 0.995-5 of the Matrix package.
> >
> >
> >
> >>library(Matrix)
> >>data(star, package = 'mlmRev')
> >>system.time(fm1 <- lmer(math~gr+sx+eth+cltype+(yrs|id)+(1|tch)+(yrs|sch), star, control = list(nit=0,grad=0,msV=1)))
> >>
> >>
> >  0      241720.:  1.16440 0.335239  0.00000  1.78732 0.867209 0.382318  0.00000
> >  1      239722.:  1.94952 5.00000e-10 0.00933767  1.65999 0.858003
> >0.341520 0.00908757
> >  2      239580.:  1.95924 0.0884059 0.00933767  1.65308 0.857487
> >0.339296 0.00954718
> >  3      239215.:  2.60877 0.0765848 0.0177699  1.45739 0.843562
> >0.275100 0.0236849
> >  4      239204.:  2.62582 0.106670 0.0239698  1.41976 0.841086
> >0.261033 0.0267073
> >  5      239176.:  2.63149 0.0787924 0.0367185  1.37952 0.838527
> >0.245076 0.0301134
> >  6      239141.:  2.64949 0.108534 0.0594586  1.28846 0.832543
> >0.208404 0.0375456
> >  7      239049.:  2.64794 0.0789214 0.121782  1.10436 0.819711
> >0.126101 0.0524965
> >  8      239004.:  2.66044 0.117957 0.181505 0.932202 0.798982
> >0.0718116 0.0628958
> >  9      238944.:  2.66310 0.0819653 0.334477 0.631735 0.740855
> >0.258359 0.0806590
> > 10      238893.:  2.72626 0.0975205 0.653432 0.703912 0.666067
> >0.109922 0.201809
> > 11      238892.:  2.74381 0.111146 0.666155 0.693544 0.662000 0.104060 0.207591
> > 12      238888.:  2.75052 0.0990238 0.689199 0.694588 0.655781
> >0.106516 0.216460
> > 13      238861.:  2.80325 0.126935  1.05912 0.733914 0.556162 0.159296 0.360938
> > 14      238832.:  2.82656 0.117617  1.59471 0.607916 0.441371
> >0.0749944 0.976142
> > 15      238811.:  2.87350 0.136332  1.59046 0.653141 0.353763 0.226061  1.79285
> > 16      238810.:  2.87663 0.125135  1.58992 0.656808 0.352605 0.220488  1.79282
> > 17      238806.:  2.89342 0.141551  1.58607 0.676523 0.344212 0.181833  1.79268
> > 18      238804.:  2.90080 0.125137  1.56624 0.682921 0.261295 0.180598  1.74325
> > 19      238802.:  2.91950 0.128569  1.56836 0.680436 0.336051 0.159940  1.80400
> > 20      238801.:  2.92795 0.134762  1.56597 0.685121 0.331695 0.145547  1.80414
> > 21      238801.:  2.93741 0.125667  1.56139 0.687827 0.332700 0.138854  1.81495
> > 22      238800.:  2.94588 0.131757  1.55294 0.687909 0.330414 0.137834  1.82826
> > 23      238799.:  2.96867 0.129716  1.52943 0.688678 0.323171 0.139912  1.84615
> > 24      238799.:  2.98994 0.133378  1.52188 0.700038 0.337387 0.131403  1.77623
> > 25      238799.:  3.00312 0.135308  1.51475 0.697550 0.311750 0.145683  1.78037
> > 26      238799.:  3.00461 0.129920  1.51083 0.697666 0.306722 0.138745  1.81188
> > 27      238799.:  3.00504 0.134882  1.50539 0.696745 0.302949 0.135897  1.84405
> > 28      238799.:  3.00422 0.134013  1.47947 0.698115 0.303243 0.133806  1.86486
> > 29      238799.:  3.00819 0.134378  1.48185 0.701871 0.307097 0.134637  1.84996
> > 30      238799.:  3.01313 0.134279  1.49098 0.702883 0.304788 0.133682  1.86254
> > 31      238799.:  3.01291 0.134253  1.49060 0.701818 0.303155 0.133771  1.84613
> > 32      238799.:  3.01142 0.134314  1.48921 0.701782 0.303589 0.134439  1.84653
> > 33      238799.:  3.01174 0.134315  1.48926 0.701641 0.304120 0.134145  1.84635
> > 34      238799.:  3.01175 0.134304  1.48942 0.701740 0.303762 0.134185  1.84649
> > 35      238799.:  3.01173 0.134307  1.48937 0.701724 0.303809 0.134206  1.84647
> >[1] 43.10  3.78 48.41  0.00  0.00
> >
> >
> >(If you run the timing yourself and don't want to see the iteration
> >output, take the msV=1 out of the control list.  I keep it in there so
> >I can monitor the progress.)
> >
> >If I time the same model fit on a dual-core Athlon 64 X2 3800+ with
> >the same version of R, BLAS and Matrix package, the timing ends up
> >with something like
> >
> >90 140 235 0 0
> >
> >I do see that the multithreading is working on a calculation that is
> >essentially BLAS-bound such as
> >
> >
> >
> >>mm <- Matrix(rnorm(1e6), nc = 1e3)
> >>system.time(crossprod(mm))
> >>
> >>
> >[1] 0.57 0.02 0.60 0.00 0.00
> >
> >On the X2 processor it still takes about 0.6 seconds user time but
> >only 0.3 seconds elapsed time when the machine is otherwise idle and
> >both cores are available for the calculation.
> >
> >Any suggestions why the dual-core processor is so much slower than the
> >single core processor?
> >
> >By the way, I would be interested in accumulating timings of this
> >model fit on other systems.  If you do time it please send me
> >(off-list) a summary of the version of R, version of the accelerated
> >BLAS if you use them, processor speed and configuration (i.e.
> >multiprocessor, multicore, etc.) and, if you know it, memory speed.
> >
> >This is an example of a complex multilevel model with crossed grouping
> >factors fit to a relatively large (30000 observations on 10000
> >students, 1400 teachers and 80 schools) data set.
> >
> >______________________________________________
> >[hidden email] mailing list
> >https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Peculiar timing result

Brian Ripley
In reply to this post by Brian Ripley
Here is a summary of some results on a dual Opteron 252 running FC3

64-bit gcc 3.4.5
R's blas 34.83  3.45 38.56
ATLAS 36.70  3.28 40.14
ATLAS multithread 76.85  5.39 82.29
Goto 1 thread 36.17  3.44 39.76
Goto multithread       178.06 345.97 467.99
ACML 49.69  3.36 53.23

64-bit gcc 4.1.0
R's blas 34.98  3.49 38.55
32-bit gcc 3.4.5
R's blas 33.72  3.27 36.99
32-bit gcc 4.1.0
R's blas 34.62  3.25 37.93

The timings are not that repeatable, but the message seems clear that
this problem does not benefit from a tuned BLAS and the overhead from
multiple threads is harmful.  (The gcc 4.1.0 results took fewer
iterations, which skews the results in its favour.)

And my 2GHz Pentium M laptop under Windows gave 39.96  3.68 44.06.

Clearly the Goto BLAS has a problem here: the results are slower on a dual
252 than a dual 248 (see below).


On Fri, 3 Mar 2006, Prof Brian Ripley wrote:

> On Fri, 3 Mar 2006, Douglas Bates wrote:
>
>> I have been timing a particular model fit using lmer on several
>> different computers and came up with a peculiar result - the model fit
>> is considerably slower on a dual-core Athlon 64 using Goto's
>> multithreaded BLAS than on a single-core processor.
>
> Is there a Goto BLAS tuned for that chip?  I can only see one tuned for an
> (unspecified) Opteron.  L1 and L2 cache sizes do sometimes matter a lot
> for tuned BLAS, and (according to the AMD site I just looked up) the X2
> 3800+ only has a 512Kb per core L2 cache.  Opterons have a 1Mb L2 cache.
>
> Also, the very large system time seen in the dual-core run is typical of
> what I see when pthreads is not working right, and I suggest you try a
> limit of one thread (see the R-admin manual).  On our dual-processor
> Opteron 248 that ran in 44 secs instead of 328.
>
>> Here is the timing on a single-core Athlon 64 3000+ running under
>> today's R-devel with version 0.995-5 of the Matrix package.
>>
>>> library(Matrix)
>>> data(star, package = 'mlmRev')
>>> system.time(fm1 <- lmer(math~gr+sx+eth+cltype+(yrs|id)+(1|tch)+(yrs|sch), star,
> control = list(nit=0,grad=0,msV=1)))
>> [1] 43.10  3.78 48.41  0.00  0.00
>>
>>
>> (If you run the timing yourself and don't want to see the iteration
>> output, take the msV=1 out of the control list.  I keep it in there so
>> I can monitor the progress.)
>>
>> If I time the same model fit on a dual-core Athlon 64 X2 3800+ with
>> the same version of R, BLAS and Matrix package, the timing ends up
>> with something like
>>
>> 90 140 235 0 0
> ....
>
>

--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Peculiar timing result

Douglas Bates-2
On 3/11/06, Prof Brian Ripley <[hidden email]> wrote:

> Here is a summary of some results on a dual Opteron 252 running FC3
>
> 64-bit gcc 3.4.5
> R's blas                34.83  3.45 38.56
> ATLAS                   36.70  3.28 40.14
> ATLAS multithread       76.85  5.39 82.29
> Goto 1 thread           36.17  3.44 39.76
> Goto multithread       178.06 345.97 467.99
> ACML                    49.69  3.36 53.23
>
> 64-bit gcc 4.1.0
> R's blas                34.98  3.49 38.55
> 32-bit gcc 3.4.5
> R's blas                33.72  3.27 36.99
> 32-bit gcc 4.1.0
> R's blas                34.62  3.25 37.93
>
> The timings are not that repeatable, but the message seems clear that
> this problem does not benefit from a tuned BLAS and the overhead from
> multiple threads is harmful.  (The gcc 4.1.0 results took fewer
> iterations, which skews the results in its favour.)
>
> And my 2GHz Pentium M laptop under Windows gave 39.96  3.68 44.06.
>
> Clearly the Goto BLAS has a problem here: the results are slower on a dual
> 252 than a dual 248 (see below).

Thanks for the information on the timings.  It happens that this
message ended up in my R-help folder and I only got around to reading
that folder today.

Is it ok with you if I forward this message to Simon Urbanek?  I am
having similar difficulties in the timing with R on a dual-core Intel
MacBook.

>
>
> On Fri, 3 Mar 2006, Prof Brian Ripley wrote:
>
> > On Fri, 3 Mar 2006, Douglas Bates wrote:
> >
> >> I have been timing a particular model fit using lmer on several
> >> different computers and came up with a peculiar result - the model fit
> >> is considerably slower on a dual-core Athlon 64 using Goto's
> >> multithreaded BLAS than on a single-core processor.
> >
> > Is there a Goto BLAS tuned for that chip?  I can only see one tuned for an
> > (unspecified) Opteron.  L1 and L2 cache sizes do sometimes matter a lot
> > for tuned BLAS, and (according to the AMD site I just looked up) the X2
> > 3800+ only has a 512Kb per core L2 cache.  Opterons have a 1Mb L2 cache.
> >
> > Also, the very large system time seen in the dual-core run is typical of
> > what I see when pthreads is not working right, and I suggest you try a
> > limit of one thread (see the R-admin manual).  On our dual-processor
> > Opteron 248 that ran in 44 secs instead of 328.
> >
> >> Here is the timing on a single-core Athlon 64 3000+ running under
> >> today's R-devel with version 0.995-5 of the Matrix package.
> >>
> >>> library(Matrix)
> >>> data(star, package = 'mlmRev')
> >>> system.time(fm1 <- lmer(math~gr+sx+eth+cltype+(yrs|id)+(1|tch)+(yrs|sch), star,
> > control = list(nit=0,grad=0,msV=1)))
> >> [1] 43.10  3.78 48.41  0.00  0.00
> >>
> >>
> >> (If you run the timing yourself and don't want to see the iteration
> >> output, take the msV=1 out of the control list.  I keep it in there so
> >> I can monitor the progress.)
> >>
> >> If I time the same model fit on a dual-core Athlon 64 X2 3800+ with
> >> the same version of R, BLAS and Matrix package, the timing ends up
> >> with something like
> >>
> >> 90 140 235 0 0
> > ....
> >
> >
>
> --
> Brian D. Ripley,                  [hidden email]
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel