Re: Bad performance with UPC

Date view	Thread view	Subject view	Author view	Attachment view
From: Dan Bonachea (bonachea_at_cs_dot_berkeley_dot_edu)
Date: Thu Sep 01 2005 - 01:18:27 PDT
Next message: Adrian Powell: "xt3 support"
Previous message: Dan Bonachea: "GASNet v 1.6 released!"
Hi Christian - I'm moving your question to the UPC-help_at_hermes_dot_gwu_dot_edu list 
(please send any followup questions there), which is a more appropriate forum 
for user questions than the UPC announcement list.

I'd recommend several changes to your code. First, move the declarations of 
a/b/c from global scope to local scope (or alternatively declare local 
variables that copy their value). This only matters because you're using the 
-pthreads backend in Berkeley UPC, which adds an extra level of indirection to 
non-shared global-scope variables, taking a small performance hit (here's some 
results from 2 threads on my quad 2.20GHz Xeon Linux machine with the 
smp/pthreads backend on Berkeley UPC 2.2):

ORIGINAL CODE:
Function      Rate (MB/s)   RMS time     Min time     Max time
Assignment:   167.8426       1.0527       0.9533       1.1090
Scaling   :   160.5885       1.0610       0.9963       1.1741
Summing   :   198.0455       1.2694       1.2118       1.3055
SAXPYing  :   189.6894       1.3231       1.2652       1.3813

WITH LOCAL VARS:
Function      Rate (MB/s)   RMS time     Min time     Max time
Assignment:   186.6148       0.9789       0.8574       1.0568
Scaling   :   184.8741       0.9989       0.8655       1.0629
Summing   :   205.0178       1.2583       1.1706       1.3389
SAXPYing  :   202.2681       1.2571       1.1865       1.3456

Secondly, you can remove much of the overhead associated with the UPC forall 
loop by enabling the UPC-level optimization in the Berkeley UPC 2.2 release, 
by passing upcc -opt. This notably enables a smarter codegen strategy for 
analyzable forall loops like yours allowing threads to only execute loop 
headers which have affinity to that thread, as opposed to the naive 
translation where all threads execute the loop headers for all iterations. If 
you enable -opt, performance jumps by over 2x (and the improvement scales with 
the number of threads):

Function      Rate (MB/s)   RMS time     Min time     Max time
Assignment:   384.6043       0.5449       0.4160       0.8324
Scaling   :   350.1492       0.5496       0.4569       0.6488
Summing   :   411.5805       0.8906       0.5831       1.1028
SAXPYing  :   345.0864       0.8725       0.6955       1.3570

Finally, because your benchmark is only performing local computation, you can 
and should apply a privatization optimization to the compute loop, thereby 
avoiding the overheads of pointer-to-shared arithmetic in the inner loops. Eg:

THIS:
         /* start timer */
         upc_forall (j=0; j<N; j++; j)
             c[j] = a[j];
         /* end timer */

BECOMES:
         double * const l_a = (double *)&a[MYTHREAD];
         double * const l_b = (double *)&b[MYTHREAD];
         double * const l_c = (double *)&c[MYTHREAD];
         int const localsz = (N + OFFSET) / THREADS;

         /* start timer */
         for (j=0; j < localsz; j++)
             l_c[j] = l_a[j];
         /* end timer */

This privatization optimization is very important for performance, and UPC 
compiler optimizers will eventually apply it automatically (at least in simple 
cases like this benchmark where the data locality can be statically analyzed). 
However in the meantime, good UPC programmers know it often needs to be done 
by hand. With this optimization applied all shared pointer manipulations are 
removed from the critical path and there's essentially no difference between 
the UPC inner loop and an equivalent inner loop in pure C, and the performance 
approaches the hardware peak:

Function      Rate (MB/s)   RMS time     Min time     Max time
Assignment:  1934.3749       0.1489       0.0827       0.2014
Scaling   :  1603.0944       0.1928       0.0998       0.3800
Summing   :  1998.5663       0.2246       0.1201       0.2900
SAXPYing  :  1736.3372       0.2279       0.1382       0.2925

An updated tarball is attached with the all changes described above.

Incidentally, the allocation of your arrays is not quite right - it 
essentially allocates a shared array of 1-byte blocks, which is not what you 
want for doubles. It happens to work in most cases on Berkeley UPC, but to 
ensure correct behavior you should really change this:
     a = (shared double *) upc_all_alloc((N + OFFSET) * sizeof(double), 1); // 
CT
to this:
     a = (shared double *) upc_all_alloc((N + OFFSET), sizeof(double)); // CT

Finally, if you're using Berkeley UPC 2.2, you can dump all that 
gettimeofday() timer code and instead use bupc_tick_t (see our User Guide) 
which gives you direct, easy access to the hardware cycle counters.

Hope this helps...

Dan

At 01:14 PM 8/31/2005, Christian Terboven wrote:
>Hi all.
>
>First, I hope this is the right place to put my question. If not, please
>give me a pointer.
>
>I am pretty new to UPC and currently comparing the application of UPC to
>some of our user's codes with OpenMP and PThreads. Targeting a shared
>memory parallelization, I experience really bad performance with UPC.
>For example let's look at this code snippet from the STREAM-benchmark, I
>did:
>
>shared double *a;
>shared double *c;
>[...]
>a = (shared double *) upc_all_alloc((N + OFFSET) * sizeof(double), 1);
>c = (shared double *) upc_all_alloc((N + OFFSET) * sizeof(double), 1);
>[...]
>         upc_forall (j=0; j<N; j++; j)
>             c[j] = a[j];
>
>On a 4x-Opteron system running Linux with Berkeley-UPC-2.0.1 I get about
>130 MB/s running with one thread, but about 2000 MB/s running compiled
>with GNU C++ oder Intel'c C++ (OpenMP). When using more threads, it gets
>worse.
>
>What's wrong? What can I do to improve the performance of my UPC codes?
>
>
>Best regards,
>Christian Terboven
application/octet-stream attachment: src-dynamic-c-upc.tar.gz
Next message: Adrian Powell: "xt3 support"
Previous message: Dan Bonachea: "GASNet v 1.6 released!"
Date view	Thread view	Subject view	Author view	Attachment view