From: Dan Bonachea (bonachea_at_cs_dot_berkeley_dot_edu)
Date: Thu Sep 01 2005 - 01:18:27 PDT
Hi Christian - I'm moving your question to the UPC-help_at_hermes_dot_gwu_dot_edu list
(please send any followup questions there), which is a more appropriate forum
for user questions than the UPC announcement list.
I'd recommend several changes to your code. First, move the declarations of
a/b/c from global scope to local scope (or alternatively declare local
variables that copy their value). This only matters because you're using the
-pthreads backend in Berkeley UPC, which adds an extra level of indirection to
non-shared global-scope variables, taking a small performance hit (here's some
results from 2 threads on my quad 2.20GHz Xeon Linux machine with the
smp/pthreads backend on Berkeley UPC 2.2):
ORIGINAL CODE:
Function Rate (MB/s) RMS time Min time Max time
Assignment: 167.8426 1.0527 0.9533 1.1090
Scaling : 160.5885 1.0610 0.9963 1.1741
Summing : 198.0455 1.2694 1.2118 1.3055
SAXPYing : 189.6894 1.3231 1.2652 1.3813
WITH LOCAL VARS:
Function Rate (MB/s) RMS time Min time Max time
Assignment: 186.6148 0.9789 0.8574 1.0568
Scaling : 184.8741 0.9989 0.8655 1.0629
Summing : 205.0178 1.2583 1.1706 1.3389
SAXPYing : 202.2681 1.2571 1.1865 1.3456
Secondly, you can remove much of the overhead associated with the UPC forall
loop by enabling the UPC-level optimization in the Berkeley UPC 2.2 release,
by passing upcc -opt. This notably enables a smarter codegen strategy for
analyzable forall loops like yours allowing threads to only execute loop
headers which have affinity to that thread, as opposed to the naive
translation where all threads execute the loop headers for all iterations. If
you enable -opt, performance jumps by over 2x (and the improvement scales with
the number of threads):
Function Rate (MB/s) RMS time Min time Max time
Assignment: 384.6043 0.5449 0.4160 0.8324
Scaling : 350.1492 0.5496 0.4569 0.6488
Summing : 411.5805 0.8906 0.5831 1.1028
SAXPYing : 345.0864 0.8725 0.6955 1.3570
Finally, because your benchmark is only performing local computation, you can
and should apply a privatization optimization to the compute loop, thereby
avoiding the overheads of pointer-to-shared arithmetic in the inner loops. Eg:
THIS:
/* start timer */
upc_forall (j=0; j<N; j++; j)
c[j] = a[j];
/* end timer */
BECOMES:
double * const l_a = (double *)&a[MYTHREAD];
double * const l_b = (double *)&b[MYTHREAD];
double * const l_c = (double *)&c[MYTHREAD];
int const localsz = (N + OFFSET) / THREADS;
/* start timer */
for (j=0; j < localsz; j++)
l_c[j] = l_a[j];
/* end timer */
This privatization optimization is very important for performance, and UPC
compiler optimizers will eventually apply it automatically (at least in simple
cases like this benchmark where the data locality can be statically analyzed).
However in the meantime, good UPC programmers know it often needs to be done
by hand. With this optimization applied all shared pointer manipulations are
removed from the critical path and there's essentially no difference between
the UPC inner loop and an equivalent inner loop in pure C, and the performance
approaches the hardware peak:
Function Rate (MB/s) RMS time Min time Max time
Assignment: 1934.3749 0.1489 0.0827 0.2014
Scaling : 1603.0944 0.1928 0.0998 0.3800
Summing : 1998.5663 0.2246 0.1201 0.2900
SAXPYing : 1736.3372 0.2279 0.1382 0.2925
An updated tarball is attached with the all changes described above.
Incidentally, the allocation of your arrays is not quite right - it
essentially allocates a shared array of 1-byte blocks, which is not what you
want for doubles. It happens to work in most cases on Berkeley UPC, but to
ensure correct behavior you should really change this:
a = (shared double *) upc_all_alloc((N + OFFSET) * sizeof(double), 1); //
CT
to this:
a = (shared double *) upc_all_alloc((N + OFFSET), sizeof(double)); // CT
Finally, if you're using Berkeley UPC 2.2, you can dump all that
gettimeofday() timer code and instead use bupc_tick_t (see our User Guide)
which gives you direct, easy access to the hardware cycle counters.
Hope this helps...
Dan
At 01:14 PM 8/31/2005, Christian Terboven wrote:
>Hi all.
>
>First, I hope this is the right place to put my question. If not, please
>give me a pointer.
>
>I am pretty new to UPC and currently comparing the application of UPC to
>some of our user's codes with OpenMP and PThreads. Targeting a shared
>memory parallelization, I experience really bad performance with UPC.
>For example let's look at this code snippet from the STREAM-benchmark, I
>did:
>
>shared double *a;
>shared double *c;
>[...]
>a = (shared double *) upc_all_alloc((N + OFFSET) * sizeof(double), 1);
>c = (shared double *) upc_all_alloc((N + OFFSET) * sizeof(double), 1);
>[...]
> upc_forall (j=0; j<N; j++; j)
> c[j] = a[j];
>
>On a 4x-Opteron system running Linux with Berkeley-UPC-2.0.1 I get about
>130 MB/s running with one thread, but about 2000 MB/s running compiled
>with GNU C++ oder Intel'c C++ (OpenMP). When using more threads, it gets
>worse.
>
>What's wrong? What can I do to improve the performance of my UPC codes?
>
>
>Best regards,
>Christian Terboven