From: Dan Bonachea (bonachea_at_cs_dot_berkeley_dot_edu)
Date: Thu Sep 01 2005 - 01:18:27 PDT
Hi Christian - I'm moving your question to the UPC-help_at_hermes_dot_gwu_dot_edu list (please send any followup questions there), which is a more appropriate forum for user questions than the UPC announcement list. I'd recommend several changes to your code. First, move the declarations of a/b/c from global scope to local scope (or alternatively declare local variables that copy their value). This only matters because you're using the -pthreads backend in Berkeley UPC, which adds an extra level of indirection to non-shared global-scope variables, taking a small performance hit (here's some results from 2 threads on my quad 2.20GHz Xeon Linux machine with the smp/pthreads backend on Berkeley UPC 2.2): ORIGINAL CODE: Function Rate (MB/s) RMS time Min time Max time Assignment: 167.8426 1.0527 0.9533 1.1090 Scaling : 160.5885 1.0610 0.9963 1.1741 Summing : 198.0455 1.2694 1.2118 1.3055 SAXPYing : 189.6894 1.3231 1.2652 1.3813 WITH LOCAL VARS: Function Rate (MB/s) RMS time Min time Max time Assignment: 186.6148 0.9789 0.8574 1.0568 Scaling : 184.8741 0.9989 0.8655 1.0629 Summing : 205.0178 1.2583 1.1706 1.3389 SAXPYing : 202.2681 1.2571 1.1865 1.3456 Secondly, you can remove much of the overhead associated with the UPC forall loop by enabling the UPC-level optimization in the Berkeley UPC 2.2 release, by passing upcc -opt. This notably enables a smarter codegen strategy for analyzable forall loops like yours allowing threads to only execute loop headers which have affinity to that thread, as opposed to the naive translation where all threads execute the loop headers for all iterations. If you enable -opt, performance jumps by over 2x (and the improvement scales with the number of threads): Function Rate (MB/s) RMS time Min time Max time Assignment: 384.6043 0.5449 0.4160 0.8324 Scaling : 350.1492 0.5496 0.4569 0.6488 Summing : 411.5805 0.8906 0.5831 1.1028 SAXPYing : 345.0864 0.8725 0.6955 1.3570 Finally, because your benchmark is only performing local computation, you can and should apply a privatization optimization to the compute loop, thereby avoiding the overheads of pointer-to-shared arithmetic in the inner loops. Eg: THIS: /* start timer */ upc_forall (j=0; j<N; j++; j) c[j] = a[j]; /* end timer */ BECOMES: double * const l_a = (double *)&a[MYTHREAD]; double * const l_b = (double *)&b[MYTHREAD]; double * const l_c = (double *)&c[MYTHREAD]; int const localsz = (N + OFFSET) / THREADS; /* start timer */ for (j=0; j < localsz; j++) l_c[j] = l_a[j]; /* end timer */ This privatization optimization is very important for performance, and UPC compiler optimizers will eventually apply it automatically (at least in simple cases like this benchmark where the data locality can be statically analyzed). However in the meantime, good UPC programmers know it often needs to be done by hand. With this optimization applied all shared pointer manipulations are removed from the critical path and there's essentially no difference between the UPC inner loop and an equivalent inner loop in pure C, and the performance approaches the hardware peak: Function Rate (MB/s) RMS time Min time Max time Assignment: 1934.3749 0.1489 0.0827 0.2014 Scaling : 1603.0944 0.1928 0.0998 0.3800 Summing : 1998.5663 0.2246 0.1201 0.2900 SAXPYing : 1736.3372 0.2279 0.1382 0.2925 An updated tarball is attached with the all changes described above. Incidentally, the allocation of your arrays is not quite right - it essentially allocates a shared array of 1-byte blocks, which is not what you want for doubles. It happens to work in most cases on Berkeley UPC, but to ensure correct behavior you should really change this: a = (shared double *) upc_all_alloc((N + OFFSET) * sizeof(double), 1); // CT to this: a = (shared double *) upc_all_alloc((N + OFFSET), sizeof(double)); // CT Finally, if you're using Berkeley UPC 2.2, you can dump all that gettimeofday() timer code and instead use bupc_tick_t (see our User Guide) which gives you direct, easy access to the hardware cycle counters. Hope this helps... Dan At 01:14 PM 8/31/2005, Christian Terboven wrote: >Hi all. > >First, I hope this is the right place to put my question. If not, please >give me a pointer. > >I am pretty new to UPC and currently comparing the application of UPC to >some of our user's codes with OpenMP and PThreads. Targeting a shared >memory parallelization, I experience really bad performance with UPC. >For example let's look at this code snippet from the STREAM-benchmark, I >did: > >shared double *a; >shared double *c; >[...] >a = (shared double *) upc_all_alloc((N + OFFSET) * sizeof(double), 1); >c = (shared double *) upc_all_alloc((N + OFFSET) * sizeof(double), 1); >[...] > upc_forall (j=0; j<N; j++; j) > c[j] = a[j]; > >On a 4x-Opteron system running Linux with Berkeley-UPC-2.0.1 I get about >130 MB/s running with one thread, but about 2000 MB/s running compiled >with GNU C++ oder Intel'c C++ (OpenMP). When using more threads, it gets >worse. > >What's wrong? What can I do to improve the performance of my UPC codes? > > >Best regards, >Christian Terboven