Re: Bad performance with UPC

From: Dan Bonachea (bonachea_at_cs_dot_berkeley_dot_edu)
Date: Thu Sep 01 2005 - 01:18:27 PDT

  • Next message: Adrian Powell: "xt3 support"
    Hi Christian - I'm moving your question to the UPC-help_at_hermes_dot_gwu_dot_edu list 
    (please send any followup questions there), which is a more appropriate forum 
    for user questions than the UPC announcement list.
    I'd recommend several changes to your code. First, move the declarations of 
    a/b/c from global scope to local scope (or alternatively declare local 
    variables that copy their value). This only matters because you're using the 
    -pthreads backend in Berkeley UPC, which adds an extra level of indirection to 
    non-shared global-scope variables, taking a small performance hit (here's some 
    results from 2 threads on my quad 2.20GHz Xeon Linux machine with the 
    smp/pthreads backend on Berkeley UPC 2.2):
    Function      Rate (MB/s)   RMS time     Min time     Max time
    Assignment:   167.8426       1.0527       0.9533       1.1090
    Scaling   :   160.5885       1.0610       0.9963       1.1741
    Summing   :   198.0455       1.2694       1.2118       1.3055
    SAXPYing  :   189.6894       1.3231       1.2652       1.3813
    Function      Rate (MB/s)   RMS time     Min time     Max time
    Assignment:   186.6148       0.9789       0.8574       1.0568
    Scaling   :   184.8741       0.9989       0.8655       1.0629
    Summing   :   205.0178       1.2583       1.1706       1.3389
    SAXPYing  :   202.2681       1.2571       1.1865       1.3456
    Secondly, you can remove much of the overhead associated with the UPC forall 
    loop by enabling the UPC-level optimization in the Berkeley UPC 2.2 release, 
    by passing upcc -opt. This notably enables a smarter codegen strategy for 
    analyzable forall loops like yours allowing threads to only execute loop 
    headers which have affinity to that thread, as opposed to the naive 
    translation where all threads execute the loop headers for all iterations. If 
    you enable -opt, performance jumps by over 2x (and the improvement scales with 
    the number of threads):
    Function      Rate (MB/s)   RMS time     Min time     Max time
    Assignment:   384.6043       0.5449       0.4160       0.8324
    Scaling   :   350.1492       0.5496       0.4569       0.6488
    Summing   :   411.5805       0.8906       0.5831       1.1028
    SAXPYing  :   345.0864       0.8725       0.6955       1.3570
    Finally, because your benchmark is only performing local computation, you can 
    and should apply a privatization optimization to the compute loop, thereby 
    avoiding the overheads of pointer-to-shared arithmetic in the inner loops. Eg:
             /* start timer */
             upc_forall (j=0; j<N; j++; j)
                 c[j] = a[j];
             /* end timer */
             double * const l_a = (double *)&a[MYTHREAD];
             double * const l_b = (double *)&b[MYTHREAD];
             double * const l_c = (double *)&c[MYTHREAD];
             int const localsz = (N + OFFSET) / THREADS;
             /* start timer */
             for (j=0; j < localsz; j++)
                 l_c[j] = l_a[j];
             /* end timer */
    This privatization optimization is very important for performance, and UPC 
    compiler optimizers will eventually apply it automatically (at least in simple 
    cases like this benchmark where the data locality can be statically analyzed). 
    However in the meantime, good UPC programmers know it often needs to be done 
    by hand. With this optimization applied all shared pointer manipulations are 
    removed from the critical path and there's essentially no difference between 
    the UPC inner loop and an equivalent inner loop in pure C, and the performance 
    approaches the hardware peak:
    Function      Rate (MB/s)   RMS time     Min time     Max time
    Assignment:  1934.3749       0.1489       0.0827       0.2014
    Scaling   :  1603.0944       0.1928       0.0998       0.3800
    Summing   :  1998.5663       0.2246       0.1201       0.2900
    SAXPYing  :  1736.3372       0.2279       0.1382       0.2925
    An updated tarball is attached with the all changes described above.
    Incidentally, the allocation of your arrays is not quite right - it 
    essentially allocates a shared array of 1-byte blocks, which is not what you 
    want for doubles. It happens to work in most cases on Berkeley UPC, but to 
    ensure correct behavior you should really change this:
         a = (shared double *) upc_all_alloc((N + OFFSET) * sizeof(double), 1); // 
    to this:
         a = (shared double *) upc_all_alloc((N + OFFSET), sizeof(double)); // CT
    Finally, if you're using Berkeley UPC 2.2, you can dump all that 
    gettimeofday() timer code and instead use bupc_tick_t (see our User Guide) 
    which gives you direct, easy access to the hardware cycle counters.
    Hope this helps...
    At 01:14 PM 8/31/2005, Christian Terboven wrote:
    >Hi all.
    >First, I hope this is the right place to put my question. If not, please
    >give me a pointer.
    >I am pretty new to UPC and currently comparing the application of UPC to
    >some of our user's codes with OpenMP and PThreads. Targeting a shared
    >memory parallelization, I experience really bad performance with UPC.
    >For example let's look at this code snippet from the STREAM-benchmark, I
    >shared double *a;
    >shared double *c;
    >a = (shared double *) upc_all_alloc((N + OFFSET) * sizeof(double), 1);
    >c = (shared double *) upc_all_alloc((N + OFFSET) * sizeof(double), 1);
    >         upc_forall (j=0; j<N; j++; j)
    >             c[j] = a[j];
    >On a 4x-Opteron system running Linux with Berkeley-UPC-2.0.1 I get about
    >130 MB/s running with one thread, but about 2000 MB/s running compiled
    >with GNU C++ oder Intel'c C++ (OpenMP). When using more threads, it gets
    >What's wrong? What can I do to improve the performance of my UPC codes?
    >Best regards,
    >Christian Terboven

  • Next message: Adrian Powell: "xt3 support"