Re: UPC Benchmarks

From: Steven D. Vormwald (sdvormwa_at_mtu_dot_edu)
Date: Thu Feb 12 2009 - 13:48:10 PST

  • Next message: Paul H. Hargrove: "Re: pthreads"
    Gary Funck wrote:
    > Steven,
    > 
    > A student at MTU, Zhang Zhang, presented some UPC benchmark results
    > back in 2004/2005:
    > http://www.upc.mtu.edu/papers/ZhangIPDPS05.pdf
    > http://upc.gwu.edu/~upc/upcworkshop04/MTU-upcworkshop04.pdf
    > 
    > We looked at those his paper, and those benchmarks, and noted
    > some methodological errors.  Notably, a buggy version of the NPB benchmark
    > developed by GWU was utilized which skewed results and led to some
    > false indications of failures when run on various platforms.
    > This led to apparent "no shows" by various compilers.
    > 
    > A couple of years ago, we collected UPC benchmarks from various
    > sources, and re-worked them so that they (1) execute enough iterations
    > to be meaningful on modern hardware, (2) did not print extraneous
    > output during the timing run part of the benchmark, and (3) were run
    > in a dedicated OS environment (run level 1 on Linux) to avoid
    > extraneous timing noise created by normal OS activities (4) sufficient
    > runs of the benchmarks were made to obtain a representative timing
    > sample.  We found that all these steps were necessary to obtain
    > reasonable timing results.  During that process, we did not attempt
    > to verify that each benchmark measured exactly what it was trying
    > to measure in an effective fashion.  Further, we didn't try to
    > verify that complex benchmarks (like NPB) produced correct results.
    > 
    > Although I commend Zhang Zhang for advancing knowledge in the
    > area of UPC performance -- due to methodological errors it is
    > unfortunate that his paper is the seminal work in this area.
    > I'd like to see his experiments re-done with the errors corrected,
    > and run against current compilers and runtime systems.
    > 
    > A procedural recommendation: while developing and selecting
    > benchmarks and collecting initial results, I'd encourage
    > that the results be run by each vendor involved to ensure that
    > the compiler was executed with appropriate paramaters and to
    > give the vendor the opportunity to fix small errors/bugs,
    > and to verify that the benchmarks in fact measure the
    > feature as intended.
    > 
    > - Gary
    
    Gary,
    
    Thank you for your prompt reply.  You raise some valid points about 
    benchmarks in general and the history of benchmarks in UPC.  As I see 
    it, there are two primary reasons for having language benchmarks.  The 
    first is to measure the performance of various implementations of the 
    language, and the points you brought up address this wonderfully.  The 
    second is to give implementation developers and researchers examples of 
    important program behaviors, that they can use to model the effect that 
    their "[not-so-]great new optimization" will have on "real applications".
    
    For the work we are doing, the behavior of applications (in particular, 
    the remote memory access patterns) is more important than with the 
    efficiency and "correctness" of the implementation of various language 
    features.  We are running the benchmarks on an instrumented version of 
    MuPC that records a trace of all remote memory accesses (and doesn't 
    optimize them away...) that is then analyzed offline, so 
    micro-benchmarks that focus on the performance of a few language 
    features aren't as useful as benchmarks that have remote access 
    behaviors more similar to those one would expect to find in a real 
    application.
    
    I'd be happy to get recommendations for algorithms to implement (and 
    also the "proper" way to implement them in UPC) that people think would 
    provide good coverage of actual application behaviors.  At the moment, I 
    have a couple different simple matrix-multiply programs (naive element 
    by element computed in a upc_forall loop) that differ in the 
    distribution of the shared arrays (checkerboard, cyclic, block-cyclic), 
    and an (equally naive) implementation of the Jacobi method that halts 
    after a given number of iterations instead of when the result converges, 
    though it still does the convergence check.
    
    Steven Vormwald
    

  • Next message: Paul H. Hargrove: "Re: pthreads"