Re: UPC message colaescing optimization

From: Dan Bonachea (bonachea_at_cs_dot_berkeley_dot_edu)
Date: Wed Jun 08 2005 - 00:34:06 PDT

  • Next message: Marc L. Smith: "Makefile bug"
    At 04:07 PM 6/7/2005, Sinan Al-Saffar wrote:
    >Hi Dan,
    >Looking at Wei-Yu's report on the Berkeley source-to-source compiler for UPC 
    >one concludes that the colaescing optimization is included in that compiler.
    >Whereas from my testing for the CG and other benchmarks the loops for 
    >fine-grained access like the ones on page 19 of that report do not get 
    >optimized to memgets.
    >See the CG non optimized results here:
    >there is a graph on that page. Other researchers I have talked to have had 
    >similar results.
    >So my question is: Does the Berkeley UPC compiler perform colaescing and 
    >prefetching optimizations? And if no why did the optimizations mentioned in 
    >Wei-Yu's report not make it to the final release? How difficult do you think 
    >it is to add these optimizations?
    >I think their addition would be very important since there is little 
    >advantage of writing UPC apps in non-shared memory style which is what one 
    >has to do to get good performance now. Automatic colaescing and pre-fetching 
    >can reduce the need for a programmer to do memgets himself.
    >Thanks in advance and hope youre enjoying your summer!
    >PS. I also emailed Wei Yu to see if he has some input on this.
    Hi Sinan - I believe Wei is out of town this week.
    The answer to your question is that Berkeley UPC does not perform any 
    UPC-level optimizations in the publicly available release *yet*. We internally 
    have been developing UPC-level static optimizations for quite some time now 
    (which is the basis of Wei's papers), and they should begin to appear in the 
    public releases starting with the next major release of Berkeley UPC. However 
    as you might guess, parallel compiler optimizations are a very complex 
    problem, so this is only the beginning of the story - this is one of our major 
    areas of ongoing research, and the Berkeley UPC optimizer will be updated as 
    the research progresses.
    However, it's important to note that if you want competitive performance on 
    distributed-memory machines (where communication can be very expensive), 
    application programmers still need to *think* about data locality in the 
    critical paths of the application - UPC frees you from the tedium of message 
    passing and enables an incremental approach to application tuning (so 
    programmers only need to focus on the critical loops), and compiler 
    optimizations can perform fine-grained transformations to automatically 
    aggregate and schedule communication; however the optimizer is not magic - no 
    amount of static compiler optimizations can turn a fundamentally 
    poorly-written UPC program into a high-performance distributed-memory program 
    (for an appropriate definition of "poorly-written"). UPC programmers still 
    need to understand the parallel layout of their critical application data 
    structures and have some idea where and when communication should be occurring 
    along critical paths - applications whose critical loops perform complicated 
    fine-grained communication in a shared memory style are likely to suffer a 
    performance hit on distributed memory hardware, unless the optimizer can 
    figure out what the code is doing and re-schedule the communication.
    Hope this helps..

  • Next message: Marc L. Smith: "Makefile bug"