From: Dan Bonachea (bonachea_at_cs_dot_berkeley_dot_edu)
Date: Wed Jun 08 2005 - 00:34:06 PDT
At 04:07 PM 6/7/2005, Sinan Al-Saffar wrote: >Hi Dan, > >Looking at Wei-Yu's report on the Berkeley source-to-source compiler for UPC >one concludes that the colaescing optimization is included in that compiler. ><http://upc.lbl.gov/publications/wychen-master-report.pdf>http://upc.lbl.gov/publications/wychen-master-report.pdf >Whereas from my testing for the CG and other benchmarks the loops for >fine-grained access like the ones on page 19 of that report do not get >optimized to memgets. >See the CG non optimized results here: ><http://hermes.circ.gwu.edu/cgi-bin/wa?A2=ind0504&L=upc&F=&S=&P=786>http://hermes.circ.gwu.edu/cgi-bin/wa?A2=ind0504&L=upc&F=&S=&P=786 >there is a graph on that page. Other researchers I have talked to have had >similar results. > >So my question is: Does the Berkeley UPC compiler perform colaescing and >prefetching optimizations? And if no why did the optimizations mentioned in >Wei-Yu's report not make it to the final release? How difficult do you think >it is to add these optimizations? >I think their addition would be very important since there is little >advantage of writing UPC apps in non-shared memory style which is what one >has to do to get good performance now. Automatic colaescing and pre-fetching >can reduce the need for a programmer to do memgets himself. > >Thanks in advance and hope youre enjoying your summer! > >Sinan > >PS. I also emailed Wei Yu to see if he has some input on this. Hi Sinan - I believe Wei is out of town this week. The answer to your question is that Berkeley UPC does not perform any UPC-level optimizations in the publicly available release *yet*. We internally have been developing UPC-level static optimizations for quite some time now (which is the basis of Wei's papers), and they should begin to appear in the public releases starting with the next major release of Berkeley UPC. However as you might guess, parallel compiler optimizations are a very complex problem, so this is only the beginning of the story - this is one of our major areas of ongoing research, and the Berkeley UPC optimizer will be updated as the research progresses. However, it's important to note that if you want competitive performance on distributed-memory machines (where communication can be very expensive), application programmers still need to *think* about data locality in the critical paths of the application - UPC frees you from the tedium of message passing and enables an incremental approach to application tuning (so programmers only need to focus on the critical loops), and compiler optimizations can perform fine-grained transformations to automatically aggregate and schedule communication; however the optimizer is not magic - no amount of static compiler optimizations can turn a fundamentally poorly-written UPC program into a high-performance distributed-memory program (for an appropriate definition of "poorly-written"). UPC programmers still need to understand the parallel layout of their critical application data structures and have some idea where and when communication should be occurring along critical paths - applications whose critical loops perform complicated fine-grained communication in a shared memory style are likely to suffer a performance hit on distributed memory hardware, unless the optimizer can figure out what the code is doing and re-schedule the communication. Hope this helps.. Dan