Re: UPC message colaescing optimization

Date view	Thread view	Subject view	Author view	Attachment view

From: Dan Bonachea (bonachea_at_cs_dot_berkeley_dot_edu)
Date: Wed Jun 08 2005 - 00:34:06 PDT

Next message: Marc L. Smith: "Makefile bug"

Previous message: Dan Bonachea: "Re: Error when compile my UPC program."

At 04:07 PM 6/7/2005, Sinan Al-Saffar wrote:

>Hi Dan,
>
>Looking at Wei-Yu's report on the Berkeley source-to-source compiler for UPC 
>one concludes that the colaescing optimization is included in that compiler.
><http://upc.lbl.gov/publications/wychen-master-report.pdf>http://upc.lbl.gov/publications/wychen-master-report.pdf
>Whereas from my testing for the CG and other benchmarks the loops for 
>fine-grained access like the ones on page 19 of that report do not get 
>optimized to memgets.
>See the CG non optimized results here:
><http://hermes.circ.gwu.edu/cgi-bin/wa?A2=ind0504&L=upc&F=&S=&P=786>http://hermes.circ.gwu.edu/cgi-bin/wa?A2=ind0504&L=upc&F=&S=&P=786
>there is a graph on that page. Other researchers I have talked to have had 
>similar results.
>
>So my question is: Does the Berkeley UPC compiler perform colaescing and 
>prefetching optimizations? And if no why did the optimizations mentioned in 
>Wei-Yu's report not make it to the final release? How difficult do you think 
>it is to add these optimizations?
>I think their addition would be very important since there is little 
>advantage of writing UPC apps in non-shared memory style which is what one 
>has to do to get good performance now. Automatic colaescing and pre-fetching 
>can reduce the need for a programmer to do memgets himself.
>
>Thanks in advance and hope youre enjoying your summer!
>
>Sinan
>
>PS. I also emailed Wei Yu to see if he has some input on this.

Hi Sinan - I believe Wei is out of town this week.

The answer to your question is that Berkeley UPC does not perform any 
UPC-level optimizations in the publicly available release *yet*. We internally 
have been developing UPC-level static optimizations for quite some time now 
(which is the basis of Wei's papers), and they should begin to appear in the 
public releases starting with the next major release of Berkeley UPC. However 
as you might guess, parallel compiler optimizations are a very complex 
problem, so this is only the beginning of the story - this is one of our major 
areas of ongoing research, and the Berkeley UPC optimizer will be updated as 
the research progresses.

However, it's important to note that if you want competitive performance on 
distributed-memory machines (where communication can be very expensive), 
application programmers still need to *think* about data locality in the 
critical paths of the application - UPC frees you from the tedium of message 
passing and enables an incremental approach to application tuning (so 
programmers only need to focus on the critical loops), and compiler 
optimizations can perform fine-grained transformations to automatically 
aggregate and schedule communication; however the optimizer is not magic - no 
amount of static compiler optimizations can turn a fundamentally 
poorly-written UPC program into a high-performance distributed-memory program 
(for an appropriate definition of "poorly-written"). UPC programmers still 
need to understand the parallel layout of their critical application data 
structures and have some idea where and when communication should be occurring 
along critical paths - applications whose critical loops perform complicated 
fine-grained communication in a shared memory style are likely to suffer a 
performance hit on distributed memory hardware, unless the optimizer can 
figure out what the code is doing and re-schedule the communication.

Hope this helps..

Dan

Next message: Marc L. Smith: "Makefile bug"

Previous message: Dan Bonachea: "Re: Error when compile my UPC program."

Date view	Thread view	Subject view	Author view	Attachment view