Re: ping pong in UPC

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Jul 28 2009 - 18:11:42 PDT

  • Next message: Jose Vicente Espi: "Re: ping pong in UPC"
    Jose,
    
    Sorry we have not responded sooner.  Your mail arrived while our entire 
    team was involved in an important two-day meeting.
    
    First let me say that what you are trying to compare is a little 
    tricky.  The performance results you see on the GASNet webpages are a 
    comparison of the speed of MPI vs GASNet for *implementing* UPC-like 
    communications patterns over various networks.  That is not quite the 
    same as comparing UPC vs MPI for implementing a given application's 
    communications.
    
    Second let me say that UDP is expected to give better latency 
    performance than MPI when both are running on an Ethernet network, but 
    this assumes that network is "mostly reliable" as is the case with most 
    switched Ethernet networks used in clusters.  However, if run over a 
    wide-area network or with very inexpensive equipment, it is possible 
    that reliability at the TCP level (used indirectly by MPI) may be more 
    efficient than the UDP implementation that GASNet employs.
    
    PLEASE keep in mind that both the MPI and UDP implementations of GASNet 
    exist only for their portability and neither is going to be blindingly 
    fast.  Comparing either of them to some other benchmark may satisfy ones 
    curiosity, but I don't see any deep value in such a comparison.
    
    Benchmarks in general:
    
    In the Berkeley UPC distribution tarball there are upc-tests and 
    upc-examples directories that contain UPC code gathered from many 
    sources.  Among them are several benchmarks, some of which might even be 
    correct ;-).  Have a look at that collection of code, but be aware that 
    we provide it as-is and since we wrote very little of it may not be able 
    to help much.  (We might not even know what some of them do.)
    
    Measuring "latency":
    
    How you define latency will depend on what really matters to your 
    application.  If one wants to look (as we have on the GASNet site) at 
    the time required to implement a UPC-level "strict Put" operation then 
    you are looking at comparing upc_memput() against an MPI Ping-ACK 
    (N-bytes sent, and then wait for a zero-byte reply).  The "ACK" in the 
    MPI test is to allow the sender to know the value has reached remote 
    memory before it can perform the next operation (a part of the 'strict' 
    UPC memory model).  In the GASNet case, the completion of a blocking Put 
    operation uses lower-level acknowledgments when available from a given 
    network API, which is one of the reasons it outperforms MPI Ping-ACK on 
    many high-speed networks.  In the case of UDP, however, no lower-level 
    notification is provided and the comms pattern is pretty much the same 
    as for MPI (a UDP-level ACK sent by the GASNet implementation)
    
    If what you want is a Ping-Pong in which node0 sends a message that 
    requires a reply from node1, then you are trying to measure something 
    quite different from what the GASNet performance webpage shows.  In MPI 
    the idea of waiting for a message arrival is quite natural.  In UPC, 
    however, there is no natural way to wait for "arrival" since there is no 
    "message" concept.  In the Berkeley UPC implementation we address this 
    lack with a "semaphore" extension that you may wish to investigate.  
    Without the semaphore or a similar abstraction for point-to-point 
    ordering, a true Ping-Pong is hard to write portably in UPC (and the 
    portable implementation may be quire inefficient).
    
    Measuring "bandwidth":
    
    In the case of bandwidth the idea is pretty much the same in MPI and 
    UPC: move data in one direction as fast as possible with a given 
    transfer size.  Again, however, mapping this into MPI and UPC code is 
    different.  In MPI one will use non-blocking sends and receives to get 
    the best possible bandwidth (by overlapping the per-operation overhead 
    with the communication of the previous operations).  In UPC one wants to 
    do the same thing.  In an ideal work the fact that comms in UPC are done 
    at a language level should allow a smart compiler to automatically 
    transform things into non-blocking transfers when this does not change 
    the program semantics.  However, few compilers can do this perfectly 
    (ours can to a limited extent) and even if they could the typical 
    benchmark is transferring to the same same destination repeatedly, 
    possibly preventing such a non-blocking transformation by the compiler.  
    So, how does one express EXPLICITLY non-blocking comms in UPC?  Again we 
    have an extension in the Berkeley UPC compiler (proposed as an extension 
    to the UPC language spec) for this purpose.
    
    Docs:
    For info on the semaphore/signaling-put extensions, see 
    http://upc.lbl.gov/publications/PGAS06-p2p.pdf
    For info on the non-blocking memcpy extensions, see 
    http://upc.lbl.gov/publications/upc_memcpy.pdf
    
    
    I have probably left you with more question than answers, but hopefully 
    the new questions lead you in the right direction.  If you could 
    describe for us what you think you want to measure, we might be able to 
    provide more useful answers.  However, I will caution you again that the 
    UDP implementation of GASNet exists for portability (not performance) 
    and comparing it to MPI benchmarks is probably of very little value.
    
    -Paul
    
    
    Jose Vicente Espi wrote:
    > Hello,
    >
    > I'm testing performance of UPC communications in an UDP network, 
    > comparing it with MPI ping pong bandwith/latency test. But  I didn't 
    > get the results that I expected, based in the tests made in 
    > http://gasnet.cs.berkeley.edu/performance/.
    >
    > I'm probably doing something wrong, functions used for transferring 
    > data are upc_memput and upc_memget. Only with message sizes smaller 
    > than 512 bytes I get a little better performance than MPI. For larger 
    > message sizes performance become worse.
    >
    > Have you got an example of code for measuring performance of UPC vs 
    > MPI  in ping pong bandwith test?
    >
    > Thanks in advance.
    >
    > Jose Vicente.
    >
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     
    

  • Next message: Jose Vicente Espi: "Re: ping pong in UPC"