Re: upc_all_reduce behaviour

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Sat Apr 17 2010 - 14:18:27 PDT

  • Next message: Debabrata Midya: "Re: UPC on Windows"
    Unfortunately the person most familiar with the collectives 
    implementation in Berkeley UPC is a student who graduated recently.
    I or Yili Zheng (also on this list) are the "next best" choices.  So I'd 
    suggest you direct your questions to this list.
    If your hope is to ask what order the threads leave the collectives I am 
    going to have to disappoint you again.  The algorithms used for the 
    collectives will differ according to the sizes and the sync flags, and 
    in some cases on the network being used (some networks have hardware 
    support we can leverage).  Additionally, if one sets the appropriate 
    environment variable we can perform online auto-tuning in which the 
    first collective call with a given set of arguments will time several 
    possible algorithms and pick the fastest one for use in all subsequent 
    calls that have the same arguments (sizes, sync flags and affinity of 
    In the particular case of the Reduce, I am pretty sure we are still 
    using the very naive (and non-scalable) algorithm I alluded to in my 
    previous email: Gather to the thread with affinity to the destination 
    and let it do all the work.  In the case of a blocksize > 1, the 
    individual threads will perform the reduction over their block before 
    the Gather.
    I think PrefixReduce is similar, with per-block reductions followed by a 
    gather to thread 0.  Thread 0 performs a prefix-reduction over the 
    block-reduced results and then Scatters those values.  The final step is 
    to combine the scattered values with the individual elements within each 
    block to arrive at the full results.
    Of course the Gather and Scatter in the Reduce and PrefixReduce are 
    subject to the multiple possible  implementations depending on their 
    It is also worth mentioning that all of this is subject to change with 
    each software release - for instance tree-based reductions are a very 
    likely candidate for our Nov 2010 release.
    Nikita Andreev wrote:
    >> collectives as if passed UPC_IN_ALLSYNC|UPC_OUT_ALLSYNC.  This is 
    >> true of ANY collective - the sync flags BOUNDS when a thread may exit 
    >> but does not DEFINE it.
    > That's perfect remark. Didn't think about it in the first place.
    >> If, for instance, the prefix reduction is computed by a Gather of all 
    >> the values to Thread 0, which then does all the arithmetic and sends 
    >> out the results, then one would expect Thread 0 to leave LAST rather 
    >> than first.
    > Paul, could you please point me to anyone who is particularly familiar 
    > with current Berkeley implementation of UPC collectives?
    > Nikita
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     

  • Next message: Debabrata Midya: "Re: UPC on Windows"