Re: upc_all_reduce behaviour

Date view	Thread view	Subject view	Author view	Attachment view

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Sat Apr 17 2010 - 14:18:27 PDT

Next message: Debabrata Midya: "Re: UPC on Windows"

Previous message: Nikita Andreev: "Re: upc_all_reduce behaviour"
In reply to: Nikita Andreev: "Re: upc_all_reduce behaviour"

Nikita,

Unfortunately the person most familiar with the collectives 
implementation in Berkeley UPC is a student who graduated recently.
I or Yili Zheng (also on this list) are the "next best" choices.  So I'd 
suggest you direct your questions to this list.

If your hope is to ask what order the threads leave the collectives I am 
going to have to disappoint you again.  The algorithms used for the 
collectives will differ according to the sizes and the sync flags, and 
in some cases on the network being used (some networks have hardware 
support we can leverage).  Additionally, if one sets the appropriate 
environment variable we can perform online auto-tuning in which the 
first collective call with a given set of arguments will time several 
possible algorithms and pick the fastest one for use in all subsequent 
calls that have the same arguments (sizes, sync flags and affinity of 
destination).

In the particular case of the Reduce, I am pretty sure we are still 
using the very naive (and non-scalable) algorithm I alluded to in my 
previous email: Gather to the thread with affinity to the destination 
and let it do all the work.  In the case of a blocksize > 1, the 
individual threads will perform the reduction over their block before 
the Gather.

I think PrefixReduce is similar, with per-block reductions followed by a 
gather to thread 0.  Thread 0 performs a prefix-reduction over the 
block-reduced results and then Scatters those values.  The final step is 
to combine the scattered values with the individual elements within each 
block to arrive at the full results.

Of course the Gather and Scatter in the Reduce and PrefixReduce are 
subject to the multiple possible  implementations depending on their 
arguments.

It is also worth mentioning that all of this is subject to change with 
each software release - for instance tree-based reductions are a very 
likely candidate for our Nov 2010 release.

-Paul

Nikita Andreev wrote:
>> collectives as if passed UPC_IN_ALLSYNC|UPC_OUT_ALLSYNC.  This is 
>> true of ANY collective - the sync flags BOUNDS when a thread may exit 
>> but does not DEFINE it.
>
> That's perfect remark. Didn't think about it in the first place.
>
>> If, for instance, the prefix reduction is computed by a Gather of all 
>> the values to Thread 0, which then does all the arithmetic and sends 
>> out the results, then one would expect Thread 0 to leave LAST rather 
>> than first.
>
> Paul, could you please point me to anyone who is particularly familiar 
> with current Berkeley implementation of UPC collectives?
>
> Nikita
>

-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group                 Tel: +1-510-495-2352
HPC Research Department                   Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory

Next message: Debabrata Midya: "Re: UPC on Windows"

Previous message: Nikita Andreev: "Re: upc_all_reduce behaviour"
In reply to: Nikita Andreev: "Re: upc_all_reduce behaviour"

Date view	Thread view	Subject view	Author view	Attachment view