From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Sat Apr 17 2010 - 14:18:27 PDT
Nikita, Unfortunately the person most familiar with the collectives implementation in Berkeley UPC is a student who graduated recently. I or Yili Zheng (also on this list) are the "next best" choices. So I'd suggest you direct your questions to this list. If your hope is to ask what order the threads leave the collectives I am going to have to disappoint you again. The algorithms used for the collectives will differ according to the sizes and the sync flags, and in some cases on the network being used (some networks have hardware support we can leverage). Additionally, if one sets the appropriate environment variable we can perform online auto-tuning in which the first collective call with a given set of arguments will time several possible algorithms and pick the fastest one for use in all subsequent calls that have the same arguments (sizes, sync flags and affinity of destination). In the particular case of the Reduce, I am pretty sure we are still using the very naive (and non-scalable) algorithm I alluded to in my previous email: Gather to the thread with affinity to the destination and let it do all the work. In the case of a blocksize > 1, the individual threads will perform the reduction over their block before the Gather. I think PrefixReduce is similar, with per-block reductions followed by a gather to thread 0. Thread 0 performs a prefix-reduction over the block-reduced results and then Scatters those values. The final step is to combine the scattered values with the individual elements within each block to arrive at the full results. Of course the Gather and Scatter in the Reduce and PrefixReduce are subject to the multiple possible implementations depending on their arguments. It is also worth mentioning that all of this is subject to change with each software release - for instance tree-based reductions are a very likely candidate for our Nov 2010 release. -Paul Nikita Andreev wrote: >> collectives as if passed UPC_IN_ALLSYNC|UPC_OUT_ALLSYNC. This is >> true of ANY collective - the sync flags BOUNDS when a thread may exit >> but does not DEFINE it. > > That's perfect remark. Didn't think about it in the first place. > >> If, for instance, the prefix reduction is computed by a Gather of all >> the values to Thread 0, which then does all the arithmetic and sends >> out the results, then one would expect Thread 0 to leave LAST rather >> than first. > > Paul, could you please point me to anyone who is particularly familiar > with current Berkeley implementation of UPC collectives? > > Nikita > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory