Re: upc_all_reduce behaviour

Date view	Thread view	Subject view	Author view	Attachment view

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Apr 16 2010 - 10:29:14 PDT

Next message: Nikita Andreev: "Re: upc_all_reduce behaviour"

Previous message: Nikita Andreev: "Re: upc_all_reduce behaviour"
In reply to: Nikita Andreev: "Re: upc_all_reduce behaviour"
Next in thread: Nikita Andreev: "Re: upc_all_reduce behaviour"
Reply: Nikita Andreev: "Re: upc_all_reduce behaviour"

Nikita ,

The sync flags tell one how soon a thread is PERMITTED to leave from a 
collective operation.  Any implementation is free to be more 
conservative.  As an extreme example it - it would be perfectly legal to 
implement collectives that ignore the sync flags and implement all 
collectives as if passed UPC_IN_ALLSYNC|UPC_OUT_ALLSYNC.  This is true 
of ANY collective - the sync flags BOUNDS when a thread may exit but 
does not DEFINE it.

If, for instance, the prefix reduction is computed by a Gather of all 
the values to Thread 0, which then does all the arithmetic and sends out 
the results, then one would expect Thread 0 to leave LAST rather than first.

-Paul

Nikita Andreev wrote:
> Paul,
>
> I'm asking this question since I'm develepoing performance 
> optimization instrument and want to know where in UPC collective 
> operations delays may pop up. In fact everything is quite clear in 
> almost all operations and in reduce also. Particularly I'm not 
> completely sure in upc_all_prefixe_reduce. I need to know order in 
> which threads leave prefix_reduce. In case of ALLSYNC and NOSYNC 
> everything is obvious. I'd like to know what happens in the case of 
> MYSYNC. Correct me if I'm wrong.
>
> Since every thread n depends on result from thread n-1 then in MYSYNC 
> type of synchronization threads will exit serially thread 0 first, 
> then thread 1, 2, etc. Am I right?
>
> Regards,
> Nikita
>
> ----- Original Message ----- From: "Paul H. Hargrove" 
> <PHHargrove_at_lbl_dot_gov>
> To: "Nikita Andreev" <[email protected]>
> Cc: <upc-users_at_lbl_dot_gov>
> Sent: Monday, April 12, 2010 2:05 AM
> Subject: Re: upc_all_reduce behaviour
>
>
>> Nikita,
>>
>> I assume you are asking regarding the example on page 20 of the 
>> Collective spec.
>> I just looked at it and agree that it is slightly broken with respect 
>> to result and B.
>> One should change this example to make "result" shared and pass its 
>> address instead of "B":
>>
>> #define BLK_SIZE 3
>> #define NELEMS 10
>> shared [BLK_SIZE] long A[NELEMS*THREADS];
>> shared long result;
>> // Initialize A. The result below is defined only on thread 0.
>> upc_barrier;
>> upc_all_reduceL( &result, A, UPC_ADD, NELEMS*THREADS, BLK_SIZE,
>>                   NULL, UPC_IN_NOSYNC | UPC_OUT_NOSYNC );
>> upc_barrier;
>>
>> And the comment "defined only on thread 0" was meant to convey that 
>> only B[0] is defined, but my change has just eliminated B from the 
>> example.
>>
>> For your other questions:
>>
>> 1. Distributions always begin on thread 0 when ALLOCATED.  However, 
>> one can pass the collective reduce operations a pointer to any 
>> element of the array as the starting point of the reduction.  This is 
>> what figure 7 is trying to convey.
>> 2. The number of comms involved is not defined by the specification. 
>> There are many different algorithms one could use internally that may 
>> vary in the number and size of communications.  So there is no single 
>> answer to this question.
>> 3. Only the element at *dst is set to the reduction over all elements 
>> - one scalar output.  However the "prefix_reduce" operation produces 
>> as its output an entire array (of same length as src) of partial 
>> results.  I don't have the book handy for comparison, but the figure 
>> you have reproduced appears to me to be showing neither reduce nor 
>> prefix-reduce.
>>
>> -Paul
>>
>> Nikita Andreev wrote:
>>> Hi Paul,
>>>  Sorry for spamming the list. But I've got another question. I'm 
>>> reading UPC Collective Operations Specifications 1.0 at the moment 
>>> and upc_all_reduce section with its example confuses me a bit.
>>>  Questions that immediately comes to my mind:
>>> 1. What is the point of 'result' variable if it's not used anywhere?
>>> 2. Why B is a pointer? It has no memory allocated to it. So it will 
>>> certainly end up in segmentation fault.
>>>  I assume it's just the typos. More interesting things are:
>>> 1. Why in figure 7 distribution of array 'D' starts from thread T1. 
>>> I always thought that all distributions start from thread 0.
>>> 2. If nelems/blk_size/THREADS > 1.0 (means that one or more threads 
>>> receive more than one block of array) then how many one-sided 
>>> communications will reduce incorporate? One root<-thread 
>>> communication with each thread (so all blocks will be packed into 
>>> one get) or one get for each thread's block?
>>> 3. Does upc_all_reduce every time end up with one value on one 
>>> thread (thread which dst has affinity to) or it may result in one 
>>> value in each thread? I believe it is one value on one thread. But I 
>>> took a look into book "UPC: Distributed Shared Memory Programming" 
>>> and found the example (find it attached) where it works as in the 
>>> second case. But I suppose they just confused everything in this 
>>> example.
>>>  Could you clarify this, Paul?
>>>  Thank you for your time,
>>> Nikita
>>>
>>> ------------------------------------------------------------------------ 
>>>
>>>
>>
>>
>> -- 
>> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
>> Future Technologies Group                 Tel: +1-510-495-2352
>> HPC Research Department                   Fax: +1-510-486-6900
>> Lawrence Berkeley National Laboratory
>>
>
>


-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group                 Tel: +1-510-495-2352
HPC Research Department                   Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory

Next message: Nikita Andreev: "Re: upc_all_reduce behaviour"

Previous message: Nikita Andreev: "Re: upc_all_reduce behaviour"
In reply to: Nikita Andreev: "Re: upc_all_reduce behaviour"
Next in thread: Nikita Andreev: "Re: upc_all_reduce behaviour"
Reply: Nikita Andreev: "Re: upc_all_reduce behaviour"

Date view	Thread view	Subject view	Author view	Attachment view