Re: upc_all_reduce behaviour

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Apr 16 2010 - 10:29:14 PDT

  • Next message: Nikita Andreev: "Re: upc_all_reduce behaviour"
    Nikita ,
    The sync flags tell one how soon a thread is PERMITTED to leave from a 
    collective operation.  Any implementation is free to be more 
    conservative.  As an extreme example it - it would be perfectly legal to 
    implement collectives that ignore the sync flags and implement all 
    collectives as if passed UPC_IN_ALLSYNC|UPC_OUT_ALLSYNC.  This is true 
    of ANY collective - the sync flags BOUNDS when a thread may exit but 
    does not DEFINE it.
    If, for instance, the prefix reduction is computed by a Gather of all 
    the values to Thread 0, which then does all the arithmetic and sends out 
    the results, then one would expect Thread 0 to leave LAST rather than first.
    Nikita Andreev wrote:
    > Paul,
    > I'm asking this question since I'm develepoing performance 
    > optimization instrument and want to know where in UPC collective 
    > operations delays may pop up. In fact everything is quite clear in 
    > almost all operations and in reduce also. Particularly I'm not 
    > completely sure in upc_all_prefixe_reduce. I need to know order in 
    > which threads leave prefix_reduce. In case of ALLSYNC and NOSYNC 
    > everything is obvious. I'd like to know what happens in the case of 
    > MYSYNC. Correct me if I'm wrong.
    > Since every thread n depends on result from thread n-1 then in MYSYNC 
    > type of synchronization threads will exit serially thread 0 first, 
    > then thread 1, 2, etc. Am I right?
    > Regards,
    > Nikita
    > ----- Original Message ----- From: "Paul H. Hargrove" 
    > <PHHargrove_at_lbl_dot_gov>
    > To: "Nikita Andreev" <>
    > Cc: <upc-users_at_lbl_dot_gov>
    > Sent: Monday, April 12, 2010 2:05 AM
    > Subject: Re: upc_all_reduce behaviour
    >> Nikita,
    >> I assume you are asking regarding the example on page 20 of the 
    >> Collective spec.
    >> I just looked at it and agree that it is slightly broken with respect 
    >> to result and B.
    >> One should change this example to make "result" shared and pass its 
    >> address instead of "B":
    >> #define BLK_SIZE 3
    >> #define NELEMS 10
    >> shared [BLK_SIZE] long A[NELEMS*THREADS];
    >> shared long result;
    >> // Initialize A. The result below is defined only on thread 0.
    >> upc_barrier;
    >> upc_all_reduceL( &result, A, UPC_ADD, NELEMS*THREADS, BLK_SIZE,
    >>                   NULL, UPC_IN_NOSYNC | UPC_OUT_NOSYNC );
    >> upc_barrier;
    >> And the comment "defined only on thread 0" was meant to convey that 
    >> only B[0] is defined, but my change has just eliminated B from the 
    >> example.
    >> For your other questions:
    >> 1. Distributions always begin on thread 0 when ALLOCATED.  However, 
    >> one can pass the collective reduce operations a pointer to any 
    >> element of the array as the starting point of the reduction.  This is 
    >> what figure 7 is trying to convey.
    >> 2. The number of comms involved is not defined by the specification. 
    >> There are many different algorithms one could use internally that may 
    >> vary in the number and size of communications.  So there is no single 
    >> answer to this question.
    >> 3. Only the element at *dst is set to the reduction over all elements 
    >> - one scalar output.  However the "prefix_reduce" operation produces 
    >> as its output an entire array (of same length as src) of partial 
    >> results.  I don't have the book handy for comparison, but the figure 
    >> you have reproduced appears to me to be showing neither reduce nor 
    >> prefix-reduce.
    >> -Paul
    >> Nikita Andreev wrote:
    >>> Hi Paul,
    >>>  Sorry for spamming the list. But I've got another question. I'm 
    >>> reading UPC Collective Operations Specifications 1.0 at the moment 
    >>> and upc_all_reduce section with its example confuses me a bit.
    >>>  Questions that immediately comes to my mind:
    >>> 1. What is the point of 'result' variable if it's not used anywhere?
    >>> 2. Why B is a pointer? It has no memory allocated to it. So it will 
    >>> certainly end up in segmentation fault.
    >>>  I assume it's just the typos. More interesting things are:
    >>> 1. Why in figure 7 distribution of array 'D' starts from thread T1. 
    >>> I always thought that all distributions start from thread 0.
    >>> 2. If nelems/blk_size/THREADS > 1.0 (means that one or more threads 
    >>> receive more than one block of array) then how many one-sided 
    >>> communications will reduce incorporate? One root<-thread 
    >>> communication with each thread (so all blocks will be packed into 
    >>> one get) or one get for each thread's block?
    >>> 3. Does upc_all_reduce every time end up with one value on one 
    >>> thread (thread which dst has affinity to) or it may result in one 
    >>> value in each thread? I believe it is one value on one thread. But I 
    >>> took a look into book "UPC: Distributed Shared Memory Programming" 
    >>> and found the example (find it attached) where it works as in the 
    >>> second case. But I suppose they just confused everything in this 
    >>> example.
    >>>  Could you clarify this, Paul?
    >>>  Thank you for your time,
    >>> Nikita
    >>> ------------------------------------------------------------------------ 
    >> -- 
    >> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >> Future Technologies Group                 Tel: +1-510-495-2352
    >> HPC Research Department                   Fax: +1-510-486-6900
    >> Lawrence Berkeley National Laboratory
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     

  • Next message: Nikita Andreev: "Re: upc_all_reduce behaviour"