Re: upc_all_reduce behaviour

From: Nikita Andreev (lestat_at_kemsu.ru)
Date: Fri Apr 16 2010 - 05:42:10 PDT

  • Next message: Paul H. Hargrove: "Re: upc_all_reduce behaviour"
    Paul,
    
    I'm asking this question since I'm develepoing performance optimization 
    instrument and want to know where in UPC collective operations delays may 
    pop up. In fact everything is quite clear in almost all operations and in 
    reduce also. Particularly I'm not completely sure in upc_all_prefixe_reduce. 
    I need to know order in which threads leave prefix_reduce. In case of 
    ALLSYNC and NOSYNC everything is obvious. I'd like to know what happens in 
    the case of MYSYNC. Correct me if I'm wrong.
    
    Since every thread n depends on result from thread n-1 then in MYSYNC type 
    of synchronization threads will exit serially thread 0 first, then thread 1, 
    2, etc. Am I right?
    
    Regards,
    Nikita
    
    ----- Original Message ----- 
    From: "Paul H. Hargrove" <PHHargrove_at_lbl_dot_gov>
    To: "Nikita Andreev" <[email protected]>
    Cc: <upc-users_at_lbl_dot_gov>
    Sent: Monday, April 12, 2010 2:05 AM
    Subject: Re: upc_all_reduce behaviour
    
    
    > Nikita,
    >
    > I assume you are asking regarding the example on page 20 of the Collective 
    > spec.
    > I just looked at it and agree that it is slightly broken with respect to 
    > result and B.
    > One should change this example to make "result" shared and pass its 
    > address instead of "B":
    >
    > #define BLK_SIZE 3
    > #define NELEMS 10
    > shared [BLK_SIZE] long A[NELEMS*THREADS];
    > shared long result;
    > // Initialize A. The result below is defined only on thread 0.
    > upc_barrier;
    > upc_all_reduceL( &result, A, UPC_ADD, NELEMS*THREADS, BLK_SIZE,
    >                   NULL, UPC_IN_NOSYNC | UPC_OUT_NOSYNC );
    > upc_barrier;
    >
    > And the comment "defined only on thread 0" was meant to convey that only 
    > B[0] is defined, but my change has just eliminated B from the example.
    >
    > For your other questions:
    >
    > 1. Distributions always begin on thread 0 when ALLOCATED.  However, one 
    > can pass the collective reduce operations a pointer to any element of the 
    > array as the starting point of the reduction.  This is what figure 7 is 
    > trying to convey.
    > 2. The number of comms involved is not defined by the specification. 
    > There are many different algorithms one could use internally that may vary 
    > in the number and size of communications.  So there is no single answer to 
    > this question.
    > 3. Only the element at *dst is set to the reduction over all elements - 
    > one scalar output.  However the "prefix_reduce" operation produces as its 
    > output an entire array (of same length as src) of partial results.  I 
    > don't have the book handy for comparison, but the figure you have 
    > reproduced appears to me to be showing neither reduce nor prefix-reduce.
    >
    > -Paul
    >
    > Nikita Andreev wrote:
    >> Hi Paul,
    >>  Sorry for spamming the list. But I've got another question. I'm reading 
    >> UPC Collective Operations Specifications 1.0 at the moment and 
    >> upc_all_reduce section with its example confuses me a bit.
    >>  Questions that immediately comes to my mind:
    >> 1. What is the point of 'result' variable if it's not used anywhere?
    >> 2. Why B is a pointer? It has no memory allocated to it. So it will 
    >> certainly end up in segmentation fault.
    >>  I assume it's just the typos. More interesting things are:
    >> 1. Why in figure 7 distribution of array 'D' starts from thread T1. I 
    >> always thought that all distributions start from thread 0.
    >> 2. If nelems/blk_size/THREADS > 1.0 (means that one or more threads 
    >> receive more than one block of array) then how many one-sided 
    >> communications will reduce incorporate? One root<-thread communication 
    >> with each thread (so all blocks will be packed into one get) or one get 
    >> for each thread's block?
    >> 3. Does upc_all_reduce every time end up with one value on one thread 
    >> (thread which dst has affinity to) or it may result in one value in each 
    >> thread? I believe it is one value on one thread. But I took a look into 
    >> book "UPC: Distributed Shared Memory Programming" and found the example 
    >> (find it attached) where it works as in the second case. But I suppose 
    >> they just confused everything in this example.
    >>  Could you clarify this, Paul?
    >>  Thank you for your time,
    >> Nikita
    >>
    >> ------------------------------------------------------------------------
    >>
    >
    >
    > -- 
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    > Future Technologies Group                 Tel: +1-510-495-2352
    > HPC Research Department                   Fax: +1-510-486-6900
    > Lawrence Berkeley National Laboratory
    > 
    

  • Next message: Paul H. Hargrove: "Re: upc_all_reduce behaviour"