Re: upc_all_reduce behaviour

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Sun Apr 11 2010 - 12:05:20 PDT

  • Next message: Reinhold Bader: "upc_threadof() inconsistency?"
    I assume you are asking regarding the example on page 20 of the 
    Collective spec.
    I just looked at it and agree that it is slightly broken with respect to 
    result and B.
    One should change this example to make "result" shared and pass its 
    address instead of "B":
    #define BLK_SIZE 3
    #define NELEMS 10
    shared [BLK_SIZE] long A[NELEMS*THREADS];
    shared long result;
    // Initialize A. The result below is defined only on thread 0.
    upc_all_reduceL( &result, A, UPC_ADD, NELEMS*THREADS, BLK_SIZE,
                       NULL, UPC_IN_NOSYNC | UPC_OUT_NOSYNC );
    And the comment "defined only on thread 0" was meant to convey that only 
    B[0] is defined, but my change has just eliminated B from the example.
    For your other questions:
    1. Distributions always begin on thread 0 when ALLOCATED.  However, one 
    can pass the collective reduce operations a pointer to any element of 
    the array as the starting point of the reduction.  This is what figure 7 
    is trying to convey.
    2. The number of comms involved is not defined by the specification.  
    There are many different algorithms one could use internally that may 
    vary in the number and size of communications.  So there is no single 
    answer to this question.
    3. Only the element at *dst is set to the reduction over all elements - 
    one scalar output.  However the "prefix_reduce" operation produces as 
    its output an entire array (of same length as src) of partial results.  
    I don't have the book handy for comparison, but the figure you have 
    reproduced appears to me to be showing neither reduce nor prefix-reduce.
    Nikita Andreev wrote:
    > Hi Paul,
    > Sorry for spamming the list. But I've got another question. I'm 
    > reading UPC Collective Operations Specifications 1.0 at the moment and 
    > upc_all_reduce section with its example confuses me a bit.
    > Questions that immediately comes to my mind:
    > 1. What is the point of 'result' variable if it's not used anywhere?
    > 2. Why B is a pointer? It has no memory allocated to it. So it will 
    > certainly end up in segmentation fault.
    > I assume it's just the typos. More interesting things are:
    > 1. Why in figure 7 distribution of array 'D' starts from thread T1. I 
    > always thought that all distributions start from thread 0.
    > 2. If nelems/blk_size/THREADS > 1.0 (means that one or more threads 
    > receive more than one block of array) then how many one-sided 
    > communications will reduce incorporate? One root<-thread 
    > communication with each thread (so all blocks will be packed into one 
    > get) or one get for each thread's block?
    > 3. Does upc_all_reduce every time end up with one value on one thread 
    > (thread which dst has affinity to) or it may result in one value in 
    > each thread? I believe it is one value on one thread. But I took a 
    > look into book "UPC: Distributed Shared Memory Programming" and found 
    > the example (find it attached) where it works as in the second case. 
    > But I suppose they just confused everything in this example.
    > Could you clarify this, Paul?
    > Thank you for your time,
    > Nikita
    > ------------------------------------------------------------------------
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     

  • Next message: Reinhold Bader: "upc_threadof() inconsistency?"