From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Sun Apr 11 2010 - 12:05:20 PDT
Nikita, I assume you are asking regarding the example on page 20 of the Collective spec. I just looked at it and agree that it is slightly broken with respect to result and B. One should change this example to make "result" shared and pass its address instead of "B": #define BLK_SIZE 3 #define NELEMS 10 shared [BLK_SIZE] long A[NELEMS*THREADS]; shared long result; // Initialize A. The result below is defined only on thread 0. upc_barrier; upc_all_reduceL( &result, A, UPC_ADD, NELEMS*THREADS, BLK_SIZE, NULL, UPC_IN_NOSYNC | UPC_OUT_NOSYNC ); upc_barrier; And the comment "defined only on thread 0" was meant to convey that only B[0] is defined, but my change has just eliminated B from the example. For your other questions: 1. Distributions always begin on thread 0 when ALLOCATED. However, one can pass the collective reduce operations a pointer to any element of the array as the starting point of the reduction. This is what figure 7 is trying to convey. 2. The number of comms involved is not defined by the specification. There are many different algorithms one could use internally that may vary in the number and size of communications. So there is no single answer to this question. 3. Only the element at *dst is set to the reduction over all elements - one scalar output. However the "prefix_reduce" operation produces as its output an entire array (of same length as src) of partial results. I don't have the book handy for comparison, but the figure you have reproduced appears to me to be showing neither reduce nor prefix-reduce. -Paul Nikita Andreev wrote: > Hi Paul, > > Sorry for spamming the list. But I've got another question. I'm > reading UPC Collective Operations Specifications 1.0 at the moment and > upc_all_reduce section with its example confuses me a bit. > > Questions that immediately comes to my mind: > 1. What is the point of 'result' variable if it's not used anywhere? > 2. Why B is a pointer? It has no memory allocated to it. So it will > certainly end up in segmentation fault. > > I assume it's just the typos. More interesting things are: > 1. Why in figure 7 distribution of array 'D' starts from thread T1. I > always thought that all distributions start from thread 0. > 2. If nelems/blk_size/THREADS > 1.0 (means that one or more threads > receive more than one block of array) then how many one-sided > communications will reduce incorporate? One root<-thread > communication with each thread (so all blocks will be packed into one > get) or one get for each thread's block? > 3. Does upc_all_reduce every time end up with one value on one thread > (thread which dst has affinity to) or it may result in one value in > each thread? I believe it is one value on one thread. But I took a > look into book "UPC: Distributed Shared Memory Programming" and found > the example (find it attached) where it works as in the second case. > But I suppose they just confused everything in this example. > > Could you clarify this, Paul? > > Thank you for your time, > Nikita > > ------------------------------------------------------------------------ > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory