From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Apr 16 2010 - 10:29:14 PDT
Nikita , The sync flags tell one how soon a thread is PERMITTED to leave from a collective operation. Any implementation is free to be more conservative. As an extreme example it - it would be perfectly legal to implement collectives that ignore the sync flags and implement all collectives as if passed UPC_IN_ALLSYNC|UPC_OUT_ALLSYNC. This is true of ANY collective - the sync flags BOUNDS when a thread may exit but does not DEFINE it. If, for instance, the prefix reduction is computed by a Gather of all the values to Thread 0, which then does all the arithmetic and sends out the results, then one would expect Thread 0 to leave LAST rather than first. -Paul Nikita Andreev wrote: > Paul, > > I'm asking this question since I'm develepoing performance > optimization instrument and want to know where in UPC collective > operations delays may pop up. In fact everything is quite clear in > almost all operations and in reduce also. Particularly I'm not > completely sure in upc_all_prefixe_reduce. I need to know order in > which threads leave prefix_reduce. In case of ALLSYNC and NOSYNC > everything is obvious. I'd like to know what happens in the case of > MYSYNC. Correct me if I'm wrong. > > Since every thread n depends on result from thread n-1 then in MYSYNC > type of synchronization threads will exit serially thread 0 first, > then thread 1, 2, etc. Am I right? > > Regards, > Nikita > > ----- Original Message ----- From: "Paul H. Hargrove" > <PHHargrove_at_lbl_dot_gov> > To: "Nikita Andreev" <[email protected]> > Cc: <upc-users_at_lbl_dot_gov> > Sent: Monday, April 12, 2010 2:05 AM > Subject: Re: upc_all_reduce behaviour > > >> Nikita, >> >> I assume you are asking regarding the example on page 20 of the >> Collective spec. >> I just looked at it and agree that it is slightly broken with respect >> to result and B. >> One should change this example to make "result" shared and pass its >> address instead of "B": >> >> #define BLK_SIZE 3 >> #define NELEMS 10 >> shared [BLK_SIZE] long A[NELEMS*THREADS]; >> shared long result; >> // Initialize A. The result below is defined only on thread 0. >> upc_barrier; >> upc_all_reduceL( &result, A, UPC_ADD, NELEMS*THREADS, BLK_SIZE, >> NULL, UPC_IN_NOSYNC | UPC_OUT_NOSYNC ); >> upc_barrier; >> >> And the comment "defined only on thread 0" was meant to convey that >> only B[0] is defined, but my change has just eliminated B from the >> example. >> >> For your other questions: >> >> 1. Distributions always begin on thread 0 when ALLOCATED. However, >> one can pass the collective reduce operations a pointer to any >> element of the array as the starting point of the reduction. This is >> what figure 7 is trying to convey. >> 2. The number of comms involved is not defined by the specification. >> There are many different algorithms one could use internally that may >> vary in the number and size of communications. So there is no single >> answer to this question. >> 3. Only the element at *dst is set to the reduction over all elements >> - one scalar output. However the "prefix_reduce" operation produces >> as its output an entire array (of same length as src) of partial >> results. I don't have the book handy for comparison, but the figure >> you have reproduced appears to me to be showing neither reduce nor >> prefix-reduce. >> >> -Paul >> >> Nikita Andreev wrote: >>> Hi Paul, >>> Sorry for spamming the list. But I've got another question. I'm >>> reading UPC Collective Operations Specifications 1.0 at the moment >>> and upc_all_reduce section with its example confuses me a bit. >>> Questions that immediately comes to my mind: >>> 1. What is the point of 'result' variable if it's not used anywhere? >>> 2. Why B is a pointer? It has no memory allocated to it. So it will >>> certainly end up in segmentation fault. >>> I assume it's just the typos. More interesting things are: >>> 1. Why in figure 7 distribution of array 'D' starts from thread T1. >>> I always thought that all distributions start from thread 0. >>> 2. If nelems/blk_size/THREADS > 1.0 (means that one or more threads >>> receive more than one block of array) then how many one-sided >>> communications will reduce incorporate? One root<-thread >>> communication with each thread (so all blocks will be packed into >>> one get) or one get for each thread's block? >>> 3. Does upc_all_reduce every time end up with one value on one >>> thread (thread which dst has affinity to) or it may result in one >>> value in each thread? I believe it is one value on one thread. But I >>> took a look into book "UPC: Distributed Shared Memory Programming" >>> and found the example (find it attached) where it works as in the >>> second case. But I suppose they just confused everything in this >>> example. >>> Could you clarify this, Paul? >>> Thank you for your time, >>> Nikita >>> >>> ------------------------------------------------------------------------ >>> >>> >> >> >> -- >> Paul H. Hargrove PHHargrove_at_lbl_dot_gov >> Future Technologies Group Tel: +1-510-495-2352 >> HPC Research Department Fax: +1-510-486-6900 >> Lawrence Berkeley National Laboratory >> > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory