From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Sun Apr 11 2010 - 12:05:20 PDT
Nikita,
I assume you are asking regarding the example on page 20 of the
Collective spec.
I just looked at it and agree that it is slightly broken with respect to
result and B.
One should change this example to make "result" shared and pass its
address instead of "B":
#define BLK_SIZE 3
#define NELEMS 10
shared [BLK_SIZE] long A[NELEMS*THREADS];
shared long result;
// Initialize A. The result below is defined only on thread 0.
upc_barrier;
upc_all_reduceL( &result, A, UPC_ADD, NELEMS*THREADS, BLK_SIZE,
NULL, UPC_IN_NOSYNC | UPC_OUT_NOSYNC );
upc_barrier;
And the comment "defined only on thread 0" was meant to convey that only
B[0] is defined, but my change has just eliminated B from the example.
For your other questions:
1. Distributions always begin on thread 0 when ALLOCATED. However, one
can pass the collective reduce operations a pointer to any element of
the array as the starting point of the reduction. This is what figure 7
is trying to convey.
2. The number of comms involved is not defined by the specification.
There are many different algorithms one could use internally that may
vary in the number and size of communications. So there is no single
answer to this question.
3. Only the element at *dst is set to the reduction over all elements -
one scalar output. However the "prefix_reduce" operation produces as
its output an entire array (of same length as src) of partial results.
I don't have the book handy for comparison, but the figure you have
reproduced appears to me to be showing neither reduce nor prefix-reduce.
-Paul
Nikita Andreev wrote:
> Hi Paul,
>
> Sorry for spamming the list. But I've got another question. I'm
> reading UPC Collective Operations Specifications 1.0 at the moment and
> upc_all_reduce section with its example confuses me a bit.
>
> Questions that immediately comes to my mind:
> 1. What is the point of 'result' variable if it's not used anywhere?
> 2. Why B is a pointer? It has no memory allocated to it. So it will
> certainly end up in segmentation fault.
>
> I assume it's just the typos. More interesting things are:
> 1. Why in figure 7 distribution of array 'D' starts from thread T1. I
> always thought that all distributions start from thread 0.
> 2. If nelems/blk_size/THREADS > 1.0 (means that one or more threads
> receive more than one block of array) then how many one-sided
> communications will reduce incorporate? One root<-thread
> communication with each thread (so all blocks will be packed into one
> get) or one get for each thread's block?
> 3. Does upc_all_reduce every time end up with one value on one thread
> (thread which dst has affinity to) or it may result in one value in
> each thread? I believe it is one value on one thread. But I took a
> look into book "UPC: Distributed Shared Memory Programming" and found
> the example (find it attached) where it works as in the second case.
> But I suppose they just confused everything in this example.
>
> Could you clarify this, Paul?
>
> Thank you for your time,
> Nikita
>
> ------------------------------------------------------------------------
>
--
Paul H. Hargrove PHHargrove_at_lbl_dot_gov
Future Technologies Group Tel: +1-510-495-2352
HPC Research Department Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory