From: QMar=EDa_J._Mart=EDn=22?= (maria.martin.santamaria_at_udc.es)
Date: Thu Dec 03 2009 - 05:44:33 PST
"[ ]" specifies an indefinite block size. All the array elements should have affinity to the same thread. In the case of upc_alloc, the block size of the space allocated is always the indefinite block size. If the shared array is declared with indefinite block size, the result of the pointer-to-shared arithmetic is identical to normal C pointers. As regards the performance optimization .... A generic pointer-to- shared contains three fields: thread, block address and phase. When performing pointer arithmetic on a pointer-to-shared all three fields will be updated, making the operation slower than private pointer arithmetic. The Berkeley UPC Compiler implements an optimization called “phaseless” pointers for the common special case of cyclic and indefinite pointers. Cyclic pointers have a block size of one, and their phase is thus always zero; Indefinite pointers have a block size of zero, and their phase is also defined to zero since all elements belong to the same UPC thread. Cyclic and indefinite pointers are thus “phaseless”, and the compiler exploits this knowledge to schedule more efficient operations for them (see http://www.gwu.edu/~upc/publications/performance.pdf for more details). Regards, María El 03/12/2009, a las 11:37, Oliver Perks escribió: > Thank you for your reply. > This works much better. I had actually "fixed" the problem by using: > > shared [UPC_MAX_BLOCK_SIZE] int *shared * a; > > Your solution provides much better performance so thank you, but I > am still confused as to what this then uses as the block size? > > Regards > Oliver > > María J. Martín wrote: >> The a pointer is incorrectly declarated. >> >> Try: >> >> shared[] int *shared * a; >> >> a = (shared[] int *shared *)upc_all_alloc(10,sizeof(shared int*)); >> >> Regards, >> >> María >> >> >> >> El 02/12/2009, a las 11:25, Oliver Perks escribió: >> >>> I have been trying to get a simple example working where by a 2D >>> array is striped across multiple processors. Where each column is >>> placed on a different processor in a round robin fashion. >>> I assumed that this would be achieved by the code provided by Ben, >>> but the results suggest otherwise. Can anyone shine some light on >>> what I would have considered a rather simple problem. >>> >>> >>> shared int *shared * a; >>> >>> a = (shared int *shared *)upc_all_alloc(10,sizeof(shared int*)); >>> upc_forall(int i = 0; i < 10; i++; i) >>> { >>> a[i] = upc_alloc(10*sizeof(shared int)); >>> for(int j = 0; j < 10; j++) >>> { >>> a[i][j] = i * j; >>> printf("Owner of %d - %d is %d\n", i, j, upc_threadof(&a[i] >>> [j])); >>> } >>> } >>> return 0; >>> >>> >>> When run on 2 threads: >>> I would expect this to put even columns on thread 0, and odd >>> columns on thread 1. Then each column be entirely constrained >>> within that thread. >>> >>> 0 1 0 1 0 1 ..... >>> 0 1 0 1 0 1 ..... >>> 0 1 0 1 0 1 ..... >>> 0 1 0 1 0 1 ..... >>> 0 1 0 1 0 1 ..... >>> . . . . . . . >>> >>> By what I actually get is that it is striping the column over the >>> processors. >>> >>> 0 1 0 1 0 1 ..... >>> 1 0 1 0 1 0 ..... >>> 0 1 0 1 0 1 ..... >>> 1 0 1 0 1 0 ..... >>> 0 1 0 1 0 1 ..... >>> . . . . . . . >>> >>> Any ideas. >>> Regards Oliver >>> >>> >>> >>> >> >