From: QMar=EDa_J._Mart=EDn=22?= (maria.martin.santamaria_at_udc.es)
Date: Thu Dec 03 2009 - 05:44:33 PST
"[ ]" specifies an indefinite block size. All the array elements should have affinity to the same thread. In the case of upc_alloc, the block size of the space allocated is always the indefinite block size. If the shared array is declared with indefinite block size, the result of the pointer-to-shared arithmetic is identical to normal C pointers. As regards the performance optimization .... A generic pointer-to- shared contains three fields: thread, block address and phase. When performing pointer arithmetic on a pointer-to-shared all three fields will be updated, making the operation slower than private pointer arithmetic. The Berkeley UPC Compiler implements an optimization called �phaseless� pointers for the common special case of cyclic and indefinite pointers. Cyclic pointers have a block size of one, and their phase is thus always zero; Indefinite pointers have a block size of zero, and their phase is also defined to zero since all elements belong to the same UPC thread. Cyclic and indefinite pointers are thus �phaseless�, and the compiler exploits this knowledge to schedule more efficient operations for them (see http://www.gwu.edu/~upc/publications/performance.pdf for more details). Regards, Mar�a El 03/12/2009, a las 11:37, Oliver Perks escribi�: > Thank you for your reply. > This works much better. I had actually "fixed" the problem by using: > > shared [UPC_MAX_BLOCK_SIZE] int *shared * a; > > Your solution provides much better performance so thank you, but I > am still confused as to what this then uses as the block size? > > Regards > Oliver > > Mar�a J. Mart�n wrote: >> The a pointer is incorrectly declarated. >> >> Try: >> >> shared[] int *shared * a; >> >> a = (shared[] int *shared *)upc_all_alloc(10,sizeof(shared int*)); >> >> Regards, >> >> Mar�a >> >> >> >> El 02/12/2009, a las 11:25, Oliver Perks escribi�: >> >>> I have been trying to get a simple example working where by a 2D >>> array is striped across multiple processors. Where each column is >>> placed on a different processor in a round robin fashion. >>> I assumed that this would be achieved by the code provided by Ben, >>> but the results suggest otherwise. Can anyone shine some light on >>> what I would have considered a rather simple problem. >>> >>> >>> shared int *shared * a; >>> >>> a = (shared int *shared *)upc_all_alloc(10,sizeof(shared int*)); >>> upc_forall(int i = 0; i < 10; i++; i) >>> { >>> a[i] = upc_alloc(10*sizeof(shared int)); >>> for(int j = 0; j < 10; j++) >>> { >>> a[i][j] = i * j; >>> printf("Owner of %d - %d is %d\n", i, j, upc_threadof(&a[i] >>> [j])); >>> } >>> } >>> return 0; >>> >>> >>> When run on 2 threads: >>> I would expect this to put even columns on thread 0, and odd >>> columns on thread 1. Then each column be entirely constrained >>> within that thread. >>> >>> 0 1 0 1 0 1 ..... >>> 0 1 0 1 0 1 ..... >>> 0 1 0 1 0 1 ..... >>> 0 1 0 1 0 1 ..... >>> 0 1 0 1 0 1 ..... >>> . . . . . . . >>> >>> By what I actually get is that it is striping the column over the >>> processors. >>> >>> 0 1 0 1 0 1 ..... >>> 1 0 1 0 1 0 ..... >>> 0 1 0 1 0 1 ..... >>> 1 0 1 0 1 0 ..... >>> 0 1 0 1 0 1 ..... >>> . . . . . . . >>> >>> Any ideas. >>> Regards Oliver >>> >>> >>> >>> >> >