From: Andreev Nikita (nik_at_kemsu.ru)
Date: Thu Jun 05 2008 - 20:50:31 PDT
Hi, UPC users. Recently I set up GCC UPC 4.0.3.5 + Berkeley UPC runtime 2.6.0. On SMP UPC perfoms quite well but I faced some issues with commodity cluster. It is composed of 3 Dell Optiplex 755 workstations with dual core Intel E6550 processors and 2GB of RAM each connected throught 1Gb Ethernet interconnect. All this give me 33.5 GFlops in High Perfomance Linpack. It is 60% from theoretical peek. Due to 60% perfomance high watermark for 1Gb ethernet it's quite well. I ran matrix-vector multiplication written in UPC and pure MPI and here are results (I ran 1 thread on each dual core processor to avoid shared memory effects, matrix size was 5000x5000): ---Locality aware UPC (UDP conduit) 1 thread: Thread 0/1. Calculation time = 7.189. 2 threads: Thread 1/2. Calculation time = 769.64940. Thread 0/2. Calculation time = 768.100. 3 threads: Thread 2/3. Calculation time = 643.65514. Thread 0/3. Calculation time = 639.65502. Thread 1/3. Calculation time = 646.76. ---Locality aware UPC (MPICH 2 conduit) 1 thread: Thread 0/1. Calculation time = 7.268. 2 threads: Thread 0/2. Calculation time = 822.125. Thread 1/2. Calculation time = 822.124. 3 threads: Thread 0/3. Calculation time = 627.65186. Thread 2/3. Calculation time = 434.65123. Thread 1/3. Calculation time = 431.155. ---Pure MPI 1 thread: Thread 0/1. Calculation time = 0.71. 2 threads Thread 0/2. Calculation time = 0.61. Thread 1/2. Calculation time = 0.60. 3 threads: Thread 2/3. Calculation time = 0.55. Thread 1/3. Calculation time = 0.34. Thread 0/3. Calculation time = 0.64679. So we observe 1000 times perfomance degradation with UPC in spite of locality awareness (matrix is distributed not in round-robin, but by rows). I tried to disscuss this issue with Intrepid (GCC UPC developers) but it more likely depends on runtime which is not in their responsibilities. I suppose that it happens due to shared pointers overhead. I know that local shared read of one data element leads the processor to perform 20 private reads and 6 private writes from "Performance Monitoring and Evaluation of a UPC Implementation on a NUMA Architecture" article. But overhead that we can see is incredibly high. Can somebody confirm or deny my observations? Regads, Nikita Andreev. P.S.: 1. Code for MPI and UPC is included in attachment. 2. UPC compile flags: /opt/upc-runtime/bin/upcc -network=udp|mpi --shared-heap=2000 -T=1|2|3 main_upc.c 3. UPC run flags: UPC_NODES=server1,server2,server3 /opt/upc-runtime/bin/upcrun a.out 4. MPI compile flags: /usr/bin/mpicc main_mpi.c 5. MPI run flags: MPICH_CH3CHANNEL=ssm /usr/bin/mpiexec -machinefile ~/mpd.hosts -path ~/work/mpi -n 1|2|3 a.out