PGAS perfomance issues

Date view	Thread view	Subject view	Author view	Attachment view

From: Andreev Nikita (nik_at_kemsu.ru)
Date: Thu Jun 05 2008 - 20:50:31 PDT

Next message: George Caragea: "UPC cross-compiler for unknown platform"

Previous message: Gary Funck: "subscribe"
Next in thread: Dan Bonachea: "Re: PGAS perfomance issues"
Reply: Dan Bonachea: "Re: PGAS perfomance issues"

Hi, UPC users.

Recently I set up GCC UPC 4.0.3.5 + Berkeley UPC runtime 2.6.0. On SMP
UPC perfoms quite well but I faced some issues with commodity cluster.
It is composed of 3 Dell Optiplex 755 workstations with dual core
Intel E6550 processors and 2GB of RAM each connected throught 1Gb
Ethernet interconnect. All this give me 33.5 GFlops in High Perfomance
Linpack. It is 60% from theoretical peek. Due to 60% perfomance high
watermark for 1Gb ethernet it's quite well.

I ran matrix-vector multiplication written in UPC and pure MPI and
here are results (I ran 1 thread on each dual core processor to avoid
shared memory effects, matrix size was 5000x5000):

---Locality aware UPC (UDP conduit)
1 thread:
Thread 0/1. Calculation time = 7.189.

2 threads:
Thread 1/2. Calculation time = 769.64940.
Thread 0/2. Calculation time = 768.100.

3 threads:
Thread 2/3. Calculation time = 643.65514.
Thread 0/3. Calculation time = 639.65502.
Thread 1/3. Calculation time = 646.76.

---Locality aware UPC (MPICH 2 conduit)
1 thread:
Thread 0/1. Calculation time = 7.268.

2 threads:
Thread 0/2. Calculation time = 822.125.
Thread 1/2. Calculation time = 822.124.

3 threads:
Thread 0/3. Calculation time = 627.65186.
Thread 2/3. Calculation time = 434.65123.
Thread 1/3. Calculation time = 431.155.

---Pure MPI
1 thread:
Thread 0/1. Calculation time = 0.71.

2 threads
Thread 0/2. Calculation time = 0.61.
Thread 1/2. Calculation time = 0.60.

3 threads:
Thread 2/3. Calculation time = 0.55.
Thread 1/3. Calculation time = 0.34.
Thread 0/3. Calculation time = 0.64679.

So we observe 1000 times perfomance degradation with UPC in spite of
locality awareness (matrix is distributed not in round-robin, but by
rows).

I tried to disscuss this issue with Intrepid (GCC UPC developers) but
it more likely depends on runtime which is not in their
responsibilities.

I suppose that it happens due to shared pointers overhead. I know that
local shared read of one data element leads the processor to perform
20 private reads and 6 private writes from "Performance Monitoring and
Evaluation of a UPC Implementation on a NUMA Architecture" article.
But overhead that we can see is incredibly high.

Can somebody confirm or deny my observations?

Regads,
Nikita Andreev.

P.S.:
1. Code for MPI and UPC is included in attachment.
2. UPC compile flags: /opt/upc-runtime/bin/upcc -network=udp|mpi
--shared-heap=2000 -T=1|2|3 main_upc.c
3. UPC run flags: UPC_NODES=server1,server2,server3
/opt/upc-runtime/bin/upcrun a.out
4. MPI compile flags: /usr/bin/mpicc main_mpi.c
5. MPI run flags: MPICH_CH3CHANNEL=ssm /usr/bin/mpiexec -machinefile
~/mpd.hosts -path ~/work/mpi -n 1|2|3 a.out

application/octet-stream attachment: main_upc.c

application/octet-stream attachment: main_mpi.c

Next message: George Caragea: "UPC cross-compiler for unknown platform"

Previous message: Gary Funck: "subscribe"
Next in thread: Dan Bonachea: "Re: PGAS perfomance issues"
Reply: Dan Bonachea: "Re: PGAS perfomance issues"

Date view	Thread view	Subject view	Author view	Attachment view