PGAS perfomance issues

From: Andreev Nikita (nik_at_kemsu.ru)
Date: Thu Jun 05 2008 - 20:50:31 PDT

  • Next message: George Caragea: "UPC cross-compiler for unknown platform"
    Hi, UPC users.
    
    Recently I set up GCC UPC 4.0.3.5 + Berkeley UPC runtime 2.6.0. On SMP
    UPC perfoms quite well but I faced some issues with commodity cluster.
    It is composed of 3 Dell Optiplex 755 workstations with dual core
    Intel E6550 processors and 2GB of RAM each connected throught 1Gb
    Ethernet interconnect. All this give me 33.5 GFlops in High Perfomance
    Linpack. It is 60% from theoretical peek. Due to 60% perfomance high
    watermark for 1Gb ethernet it's quite well.
    
    I ran matrix-vector multiplication written in UPC and pure MPI and
    here are results (I ran 1 thread on each dual core processor to avoid
    shared memory effects, matrix size was 5000x5000):
    
    ---Locality aware UPC (UDP conduit)
    1 thread:
    Thread 0/1. Calculation time = 7.189.
    
    2 threads:
    Thread 1/2. Calculation time = 769.64940.
    Thread 0/2. Calculation time = 768.100.
    
    3 threads:
    Thread 2/3. Calculation time = 643.65514.
    Thread 0/3. Calculation time = 639.65502.
    Thread 1/3. Calculation time = 646.76.
    
    ---Locality aware UPC (MPICH 2 conduit)
    1 thread:
    Thread 0/1. Calculation time = 7.268.
    
    2 threads:
    Thread 0/2. Calculation time = 822.125.
    Thread 1/2. Calculation time = 822.124.
    
    3 threads:
    Thread 0/3. Calculation time = 627.65186.
    Thread 2/3. Calculation time = 434.65123.
    Thread 1/3. Calculation time = 431.155.
    
    ---Pure MPI
    1 thread:
    Thread 0/1. Calculation time = 0.71.
    
    2 threads
    Thread 0/2. Calculation time = 0.61.
    Thread 1/2. Calculation time = 0.60.
    
    3 threads:
    Thread 2/3. Calculation time = 0.55.
    Thread 1/3. Calculation time = 0.34.
    Thread 0/3. Calculation time = 0.64679.
    
    So we observe 1000 times perfomance degradation with UPC in spite of
    locality awareness (matrix is distributed not in round-robin, but by
    rows).
    
    I tried to disscuss this issue with Intrepid (GCC UPC developers) but
    it more likely depends on runtime which is not in their
    responsibilities.
    
    I suppose that it happens due to shared pointers overhead. I know that
    local shared read of one data element leads the processor to perform
    20 private reads and 6 private writes from "Performance Monitoring and
    Evaluation of a UPC Implementation on a NUMA Architecture" article.
    But overhead that we can see is incredibly high.
    
    Can somebody confirm or deny my observations?
    
    Regads,
    Nikita Andreev.
    
    P.S.:
    1. Code for MPI and UPC is included in attachment.
    2. UPC compile flags: /opt/upc-runtime/bin/upcc -network=udp|mpi
    --shared-heap=2000 -T=1|2|3 main_upc.c
    3. UPC run flags: UPC_NODES=server1,server2,server3
    /opt/upc-runtime/bin/upcrun a.out
    4. MPI compile flags: /usr/bin/mpicc main_mpi.c
    5. MPI run flags: MPICH_CH3CHANNEL=ssm /usr/bin/mpiexec -machinefile
    ~/mpd.hosts -path ~/work/mpi -n 1|2|3 a.out
    
    
    



  • Next message: George Caragea: "UPC cross-compiler for unknown platform"