Re: PGAS perfomance issues

From: Dan Bonachea (bonachea_at_cs_dot_berkeley_dot_edu)
Date: Tue May 05 2009 - 18:04:48 PDT

  • Next message: Benjamin Byington: "Hanging during upc_alloc()"
    Hi Andreev -
    
    Don't know if you ever got an answer to this, but in looking at the code the 
    problem is fairly obvious - your UPC code is "locality aware" in allocation, 
    but not in reference. Your inner loop performs repeated fine-grained remote 
    access to b[], which will lead to inefficient and redundant communication on 
    distributed systems. This still falls in the realm of what we could call a 
    "naive" UPC program that is not correctly optimized for a distributed-memory 
    environment. Ideally a very smart compiler would transform this loop to 
    aggregate and coalesce the communication, but that's a rather sophisticated 
    transformation that BUPC will not perform by default. UPC programmers who want 
    performance on distributed systems would typically use a upc_memget to fetch a 
    local copy of b[] outside the loop, then operate on that.
    
    Even with that change, one would still expect this code to perform well below 
    peak without some standard (serial) cache tiling optimizations for local 
    accesses.
    
    Looking at the MPI code, you've replicated the b[] array there (whereas it's 
    distributed in UPC), which is a data layout difference that makes this an 
    apples-to-oranges comparison. The MPI code timed region also uses a (badly) 
    hand-rolled gather collective to collect the c[] results to processor 0, which 
    the UPC code does not do. Incidentally, I believe the MPI code also has a 
    buffer overrun bug on c[] and fails to compute the correct result, and the 
    code to handle sub-second time results is incorrect...
    
    Hope this helps..
    -D
    
    At 08:50 PM 6/5/2008, Andreev Nikita wrote:
    >Hi, UPC users.
    >
    >Recently I set up GCC UPC 4.0.3.5 + Berkeley UPC runtime 2.6.0. On SMP
    >UPC perfoms quite well but I faced some issues with commodity cluster.
    >It is composed of 3 Dell Optiplex 755 workstations with dual core
    >Intel E6550 processors and 2GB of RAM each connected throught 1Gb
    >Ethernet interconnect. All this give me 33.5 GFlops in High Perfomance
    >Linpack. It is 60% from theoretical peek. Due to 60% perfomance high
    >watermark for 1Gb ethernet it's quite well.
    >
    >I ran matrix-vector multiplication written in UPC and pure MPI and
    >here are results (I ran 1 thread on each dual core processor to avoid
    >shared memory effects, matrix size was 5000x5000):
    >
    >---Locality aware UPC (UDP conduit)
    >1 thread:
    >Thread 0/1. Calculation time = 7.189.
    >
    >2 threads:
    >Thread 1/2. Calculation time = 769.64940.
    >Thread 0/2. Calculation time = 768.100.
    >
    >3 threads:
    >Thread 2/3. Calculation time = 643.65514.
    >Thread 0/3. Calculation time = 639.65502.
    >Thread 1/3. Calculation time = 646.76.
    >
    >---Locality aware UPC (MPICH 2 conduit)
    >1 thread:
    >Thread 0/1. Calculation time = 7.268.
    >
    >2 threads:
    >Thread 0/2. Calculation time = 822.125.
    >Thread 1/2. Calculation time = 822.124.
    >
    >3 threads:
    >Thread 0/3. Calculation time = 627.65186.
    >Thread 2/3. Calculation time = 434.65123.
    >Thread 1/3. Calculation time = 431.155.
    >
    >---Pure MPI
    >1 thread:
    >Thread 0/1. Calculation time = 0.71.
    >
    >2 threads
    >Thread 0/2. Calculation time = 0.61.
    >Thread 1/2. Calculation time = 0.60.
    >
    >3 threads:
    >Thread 2/3. Calculation time = 0.55.
    >Thread 1/3. Calculation time = 0.34.
    >Thread 0/3. Calculation time = 0.64679.
    >
    >So we observe 1000 times perfomance degradation with UPC in spite of
    >locality awareness (matrix is distributed not in round-robin, but by
    >rows).
    >
    >I tried to disscuss this issue with Intrepid (GCC UPC developers) but
    >it more likely depends on runtime which is not in their
    >responsibilities.
    >
    >I suppose that it happens due to shared pointers overhead. I know that
    >local shared read of one data element leads the processor to perform
    >20 private reads and 6 private writes from "Performance Monitoring and
    >Evaluation of a UPC Implementation on a NUMA Architecture" article.
    >But overhead that we can see is incredibly high.
    >
    >Can somebody confirm or deny my observations?
    >
    >Regads,
    >Nikita Andreev.
    >
    >P.S.:
    >1. Code for MPI and UPC is included in attachment.
    >2. UPC compile flags: /opt/upc-runtime/bin/upcc -network=udp|mpi
    >--shared-heap=2000 -T=1|2|3 main_upc.c
    >3. UPC run flags: UPC_NODES=server1,server2,server3
    >/opt/upc-runtime/bin/upcrun a.out
    >4. MPI compile flags: /usr/bin/mpicc main_mpi.c
    >5. MPI run flags: MPICH_CH3CHANNEL=ssm /usr/bin/mpiexec -machinefile
    >~/mpd.hosts -path ~/work/mpi -n 1|2|3 a.out
    

  • Next message: Benjamin Byington: "Hanging during upc_alloc()"