possible upc_memget() and upc_memput() bugs

From: Wes Bethel (ewbethel_at_lbl_dot_gov)
Date: Tue Aug 12 2008 - 14:15:17 PDT

  • Next message: George Caragea: "deadlock possible in mpi-conduit w/ blocking sends"
    Hi,
    
    After working on a couple of UPC-based applications (that are 
    stencil-based image processing algorithms), I've run into what seems to 
    be a bug with upc_memget() that shows up on all test platforms I have 
    access to, and a bug with upc_memput() that shows up on the Cray XT4.
    
    I'm attaching a small program that reproduces the problem, along with 
    some sample output.
    
    My test platforms: UPC 2.6.0, SuSE Linux 10.0, dual-core Opteron 
    workstation; franklin.nersc.gov/Cray XT4; and a nondescript x86/P4 
    laptop running SuSE 9.3.
    
    The enclosed program has lots of comments and should be self-explanatory.
    
    The basic issue is that if I do an element-by-element copy from local to 
      shared global memory, or vice-versa, all is well. If I use 
    upc_memget() to copy from shared global to local memory, results are 
    always corrupt on all platforms. If I use upc_memput() to copy from 
    local to shared global memory, results seem OK on my Linux platforms but 
    corrupt on the Cray.
    
    In the longer run, I'm hoping to implement families of visualization 
    algorithms in UPC and will be making heavy use of upc_memput(), 
    upc_memget() to move data around.
    
    tx,
    wes
    
    -- 
    Wes Bethel -- voice (510) 486-7353 -- fax (510) 486-5812 -- vis.lbl.gov
    
    
    /* 
     * 8/12/2008 W. Bethel, LBNL
     *
     * This simple program uses a combination of element-by-element copies,
     * upc_memget and upc_memput to move data around between local and global
     * arrays. It reveals what seems to be a bug in upc_memget().
     *
     * The basic idea is as follows: 
     * - Each PE will upc_alloc a small array of  * ints (number is defined by
     * COUNT_PER_PE).  
     * - All PEs write their rank into all entries of their int array.
     * - All PEs copy from their local arrays to the global array using first
     * an element-by-element copy (followed by fprintf of the global array to
     * verify results), then upc_memput() (followed by fprintf of the global
     * array).
     * - Next, PE zero will create a local array that is the same size as
     * the global array, and copy from global to its big local array using
     * an element-by-element copy (followed by fprintf of the big local array),
     * then using upc_memget() (followed by fprintf of the global array).
     *
     * At this time, I am seeing what appear to be corrupt results when using
     * upc_memget() to move data from the global to big local array.
     */
    
    #include <upc.h>
    #include <stdio.h>
    
    #define COUNT_PER_PE 4
    
    
    main()
    {
        int *local;
        int myRank = MYTHREAD;
        int i;
        size_t nBytes = sizeof(int)*THREADS*COUNT_PER_PE;
        shared [1] int *global=NULL;
        int *bigLocal;
        
        printf(" Thread %d of %d: hello UPC world. \n", MYTHREAD, THREADS);
    
        /* first, each PE allocates a small array of ints */
        local = (int *)upc_alloc(sizeof(int)*COUNT_PER_PE);
    
        for (i=0;i<COUNT_PER_PE;i++)
    	local[i] = myRank;
    
        fprintf(stderr," Thread %d local = %d \n", myRank, *local);
        upc_barrier;
    
        /* next, we allocate a shared array of ints that is big
           enough to hold all smaller arrays */
        global = (shared int *)upc_all_alloc(THREADS, nBytes);
        upc_barrier;
        
        if (global == NULL)
    	fprintf(stderr, " Warning: global is NULL on thread %d \n", myRank);
    
        /* all PEs copy from their local to the global array using upc_memput */
        upc_memput((shared void *) (global+myRank*COUNT_PER_PE),  (void *)local, sizeof(int)*COUNT_PER_PE); 
        upc_barrier;
    
        /* print results */
        if (myRank == 0)
        {
    	int i;
    	fprintf(stderr," Results of using upc_memput() to copy from local to global: \n");
    	for (i=0;i<THREADS*COUNT_PER_PE;i++)
    	    fprintf(stderr," [%d] = %d \n", i, global[i]);
        }
    
        /* now, use an element-by-element copy rather than upc_memput() to 
         * copy stuff from local to global */
        for (i=0;i<COUNT_PER_PE;i++)
    	global[myRank*COUNT_PER_PE+i] = *local;
        upc_barrier;
        
        /* print results */
        if (myRank == 0)
        {
    	int i;
    	fprintf(stderr," Results of element-by-element copy from local to global: \n");
    	for (i=0;i<THREADS*COUNT_PER_PE;i++)
    	    fprintf(stderr," [%d] = %d \n", i, global[i]);
        }
    
        /* now, exercise copying from global to a local array by doing
           element-by-element  copy, then print results. */
    
        if (myRank == 0)
        {
    	int i;
    
    	bigLocal = (int *)upc_alloc(THREADS*nBytes);
    
    	memset(bigLocal, 0xFF, THREADS*nBytes);
    
    	/* element by element copy */
    	for (i=0;i<THREADS*COUNT_PER_PE;i++)
    	    bigLocal[i] = global[i];
    
    	/* print results */
    	fprintf(stderr," Results for bigLocal using element-by-element copy from shared to local: \n");
    	for (i=0;i<THREADS*COUNT_PER_PE;i++)
    	    fprintf(stderr," [%d] = %d \n", i, bigLocal[i]);
    
    	memset(bigLocal, 0xFF, THREADS*nBytes);
    
    	/* now, use upc_memget() to copy from global to a local copy */
    	upc_memget((void *)bigLocal, (const shared void *)global, nBytes);
    
    	/* print results */
    	fprintf(stderr," Results for bigLocal using ucp_memget() to copy from shared to local: \n");
    	for (i=0;i<THREADS*COUNT_PER_PE;i++)
    	    fprintf(stderr," [%d] = %d \n", i, bigLocal[i]);
        }
        upc_barrier;
    
    #if 0
        /* this code is a more conservative implementation of the upc_memget
           test. It produces the same results as above (e.g., failure) */
        if (myRank == 0)
        {
    	int i;
    
    	bigLocal = (int *)upc_alloc(THREADS*nBytes);
    
    	memset(bigLocal, 0xFF, THREADS*nBytes);
    
    	/* element by element copy */
    	for (i=0;i<THREADS*COUNT_PER_PE;i++)
    	    bigLocal[i] = global[i];
    
    	/* print results */
    	fprintf(stderr," Results for bigLocal using element-by-element copy from shared to local: \n");
    	for (i=0;i<THREADS*COUNT_PER_PE;i++)
    	    fprintf(stderr," [%d] = %d \n", i, bigLocal[i]);
        }
    
        upc_barrier;
    
        if (myRank == 0)
    	memset(bigLocal, 0xFF, THREADS*nBytes);
        upc_barrier;
    
        if (myRank == 0)
    	/* now, use upc_memget() to copy from global to a local copy */
    	upc_memget((void *)bigLocal, (const shared void *)global, nBytes);
    
        upc_barrier;
    
        if (myRank == 0)
        {
    	/* print results */
    	fprintf(stderr," Results for bigLocal using ucp_memget() to copy from shared to local: \n");
    	for (i=0;i<THREADS*COUNT_PER_PE;i++)
    	    fprintf(stderr," [%d] = %d \n", i, bigLocal[i]);
        }
        upc_barrier;
    #endif
    }
    
    // comment: this is output from upcHelloMem-v2 run at a concurrencly
    // level of four PEs on an Opteron workstation running SuSE Linux 10.0, UPC 2.6.0
    UPCR: UPC threads 0..3 of 4 on porky (process 0 of 1, pid=15369)
     Thread 0 local = 0 
     Thread 1 local = 1 
     Thread 2 local = 2 
     Thread 3 local = 3 
     Results of using upc_memput() to copy from local to global: 
    // comment -- these results are good. 
     [0] = 0 
     [1] = 0 
     [2] = 0 
     [3] = 0 
     [4] = 1 
     [5] = 1 
     [6] = 1 
     [7] = 1 
     [8] = 2 
     [9] = 2 
     [10] = 2 
     [11] = 2 
     [12] = 3 
     [13] = 3 
     [14] = 3 
     [15] = 3 
     Results of element-by-element copy from local to global: 
    // comment -- these results are also good
     [0] = 0 
     [1] = 0 
     [2] = 0 
     [3] = 0 
     [4] = 1 
     [5] = 1 
     [6] = 1 
     [7] = 1 
     [8] = 2 
     [9] = 2 
     [10] = 2 
     [11] = 2 
     [12] = 3 
     [13] = 3 
     [14] = 3 
     [15] = 3 
     Results for bigLocal using element-by-element copy from shared to local: 
    // comment -- these results are also good
     [0] = 0 
     [1] = 0 
     [2] = 0 
     [3] = 0 
     [4] = 1 
     [5] = 1 
     [6] = 1 
     [7] = 1 
     [8] = 2 
     [9] = 2 
     [10] = 2 
     [11] = 2 
     [12] = 3 
     [13] = 3 
     [14] = 3 
     [15] = 3 
     Results for bigLocal using ucp_memget() to copy from shared to local: 
    // comment -- these results are bad
     [0] = 0 
     [1] = 1 
     [2] = 2 
     [3] = 3 
     [4] = 2 
     [5] = 2 
     [6] = 3 
     [7] = 0 
     [8] = 0 
     [9] = 0 
     [10] = 0 
     [11] = 0 
     [12] = 0 
     [13] = 0 
     [14] = 0 
     [15] = 0 
     Thread 0 of 4: hello UPC world. 
     Thread 1 of 4: hello UPC world. 
     Thread 2 of 4: hello UPC world. 
     Thread 3 of 4: hello UPC world. 
    

  • Next message: George Caragea: "deadlock possible in mpi-conduit w/ blocking sends"