Re: Hanging during upc_alloc()

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Sun May 10 2009 - 10:13:46 PDT

  • Next message: Nikita Andreev: "Re: PGAS perfomance issues"
    Ben,
    
       The specific reason you see the "hang" in the upc_alloc() call is that the 
    Berkeley UPC runtime library needs to communicate with a central allocation 
    manager on thread 0 (not for *every* upc_alloc(), but occasionally to move a 
    "high water mark").  However, your "while(1)" is preventing thread 0 from 
    responding to the request in the distributed case.  In the shared memory case 
    on your multicore laptop, thread 1 "knows" it is in the same address space as 
    thread 0 and therefore runs the allocation management code itself, w/o the 
    need for communication.
       There is no clear language in the UPC specification about progress 
    guarantees and your code demonstrates one of the classes of code that would 
    benefit if there was a clear specification.  Our solution/work-around for 
    progress is the "upc_poll()" function, which ensures progress of the 
    communications library.  Your toy code should probably work with "while(1) 
    upc_poll();" or similar.
    
       I think the addition of upc_poll() to your event loop should allow your 
    event-driven loop to make progress.  So, you probably want something like:
    
       while (1) {
          upc_poll();
          if (have_work) do_the_work();
       }
    
    Note that I am suggesting to make the upc_poll() call every time through the 
    loop so that incoming work can't "starve" things like upc_alloc() calls.
    
    Let us know if you need further assistance.
    
    -Paul
    
    Benjamin Byington wrote:
    > Hello,
    > 
    > So my question comes in two parts.  First, what is wrong with the toy code below? 
    > (Besides the obvious infinite loop...).  When executing this code with two processors 
    > on two separate nodes, somehow the tight loop thread 0 is performing is preventing 
    > thread 1 from doing the memory allocation.  The first print statement is reached,
    > but never the second.  If I either remove the loop, or simply switch things around 
    > so that thread 1 is in the loop and thread 0 is trying to do the allocation, things 
    > proceed as would be expected and the memory allocation is completed.  
    > 
    > #include <upc.h>
    > #include <stdio.h>
    > 
    > int main( int argc, char** argv )
    > {
    >     if(MYTHREAD == 0)
    >     {
    >         int len;
    >         while(1);
    >     }
    >     else if(MYTHREAD == 1)
    >     {
    >         fprintf(stderr, "Beginning memory allocation\n");
    >         shared void * t = upc_alloc(1000000);
    >         fprintf(stderr, "Finished memory allocation\n");
    >     }
    > 
    >     upc_barrier;
    > 
    >     return 0;
    > }
    > 
    > The second part of my question is: How should one approach doing event driven 
    > programming in upc?  The above situation arose when I was trying to write a program 
    > that used dynamic scheduling to control when various tasks get performed.  Thread 0 
    > sits in a tight loop monitoring a set of flags for each of the worker processors, 
    > and gives them new directions any time it detects one is available.  The worker 
    > nodes also sit in a tight loop any time they are idle, monitoring another flag to 
    > see if there is any more work available.  I took care to insure that all these 
    > rapidly accessed flags were local to the processor sitting on them so as to avoid a 
    > million tiny unnecessary messages, but as my first example demonstrates that doesn't 
    > seem to be enough.  All the processors go through some setup code allocating various 
    > shared data structures without a problem, but almost as soon as things enter the meat 
    > of the program things hang.  Processor 0 hands off the first job to some worker 
    > node, and since at this stage there are no other concurrent tasks until the first one
    > finished, processor zero just ends up repeatedly checking all the flags waiting for 
    > the job to be finished.  The worker node however never completes the task.  It always 
    > manages to perform a malloc(), a upc_memget(), and a upc_free without a problem, but 
    > the first time it hits a upc_alloc() the program just freezes.  (The freezing problem 
    > goes away if I tell processor zero to just exit the loop and wait at a barrier, but 
    > that of course is useless since now it can't detect or do anything once the first task 
    > is done).  Is there a better way than my flags to take event driven action?  Is there 
    > a reason processor 0 being in a tight loop affects the execution of other processors?  
    > 
    > I just realized, this code works on my multicore laptop just fine, and while I presumed 
    > the problem had to do with distributed memory verses shared memory, I figured I should 
    > provide what details I can about the hardware this program is failing on in case there 
    > is a key there...
    > 
    > Thanks in advance!
    > Ben
    [...config info removed...]
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Nikita Andreev: "Re: PGAS perfomance issues"