From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Sun May 10 2009 - 10:13:46 PDT
Ben, The specific reason you see the "hang" in the upc_alloc() call is that the Berkeley UPC runtime library needs to communicate with a central allocation manager on thread 0 (not for *every* upc_alloc(), but occasionally to move a "high water mark"). However, your "while(1)" is preventing thread 0 from responding to the request in the distributed case. In the shared memory case on your multicore laptop, thread 1 "knows" it is in the same address space as thread 0 and therefore runs the allocation management code itself, w/o the need for communication. There is no clear language in the UPC specification about progress guarantees and your code demonstrates one of the classes of code that would benefit if there was a clear specification. Our solution/work-around for progress is the "upc_poll()" function, which ensures progress of the communications library. Your toy code should probably work with "while(1) upc_poll();" or similar. I think the addition of upc_poll() to your event loop should allow your event-driven loop to make progress. So, you probably want something like: while (1) { upc_poll(); if (have_work) do_the_work(); } Note that I am suggesting to make the upc_poll() call every time through the loop so that incoming work can't "starve" things like upc_alloc() calls. Let us know if you need further assistance. -Paul Benjamin Byington wrote: > Hello, > > So my question comes in two parts. First, what is wrong with the toy code below? > (Besides the obvious infinite loop...). When executing this code with two processors > on two separate nodes, somehow the tight loop thread 0 is performing is preventing > thread 1 from doing the memory allocation. The first print statement is reached, > but never the second. If I either remove the loop, or simply switch things around > so that thread 1 is in the loop and thread 0 is trying to do the allocation, things > proceed as would be expected and the memory allocation is completed. > > #include <upc.h> > #include <stdio.h> > > int main( int argc, char** argv ) > { > if(MYTHREAD == 0) > { > int len; > while(1); > } > else if(MYTHREAD == 1) > { > fprintf(stderr, "Beginning memory allocation\n"); > shared void * t = upc_alloc(1000000); > fprintf(stderr, "Finished memory allocation\n"); > } > > upc_barrier; > > return 0; > } > > The second part of my question is: How should one approach doing event driven > programming in upc? The above situation arose when I was trying to write a program > that used dynamic scheduling to control when various tasks get performed. Thread 0 > sits in a tight loop monitoring a set of flags for each of the worker processors, > and gives them new directions any time it detects one is available. The worker > nodes also sit in a tight loop any time they are idle, monitoring another flag to > see if there is any more work available. I took care to insure that all these > rapidly accessed flags were local to the processor sitting on them so as to avoid a > million tiny unnecessary messages, but as my first example demonstrates that doesn't > seem to be enough. All the processors go through some setup code allocating various > shared data structures without a problem, but almost as soon as things enter the meat > of the program things hang. Processor 0 hands off the first job to some worker > node, and since at this stage there are no other concurrent tasks until the first one > finished, processor zero just ends up repeatedly checking all the flags waiting for > the job to be finished. The worker node however never completes the task. It always > manages to perform a malloc(), a upc_memget(), and a upc_free without a problem, but > the first time it hits a upc_alloc() the program just freezes. (The freezing problem > goes away if I tell processor zero to just exit the loop and wait at a barrier, but > that of course is useless since now it can't detect or do anything once the first task > is done). Is there a better way than my flags to take event driven action? Is there > a reason processor 0 being in a tight loop affects the execution of other processors? > > I just realized, this code works on my multicore laptop just fine, and while I presumed > the problem had to do with distributed memory verses shared memory, I figured I should > provide what details I can about the hardware this program is failing on in case there > is a key there... > > Thanks in advance! > Ben [...config info removed...] -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900