From: Andreev Nikita (nik_at_kemsu.ru)
Date: Thu Dec 24 2009 - 02:13:54 PST
Paul, I've just reduced heap size to 8GB (it's enough for me) and it resolved the issue. I can try your patch if you need that. Nikita > Nikita, > I have implemented checking of the requested shared heap against the > address bits in the shared pointer representation. You can find the > small runtime changes for this checking at > https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=382 > I would recommend that you apply this patch and recompile the UPC > runtime AND your application and then retry your failed runs. If you > see the following error message, then I was correct in my guess about > the 32GB heap: > UPC Runtime error: out-of-range size for UPC_SHARED_HEAP_SIZE: > [something] > If you DON'T see this error, then I was wrong about the cause of your > crash and will need to think more about the problem (and/or look for a > large memory machine where I can try to reproduce the problem for myself). > -Paul > Paul H. Hargrove wrote: >> Nikita, >> >> Our default representation for a shared pointer packs all the >> required info into 64 bits. By default only 34 bits are used for >> "address" bits. This means that the maximum shared heap size per UPC >> thread is 16G. So, I would guess that the 2 and 4 thread cases are >> simply using too much memory (I see --shared-heap=32G in the job >> script) and the addressing is getting truncated internally leading to >> the crash. >> >> Assuming you do actually have enough memory on your compute nodes, >> there are at least two things you could do to address this 16GB shared >> heap limit. Unfortunately they both require rebuilding the Berkeley >> UPC runtime: >> >> Option 1) Pass --enable-sptr-struct to configure. This will use a >> struct to represent a shared pointer and effectively removes any >> limitations on the shared heap imposed by the shared pointer >> representation. Unfortunately, this tends to perform less well than >> the 64-bit "packed" representation. >> >> Option 2) You could keep the 64-bit packed representation but adjust >> how many bits are allocated to the address field. To do this pass >> --with-sptr-packed-bits=P,T,A to configure, for suitable integer >> values of P (phase bits), T (thread bits) and A (address bits). For >> an explanation of these values, see the section "TRADING-OFF MAXIMUM >> 'THREADS', BLOCKSIZE, AND HEAP SIZE" in the INSTALL.TXT file in the >> Berkley UPC source (or online at >> http://upc.lbl.gov/download/dist/INSTALL.TXT ) >> >> I would try "Option 1" first. If that does not work, then I am wrong >> about the cause of your crashes and "Option 2" will not be of any use >> regardless of what P,T,A values you pass. >> >> We should probably be sanity checking the --shared-heap argument >> against the shared pointer representation to produce a use full >> message instead of crashing. I'll enter a bug report for that issue. >> >> -Paul >> P.S. Due to the holiday season, you may not hear much from me or the >> rest of the UPC group in Berkeley for the next week or more. Happy >> Holidays to you. >> >> Andreev Nikita wrote: >>> Hi, >>> >>> Another issue with suncc. I'm trying to run UPC NPB benchmark on Sun >>> x86 machines cluster. I create jobs for every NPB kernel for >>> 1, 2, 4, 8, 16 and 32 threads. 1, 8, 16 and 32 compute fine, but 2 and 4 >>> everytime crash (for every kernel). >>> >>> I'm using Sun Ceres Studio IDE 9.0 Linux_i386 2009/03/06 compiler, udp >>> conduit (ibv also crashes). >>> >>> make.def file which is used for compiling NPB is in attachment. >>> >>> For instance I ran mg kernel for 2 threads and saved the output with >>> debug info. It's also in the attachment. There you can find also the >>> job file which was used to run the job in question. >>> >>> Regards, >>> Nikita >>> >> >>