From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Dec 24 2009 - 02:04:29 PST
Nikita,
I have implemented checking of the requested shared heap against the
address bits in the shared pointer representation. You can find the
small runtime changes for this checking at
https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=382
I would recommend that you apply this patch and recompile the UPC
runtime AND your application and then retry your failed runs. If you
see the following error message, then I was correct in my guess about
the 32GB heap:
UPC Runtime error: out-of-range size for UPC_SHARED_HEAP_SIZE:
[something]
If you DON'T see this error, then I was wrong about the cause of your
crash and will need to think more about the problem (and/or look for a
large memory machine where I can try to reproduce the problem for myself).
-Paul
Paul H. Hargrove wrote:
> Nikita,
>
> Our default representation for a shared pointer packs all the
> required info into 64 bits. By default only 34 bits are used for
> "address" bits. This means that the maximum shared heap size per UPC
> thread is 16G. So, I would guess that the 2 and 4 thread cases are
> simply using too much memory (I see --shared-heap=32G in the job
> script) and the addressing is getting truncated internally leading to
> the crash.
>
> Assuming you do actually have enough memory on your compute nodes,
> there are at least two things you could do to address this 16GB shared
> heap limit. Unfortunately they both require rebuilding the Berkeley
> UPC runtime:
>
> Option 1) Pass --enable-sptr-struct to configure. This will use a
> struct to represent a shared pointer and effectively removes any
> limitations on the shared heap imposed by the shared pointer
> representation. Unfortunately, this tends to perform less well than
> the 64-bit "packed" representation.
>
> Option 2) You could keep the 64-bit packed representation but adjust
> how many bits are allocated to the address field. To do this pass
> --with-sptr-packed-bits=P,T,A to configure, for suitable integer
> values of P (phase bits), T (thread bits) and A (address bits). For
> an explanation of these values, see the section "TRADING-OFF MAXIMUM
> 'THREADS', BLOCKSIZE, AND HEAP SIZE" in the INSTALL.TXT file in the
> Berkley UPC source (or online at
> http://upc.lbl.gov/download/dist/INSTALL.TXT )
>
> I would try "Option 1" first. If that does not work, then I am wrong
> about the cause of your crashes and "Option 2" will not be of any use
> regardless of what P,T,A values you pass.
>
> We should probably be sanity checking the --shared-heap argument
> against the shared pointer representation to produce a use full
> message instead of crashing. I'll enter a bug report for that issue.
>
> -Paul
> P.S. Due to the holiday season, you may not hear much from me or the
> rest of the UPC group in Berkeley for the next week or more. Happy
> Holidays to you.
>
> Andreev Nikita wrote:
>> Hi,
>>
>> Another issue with suncc. I'm trying to run UPC NPB benchmark on Sun
>> x86 machines cluster. I create jobs for every NPB kernel for
>> 1, 2, 4, 8, 16 and 32 threads. 1, 8, 16 and 32 compute fine, but 2 and 4
>> everytime crash (for every kernel).
>>
>> I'm using Sun Ceres Studio IDE 9.0 Linux_i386 2009/03/06 compiler, udp
>> conduit (ibv also crashes).
>>
>> make.def file which is used for compiling NPB is in attachment.
>>
>> For instance I ran mg kernel for 2 threads and saved the output with
>> debug info. It's also in the attachment. There you can find also the
>> job file which was used to run the job in question.
>>
>> Regards,
>> Nikita
>>
>
>
--
Paul H. Hargrove PHHargrove_at_lbl_dot_gov
Future Technologies Group Tel: +1-510-495-2352
HPC Research Department Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory