From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Dec 24 2009 - 02:04:29 PST
Nikita,
  I have implemented checking of the requested shared heap against the 
address bits in the shared pointer representation.  You can find the 
small runtime changes for this checking at
        https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=382
  I would recommend that you apply this patch and recompile the UPC 
runtime AND your application and then retry your failed runs.  If you 
see the following error message, then I was correct in my guess about 
the 32GB heap:
       UPC Runtime error: out-of-range size for UPC_SHARED_HEAP_SIZE: 
[something]
  If you DON'T see this error, then I was wrong about the cause of your 
crash and will need to think more about the problem (and/or look for a 
large memory machine where I can try to reproduce the problem for myself).
-Paul
Paul H. Hargrove wrote:
> Nikita,
>
>  Our default representation for a shared pointer packs all the 
> required info into 64 bits.  By default only 34 bits are used for 
> "address" bits.  This means that the maximum shared heap size per UPC 
> thread is 16G.  So, I would guess that the 2 and 4 thread cases are 
> simply using too much memory (I see --shared-heap=32G in the job 
> script) and the addressing is getting truncated internally leading to 
> the crash.
>
>  Assuming you do actually have enough memory on your compute nodes, 
> there are at least two things you could do to address this 16GB shared 
> heap limit.  Unfortunately they both require rebuilding the Berkeley 
> UPC runtime:
>
> Option 1)  Pass --enable-sptr-struct to configure.  This will use a 
> struct to represent a shared pointer and effectively removes any 
> limitations on the shared heap imposed by the shared pointer 
> representation.  Unfortunately, this tends to perform less well than 
> the 64-bit "packed" representation.
>
> Option 2) You could keep the 64-bit packed representation but adjust 
> how many bits are allocated to the address field.  To do this pass 
> --with-sptr-packed-bits=P,T,A to configure, for suitable integer 
> values of P (phase bits), T (thread bits) and A (address bits).  For 
> an explanation of these values, see the section "TRADING-OFF MAXIMUM 
> 'THREADS', BLOCKSIZE, AND HEAP SIZE" in the INSTALL.TXT file in the 
> Berkley UPC source (or online at 
> http://upc.lbl.gov/download/dist/INSTALL.TXT )
>
> I would try "Option 1" first.  If that does not work, then I am wrong 
> about the cause of your crashes and "Option 2" will not be of any use 
> regardless of what P,T,A values you pass.
>
> We should probably be sanity checking the --shared-heap argument 
> against the shared pointer representation to produce a use full 
> message instead of crashing.  I'll enter a bug report for that issue.
>
> -Paul
> P.S.  Due to the holiday season, you may not hear much from me or the 
> rest of the UPC group in Berkeley for the next week or more.  Happy 
> Holidays to you.
>
> Andreev Nikita wrote:
>> Hi,
>>
>> Another issue with suncc. I'm trying to run UPC NPB benchmark on Sun
>> x86 machines cluster. I create jobs for every NPB kernel for
>> 1, 2, 4, 8, 16 and 32 threads. 1, 8, 16 and 32 compute fine, but 2 and 4
>> everytime crash (for every kernel).
>>
>> I'm using Sun Ceres Studio IDE 9.0 Linux_i386 2009/03/06 compiler, udp
>> conduit (ibv also crashes).
>>
>> make.def file which is used for compiling NPB is in attachment.
>>
>> For instance I ran mg kernel for 2 threads and saved the output with
>> debug info. It's also in the attachment. There you can find also the
>> job file which was used to run the job in question.
>>
>> Regards,
>> Nikita
>>   
>
>
-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group                 Tel: +1-510-495-2352
HPC Research Department                   Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory