Re: suncc, NPB crashes

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Dec 24 2009 - 02:04:29 PST

  • Next message: Andreev Nikita: "Re[2]: suncc, NPB crashes"
    Nikita,
    
      I have implemented checking of the requested shared heap against the 
    address bits in the shared pointer representation.  You can find the 
    small runtime changes for this checking at
            https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=382
    
      I would recommend that you apply this patch and recompile the UPC 
    runtime AND your application and then retry your failed runs.  If you 
    see the following error message, then I was correct in my guess about 
    the 32GB heap:
           UPC Runtime error: out-of-range size for UPC_SHARED_HEAP_SIZE: 
    [something]
    
      If you DON'T see this error, then I was wrong about the cause of your 
    crash and will need to think more about the problem (and/or look for a 
    large memory machine where I can try to reproduce the problem for myself).
    
    -Paul
    
    Paul H. Hargrove wrote:
    > Nikita,
    >
    >  Our default representation for a shared pointer packs all the 
    > required info into 64 bits.  By default only 34 bits are used for 
    > "address" bits.  This means that the maximum shared heap size per UPC 
    > thread is 16G.  So, I would guess that the 2 and 4 thread cases are 
    > simply using too much memory (I see --shared-heap=32G in the job 
    > script) and the addressing is getting truncated internally leading to 
    > the crash.
    >
    >  Assuming you do actually have enough memory on your compute nodes, 
    > there are at least two things you could do to address this 16GB shared 
    > heap limit.  Unfortunately they both require rebuilding the Berkeley 
    > UPC runtime:
    >
    > Option 1)  Pass --enable-sptr-struct to configure.  This will use a 
    > struct to represent a shared pointer and effectively removes any 
    > limitations on the shared heap imposed by the shared pointer 
    > representation.  Unfortunately, this tends to perform less well than 
    > the 64-bit "packed" representation.
    >
    > Option 2) You could keep the 64-bit packed representation but adjust 
    > how many bits are allocated to the address field.  To do this pass 
    > --with-sptr-packed-bits=P,T,A to configure, for suitable integer 
    > values of P (phase bits), T (thread bits) and A (address bits).  For 
    > an explanation of these values, see the section "TRADING-OFF MAXIMUM 
    > 'THREADS', BLOCKSIZE, AND HEAP SIZE" in the INSTALL.TXT file in the 
    > Berkley UPC source (or online at 
    > http://upc.lbl.gov/download/dist/INSTALL.TXT )
    >
    > I would try "Option 1" first.  If that does not work, then I am wrong 
    > about the cause of your crashes and "Option 2" will not be of any use 
    > regardless of what P,T,A values you pass.
    >
    > We should probably be sanity checking the --shared-heap argument 
    > against the shared pointer representation to produce a use full 
    > message instead of crashing.  I'll enter a bug report for that issue.
    >
    > -Paul
    > P.S.  Due to the holiday season, you may not hear much from me or the 
    > rest of the UPC group in Berkeley for the next week or more.  Happy 
    > Holidays to you.
    >
    > Andreev Nikita wrote:
    >> Hi,
    >>
    >> Another issue with suncc. I'm trying to run UPC NPB benchmark on Sun
    >> x86 machines cluster. I create jobs for every NPB kernel for
    >> 1, 2, 4, 8, 16 and 32 threads. 1, 8, 16 and 32 compute fine, but 2 and 4
    >> everytime crash (for every kernel).
    >>
    >> I'm using Sun Ceres Studio IDE 9.0 Linux_i386 2009/03/06 compiler, udp
    >> conduit (ibv also crashes).
    >>
    >> make.def file which is used for compiling NPB is in attachment.
    >>
    >> For instance I ran mg kernel for 2 threads and saved the output with
    >> debug info. It's also in the attachment. There you can find also the
    >> job file which was used to run the job in question.
    >>
    >> Regards,
    >> Nikita
    >>   
    >
    >
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     
    

  • Next message: Andreev Nikita: "Re[2]: suncc, NPB crashes"