Re[2]: suncc, NPB crashes

From: Andreev Nikita (nik_at_kemsu.ru)
Date: Thu Dec 24 2009 - 02:13:54 PST

  • Next message: Oystein Thorsen: "shared pointer as a return type"
    Paul,
    
    I've just reduced heap size to 8GB (it's enough for me) and it
    resolved the issue. I can try your patch if you need that.
    
    Nikita
    
    > Nikita,
    
    >   I have implemented checking of the requested shared heap against the
    > address bits in the shared pointer representation.  You can find the 
    > small runtime changes for this checking at
    >         https://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=382
    
    >   I would recommend that you apply this patch and recompile the UPC 
    > runtime AND your application and then retry your failed runs.  If you 
    > see the following error message, then I was correct in my guess about 
    > the 32GB heap:
    >        UPC Runtime error: out-of-range size for UPC_SHARED_HEAP_SIZE: 
    > [something]
    
    >   If you DON'T see this error, then I was wrong about the cause of your
    > crash and will need to think more about the problem (and/or look for a
    > large memory machine where I can try to reproduce the problem for myself).
    
    > -Paul
    
    > Paul H. Hargrove wrote:
    >> Nikita,
    >>
    >>  Our default representation for a shared pointer packs all the 
    >> required info into 64 bits.  By default only 34 bits are used for 
    >> "address" bits.  This means that the maximum shared heap size per UPC 
    >> thread is 16G.  So, I would guess that the 2 and 4 thread cases are 
    >> simply using too much memory (I see --shared-heap=32G in the job 
    >> script) and the addressing is getting truncated internally leading to 
    >> the crash.
    >>
    >>  Assuming you do actually have enough memory on your compute nodes, 
    >> there are at least two things you could do to address this 16GB shared 
    >> heap limit.  Unfortunately they both require rebuilding the Berkeley 
    >> UPC runtime:
    >>
    >> Option 1)  Pass --enable-sptr-struct to configure.  This will use a 
    >> struct to represent a shared pointer and effectively removes any 
    >> limitations on the shared heap imposed by the shared pointer 
    >> representation.  Unfortunately, this tends to perform less well than 
    >> the 64-bit "packed" representation.
    >>
    >> Option 2) You could keep the 64-bit packed representation but adjust 
    >> how many bits are allocated to the address field.  To do this pass 
    >> --with-sptr-packed-bits=P,T,A to configure, for suitable integer 
    >> values of P (phase bits), T (thread bits) and A (address bits).  For 
    >> an explanation of these values, see the section "TRADING-OFF MAXIMUM 
    >> 'THREADS', BLOCKSIZE, AND HEAP SIZE" in the INSTALL.TXT file in the 
    >> Berkley UPC source (or online at 
    >> http://upc.lbl.gov/download/dist/INSTALL.TXT )
    >>
    >> I would try "Option 1" first.  If that does not work, then I am wrong 
    >> about the cause of your crashes and "Option 2" will not be of any use 
    >> regardless of what P,T,A values you pass.
    >>
    >> We should probably be sanity checking the --shared-heap argument 
    >> against the shared pointer representation to produce a use full 
    >> message instead of crashing.  I'll enter a bug report for that issue.
    >>
    >> -Paul
    >> P.S.  Due to the holiday season, you may not hear much from me or the 
    >> rest of the UPC group in Berkeley for the next week or more.  Happy 
    >> Holidays to you.
    >>
    >> Andreev Nikita wrote:
    >>> Hi,
    >>>
    >>> Another issue with suncc. I'm trying to run UPC NPB benchmark on Sun
    >>> x86 machines cluster. I create jobs for every NPB kernel for
    >>> 1, 2, 4, 8, 16 and 32 threads. 1, 8, 16 and 32 compute fine, but 2 and 4
    >>> everytime crash (for every kernel).
    >>>
    >>> I'm using Sun Ceres Studio IDE 9.0 Linux_i386 2009/03/06 compiler, udp
    >>> conduit (ibv also crashes).
    >>>
    >>> make.def file which is used for compiling NPB is in attachment.
    >>>
    >>> For instance I ran mg kernel for 2 threads and saved the output with
    >>> debug info. It's also in the attachment. There you can find also the
    >>> job file which was used to run the job in question.
    >>>
    >>> Regards,
    >>> Nikita
    >>>   
    >>
    >>
    

  • Next message: Oystein Thorsen: "shared pointer as a return type"