GASNet elan-conduit documentation Dan Bonachea $Revision: 1.6 $ User Information: ----------------- Recognized environment variables: --------------------------------- * All the standard GASNet environment variables (see top-level README) * GASNET_NETWORKDEPTH - depth of libelan put queues to allocate Use larger values to prevent stalls in put_nb(i) initiation operations when queueing up many non-blocking puts in a short period of time. Note that larger values increase memory consumption on the elan NIC, and specifying a value which is too large can lead to elan memory exhaustion. * GASNET_NBI_THROTTLE - max number of outstanding nbi operations to issue before stalling to reap some completions. Defaults to GASNET_NETWORKDEPTH. * GASNET_BARRIER=ELANFAST (default) - use elan hardware support for barrier. Additionally, assume that all threads agree on whether a given barrier is named or anonymous, to allow higher performance barrier * GASNET_BARRIER=ELANSLOW - use libelan hardware collectives for barrier, but comply strictly to barrier semantics. Optional compile-time settings: ------------------------------ * All the compile-time settings from extended-ref (see the extended-ref README) * GASNETC_PREALLOC_AMLONG_BOUNCEBUF(1/0) - enable optimization to preallocate bounce buffers for AMLongs that need it (disabling this may cause failures on elan out-of-memory conditions) * GASNETC_USE_SIGNALING_EXIT(1/0) - enable the use of RMS-based signals for non-collective terminations via gasnet_exit. Turning this off may result in job hangs and orphaned nodes during non-collective calls to gasnet_exit(). * GASNETE_USE_ELAN_PUTGET(1/0) - setting to zero makes all put/gets use AM rather than elanlib put/gets Known problems: --------------- * Dual-rail doesn't work on older versions of elan3/libelan due to bugs in libelan - If you run into trouble on dual-rail hardware, try disabling dual-rail operation. * On some older versions of libelan/elan3, libelan queues can cause AM messages between GASNet nodes sharing an elan NIC to deadlock - if you see deadlocks, try running only a single GASNet process per elan NIC (and use pthreads to get concurrency on SMP nodes). * See the Berkeley UPC Bugzilla server for details on known bugs. Future work: ------------ * Fix the above bugs * Implement GASNet split-phase barriers entirely on the elan NIC thread processor =============================================================================== Design Overview: ---------------- elan-conduit targets the Quadrics libelan software, a low-level interface which is implemented on top of the hardware-level elan3/4 library, and is intended to provide portability to new versions of the elan hardware. The core API is implemented using the elan queue abstraction in elanlib (for short, long and sufficiently small medium AM's), and the elanlib TPORTS API (a generalized message send/recv mechanism with optional tag-matching which is used to implement MPI) is used for larger medium messages. All AM handlers currently run synchronously within explicit or implicit polling operations. The extended API is implemented as an extension of the ref-extended API implementation (similar eop/iop handling). All of the put/get operations use elanlib put/get operations (which translate to purely one-sided, zero-copy RDMA operations) in the common case. Some uncommon cases require the use of bounce-buffers on the initiator node, and a few very exceptional cases are handled by falling back to AM-based put/gets. When stats/tracing are enabled, the conduit keeps detailed information about which elanlib mechanism was used for each put/get operation. Barriers are implemented using the elanlib hardware barrier (for anonymous barriers), or a hardware broadcast/hardware barrier combination for named barriers. The elanlib collective operations are all blocking, so barrier_notify actually blocks during these calls (but the hardware barrier is so fast that it's probably worthwhile). Future work includes implementing hardware split-phase collectives on the elan NIC thread processor. Detailed design of core and extended implementation: ---------------------------------------------------- Basic design of the core implementation: ======================================= (the canonical version is in gasnet_core_reqrep.c) All Shorts/All Longs/Mediums <= GASNETC_ELAN_MAX_QUEUEMSG(320): sent using an elan queue of length LIBELAN_TPORT_NSLOTS Longs use a blocking elan_put before queuing to ensure ordering use a bounce-buffer if > GASNETC_ELAN_SMALLPUTSZ and not elan-mapped AMPoll checks for incoming queue entries All mediums are argument-padded to ensure payload alignment on recvr Mediums > GASNETC_ELAN_MAX_QUEUEMSG(320): sent using a tport message in a pre-allocated buffer Keep tport Tx buffers in a FIFO of length LIBELAN_TPORT_NSLOTS - poll for Tx completion starting at oldest Tx buffer whenever we need one may spin-poll during a send if all Tx buffers occupied Keep a FIFO of posted tport Rx bufs, which are guaranteed to arrive in order AMPoll checks the head for completion every tport buffer has a dedicated descriptor (gasnetc_bufdesc_t) holds ELAN_EVENT for pending Tx/Rx pointer to the buffer (gasnetc_buf_t) and possibly a system Rx buffer AMPoll handles up to GASNETC_MAX_RECVMSGS_PER_POLL messages from either the queue or tport (giving precedence to the queue) All loopback AM messages are run synchronously inside the request/reply Basic design of the extended implementation: =========================================== (the canonical version is in gasnet_extended.c) gasnet_handle_t - can be either an (gasnete_op_t *) or (ELAN_EVENT *), as flagged by LSB eops - marked with the operation type - always AM-based op or put/get w/ bouncebuf iops - completion counters for AM-based ops linked list of put/get eops that need bounce-bufs pgctrl objects for holding put/get ELAN_EVENT's for nbi if !GASNETE_USE_ELAN_PUTGET, all put/gets are done using AM (extended-ref) GASNETE_USE_LONG_GETS works just like in extended-ref otherwise... get_nb: if elan-addressable dest use a simple elan_get else if < GASNETE_MAX_COPYBUFFER_SZ and mem available use a elan bouncebuf dest with an eop else use AM ref-ext med or long (if larger than 1 long, use get_nbi) put_nb: if bulk and elan-addressable src or < GASNETC_ELAN_SMALLPUTSZ (128) use a simple elan_put else if < GASNETE_MAX_COPYBUFFER_SZ and mem available copy to elan bouncebuf with an eop else use AM ref-ext med or long (if larger than 1 long, use put_nbi) get_nbi: if elan-addressable dest use a simple elan_get and add ELAN_EVENT to pgctrl (spin-poll if more than GASNETE_DEFAULT_NBI_THROTTLE(256) outstanding) else if < GASNETE_MAX_COPYBUFFER_SZ and mem available use a elan bouncebuf dest with an eop and add to nbi linked-list else use AM ref-ext put_nbi: if bulk and elan-addressable src or < GASNETC_ELAN_SMALLPUTSZ (128) use a simple elan_put and add ELAN_EVENT to pgctrl (spin-poll if more than GASNETE_DEFAULT_NBI_THROTTLE(256) outstanding) else if < GASNETE_MAX_COPYBUFFER_SZ and mem available copy to a elan bouncebuf src with an eop and add to nbi linked-list else use AM ref-ext barrier: if GASNET_BARRIER != ELANFAST && GASNET_BARRIER != ELANSLOW use AM (extended ref) else register a poll callback function at startup to ensure polling during hardware barrier if GASNET_BARRIER==ELANFAST and barrier anonymous mismatchers report to all nodes hardware elan barrier else hardware broadcast id from zero if node detects mismatch, report to all nodes hardware elan barrier