README file for vapi-conduit
================================
Paul H. Hargrove <PHHargrove@lbl.gov>
$Revision: 1.42 $

@ TOC: @
@ Section: Job Spawning @
@ Section: Multi-rail Support @
@ Section: Runtime Configuration @
@ Section: Core API @
@ Section: Extended API @
@ Section: Graceful exits @
@ Section: TO DO @
@ Section: References @


@ Section: Job Spawning @
  
  If MPI support was NOT enabled when GASNet was configured, then
  only SSH based spawning will be supported and the following paragraph
  may be ignored.

  If MPI support was enabled when GASNet was configured, then there
  are two options for spawning a GASNet vapi-conduit application:
  MPI or SSH.  The default can be set at configure time with
      --with-vapi-spawner=ssh
  or  --with-vapi-spawner=mpi
  where mpi is the default.  At runtime the environment variable
  GASNET_VAPI_SPAWNER (set to "mpi" or "ssh") can override the
  value set at configuration time.

  If using UPC or Titanium, the language-specific commands should be used
  to launch applications.  Otherwise, applications can be launched using
  the gasnetrun_vapi utility:
  + usage summary:
    gasnetrun_vapi -n <n> [options] [--] prog [program args]
    options:
      -n <n>                number of processes to run (required)
      -N <N>                number of nodes to run on (not supported by all MPIs)
      -E <VAR1[,VAR2...]>   list of environment vars to propagate
      -v                    be verbose about what is happening
      -t                    test only, don't execute anything (implies -v)
      -k                    keep any temporary files created (implies -v)
      -spawner=(ssh|mpi)    force use of MPI or SSH for spawning (if available)

  If spawning using MPI, then the following apply:
  + In order to bootstrap vapi-conduit, a working MPI must be installed
    and configured on your system.  See mpi-conduit/README for
    information on configuring GASNet for a particular MPI.  Note that
    you must compile mpi-conduit as well (even if you never plan to use
    it).
  + MPI is only used in gasnet_init(), gasnet_attach() and gasnet_exit()
    and not for any GASNet cals between attach and exit.  Therefore it is
    acceptable to use a TCP/IP based MPI such as MPICH or LAM/MPI.
  + The environment variable MPIRUN_CMD can be used to configure how to
    invoke mpirun.  See mpi-conduit/README (or README-mpi) for more
    information.

  If spawning using SSH, the following apply:
  + The -E option is not necessary, as the full environment is always
    propagated to the application processes.
  + A list of hosts is specified using either the GASNET_SSH_NODEFILE
    or GASNET_SSH_SERVERS environment variables.
    If set, the variable GASNET_SSH_NODEFILE specifies a file with one
    hostname per line.  Blank lines and comment lines (using '#') are
    ignored.
    If set, the variable GASNET_SSH_SERVERS itself contains a list of
    hostnames, delimited by commas or whitespace.
    If both are set, GASNET_SSH_NODEFILE takes precedence.
  + The environment variable GASNET_SSH_CMD can be set to specify a
    specific remote shell (perhaps rsh).  The default is "ssh", and
    a search of $PATH resolves the full path.
  + The environment variable GASNET_SSH_OPTIONS can be set to
    specify options that will precede the hostname in the commands
    used to spawn jobs.  One example, for OpenSsh, would be
      GASNET_SSH_OPTIONS="-o 'StrictHostKeyChecking no'"
  + For the following, the term "compute node"  means one of the hosts
    given by GASNET_SSH_NODEFILE or GASNET_SSH_SERVERS, which will run
    an application process.  The term "master node" means the node from
    which the job was spawned.  The master node may or may not be one
    of the compute nodes.
  + The ssh (or rsh) at your site must be configured to allow logins
    from the master node to compute nodes, and among the compute nodes.
    These must be achieved without interaction (such as entering a
    password or accepting new host keys).
  + Any firewall or port filtering must allow the ssh/rsh connections
    described above, plus TCP connections on untrusted port (those
    with numbers over 1024) from a compute node to the master node and
    and among compute nodes.
  + Resolution for all given hostnames must be possible from both the
    master node and the compute nodes.

@ Section: Multi-rail Support @

  Multi-rail support is ON in GASNet vapi-conduit by default.

  By default, GASNet vapi-conduit will open up to two InfiniBand Host
  Channel Adapters (HCAs) per node, and will stripe communications over
  one active port on each adapter.  See the sections "Build-time
  Configuration" and "Runtime Configuration" for information on
  how to open more or fewer HCAs/ports, or to control which HCAs/ports
  are used.

  To first order, the use of multiple ports or multiple adapters will
  yield increases in both bandwidth (good) and software overhead (bad).
  How the resulting trade off works for a given application may be
  hard to predict.  If one is concerned with obtaining the maximum
  possible performance for a given appliction, then experiment with
  the GASNET_NUM_QPS and/or GASNET_VAPI_PORTS environment variables
  (documented in "Runtime Configuration") to determine how a given
  application runs best.

@ Section: Build-time Configuration @

  By default vapi-conduit ensures network attentiveness (timely
  processing of incoming AMs) by spawning an extra thread that
  remains blocked until the arrival of an Active Message.  One
  can disable this thread by configuring GASNet with the flag
  '--disable-vapi-rcv-thread'.  It is recommended that one NOT
  use this option, but instead disabled the thread at runtime
  (see Runtime Configuration section).  If the extra thread will
  never be needed, disabling it at build time will yield a small
  reduction in latencies by allowing some locking operations to
  compile away.

  In firmware versions prior to 3.0, the VAPI_poll_cq() call is
  not thread safe under some conditions.  To run correctly with
  pre-3.0 firmware, one must configure GASNet with the flag
  '--enable-vapi-poll-lock'.  One may also pass this flag if
  there is reason to believe that VAPI_poll_cq() and/or
  EVAPI_peek_cq() are not thread-safe.
  If vapi-conduit detects pre-3.0 firmware at startup, a fatal
  error will be issued.

  By default, vapi-conduit will open at most two Host Channel Adapters
  (HCAs) on a node.  To utilize more than two HCAs in a host, specify
  '--with-vapi-max-hcas=N' at configure time.  However, if you
  have only a single HCA per host, then you may be able to get a small
  performance improvement by disabling multi-rail support with
  '--disable-vapi-multirail' at configure time.
  

@ Section: Runtime Configuration @

  There are a number of parameters in vapi-conduit which can be tuned
  at runtime via environment variables.

  Connection settings:
  Under normal conditions, Host Channel Adapters and Ports will be
  located automatically.  However, in the event you have multiple
  adapters or multiple active ports on a single adapter, you may wish
  to set these environment variables to identify the correct HCAs and
  Ports.  These parameters may legally take different values on each node.

  See "Build-time Configuration", above, for information on enabling
  use of multiple HCAs in GASNet vapi-conduit.

  + GASNET_HCA_ID
  + GASNET_PORT_NUM
    ** UNSUPPORTED **
    These environment variables, used in older releases, are no longer
    supported.  Setting them to anything but the empty string will
    result in a run-time warning.

  + GASNET_NUM_QPS
    This variable gives the number of VAPI Queue Pairs (QPs) over which to 
    stripe traffic between each pair of peers.  This can yield an increase
    in throughput and bandwidth when multiple physical ports are used
    on one or more adapters.
    If the number of QPs exceeds the number of available physical ports 
    then multiple QPs will be mapped round-robin to the ports.  Be aware
    that mapping multiple QPs per port may yield either a performance
    improvement or a degradation, depending on traffic pattern.
    The default is 0, which means one QP per HCA/port used.

  + GASNET_VAPI_PORTS
    By default, GASNet will open and use one active IB port on each HCA
    used, which will be all HCAs (when GASNET_NUM_QPS is zero), or the
    first GASNET_NUM_QPS HCAs found (when GASNET_NUM_QPS is non-zero).
    Setting GASNET_VAPI_PORTS will specify a filter for which ports will
    be used.  This can be used for instance to cause multiple physical
    ports to be used per HCA, or to specify specific ports and/or HCAs
    to be considered (up to GASNETC_NUM_QPS if it is non-zero).
    This variable is a string of one or more HCA/port specifications,
    separated by '+' characters.  Each such specification gives an HCA
    identifier and an optional comma-separated list of port numbers.
    The list of port numbers, if provided, is separated from the HCA id
    by a ':'.  If a list of ports is given, only those ports may be used.
    Otherwise the first active port on the given HCA may be used.  The
    following example allows the first active port on HCA InfiniHost0,
    and only port 2 on InifiniHost1:
	GASNET_VAPI_PORTS="InfiniHost0+InfiniHost1:2".
    Note that this list is a *filter*, which means:
    + Duplicate entries do not cause multiple opens of a port or HCA
    + Entries describing non-existant HCAs are silently ignored
    + Entries describing inactive ports are silently ignored
    + Order is not significant.  In particular if GASNET_NUM_QPS is
      less than the number of entries in GASNET_VAPI_PORTS, ports
      are openned in the order detected, regardless of their order
      in GASNET_VAPI_PORTS
    Note that the 'vstat' utility in most IB distributions will list
    the available HCAs and the status of their ports.
    The default is no filter.

  Software configuration settings:
  There are some optional behaviors in vapi-conduit that can be turnned
  ON or OFF.  These parameters may legally take different values on each
  node, but doing so may not be useful.

  + GASNET_RCV_THREAD
    This gives a boolean: "0" to disable or "1" to enable the use
    of an extra thread that blocks waiting for vapi to wake it when an
    Active Message request arrives.  This allows vapi-conduit to be
    more attentive to incoming AMs even while the application may not
    be making any calls to GASNet.  The down side is that each time
    this thread wakes it must contend for CPU resources.  Thus for an
    application that is calling GASNet sufficiently often, use of this
    thread may significantly INCREASE running time.  However, on an SMP
    where an otherwise idle processor is available the use of this
    thread can REDUCE running time by relieving the application thread
    of the burden of servicing incoming AM requests and replies.
    Note that if '--disable-vapi-rcv-thread' was specified at build time
    then the extra thread is unavailable and this environment variable
    is ignored.
    Currently the default is disabled (0), but this is subject to change
    in a future version.

  Protocol configuration:
  The following environment variables control the selection of protocols
  for performing certain transfers.

  These parameters must be equal across all nodes, and the behavior
  otherwise is undefined.

  + GASNET_PHYSMEM_MAX
    If non-zero this parameter tells vapi-conduit the maximum amount of
    physical memory to pin.  The suffixes "M" and "G" are interpreted as
    Megabytes and Gigabytes respectively.  The current default is zero,
    which means to probe the limits imposed by the O/S and HCA.
    Setting to a non-zero value will limit how large the GASNet segment
    can be, and how much memory is available for firehose (see below),
    but may speed startup by bounding the probe.

  + GASNET_PHYSMEM_NOPROBE
    This gives a boolean: "0" to disable or "1" to enable the use of
    GASNET_PHYSMEM_MAX without probing.
    Enabling this setting may greatly speed startup, but can lead to
    unexpected runtime failures if GASNET_PHYSMEM_MAX exceedes the limts
    imposed by the O/S and HCA.
    If GASNET_PHYSMEM_MAX is zero (or unset) this variable is ignored.
    The default is OFF.

  + GASNET_INLINESEND_LIMIT
    VAPI includes an "inline send" operation that transfers the data to
    the HCA at the same time it transfers the request.  This normally
    provides a measurable performance improvement, but is only available
    up to an firmware-dependent maximum size.  Therefore, the default of
    72 is normally correct.
   
    In firmware versions after 1.17 and prior to 3.0, there is a
    performance anomaly with the call EVAPI_post_inline_sr().
    By default, vapi-conduit will use this call to perform small
    PUTs.  If firmware in the range [1.18, 3.0) is detected at
    runtime, a warning will be printed but the call will still
    be used.  In this case (or if you believe that the same anomaly
    is affecting some other version) you should try setting this
    variable to 0.

  + GASNET_PACKEDLONG_LIMIT
    To perform an AMLong or AMLongAsync with non-empty payload,
    vapi-conduit must transfer both the payload and the header.  For
    sufficiently small payloads, it is more efficient (in terms of both CPU
    overhead and network latency) to pack the header and payload together
    and copy the payload into place on the target before running the
    handler.  Thus, for payload upto and including this size this packing
    is used.
    The default value is the maximum that fits into a 4KB buffer together
    with the maximum sized header (currently 4012).
    A value of zero ensures the payload and header always travel separately.
    
  + GASNET_NONBULKPUT_BOUNCE_LIMIT
    To perform a non-bulk PUT with nbytes > GASNET_INLINESEND_LIMIT or to
    transfer the payload of an AMLong (but not AMLongAsync) with nbytes >
    MAX(GASNET_INLINESEND_LIMIT, GASNET_PACKEDLONG_LIMIT), vapi-conduit must
    either copy the data into bounce buffers, or block until remote
    completion is signalled by the HCA.  Such transfers upto and including
    size GASNET_NONBULKPUT_BOUNCE_LIMIT are performed using bounce buffers
    while larger transfers are transfered using blocking PUTs.
    The default value is 64KB.
    A value of zero disables use of bounce buffers.
    
  + GASNET_PUTINMOVE_LIMIT (only for GASNET_SEGMENT_{LARGE,EVERYTHING})
    When the firehose algorithm (see below) is in use for managing the
    pinning of remote memory, a PUT that misses in the firehose cache
    may be accelerated by piggybacking data on the AMMedium that is
    used to obtain a remote pinning.  The value of GASNET_PUTINMOVE_LIMIT
    is the maximum number of bytes to send in this way.  The value is
    bounded by the maximum value set at compile time, and it is an
    error to request a larger value.
    Note that in a GASNET_SEGMENT_FAST configuration, the remote segment
    is pinned statically and this optimization is never applicable.
    The default value is 3KB (the current maximum value).
    A value of zero disables this optimization.

  Resource usage parameters:
  The following environment variables control how much memory is
  preallocated at startup time to serve various functions.  Because these
  resource pools do not grow dynamically, it is important that these
  parameters be sufficiently large, or performance degradations may
  results.  The default settings should be sufficient for most conditions.
  You may need to lower some values if you have insufficient memory.

  These parameters must be equal across all nodes, and the behavior
  otherwise is undefined.

  + GASNET_NETWORKDEPTH_PP
    This gives the maximum number of ops (RDMA + AMs) which can be
    in-flight simultaneously from a node to each of its peers.  Here
    "in-flight" means queued to the send work queue and not yet reaped
    from the send completion queue.  This value is the depth of each
    send work queue.  This limit is on the number of vapi-level ops
    in-flight, and the number of GASNet-level operations may be less
    (for example, when the remote range of a PUT or GET covers more
    than one pinned region, due to GASNET_PIN_MAXSZ, or because an AM
    Long uses separate ops for the payload and header).
    The default value is 64.
    Reducing this parameter may limit small message throughput.  If you
    believe your small message throughput is too low, you may try
    increasing this value.

  + GASNET_NETWORKDEPTH_TOTAL
    This gives the maximum number of ops (RDMA + AMs) which can be
    in-flight simultaneously from each node (with "in-flight" defined as
    in GASNET_NETWORKDEPTH_PP.)  The depth of the send completion queue
    is min(GASNET_NETWORKDEPTH_TOTAL, GASNET_NETWORKDEPTH_PP*(N-1)).
    If set to zero, the value is set to the maximum usable value computed
    from GASNET_NETWORKDEPTH_PP and the HCA's reported capabilities.
    The default value is 0 (compute automatically).
    Reducing this parameter may limit small message throughput.  If you
    believe your small message throughput is too low, you may try
    increasing this value (or setting it to zero), at a cost in
    additional memory consumption.

  + GASNET_AM_CREDITS_PP
    This give the maximum number of AM Requests which can be in-flight
    simultaneously from a node to each of its peers.  Here "in-flight"
    means the Request is queued to the send work queue, but the matching
    Reply has not yet been processed for AM flow control (described in
    another section of this README).  This is the number of buffers which
    must be preposted to each receive work queue for AM Requests.
    This value should be less than or equal to GASNET_NETWORKDEPTH_PP,
    and will be trimmed to GASNET_NETWORKDEPTH_PP if larger.
    The default value is 32 (128KB*(N-1) allocated for Request buffers).
    Reducing this parameter may limit Active Message throughput.  If you
    believe your Active Message throughput is too low, you may try
    increasing this value.

  + GASNET_AM_CREDITS_TOTAL
    This gives the integer number of AM Requests which can be in-flight
    simultaneously from each node, with "in-flight" defined as in
    GASNET_AM_CREDITS_PP.  This is the number of receive buffers which
    will be allocated for posting to an endpoint for the AM Reply which
    follows each AM Request.
    If set to zero, the value is set to the maximum usable value computed
    from GASNET_AM_CREDITS_PP and the HCA's reported capabilities.
    The default value is MAX(1024, nodes).
    This value should be less than or equal to GASNET_NETWORKDEPTH_TOTAL,
    and will be trimmed to GASNET_NETWORKDEPTH_TOTAL if larger.
    Reducing this parameter may limit Active Message throughput.  If you
    believe your Active Message throughput is too low, you may try
    increasing this value (or setting it to zero), at a cost in additional
    pinned memory.

  + GASNET_AM_CREDITS_SLACK
    This gives the maximum number of flow-control credits that can be
    delayed at the responder.  If a Request handler does not produce a
    Reply, a credit may be "banked" to be piggy-backed on the next
    Request or Reply headed to the requesting node.  The value of
    GASNET_AM_CREDITS_SLACK gives the maximum number of credits that can
    be banked before a hidden Reply is generated to convey credits back
    to the requester.
    The default value is 1.
    GASNET_AM_CREDITS_SLACK will be silently reduced if needed to
    ensure deadlock will not occur.
    Reducing this parameter to zero or setting it too high may
    increase the latency of Active Message traffic.

  + GASNET_BBUF_COUNT
    This gives the number of pre-pinned buffers allocated on each node.
    These buffers are needed for assembly of AM headers and the payload
    of mediums, and for some PUTs (see GASNET_NONBULKPUT_BOUNCE_LIMIT).
    The actual number of buffers allocated is the lesser of the values of
    GASNET_BBUF_COUNT and GASNET_NETWORKDEPTH_TOTAL, since the total
    network depth bounds the number of in-flight operations that might
    need these buffers.
    If set to zero, the value is set to GASNET_NETWORKDEPTH_TOTAL.
    The default value is 1024 (up to 4MB for buffers).
    Reducing this parameter limits the number of in-flight operations
    which consume bounce buffers.  This includes AMs too large for an
    inline send and PUTs subject to the GASNET_NONBULKPUT_BOUNCE_LIMIT.
    If you believe that throughput of these operations is too small, you
    may try increasing this value (or setting it to zero), at a cost in
    additional pinned memory.
   
  + GASNET_PIN_MAXSZ
    This gives the maximum size of a single VAPI pinned region, and must
    be a power of two.  If the GASNet segment is larger than this size in a
    GASNET_SEGMENT_FAST configuration, then it will be pinned using multiple
    regions.  This works around an observed bug in Mellanox HCAs and/or
    drivers in which large regions lead to premature trashing of the HCA's
    on board TPT (Translation and Protection Table) cache while pinning
    the same memory with multiple smaller regions does not.
    If you see throughput of large messages dropping off sharply with an
    increase in message size, you may try reducing this value.
    The suffixes "K", "M" and "G" are interpreted as Kilobytes, Megabytes
    and Gigabytes respectively.  The current default is 256K.
   
  Firehose configuration:
  These parameters must be equal across all nodes, and the behavior
  otherwise is undefined.

  The following environment variables control the resources used by the
  "firehose" [ref 1] dynamic registration library.  By default firehose
  will use as much pinned memory as the HCA and O/S will permit.

  Resource use is divided into two pools.  The main pool is for managing
  of pinning of the GASNet segment on remote nodes, while the "victim"
  pool is used to manage pinnings for local use.  By default in a
  GASNET_SEGMENT_LARGE or GASNET_SEGMENT_EVERYTHING configurations, 75%
  of the pinnable memory will go in the main pool and 25% into the victim
  pool.  In a GASNET_SEGMENT_FAST configuration, firehose is not needed
  for management of the statically pinned GASNet segment, and by default
  only a small fraction of the the available memory is placed in the main
  pool and the majority is placed in the victim pool.

  + GASNET_USE_FIREHOSE
    This environment variable is only available in a DEBUG build of
    GASNet (one configured with --enable-debug).
    This gives a boolean: "0" to disable or "1" to enable the use
    of the firehose dynamic pinning library in a GASNET_SEGMENT_FAST
    configuration.  In a GASNET_SEGMENT_FAST configuration, the GASNet
    segment is registered (pinned) with the HCA at initialization time,
    because pinning is required for RDMA.  However, GASNet allows for
    local addresses (source of a PUT or destination of a GET) to lie
    outside of the GASNet segment.  So, to perform RDMA GETs and PUTs,
    vapi-conduit must either copy out-of-segment transfers though
    preregistered bounce buffers, or dynamically register memory.  By
    default firehose is used to manage registration of out-of-segment
    memory.  (default is ON).
    Setting this environment variable to "0" (or "no") will disable use
    of firehose, forcing the use of bounce buffers for out-of-segment
    transfers.  This will result in a significantly lower peak bandwidth
    for large PUTs and GETs, with little or no affect on small message
    latency.  It is available only for debugging purposes.
    In a GASNET_SEGMENT_LARGE or GASNET_SEGMENT_EVERYTHING configuration,
    the GASNet segment is not preregistred and use of firehose is
    required.  Thus it is an error to disable firehose in such a
    configuration.

  + GASNET_FIREHOSE_M
    This gives the amount of memory to place in the main pool.  The
    suffixes "K", "M" and "G" are interpreted as Kilobytes, Megabytes
    and Gigabytes respectively, with "M" assumed if no suffix is given.
    When GASNET_FIREHOSE_MAXVICTIM_M is set, the default is the maximum
    pinnable memory minus GASNET_FIREHOSE_MAXVICTIM_M.  Otherwise the
    default is 75% of the maximum pinnable memory (in a GASNET_SEGMENT_LARGE
    or GASNET_SEGMENT_EVERYTHING configuration), or the size of the
    prepinned bounce buffer pool (in a GASNET_SEGMENT_FAST configuration).

  + GASNET_FIREHOSE_MAXVICTIM_M
    This gives the amount of memory to place in the victim (local) pool.
    The suffixes "K", "M" and "G" are interpreted as Kilobytes, Megabytes
    and Gigabytes respectively, with "M" assumed if no suffix is given.
    The default is the maximum pinnable memory minus GASNET_FIREHOSE_M.

  + GASNET_FIREHOSE_MAXREGION_SIZE
    This gives the maximum size of a single dynamically pinned region,
    should be a multiple of the pagesize, and preferably a power of two.
    The suffixes "K", "M" and "G" are interpreted as Kilobytes, Megabytes
    and Gigabytes respectively, with "M" assumed if no suffix is given.
    The current default is 128k.  Larger values have been known to trigger
    a performance anomaly in some HCAs.

  + GASNET_FIREHOSE_R
    This gives the number of pinned regions to allocate for the management
    of the main pool.  Values will be truncated if larger than the
    default of (GASNET_FIREHOSE_M / GASNET_FIREHOSE_MAXREGION_SIZE).

  + GASNET_FIREHOSE_MAXVICTIM_R
    This gives the number of pinned regions to allocate for the management
    of the victim (local) pool.  Values will be truncated if larger than
    the default of (GASNET_FIREHOSE_MAXVICTIM_M /
    GASNET_FIREHOSE_MAXREGION_SIZE).

  + GASNET_FIREHOSE_VERBOSE
    This gives a boolean: "0" to disable or "1" to enable the output of
    internal information of use to the developers.  You may be asked
    to run with this environment variable set if you report a bug that
    appears related to the firehose algorithm. 

@ Section: Core API @

+ Flow-control for AMs.

  The AMs in vapi-conduit are just implemented as send/recv traffic.
  Therefore a send without a corresponding recv buffer preposted at the
  peer will be stalled by the RNR (receiver-not-ready) flow control
  in IB.  However there are two reasons why we want to avoid this
  situation.  The first is that if such a send is blocked by flow
  control, then the ordering semantics of IB tell us that all the
  gets and puts that we've initiated after the AM was sent are also
  stalled.  Rather than let that happen, we should manually delay
  those which are dependent on the AM.  The second reason is that
  under some conditions the RNR flow control is very poor.  The problem
  is that once the intended receiver sends a RNR NAK to indicate no
  available recv buffers, IB has the SENDER's hardware/firmware poll
  for the receiver to become ready again!  That leaves us with a choice
  between configuring a small polling interval and consuming a lot of
  bandwidth for this polling, or a large interval which leads to 
  performance which is degraded more than necessary when IB flow control
  is asserted.

  For these reasons we implement some flow control at the AM level.
  The basic idea is that every REQUEST consumes one credit on the
  sending endpoint and every REPLY grants one credit on the receiving
  endpoint.  Thus if M is the initial number of credits on each endpoint
  and every REQUEST has exactly one matching REPLY, then M becomes a
  limit on the number of un-acknowledged REQUESTS in flight on an
  endpoint.  If we want to avoid RNR conditions, then we should start
  with M credits and M preposted recv buffers on each endpoint.  This
  allows for only the receipt of M REQUESTS.  In addition, a recv buffer
  will be posted on demand for a REPLY just before sending each REQUEST.

  It is a simple matter to count the credits when a REPLY is received
  and to poll for credits when needed to send a REQUEST.  It is also
  simple to ensure the exactly-one-reply.  We already ensure that
  at-most-one reply is sent by the request handler.  Additionally we
  must check upon handler return for the case that the request hander
  sent no reply, and send one implicitly.  We just use a special
  "system category" handler, gasnetc_SYS_ack, which doesn't even run
  a handler.

  To avoid using up 1/2 our bandwidth in the event of a REQUEST-REQUEST
  pong-pong, we perform some coalescing to avoid sending too many
  SYS_ack REPLIES.  We keep up to GASNET_AM_CREDITS_SLACK "banked" on
  the responding node, sending the SYS_ack REPLY only if the number
  banked exceeds this limit.  Credits which are banked get piggybacked
  on the next REQUEST or REPLY headed back to the orginal requester.

  To avoid a window of time between when we send a REPLY (credit) and
  when we post the recv buffer, we must post the replacement recv
  buffer BEFORE running an AM REQUEST handler.  To do this we keep a
  pool of unposted recv buffers (also used for the on-demand posting
  of buffers needed for REPLIES).  So, when we recv an AM REQUEST, we
  grab a free recv buffer from the pool and post it to the endpoint,
  and only then run the handler.  We send an implicit reply if a REQUEST
  handler didn't send any REPLY.  Finally we take the recv buffer
  containing the just-processed AM and we return it to the unposted
  pool.

  There is a corner case we must deal with when there are no spares
  left in the unposted pool.  In this case we will copy the received
  REQUEST into a temporary (non-pinned) buffer before processing it.
  This allows us to repost the recv buffer immediately.  Since the
  temporary buffer is not pinned, it cannot be used for receives.
  Therefore, we free the temporary buffer when the handler is done,
  rather than placing it in the unposted pool.
  
  If we reap multiple AMs in a single Poll, then we reuse the
  previous buffer as the "spare" for the next one, in place of
  grabbing one from the unposted pool each time.  Thus, we touch the
  unposted pool at most twice per Poll, once for the first AM we
  receive and once at the end to put the recv buffer of the final AM
  back in the unposted pool.  For the dedicated receive thread we can
  do even better, never touching the unposted pool at all, by always
  keeping a single thread-local "spare", initially acquired at startup.

@ Section: Extended API @

Notes for myself for extended API:

+ The send completion facility consists of two pointers to counters,
  associated with each sbuf.  If these pointers are non-NULL then the
  counter is decremented atomically when the send is complete.
  
  One counter is for awaiting reuse of local memory and is
  only be used for sbufs which are doing zero copy.  This counter
  provides the mechanism for Longs and non-bulk puts to block before
  they return, and should be allocated as an automatic variable.

  The second counter is for request completion and should be non-NULL
  for every sbuf for which request completion would be checked (all
  gets & puts, but not the Longs).  For nb and nbi the counter is
  waited on at sync-time.  Therefore the explicit handle is a struct
  containing the counter.
  
+ Similar to the reference implementation's cut-off between Mediums
  (which typically do a source-side copy) and Longs (which may not),
  we have a cut-off size, below which the RDMA-put operation will do
  source-side copies _iff_ local completion is desired (Long, put_nb,
  and put_nbi).

+ The gets are done w/ RDMA-reads, and use the sbuf bounce buffers
  if the local memory is not in the segment (or otherwise registered).
  The value gets also pass though the bounce buffers.  Clearly there
  is no bulk/non-bulk distinction in terms of local memory reuse, just
  the alignment and optimal size distinctions.  So, only the outstanding
  request counter on the sbuf is needed for syncs of all types of gets.

+ Table of when synchronization is needed
	              Local Remote
	Operation     Sync  Sync
	--------------------------
	LongAsync       X     X
	Long            I     X

	put_nb          I     S
	put_nbi         I     S
	put_nb_bulk     X     S
	put_nbi_bulk    X     S
	put_nb_val	X     S
	put_nbi_val	X     S
	put             X     I
	put_bulk        X     I
	put_val         X     I

	get_nb		X     S
	get_nbi		X     S
	get_nb_bulk	X     S
	get_nbi_bulk	X     S
	get_nb_val	X     S
	get_nbi_val (DOES NOT EXIST)
	get		X     I
	get_bulk	X     I
	get_val		X     I

   X = Not needed at all (or not even applicable with _val forms)
   I = Needed before (I)nitiating function returns
   S = Needed before (S)ynchronizing function returns

+ Some minor tweaks are used to avoid allocation of counters in
  some cases.
  - For all the functions which require waiting on a counter in the
    initiating function, the counter can be allocated on the stack (as
    an automatic variable).
  - For the implicit-handle forms the request counter is in the
    thread-specific data, possibly in an access-region.
  - For the explicit handle forms the request counter must be allocated
    from some pool, requiring some memory management work.  This is
    done with a modification to the code from the the reference
    implementation, and uses thread-local data to avoid locks.

+ The memsets can be more efficiently implemented as a _local_ memset
  followed by a PUT, for small enough sizes.  The cutoff is
  presently the size of one bounce buffer, but has not been tuned.

  This was disabled when GASNET_PIN_MAXSZ was introduced.  Therefore,
  all memsets are currently done by Active Messages.
  

@ Section: Graceful exits @

On June 24, 2003 vapi-conduit now passes all 9 (I added two recently)
of the cases in testexit.  By "Pass" I mean that the entire gasnet job
(tested up to 8-way across my 4 dual-processor machines) terminates
with no orphans, and with tracing properly finalized (if tracing is
enabled).  On August 11, 2003 the graceful exit code was revised to
send O(N) network traffic in the worst case, as opposed to the O(N^2)
required in all cases in the first implementation.

Additionally, the exit code is properly propagated through the
bootstrap, to yield a correct exit code for the parallel job as a
whole.  If using MPI for bootstrapping, the actual exit code will
depend on supported in a given MPI implementation (some ignore the
exit code of the individual processes).

This code is heavily commented, but for the curious, here is a
description of the code.

There are three paths by which an exit request can begin.  The first
is through gasnetc_exit(), which may be called by the user, by the
conduit in certain error cases, and by the default signal handler for
"termination signals".  The second is via a remote exit request,
passed between nodes to ensure full-job termination from
non-collective exits.  The third is via an atexit() handler,
registered by gasnetc_init(), used to catch returns from main() and
user calls to exit().

There are slight variations among the code in these three cases, but
most of the work is common, and is performed by three functions:
gasnetc_exit_head(), gasnetc_exit_body() and gasnetc_exit_tail().  The
first of these, _head, is used to determine the "first" exit and store
its exit code for later use.  This is important because even a
collective exit will involve receiving remote exit requests.  Only if
a remote exit request is received before any local calls to
gasnetc_exit(), should the request handler initiate the exit.  Note
that even in the case of a collective exit it is possible for the
first remote request to arrive before the local gasnetc_exit() call.
However, that is made very unlikely by the timing and is nearly
harmless since the only difference is the raising of SIGQUIT in
response to a remote exit request, which is not done for
locally-initiated ones.

The second common function, _body(), is used to perform the "meat" of
the shutdown.  It begins by ignoring SIGQUIT to avoid re-entrance, and
then blocks all but the first caller in a polling loop to avoid
multiple threads from executing the shutdown code.  Because strange
things can happen if we are trying to shutdown from a signal context,
a signal handler is installed for all the "abort signals".  This
signal handler just calls _exit() with the exit code stored by
_head().  Because we may have problems shutting down if certain locks
were held when a signal arrived, we also install the signal handler
for SIGALRM, and use the alarm() function to bound the time spent
blocked in the shutdown code.  While there is the risk that this alarm
might go off "too soon" if the shutdown has lots of work to do, we can
be certain that the correct exit code is still generated.

Once the signal handlers are established, _body closes down the
tracing and stats gathering and flushes stdout and stderr.  Then _body
calls gasnetc_get_exit_role() to "elect" a master node for the exit.
This is done with an alarm() timer in force.  The use of an "election"
with a timeout ensures that we will exit, even if node 0 is wedged.
The election of a master proceeds by sending a system-category AM
request to node 0, and spinning to wait for a corresponding reply,
which will indicated if the local node is the "master" or a "slave"
in the coordination of the graceful exit.  The logic on node 0
ensures that the first "candidate" is always made the master, not
waiting for multiple AMs to arrive.  Additionally the slave nodes
may, under circumstances described below, know before entering
gasnetc_get_exit_role() that they are slaves, and will not bother
to send an AMRequest to node 0.  In either case gasnetc_get_exit_role()
indicates to _body which role the local node is to assume.

From _body, the single master node will enter gasnetc_exit_master() and
will begin sending an remote exit request (system-category AM, so this
will all work between _init and _attach) to each peer.  Then the master
waits (with timeout, of course) for a reply from each peer.  This request
conveys the desired exit code to each node.  It also will wake them out
of a spin-loop, barrier, or other case where they were not yet aware of
the need to exit.  In the handler for the exit request, a node will send
a reply back to the master, so it knows all the nodes are reachable.  It
will set its role to "slave" and, if no exit is in-progress, it will start
the exit procedure, as described later.  From _body, the slave nodes all
call gasnetc_exit_slave(), which simply spins until the remote exit request
has arrived from the master.

Regardless of whether the sending of exit requests and replies completed
within the timeout, _body proceeds to shutdown the transport and release
the conduit's resources.  This is, again, protected by an alarm() in case
we get wedged.  Once the transport resources are released, _body flushes
stdout and stderr one last time and closes stdin, stdout and stderr.
Finally, _body shuts down its bootstrap support.  If the coordination
was completed within the timeout, then the gasnetc_bootstrapFini()
routine is called indicating that we'll not be making any more calls
to the bootstrap code and expect to exit shortly.  However, if the
coordination did timeout we call gasnetc_bootstrapAbort(exitcode).  This
call is meant to request that the bootstrap terminate our job "with
prejudice" since we failed to coordinate a graceful shutdown on our
own.  We do this to try to avoid orphans, but risk lots of unsightly
error messages and possible loss of our exit code. Assuming we did not
call _bootstrapAbort (which does not return) we finish _body by
canceling our alarm timer and return to our caller.

The final common routine is gasnetc_exit_tail().  This function just
does the last bit of work to terminate the job.  It is not included in
_body because we let the atexit() case terminate "normally" after
_body returns.  However, in the case of exits initiated via
gasnet_exit() or remote exit request we call _tail to complete the
exit.  In _tail we set an atomic variable to wake any threads which
were stuck polling in _body due to being other than the first thread
to enter.  Those threads should eventually wake and also call _tail to
terminate.  Next, we call gasneti_killmyprocess() to do any platform-
specific magic required to get the entire multithreaded application to
exit.  Finally we call _exit() with the saved exit code.

Given the routines gasnetc_exit_{head,body,tail}() the code for the
three types of exit are pretty trivial.  In particular, gasnetc_exit()
just calls _head, _body and _tail with no additional logic.  In the
request handler for the exit request AM, we look at the return from
_head to determine if this exit request is the first we've seen
(inclusive of local calls to gasnet_exit() and our atexit handler).  If
it IS the first exit request, then we raise a SIGQUIT, as required by
the GASNet spec, to allow the user's handler to perform its cleanup.
However, to get the most robust exit code we don't want to run the
_body code from a signal handler context if we can avoided it.
Therefore we inspect the signal handler and skip the raise() call if
the handler is the gasnet default handler, SIG_DFL or SIG_IGN.  After
the raise() returns, or is skipped all together, we are certain that
the user's hander, if any, has executed and has NOT called
gasnet_exit().  If a user handler had called gasnet_exit(), then
raise() would not have returned.  So, if we reach the code after the
possible raise(), we proceed to call gasnetc_exit_body() and _tail to
complete the (hopefully) graceful exit of the gasnet job.

It is important to note that if we get a remote exit request that
initiates an exit, then we will never return from the handler.
However, the design of the AM code in VAPI conduit ensures that this
will actually work without deadlock.  For one, we never run handlers
from signal context or with locks held.  Thus we can expect a
"clean" set of locks.  Furthermore, we don't expect to do anything
useful with the network once the request handler calls _body anyway.

The atexit handler just calls _head and _body before returning to
allow the exit to complete.  In this case we have a little problem
with the lack of access to the return code.  Therefore we just pass 0
to _head, which _body then sends in the remote exit requests.
Experience has shown that, at least with LAM/MPI for bootstrap, when
all but one task exits with zero, the single non-zero exit code
becomes the exit code for the parallel job.  Therefore, using zero
here gives the specified exit code from the parallel job for both
collective and non-collective returns from main.

In the best case one node is way ahead of the others and can win the
master election and send remote exit requests before the others attempt
the election.  In this case the coordinated shutdown needs 1 round-trip
for the election, followed by (N-1) round-trips for the remote exit
request/reply, for a total of 2*N AMs sent.

In the worst case all nodes attempt the election at roughly the same time
and a full N round-trips take place for the election, followed by (N-1)
round trips for the remote exit request/reply, for a total of 4*N-2 AMs
sent.

The average case is somewhere between these two.

@ Section: TO DO @

+ Value gets move as follows:
  - From network to bounce buffer
  - From bounce buffer to _gasnet_valget_op_t or automatic variable
  - From memory to return from function
  The bounce buffer is a little harder to eliminate than for the puts,
  unless we can allocate the _gasnet_valget_op_t in pinned memory.  One
  way to do this might be for the _gasnet_valget_op_t to be placed IN
  the bounce buffer, but that creates a problem with long-lived bounce
  buffers with non-blocking gets.  This _will_ work, however, for blocking
  value gets.

+ Can use "notifn" to make the progress thread only wake if there are
  multiple outstanding recvs.  This could, for instance, wake only if
  we reach the RNR state (or maybe are one event shy of it).
 
+ There is DDR memory on the board, which is at least 2.5 times faster
  than host memory from the point-of-view of the network.  The down side
  is that it is both slower and non-cachable from the host.  However, it
  might be sensible to use it for some purposes.  In particular bounce
  buffers don't need to be cachable.  So, the bounce buffers could move
  to the HCA.

  I speculate that the result could be an increase in overhead, due to
  the host-copy going to slower memory.  However, I expect that the total
  end-to-end latency may be reduced slightly and less cache pollution will
  occur.  Additionally the "background" copy by the HCA will not cause
  traffic on either the PCI or memory bus - reducing resource contention
  during the time one may wish to overlap computation.

    PUT Case I: 
      CPU copies host_mem->host_mem
      HCA copies host_mem->HCA_mem
      HCA sends from HCA_mem to network
    PUT Case II:
      CPU copies host_mem->HCA_mem
      HCA copies HCA_mem->HCA_mem (Might be omitted?)
      HCA sends from HCA_mem to network
 
    GET Case I:
      HCA recvs to HCA_mem
      HCA copies HCA_mem->host_mem
      CPU copies host_mem->host_mem
    GET Case II:
      HCA recvs to HCA_mem
      HCA copies HCA_mem->HCA_mem (Might be omitted?)
      CPU copies HCA_mem->host_mem

  Of course I can't get a program to allocate from HCA memory!

@ Section: References @

[1]	Bell, Bonachea. "A New DMA Registration Strategy for Pinning-Based
	High Performance Networks", Workshop on Communication Architecture
	for Clusters (CAC'03), 2003.
	Also at  http://upc.lbl.gov/publications/
