Index of /download/dist/gasnet/ibv-conduit
README file for ibv-conduit
===========================
Paul H. Hargrove <[email protected]>
@ TOC: @
@ Section: Overview @
@ Section: Where this conduit runs @
@ Section: Terminology @
@ Section: Build-time Configuration @
@ Section: Job Spawning @
@ Section: Runtime Configuration @
@ Section: Multi-rail Support @
@ Section: On-Demand Paging (ODP) Support @
@ Section: HCA Configuration @
@ Section: Advice to Client Authors @
@ Section: Known Problems @
@ Section: Core API @
@ Section: Extended API @
@ Section: GEX_FLAG_IMMEDIATE Support @
@ Section: Graceful exits @
@ Section: References @
@ Section: Overview @
ibv-conduit implements GASNet over InfiniBand networks using the Open
Fabrics Verbs API (www.openfabrics.org).
The name "ibv" comes from a time when this API was known as
"InfiniBand Verbs" (its library is still named libibverbs).
@ Section: Where this conduit runs @
ibv-conduit runs over networks using the Open Fabrics Verbs API, such as
InfiniBand. Platforms known to support this API include Linux, Solaris,
AIX, Windows and FreeBSD. However, at this time only Linux and Solaris
have been confirmed to run GASNet's ibv-conduit.
While Open Fabrics Verbs also covers iWARP and RoCE, ibv-conduit does not
currently support these networks. Users interested in contributing the
necessary support (primarily use of "rdmacm" for connection setup) are
encouraged to contact [email protected].
While ibv-conduit may work on Intel Omni-Path (OPA) HCAs, we instead
recommend use of ofi-conduit on such systems.
Nearly all InfiniBand cards (known as Host Channel Adapters, or HCAs)
have support for Open Fabrics Verbs, available from www.openfabrics.org
or from the HCA or OS vendor. We have tested numerous InfiniBand HCAs
from Mellanox, InfiniPath HCAs from Pathscale/QLogic, and TrueScale HCAs
from Intel.
In addition to the Open Fabrics Verbs, vendor-specific APIs also exist
and some have GASNet conduits:
+ ofi-conduit for the PSM (version 1) API on Intel TrueScale HCAs
+ ofi-conduit for the PSM2 API on Intel Omni-Path HCAs
The performance of ibv-conduit relative to others on the same hardware
will depend on your applications' communications patterns, the number of
nodes you run on, and other parameters. Therefore, we do not make any
specific recommendation as to which to use. We recommend that you
benchmark your own workload if you are concerned with the very best
possible performance.
While ibv-conduit runs on Solaris, the libibverbs on Solaris is not
compatible with GASNet's implementation of PSHM over POSIX shared memory.
This is not normally of concern, because on Solaris GASNet defaults to
an implementation of PSHM over SystemV shared memory. However, if one
configures GASNet to use PSHM-over-POSIX on Solaris, then be advised
that ibv-conduit will lack PSHM support and thus perform all communication
within a compute node via the InfiniBand HCA instead of shared memory.
@ Section: Terminology @
This document will use "NIC" (short for Network Interface Card) to refer to
the physical object installed in a host and "connector" to refer to the
external connections from the NIC to a network.
The term "HCA" (short for Host Channel Adapter) will be used to refer to a
device as enumerated by `ibv_get_device_list()` or the command-line utility
`ibv_devices`. The HCA is the device driver's logical representation of
the NIC, but there is not always a one-to-one correspondence, as described
next.
When a NIC has multiple connectors, the driver may present these either as
a single HCA with multiple "ports" or as multiple single-port HCAs.
Additionally, some systems will present more than one HCA per connector.
This is typically done on systems where the NIC is connected to multiple
I/O buses. On a compute node of the Summit system at OLCF, there are two
external network cable connectors on a single NIC which is connected
internally to two I/O buses. The driver presents four HCAs, one for each
combination of external connector and internal I/O bus. So, for a single
NIC with two connectors there are at least three ways the system may
present the same resources: with 1, 2 or 4 HCAs.
See the end of the description of the `GASNET_IBV_PORTS` environment
variable for information on multiple ways to get a listing of HCAs and
ports which ibv-conduit detects as available.
The term "rail" is used in place of "HCA" in some contexts, but has the
same meaning. In particular, one must enable multi-rail support in
ibv-conduit if one is to use multiple HCAs (as defined above) in a given
process. However, one can use multiple ports of a single multi-port HCA
without enabling multi-rail.
@ Section: Build-time Configuration @
Ibv-conduit can ensure good network attentiveness (timely
processing of incoming AMs) by spawning an extra thread that
remains blocked until the arrival of an Active Message. One
can disable this thread by configuring GASNet with the flag
'--disable-ibv-rcv-thread'. It is recommended that one NOT
use this option, but instead disable the thread at runtime
(see Runtime Configuration section). If the extra thread will
never be needed, disabling it at build time will yield a small
reduction in latencies by allowing some locking operations to
compile away.
By default, each ibv-conduit process in a GASNet job will open at most
one Host Channel Adapter (HCA). To allow a process to utilize more
than one HCA, specify '--with-ibv-max-hcas=N' at configure time (where
'N' is the number of HCAs to support per process).
Alternatively, specifying '--enable-ibv-multirail' is equivalent to
'--with-ibv-max-hcas=2' unless an explicit '--with-ibv-max-hcas=N'
option provides a different value. Passing '--disable-ibv-multirail'
overrides any explicit '--with-ibv-max-hcas=N' options.
Note that multirail support includes provisions for correctness which
can be relevant if using mutiple HCAs per *host*, even if using only
a single HCA in each process. So, '--with-ibv-max-hcas=1' is distinct
from '--disable-ibv-multirail'.
Enabling multirail support (using '--with-ibv-max-hcas=1' if appropriate)
is strongly recommended if one might ever use multiple HCAs per host.
See 'GASNET_USE_FENCED_PUTS' in the Runtime Configuration section and
"Bug 3447" in the Known Problems section for more information regarding
the correctness issues.
The use of specific HCA ports is controlled at run time by the environment
variable GASNET_IBV_PORTS, described below. The default value of this
variable can be set at configure time using '--with-ibv-ports=...'.
Closely connected to selection of HCA ports is the setting of the
environment variable GASNET_USE_FENCED_PUTS, also described below. Its
default value can be set using '--with-ibv-fenced-puts=val' where 'val'
is either '0' or '1'. Alternatively, '--with-ibv-fenced-puts' (with no
argument) and '--without-ibv-fenced-puts' can be used to select defaults
of 1 and 0, respectively.
When using dynamic connections (see GASNET_CONNECT_DYNAMIC env var,
below) there is an extra thread spawned to block for the arrival of
connection requests. If needed, this can be disabled at configure
time using '--disable-ibv-conn-thread'.
By default, ibv-conduit uses 64KB buffers for AM Mediums, to yield
gasnetc_AM_LUB{Request,Reply}Medium() values tens of bytes smaller due
to message headers. This default can be overridden by passing
'--with-ibv-max-medium=N' for 'N' equal to any power-of-two from 1024 to
262144, inclusive.
The default spawner to be used by the gasnetrun_ibv utility can be
selected by configuring '--with-ibv-spawner=VALUE', where VALUE is one
of 'mpi', 'pmi' or 'ssh'. If this option is not used, mpi is the
default when available, and ssh otherwise.
Here are some things to consider when selecting a default spawner:
+ The choice of spawner only affects the protocol used for parallel job
setup and teardown; in particular it is NOT used to implement any part
of the steady-state GASNet communication operations. As such, the
selected protocol needs to be stable and co-exist with GASNet
communication, but its performance efficiency is usually not a
practical consideration.
+ mpi-spawner is the default when MPI is available precisely because it
is so frequently present on systems where GASNet is to be installed.
Additionally, very little (if any) configuration is required and the
behavior is highly reliable.
+ pmi-spawner uses the same "Process Management Interface" which forms
the basis for many mpirun implementations. When support is available,
this spawner can be as easy to use and as reliable as mpi-spawner, but
without the overheads of initializing an MPI runtime.
+ ssh-spawner depends only on the availability of a remote shell command
such as ssh. For this reason ssh-spawner support is always compiled.
However, it can be difficult (or impossible) to use on a cluster which
was not setup to allow ssh to (and among) its compute nodes.
For more information on configuration and use of these spawners, see
README-{ssh,mpi,pmi}-spawner (installed)
or
other/{ssh,mpi,pmi}-spawner/README (source).
By default, ibv-conduit serializes calls to `ibv_poll_cq()` in a manner which
reduces time spent blocked on the mutex internal to its implementation. One
can configure using `--disable-ibv-serialize-poll-cq` to disable this
behavior. For more information, see the `GASNET_RCV_THREAD_POLL_MODE`
environment variable documentation, below.
@ Section: Job Spawning @
If using UPC++, Chapel, etc. the language-specific commands should be used
to launch applications. Otherwise, applications can be launched using
the gasnetrun_ibv utility:
+ usage summary:
gasnetrun_ibv -n <n> [options] [--] prog [program args]
options:
-n <n> number of processes to run (required)
-N <N> number of nodes to run on (not supported by all MPIs)
-E <VAR1[,VAR2...]> list of environment vars to propagate
-v be verbose about what is happening
-t test only, don't execute anything (implies -v)
-k keep any temporary files created (implies -v)
-spawner=(ssh|mpi|pmi) force use of a specific spawner (if available)
There are as many as three possible methods (ssh, mpi and pmi) by which one
can launch an ibv-conduit application. Ssh-based spawning is always
available, and mpi- and pmi-based spawning are available if the respective
support was located at configure time. The default is established at
configure time (see section "Build-time Configuration").
To select a non-default spawner one may either use the "-spawner=" command-
line argument or set the environment variable GASNET_IBV_SPAWNER to "ssh",
"mpi" or "pmi". If both are used, then the command line argument takes
precedence.
It has been noted that some InfiniBand driver implementations may not
allow for multiple open()s of the adapter. In this case, spawning via
MPI is not possible because the MPI and GASNet implementations cannot
share the adapter. If your GASNet jobs fail to spawn via MPI, but
spawn correctly with ssh or pmi, then this may be the reason. If you
need mpi-based spawning, our recommendation is to attempt to set the
MPIRUN_CMD such that your MPI will not use InfiniBand (see mpi-spawner's
README). If that is not possible, you may need to select a different
MPI implementation.
@ Section: Runtime Configuration @
There are a number of parameters in ibv-conduit which can be tuned
at runtime via environment variables.
General settings:
----------------
Ibv-conduit supports all of the standard GASNet environment variables
and the optional GASNET_EXITTIMEOUT and GASNET_THREAD_STACK families
of environment variables.
See GASNet's top-level README for documentation.
+ GASNET_BARRIER
In addition to the barrier algorithms in the top-level README, there
is an implementation specific to IBV:
IBDISSEM - like RDMADISSEM, but implemented using lower-level
operations for lower latency.
Currently IBDISSEM is the default on IBV.
Spawner settings:
----------------
+ GASNET_IBV_SPAWNER
To override the default spawner for ibv-conduit jobs, one may set this
environment variable as described in the section "Job Spawning", above.
There are additional settings which control behaviors of the various
spawners, as described in the respective READMEs (listed in section
"Build-time Configuration", above).
Connection settings:
-------------------
Under normal conditions, Host Channel Adapters and Ports will be
located and configured automatically. However, in the event you have
multiple adapters or multiple active ports on a single adapter, you may
wish to set environment variables to identify the correct HCAs and Ports.
Or, you may wish to use non-default values for configuring connections.
These parameters are permitted to take different values on each process.
However, please see bug 4314 if process-specifc values are needed.
https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4314
See "Build-time Configuration", above, for information on enabling
use of multiple HCAs in GASNet ibv-conduit.
+ GASNET_HCA_ID
+ GASNET_PORT_NUM
** UNSUPPORTED **
These environment variables, used in older releases, are no longer
supported. Setting them to anything but the empty string will
result in a run-time warning.
+ GASNET_NUM_QPS
This variable gives the number of IB Queue Pairs (QPs) over which to
stripe traffic between each pair of peers. This can yield an increase
in throughput and bandwidth when multiple physical ports are used
on one or more adapters.
If the number of QPs exceeds the number of available physical ports
then multiple QPs will be mapped round-robin to the ports. Be aware
that mapping multiple QPs per port may yield either a performance
improvement or a degradation, depending on traffic pattern.
The default is 0, which means one QP per HCA/port used.
+ GASNET_IBV_PORTS
By default, GASNet will open and use one active IB port on each HCA
used, which will be all HCAs (when GASNET_NUM_QPS is zero), or the
first GASNET_NUM_QPS HCAs found (when GASNET_NUM_QPS is non-zero).
Setting GASNET_IBV_PORTS will specify a filter for which ports will
be used. This can be used for instance to cause multiple physical
ports to be used per HCA, or to specify specific ports and/or HCAs
to be considered (up to GASNET_NUM_QPS if it is non-zero).
This variable is a string of one or more HCA/port specifications,
separated by '+' characters. Each such specification gives an HCA
identifier and an optional comma-separated list of port numbers.
The list of port numbers, if provided, is separated from the HCA id
by a ':'. If a list of ports is given, only those ports may be used.
Otherwise the first active port on the given HCA may be used. The
following example allows the first active port on HCA mlx5_0, and
only port 2 on mlx5_1:
GASNET_IBV_PORTS="mlx5_0+mlx5_1:2".
Note that this list is a *filter*, which means:
+ Duplicate entries do not cause multiple opens of a port or HCA
+ Entries describing non-existent HCAs are silently ignored
+ Entries describing inactive ports are silently ignored
+ Order is not significant. In particular if GASNET_NUM_QPS is
less than the number of entries in GASNET_IBV_PORTS, ports
are opened in the order detected, regardless of their order
in GASNET_IBV_PORTS
See 'GASNET_IBV_LIST_PORTS' and 'GASNET_IBV_LIST_PORTS_NODES' for
how to enumerate available HCAs and the status of their ports.
In most IBV distributions the 'ibv_devinfo' utility is also
available to list the HCAs and the status of their ports.
The default can be set at configure time using '--with-ibv-ports=...',
and is empty (no filter) in the absence of that configure option.
See also 'GASNET_IBV_PORTS_*', immediately below.
+ GASNET_IBV_PORTS_*
The environment variable 'GASNET_IBV_PORTS', described immediately above,
provides only a single setting and unless one uses some external means to
give per-process settings this cannot provide per-process control. This
can make it difficult to get the best performance from multi-rail systems
with multiple processes per node and architectural locality properties that
affect PCI/adapter access efficiency.
However, if 'hwloc' is detected at configure time, then it is possible to
give ibv-conduit values for 'GASNET_IBV_PORTS' which will vary per-process
based on cpu-binding and machine topology information as follows.
1. The variable 'GASNET_IBV_PORTS_TYPE' names an object type using hwloc's
terminology, with the default being "Socket" (aka "Package").
If the value is "None" (case-insensitive) then the logic below is
disabled and the value of 'GASNET_IBV_PORTS' is used by all processes.
2. GASNet queries the set of objects of the given type which intersect the
process's cpuset, to construct a variable name 'GASNET_IBV_PORTS_[suff]'
where '[suff]' is an underscore-delimited ordered list of logical
object ids. For example, with the default object type, a process bound
only to cores in the first socket would have a variable name of
'GASNET_IBV_PORTS_0'. Meanwhile, if the cpuset spans sockets 0 and 1
(such as for an unbound process on a 2-socket system) then the variable
'GASNET_IBV_PORTS_0_1' is used.
3. If the environment variable determined in step 2 is set, then it is
used. Otherwise the un-suffixed 'GASNET_IBV_PORTS' is used.
As a concrete example, on OLCF's Summit there are four HCAs in software,
which represent connections from two I/O buses (one per socket) to two
distinct InfiniBand rails. Use of the further I/O bus introduces a latency
penalty, but achieving peak aggregate bandwidth requires the job to split
traffic over both I/O buses and both rails.
For most applications, we have observed the best latency and aggregate per-
node bandwidth is achieved using a single HCA connected to the local
socket's I/O bus. For an unbound process (or one bound to cores in both
sockets) the performance suffers relative to the bound case, but the best
average-case is achieved using two HCAs chosen to span both buses and both
network rails. This yields the following recommendation as a good
default for most applications running on this system:
GASNET_IBV_PORTS='mlx5_0+mlx5_3' # Spanning both sockets (e.g. unbound)
GASNET_IBV_PORTS_0='mlx5_0' # Bound to socket0
GASNET_IBV_PORTS_1='mlx5_3' # Bound to socket1
For an application which needs to maximize bandwidth of communication
to/from processes in a single socket at a time, one must allow process to
make use of both I/O buses and network rails (at the cost of increased
latency and potentially reduced aggregate per-node bandwidth). This can be
accomplished using two HCAs per processes as follows:
GASNET_IBV_PORTS='mlx5_0+mlx5_3'
GASNET_IBV_PORTS_1='mlx5_1+mlx5_2'
This example illustrates the use of un-suffixed 'GASNET_IBV_PORTS' as a
default when lacking a more specific setting. In particular, unbound
processes and those bound to socket 0 will both use 'mlx5_0+mlx5_3'
while processes bound to socket 1 will use 'mlx5_1+mlx5_2'.
These specific recommendations are appropriate to the specific composition
of a node of OLCF's Summit, and should not be considered as generic advice
for use of all multi-HCA systems.
Of course, even on the same system, your mileage may vary.
By default 'GASNET_IBV_PORTS_TYPE' is "Socket" and all other variables in
this family are unset.
+ GASNET_IBV_PORTS_VERBOSE
This integer setting controls the detail of any warnings printed when there
are non-fatal issues related to the selection of HCAs and ports to be used.
A value of 0 supresses the warnings entirely.
A value of 1 (the default) or higher will warn if one or more HCAs are
excluded from consideration other than due to failure to match
GASNET_IBV_PORTS (or the numerically suffixed variants). In particular, a
warning is issued if more HCAs are detected than supported by
'--with-ibv-max-hcas=N' (where N defaults to 2 when configured using
`--enable-ibv-multirail`, and 1 otherwise).
Values of 2 or higher are reserved to request printing additional
details in future releases.
This setting defaults to 1.
+ GASNET_IBV_LIST_PORTS
The value is a boolean: "0" to disable or "1" to enable the reporting of
all detectable HCAs and and the status of their ports.
See 'GASNET_IBV_LIST_PORTS_NODES' for how to limit which nodes report.
The default is "0" (no report).
+ GASNET_IBV_LIST_PORTS_NODES
If GASNET_IBV_LIST_PORTS is enabled, then this setting may be used to
limit which nodes report HCAs/ports. The value is a list which may
contain one or more integers or ranges separated by commas, such as
"0,2-4,6". If unset, empty, or equal to "*" then all nodes will report
(if enabled by GASNET_IBV_LIST_PORTS).
The default is no limit on which nodes report.
+ GASNET_IBV_PKEY
If set, this specifies the 15-bit InfiniBand Partition Key to use.
Valid values are in the range 2 to 0x7fff.
For compatibility, the membership bit (0x8000) is ignored.
The default is to use the Partition Key installed at table index 0.
+ GASNET_QP_TIMEOUT
This sets the timeout value used to configure InfiniBand QueuePairs.
The IB specification uses (4.096us * 2^qp_timeout) as the length of
time an HCA waits to receive and ACK from its peer before attempting
retransmission.
The default is currently 18 (roughly 1 second).
+ GASNET_QP_RETRY_COUNT
This sets the maximum number of retransmissions due to ACK timeout
before the HCA signals a fatal error.
The default is currently 7 (the max supported by early Mellanox HCAs)
+ GASNET_QP_RD_ATOM
This sets the number of per-connection resources allocated by the HCA
for responding to RDMA Reads (and atomics, which GASNet does not use
currently). Lower values use slightly less memory but may reduce the
throughput of Get-intensive communications patterns.
The default value is '0', which means to use the maximum supported
value reported by the HCA.
Other valid setting are typically in the range from 1 to 4.
+ GASNET_MAX_MTU
This sets the maximum MTU to be used, and has the following valid
values: -1, 0, 256, 512, 1024, 2048 or 4096.
If the value is 0 GASNet will automatically select the MTU size.
If the value is -1 GASNet will use the HCA port's active value.
Otherwise the lesser of this setting or the port's active value
will be used.
The default is 0: automatic MTU selection.
+ GASNET_CONNECT_DYNAMIC
This boolean setting determines if connections can be established
on demand. The default value is TRUE.
When GASNET_CONNECT_DYNAMIC is enabled, a node will connect on
demand to any peer not previously connected at startup. However,
if a node is fully connected to all peers at startup, then dynamic
connections are automatically disabled on that node. Therefore,
unless GASNET_CONNECT_STATIC or GASNET_CONNECTFILE_IN is set to a
non-default value this variable has no effect.
+ GASNET_CONNECT_STATIC
This setting determines if connections are established at startup.
When GASNET_CONNECT_STATIC is enabled, a node will connect at
startup to all peers indicated by the GASNET_CONNECTFILE_IN
setting (see below), or to ALL peers if that variable is unset.
The value is a boolean with a default of TRUE.
+ GASNET_CONNECTFILE_IN
This setting provides a filename used to limit the connections
established at startup, and is ignored if GASNET_CONNECT_STATIC is
FALSE.
Any '%' character in the value is replaced with the node number to
allow (but not require) separate per-node files.
The format of a connect file is a series of lines of the form:
node: peer1 peer2 ...
without leading whitespace. For example, to request that node 7
connect to nodes 0, 4 and 6:
7: 0 4 6
Line lengths are not limited, but the same node number may appear
to the left of the colon on multiple lines to limit line lengths.
So, the following is equivalent to the previous example:
7:0 4
7:6
Ranges are supported. So, the following connects node 6 with
nodes 9, 10, 11 and 12:
6:9-12
Order is not significant (except in ranges), so neither lines nor
peer numbers need to be sorted.
Connections are bidirectional so the following:
1:0
0:1
describes only 1 connection between nodes 0 and 1 and only one of
these two lines is required to establish it (though there is no
error in specifying both). This is true regardless of whether
using a single file or per-node files.
An optional line
size: N
indicates the number of nodes in the job, and is validated against
the size of the current job if present.
An optional line
base: N
specifies a numeric base for interpretation of all node numbers on
lines that follow. The default is 10 (decimal), and legal values
range from 2 (binary) to 36 (uses digits '0'-'9' and 'a'-'z'). If
present, the 'base' line only affects node numbers read from later
lines, and therefore should appear at the start of the file.
Values on the 'size' and 'base' lines are always read as decimal.
The default is unset/empty (no limit on which nodes are connected
at startup).
+ GASNET_CONNECTFILE_OUT
This setting specifies a filename in which to generate connection
information suitable for later use as GASNET_CONNECTFILE_IN.
Any '%' character in the value is replaced with the node number to
allow separate per-node files. Use of per-node files is strongly
recommended, and on some file systems (notably NFS) is REQUIRED
for correct operation. If desired, the separate files may be
concatenated together after the run completes to produce a single
file suitable for use as GASNET_CONNECTFILE_IN. Alternatively, the
following perl one-liner will concatenate the files while removing
all but the first instance of the 'base' and 'size' lines:
perl -ne 'print unless (/(base|size)/ && $X{$_}++);' -- [FILES]
where [FILES] denotes the list of per-node connection files and the
combined file is generated on stdout.
The connection information produced in the output file(s) lists
only those connections actually used in the current run.
Therefore a common use case is to set GASNET_CONNECTFILE_OUT on a
fully-connected run, and then use the generated file(s) to limit
static connections in subsequent runs.
The default is to use base-36 for node numbers, which results in
more compact files but is difficult for a human to read. See
GASNET_CONNECTFILE_BASE, below, for how to change this.
The default is unset/empty (no output files are generated).
+ GASNET_CONNECTFILE_BASE
This setting controls the numeric base used for node numbers in
GASNET_CONNECTFILE_OUT files.
Valid values range from 2 (binary) to 36 (uses digits '0'-'9' and
'a'-'z'). The value of the setting is always parsed as base-10.
The default value is 36.
+ GASNET_CONNECT_SNDS
+ GASNET_CONNECT_RCVS
These two settings control the number of small buffers allocated
to send and to receive dynamic connection requests, and are
ignored if GASNET_CONNECT_DYNAMIC is FALSE, or on any node that is
already fully connected at startup.
Because the buffers are small and allocation is page granular
there is seldom any benefit to reducing the default values.
However, there are conditions under which increasing one or both
may help reduce the latency of dynamic connections:
+ Dynamic connection setup is blocking at the initiator, but if
using pthreads it is possible that one node may have dynamic
connection requests in-progress to multiple nodes. So, if
an application is highly-threaded it may be beneficial to
increase GASNET_CONNECT_SNDS for greater concurrency of sends.
+ If a given node receives many simultaneous connection requests,
any requests in excess of the allocated buffers will be dropped.
The connection will be delayed until the requester retransmits.
So, the average connection setup time in the presence of "bursty"
requests may be reduced by increasing GASNET_CONNECT_RCVS.
The default value of GASNET_CONNECT_SNDS is 4.
The default value of GASNET_CONNECT_RCVS is
MAX(6, 4 + 2*ceil(log_2(N_remote)))
where "ceil()" denotes rounding up to an integer, "log_2()" is
the base-2 logarithm and "N_remote" is the number of GASNet nodes
minus "self" and any nodes reachable through shared memory (PSHM).
+ GASNET_CONNECT_RETRANS_MIN
+ GASNET_CONNECT_RETRANS_MAX
These two settings control the minimum and maximum intervals
between retransmission of messages used in establishing dynamic
connections, and are ignored if GASNET_CONNECT_DYNAMIC is FALSE,
or on any node that is already fully connected at startup.
Values are in units of microseconds (10^-6 sec).
The value of GASNET_CONNECT_RETRANS_MIN is the interval between
sending an initial request and the first retransmission. Each
retransmission doubles the interval before the next, up to the
maximum value given by GASNET_CONNECT_RETRANS_MAX, after which the
connection setup fails.
Adjustment of these settings may help resolve timeouts on networks
with high rates of UD packet loss. However, this is not
recommended without consulting with the author and the defaults
are therefore not documented here.
Software configuration settings:
-------------------------------
There are some optional behaviors in ibv-conduit that can be turned
ON or OFF.
These parameters are permitted to take different values on each process,
though doing so may not be useful.
However, please see bug 4314 if process-specifc values are needed.
https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4314
+ GASNET_RCV_THREAD
This gives a boolean: "0" to disable, or "1" to enable, the use of an
extra thread per-HCA port to block waiting for an Active Message Request
or Reply to arrive. This allows ibv-conduit to remain attentive
to incoming AM traffic even while the application is not making any
calls to GASNet. The down side is that when this thread wakes it
must contend for CPU resources and for locks. Therefore, for an
application that is calling GASNet sufficiently often, use of this
thread may significantly INCREASE running time. However, on an SMP
where an otherwise idle processor is available the use of this
thread can REDUCE running time by relieving the application thread
of the burden of servicing incoming AM Requests and Replies.
Note that if '--disable-ibv-rcv-thread' was specified at build time
then the extra thread is unavailable and this environment variable
is ignored.
Currently the default is disabled (0), but this is subject to change.
NOTE: In releases prior to GASNet 1.18.2 the AM receive thread was
unavailable for ibv-conduit, but that is no longer the case.
+ GASNET_RCV_THREAD_RATE
If GASNET_RCV_THREAD is enabled, then this setting can be used to
impose a limit on how frequently the AM receive thread may wake.
This may be used to limit interference between the AM receive thread
and the main application thread(s), while providing some network
attentiveness when the application is not making GASNet calls.
A non-zero value gives the maximum rate in wake-ups per second.
The default value is 0, which means no limit is imposed.
NOTE: A future release may implement GASNET_RCV_THREAD_LOAD to
impose a limit on the *fraction* of time the thread spends awake.
+ GASNET_RCV_THREAD_POLL_MODE
If GASNET_RCV_THREAD is enabled, then this setting determines if/how the
AM receive thread participates in the serialization of calls to
`ibv_poll_cq()`. This serialization can improve the throughput of
clients with multiple threads (the AM receive thread included) by reducing
the time spent blocked while polling.
The following values (case-insensitive) are recognized:
* 'serialized' - The AM receive thread will poll the completion queue
for AM arrivals while participating in the same serialization protocol
as client threads. (default)
* 'unserialized' - The AM receive thread will poll the completion queue
for AM arrivals without regard to the serialization protocol observed
by client threads.
* 'exclusive' - The AM receive thread will utilize the serialization
protocol observed by client threads to become the only poller of the
completion queue for AM arrivals.
The default is "serialized".
The value "exclusive" is the only option which directly affects the
behavior of application threads. The others have only an indirect affect
via the degree of contention they may experience due to the receive thread.
If the AM receive thread is disabled (via `'--disable-ibv-rcv-thread` at
configure time or via the `GASNET_RCV_THREAD` environment variable), then
this setting is ignored. In particular, setting `GASNET_RCV_THREAD=0`
and `GASNET_RCV_THREAD_POLL_MODE=exclusive` does not prevent application
threads from polling the completion queue for AM arrivals.
If serialization of calls to `ibv_poll_cq()` was disabled at configure
time via `--disable-ibv-serialize-poll-cq`, then this setting is ignored
and the behavior is equivalent to "unserialized".
Pinnable memory probe configuration:
-----------------------------------
In normal operation of ibv-conduit it is necessary to know how much memory
may be registered (aka pinned) with the InfiniBand HCA(s). This is limited
by multiple factors and thus cannot be determined by a simple query.
Therefore, the default behavior is to attempt to mmap and register as much
memory as possible at startup, and then release all the memory. When there
are multiple GASNet processes on a shared memory node, one representative
process will perform this probe. There are at least two well-known reasons
why one may desire to limit or eliminate this probe. The first is the time
spent performing the probe. The second is the possibility that the O/S or
a batch execution environment may terminate a process that exceeds some
limit on the virtual memory size of a process and/or may terminate the
process with the largest size when memory is exhausted. Use of the
following parameters allows one to bound, or to eliminate, this probe.
+ GASNET_PHYSMEM_MAX
If set, this parameter is used to determine the maximum amount of memory
ibv-conduit may pin. This limits how large the GASNet segment can be, and
how much memory is available for firehose (see below). The value gives
an upper bound on pinnable memory per host, which is divided equally among
processes on each host.
The value may specify either a relative or absolute size. If the value
parses as a floating-point value less than 1.0 (including fractions such
as "5/8"), then this is taken as a fraction of the (estimated) physical
memory. Otherwise the value is taken as an absolute memory size, with "M",
"G" and "T" suffixes accepted to indicate units of Megabytes, Gigabytes,
and Terabytes, respectively.
This parameter *may* validly differ among processes. However, if the value
differs among processes on the same host, the implementation will select
just one value per host (the algorithm for the selection is unspecified).
The default may be set at configure time using --with-ibv-physmem-max=VALUE,
and otherwise is "2/3" (pin up to 2/3 of the estimated physical memory).
The following two parameters must be equal across all processes, and the
behavior otherwise is undefined.
+ GASNET_PHYSMEM_PROBE
This gives a boolean: "1" to enable or "0" to disable the validation
(potentially slow) of the GASNET_PHYSMEM_MAX value.
By default the environment variable GASNET_PHYSMEM_MAX is trusted if set
at runtime or when using a default given at configure time. If neither
source has provided a value, the "2/3" default is taken as a maximum
amount of memory that might be possible pinned, but the final limit is
determined by probing the limits imposed by the O/S and HCA. This probe
can take a significant period of time on large memory nodes. Therefore,
enabling this probe may greatly slow startup, but can prevent unexpected
runtime failures if the user-provided values exceed those imposed by the
O/S and HCA. Therefore, it is recommended to enable this probe if one
experiences any runtime failure consistent with an out-of-memory condition.
The O/S limits may be hard limits from the kernel (Linux often allows at
most 80% of physical memory to be pinned) or from resource limits (see
'ulimit' in a Bourne shell or 'limit' in a C-shell).
The default is OFF (probe disabled) if configure or environment sets a
value of GASNET_PHYSMEM_MAX. However, this can be changed to a default of
ON or OFF by configuring using, respectively, --enable-ibv-physmem-probe
or --disable-ibv-physmem-probe.
+ GASNET_PHYSMEM_WARN
This gives a boolean: "1" to enable or "0" to disable the warning printed
if/when the GASNET_PHYSMEM_MAX value is probed.
Protocol configuration:
----------------------
The following environment variables control the selection of protocols
for performing certain transfers.
These parameters must be equal across all nodes, and the behavior
otherwise is undefined.
+ GASNET_INLINESEND_LIMIT
IBV includes an "inline send" operation that transfers the data to
the HCA at the same time it transfers the request. This normally
provides a measurable performance improvement, but is only available
up to an hardware- and firmware-dependent maximum size.
A value of 0 disables use of inline sends.
A value of -1 causes use of the maximum value reported by the HCA.
The default of 72 is normally correct.
+ GASNET_PACKEDLONG_LIMIT
To perform an AMLong with non-empty payload,
ibv-conduit must transfer both the payload and the header. For
sufficiently small payloads, it is more efficient (in terms of both CPU
overhead and network latency) to pack the header and payload together
and copy the payload into place on the target before running the
handler. Thus, for payload up to and including this size this packing
is used.
The default value is the maximum that, together with the maximum sized
header, fits into a 4KiB transfer (currently 4012).
A value of zero ensures the payload and header always travel separately.
+ GASNET_NONBULKPUT_BOUNCE_LIMIT
This parameter sets the limit on the use of bounce buffers to achieve
local completion of "non-bulk" PUT and AMLong payload transfers. When
passing GEX_EVENT_NOW to perform a PUT or AMLong, the implementation must
block until local completion. For PUTs with nbytes larger than
GASNET_INLINESEND_LIMIT, and for AMLongs with nbytes larger than both
GASNET_INLINESEND_LIMIT and GASNET_PACKEDLONG_LIMIT, ibv-conduit must
either copy the data into bounce buffers, or block until remote completion
is signaled by the HCA. Such transfers up to and including size
GASNET_NONBULKPUT_BOUNCE_LIMIT are performed using bounce buffers while
larger transfers stall return from injection until the RMA is acknowledged.
The default value is 64KB.
A value of zero disables use of bounce buffers.
+ GASNET_PUTINMOVE_LIMIT (only for GASNET_SEGMENT_{LARGE,EVERYTHING})
When the firehose algorithm (see below) is in use for managing the
pinning of remote memory, a PUT that misses in the firehose cache
may be accelerated by piggybacking data on the AMMedium that is
used to obtain a remote pinning. The value of GASNET_PUTINMOVE_LIMIT
is the maximum number of bytes to send in this way. The value is
bounded by the maximum value set at compile time, and it is an
error to request a larger value.
Note that in a GASNET_SEGMENT_FAST configuration, the remote segment
is pinned statically and this optimization is never applicable.
The default value is 3KB (the current maximum value).
A value of zero disables this optimization.
+ GASNET_PUT_STRIPE_SZ and GASNET_GET_STRIPE_SZ (experimental)
When multiple HCA ports are used, the performance of a sufficiently large
isolated RMA operation (one not overlapped with other communication) can
be increased by subdividing it into multiple pieces striped over more than
one path. These parameters specify the threshold above which striping is
applied to RMA Put and Get operations, respectively.
Suffixes "K", "M" and "G" can be used to specify units of Kilobytes,
Megabytes and Gigabytes, respectively. The default units are Kilobytes.
Use of a too-small value may limit the performance of large RMA operations
by subdividing them into stripes too small to saturate the network.
Use of a too-large value may limit the performance of RMA operations
which are large enough to benefit from striping but below the value.
Values exceeding the HCA's maximum transfer size will be silently reduced.
These parameters are ignored if only a single HCA/port is in use.
A value of zero uses the HCA's maximum transfer size as the stripe size,
effectively disabling this optimization.
The current default is zero (disabled).
+ GASNET_AM_GATHER_MIN
This parameter sets the minimum payload size at which a multi-segment
gather may be used to concatenate an AM header and payload. Below this
minimum payload data is copied.
The default value is 1500.
A value of -1 disables multi-segment gather, using payload data copy
at all payload sizes.
+ GASNET_USE_SRQ
This controls whether IBV Shared Receive Queue (SRQ) support is used,
but is ignored if GASNet was configured with --disable-ibv-srq.
This setting defaults to -1, which means that SRQ will be used only
if doing so would reduce memory usage (as determined from the value
of the GASNET_RBUF_COUNT setting, described below).
If set to a non-negative value, this setting give the minimum GASNet
node count at which SRQ will be used, regardless of whether or not
the memory usage would increase or decrease. A value of zero will
disable SRQ. Examples:
- GASNET_USE_SRQ unset or explicitly set to -1:
SRQ is used ONLY if GASNET_RBUF_COUNT is less than the number
of receive buffers required for the non-SRQ case.
- GASNET_USE_SRQ <= job size [includes GASNET_USE_SRQ == 1]
SRQ is used and GASNET_RBUF_COUNT is enforced as a maximum
- GASNET_USE_SRQ > job size
SRQ is NOT used and GASNET_RBUF_COUNT is ignored
Note that the interpretation of the values 0 and 1 allow one to use
this setting as a simple boolean if desired.
+ GASNET_USE_XRC
This controls whether IBV eXtended Reliable Connection (XRC) support
is used. However, it is is ignored if GASNet was configured with
--disable-ibv-xrc, if XRC support was not found at configure time, or
if SRQ support is not used (regardless of why),
This setting defaults to 1 if SRQ support was enabled at configure
time. As a result XRC will be used anytime SRQ is used.
+ GASNET_USE_ODP
This boolean setting controls whether IBV On-Demand Paging (ODP) support
should be used. If true, then ibv-conduit uses ODP on any compute nodes
where the IBV library reports support is available. When ODP is NOT
available on one or more compute nodes, a warning is issued. See
GASNET_ODP_VERBOSE for information on controlling this warning.
This setting is ignored if GASNet was configured with --disable-ibv-odp,
or if ODP support was not found at configure time.
This setting defaults to 1 if ODP support was enabled at configure time.
+ GASNET_ODP_VERBOSE
This integer setting controls the detail of the warning printed when ODP
support is enabled but is not present on one or more HCAs in the job.
A value of 0 supresses the warning entirely.
A value of 1 will report a count of processes lacking support.
Values of 2 or higher give increasing levels of detail concerning the
missing support.
This setting defaults to 1 if ODP support was enabled at configure time.
+ GASNET_USE_FENCED_PUTS
This boolean setting controls the use of atomic operations to provide for
correct remote completion detection in the presence of multiple HCAs.
See "Bug 3447" in the Known Problems section for information on when one
may wish to enable this setting.
If enabled when multirail support was not enabled at configure time, a
warning will be issued.
The default can be set at configure time via '--with-ibv-fenced-puts=...'
and is 0 (disabled) in the absence of that configure option.
Resource usage parameters:
-------------------------
The following environment variables control how much memory is
preallocated at startup time to serve various functions. Because these
resource pools do not grow dynamically, it is important that these
parameters be sufficiently large, or performance degradation may
result. The default settings should be sufficient for most conditions.
You may need to lower some values if you have insufficient memory.
+ GASNET_RBUF_SPARES
This gives the number of AM receive buffers used to hold header and
payload of executing AM Request, and thus bounds the number of threads
which may concurrently execute AM handlers "in place". Any threads
beyond this limit must copy the header and payload before executing
the handler.
The default value reflects a heuristic estimate of the number of
threads which might concurrently poll for AM arrivals.
Reducing this parameter may reduce Active Message throughput.
The following parameters must be equal across all nodes, and the behavior
otherwise is undefined.
+ GASNET_NETWORKDEPTH_PP
This gives the maximum number of ops (RDMA + AMs) which can be
in-flight simultaneously from a node to each of its peers. Here
"in-flight" means queued to the send work queue and not yet reaped
from the send completion queue. This value is the depth of each
send work queue. This limit is on the number of ibv-level ops
in-flight, and the number of GASNet-level operations may be less
(for example, when the length of a PUT or GET is larger than the
HCA's maximum message length, or because an AM Long uses separate
ops for the payload and header).
The default value is 24.
Reducing this parameter may limit small message throughput. If you
believe your small message throughput is too low, you may try
increasing this value.
+ GASNET_NETWORKDEPTH_TOTAL
This gives the maximum number of ops (RDMA + AMs) which can be
in-flight simultaneously from each node (with "in-flight" defined as
in GASNET_NETWORKDEPTH_PP). The depth of the send completion queue
is min(GASNET_NETWORKDEPTH_TOTAL, GASNET_NETWORKDEPTH_PP*(N-1)).
If set to zero, the value is set to the maximum usable value computed
from GASNET_NETWORKDEPTH_PP and the HCA's reported capabilities.
The default value is 255.
Reducing this parameter may limit small message throughput. If you
believe your small message throughput is too low, you may try
increasing this value (or setting it to zero), at a cost in
additional memory consumption.
+ GASNET_AM_CREDITS_PP
This give the maximum number of AM Requests which can be in-flight
simultaneously from a node to each of its peers. Here "in-flight"
means the Request is queued to the send work queue, but the matching
Reply has not yet been processed for AM flow control (described in
another section of this README). This is the number of buffers which
must be preposted to each receive work queue for AM Requests.
The default value is 12 (12*MaxMedium*(N-1) allocated for Request buffers).
Reducing this parameter may limit Active Message throughput. If you
believe your Active Message throughput is too low, you may try
increasing this value.
+ GASNET_AM_CREDITS_TOTAL
This gives the integer number of AM Requests which can be in-flight
simultaneously from each node, with "in-flight" defined as in
GASNET_AM_CREDITS_PP.
If set to zero, the value is set to the maximum usable value computed
from GASNET_AM_CREDITS_PP and the HCA's reported capabilities.
The default value is MIN(256, (nodes-1)*GASNET_AM_CREDITS_PP).
Reducing this parameter may limit Active Message throughput. If you
believe your Active Message throughput is too low, you may try
increasing this value (or setting it to zero), at a cost in additional
pinned memory.
+ GASNET_AM_CREDITS_SLACK
This gives the maximum number of flow-control credits that can be
delayed at the responder. If a Request handler does not produce a
Reply, a credit may be "banked" to be piggy-backed on the next
Request or Reply headed to the requesting node. The value of
GASNET_AM_CREDITS_SLACK gives the maximum number of credits that can
be banked before a hidden Reply is generated to convey credits back
to the requester.
The default value is 1.
GASNET_AM_CREDITS_SLACK will be silently reduced if needed to
ensure deadlock will not occur, and is ignored when SRQ is used.
Reducing this parameter to zero or setting it too high may
increase the latency of Active Message traffic.
+ GASNET_RBUF_COUNT
If SRQ support is unavailable or disabled, this parameter is ignored.
See GASNET_USE_SRQ documentation for details of when SRQ is enabled.
When SRQ is enabled this gives the max number of AM receive buffers
allocated on each node. These buffers are needed for reception of
AM headers and the payload of mediums, but are not used for RDMA.
The actual number of buffers allocated is the lesser of the value of
GASNET_RBUF_COUNT or a value computed from the GASNET_AM_* and
GASNET_NETWORKDEPTH_* parameters described above.
If set to zero, the value is limited only by the HCA's capabilities.
The default value is 1024 (up to 1024*MaxMedium for buffers).
Reducing this parameter may limit Active Message throughput. If you
believe your Active Message throughput is too low, you may try
increasing this value (or setting it to zero), at a cost in additional
pinned memory.
+ GASNET_BBUF_COUNT
This gives the max number of pre-pinned buffers allocated on each node.
These buffers are needed for assembly of AM headers and the payload
of mediums, and for some PUTs (see GASNET_NONBULKPUT_BOUNCE_LIMIT).
The actual number of buffers allocated is the lesser of the values of
GASNET_BBUF_COUNT and GASNET_NETWORKDEPTH_TOTAL, since the total
network depth bounds the number of in-flight operations that might
need these buffers.
If set to zero, the value is set to GASNET_NETWORKDEPTH_TOTAL.
The default value is 1024 (up to 1024*MaxMedium for buffers).
Reducing this parameter limits the number of in-flight operations
which consume bounce buffers. This includes AMs too large for an
inline send and PUTs subject to the GASNET_NONBULKPUT_BOUNCE_LIMIT.
If you believe that throughput of these operations is too small, you
may try increasing this value (or setting it to zero), at a cost in
additional pinned memory.
+ GASNET_PINNED_REGIONS_MAX
This provides a limit on the number of pinned regions to be created.
Similar to GASNET_PHYSMEM_MAX, the value gives an upper bound on pinned
regions per host, which is divided equally among processes on each host.
This may constrain dynamic registration via firehose (below).
The value may specify either a relative or absolute size. If the value
parses as a floating-point value less than 1.0 (including fractions such as
"5/8"), then this is taken as a fraction of the maximum supported region
count reported by the HCA(s). Otherwise the value is taken as an absolute
region count.
The default is to use a fraction of the HCA pinning resources equal to
the fraction of physical memory given by GASNET_PHYSMEM_MAX, subject to
a system-dependent maximum value.
Firehose configuration:
----------------------
These parameters must be equal across all nodes, and the behavior
otherwise is undefined.
The following environment variables control the per-process resources used by
the "firehose" [ref 1] dynamic registration library. By default, firehose
will use as much pinned memory as the HCA and O/S will permit, bounded by
GASNET_PHYSMEM_MAX.
Resource use is divided into two pools. The main pool is for managing
of pinning of the GASNet segment on remote nodes, while the "victim"
pool is used to manage pinnings for local use. By default in a
GASNET_SEGMENT_LARGE or GASNET_SEGMENT_EVERYTHING configurations, 75%
of the pinnable memory will go in the main pool and 25% into the victim
pool. In a GASNET_SEGMENT_FAST configuration, firehose is not needed
for management of the statically pinned GASNet segment, and by default
only a small fraction of the available memory is placed in the main
pool for internal uses and the majority is placed in the victim pool.
+ GASNET_USE_FIREHOSE
This environment variable is only available in a DEBUG build of
GASNet (one configured with --enable-debug).
This gives a boolean: "0" to disable or "1" to enable the use
of the firehose dynamic pinning library in a GASNET_SEGMENT_FAST
configuration. In a GASNET_SEGMENT_FAST configuration, the GASNet
segment is registered (pinned) with the HCA at initialization time,
because pinning is required for RDMA. However, GASNet allows for
local addresses (source of a PUT or destination of a GET) to lie
outside of the GASNet segment. So, to perform RDMA GETs and PUTs,
ibv-conduit must either copy out-of-segment transfers though
preregistered bounce buffers, or dynamically register memory. By
default firehose is used to manage registration of out-of-segment
memory. (default is ON).
Setting this environment variable to "0" (or "no") will disable use
of firehose, forcing the use of bounce buffers for out-of-segment
transfers. This will result in a significantly lower peak bandwidth
for large PUTs and GETs, with little or no effect on small message
latency. It is available only for debugging purposes.
In a GASNET_SEGMENT_LARGE or GASNET_SEGMENT_EVERYTHING configuration,
the GASNet segment is not preregistered and use of firehose is
required. Thus it is an error to disable firehose in such a
configuration.
+ GASNET_FIREHOSE_M and GASNET_FIREHOSE_MAXVICTIM_M
GASNET_FIREHOSE_M gives the amount of memory to place in the main pool,
while GASNET_FIREHOSE_MAXVICTIM_M gives the amount of memory to place in
the victim (local) pool. The suffixes "K", "M" and "G" are interpreted as
Kilobytes, Megabytes and Gigabytes respectively, with "M" assumed if no
suffix is given.
When neither variable is set, the defaults are respectively 75% and 25% of
the total pool. In a GASNET_SEGMENT_LARGE or GASNET_SEGMENT_EVERYTHING
configuration, this pool's size is the maximum pinnable memory (but see
below), while in a GASNET_SEGMENT_FAST configuration it is the same size as
the prepinned bounce buffer pool. Note that, as used here, "maximum
pinnable memory" may be less than determined from GASNET_PHYSMEM_MAX, and
in particular may be constrained by the product of the number of pinnable
regions and their maximum size. See, GASNET_FIREHOSE_MAXREGION_SIZE,
GASNET_FIREHOSE_R and GASNET_FIREHOSE_MAXVICTIM_R for more information.
If only one of these variables is set, then the other defaults such that
their sum equals the total pool size. Therefore, to enlarge or reduce the
total pool, one must set both. Since enlarging the total pool risks
exhausting resources, potentially leading to crashes at runtime, doing so
will result in a warning.
+ GASNET_FIREHOSE_R and GASNET_FIREHOSE_MAXVICTIM_R
GASNET_FIREHOSE_R gives the maximum number of pinned regions to allocate
for the management of the main pool, while GASNET_FIREHOSE_MAXVICTIM_R
gives the maximum number of pinned regions to allocate for the management
of the victim (local) pool.
When neither variable is set, the default is to split the available pool of
pinnable regions (see GASNET_PINNED_REGIONS_MAX) in proportion to the
values of GASNET_FIREHOSE_M and GASNET_FIREHOSE_MAXVICTIM_M.
If only one of these variables is set, then the other defaults such that
their sum equals the total pool size. Therefore, to enlarge or reduce the
total pool, one must set both. Since enlarging the total pool risks
exhausting resources, potentially leading to crashes at runtime, doing so
will result in a warning.
The value of GASNET_FIREHOSE_R will be silently truncated if larger than
(GASNET_FIREHOSE_M / GASNET_FIREHOSE_MAXREGION_SIZE), since additional
regions would not be used. Similarly, GASNET_FIREHOSE_MAXVICTIM_R will be
silently reduced if it would address more than GASNET_FIREHOSE_MAXVICTIM_M.
+ GASNET_FIREHOSE_MAXREGION_SIZE
This gives the maximum size of a single dynamically pinned region,
should be a multiple of the pagesize, and preferably a power of two.
The suffixes "K", "M" and "G" are interpreted as Kilobytes, Megabytes
and Gigabytes respectively, with "M" assumed if no suffix is given.
The maximum addressable size of the main and victim pools are limited by
the product of this region size and the number of firehose regions
allocated to each pool.
If the value of this parameter is set to 0, then it will be automatically
adjusted to allow the main and victim pools to be addressed within the
available number of regions, if doing so is possible subject to a
system-dependent maximum (pagesize squared, or larger).
The default value of this parameter is 128KB.
+ GASNET_FIREHOSE_TABLE_SCALE
This parameter gives a floating point factor, used to scale the size of
hash tables used in the firehose library relative to their default sizes.
Smaller values producer smaller tables, saving memory at the expense of
performance.
The default value is 1.
+ GASNET_FIREHOSE_VERBOSE
This gives a boolean: "0" to disable or "1" to enable the output of
internal information of use to the developers. You may be asked
to run with this environment variable set if you report a bug that
appears related to the firehose algorithm.
External library settings:
--------------------------
These parameters must be equal across all nodes, and the behavior
otherwise is undefined.
+ MLX4_SINGLE_THREADED and MLX5_SINGLE_THREADED
In a SEQ or PARSYNC build, ibv-conduit will set these environment
variables to '1' under appropriate conditions (which include the default
configure and environment settings) to instruct libibverbs to elide
locking. These variables influence the libibverbs implementations for,
respectively, the "mlx4" and "mlx5" drivers.
Here "appropriate conditions" means that ibv-conduit is not starting any
asynchronous threads. In particular, these variables are *not* set by the
conduit if use of GASNET_RCV_THREAD has requested the asynchronous AM
receive thread, or if the combination of GASNET_CONN_* settings may require
starting a thread to receive dynamic connection requests. For more
information on these variables, see their respective documentation, above.
In all cases, a user's setting of either variable is preserved, allowing an
explicit setting of '0' to prevent the conduit from setting it to '1'.
@ Section: Multi-rail Support @
Multi-rail support is OFF in GASNet ibv-conduit by default.
By default, ibv-conduit will use only the first active port on the
first active InfiniBand Host Channel Adapter (HCA). However, if more
than one HCA port is enabled for use, ibv-conduit will stripe
communications over them. See the sections "Build-time Configuration"
and "Runtime Configuration" for information on how to enable use of
more HCAs/ports, or to control which HCAs/ports are used.
To first order, the use of multiple ports or multiple adapters will
yield increases in both bandwidth (good) and software overhead (bad).
How the resulting trade off works for a given application may be
hard to predict. If one is concerned with obtaining the maximum
possible performance for a given application, then experiment with
the GASNET_NUM_QPS and/or GASNET_IBV_PORTS environment variables
(documented in "Runtime Configuration") to determine how a given
application runs best.
IMPORTANT NOTE:
The multi-rail support in ibv-conduit makes the assumption that the
number of HCAs used in every process will be identical. One must use
GASNET_IBV_PORTS to ensure that this property is true. Otherwise, the
behavior is undefined (startup crash being the most likely).
@ Section: On-Demand Paging (ODP) Support @
Recent Mellanox HCAs (ConnectX-4 and newer) support a feature known as
On-Demand Paging (ODP). Where this support is available, ibv-conduit
can use it to reduce resource pressure on physical memory. By default,
ibv-conduit will attempt to warn at runtime if this support does not
appear to be available despite having either the required software or
hardware. If you are using Mellanox Connect4-X HCAs or newer and
ibv-conduit warns about missing ODP support, we recommend that you
install the latest "Mellenox OFED" (aka MLNX_OFED) distribution,
available for download from Mellanox. Mellanox provides documentation
on installing the software and upgrading the HCA firmware (if needed).
At configure time, ibv-conduit does not know the hardware, firmware, or
driver versions on the compute nodes and cannot, in general, assume the
host running configure represents every node in the system. Therefore,
ODP support is enabled by default whenever the necessary library support
is detected by the configure script.
If ibv-conduit warns about missing ODP support on a system in which there
are no ODP-capable nodes, then we recommend reconfiguring GASNet using
--disable-ibv-odp to both eliminate the warnings and avoid the small
overhead of ODP support. However, in a heterogeneous system in which some
subset of compute nodes do support ODP, one should set GASNET_ODP_VERBOSE=0
in the environment to suppress warnings from nodes lacking ODP support,
while allowing ibv-conduit to continue using ODP on the remaining nodes.
@ Section: HCA Configuration @
GASNet ibv-conduit should *not* require any specialized configuration of
your HCAs to achieve normal, correct operation. However, this
section documents any configuration that *may* help improve performance.
We recommend you backup your configuration data prior to attempting any
modification, and that you confirm that any changes made produce a
measurable benefit before deciding to keep them. If trying a suggestion
here results in no measurable improvement, then we recommend that you
return the modified parameter(s) to their previous value(s).
WE DISCLAIM ALL RESPONSIBILITY IF FOLLOWING ANY SUGGESTION HERE RESULTS
IN AN UNSTABLE OR UNUSABLE SYSTEM.
Please consult the documentation provided with your HCA drivers, and/or
your vendor or system integrator for information on how to query or
change your HCA's configuration parameters.
+ The HCA configuration parameter MAX_QP_OUS_RD_ATOM controls the number
of simultaneous RDMA Reads for which a QP may act as Responder. Our
testing on one system with a default value of 8, showed that increasing
the value to 16 yielded approximately a 30% bandwidth improvement in an
RDMA-GET benchmark.
@ Section: Advice to Client Authors @
+ Negotiated-payload Active Messages (NPAM)
TL;DR: The results below, from one particular system, can be summarized in
the following rules-of-thumb for use of NPAM with ibv-conduit in the
*current* release:
+ Use of client-provided buffer is never advantageous.
+ Use of gasnet-provided buffer may be advantageous for Medium with
sufficiently large payloads, where both latency and bandwidth can exceed those
of FPAM.
+ Use of gasnet-provided buffer may be advantageous for Long payloads of
sufficient length, where bandwidth is better than FPAM (but at the
expense of worse latency).
Note that as development continues, these finding are subject to change.
Calls to gex_AM_Prepare{Request,Reply}{Medium,Long}() with client_buf !=
NULL are known as "client-provided buffer" calls. In this mode of
operation, there is a small penalty in CPU overhead relative to the
fixed-payload AM (FPAM) calls gex_AM_{Request,Reply}{Medium,Long}(), due
primarily to the split-phase calling convention. While the design of NPAM
allows for the possibility that NPAM with client-provided buffer could
enable larger Medium payloads than FPAM, ibv-conduit currently does not
provide that capability.
Measurements of both AM Mediums and Long with client-provided buffer NPAM
on OLCF's Summit show the latency penalty relative to FPAM in a
Request/Reply "ping-pong" test is around 2% for payload sizes below a
couple hundred bytes, and 1% or lower for payloads of 512 bytes or larger.
For AM Longs the penalty eventually approaches zero for payloads of about
512KiB or larger.
Throughput of a "flood" test with client-provided buffer NPAM shows
penalties of about 8% at the smallest payload sizes, declining smoothly to
5% at the largest Medium payloads and approaching zero for Long payloads of
about 512KiB or larger.
Calls to gex_AM_Prepare{Request,Reply}{Medium,Long}() with client_buf ==
NULL are known as "gasnet-provided buffer" calls. In this mode of
operation, in which GASNet allocates a buffer where client code
assembles/generates the payload at AM injection time, there is a measurable
advantage to NPAM for sufficiently large payloads, but a small penalty for
small payloads.
Measurements with gasnet-provided buffer NPAM on OLCF's Summit show the
latency penalty in a Request/Reply "ping-pong" test is around 2% for
payload sizes below a couple hundred bytes for both Medium and Long. For
Mediums of 512 bytes or larger, the latency is improved over FPAM (by about
9% for large payloads). For Longs, the large payload latency is worse
than for small payloads.
Throughput of a "flood" test with Mediums shows similar behavior to
ping-pong, with a throughput penalty of up to 5% for payload sizes 512
bytes and below, but improvements in throughput above 512 bytes (by about
17% for large payloads). For Longs there is a throughput penalty of up to
5% for payload sizes below about 2KiB, but for large payloads a latency
improvement of 40% or more can be seen.
Your mileage may vary.
Relative performance may change in future releases.
@ Section: Known Problems @
+ Slow PHYSMEM probe at start-up
As described in more detail above, the environment variables
GASNET_PHYSMEM_MAX and GASNET_PHYSMEM_PROBE can be used to control an
upper-bound on the amount of memory that ibv-conduit will attempt to
register/pin. If configure was passed --with-ibv-physmem-max=VALUE, then
the given VALUE is used as the default value of GASNET_PHYSMEM_MAX.
However, if this environment variable is not set, and no value was given at
configure time for GASNET_PHYSMEM_MAX, then the values "2/3" and "yes" are
used for these two environment variables and a message will direct the user
to this text. Running with GASNET_PHYSMEM_PROBE=1 will also direct one to
this text.
If all compute nodes probed are found to allow the same amount of memory to
be pinned (either in absolute or relative terms), the message issued will
indicate the value determined and how to use it at configure or run time.
If there is a single memory configuration in use on your system, then
configuring (or setting GASNET_PHYSMEM_MAX) as directed in that message will
eliminate the time spent on the probe at startup, as well as the message.
If the memory configuration varies among nodes on the system, it may not be
safe to use a single recommended setting. There are at least three options
available:
1. To avoid the probe at the potential cost of registering less than the
maximum possible memory for communication, one can elect to use the
minimum probed value reported by the probe output across all of the node
configurations to be used. Depending of the actual configurations using
an absolute (whole number) or relative (fraction or decimal) value may
result in using nearer to the maximum memory on all nodes. The most
robust option would be to configure using --with-ibv-physmem-max=VALUE,
but running with GASNET_PHYSMEM_MAX=VALUE may be simpler.
Unless one has configured with --enable-ibv-physmem-probe, or is running
with GASNET_PHYSMEM_PROBE=1 in the environment, the given value will be
used without running the probe.
2. If the time spent by the probe is not a concern, one can obtain the
behavior (default in older releases) of probing for UP TO a (safe)
default of 2/3 of apparent physical memory size on every run (without
any message) by running with environment variable GASNET_PHYSMEM_WARN=0.
3. If 2/3 is too conservative, one can force a probe with an alternative
value by configuring using
--with-ibv-physmem-max=VALUE --enable-physmem-probe
of running with
GASNET_PHYSMEM_MAX=VALUE GASNET_PHYSMEM_PROBE=1
In either case, one may also want to set GASNET_PHYSMEM_WARN=0.
+ Crashes have been seen using QLogic's InfiniPath HCAs with ibv-conduit
with default parameters. If you see crashes with a message containing
FATAL ERROR: aborting on reap of failed send
then we recommend setting the following two environment variables
GASNET_NETWORKDEPTH_PP=8
GASNET_QP_RD_ATOM=1
In our testing this resulted in about a 2% reduction in peak bandwidth,
but eliminated all instances of "aborting on reap of failed send".
+ Lack of XRC support
XRC is a optional feature in the Open Fabrics Verbs API, for which
GASNet's configure script will probe support. However, that probe is
limited to determining if the XRC-related function calls will compile
and link, and cannot distinguish platforms on which the calls are
present but always fail. On such systems, the failure will be reported
at runtime above a certain number of processes (86 with the defaults
for all environment variables) with the following message:
*** FATAL ERROR: Unable to create an XRC domain. Please see "Lack of XRC support" under Known Problems in GASNet's README-ibv.
If you experience this error it is recommended that you reconfigure
your build of GASNet with --disable-ibv-xrc. If that is not possible
one can also set GASNET_USE_XRC=0 in your environment.
+ Bug 495 and 955
The "firehose" implementation of dynamic memory registration in ibv-conduit
is susceptible to incorrect behavior if it caches registration information
for a page of virtual memory that is returned to the OS by munmap() and the
same virtual address is allocated to the process again later. To avoid this
in most cases, ibv-conduit defaults to GASNET_DISABLE_MUNMAP=1 (documented
in the top-level README) on 64-bit systems. However, this can be undermined
by the GASNet client in a number of ways. A (non-exhaustive) list of known
potentially problematic behaviors include:
+ Calls to `mallopt(M_TRIM_THRESHOLD, X)` for X != -1.
+ Setting environment variable MALLOC_MMAP_THRESHOLD_ other than to -1.
+ Calls to `mallopt(M_MMAP_MAX, Y)` for Y != 0.
+ Setting environment variable MALLOC_MMAP_MAX_ other than to 0.
+ Replacing the glibc implementation of malloc/free with one that will
munmap() memory at any time other than process exit.
If your GASNet client code exhibits any of the above-listed behaviors, you
should explicitly set GASNET_DISABLE_MUNMAP=0. This not only prevents
ibv-conduit from applying its default of 1, but additionally (and more
importantly) modifies other behaviors within GASNet to avoid memory
management behaviors that could trigger the bug. However, this setting
cannot protect client code which may use malloc()ed memory as the source
or destination of GASNet communication calls if such memory is later
returned to the OS via munmap().
The most up-to-date information on this bug is maintained at:
https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=495
That bug report describes the problem in more detail, and lists the best
known recommended work-around(s).
+ Bug 3447
In this release the implementation of ibv-conduit on multi-HCA InfiniBand
networks may yield incorrect results for certain communication patterns.
However, the environment variable GASNET_USE_FENCED_PUTS (described above)
can correct for this problem at the expense of higher latency and reduced
bandwidth for Put-based communications due to additional communication.
The most up-to-date information on this bug is maintained at:
https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=3447
That bug report describes the conditions under which the problem may
manifest, and the recommended work-arounds.
+ Bug 3693
In some circumstances, one may see the following message at startup:
*** FATAL ERROR: Unexpected error Bad address (errno=14) when registering the segment
If your system is configured to allow large SystemV shared memory segments,
then this can be resolved by switching to SystemV instead of POSIX for
GASNet's shared memory support. However, it is also possible that one just
needs to increase the systems limit on POSIX shared memory allocations as
described under "System Settings for POSIX Shared Memory" in GASNet's
top-level README.
The most up-to-date information on this bug is maintained at:
https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=3693
That bug report describes more about the problem, and provides more detail
regarding the recommended work-around(s).
+ Bug 3816
When using Mellanox On-Demand Paging (ODP), it is possible that abnormal exits
(SIGKILL or _exit(), in particular) may lead to leaking memory at the system
level, requiring a reboot to reclaim the lost memory. In extreme cases, the
cumulative effect of many leaks could eventually lead to the Linux
Out-Of-Memory killer being invoked, or even a kernel panic.
If you use ODP and see evidence of available memory declining over time, and
have reason to suspect a GASNet application is experiencing an abnormal exit,
then it may be advisable to disable ODP (see On-Demand Paging (ODP) Support
section).
The most up-to-date information on this bug is maintained at:
https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=3816
That bug report describes more about the problem, and provides more detail
regarding the recommended work-around(s).
+ Bug 3838
When using Mellanox On-Demand Paging (ODP), it has been observed that certain
patterns of many small RMA operations (as seen in GASNet's testvis) may lead
to communications "locking up" such that RMA operations issued do not complete
and eventually new RMA operations cannot be initiated.
If you use ODP and experience hangs from an application with many fine-grained
RMA operations, then you should retry with ODP disabled (see On-Demand Paging
(ODP) Support section).
The most up-to-date information on this bug is maintained at:
https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=3838
That bug report describes more about the problem, and provides more detail
regarding the recommended work-around(s).
+ Bug 3997
When using ibv-conduit on Linux/AARCH64 (aka ARM64 or ARMv8) systems, it has
been observed that some traffic patterns may lead to crashes of the IB HCA
or of the compute node.
The most up-to-date information on this bug is maintained at:
https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=3997
That bug report describes more about the problem, and provides more detail
regarding the recommended work-around(s).
+ Bugs 4008 and 4009
On some systems, RMA Puts and AM Longs with source buffers on read-only pages
(such as those that might be generated for const-qualified static variables)
are not handled correctly.
The most up-to-date information on this issue is maintained at:
https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4008 (ODP)
https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4009 (Solaris)
Those bug reports provide more details, and will provide the most current
information regarding any recommended work-around(s).
+ Bug 4314
While this file documents several environment variables as permitting
different values on different processes, it is not always possible to
achieve such a scenario.
The most up-to-date information on this issue is maintained at:
https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4314
+ See the GASNet Bugzilla server for details on other known bugs:
https://gasnet-bugs.lbl.gov/
@ Section: Core API @
+ Flow-control for AMs.
The AMs in ibv-conduit are just implemented as send/recv traffic.
Therefore a send without a corresponding recv buffer preposted at the
peer will be stalled by the RNR (receiver-not-ready) flow control
in IB. However there are two reasons why we want to avoid this
situation. The first is that if such a send is blocked by flow
control, then the ordering semantics of IB tell us that all the
gets and puts that we've initiated after the AM was sent are also
stalled. Rather than let that happen, we should manually delay
those which are dependent on the AM. The second reason is that
under some conditions the RNR flow control is very poor. The problem
is that once the intended receiver sends a RNR NAK to indicate no
available recv buffers, IB has the SENDER's hardware/firmware poll
for the receiver to become ready again! That leaves us with a choice
between configuring a small polling interval and consuming a lot of
bandwidth for this polling, or a large interval which leads to
performance which is degraded more than necessary when IB flow control
is asserted.
For these reasons we implement some flow control at the AM level.
The basic idea is that every REQUEST consumes one credit on the
sending endpoint and every REPLY grants one credit on the receiving
endpoint. Thus if M is the initial number of credits on each endpoint
and every REQUEST has exactly one matching REPLY, then M becomes a
limit on the number of un-acknowledged REQUESTS in flight on an
endpoint. If we want to avoid RNR conditions, then we should start
with M credits and M preposted recv buffers on each endpoint. This
allows for only the receipt of M REQUESTS. In addition, a recv buffer
will be posted on demand for a REPLY just before sending each REQUEST.
It is a simple matter to count the credits when a REPLY is received
and to poll for credits when needed to send a REQUEST. It is also
simple to ensure the exactly-one-reply. We already ensure that
at-most-one reply is sent by the request handler. Additionally we
must check upon handler return for the case that the request handler
sent no reply, and send one implicitly. We just use a special
"system category" handler, gasnetc_SYS_ack, which doesn't even run
a handler.
To avoid using up 1/2 our bandwidth in the event of a REQUEST-REQUEST
ping-pong, we perform some coalescing to avoid sending too many
SYS_ack REPLIES. We keep up to GASNET_AM_CREDITS_SLACK "banked" on
the responding node, sending the SYS_ack REPLY only if the number
banked exceeds this limit. Credits which are banked get piggybacked
on the next REQUEST or REPLY headed back to the original requester.
To avoid a window of time between when we send a REPLY (credit) and
when we post the recv buffer, we must post the replacement recv
buffer BEFORE running an AM REQUEST handler. To do this we keep a
pool of unposted recv buffers (also used for the on-demand posting
of buffers needed for REPLIES). So, when we recv an AM REQUEST, we
grab a free recv buffer from the pool and post it to the endpoint,
and only then run the handler. We send an implicit reply if a REQUEST
handler didn't send any REPLY. Finally we take the recv buffer
containing the just-processed AM and we return it to the unposted
pool.
There is a corner case we must deal with when there are no spares
left in the unposted pool. In this case we will copy the received
REQUEST into a temporary (non-pinned) buffer before processing it.
This allows us to repost the recv buffer immediately. Since the
temporary buffer is not pinned, it cannot be used for receives.
Therefore, we free the temporary buffer when the handler is done,
rather than placing it in the unposted pool.
If we reap multiple AMs in a single Poll, then we reuse the
previous buffer as the "spare" for the next one, in place of
grabbing one from the unposted pool each time. Thus, we touch the
unposted pool at most twice per Poll, once for the first AM we
receive and once at the end to put the recv buffer of the final AM
back in the unposted pool. For the dedicated receive thread we can
do even better, never touching the unposted pool at all, by always
keeping a single thread-local "spare", initially acquired at startup.
Note that when SRQ is used, no flow control is used.
@ Section: Extended API @
[This section is still *mostly* accurate, but has not been kept up-to-date
with respect to EX updates. Most notably: "LongAsync" is gone, "_bulk" has
been supplanted by lc_opt options, and all the function names have changed.]
Notes for myself for extended API:
+ The send completion facility consists of two pointers to counters,
associated with each sbuf. If these pointers are non-NULL then the
counter is decremented atomically when the send is complete.
One counter is for awaiting reuse of local memory and is
only be used for sbufs which are doing zero copy. This counter
provides the mechanism for Longs and non-bulk puts to block before
they return, and should be allocated as an automatic variable.
The second counter is for request completion and should be non-NULL
for every sbuf for which request completion would be checked (all
gets & puts, but not the Longs). For nb and nbi the counter is
waited on at sync-time. Therefore the explicit event is a struct
containing the counter.
+ Similar to the reference implementation's cut-off between Mediums
(which typically do a source-side copy) and Longs (which may not),
we have a cut-off size, below which the RDMA-put operation will do
source-side copies _iff_ local completion is desired (Long, put_nb,
and put_nbi).
+ The gets are done w/ RDMA-reads, and use the sbuf bounce buffers
if the local memory is not in the segment (or otherwise registered).
The value gets also pass though the bounce buffers. Clearly there
is no bulk/non-bulk distinction in terms of local memory reuse, just
the alignment and optimal size distinctions. So, only the outstanding
request counter on the sbuf is needed for syncs of all types of gets.
+ Table of when synchronization is needed
Local Remote
Operation Sync Sync
--------------------------
LongAsync X X
Long I X
put_nb I S
put_nbi I S
put_nb_bulk X S
put_nbi_bulk X S
put_nb_val X S
put_nbi_val X S
put X I
put_bulk X I
put_val X I
get_nb X S
get_nbi X S
get_nb_bulk X S
get_nbi_bulk X S
get X I
get_bulk X I
get_val X I
X = Not needed at all (or not even applicable with _val forms)
I = Needed before (I)nitiating function returns
S = Needed before (S)ynchronizing function returns
+ Some minor tweaks are used to avoid allocation of counters in
some cases.
- For all the functions which require waiting on a counter in the
initiating function, the counter can be allocated on the stack (as
an automatic variable).
- For the implicit-handle forms the request counter is in the
thread-specific data, possibly in an access-region.
- For the explicit event forms the request counter must be allocated
from some pool, requiring some memory management work. This is
done with a modification to the code from the reference
implementation, and uses thread-local data to avoid locks.
+ The memsets can be more efficiently implemented as a _local_ memset
followed by a PUT, for small enough sizes. This is not currently
implemented.
@ Section: GEX_FLAG_IMMEDIATE Support
The following summarizes the current state of support for GEX_FLAG_IMMEDIATE.
Last reviewed against 2022.3.0
| SEGMENT | SEGMENT | SEGMENT |
OPERATION | FAST | LARGE | EVERYTHING |
------------+-------------+------------+------------+
FPAM Short | Y | Y | Y |
FPAM Medium | Y | Y | Y |
FPAM Long | Note 1 | Note 1 | Note 1 |
| | | |
NPAM Medium | Y | Y | Y |
NPAM Long | Note 2 | N | N |
| | | |
RMA Put | N | N | N |
RMA Get | N | N | N |
------------+-------------+------------+------------+
Y = GEX_FLAG_IMMEDIATE is fully implemented
N = GEX_FLAG_IMMEDIATE is ignored
FPAM = Fixed-payload AM
NPAM = Negotiated-payload AM
For the notes which follow, define a "packed Long" as an AM Long having a
payload size of GASNET_PACKEDLONG_LIMIT or less.
Note 1: FPAM Long
+ For a packed Long the IMMEDIATE flag is fully implemented.
For the LARGE and EVERYTHING segment modes, this notably includes all Reply
calls because in those cases the payload is always packed.
+ Otherwise, the IMMEDIATE flag is ignored.
Note 2: NPAM Long / FAST
+ For a packed Long the IMMEDIATE flag is fully implemented.
This notably includes all calls with a GASNet-allocated buffer because in
this case the payload is always packed.
+ Otherwise, the IMMEDIATE flag is *partially* implemented. The logic in
Prepare is sensitive to AM flow-control credits and buffer allocation for
the AM header, but ignores the possibility of a stall in injection of the
payload transfer at Commit time.
@ Section: Graceful exits @
On June 24, 2003 ibv-conduit now passes all 9 (I added two recently)
of the cases in testexit. By "Pass" I mean that the entire gasnet job
(tested up to 8-way across my 4 dual-processor machines) terminates
with no orphans, and with tracing properly finalized (if tracing is
enabled). On August 11, 2003 the graceful exit code was revised to
send O(N) network traffic in the worst case, as opposed to the O(N^2)
required in all cases in the first implementation.
Additionally, the exit code is properly propagated through the
bootstrap, to yield a correct exit code for the parallel job as a
whole. If using MPI for bootstrapping, the actual exit code will
depend on supported in a given MPI implementation (some ignore the
exit code of the individual processes).
This code is heavily commented, but for the curious, here is a
description of the code.
There are three paths by which an exit request can begin. The first
is through gasnetc_exit(), which may be called by the user, by the
conduit in certain error cases, and by the default signal handler for
"termination signals". The second is via a remote exit request,
passed between nodes to ensure full-job termination from
non-collective exits. The third is via an atexit/on_exit handler,
registered by gasnetc_init(), used to catch returns from main() and
user calls to exit().
There are slight variations among the code in these three cases, but
most of the work is common, and is performed by three functions:
gasnetc_exit_head(), gasnetc_exit_body() and gasnetc_exit_tail(). The
first of these, _head, is used to determine the "first" exit and store
its exit code for later use. This is important because even a
collective exit will involve receiving remote exit requests. Only if
a remote exit request is received before any local calls to
gasnetc_exit(), should the request handler initiate the exit. Note
that even in the case of a collective exit it is possible for the
first remote request to arrive before the local gasnetc_exit() call.
However, that is made very unlikely by the timing and is nearly
harmless since the only difference is the raising of SIGQUIT in
response to a remote exit request, which is not done for
locally-initiated ones.
The second common function, _body(), is used to perform the "meat" of
the shutdown. It begins by ignoring SIGQUIT to avoid re-entrance, and
then blocks all but the first caller in a polling loop to avoid
multiple threads from executing the shutdown code. While the template
uses a mutex to block additional threads, ibv-conduit is using atomics
to avoid problems with signal context. Once additional thread are
blocked from making progress through _body(), the AM progress thread (if
any) is terminated to prevent it from interfering.
Because strange things can happen if we are trying to shutdown from a
signal context, a signal handler is installed for all the "abort
signals". This signal handler just calls _exit() with the exit code
stored by _head(). Because we may have problems shutting down if
certain locks were held when a signal arrived, we also install the
signal handler for SIGALRM, and use the alarm() function to bound the
time spent blocked in the shutdown code. While there is the risk that
this alarm might go off "too soon" if the shutdown has lots of work to
do, we can be certain that the correct exit code is still generated.
An additional step to address signal context is the definition of
GASNETC_FATALSIGNAL_CALLBACK in ibv-conduit/gasnet_core_fwd.h, which
gives ibv-conduit the chance to set gasnetc_exit_in_signal=1 just before
the conduit-independent signal-handling code can reach the exit path.
If this variable is non-zero then certain operations known to be
especially risky in signal-handler context are skipped.
After signal handlers are established, _body calls gasnetc_disable_AMs()
to zero the table of client-registered AM handlers (though preserving
the internal ones). This helps avoid interference.
Then _body calls gasnetc_exit_reduce() to try to perform a collective
reduce-to-all over the exit codes. If this completes within a given
timeout then we know the exit is collective (and "graceful" is set
non-zero) and skip over the leader/member logic described in the next
two paragraphs.
If the reduction does not complete within the timeout, then _body next
calls gasnetc_get_exit_role() to "elect" a leader node for the exit.
This is done with an alarm() timer in force. The use of an "election"
with a timeout ensures that we will exit, even if node 0 is wedged.
The election of a leader proceeds by sending a system-category AM
request to node 0, and spinning to wait for a corresponding reply,
which will indicated if the local node is the "leader" or a "member"
in the coordination of the graceful exit. The logic on node 0
ensures that the first "candidate" is always made the leader, not
waiting for multiple AMs to arrive. Additionally the member nodes
may, under circumstances described below, know before entering
gasnetc_get_exit_role() that they are members, and will not bother
to send an AMRequest to node 0. In either case gasnetc_get_exit_role()
indicates to _body which role the local node is to assume.
From _body, the single leader node will enter gasnetc_exit_leader() and
will begin sending an remote exit request (system-category AM, so this
will all work between _init and _attach) to each peer. Then the leader
waits (with timeout, of course) for a reply from each peer. This request
conveys the desired exit code to each node. It also will wake them out
of a spin-loop, barrier, or other case where they were not yet aware of
the need to exit. In the handler for the exit request, a node will send
a reply back to the leader, so it knows all the nodes are reachable. It
will set its role to "member" and, if no exit is in-progress, it will start
the exit procedure, as described later. From _body, the member nodes all
call gasnetc_exit_member(), which simply spins until the remote exit request
has arrived from the leader.
Regardless of whether exit coordination (the reduction, or exit requests
and replies) completed within their timeouts, _body proceeds to flush
stdout and stderr one last time and closes stdin, stdout and stderr.
Finally, _body shuts down its bootstrap support. If either coordination
was completed within the timeout, then the gasnetc_bootstrapFini()
routine is called indicating that we'll not be making any more calls
to the bootstrap code and expect to exit shortly. However, if both
coordinations did fail we call gasnetc_bootstrapAbort(exitcode). This
call is meant to request that the bootstrap terminate our job "with
prejudice" since we failed to coordinate a graceful shutdown on our
own. We do this to try to avoid orphans, but risk lots of unsightly
error messages and possible loss of our exit code. Assuming we did not
call _bootstrapAbort (which does not return) we finish _body by
canceling our alarm timer and return to our caller.
The final common routine is gasnetc_exit_tail(). This function just
does the last bit of work to terminate the job. It is not included in
_body because we let the atexit/on_exit() case terminate "normally"
after _body returns. However, in the case of exits initiated via
gasnet_exit() or remote exit request we call _tail to complete the
exit. In _tail we set an atomic variable to wake any threads which
were stuck polling in _body due to being other than the first thread
to enter. Those threads should eventually wake and also call _tail to
terminate. Next, we call gasneti_killmyprocess() to do any platform-
specific magic required to get the entire multithreaded application to
exit. Finally we call _exit() with the saved exit code.
Given the routines gasnetc_exit_{head,body,tail}() the code for the
three types of exit are pretty trivial. In particular, gasnetc_exit()
just calls _head, _body and _tail with no additional logic. In the
request handler for the exit request AM, we look at the return from
_head to determine if this exit request is the first we've seen
(inclusive of local calls to gasnet_exit() and our atexit/on_exit handler). If
it IS the first exit request, then we raise a SIGQUIT, as required by
the GASNet spec, to allow the user's handler to perform its cleanup.
However, to get the most robust exit code we don't want to run the
_body code from a signal handler context if we can avoided it.
Therefore we inspect the signal handler and skip the raise() call if
the handler is the gasnet default handler, SIG_DFL or SIG_IGN. After
the raise() returns, or is skipped all together, we are certain that
the user's handler, if any, has executed and has NOT called
gasnet_exit(). If a user handler had called gasnet_exit(), then
raise() would not have returned. So, if we reach the code after the
possible raise(), we proceed to call gasnetc_exit_body() and _tail to
complete the (hopefully) graceful exit of the gasnet job.
It is important to note that if we get a remote exit request that
initiates an exit, then we will never return from the handler.
However, the design of the AM code in IBV conduit ensures that this
will actually work without deadlock. For one, we never run handlers
from signal context or with locks held. Thus we can expect a
"clean" set of locks. Furthermore, we don't expect to do anything
useful with the network once the request handler calls _body anyway.
The atexit handler just calls _head and _body before returning to
allow the exit to complete. In this case we have a little problem
with the lack of access to the return code. Therefore we just pass 0
to _head, which _body then sends in the remote exit requests.
Experience has shown that, at least with LAM/MPI for bootstrap, when
all but one task exits with zero, the single non-zero exit code
becomes the exit code for the parallel job. Therefore, using zero
here gives the specified exit code from the parallel job for both
collective and non-collective returns from main.
If support is detected at configure time for on_exit(), then it is
used rather than atexit(), and the problem of the missing return code
vanishes.
In the normal case of a collective exit, the reduce-to-all-with-timeout
is performed in 3 steps. The first is an intra-supernode reduction.
The second is a reduce-to-all over supernodes using the same
communication pattern as the dissemination barrier, requiring
ceil(log_2(SN)) rounds in which each supernode sends and receives one
AM (where "SN" is number of supernodes). The third step is a
supernode-scoped broadcast. For non-PSHM builds, only a dissemination
based reduce-to-all is performed (steps 1 and 3 are eliminated and
"supernode" is replace by "node" in the description of step 2).
For the non-collective exits, there is both a "best case" and a
"worst case" to consider:
Best case: one node is way ahead of the others and can win the
leader election and send remote exit requests before the others attempt
the election. In this case the coordinated shutdown needs 1 round-trip
for the election, followed by (N-1) round-trips for the remote exit
request/reply, for a total of 2*N AMs sent (not counting those from
the failed reduction).
Worst case: all nodes attempt the election at roughly the same time
and a full N round-trips take place for the election, followed by (N-1)
round trips for the remote exit request/reply, for a total of 4*N-2 AMs
sent (plus those from the failed reduction).
The average case for non-collective exits is somewhere between those two.
@ Section: References @
[1] Bell, Bonachea. "A New DMA Registration Strategy for Pinning-Based
High Performance Networks", Workshop on Communication Architecture
for Clusters (CAC'03), 2003.
Also at https://gasnet.lbl.gov/