Index of /download/dist/gasnet/ucx-conduit
GASNet ucx-conduit documentation
Mellanox Technologies LTD Copyright (c) 2019-2021.
Portions copyright 2021-2022, The Regents of the University of California.
Boris I. Karasev <[email protected]>
Artem Y. Polyakov <[email protected]>
Paul H. Hargrove <[email protected]>
@ TOC: @
@ Section: Overview @
@ Section: Where this conduit runs @
@ Section: Build-time Configuration @
@ Section: Job Spawning @
@ Section: Runtime Configuration @
@ Section: Multi-rail Support @
@ Section: On-Demand Paging (ODP) Support @
@ Section: HCA Configuration @
@ Section: Known Problems @
@ Section: Design Overview @
@ Section: Graceful exits @
@ Section: References @
@ Section: Overview @
Ucx-conduit implements GASNet over the Unified Communication X (UCX) framework
(see https://www.openucx.org/ for general information on UCX).
This conduit is feature-complete and supports Active Messages and
hardware-offload of RMA and Atomic operations. Performance tuning is
is planned as future work.
Since this conduit is considered "experimental" pending performance-tuning
work, it is disabled by default. It can be enabled explicitly at configure
time using either `--enable-ucx` or `--enable-ucx=probe`. Either option will
enable ucx-conduit if `configure` locates the prerequisites. The difference
is that the first option makes failure to locate the prerequisites fatal in
`configure`.
@ Section: Where this conduit runs @
The conduit is based on Unified Communication X (UCX) communication library (see
https://www.openucx.org/ for general information on UCX). UCX is an open-source project
developed in collaboration between industry, laboratories, and academia to create
an open-source production-grade communication framework for data-centric and
high-performance applications. The UCX library can be downloaded from repositories
(e.g., Fedora/RedHat yum repositories). The UCX library is also part of Mellanox OFED
and the NVIDIA Mellanox HPC-X Software Toolkit.
The conduit is tested and known to work with
- Mellanox InfiniBand and RoCE devices starting from ConnectX-5
- UCX library version 1.6 and above
- Linux platform
For Mellanox adapters, UCX conduit supports all transports: RC, UD and DC. For large-scale
applications, it is recommended to use DC transport for scalability reasons.
@ Section: Build-time Configuration @
In order to enable the conduit, the option '--enable-ucx' must be passed to configure.
By default, ucx-conduit will not be built.
See `configure --help` for additional ucx-related configure options.
See the extended-ref README for the GASNet general-purpose configuration parameters.
--with-ucx-max-medium=[value]
By default gex_AM_LUB{Request,Reply}Medium() return 3992, which is a
4KiB buffer minus overheads such as handler arguments. This configure
option sets the default value for the GASNET_UCX_MAX_MEDIUM environment
variable, which determines the LUB size at runtime.
The value cannot be less than 512.
The default value is 3992.
@ Section: Job Spawning @
If using UPC++, Chapel, etc. the language-specific commands should be used
to launch applications. Otherwise, applications can be launched using
the gasnetrun_ucx utility:
+ usage summary:
gasnetrun_ucx -n <n> [options] [--] prog [program args]
options:
-n <n> number of processes to run (required)
-N <N> number of nodes to run on (not supported by all MPIs)
-E <VAR1[,VAR2...]> list of environment vars to propagate
-v be verbose about what is happening
-t test only, don't execute anything (implies -v)
-k keep any temporary files created (implies -v)
-spawner=(ssh|mpi|pmi) force use of a specific spawner (if available)
There are as many as three possible methods (ssh, mpi and pmi) by which one
can launch an ucx-conduit application. Ssh-based spawning is always
available, and mpi- and pmi-based spawning are available if the respective
support was located at configure time. The default is established at
configure time (see section "Build-time Configuration").
The choice of spawner only affects the protocol used for parallel job setup and
teardown; in particular it is NOT used to implement any part of the
steady-state GASNet communication operations. As such, the selected protocol
needs to be stable and co-exist with GASNet communication, but its performance
efficiency is usually not a practical consideration.
To select a non-default spawner one may either use the "-spawner=" command-
line argument or set the environment variable GASNET_UCX_SPAWNER to "ssh",
"mpi" or "pmi". If both are used, then the command line argument takes
precedence.
@ Section: Runtime Configuration @
Ucx-conduit supports all of the standard GASNet environment variables
and the optional GASNET_EXITTIMEOUT family of environment variables.
See GASNet's top-level README for documentation.
Additional environment variables:
+ GASNET_UCX_AM_ORDERED_TLS
This Boolean setting determines if the logic for transfer of AM Long
payloads may assume UCX is using an in-order transport. This results in
faster transfers, but could yield data corruption if the payload and header
are reordered.
Setting to 1 forces this optimization, while 0 disables it.
The default is to enable this setting if the conduit can determine that
GASNet's PSHM (which guarantees ordering) is used for all intra-node
communication, as opposed to UCX shared-memory transport which does not
guarantee ordering (see bug 4155).
+ GASNET_UCX_MAX_MEDIUM
This setting determines the size of buffers used for AM Mediums, and
thus the return value of the gex_AM_LUB{Request,Reply}Medium() queries.
The value cannot be less than 512.
Unless a different value was set using --with-ucx-max-medium=[value]
at configure time, the default value is 3992.
In order to control UCX parameters (i.e. device or transport selection), environment
variables are used.
Most commonly used variables are described on the UCX Wiki page
https://github.com/openucx/ucx/wiki/UCX-environment-parameters
For the full list of tuning knobs supported by a particular UCX version,
see the output of `ucx_info -f`.
For software stacks allowing concurrent use of the UCX library in its
different components or layers, ucx-conduit supports a personalized prefix
that allows specifying unique parameters for its use that will not affect
or conflict with global parameters (i.e. "UCX_TLS", "UCX_NET_DEVICES") and/or
other personalized parameters.
+ For UCX versions prior to 1.9, the prefix is "UCX_GASNET"
+ For UCX versions 1.9 and newer, the prefix is "GASNET_UCX"
So, using "[PREFIX]" to denote the version-appropriate value, it is
recommended that one set variables "[PREFIX]_TLS" and "[PREFIX]_NET_DEVICES"
instead of "UCS_TLS" and "UCX_NET_DEVICES". However, the following text will
use the short "UCX_" prefix for brevity and to match UCX documentation.
The "UCX_TLS" environment variable controls transport selection and is
one of the most commonly used UCX parameters. Available options are:
$ ucx_info -f | grep UCX_TLS -B 23 | head -n 20
#
# Comma-separated list of transports to use. The order is not meaningful.
# - all : use all the available transports.
# - sm/shm : all shared memory transports (mm, cma, knem).
# - mm : shared memory transports - only memory mappers.
# - ugni : ugni_smsg and ugni_rdma (uses ugni_udt for bootstrap).
# - ib : all infiniband transports (rc/rc_mlx5, ud/ud_mlx5, dc_mlx5).
# - rc_v : rc verbs (uses ud for bootstrap).
# - rc_x : rc with accelerated verbs (uses ud_mlx5 for bootstrap).
# - rc : rc_v and rc_x (preferably if available).
# - ud_v : ud verbs.
# - ud_x : ud with accelerated verbs.
# - ud : ud_v and ud_x (preferably if available).
# - dc/dc_x : dc with accelerated verbs.
# - tcp : sockets over TCP/IP.
# - cuda : CUDA (NVIDIA GPU) memory support.
# - rocm : ROCm (AMD GPU) memory support.
# Using a \ prefix before a transport name treats it as an explicit transport name
# and disables aliasing.
#
In order to ensure ucx-conduit is being used with supported Mellanox
devices, the following transport set is recommended to exclude unsupported
networks: "UCX_TLS=ib,sm,self". To leverage UCX support for GPU device
memory for GASNet-EX Memory Kinds, one may append ",cuda" or ",rocm".
However, please see "Known Problems" for issues with UCX support for GPU
device memory.
Ucx-conduit supports all of the standard and the optional GASNET_EXITTIMEOUT GASNet
environment variables. See GASNet's top-level README for documentation.
@ Section: Multi-rail Support @
UCX supports multi-rail configurations, and ucx-conduit has been successfully
tested on such systems. Please consult the online UCX documentation at
https://www.openucx.org/ for information on configuring the use of multiple
network rails.
@ Section: On-Demand Paging (ODP) Support @
The conduit supports ODP through UCX, see "UCX_IB_REG_METHODS" UCX environment variable.
@ Section: Known Problems @
+ This conduit does not support Mellanox ConnectX-4 adapters, or older.
This is unlikely to change.
+ Support for segment-everything mode is currently disabled due to unbounded
memory use in AMLong (especially as used in the "amref" implementation of RMA).
Restored support for `--enable-segment-everything` is an area for future work.
+ Bug 4165
The implementation of AM Mediums is known to have anomalous performance
for payloads above about 1KiB on multiple systems.
The most up-to-date information on this bug is maintained at:
https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4165
+ Bug 4172
In hybrid MPI applications there are conditions which can trigger a crash at
exit time due to a call to MPI_Barrier() by ucx-conduit after MPI_Finalize()
by the application.
The most up-to-date information on this bug is maintained at:
https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4172
+ Bug 4301
As descried in the "Runtime Configuration" section, UCX versions 1.9 and
newer use prefix of "GASNET_UCX" for private (per client of UCX)
configuration. However, this same prefix is used by ucx-conduit for its
own environment settings. The result is that, by default, the UCX library
may issue warnings which include text like the following:
"UCX WARN unused env variable: GASNET_UCX_SPAWNER"
and
"(set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)"
The work-around is to set UCX_WARN_UNUSED_ENV_VARS=n as the message
recommends.
The most up-to-date information on this bug is maintained at:
https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4301
+ Bug 4337
Some Active Message traffic patterns may induce ucx-conduit to perform
excessive buffering within the UCX library. This can lead to unexpectedly
large memory consumption, which in turn may lead to multiple fatal error
modes.
The most up-to-date information on this bug is maintained at:
https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4337
+ Bug 4384
Use of GASNet-EX "CUDA_UVA" memory kinds with ucx-conduit may crash due
to incorrect default algorithm selection inside UCX for RMA transfers
involving device memory.
For a certain range of UCX versions, there are work-arounds based on
setting environment variables to modify the algorithm selection to be
correct for device memory at the expense of slowing of host memory RMA.
For other versions, there is no known work-around.
The most up-to-date information on this bug is maintained at:
https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4384
+ Bug 4393
The use by ucx-conduit of UCX's support for CUDA devices currently lacks
the manipulation of CUDA contexts required for correct multi-device
operation.
Use of the recommended "UCX_TLS=ib,sm,self" setting is an effective
work-around. Others are described in the bug report.
The most up-to-date information on this bug is maintained at:
https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4393
* Bug 4474
When using hugepages together with Mellanox InfiniBand and RoCE devices,
it may be necessary to set `RDMAV_HUGEPAGES_SAFE=1` in the environment.
Failure to do so may result in a self-explanatory error message:
UCX ERROR ibv_reg_mr(address=0x7f49c5a02800, length=2048, access=0xf) failed:
Invalid argument. Application is using HUGE pages. Please set environment
variable RDMAV_HUGEPAGES_SAFE=1
The most up-to-date information on this bug is maintained at:
https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4474
+ See the GASNet Bugzilla server for details on other known bugs:
https://gasnet-bugs.lbl.gov/
@ Section: Design Overview @
The UCX conduit implements GASNet core and extended API for UCX-compatible
network adapters. Where:
(a) Core API includes conduit resource initialization and cleanup functionality
along with Active Message communication;
(b) Extended API includes one-sided and atomic operations support.
References:
+ UCX API documentation:
https://openucx.github.io/ucx/api/latest/html/index.html (most recent)
https://openucx.github.io/ucx/api/v1.6/html/index.html (oldest supported)
Resource allocation and initialization:
During GASNet initialization, every process creates a UCX worker, participates
in the exchange of worker addresses (with Allgather communication pattern), and
creates UCP endpoints representing connections to all other processes in the
GASNet job. In addition, a set of service buffers for Active Messages is
allocated and all required UCP memory registrations and memory key exchanges
are performed.
Active Message:
UCX conduit implementation is using pools of pre-allocated AM header buffers
that are used for all types of AMs. Sending of Requests and Replies each use
a distinct pool, while reception of AMs uses a third pool.
For AM Medium messages, the payload is always copied to the buffer
holding the header, to reduce the latency in absence of local completion
options ("lc_opt") support. This behavior will be reconsidered in the future.
For Long AMs payload transfer is implemented using RDMA Put operation.
RDMA/AMO:
Implementation of RDMA Put and Atomic operations is a thin layer on top of UCP
primitives. In addition, "ucp_ep_flush_nb" is used to implement remote completion
tracking.
UCX directly supports a fixed set of atomic operations, unsupported operations
are handled by generic GASNet implementation. The status of atomic operation
support is described in the table below:
I32/U32/I64/U64
--- ---------------
SET: Y
GET: Y
SWAP: Y
(F)CAS: Y
(F)ADD: Y
(F)SUB: Y
(F)INC: Y
(F)DEC: Y
(F)MULT: N
(F)MIN: N
(F)MAX: N
(F)AND: Y
(F)OR: Y
(F)XOR: Y
Y = offload is supported
N = offload is not supported
@ Section: Graceful exits @
Graceful exits are supported by the conduit. The design is based
on ibv-conduit description (see ibv-conduit README).
@ Section: Future work @
+ Implement non-trivial support for local completion options ("lc_opt")
+ Implement native support for Negotiated-Payload Active Message (NPAM)
+ Additional round of performance analysis and tuning.
+ Investigate older UCX versions and hardware compatibility and update README.
+ Support Graceful exits if rank 0 has failed.
+ Restore `--enable-segment-everything` mode:
- Fix robustness problems with AMLong in `--enable-segment-everything` mode.
- Apply ODP to optimize `--enable-segment-everything` mode.