GASNet udp-conduit documentation Dan Bonachea $Revision: 1.8 $ User Information: ----------------- Recognized environment variables: --------------------------------- * All the standard GASNet environment variables (see top-level README) * GASNET_NETWORKDEPTH - depth of network buffers to allocate Use larger values to increase bandwidth on longer-latency networks. Note that internal buffer space grows linearly with depth, and specifying a value which is too large can lead to increased packet loss and retransmission costs on OS's with small UDP buffer limits or hardware that drops packets in response to congestion (includes most Ethernet hardware). * GASNET_SPAWNFN - job spawn mechanism to use (see below) * GASNET_REQUESTTIMEOUT_MAX - max timeout in microseconds before a request is considered undeliverable (a fatal error). Defaults to 30000000 (30 seconds). Can be set to 0 for infinite timeout, which can be useful when freezing processes for long periods of time during debugging sessions. * GASNET_REQUESTTIMEOUT_INITIAL, GASNET_REQUESTTIMEOUT_BACKOFF, GASNET_EXPECTED_BANDWIDTH Advanced options for tweaking the retransmission algorithm (initial timeout in us, timeout backoff multiplier and expected min one-way bandwidth in KB/s). Most users should probably leave these alone. * GASNET_ROUTE_OUTPUT If non-zero, this option request AMUDP perform explicit forwarding of stdout/stderr streams from the workers to the console using TCP socket connections. If zero, no explicit forwarding is performed and this duty is left up to the underlying system and job spawner. The default is system-dependent. Optional compile-time settings: ------------------------------ * All the compile-time settings from extended-ref (see the extended-ref README) * There's a number of parameters in other/amudp/amudp_internal.h that can be tweaked to adjust the behavior of the reliability/retransmission algorithm (which may affect performance on a network with high UDP packet loss rates), but in most cases the default values should provide reasonable performance. Known problems: --------------- * See the Berkeley UPC Bugzilla server for details on known bugs. Future work: ------------ Choosing the udp-conduit spawn mechanism (GASNET_SPAWNFN): --------------------------------------------------------- udp-conduit is very a portable conduit, requiring only a UNIX-like system with a reasonable C/C++ compilier and a standard sockets-based TCP/IP stack (which includes UDP support). However, the one aspect that remains somewhat site-specific is the means for spawning the GASNet job across the UDP-connected worker nodes. Consequently, the primary user-visible configuration decision to be make when installing/using udp-conduit is the spawning mechanism to use for starting udp-conduit jobs. udp-conduit includes built-in support for several spawning mechanisms (including a very portable ssh-based spawner), and an extensibility option which allows you to plug in your own job spawning command, if desired. All udp-conduit jobs should be started from the console (or from wrapper scripts such as tcrun, upcrun) using a command such as: $ ./a.out [program args...] where the first argument is the number of GASNet nodes to spawn, and any subsequent arguments will be passed to the GASNet client as argc/argv upon return from gasnet_init(). The GASNET_SPAWNFN environment variable is used to tell udp-conduit which mechanism to use for spawning the worker node processes for the job, and may be one of the following values (some of which have additional related environment variables): * GASNET_SPAWNFN='L' (localhost spawn) Uses a standard UNIX fork/exec to spawn all the worker processes on the local machine, and UDP traffic between the nodes is sent over the localhost loopback interface (which usually bypasses network hardware, but not the kernel). Useful for debugging and testing, but probably not of interest for production jobs. * GASNET_SPAWNFN='S' (ssh/rsh-based spawn, the default spawner) Uses any command-line based ssh or rsh client to connect to worker nodes, which should be running an ssh/rsh daemon (ie sshd). Requires that users setup password-less authentication to the worker nodes (eg. using RSA public-key-based authentication and/or ssh-agent). Specifically, users need the ability to run a command such as: "ssh machinename echo hi" and have that command execute on the remote node without the need for typing any passwords. Finally, any network firewalls present must be configured such that the worker nodes have the ability to make TCP connections to the machine that executes the initial spawn command (used for bootstrapping) and such that the worker nodes can send UDP packets to each other (used to implement GASNet communication). See the ssh-agent tutorial here: http://upc.lbl.gov/docs/user/sshagent.shtml or the documentation for your ssh client/daemon for more information on setting up secure, password-less authentication for ssh. rsh-based spawning is also supported, although not generally recommended due to the inherent security flaws in rsh-based authentication (although it may still be appropriate on physically secure private networks). the ssh-based spawner also recognizes the following environment variables (the SSH_SERVERS value provides the names of the worker nodes running sshd and is required): option default description ---------------------------------------------------------------- SSH_CMD "ssh" ssh command to use. Can also be set to "rsh", or the name of any other remote shell spawner program/script with a similar interface. SSH_SERVERS none - must be provided space-delimited list of server names to pass to SSH_CMD, one per node, in order of priority (trailing extra server names ignored) may specify DNS names or IP addresses SSH_OPTIONS "" additional options to give SSH_CMD client SSH_REMOTE_PATH current working dir. the directory to use on remote machine must contain a copy of the udp-conduit a.out executable to be started (these variables may also optionally be prefixed with "AMUDP_" or "GASNET_") So for example, one could use the ssh spawner to start a job for an a.out executable linked against libgasnet-udp-*.a as follows (assuming a csh-like shell): $ ssh node0 echo ssh is working ssh is working $ setenv GASNET_SPAWNFN 'S' $ setenv GASNET_SSH_SERVERS 'node0 node1 node2 node3 node4 node5' $ ./a.out 3 arg1 arg2 arg3 Hello from node0 Hello from node1 Hello from node2 $ * GASNET_SPAWNFN='C' (custom spawner) The custom spawner allows the user and/or site installer to provide a custom command to be used for starting the worker processes across the worker nodes. This provides spawning extensiblity - the custom command can invoke any arbitrary site-specific spawning command (for example to call to an OS-provided spawner, a batch system, or a custom-written wrapper script that performs whatever actions are necessary to start the job). The only required environment variable setting is CSPAWN_CMD, which provides to command to be invoked for performing the spawn, upon which the following substitutions are performed: %N => number of worker nodes requested %C => the command that should be run once by each worker node participating in the job %D => the current working directory %% => % %M => optional list of servers taken from CSPAWN_SERVERS (the first nproc are passed) The custom command specified by CSPAWN_CMD is invoked exactly once at startup, and is responsible for starting all the %N remote worker processes and having them execute the command passed as %C, in a directory containing the a.out udp-conduit executable. The worker processes then use information passed within %C to connect to the master process on the spawning console and bootstrap the job. Note that any network firewalls present must be configured such that the worker nodes have the ability to make TCP connections to the machine that executes the initial spawn command (used for bootstrapping) and such that the worker nodes can send UDP packets to each other (used to implement GASNet communication). The custom spawner recognizes the following environment variables: option default description -------------------------------------------------- CSPAWN_CMD none the custom command to be called for spawning, after replacement of the patterns listed above the command must result in %N invocations of the %C command, once on each worker node CSPAWN_SERVERS none space-delimited list of servers to use - only required if %M is used in CSPAWN_CMD CSPAWN_ROUTE_OUTPUT set this variable to request built-in stdout/stderr routing from worker processes to the console, if your CSPAWN_CMD doesn't automatically provide that capability (these variables may also optionally be prefixed with "AMUDP_" or "GASNET_") So for example, one could use the custom spawner in conjunction with the Ganglia gexec command (http://www.theether.org/gexec/) to start a job for an a.out executable linked against libgasnet-udp-*.a as follows (assuming a csh-like shell): $ setenv GASNET_SPAWNFN 'C' $ setenv GASNET_CSPAWN_CMD 'gexec -n %N %C' $ ./a.out 3 arg1 arg2 arg3 Hello from node0 Hello from node1 Hello from node2 $ * GASNET_SPAWNFN='R' (Berkeley rexec spawner) Deprecated spawn mechanism for Berkeley rexec facility (http://www.theether.org/rexec/) * GASNET_SPAWNFN='G' (Berkeley GLUNIX spawner) Deprecated spawn mechanism for Berkeley GLUNIX (http://now.cs.berkeley.edu/) ============================================================================== Design Overview: ---------------- The core API implementation is a very thin wrapper around the AMUDP implementation by Dan Bonachea. See documentation in the other/amudp directory or the AMUDP webpage (http://www.cs.berkeley.edu/~bonachea/amudp) for details. The udp-conduit directly uses the extended-ref implementation for the extended API - see the extended-ref directory for details. After some discussions at SC03, I decided it was worthwhile to provide a GASNet conduit that runs as-natively-as-possible on gigabit ethernet. Given the hardware characteristics of ethernet (most significantly, store-and-forward routing), we don't ever expect this to be a very low-latency target, however Ethernet is ubiquitous and the bandwidths can be respectable.. and we can certainly hope to do better than GASNet-over-AMMPI-over-MPICH-over-TCP-over-Ethernet, which was previously our only choice on Ethernet. Implementing udp-conduit was basically just a matter of adapting my mpi-conduit-over-AMMPI glue to accomodate my AM-over-UDP implementation (AMUDP), which Titanium tests show can significantly outperform AMMPI-over-MPICH-over-TCP (primarily because AM is a natural fit for much smarter reliability protocols than TCP and the connectionless model of UDP is more scalable). The GASNet-level design of udp-conduit is very similar to mpi-conduit. It's a bare core API implementation using the reference extended API implementation. All handlers and network access are serialized on client threads. There's no concurrent handler execution or conduit threads. GASNet segment is mmapped but not specially registered in any way. Both aligned and unaligned segments are supported. Unlike AMMPI (which is limited to MPI-1.1's functionality), AMUDP provides a fully functional job management system (which runs on the spawning console), which means that udp-conduit provides very robust job exit behavior - it passes all testexit tests and should never leave zombie processes. Current limitations: * AMUDP is implemented in C++ (fairly straightforward C++ with no templates, should work on any C++ compiler with support for simple exception-handling - works on every one I've tried). The GASNet configure script checks for C++ support if and only if udp-conduit is enabled - platforms lacking a C++ compiler should configure with --disable-udp. * job spawning is a little messy because we have to deal with site-specific wrinkles, especially for acquiring node names. However, AMUDP offers a number of job spawning options (the most portable being ssh-based spawn), and it's successfully been deployed on a number of systems. There is also spawner support for Glunix, Rexec, Localhost/fork (for debugging), and a "custom command" option for extensibility. * AMUDP supports a maximum of 16384 nodes. However, the current buffering/reliability algorithm doesn't scale well beyond a few hundred nodes, as the per-node memory usage is currently 4 * NumNodes * NetworkDepth * MaxMedium). Also, the master process currently defaults to opening at least 4 TCP connections to each compute process (possibly 5 depending on the spawner), which at large scale may exceed file descriptor limitations for the master on some OS's. This can be reduced on some systems by setting AMUDP_ROUTE_OUTPUT=0, but this might result in stdout/stderr being lost, depending on the spawner. Users interested in large-scale applications are highly recommended to use the appropriate native GASNet conduit, which are generally better tuned for large scale.