Berkeley UPC User's Guide version 2022.10.0 |
This version of Berkeley UPC includes:
The three UPC specifications referenced above are also available for convenience as a combined document:
UPC Language and Library Specifications, Version 1.3
UPC Consortium, Lawrence Berkeley National Lab Tech Report LBNL-6623E, Nov 2013.
#include <upc_relaxed.h> #include <stdio.h> int main() { printf("Hello from thread %i/%i\n", MYTHREAD, THREADS); upc_barrier; return 0; }This program prints a message once from each thread (in some arbitrary interleaving), executes a barrier (optional), and exits.
For more involved examples of UPC code, see the UPC Language Tutorials - from the UPC Language Community website (archived) and the 'upc-examples' directory in of the Berkeley UPC runtime distribution. The Official UPC Specifications are a useful reference, and contains a description of the standard libraries.
upcc -o light particle.upc wave.c -lgrottymathNote that 'wave.c' can contain either UPC code or regular C code, and the 'grottymath' library that is linked into the application can be a regular C library: Berkeley UPC is fully interoperable with regular C source, object, and library files (note: if you compile with the -pthreads flag, any C libraries you use must be thread-safe). Berkeley UPC 2.0 also adds support for linking C++/FORTRAN/MPI objects into a UPC executable: see Mixing C/C++/MPI/FORTRAN with UPC.
upcc recognizes most commonly used C compiler flags (-D, -I, etc.). It also uses a number of its own flags for the choice of network API your program will run over, for compiling your UPC code for a static number of threads, and other UPC-specific options. See the upcc man page for details.
Name | Description |
ibv | OpenFabrics (aka OpenIB) InfiniBand Verbs for InfiniBand networks |
aries | GNI API for Cray XC systems running CLE. |
ofi | OpenFabrics Interfaces API (aka libfabric) for multiple networks.
This is the recommended network API for HPE Cray EX Slingshot and Intel Omni-Path networks only. |
udp | UDP: works on any system with a standard TCP/IP stack, but is typically slower than using one of the native network types. Generally the fastest option for systems with only Ethernet hardware (notably faster than MPI-over-TCP). |
mpi | MPI: works on any system with MPI installed, but is typically slower than using one of the other network types. |
smp | "Symmetric multiprocessor (SMP)" mode: uses no network. Currently runs with only a single process unless your runtime has been configured with --enable-pshm (currently default only on Linux). Otherwise, you must pass -pthreads to upcc to run smp-conduit with multiple UPC threads. |
ucx | Unified Communication X for Mellanox InfiniBand networks
NOTE: Experimental in this release, not auto-detected. |
Note that you can only compile for a given network type if your Berkeley UPC runtime was configured to support it at build/installation time. To see which APIs are supported in your installation, and to see which is used by default, use 'upcc --version'.
An executable compiled for a fixed number of UPC threads will fail at startup if you try to run it with a different number of threads. However, fixing the number of threads allows optimization on certain operations (such as shared pointer arithmetic), especially when the number of threads is a power of 2.
Name | Value | Description | Standard |
__UPC__ | 1 | Defined by any UPC implementation | UPC Language
first specified in v1.1.1 |
__UPC_COLLECTIVE__ | 1 | Defined by UPC implementations providing the UPC Collective Utilities library <upc_collective.h> | UPC Required Library
first specified in v1.2 |
__UPC_TICK__ | 1 | Defined by UPC implementations providing UPC High-Performance Wall-Clock Timers library <upc_tick.h> | UPC Required Library
first specified in v1.3 |
__UPC_CASTABLE__ | 1 | Defined by UPC implementations providing the UPC Castability Functions library <upc_castable.h> | UPC Optional Library
first specified in v1.3 |
__UPC_IO__ | 1 | Defined by UPC implementations providing the UPC Parallel I/O library <upc_io.h> | UPC Optional Library
first specified in v1.2 |
__UPC_ATOMIC__ | 1 | Defined by UPC implementations providing the UPC Atomic Memory Operations library <upc_atomic.h> | UPC Optional Library
first specified in v1.3 |
__UPC_NB__ | 1 | Defined by UPC implementations providing the UPC Non-Blocking Transfer Operations library <upc_nb.h> | UPC Optional Library
first specified in v1.3 |
__UPC_VERSION__ | Monotonically increasing positive integer constant | UPC specification supported: value is YYYYMM date of that version's ratification (currently '201311L' for UPC 1.3) | UPC Language
first specified in v1.1.1 |
UPC_MAX_BLOCK_SIZE | A positive integer constant | Indicates the maximum value allowed in a layout qualifier for shared data. The actual value varies across configurations | UPC Language
first specified in v1.0 |
__UPC_DYNAMIC_THREADS__ | 1 if dynamic threads: else undefined | Set to 1 unless the '-T' flag was passed to upcc | UPC Language
first specified in v1.1.1 |
__UPC_STATIC_THREADS__ | 1 if static threads: else undefined | Set to 1 if the '-T' flag was passed to upcc | UPC Language
first specified in v1.1.1 |
THREADS | A compile-time integer constant representing the static thread count: else undefined | Set to the static thread count if the '-T' flag was passed to upcc. (Under dynamic threads, THREADS is a keyword that expands to the thread count determined at program launch.) | UPC Language |
__UPC_PUPC__ | 1 | Defined by UPC implementations supporting the GASP interface | GASP 1.5 specification |
__BERKELEY_UPC__ | Monotonically increasing positive integer constant | The major version number of the Berkeley UPC release.
Example: '1' for release '1.0.3'. |
Berkeley UPC only |
__BERKELEY_UPC_MINOR__ | An integer constant | The minor version number of the Berkeley UPC release.
Example: '0' for release '1.0.3'. |
Berkeley UPC only |
__BERKELEY_UPC_PATCHLEVEL__ | An integer constant | The patch version number of the Berkeley UPC release.
Example: '3' for release '1.0.3'. |
Berkeley UPC only |
__BERKELEY_UPC_<NETWORK>_CONDUIT__ | 1, or undefined | Identifies the network API used.
Example: if 'upcc -network=mpi' is used, '__BERKELEY_UPC_MPI_CONDUIT__' will be defined to '1'. |
Berkeley UPC Runtime[1] |
__BERKELEY_UPC_PSHM__ | 1, or undefined | Defined to 1 if and only if PSHM support is enabled | Berkeley UPC Runtime[1] |
__BERKELEY_UPC_PTHREADS__ | 1, or undefined | Defined to 1 if and only if the '-pthreads' flag is used | Berkeley UPC Runtime[1] |
__BERKELEY_UPC_RUNTIME__ | 1, or undefined | Defined to 1 if and only if the Berkeley UPC runtime is used | Berkeley UPC Runtime[1] |
__BERKELEY_UPC_RUNTIME_DEBUG__ | 1, or undefined | Defined to 1 if and only if a debugging runtime used (i.e. '-g' passed to upcc). | Berkeley UPC Runtime[1] |
__BERKELEY_UPC_RUNTIME_RELEASE__ | An integer constant, or undefined | The major version number of the Berkeley UPC Runtime library.
Example: '2' for release '2.12.0'. |
Berkeley UPC Runtime[1]
release 2.12 and newer |
__BERKELEY_UPC_RUNTIME_RELEASE_MINOR__ | An integer constant, or undefined | The minor version number of the Berkeley UPC Runtime library.
Example: '12' for release '2.12.0'. |
Berkeley UPC Runtime[1]
release 2.12 and newer |
__BERKELEY_UPC_RUNTIME_RELEASE_PATCHLEVEL__ | An integer constant, or undefined | The patch version number of the Berkeley UPC Runtime library.
Example: '0' for release '2.12.0'. |
Berkeley UPC Runtime[1]
release 2.12 and newer |
A remote translator can be contacted via either the HTTP or SSH protocols. To use HTTP, the 'upcc.cgi' CGI script (located in the 'contrib' directory of the runtime distribution) must be installed and configured with a web server on the remote host. Simply set the 'translator' parameter in your user configuration file (or the global 'upcc.conf') to the URL for the CGI script. To use SSH, you must be able to login to the remote host using SSH, and the 'translator' parameter must be set to 'remote_host:/path/to/translator'. You will want to use key-based authentication, and 'ssh-agent' to avoid entering your password each time you compile. See our SSH Agent Tutorial.
When using an HTTP-based remote translator, upcc also includes support for use of an HTTP proxy. Set the 'http_proxy' parameter in your user configuration file (or the global 'upcc.conf') to the proxy URL. The upcc front end does not currently support HTTPS or SOCKS proxies, nor HTTP proxies that require authentication (HTTP error 407).
If you wish to create a reusable set of compiled code, you must currently keep the files in *.o format. So, instead of the traditional C format, where you'd create 'libmyupc.a', and then link with something like
upcc myprogram.o -L/libpath -lmyupcYou must instead do something like
upcc myprogram.o /libpath/libmyupc/*.oNote that beginning with Berkeley UPC 2.12.0 it is possible to link together static threads and dynamic threads objects, with the result being a static threads executable. In many cases this allows use of a dynamic threads object in the role of a library, which can be linked to an executable with any dynamic or static thread setting.
Berkeley UPC executables should be run the same way as any other parallel program on your system that uses the same underyling network API. So, for instance, a program compiled with '--network=mpi' is run on many systems via 'mpirun -np <number of processes> a.out'. Other systems may use other invocations, such as 'prun' or 'poe', especially when API's other than MPI are used. Consult your system's documentation for details.
upcrun -n 4 parboilThis example runs the UPC executable 'parboil' with 4 UPC threads. The default layout of those threads on the physical hardware is system-dependent, but there are upcrun options to further control job layout.
An additional benefit of using upcrun is that it provides consistent support for propagating environment variables to all threads of your UPC program. If you use upcrun, any environment variable beginning with either 'UPC_' or 'GASNET_' is guaranteed to be propagated to all threads. (Support for propagating all environment variables is planned). If you do not use upcrun, environment propagation will only work to the extent that the parallel job launcher you use provides it normally.
You can see how upcrun thinks your job should be run without actually running it by passing the upcrun '-t' flag. Also, 'upcrun -i <executable>' will provide information about a Berkeley UPC executable, such as the network API that it was built against, and the number of fixed threads (if any) that it was compiled for.
See 'upcrun --help' or the upcrun man page for more information.
The default amount of shared memory to reserve per UPC thread on a system is chosen at configure time (see INSTALL.TXT for details), but you can override that value for a particular application at either compile time or at job startup. Generally this is only needed if you observe that your application is running out of either shared or regular C memory.
To embed a different default amount of shared memory into your application, simply pass 'upcc -shared-heap=144MB' for instance (to get 144 megabytes per UPC thread). You can also use 'GB' for gigabyte amounts (if neither 'MB' nor 'GB' is used, megabytes are assumed). To override the embedded default amount of shared memory at application startup, set the UPC_SHARED_HEAP_SIZE environment variable to whatever value you want ('2GB', etc.), or pass '-shared-heap' to upcrun.
While it is tempting to simply grab an extremely large shared memory segment, be aware that this is not always a good idea, or even possible. Since the shared address space range cannot be used for regular malloc allocations, creating too large of a shared space can cause the amount of regular heap memory available to your application to become small (causing malloc to eventually return NULL when you request more memory). Also, the shared memory space is reserved via an mmap() call, and while this does not generally cause any physical memory pages to be allocated, certain operating systems (for instance, Linux) will not allow more memory to be reserved by applications then the OS can guarantee is available, and so allocating a shared region larger than the physical memory (plus swap space) may fail.
The default amount of shared memory per UPC thread can be changed system-wide by modifying the 'shared_heap' parameter in the installation's upcc.conf file. You can override the system-wide default for your own applications by setting shared_heap in your user configuration file.
The '-pthreads' flag must be passed consistently at all stages of compilation and linking. Also, when pthreads are used, upcc needs to delay much of the compilation of your code until link time, so if you split code generation into separate compilation and linking steps (i.e., 'upcc -c foo.upc', followed by 'upcc foo.o bar.o'), you need to pass any macro and/or include path directives (ex: '-DFOO=bar -I/usr/local/include') to upcc for both the compilation and link commands.
Any C libraries that your code links against must be thread-safe in order to be used with -pthreads. If one or more of your libraries is not thread-safe, you must compile without pthreads, and run separate processes on the same node to exploit an SMP system. In the non-pthreads case support for shared memory communication among UPC processes on an SMP node is available on many systems via the "PSHM" feature (See "INTRA-NODE SHARED MEMORY SUPPORT" in INSTALL.TXT).
When you link an application with '-pthreads', a subdirectory named <executable_name>_pthread-link will be created in the current directory. This directory exists in order to speed up further linking commands of the same program. If you link the same application again with the same object file names, and none of the global static unshared variables in your program have changed name or size, recompilation of all the files in your application can be avoided, which can make a significant difference in build time for programs with many source files. You may delete the temporary directory at any time without any side effects (other than possibly longer link times). One can prevent this optimization with the -nolink-cache flag to upcc.
Unless otherwise specified, pthreaded UPC applications use a default number of pthreads per process (run 'upcc --version' to see the default for your system. This number is set in the upcc.conf configuration file, and can be changed there (or in your user configuration file). It can also be overridden in several ways. Compiling with 'upcc -pthreads=<NUMBER>' changes the default number of pthreads per UPC process for an executable to NUMBER. If the 'UPC_PTHREADS_PER_PROC' environment variable is set to a nonzero integer when you run a UPC program, it will override any default value. Finally, upcrun is smart about pthreads in several ways. First, if you run a pthreaded parallel job with 'upcrun -n <NUMBER> ...', the number of processes actually launched will be divided by the number of pthreads, so that exactly NUMBER UPC threads are used. Second, if the PSHM feature is disabled and you use -network=smp (generating an executable that will run only a single process), upcrun -n NUMBER will automatically set the number of pthreads to NUMBER.
You can use a regular C debugger and get usable debugging support. Berkeley UPC provides several mechanisms for attaching a regular C debugger to one or more of your UPC application's threads at various points during execution. While this does not provide a fully normal debugging environment (the debugger will show the C code emitted by our translator, rather than your UPC code), it can still allow you to see program stack traces and other important information. This can be very useful if you wish to submit a helpful bug report to us. See Attaching a regular C debugger to Berkeley UPC programs for details.
Berkeley UPC also supports automatically generating backtraces if a fatal error occurs in your program. This will allow you to see a stack trace of the function calls that your program was in at the time it crashed. To use auto-backtracing, run with upcrun -backtrace or set GASNET_BACKTRACE=1 in your environment. The level of backtracing support available depends on the back-end C compiler and operating system, and so not all systems are equally functional, and some systems will not provide backtraces. See gasnet/README for more information on backtracing.
Examining tracing information is one of the best ways to go about optimizing your UPC program. It provides a way for you to see which lines of your code are generating the most network traffic (and the size of the network messages used). From this you may be able to determine how to either avoid some of this traffic, or change your code to use fewer, larger messages (for instance, by replacing sets of individual reads/writes with bulk memory movement calls like 'upc_memget()', etc.), which is typically more efficient. Examining barrier wait times can also let you know if your computations are imbalanced across threads, and/or if you could profit by using split-phase barriers, moving computation in between 'upc_notify' and 'upc_wait'.
Note that running with tracing may slow down your application considerably: the exact amount depends on your filesystem, and the ratio of communication/computation in your program. If you are only interested in a subset of trace information, consider setting 'GASNET_TRACEMASK' as described below.
ID | Feature |
G | Network 'gets'. These include both bulk gets (from upc_memget, etc.), and network get operations caused by reading shared memory via shared variables/pointers. The 'g' mask does not include 'local' gets (i.e. reads from shared memory which has affinity to the reading UPC thread), as these do not result in network traffic. Use 'H' to trace local gets. |
P | Network 'puts'. These include both bulk puts (from upc_memput, etc.), and put operations caused by writing to shared memory via variables/pointers. The 'P' mask does not include 'local' puts (i.e. writes to shared memory which has affinity to the writing UPC thread), as these do not result in network traffic. Use 'H' to trace local puts. |
B | Barriers, including both blocking (upc_barrier) and non-blocking (upc_notify followed by upc_wait: a pair of these count as a single barrier). |
N | Line number information from UPC source files. The "N" and "H" flags must always be among those set for upc_trace to work! |
H | Miscellaneous UPC information. The "N" and "H" flags must always
be among those set for upc_trace to work! Passing
this flag causes the following things to be traced:
|
To trace only a subset of these features, set the 'GASNET_TRACEMASK' environment variable to a string containing the ID's of the features you wish to trace. Note that the "N" and "H" flags must always be among those set for 'upc_trace' to work (if you are intending to manually examine the trace file, they do not need to be set).
So, for instance, if you are trying to perform an analysis that does not require get/put information, you are highly advised to set 'GASNET_TRACEMASK' to "BHN" and 'GASNET_TRACELOCAL' to "no" (or "0"). This will turn off tracing for all get and put operations. Since gets/puts are typically the majority of items in a full trace file, this will probably result in much faster program execution, a much smaller trace file, and faster analysis by 'upc_trace'.
extern void bupc_trace_setmask (const char *newmask); extern const char * bupc_trace_getmask (void); extern int bupc_trace_gettracelocal (void); extern void bupc_trace_settracelocal (int val); void bupc_trace_printf ((const char *msg, ...));'bupc_trace_getmask' and 'bupc_trace_setmask' allow programmatic retrieval and modification of the trace masks in effect for the calling thread. The initial values are determined by the 'GASNET_TRACEMASK' environment variables, and the input and output to the mask manipulation functions have the same format as 'GASNET_TRACEMASK' values. Note that whenever any tracing is enabled (i.e. unless you are temporarily turning off tracing by passing an empty string), the "N" and "H" flags must always be among those set for 'upc_trace' to work.
Ex: bupc_trace_setmask("PGHN"); // trace everything bupc_trace_settracelocal(1); // include local puts and gets // do something... bupc_trace_setmask(""); // stop tracing
The 'bupc_trace_printf' utility outputs a message into the trace file, if it exists. Note that two sets of parentheses are required when invoking this operation, in order to allow it to compile away completely for non-tracing builds.
Ex: double A[4] = ...; int i = ...; bupc_trace_printf(("the value of A[%i] is: %f", i, A[i]));
To generate statistics, simply set the 'GASNET_STATSFILE' environment variable to a file name, into which statistics will be written at the end of your program's run. (Note: by default, only debug executables support statistics generation, as it incurs a performance penalty: if you wish to have non-debug UPC executables generate statistics, you must rebuild your UPC runtime system, passing '--with-multiconf=+opt_trace' to configure, then build your application with 'upcc -trace'.) You may generate both stats and tracing info for the same program run if you wish.
Just as with tracing, you may set a mask to control what types of events are included in the statistics, by setting the 'GASNET_STATSMASK' environment variable, and/or by calling the following functions:
extern void bupc_stats_setmask (const char *newmask); extern const char * bupc_stats_getmask (void);The same mask IDs are used by the tracing and statistics masks, i.e., calling 'bupc_stats_setmask("BP")' would cause execution to gather statistics only for barriers and puts. See the table in the tracing documentation for the list of IDs.
upcc -pg foo.c upcrun -n 2 a.out gprof a.out gmon.out.0/gmon.out gmon.out.1/gmon.out | lessNote that 'gprof' provides timings and statistics for processor usage: it does not include time during which the process has been put to sleep waiting for I/O (including network reads/writes). However, since Berkeley UPC uses spin-locks in many cases to wait for network events, rather than blocking system calls, you may see that certain 'gasnet*' functions consume large amounts of CPU time. This generally means that your program is spending most of that time waiting for network communication to complete (some fraction is the software overhead inherent in sending/receiving the network traffic). If your program spends a lot of time waiting for network operations to complete, you may be suffering from an imbalanced load across threads (so that some take longer to "catch up" to a barrier, for instance). Restructuring your application may avoid these waiting periods. Or you may be able to use some of this "spare" time for computation (or other network traffic) by switching to use non-blocking barriers (i.e., 'upc_notify/upc_wait'), and/or our non-contiguous memcpy extensions to UPC. Replace blocking network constructs (such as 'upc_barrier', 'upc_memcpy', and read/writes to shared variables) with non-blocking equivalents, and insert unrelated computation (and/or network traffic) in between the initialization and completion calls. Of course, you must be able to find unrelated computation/communication for this to work, and the degree to which this is possible will depend on your application.
NOTICE: A large portion of this Berkeley-specific extension is now officially deprecated in favor of the standardized version adopted into the official UPC specification. See below for details.
As of 2.0, Berkeley UPC fully implements a set of non-blocking extensions to the 'upc_memcpy()' function for contiguous data. These extensions allow you to explicitly overlap memcpy-like functions with computation (and/or with other memcpy calls).The full interface is described in sections 2 through 4 of our Proposal for Extending the UPC Memory Copy Library Functions. See that document for details on the functions and their usage.
NOTICE: The following interfaces have been adopted in the UPC Optional Library Specifications, Version 1.3, with semantics compatible to a subset of those given in the document referenced above, and are available in Berkeley UPC beginning with the 2.18 release. Users are strongly encouraged to develop new codes using the standardized interfaces and to migrate existing code to them. The corresponding portion of the Berkeley-specific non-blocking interfaces (specifically, those operating on contiguous data) are now officially deprecated, and the 'bupc_' prefixed equivalents of those functions will be removed in a future version.
#define __UPC_NB__ 1 // predefined feature macro #include <upc_nb.h> // defines the following: // Explicit-handle non-blocking operations and synchronization: typedef ... upc_handle_t; #define UPC_COMPLETE_HANDLE ... upc_handle_t upc_memcpy_nb(shared void * restrict dst, shared const void * restrict src, size_t n); upc_handle_t upc_memget_nb(void * restrict dst, shared const void * restrict src, size_t n); upc_handle_t upc_memput_nb(shared void * restrict dst, const void * restrict src, size_t n); upc_handle_t upc_memset_nb(shared void *dst, int c, size_t n); int upc_sync_attempt(upc_handle_t handle); void upc_sync(upc_handle_t handle); // Implicit-handle non-blocking operations and synchronization: void upc_memcpy_nbi(shared void * restrict dst, shared const void * restrict src, size_t n); void upc_memget_nbi(void * restrict dst, shared const void * restrict src, size_t n); void upc_memput_nbi(shared void * restrict dst, const void * restrict src, size_t n); void upc_memset_nbi(shared void *dst, int c, size_t n); int upc_synci_attempt(void); void upc_synci(void);
NOTE: The types upc_handle_t and bupc_handle_t are interchangable. One may freely mix the standard library calls in upc_nb.h with the Berkeley-specifc interfaces for contiguous non-blocking memcpy.
The full interface is described in sections 5 and 6 of our Proposal for Extending the UPC Memory Copy Library Functions. See that document for details on the functions and their usage.
NOTE: The types upc_handle_t and bupc_handle_t are interchangable. One may freely mix the standard library calls in upc_nb.h with the Berkeley-specifc interfaces for non-contiguous memcpy.
The full interface is described in our Proposal for Extending the UPC Libraries with Explicit Point-to-Point Synchronization Support. See that document for details on the functions and their usage.
int bupc_dump_shared(shared const void *ptr, char *buf, int maxlen);Any pointer-to-shared may be passed to this function. The 'maxlen' parameter gives the length of the buffer pointed to by 'buf', and this length must be at least BUPC_DUMP_MIN_LENGTH, or else -1 is returned, and errno set to EINVAL. On success, the function returns 0, The buffer will contain either "<NULL>" if the pointer-to-shared == NULL, or a string of the form
"<address=0x1234 (addrfield=0x1234), thread=4, phase=1>"The 'address' field provides the virtual address for the pointer, while the 'addrfield' shows the actual contents of the pointer-to-shared address bits (as returned by upc_addrfield). On some configurations these values may be the same (if the full address of the pointer can be fit into the address bits), while on others they may be quite different (if the address bits store an offset from a base initial address that may differ from thread to thread).
Both bupc_dump_shared() and BUPC_DUMP_MIN_LENGTH are visible when any of the standard UPC headers (upc.h, upc_relaxed.h, or upc_strict.h) are #included.
The 'bupc_ptradd()' function provides support for performing pointer-to-shared arithmetic with variable blocksize, which need not be a compile-time constant.
shared void * bupc_ptradd(shared void *p, size_t blockelems, size_t elemsz, ptrdiff_t elemincr); - 'p': the base pointer - 'blockelems': the block size (number of elements in a block) - 'elemsz': the element size (usually sizeof(*p)) - 'elemincr': the positive or negative offset from the base pointer
The following call:
bupc_ptradd(p, blockelems, sizeof(T), elemincr);Returns a value q as if it had been computed:
shared [blockelems] T *q = p; q += elemincr;however, the blockelems argument is not required to be a compile-time constant. Blockelems must be non-negative, but may be zero to indicate an indefinite blocking factor. Here's an example of indexing into a dynamically-allocated array whose block size is not known until run time.
int blockelems = ...; // choose some arbitrary block size // allocate an array of doubles with that blocksize shared void *myarr = upc_all_alloc(..., blockelems*sizeof(double)); // access element 14 double d = *(shared double *)bupc_ptradd(myarr, blockelems, sizeof(double), 14);
It's worth noting that in some cases bupc_ptradd() may be less efficient than regular pointer-to-shared addition, because the compile-time constant blocksize of the pointer referent type generally makes the latter more amenable to compiler optimization of the addition operation and surrounding code. This is especially true in the case of indefinitely-blocked or cyclically-blocked pointers-to-shared. However, the potential cost may be worth the added convenience in non-performance-critical code.
You will normally not need to call this function, as the runtime will automagically perform checks for incoming network requests whenever your UPC code causes network activity to be performed, and this usually occurs fairly frequently in a UPC application. However, if you writing your own 'spin lock' style synchronization, you may need to use this function to avoid deadlock. Here is an example:
shared strict int flag[THREADS]; ... if (MYTHREAD % 2) { while (flag[MYTHREAD] == 0) bupc_poll(); } else { ... some calculation ... flag[MYTHREAD + 1] = 1; }Here the 'even' UPC threads are performing some calculation, then informing the 'odd' threads that the result is ready by setting a per-thread flag. If the 'bupc_poll()' were omitted, the 'odd' threads might (on certain platforms/networks) consume all of the CPU forever in the 'while' test, never checking for the incoming network message that would set flag[MYTHREAD].
If a program contains computationally intensive sections in which no remote accesses are performed for a long time, it is also possible that performance may be improved by intermittently calling bupc_poll, particularly if other threads are likely to be performing communication (eg. remote accesses, lock synchronization, shared memory allocation, etc.) during this time.
If 'expr' has a static type which is identical to 'type', does nothing. Otherwise, prints a non-fatal warning containing the line number and a description of the two differing types.
NOTICE: This Berkeley-specific extension is now officially deprecated in favor of the standardized version adopted into the official UPC specification. See the end of this section for details.
typedef ... bupc_tick_t; /* 64-bit integral type */ #define BUPC_TICK_MAX ... #define BUPC_TICK_MIN ... bupc_tick_t bupc_ticks_now (void); uint64_t bupc_ticks_to_us (bupc_tick_t ticks); uint64_t bupc_ticks_to_ns (bupc_tick_t ticks); double bupc_ticks_granularityus (void); double bupc_ticks_overheadus (void);The 'bupc_tick_t' type and associated functions provide portable support for querying high-precision system timers for obtaining wall-clock timings of sections of code. Most CPU hardware offers access to high-performance timers with a handful of instructions, providing timer precision and overhead that can be several orders of magnitude better than can be obtained through the use of the gettimeofday() system call.
The 'bupc_tick_t' type represents an integral quantity of abstract timer ticks, whose ratio to real time is system-dependent and thread-dependent. bupc_ticks_now() returns the current value of the tick timer for the calling thread, using the fastest mechanism available. bupc_ticks_to_us() and bupc_ticks_to_ns() convert a difference in bupc_tick_t values obtained by the calling thread into microseconds or nanoseconds, respectively. The bupc_ticks_to_{us,ns}() conversion calls can be significantly more expensive than the bupc_ticks_now() tick query, so for timing short intervals it's recommended to keep timing results in units of ticks until final output. BUPC_TICK_MAX and BUPC_TICK_MIN provide tick values which are respectively larger and smaller than any possible tick value. bupc_ticks_granularityus() and bupc_ticks_overheadus() respectively report the estimated microsecond granularity (minimum time between distinct ticks) and microsecond overhead (time it takes to read a single tick value, not including conversion) for the timer facility.
Example: bupc_tick_t start = bupc_ticks_now(); compute_foo(); /* do something that needs to be timed */ bupc_tick_t end = bupc_ticks_now(); printf("Time was: %d microseconds\n", (int)bupc_ticks_to_us(end-start)); printf("Timer granularity: <= %.3f us, overhead: ~ %.3f us\n", bupc_tick_granularityus(), bupc_tick_overheadus()); printf("Estimated error: +- %.3f %%\n", 100.0*(bupc_tick_granularityus()+bupc_tick_overheadus()) / bupc_ticks_to_us(end-start));It's important to keep in mind that raw bupc_tick_t values are thread-specific quantities with a thread-specific interpretation (e.g. they might represent a hardware cycle count on a particular CPU, starting at some arbitrary time in the past). More specifically, raw ticks do NOT provide a globally-synchronized timer (i.e. the simultaneous absolute tick values may differ across threads), and furthermore the tick-to-wallclock conversion ratio might also differ across threads (e.g. on a cluster with heterogenerous CPU clock rates, the raw tick values may advance at different rates for different threads). Therefore as a rule of thumb, raw bupc_tick_t values and bupc_tick_t intervals obtained by different threads should never be directly compared or arithmetically combined, without first converting the relevant tick intervals to wall time intervals.
NOTICE: The following interfaces have been adopted in the UPC Required Library Specifications, Version 1.3, with similar semantics to those as described above, and are available in Berkeley UPC beginning with the 2.16 release. Users are strongly encouraged to develop new codes using the standardized interfaces and to migrate existing code to them. The Berkeley-specific 'bupc_' prefixed variants are now officially deprecated in favor of the standardized variant, and will be removed in a future version.
#define __UPC_TICK__ 1 // predefined feature macro #include <upc_tick.h> // defines the following: typedef ... upc_tick_t; #define UPC_TICK_MAX ... #define UPC_TICK_MIN ... upc_tick_t upc_ticks_now (void); uint64_t upc_ticks_to_ns (upc_tick_t ticks);
unsigned int bupc_thread_distance(int threadX, int threadY); #define BUPC_THREADS_SAME ... #define BUPC_THREADS_VERYNEAR ... #define BUPC_THREADS_NEAR ... #define BUPC_THREADS_FAR ... #define BUPC_THREADS_VERYFAR ...bupc_thread_distance takes two thread identifiers (whose values must be in 0..THREADS-1, otherwise behavior is undefined), and returns an unsigned integral value which represents an approximation of the abstract 'distance' between the hardware entity which hosts the first thread, and the hardware entity which hosts the memory with affinity to the second thread. In this context 'distance' is intended to provide an approximate and relative measure of expected best-case access time between the two entities in question. Several abstract 'levels' of distance are provided as pre-defined constants for user convenience, which represent monotonically non-decreasing 'distance':
These constants have implementation-defined integral values which are monotonically increasing in the order given above. Implementations may add further intermediate level with values between BUPC_THREADS_VERYNEAR and BUPC_THREADS_VERYFAR (with no corresponding define) to represent deeper hierarchies, so users should test against the constants using <= or >= instead of ==.
The intent of the interface is for users to not rely on the physical significance of any particular level and simply test the differences to discover which threads are relatively closer than others. Implementations are encouraged to document the physical significance of the various levels whenever possible (see below), however any code based on assuming exactly N levels of hierarchy or a fixed significance for a particular level will probably not be performance portable to different implementations or machines.
The relation is symmettric, ie:
bupc_thread_distance(X,Y) == bupc_thread_distance(Y,X)
but the relation is not transitive, ie:
bupc_thread_distance(X,Y) == A && bupc_thread_distance(Y,Z) == A
does NOT imply bupc_thread_distance(X,Z) == A
Furthermore, the value of bupc_thread_distance(X,Y) is guaranteed to be unchanged over the span of a single program execution, and the same value is returned regardless of the thread invoking the query.
Currently the significance of the BUPC_THREADS_* constants is as follows:
Value | Meaning(s) |
BUPC_THREADS_SAME | Only returned when threadX == threadY. |
BUPC_THREADS_VERYNEAR | threadX and threadY will communicate through shared memory. May include pthreads in the same process when compiled with -pthreads, and processes in the same compute node when PSHM support is available. |
BUPC_THREADS_NEAR | threadX and threadY are in the same compute node, but
will communicate using the network API. This may occur because either PSHM support is not available or the -pshm-width flag to upcrun has placed these threads in disjoint shared memory domains. |
BUPC_THREADS_FAR | This value is not currently used. |
BUPC_THREADS_VERYFAR | threadX and threadY are on different compute nodes. |
NOTICE: This Berkeley-specific extension is now officially deprecated in favor of the standardized version adopted into the official UPC specification. See the end of this section for details.
int bupc_castable(shared void *ptr); int bupc_thread_castable(unsigned int threadnum); void * bupc_cast(shared void *ptr);
This family of functions implements a UPC language extension propsed by Brian Wibecan of HP. Their purpose is to allow a UPC programmer to take advantage of UPC implementations in which some or all of the shared data with affinity to a given UPC thread can be directly addressed by other UPC threads using a pointer-to-local.
We use the term 'castable' to denote that the UPC implementation is able to represent a given pointer-to-shared using a pointer-to-local on a given thread. Any pointer-to-shared with affinity to a thread is guaranteed (by the language spec) to castable by that same thread. However, in general shared storage with affinity to one thread is not castable by other threads. Depending on the UPC implementation, it is possible that for a given pair of threads either all, none, or only some of the shared address space with affinity to the first may be castable by the second.
bupc_castable() takes a pointer-to-shared as argument and returns non-zero if and only if the argument is castable by the calling thread. It is guaranteed that a call to bupc_castable() with an argument having affinity to the calling thread will always return non-zero.
bupc_thread_castable() takes a UPC thread number as argument and returns non-zero if and only if every pointer-to-shared with affinity to the argument thread is castable by the calling thread. It is guaranteed that bupc_thread_castable(MYTHREAD) is always non-zero.
bupc_cast() takes a pointer-to-shared as argument and returns a pointer-to-local. The returned pointer may be used to reference the same object as the argument only if the argument pointer is castable by the calling thread, as may determined by bupc_castable() or bupc_thread_castable(). Otherwise the returned pointer is NULL.
NOTICE: The following interfaces have been adopted in the UPC Optional Library Specifications, Version 1.3, with similar semantics to those as described above, and are available in Berkeley UPC beginning with the 2.16.2 release. Users are strongly encouraged to develop new codes using the standardized interfaces and to migrate existing code to them. The Berkeley-specific 'bupc_' prefixed variants are now officially deprecated in favor of the standardized variant, and will be removed in a future version.
void *upc_cast(const shared void *ptr); upc_thread_info_t upc_thread_info(size_t threadnum);
In addition to the functions described in the previous section, Berkeley UPC 2.14.2 and later implement an 'inverse cast' function:
shared void * bupc_inverse_cast(void *ptr);This function takes a pointer-to-local argument and returns a pointer-to-shared (with zero phase) referencing the same location if and only if the argument references a UPC shared object. If the argument is NULL or references a location not in the UPC shared space, then a null pointer-to-shared is returned.
The UPC Optional Library Specifications, Version 1.3 adds a standardized UPC Atomic Memory Operations library in <upc_atomic.h>. This standardized atomics interface is fully implemented in this version of Berkeley UPC. The former Berkeley-specific atomics extensions (in the 'bupc_atomic*' function family), were subsumed by the 1.3 standardized atomics interface, and (after a lengthy deprecation period) that proprietary function family has now been removed. Berkeley UPC provides several new extensions to the 1.3 specified atomics interface, described below.
Berkeley UPC defines several extended hint values for the upc_atomichint_t hints argument to upc_all_atomicdomain_alloc() to control the behavior of the created atomic domain:
Atomic domain hint values are guaranteed to be macros, thus the recommended portable means to use the hints described above is to protect their use with #ifdef, for example:
#ifdef UPC_ATOMIC_HINT_FAVOR_FAR upc_atomichint_t hint = UPC_ATOMIC_HINT_FAVOR_FAR; #else upc_atomichint_t hint = 0; #endif // create an atomic domain for atomic get, set and increment operations on 64-bit unsigned integers, // using network offload hardware where possible upc_atomicdomain_t *my_ad = upc_all_atomicdomain_alloc(UPC_UINT64, UPC_SET|UPC_GET|UPC_INC, hint); // perform an atomic fetch-increment on A[i] uint64_t result; upc_atomic_relaxed(my_ad, &result, UPC_INC, &A[i], 0, 0);
NOTICE: This Berkeley-specific extension is now officially deprecated in favor of the standardized version adopted into the official UPC specification. See the end of this section for details.
void bupc_all_free(shared void *ptr); void bupc_all_lock_free(upc_lock_t *lockptr);
These two functions implement collective alternatives to the standard functions upc_free() and upc_lock_free(), as a convenience to the programmer. Both functions must be called collectively by all threads with the same argument. The object referenced by the argument is guaranteed to remain valid until all threads have entered the collective deallocation call, but the function does not otherwise guarantee any synchronization or strict reference. In all other respects the semantics of these functions and constraints on their usage are identical to their non-collective variants.
NOTICE: The following interfaces have been adopted in the UPC Language Specifications, Version 1.3, with the same semantics as those as described above, and are available in Berkeley UPC beginning with the 2.16 release. Users are strongly encouraged to develop new codes using the standardized interfaces and to migrate existing code to them. The Berkeley-specific 'bupc_' prefixed variants are now officially deprecated in favor of the standardized variant, and will be removed in a future version.
void upc_all_free(shared void *ptr); void upc_all_lock_free(upc_lock_t *lockptr);
Berkeley UPC provides 'bupc_system()' as a drop-in replacement for C99 'system()'. This implementation is designed to avoid adverse interactions between fork() and RMDA-capable networking libraries.
The value of errno is zero at program startup, but is never set to zero by any library function. The value of errno may be set to nonzero by a library function call whether or not there is an error, provided the use of errno is not documented in the description of the function in this International Standard.These semantics are actually somewhat weaker than one might hope - specifically, they allow library calls which succeed to change errno to a non-zero value. In practice many C/POSIX library implementation actually do this.
The problem in the context of Berkeley UPC and its source-to-source translation is that there is one copy of errno per UPC thread which is shared by both the generated code representing translated UPC code, and all the runtime libraries running underneath it (including UPCR, GASNet, vendor network libs, etc.). Furthermore, many actions in UPC which do not qualify as library calls at UPC level (e.g. dereferencing a pointer-to-shared) result in library calls within the generated code. Consequently, the value of errno set by a failed library call invoked at the UPC source level may be subsequently overwritten by any of these implicit library calls.
While one could imagine the Berkeley UPC compiler and runtime taking action to preserve the value of errno across all the implicit library calls, doing so would adversely affect performance and we do not currently take this approach. This means that a UPC user who wants to inspect the value of errno after a failed library call they make must do so immediately - not just before the next UPC-level library call, but also before taking any action that might possibly invoke implicit library calls in the generated source code.
Basically, the only 100% safe way for a UPC program to read errno when using Berkeley UPC is to copy it into a local variable immediately after the failed library call returns. This is the "recommended practice" for using errno with Berkeley UPC.
#define NDEBUG #include <assert.h>will not work as expected if the NDEBUG definition modifies the behavior of assert.h (which, in this example, it does: this NDEBUG/assert.h case is the most common case where users run into this issue with our compiler).
There is a simple workaround: if you need to define a macro that affects the behavior of #included files, define it on the command line to upcc:
upcc -DNDEBUG myprogam.upc
Berkeley UPC guarantees that 'getenv' allows retrieval of certain environment variable values that were present when the job was launched. At present this function is only guaranteed to retrieve these value for all threads if the environment variable's name begins with 'UPC_' or 'GASNET_'. On some platforms all environment variables seen by the job launcher may be propagated, but it is not portable to rely on this.
The 'setenv' and 'unsetenv' functions are not guaranteed to work in a Berkeley UPC runtime environment, and should be avoided.
If you still believe you are encountering this issue, there are several recommended workarounds:
UPC Runtime error: pthread_create: Invalid argumentUsers encountering this error are recommended to workaround it by either using the BUPC translator (which does not demonstrate the problem), or reworking their program to use less statically-allocated private data.
Thank you for using Berkeley UPC!