Berkeley UPC User's Guide version 2022.10.0


Berkeley UPC Documentation
Berkeley UPC Site-wide
Entire Web
This guide tells you how to use Berkeley UPC, which is a portable implementation of Unified Parallel C (UPC) that runs on many different parallel systems available today.

Contents


UPC standards and APIs supported by this version of Berkeley UPC

This version of Berkeley UPC includes:

The three UPC specifications referenced above are also available for convenience as a combined document:
    UPC Language and Library Specifications, Version 1.3
    UPC Consortium, Lawrence Berkeley National Lab Tech Report LBNL-6623E, Nov 2013.


A Sample UPC Program

Here is a simple hello-world program written in UPC:
    #include <upc_relaxed.h>
    #include <stdio.h>

    int main() {
      printf("Hello from thread %i/%i\n", MYTHREAD, THREADS);
      upc_barrier;
      return 0;
    }
This program prints a message once from each thread (in some arbitrary interleaving), executes a barrier (optional), and exits.

For more involved examples of UPC code, see the UPC Language Tutorials - from the UPC Language Community website (archived) and the 'upc-examples' directory in of the Berkeley UPC runtime distribution. The Official UPC Specifications are a useful reference, and contains a description of the standard libraries.


Compiling UPC programs with 'upcc'

The upcc front end is used to compile UPC programs. It is designed with an interface that is very similar to the standard GNU gcc compiler, for ease of use. For instance, you could compile a physics simulation called 'light' from two source files via
    upcc -o light particle.upc wave.c -lgrottymath
Note that 'wave.c' can contain either UPC code or regular C code, and the 'grottymath' library that is linked into the application can be a regular C library: Berkeley UPC is fully interoperable with regular C source, object, and library files (note: if you compile with the -pthreads flag, any C libraries you use must be thread-safe). Berkeley UPC 2.0 also adds support for linking C++/FORTRAN/MPI objects into a UPC executable: see Mixing C/C++/MPI/FORTRAN with UPC.

upcc recognizes most commonly used C compiler flags (-D, -I, etc.). It also uses a number of its own flags for the choice of network API your program will run over, for compiling your UPC code for a static number of threads, and other UPC-specific options. See the upcc man page for details.

Choosing a network API for your UPC executable

Berkeley UPC executables are always compiled to run over a particular network API. To choose which network API is used, pass the '--network' flag with one of the following values:

Name Description
ibv OpenFabrics (aka OpenIB) InfiniBand Verbs for InfiniBand networks
aries GNI API for Cray XC systems running CLE.
ofi OpenFabrics Interfaces API (aka libfabric) for multiple networks.
This is the recommended network API for HPE Cray EX Slingshot and Intel Omni-Path networks only.
udp UDP: works on any system with a standard TCP/IP stack, but is typically slower than using one of the native network types. Generally the fastest option for systems with only Ethernet hardware (notably faster than MPI-over-TCP).
mpi MPI: works on any system with MPI installed, but is typically slower than using one of the other network types.
smp "Symmetric multiprocessor (SMP)" mode: uses no network. Currently runs with only a single process unless your runtime has been configured with --enable-pshm (currently default only on Linux). Otherwise, you must pass -pthreads to upcc to run smp-conduit with multiple UPC threads.
ucx Unified Communication X for Mellanox InfiniBand networks
NOTE: Experimental in this release, not auto-detected.

Note that you can only compile for a given network type if your Berkeley UPC runtime was configured to support it at build/installation time. To see which APIs are supported in your installation, and to see which is used by default, use 'upcc --version'.

Compiling for a fixed number of UPC threads

The '-T <number>' option to upcc causes your executable to be build for a fixed number of UPC threads. Alternatively, you can set 'UPCC_FIXED_THREADS=<number>' in your environment (the '-T' flag overrides the environment setting if both are present).

An executable compiled for a fixed number of UPC threads will fail at startup if you try to run it with a different number of threads. However, fixing the number of threads allows optimization on certain operations (such as shared pointer arithmetic), especially when the number of threads is a power of 2.

Overriding global upcc.conf settings in a user configuration file

upcc gets global settings for your installation from a upcc.conf file which is created during the configuration stage of a runtime installation. After installation the file is located in the $prefix/<config>/etc directory of your installation. You can create a user configration file $HOME/.upccrc to override any of these settings. See the upcc man page for a list of available settings.

UPC Standard and Berkeley-specific preprocessor macros

Programs compiled with Berkeley UPC will see all of the preprocessor macros provided by your backend C compiler, plus the following:

Name Value Description Standard
__UPC__ 1 Defined by any UPC implementation UPC Language
first specified in v1.1.1
__UPC_COLLECTIVE__ 1 Defined by UPC implementations providing the UPC Collective Utilities library <upc_collective.h> UPC Required Library
first specified in v1.2
__UPC_TICK__ 1 Defined by UPC implementations providing UPC High-Performance Wall-Clock Timers library <upc_tick.h> UPC Required Library
first specified in v1.3
__UPC_CASTABLE__ 1 Defined by UPC implementations providing the UPC Castability Functions library <upc_castable.h> UPC Optional Library
first specified in v1.3
__UPC_IO__ 1 Defined by UPC implementations providing the UPC Parallel I/O library <upc_io.h> UPC Optional Library
first specified in v1.2
__UPC_ATOMIC__ 1 Defined by UPC implementations providing the UPC Atomic Memory Operations library <upc_atomic.h> UPC Optional Library
first specified in v1.3
__UPC_NB__ 1 Defined by UPC implementations providing the UPC Non-Blocking Transfer Operations library <upc_nb.h> UPC Optional Library
first specified in v1.3
__UPC_VERSION__ Monotonically increasing positive integer constant UPC specification supported: value is YYYYMM date of that version's ratification (currently '201311L' for UPC 1.3) UPC Language
first specified in v1.1.1
UPC_MAX_BLOCK_SIZE A positive integer constant Indicates the maximum value allowed in a layout qualifier for shared data. The actual value varies across configurations UPC Language
first specified in v1.0
__UPC_DYNAMIC_THREADS__ 1 if dynamic threads: else undefined Set to 1 unless the '-T' flag was passed to upcc UPC Language
first specified in v1.1.1
__UPC_STATIC_THREADS__ 1 if static threads: else undefined Set to 1 if the '-T' flag was passed to upcc UPC Language
first specified in v1.1.1
THREADS A compile-time integer constant representing the static thread count: else undefined Set to the static thread count if the '-T' flag was passed to upcc. (Under dynamic threads, THREADS is a keyword that expands to the thread count determined at program launch.) UPC Language
__UPC_PUPC__ 1 Defined by UPC implementations supporting the GASP interface GASP 1.5 specification
__BERKELEY_UPC__ Monotonically increasing positive integer constant The major version number of the Berkeley UPC release.
Example: '1' for release '1.0.3'.
Berkeley UPC only
__BERKELEY_UPC_MINOR__ An integer constant The minor version number of the Berkeley UPC release.
Example: '0' for release '1.0.3'.
Berkeley UPC only
__BERKELEY_UPC_PATCHLEVEL__ An integer constant The patch version number of the Berkeley UPC release.
Example: '3' for release '1.0.3'.
Berkeley UPC only
__BERKELEY_UPC_<NETWORK>_CONDUIT__ 1, or undefined Identifies the network API used.
Example: if 'upcc -network=mpi' is used, '__BERKELEY_UPC_MPI_CONDUIT__' will be defined to '1'.
Berkeley UPC Runtime[1]
__BERKELEY_UPC_PSHM__ 1, or undefined Defined to 1 if and only if PSHM support is enabled Berkeley UPC Runtime[1]
__BERKELEY_UPC_PTHREADS__ 1, or undefined Defined to 1 if and only if the '-pthreads' flag is used Berkeley UPC Runtime[1]
__BERKELEY_UPC_RUNTIME__ 1, or undefined Defined to 1 if and only if the Berkeley UPC runtime is used Berkeley UPC Runtime[1]
__BERKELEY_UPC_RUNTIME_DEBUG__ 1, or undefined Defined to 1 if and only if a debugging runtime used (i.e. '-g' passed to upcc). Berkeley UPC Runtime[1]
__BERKELEY_UPC_RUNTIME_RELEASE__ An integer constant, or undefined The major version number of the Berkeley UPC Runtime library.
Example: '2' for release '2.12.0'.
Berkeley UPC Runtime[1]
release 2.12 and newer
__BERKELEY_UPC_RUNTIME_RELEASE_MINOR__ An integer constant, or undefined The minor version number of the Berkeley UPC Runtime library.
Example: '12' for release '2.12.0'.
Berkeley UPC Runtime[1]
release 2.12 and newer
__BERKELEY_UPC_RUNTIME_RELEASE_PATCHLEVEL__ An integer constant, or undefined The patch version number of the Berkeley UPC Runtime library.
Example: '0' for release '2.12.0'.
Berkeley UPC Runtime[1]
release 2.12 and newer

Note 1: Defined by the Berkeley upcc driver independent of the underlying translator/compiler (Berkeley UPC-to-C translator, Clang-upc2c translator, CUPC compiler, or GUPC compiler) targeting the Berkeley UPC Runtime.

Using a remote UPC-to-C translator

The upcc front end has the ability to use a UPC-to-C translator located on a remote machine. This is provided both as a convenience (the translator takes much longer to build than the runtime, and we provide a public HTTP translator that allows users to get started with Berkeley UPC more quickly), and to support the many systems on which our translator does not build, due to C++ portability issues.

A remote translator can be contacted via either the HTTP or SSH protocols. To use HTTP, the 'upcc.cgi' CGI script (located in the 'contrib' directory of the runtime distribution) must be installed and configured with a web server on the remote host. Simply set the 'translator' parameter in your user configuration file (or the global 'upcc.conf') to the URL for the CGI script. To use SSH, you must be able to login to the remote host using SSH, and the 'translator' parameter must be set to 'remote_host:/path/to/translator'. You will want to use key-based authentication, and 'ssh-agent' to avoid entering your password each time you compile. See our SSH Agent Tutorial.

When using an HTTP-based remote translator, upcc also includes support for use of an HTTP proxy. Set the 'http_proxy' parameter in your user configuration file (or the global 'upcc.conf') to the proxy URL. The upcc front end does not currently support HTTPS or SOCKS proxies, nor HTTP proxies that require authentication (HTTP error 407).


Creating libraries of UPC code

At present you cannot create traditional C-style libraries with UPC code in them using Berkeley UPC (i.e. you cannot successfully use 'ar' to create 'libmyupc.a').

If you wish to create a reusable set of compiled code, you must currently keep the files in *.o format. So, instead of the traditional C format, where you'd create 'libmyupc.a', and then link with something like

    upcc myprogram.o -L/libpath -lmyupc
You must instead do something like
    upcc myprogram.o /libpath/libmyupc/*.o 
Note that beginning with Berkeley UPC 2.12.0 it is possible to link together static threads and dynamic threads objects, with the result being a static threads executable. In many cases this allows use of a dynamic threads object in the role of a library, which can be linked to an executable with any dynamic or static thread setting.


Running UPC programs

If you compile a UPC program with '--network=smp', you can run the executable normally (ie the same way you'd run 'ls' or 'grep'). Otherwise, you are generating an executable that uses a parallel network API, and this typically means your executable will need some special treatment to be launched correctly.

Berkeley UPC executables should be run the same way as any other parallel program on your system that uses the same underyling network API. So, for instance, a program compiled with '--network=mpi' is run on many systems via 'mpirun -np <number of processes> a.out'. Other systems may use other invocations, such as 'prun' or 'poe', especially when API's other than MPI are used. Consult your system's documentation for details.

Using 'upcrun'

The 'upcrun' script that is installed as part of the Berkeley UPC runtime is our attempt to provide a standard interface for running UPC programs. If your installation has configured 'upcrun.conf' correctly (in many cases the defaults will work), you can run UPC programs portably via commands like
    upcrun -n 4 parboil
This example runs the UPC executable 'parboil' with 4 UPC threads. The default layout of those threads on the physical hardware is system-dependent, but there are upcrun options to further control job layout.

An additional benefit of using upcrun is that it provides consistent support for propagating environment variables to all threads of your UPC program. If you use upcrun, any environment variable beginning with either 'UPC_' or 'GASNET_' is guaranteed to be propagated to all threads. (Support for propagating all environment variables is planned). If you do not use upcrun, environment propagation will only work to the extent that the parallel job launcher you use provides it normally.

You can see how upcrun thinks your job should be run without actually running it by passing the upcrun '-t' flag. Also, 'upcrun -i <executable>' will provide information about a Berkeley UPC executable, such as the network API that it was built against, and the number of fixed threads (if any) that it was compiled for.

See 'upcrun --help' or the upcrun man page for more information.

Setting the amount of shared memory available to your applications

At startup, each Berkeley UPC thread reserves a fixed portion of its address space (via the 'mmap()' system call) for shared memory. This address range cannot be used for regular unshared (i.e., malloc) memory allocations, and it also serves as a maximum value on the amount of shared memory (per-thread) that the program can use: a UPC program will die with a fatal error if any thread tries to allocate more shared memory than it reserved at startup.

The default amount of shared memory to reserve per UPC thread on a system is chosen at configure time (see INSTALL.TXT for details), but you can override that value for a particular application at either compile time or at job startup. Generally this is only needed if you observe that your application is running out of either shared or regular C memory.

To embed a different default amount of shared memory into your application, simply pass 'upcc -shared-heap=144MB' for instance (to get 144 megabytes per UPC thread). You can also use 'GB' for gigabyte amounts (if neither 'MB' nor 'GB' is used, megabytes are assumed). To override the embedded default amount of shared memory at application startup, set the UPC_SHARED_HEAP_SIZE environment variable to whatever value you want ('2GB', etc.), or pass '-shared-heap' to upcrun.

While it is tempting to simply grab an extremely large shared memory segment, be aware that this is not always a good idea, or even possible. Since the shared address space range cannot be used for regular malloc allocations, creating too large of a shared space can cause the amount of regular heap memory available to your application to become small (causing malloc to eventually return NULL when you request more memory). Also, the shared memory space is reserved via an mmap() call, and while this does not generally cause any physical memory pages to be allocated, certain operating systems (for instance, Linux) will not allow more memory to be reserved by applications then the OS can guarantee is available, and so allocating a shared region larger than the physical memory (plus swap space) may fail.

The default amount of shared memory per UPC thread can be changed system-wide by modifying the 'shared_heap' parameter in the installation's upcc.conf file. You can override the system-wide default for your own applications by setting shared_heap in your user configuration file.


Using pthreaded Berkeley UPC programs

Berkeley UPC programs may handle inter-process communication within a compute node in one of two ways. If your runtime has been configured with support for PSHM (Process SHared Memory), then all communication within a compute node will be performed directly through shared memory. This is the default configuration on nearly all systems (though some system setup is sometimes required). For cases in which PSHM support is not enabled (such as due to configuring with '--disable-pshm'), the runtime will call the network APIs for all inter-process communication, even within a compute node. While many network APIs perform some kind of optimization for 'local' traffic (avoiding actually putting messages on the network), they are typically slower than simply using shared memory between UPC threads. To provide shared-memory performance within an SMP (or cluster of SMPs), Berkeley UPC supports creating executables that implement UPC threads as pthreads within a process, thus allowing optimized communications between multiple UPC threads running in the same process. To utilize pthreads, pass the '-pthreads=N' option to upcc, where N is the number of processors per node on your system (or configure your 'upcc.conf' file, as described below). This will use one or more multithreaded processes on each node, with shared memory used among UPC threads in the same process. This is may often be the fastest way to run Berkeley UPC programs on SMP systems when PSHM is not available.

The '-pthreads' flag must be passed consistently at all stages of compilation and linking. Also, when pthreads are used, upcc needs to delay much of the compilation of your code until link time, so if you split code generation into separate compilation and linking steps (i.e., 'upcc -c foo.upc', followed by 'upcc foo.o bar.o'), you need to pass any macro and/or include path directives (ex: '-DFOO=bar -I/usr/local/include') to upcc for both the compilation and link commands.

Any C libraries that your code links against must be thread-safe in order to be used with -pthreads. If one or more of your libraries is not thread-safe, you must compile without pthreads, and run separate processes on the same node to exploit an SMP system. In the non-pthreads case support for shared memory communication among UPC processes on an SMP node is available on many systems via the "PSHM" feature (See "INTRA-NODE SHARED MEMORY SUPPORT" in INSTALL.TXT).

When you link an application with '-pthreads', a subdirectory named <executable_name>_pthread-link will be created in the current directory. This directory exists in order to speed up further linking commands of the same program. If you link the same application again with the same object file names, and none of the global static unshared variables in your program have changed name or size, recompilation of all the files in your application can be avoided, which can make a significant difference in build time for programs with many source files. You may delete the temporary directory at any time without any side effects (other than possibly longer link times). One can prevent this optimization with the -nolink-cache flag to upcc.

Unless otherwise specified, pthreaded UPC applications use a default number of pthreads per process (run 'upcc --version' to see the default for your system. This number is set in the upcc.conf configuration file, and can be changed there (or in your user configuration file). It can also be overridden in several ways. Compiling with 'upcc -pthreads=<NUMBER>' changes the default number of pthreads per UPC process for an executable to NUMBER. If the 'UPC_PTHREADS_PER_PROC' environment variable is set to a nonzero integer when you run a UPC program, it will override any default value. Finally, upcrun is smart about pthreads in several ways. First, if you run a pthreaded parallel job with 'upcrun -n <NUMBER> ...', the number of processes actually launched will be divided by the number of pthreads, so that exactly NUMBER UPC threads are used. Second, if the PSHM feature is disabled and you use -network=smp (generating an executable that will run only a single process), upcrun -n NUMBER will automatically set the number of pthreads to NUMBER.


Debugging Berkeley UPC programs

You can use a regular C debugger and get usable debugging support. Berkeley UPC provides several mechanisms for attaching a regular C debugger to one or more of your UPC application's threads at various points during execution. While this does not provide a fully normal debugging environment (the debugger will show the C code emitted by our translator, rather than your UPC code), it can still allow you to see program stack traces and other important information. This can be very useful if you wish to submit a helpful bug report to us. See Attaching a regular C debugger to Berkeley UPC programs for details.

Berkeley UPC also supports automatically generating backtraces if a fatal error occurs in your program. This will allow you to see a stack trace of the function calls that your program was in at the time it crashed. To use auto-backtracing, run with upcrun -backtrace or set GASNET_BACKTRACE=1 in your environment. The level of backtracing support available depends on the back-end C compiler and operating system, and so not all systems are equally functional, and some systems will not provide backtraces. See gasnet/README for more information on backtracing.


Analyzing UPC Programs with 'upc_trace'

As of version 2.0, Berkeley UPC includes 'upc_trace', a tool for analyzing the communication behavior of UPC programs. When run on the output of a trace-enabled Berkeley UPC program, 'upc_trace' provides information on which lines of code in your UPC program generated network traffic, including information such as: how many messages the line caused, what type (local and/or remote gets/puts), and the maximum/minimum/average/combined sizes of the messages.

Examining tracing information is one of the best ways to go about optimizing your UPC program. It provides a way for you to see which lines of your code are generating the most network traffic (and the size of the network messages used). From this you may be able to determine how to either avoid some of this traffic, or change your code to use fewer, larger messages (for instance, by replacing sets of individual reads/writes with bulk memory movement calls like 'upc_memget()', etc.), which is typically more efficient. Examining barrier wait times can also let you know if your computations are imbalanced across threads, and/or if you could profit by using split-phase barriers, moving computation in between 'upc_notify' and 'upc_wait'.

How to use 'upc_trace'

  1. Tracing must be enabled in order to work. By default, tracing is enabled for debug compilations (i.e. if 'upcc -g' is used), but not otherwise (as it incurs some overhead). If you wish to also trace non-debug executables, you must rebuild your UPC runtime system and pass '--with-multiconf=+opt_trace' to configure, then build your application with 'upcc -trace'.

  2. You must run your application with 'upcrun -trace ...' or 'upcrun -tracefile TRACE_FILE_NAME ...'. Either of these flags causes your UPC executable to dump out tracing information while it executes. The '-trace' flag causes one file per UPC thread to be generated, with the name 'upc_trace-a.out..-N', where 'a.out' is the name of your executable, and 'N' is the UPC thread's number. The '-tracefile NAME' option lets you specify your own name for the tracing file(s): if the name contains a '%' character, one trace file per thread is generated, with the '%' replaced with the UPC thread's number. Otherwise, all threads will write to the same file.
    Note that running with tracing may slow down your application considerably: the exact amount depends on your filesystem, and the ratio of communication/computation in your program. If you are only interested in a subset of trace information, consider setting 'GASNET_TRACEMASK' as described below.

  3. After your application has completed, you may run 'upc_trace' on one or more of the trace files generated by your program run:
    1. Running 'upc_trace' on a trace file generated by a single UPC thread shows the information only for that thread. If you pass multiple files from the same application run, the information for the various threads is coalesced, so passing in all the tracefiles generated by a run allows you to see information for the entire application.
    2. There are a number of flags to 'upc_trace' which control what kinds of information is reported, and how it is sorted. See 'upc_trace --help' or the upc_trace man page for details.
    3. Note that upc_trace may take a while to run, especially on large tracefiles. Consider setting GASNET_TRACEMASK and/or GASNET_TRACELOCAL (described below) to streamline the trace file's contents to include only those events you're interested in analyzing.
    4. If you compile with 'upcc -opt', it is possible that the UPC-to-C translator has coalesced some of the network operations in your program, in order to get better network performance. This means that 'upc_trace' may not report communication for certain lines of your program, and other lines may seem to be getting/putting more data than they should.
    The trace files are in a human-readable text format, and are amenable to generic text-file processing tools such as grep, perl, etc. This means you can also elect to perform your own analyses on the trace file if the upc_trace tool doesn't provide the information you are looking for.

Controlling what gets logged in the trace file by setting GASNET_TRACEMASK

By default, Berkeley UPC will trace all of the following program events:

ID Feature
G Network 'gets'. These include both bulk gets (from upc_memget, etc.), and network get operations caused by reading shared memory via shared variables/pointers. The 'g' mask does not include 'local' gets (i.e. reads from shared memory which has affinity to the reading UPC thread), as these do not result in network traffic. Use 'H' to trace local gets.
P Network 'puts'. These include both bulk puts (from upc_memput, etc.), and put operations caused by writing to shared memory via variables/pointers. The 'P' mask does not include 'local' puts (i.e. writes to shared memory which has affinity to the writing UPC thread), as these do not result in network traffic. Use 'H' to trace local puts.
B Barriers, including both blocking (upc_barrier) and non-blocking (upc_notify followed by upc_wait: a pair of these count as a single barrier).
N Line number information from UPC source files. The "N" and "H" flags must always be among those set for upc_trace to work!
H Miscellaneous UPC information. The "N" and "H" flags must always be among those set for upc_trace to work! Passing this flag causes the following things to be traced:
  • UPC lock functions ('upc_lock', 'lock_attempt', and 'upc_unlock').
  • UPC collective operations (besides barriers, which are controlled by 'B').
  • 'Local' puts/gets, i.e. gets and puts to shared memory which has affinity to the issuing UPC thread (which thus do not result in network traffic). Tracing local gets/puts can significantly expand the size of the trace file (and the time it takes to run 'upc_trace', so if you are not interested in viewing them, consider omitting them from the trace file. You can do this by setting the 'GASNET_TRACELOCAL' environment variable to "no" (or "0"). You may also selectively turn on/off local tracing during program execution by calling the 'bupc_trace_settracelocal()' function (described below). Local get/put tracing only includes accesses performed through pointers-to-shared or the bulk 'upc_memget', etc., functions: it does not include accesses to shared memory made via 'localized' pointers, i.e., pointers-to-shared that have been cast to a pointer-to-local ("regular C pointers").
  • UPC memory allocation operations, i.e., 'upc_alloc', 'upc_all_alloc', 'upc_global_alloc', and 'upc_free' function calls. (Note: allocation operations are not currently reported by 'upc_trace': if you wish to examine where/when your program run has called allocation functions, you must examine the trace file by hand.)
  • 'Strict' UPC operations. (Note: 'strict' operations are not currently reported by 'upc_trace': if you wish to examine where/when your program run has executed 'strict' operations, you must examine the trace file by hand.)

To trace only a subset of these features, set the 'GASNET_TRACEMASK' environment variable to a string containing the ID's of the features you wish to trace. Note that the "N" and "H" flags must always be among those set for 'upc_trace' to work (if you are intending to manually examine the trace file, they do not need to be set).

So, for instance, if you are trying to perform an analysis that does not require get/put information, you are highly advised to set 'GASNET_TRACEMASK' to "BHN" and 'GASNET_TRACELOCAL' to "no" (or "0"). This will turn off tracing for all get and put operations. Since gets/puts are typically the majority of items in a full trace file, this will probably result in much faster program execution, a much smaller trace file, and faster analysis by 'upc_trace'.

Controlling tracing during runtime

For even more control over tracing, you may call the following functions in your program to set the trace mask dynamically, read its current value, and/or insert your own custom messages into the trace file:
    extern void         bupc_trace_setmask (const char *newmask);
    extern const char * bupc_trace_getmask (void);
    extern int          bupc_trace_gettracelocal (void);
    extern void         bupc_trace_settracelocal (int val);
    void                bupc_trace_printf ((const char *msg, ...));
'bupc_trace_getmask' and 'bupc_trace_setmask' allow programmatic retrieval and modification of the trace masks in effect for the calling thread. The initial values are determined by the 'GASNET_TRACEMASK' environment variables, and the input and output to the mask manipulation functions have the same format as 'GASNET_TRACEMASK' values. Note that whenever any tracing is enabled (i.e. unless you are temporarily turning off tracing by passing an empty string), the "N" and "H" flags must always be among those set for 'upc_trace' to work.
'bupc_trace_{get,set}tracelocal' allow the calling thread to programmatically enable/disable tracing of local put/get operations, which correspond to pointer-to-shared accesses that actually have local affinity (and therefore invoke no network communication).

Different UPC threads may set different masks and tracelocal settings, but note that in pthreaded UPC jobs all pthreads in a process share these values. These functions have no effect if trace and stats communication profiling are disabled at upcr configure time, or are not enabled for the current run.
   Ex: 
     bupc_trace_setmask("PGHN");   // trace everything
     bupc_trace_settracelocal(1);  // include local puts and gets 
     // do something...
     bupc_trace_setmask("");      // stop tracing

The 'bupc_trace_printf' utility outputs a message into the trace file, if it exists. Note that two sets of parentheses are required when invoking this operation, in order to allow it to compile away completely for non-tracing builds.

  Ex:   double A[4] = ...; 
        int i = ...;
        bupc_trace_printf(("the value of A[%i] is: %f", i, A[i]));


Gathering application statistics

Berkeley UPC also provides the ability to generate a 'stats' report, which contains a statistical summary of program activity. While this report does not give as much information as provided by tracing, it does contain such information as the total number of get/put operations, barriers, etc. (although these cannot be traced back to specific lines of code, as 'upc_trace' provides). But the stats report is generally much smaller than the average trace file, so it may be useful if you are finding that tracing is adding too much overhead to your program runs.

To generate statistics, simply set the 'GASNET_STATSFILE' environment variable to a file name, into which statistics will be written at the end of your program's run. (Note: by default, only debug executables support statistics generation, as it incurs a performance penalty: if you wish to have non-debug UPC executables generate statistics, you must rebuild your UPC runtime system, passing '--with-multiconf=+opt_trace' to configure, then build your application with 'upcc -trace'.) You may generate both stats and tracing info for the same program run if you wish.

Just as with tracing, you may set a mask to control what types of events are included in the statistics, by setting the 'GASNET_STATSMASK' environment variable, and/or by calling the following functions:

    extern void         bupc_stats_setmask (const char *newmask);
    extern const char * bupc_stats_getmask (void);
The same mask IDs are used by the tracing and statistics masks, i.e., calling 'bupc_stats_setmask("BP")' would cause execution to gather statistics only for barriers and puts. See the table in the tracing documentation for the list of IDs.


Profiling UPC Programs with 'upcc -pg' and 'gprof'

The standard GNU 'gprof' profiling tool can be used with Berkeley UPC programs, if your backend C compiler supports gprof (this is autodetected at configure time). Simply compile your UPC program with 'upcc -pg'. When you run the program, one or more 'gmon.out' files are generated (if your UPC program consists of multiple processes, one file per process is created, each in it's own 'gmon.out.process_number' subdirectory). You can then use 'gprof' on one or more of these files (if multiple files are passed, the statistics are combined):
    upcc -pg foo.c
    upcrun -n 2 a.out
    gprof a.out gmon.out.0/gmon.out gmon.out.1/gmon.out | less
Note that 'gprof' provides timings and statistics for processor usage: it does not include time during which the process has been put to sleep waiting for I/O (including network reads/writes). However, since Berkeley UPC uses spin-locks in many cases to wait for network events, rather than blocking system calls, you may see that certain 'gasnet*' functions consume large amounts of CPU time. This generally means that your program is spending most of that time waiting for network communication to complete (some fraction is the software overhead inherent in sending/receiving the network traffic). If your program spends a lot of time waiting for network operations to complete, you may be suffering from an imbalanced load across threads (so that some take longer to "catch up" to a barrier, for instance). Restructuring your application may avoid these waiting periods. Or you may be able to use some of this "spare" time for computation (or other network traffic) by switching to use non-blocking barriers (i.e., 'upc_notify/upc_wait'), and/or our non-contiguous memcpy extensions to UPC. Replace blocking network constructs (such as 'upc_barrier', 'upc_memcpy', and read/writes to shared variables) with non-blocking equivalents, and insert unrelated computation (and/or network traffic) in between the initialization and completion calls. Of course, you must be able to find unrelated computation/communication for this to work, and the degree to which this is possible will depend on your application.


Berkeley-specific extensions to the UPC Language

Non-blocking memcpy functions (partially deprecated)

NOTICE: A large portion of this Berkeley-specific extension is now officially deprecated in favor of the standardized version adopted into the official UPC specification. See below for details.

As of 2.0, Berkeley UPC fully implements a set of non-blocking extensions to the 'upc_memcpy()' function for contiguous data. These extensions allow you to explicitly overlap memcpy-like functions with computation (and/or with other memcpy calls).

The full interface is described in sections 2 through 4 of our Proposal for Extending the UPC Memory Copy Library Functions. See that document for details on the functions and their usage.

NOTICE: The following interfaces have been adopted in the UPC Optional Library Specifications, Version 1.3, with semantics compatible to a subset of those given in the document referenced above, and are available in Berkeley UPC beginning with the 2.18 release. Users are strongly encouraged to develop new codes using the standardized interfaces and to migrate existing code to them. The corresponding portion of the Berkeley-specific non-blocking interfaces (specifically, those operating on contiguous data) are now officially deprecated, and the 'bupc_' prefixed equivalents of those functions will be removed in a future version.

    #define     __UPC_NB__  1  // predefined feature macro
    #include    <upc_nb.h>     // defines the following:

    // Explicit-handle non-blocking operations and synchronization:
    typedef      ... upc_handle_t;
    #define      UPC_COMPLETE_HANDLE ...
    upc_handle_t upc_memcpy_nb(shared void * restrict dst,
                               shared const void * restrict src,
                               size_t n);
    upc_handle_t upc_memget_nb(void * restrict dst,
                               shared const void * restrict src,
                               size_t n);
    upc_handle_t upc_memput_nb(shared void * restrict dst,
                               const void * restrict src,
                               size_t n);
    upc_handle_t upc_memset_nb(shared void *dst, int c, size_t n);
    int          upc_sync_attempt(upc_handle_t handle);
    void         upc_sync(upc_handle_t handle);

    // Implicit-handle non-blocking operations and synchronization:
    void upc_memcpy_nbi(shared void * restrict dst,
                        shared const void * restrict src,
                        size_t n);
    void upc_memget_nbi(void * restrict dst,
                        shared const void * restrict src,
                        size_t n);
    void upc_memput_nbi(shared void * restrict dst,
                        const void * restrict src,
                        size_t n);
    void upc_memset_nbi(shared void *dst, int c, size_t n);
    int  upc_synci_attempt(void);
    void upc_synci(void);

NOTE: The types upc_handle_t and bupc_handle_t are interchangable. One may freely mix the standard library calls in upc_nb.h with the Berkeley-specifc interfaces for contiguous non-blocking memcpy.

Non-contiguous memcpy functions

As of 2.0, Berkeley UPC fully implements a set of extensions to the 'upc_memcpy()' function for use with non-contiguous data. These extensions provide versions that allow you to specify non-contiguous memory regions to get/put, and include both blocking and non-blocking versions.

The full interface is described in sections 5 and 6 of our Proposal for Extending the UPC Memory Copy Library Functions. See that document for details on the functions and their usage.

NOTE: The types upc_handle_t and bupc_handle_t are interchangable. One may freely mix the standard library calls in upc_nb.h with the Berkeley-specifc interfaces for non-contiguous memcpy.

Point-to-point synchronization functions

As of 2.2, Berkeley UPC implements a set of point-to-point synchronization functions, partly based on the POSIX semaphore interfaces. These extensions allow you to explicitly synchronize between pairs of UPC threads, and to associate synchronization with data transfer.

The full interface is described in our Proposal for Extending the UPC Libraries with Explicit Point-to-Point Synchronization Support. See that document for details on the functions and their usage.

Value-based collectives convenience interface (bupc_collectivev.h)

This library wrapper provides a value-based convenience interface to the UPC collectives library that is part of UPC 1.2. There is a small amount of optimization for Berkeley UPC, but the wrapper is generic and can be used with any fully UPC-1.2 compliant implementation of the UPC collectives library. All operations are implemented as thin wrappers around that library. In most cases, operands to this library are simple values, and nothing is required to be single-valued except for the data type in use and the root thread identifier (in the case of rooted collectives). The purpose of this wrapper is to provide convenience for scalar-based collective operations, especially in cases where there are not multiple values available to be communicated in aggregate (in which case the full array-based UPC collectives interface is likely to use fewer messages and achieve better performance) or for use in setup code (where performance is secondary to simplicity). See the collectivev documentation for full interface details.

The 'bupc_all_reduce_all' function family

This is an extension to the UPC Collectives Specification. The 'bupc_all_reduce_all' functions behave identically to the 'upc_all_reduce' functions, except that the 'dest' argument has the semantics of the 'dest' argument to 'upc_all_broadcast', i.e. the result of the reduction is broadcast to all threads, instead of just one.

The 'bupc_dump_shared' function

Pointers-to-shared in UPC are logically composed of three fields: a 'local address' of the object referenced by the pointer-to-shared, the identifier of the UPC thread with affinity to the referent object (the thread where that local address is valid), and the 'phase' of the pointer-to-shared (see the UPC Language Specification for an explanation of pointer-to-shared phase). Our version of UPC provides a 'bupc_dump_shared' function that will write a description of these fields into a character buffer that the user provides:
    int bupc_dump_shared(shared const void *ptr, char *buf, int maxlen);
Any pointer-to-shared may be passed to this function. The 'maxlen' parameter gives the length of the buffer pointed to by 'buf', and this length must be at least BUPC_DUMP_MIN_LENGTH, or else -1 is returned, and errno set to EINVAL. On success, the function returns 0, The buffer will contain either "<NULL>" if the pointer-to-shared == NULL, or a string of the form
    "<address=0x1234 (addrfield=0x1234), thread=4, phase=1>" 
The 'address' field provides the virtual address for the pointer, while the 'addrfield' shows the actual contents of the pointer-to-shared address bits (as returned by upc_addrfield). On some configurations these values may be the same (if the full address of the pointer can be fit into the address bits), while on others they may be quite different (if the address bits store an offset from a base initial address that may differ from thread to thread).

Both bupc_dump_shared() and BUPC_DUMP_MIN_LENGTH are visible when any of the standard UPC headers (upc.h, upc_relaxed.h, or upc_strict.h) are #included.

The 'bupc_ptradd' function

Blocked pointers-to-shared in UPC are currently restricted to being declared with a compile-time constant block size. This can present problems in situations where the desired block size of a given array is input-dependent or otherwise unknown at compile time, and one wishes to conveniently access the array elements in layout order according to a specific block size.

The 'bupc_ptradd()' function provides support for performing pointer-to-shared arithmetic with variable blocksize, which need not be a compile-time constant.

  shared void * bupc_ptradd(shared void *p, size_t blockelems, size_t elemsz, ptrdiff_t elemincr);
    - 'p': the base pointer
    - 'blockelems': the block size (number of elements in a block)
    - 'elemsz': the element size (usually sizeof(*p))
    - 'elemincr': the positive or negative offset from the base pointer

The following call:

    bupc_ptradd(p, blockelems, sizeof(T), elemincr);
Returns a value q as if it had been computed:
    shared [blockelems] T *q = p;
    q += elemincr;
however, the blockelems argument is not required to be a compile-time constant. Blockelems must be non-negative, but may be zero to indicate an indefinite blocking factor. Here's an example of indexing into a dynamically-allocated array whose block size is not known until run time.
  int blockelems = ...; // choose some arbitrary block size

  // allocate an array of doubles with that blocksize
  shared void *myarr = upc_all_alloc(..., blockelems*sizeof(double)); 

  // access element 14
  double d = *(shared double *)bupc_ptradd(myarr, blockelems, sizeof(double), 14);

It's worth noting that in some cases bupc_ptradd() may be less efficient than regular pointer-to-shared addition, because the compile-time constant blocksize of the pointer referent type generally makes the latter more amenable to compiler optimization of the addition operation and surrounding code. This is especially true in the case of indefinitely-blocked or cyclically-blocked pointers-to-shared. However, the potential cost may be worth the added convenience in non-performance-critical code.

The 'bupc_poll' function

The 'bupc_poll()' function explicitly causes the UPC runtime to attempt to make progress on any network requests that may be pending.

You will normally not need to call this function, as the runtime will automagically perform checks for incoming network requests whenever your UPC code causes network activity to be performed, and this usually occurs fairly frequently in a UPC application. However, if you writing your own 'spin lock' style synchronization, you may need to use this function to avoid deadlock. Here is an example:

    shared strict int flag[THREADS];

    ...

    if (MYTHREAD % 2) {
        while (flag[MYTHREAD] == 0)
            bupc_poll();
    } else {
        ... some calculation ...
        flag[MYTHREAD + 1] = 1;
    }
Here the 'even' UPC threads are performing some calculation, then informing the 'odd' threads that the result is ready by setting a per-thread flag. If the 'bupc_poll()' were omitted, the 'odd' threads might (on certain platforms/networks) consume all of the CPU forever in the 'while' test, never checking for the incoming network message that would set flag[MYTHREAD].

If a program contains computationally intensive sections in which no remote accesses are performed for a long time, it is also possible that performance may be improved by intermittently calling bupc_poll, particularly if other threads are likely to be performing communication (eg. remote accesses, lock synchronization, shared memory allocation, etc.) during this time.

The 'bupc_assert_type' built-in (Berkeley UPC translator only)

The 'bupc_assert_type(expr, type)' built-in operation allows testing for compile-time type equality, and is primarily used by our UPC compiler test suite.
  1. 'expr' = any arbitrary (legal) UPC expression
  2. 'type' = any legal C/UPC type

If 'expr' has a static type which is identical to 'type', does nothing. Otherwise, prints a non-fatal warning containing the line number and a description of the two differing types.

High-precision wall-clock timer support (deprecated)

NOTICE: This Berkeley-specific extension is now officially deprecated in favor of the standardized version adopted into the official UPC specification. See the end of this section for details.

    typedef     ... bupc_tick_t; /* 64-bit integral type */
    #define     BUPC_TICK_MAX ...
    #define     BUPC_TICK_MIN ...
    bupc_tick_t bupc_ticks_now (void);
    uint64_t    bupc_ticks_to_us (bupc_tick_t ticks);
    uint64_t    bupc_ticks_to_ns (bupc_tick_t ticks);
    double      bupc_ticks_granularityus (void); 
    double      bupc_ticks_overheadus (void);
The 'bupc_tick_t' type and associated functions provide portable support for querying high-precision system timers for obtaining wall-clock timings of sections of code. Most CPU hardware offers access to high-performance timers with a handful of instructions, providing timer precision and overhead that can be several orders of magnitude better than can be obtained through the use of the gettimeofday() system call.

The 'bupc_tick_t' type represents an integral quantity of abstract timer ticks, whose ratio to real time is system-dependent and thread-dependent. bupc_ticks_now() returns the current value of the tick timer for the calling thread, using the fastest mechanism available. bupc_ticks_to_us() and bupc_ticks_to_ns() convert a difference in bupc_tick_t values obtained by the calling thread into microseconds or nanoseconds, respectively. The bupc_ticks_to_{us,ns}() conversion calls can be significantly more expensive than the bupc_ticks_now() tick query, so for timing short intervals it's recommended to keep timing results in units of ticks until final output. BUPC_TICK_MAX and BUPC_TICK_MIN provide tick values which are respectively larger and smaller than any possible tick value. bupc_ticks_granularityus() and bupc_ticks_overheadus() respectively report the estimated microsecond granularity (minimum time between distinct ticks) and microsecond overhead (time it takes to read a single tick value, not including conversion) for the timer facility.

Example:

  bupc_tick_t start = bupc_ticks_now();
    compute_foo(); /* do something that needs to be timed */
  bupc_tick_t end = bupc_ticks_now();

  printf("Time was: %d microseconds\n", (int)bupc_ticks_to_us(end-start));

  printf("Timer granularity: <= %.3f us, overhead: ~ %.3f us\n",
       bupc_tick_granularityus(), bupc_tick_overheadus());
  printf("Estimated error: +- %.3f %%\n",
      100.0*(bupc_tick_granularityus()+bupc_tick_overheadus()) /
            bupc_ticks_to_us(end-start));
It's important to keep in mind that raw bupc_tick_t values are thread-specific quantities with a thread-specific interpretation (e.g. they might represent a hardware cycle count on a particular CPU, starting at some arbitrary time in the past). More specifically, raw ticks do NOT provide a globally-synchronized timer (i.e. the simultaneous absolute tick values may differ across threads), and furthermore the tick-to-wallclock conversion ratio might also differ across threads (e.g. on a cluster with heterogenerous CPU clock rates, the raw tick values may advance at different rates for different threads). Therefore as a rule of thumb, raw bupc_tick_t values and bupc_tick_t intervals obtained by different threads should never be directly compared or arithmetically combined, without first converting the relevant tick intervals to wall time intervals.

NOTICE: The following interfaces have been adopted in the UPC Required Library Specifications, Version 1.3, with similar semantics to those as described above, and are available in Berkeley UPC beginning with the 2.16 release. Users are strongly encouraged to develop new codes using the standardized interfaces and to migrate existing code to them. The Berkeley-specific 'bupc_' prefixed variants are now officially deprecated in favor of the standardized variant, and will be removed in a future version.

    #define     __UPC_TICK__  1  // predefined feature macro
    #include    <upc_tick.h>     // defines the following:
    typedef     ... upc_tick_t;
    #define     UPC_TICK_MAX ...
    #define     UPC_TICK_MIN ...
    upc_tick_t  upc_ticks_now (void);
    uint64_t    upc_ticks_to_ns (upc_tick_t ticks);

Runtime thread layout query for hierarchical systems

    unsigned int bupc_thread_distance(int threadX, int threadY); 
    #define     BUPC_THREADS_SAME     ...
    #define     BUPC_THREADS_VERYNEAR ...
    #define     BUPC_THREADS_NEAR     ...
    #define     BUPC_THREADS_FAR      ...
    #define     BUPC_THREADS_VERYFAR  ...
bupc_thread_distance takes two thread identifiers (whose values must be in 0..THREADS-1, otherwise behavior is undefined), and returns an unsigned integral value which represents an approximation of the abstract 'distance' between the hardware entity which hosts the first thread, and the hardware entity which hosts the memory with affinity to the second thread. In this context 'distance' is intended to provide an approximate and relative measure of expected best-case access time between the two entities in question. Several abstract 'levels' of distance are provided as pre-defined constants for user convenience, which represent monotonically non-decreasing 'distance':

These constants have implementation-defined integral values which are monotonically increasing in the order given above. Implementations may add further intermediate level with values between BUPC_THREADS_VERYNEAR and BUPC_THREADS_VERYFAR (with no corresponding define) to represent deeper hierarchies, so users should test against the constants using <= or >= instead of ==.

The intent of the interface is for users to not rely on the physical significance of any particular level and simply test the differences to discover which threads are relatively closer than others. Implementations are encouraged to document the physical significance of the various levels whenever possible (see below), however any code based on assuming exactly N levels of hierarchy or a fixed significance for a particular level will probably not be performance portable to different implementations or machines.

The relation is symmettric, ie: bupc_thread_distance(X,Y) == bupc_thread_distance(Y,X)
but the relation is not transitive, ie: bupc_thread_distance(X,Y) == A && bupc_thread_distance(Y,Z) == A does NOT imply bupc_thread_distance(X,Z) == A

Furthermore, the value of bupc_thread_distance(X,Y) is guaranteed to be unchanged over the span of a single program execution, and the same value is returned regardless of the thread invoking the query.

Currently the significance of the BUPC_THREADS_* constants is as follows:

Value Meaning(s)
BUPC_THREADS_SAME Only returned when threadX == threadY.
BUPC_THREADS_VERYNEAR threadX and threadY will communicate through shared memory.
May include pthreads in the same process when compiled with -pthreads, and processes in the same compute node when PSHM support is available.
BUPC_THREADS_NEAR threadX and threadY are in the same compute node, but will communicate using the network API.
This may occur because either PSHM support is not available or the -pshm-width flag to upcrun has placed these threads in disjoint shared memory domains.
BUPC_THREADS_FAR This value is not currently used.
BUPC_THREADS_VERYFAR threadX and threadY are on different compute nodes.

Castability of pointers-to-shared

Converting a pointer to "remote" shared data into a pointer-to-local (deprecated)

NOTICE: This Berkeley-specific extension is now officially deprecated in favor of the standardized version adopted into the official UPC specification. See the end of this section for details.

    int bupc_castable(shared void *ptr);
    int bupc_thread_castable(unsigned int threadnum);
    void * bupc_cast(shared void *ptr);

This family of functions implements a UPC language extension propsed by Brian Wibecan of HP. Their purpose is to allow a UPC programmer to take advantage of UPC implementations in which some or all of the shared data with affinity to a given UPC thread can be directly addressed by other UPC threads using a pointer-to-local.

We use the term 'castable' to denote that the UPC implementation is able to represent a given pointer-to-shared using a pointer-to-local on a given thread. Any pointer-to-shared with affinity to a thread is guaranteed (by the language spec) to castable by that same thread. However, in general shared storage with affinity to one thread is not castable by other threads. Depending on the UPC implementation, it is possible that for a given pair of threads either all, none, or only some of the shared address space with affinity to the first may be castable by the second.

bupc_castable() takes a pointer-to-shared as argument and returns non-zero if and only if the argument is castable by the calling thread. It is guaranteed that a call to bupc_castable() with an argument having affinity to the calling thread will always return non-zero.

bupc_thread_castable() takes a UPC thread number as argument and returns non-zero if and only if every pointer-to-shared with affinity to the argument thread is castable by the calling thread. It is guaranteed that bupc_thread_castable(MYTHREAD) is always non-zero.

bupc_cast() takes a pointer-to-shared as argument and returns a pointer-to-local. The returned pointer may be used to reference the same object as the argument only if the argument pointer is castable by the calling thread, as may determined by bupc_castable() or bupc_thread_castable(). Otherwise the returned pointer is NULL.

NOTICE: The following interfaces have been adopted in the UPC Optional Library Specifications, Version 1.3, with similar semantics to those as described above, and are available in Berkeley UPC beginning with the 2.16.2 release. Users are strongly encouraged to develop new codes using the standardized interfaces and to migrate existing code to them. The Berkeley-specific 'bupc_' prefixed variants are now officially deprecated in favor of the standardized variant, and will be removed in a future version.

    void *upc_cast(const shared void *ptr);
    upc_thread_info_t upc_thread_info(size_t threadnum);

Converting a pointer-to-local referencing shared data into a pointer-to-shared

In addition to the functions described in the previous section, Berkeley UPC 2.14.2 and later implement an 'inverse cast' function:

    shared void * bupc_inverse_cast(void *ptr);
This function takes a pointer-to-local argument and returns a pointer-to-shared (with zero phase) referencing the same location if and only if the argument references a UPC shared object. If the argument is NULL or references a location not in the UPC shared space, then a null pointer-to-shared is returned.

Atomic Memory Operations

The UPC Optional Library Specifications, Version 1.3 adds a standardized UPC Atomic Memory Operations library in <upc_atomic.h>. This standardized atomics interface is fully implemented in this version of Berkeley UPC. The former Berkeley-specific atomics extensions (in the 'bupc_atomic*' function family), were subsumed by the 1.3 standardized atomics interface, and (after a lengthy deprecation period) that proprietary function family has now been removed. Berkeley UPC provides several new extensions to the 1.3 specified atomics interface, described below.

Atomic Domain creation hints

Berkeley UPC defines several extended hint values for the upc_atomichint_t hints argument to upc_all_atomicdomain_alloc() to control the behavior of the created atomic domain:

Currently the NEAR/FAR distinction corresponds to whether the threads in question communicate through cache-coherent shared memory (NEAR), or over a system-level network (FAR). These hints are mutually exclusive, and the default behavior is system-specific.

Atomic domain hint values are guaranteed to be macros, thus the recommended portable means to use the hints described above is to protect their use with #ifdef, for example:

  #ifdef UPC_ATOMIC_HINT_FAVOR_FAR
    upc_atomichint_t hint = UPC_ATOMIC_HINT_FAVOR_FAR;
  #else
    upc_atomichint_t hint = 0;
  #endif

  // create an atomic domain for atomic get, set and increment operations on 64-bit unsigned integers,
  // using network offload hardware where possible
  upc_atomicdomain_t *my_ad = upc_all_atomicdomain_alloc(UPC_UINT64, UPC_SET|UPC_GET|UPC_INC, hint);

  // perform an atomic fetch-increment on A[i]
  uint64_t result;
  upc_atomic_relaxed(my_ad, &result, UPC_INC, &A[i], 0, 0);

Collective deallocation functions (deprecated)

NOTICE: This Berkeley-specific extension is now officially deprecated in favor of the standardized version adopted into the official UPC specification. See the end of this section for details.

    void bupc_all_free(shared void *ptr);
    void bupc_all_lock_free(upc_lock_t *lockptr);

These two functions implement collective alternatives to the standard functions upc_free() and upc_lock_free(), as a convenience to the programmer. Both functions must be called collectively by all threads with the same argument. The object referenced by the argument is guaranteed to remain valid until all threads have entered the collective deallocation call, but the function does not otherwise guarantee any synchronization or strict reference. In all other respects the semantics of these functions and constraints on their usage are identical to their non-collective variants.

NOTICE: The following interfaces have been adopted in the UPC Language Specifications, Version 1.3, with the same semantics as those as described above, and are available in Berkeley UPC beginning with the 2.16 release. Users are strongly encouraged to develop new codes using the standardized interfaces and to migrate existing code to them. The Berkeley-specific 'bupc_' prefixed variants are now officially deprecated in favor of the standardized variant, and will be removed in a future version.

    void upc_all_free(shared void *ptr);
    void upc_all_lock_free(upc_lock_t *lockptr);

The 'bupc_system' function

Berkeley UPC provides 'bupc_system()' as a drop-in replacement for C99 'system()'. This implementation is designed to avoid adverse interactions between fork() and RMDA-capable networking libraries.


Known bugs and limitations

This release of Berkeley UPC has a number of known limitations and bugs:

Implicit library calls may modify errno

The C99 standard gives the following semantics for errno:
The value of errno is zero at program startup, but is never set to zero by any library function. The value of errno may be set to nonzero by a library function call whether or not there is an error, provided the use of errno is not documented in the description of the function in this International Standard.
These semantics are actually somewhat weaker than one might hope - specifically, they allow library calls which succeed to change errno to a non-zero value. In practice many C/POSIX library implementation actually do this.

The problem in the context of Berkeley UPC and its source-to-source translation is that there is one copy of errno per UPC thread which is shared by both the generated code representing translated UPC code, and all the runtime libraries running underneath it (including UPCR, GASNet, vendor network libs, etc.). Furthermore, many actions in UPC which do not qualify as library calls at UPC level (e.g. dereferencing a pointer-to-shared) result in library calls within the generated code. Consequently, the value of errno set by a failed library call invoked at the UPC source level may be subsequently overwritten by any of these implicit library calls.

While one could imagine the Berkeley UPC compiler and runtime taking action to preserve the value of errno across all the implicit library calls, doing so would adversely affect performance and we do not currently take this approach. This means that a UPC user who wants to inspect the value of errno after a failed library call they make must do so immediately - not just before the next UPC-level library call, but also before taking any action that might possibly invoke implicit library calls in the generated source code.

Basically, the only 100% safe way for a UPC program to read errno when using Berkeley UPC is to copy it into a local variable immediately after the failed library call returns. This is the "recommended practice" for using errno with Berkeley UPC.

Preprocessor macros defined in UPC files must not affect .h files

Berkeley UPC translates your UPC programs into C code, then runs a regular C compiler on your system to generate object code. To avoid handling vendor-specific inline assembly code that appears in some header files on many of the various systems we run on, we currently have our UPC-to-C translator 'put back' all non-UPC header files (i.e., .h files which don't contain any UPC constructs), which are then handled by the regular C compiler (we do not support placing inline assembly in your UPC code). A side effect of this process is that the preprocessor is run twice on your program. Since any #defined macros you place in your UPC code are expanded (and their definitions forgotten) the first time the preprocessor is run, these macros will not be present the second time .h files are included. Thus, UPC code such as
    #define NDEBUG
    #include <assert.h>
will not work as expected if the NDEBUG definition modifies the behavior of assert.h (which, in this example, it does: this NDEBUG/assert.h case is the most common case where users run into this issue with our compiler).

There is a simple workaround: if you need to define a macro that affects the behavior of #included files, define it on the command line to upcc:

    upcc -DNDEBUG myprogam.upc

Behavior of the 'getenv/setenv' functions

It is not well-defined in the UPC specification whether the standard 'getenv' function should return the same values on all threads, and/or if these values should include those present in the environment of the process that launches the UPC application.

Berkeley UPC guarantees that 'getenv' allows retrieval of certain environment variable values that were present when the job was launched. At present this function is only guaranteed to retrieve these value for all threads if the environment variable's name begins with 'UPC_' or 'GASNET_'. On some platforms all environment variables seen by the job launcher may be propagated, but it is not portable to rely on this.

The 'setenv' and 'unsetenv' functions are not guaranteed to work in a Berkeley UPC runtime environment, and should be avoided.

Correctness when using GCC 4.x (x<3) as the C compiler

There is a known correctness problem in the optimizer in gcc 4.0.x through 4.2.x that may affect correctness of shared-local accesses in UPC (i.e., shared accesses that result in node-local accesses at runtime). In a nutshell, it's possible that in rare cases these compilers may misoptimize a shared-local access such that it deterministically reads or writes an incorrect value. For this reason, configure will not allow you to use one of these compilers without an explicit option: '--enable-allow-gcc4'. If you do configure with '--enable-allow-gcc4', then you may encounter the optimizer bug. If you suspect you may be encountering this issue, the following actions are recommended for diagnosis:
  1. Try compiling your code in debug mode (ie with 'upcc -g'). If the problem persists, then this issue is *not* the culprit.
  2. Try compiling your code using the flag 'upcc -Wc,-fno-strict-aliasing'. If the problem persists, then this issue is *not* the culprit.
  3. Run your code several times. If the problem is intermittent, then this issue is probably not the culprit (the optimizer bug is deterministic).

If you still believe you are encountering this issue, there are several recommended workarounds:

  1. Configure BUPC to use a different backend C compiler. If you have a non-gcc vendor C compiler available, this may actually be a better choice for performance anyhow. Failing that, using gcc >= 4.3 (or gcc 3.x) should also resolve the issue, as the bug is only believed to be present in gcc 4.0.x through 4.2.x.
  2. Build the affected modules using the flag 'upcc -Wc,-fno-strict-aliasing'. This makes the gcc 4.x optimizer more conservative, and also inhibits the illegal optimization.
  3. Reconfigure BUPC using 'configure --enable-conservative-local-copy'. This globally activates a more conservative implementation of shared-local accesses that also prevents the illegal optimization.
The performance impact of the workarounds above is expected to be application-dependent.

GUPC+UPCR with -pthreads

GUPC+UPCR has a known problem in -pthreads compilation mode, whereby programs with a significant amount of statically-allocated private data may fail at program initiation time with an error message like:
    
    UPC Runtime error: pthread_create: Invalid argument
Users encountering this error are recommended to workaround it by either using the BUPC translator (which does not demonstrate the problem), or reworking their program to use less statically-allocated private data.

Other known limitations/bugs


Platform-specific issues

Running into Maximum size limits on pinning-based networks

On systems that pin RDMA-addressable memory (such as InfiniBand), the amount of shared memory that a default Berkeley UPC build can provide to a UPC program will be no larger than the maximum region that the OS and network drivers allow to be pinned at once. While this is typically a large fraction of physical memory, it may prove insufficient for your application. In this case, a "large segment" mode is available, which imposes a slight performance overhead in some situations, but which provides the maximum possible UPC shared memory space. To use large segment mode, the Berkeley UPC runtime needs to be reconfigured with '--enable-segment-large', and rebuilt. When using PSHM over POSIX shared memory (the default under Linux), it has also been observed that GASNet's InfiniBand support is unable to register a UPC shared heap as large as when using PSHM over SystemV shared memory. If you are using ibv-conduit on Linux and see crashes at startup with large UPC shared heap sizes, then we recommend reconfiguring your runtime with '--disable-pshm-posix --enable-pshm-sysv' before trying '--enable-segment-large'.


Feedback

Please contact us with your bug reports, comments, and suggestions.

Thank you for using Berkeley UPC!