[LBNL]

Berkeley UPC - Unified Parallel C

(A joint project of LBNL and UC Berkeley)
[UCB]

Home
Downloads
Documentation
Bugs
Publications
Demos
Contact
Internal

Berkeley UPC User's Guide version 2.18.0


Berkeley UPC Documentation
Berkeley UPC Site-wide
Entire Web
Jump to docs for version:
v2.0.1 v2.1.0 v2.2.0 v2.4.0 v2.6.0 v2.8.0
v2.10.0 v2.12.0 v2.14.0 v2.16.0 v2.18.0
This guide tells you how to use Berkeley UPC, which is a portable implementation of Unified Parallel C (UPC) that runs on many different parallel systems available today.

Contents


UPC standards and APIs supported by this version of Berkeley UPC

This version of Berkeley UPC includes


A Sample UPC Program

Here is a simple hello-world program written in UPC:
    #include <upc_relaxed.h>
    #include <stdio.h>

    int main() {
      printf("Hello from thread %i/%i\n", MYTHREAD, THREADS);
      upc_barrier;
      return 0;
    }
This program prints a message once from each thread (in some arbitrary interleaving), executes a barrier (optional), and exits.

For more involved examples of UPC code, see the UPC Language Tutorials - from the UPC Language Community website and the 'upc-examples' directory in of the Berkeley UPC runtime distribution. The official UPC language specification is a useful reference, and contains a description of the standard libraries.


Compiling UPC programs with 'upcc'

The upcc front end is used to compile UPC programs. It is designed with an interface that is very similar to the standard GNU gcc compiler, for ease of use. For instance, you could compile a physics simulation called 'light' from two source files via
    upcc -o light particle.upc wave.c -lgrottymath
Note that 'wave.c' can contain either UPC code or regular C code, and the 'grottymath' library that is linked into the application can be a regular C library: Berkeley UPC is fully interoperable with regular C source, object, and library files (note: if you compile with the -pthreads flag, any C libraries you use must be thread-safe). Berkeley UPC 2.0 also adds support for linking C++/FORTRAN/MPI objects into a UPC executable: see Mixing C/C++/MPI/FORTRAN with UPC.

upcc recognizes most commonly used C compiler flags (-D, -I, etc.). It also uses a number of its own flags for the choice of network API your program will run over, for compiling your UPC code for a static number of threads, and other UPC-specific options. See the upcc man page for details.

Choosing a network API for your UPC executable

Berkeley UPC executables are always compiled to run over a particular network API. To choose which network API is used, pass the '--network' flag with one of the following values:

Name Description
ibv OpenFabrics (aka OpenIB) InfiniBand Verbs for InfiniBand networks
mxm MXM API for recent Mellanox InfiniBand HCAs.
shmem SHMEM API for SGI Altix systems and the Cray X1. Other systems providing a SHMEM API may also work, but have not been tested.
portals4 Portals 4.x API (see here for info).
aries GNI API for Cray XC systems running CLE.
gemini GNI API for Cray XE and XK systems running CLE.
pami Parallel Active Message Interface for several IBM platforms, including BlueGene/Q.
dcmf Deep Computing Messaging Framework for IBM BlueGene/P systems.
udp UDP: works on any system with a standard TCP/IP stack, but is typically slower than using one of the native network types. Generally the fastest option for systems with only Ethernet hardware (notably faster than MPI-over-TCP).
mpi MPI: works on any system with MPI installed, but is typically slower than using one of the other network types.
smp "Symmetric multiprocessor (SMP)" mode: uses no network. Currently runs with only a single process unless your runtime has been configured with --enable-pshm (currently default only on Linux). Otherwise, you must pass -pthreads to upcc to run smp-conduit with multiple UPC threads.

Note that you can only compile for a given network type if your Berkeley UPC runtime was configured to support it at build/installation time. To see which APIs are supported in your installation, and to see which is used by default, use 'upcc --version'.

Compiling for a fixed number of UPC threads

The '-T <number>' option to upcc causes your executable to be build for a fixed number of UPC threads. Alternatively, you can set 'UPCC_FIXED_THREADS=<number>' in your environment (the '-T' flag overrides the environment setting if both are present).

An executable compiled for a fixed number of UPC threads will fail at startup if you try to run it with a different number of threads. However, fixing the number of threads allows optimization on certain operations (such as shared pointer arithmetic), especially when the number of threads is a power of 2.

Overriding global upcc.conf settings in a user configuration file

upcc gets global settings for your installation from a upcc.conf file which is created during the configuration stage of a runtime installation. After installation the file is located in the $prefix/<config>/etc directory of your installation. You can create a user configration file $HOME/.upccrc to override any of these settings. See the upcc man page for a list of available settings.

UPC Standard and Berkeley-specific preprocessor macros

Programs compiled with Berkeley UPC will see all of the preprocessor macros provided by your backend C compiler, plus the following:

Name Value Description Standard
__UPC__ 1 Defined by any UPC implementation UPC language
__UPC_COLLECTIVE__ 1 Defined by UPC implementations supporting the UPC Collective Utilities UPC language
__UPC_IO__ 1 Defined by UPC implementations supporting the optional UPC Parallel I/O Extensions UPC language
__UPC_TICK__ 1 Defined by UPC implementations supporting UPC High-Performance Wall-Clock Timers UPC 1.3 draft specification
__UPC_CASTABLE__ 1 Defined by UPC implementations supporting the optional UPC Castability Functions UPC 1.3 draft specification
__UPC_NB__ 1 Defined by UPC implementations supporting the optional UPC Non-Blocking Transfer Operations UPC 1.3 draft specification
__UPC_VERSION__ Monotonically increasing positive integer constant UPC specification supported: value is YYYYMM date of that version's ratification (ex: '200310L)' UPC language
__UPC_STATIC_THREADS__ 1 if static threads: else undefined Set to 1 if the '-T' flag was passed to upcc UPC language
__UPC_DYNAMIC_THREADS__ 1 if dynamic threads: else undefined Set to 1 unless the '-T' flag was passed to upcc UPC language
__UPC_PUPC__ 1 Defined by UPC implementations supporting the GASP interface GASP 1.5 specification
__BERKELEY_UPC__ Monotonically increasing positive integer constant The major version number of the Berkeley UPC release. Example: '1' for release '1.0.3'. Berkeley UPC only
__BERKELEY_UPC_MINOR__ An integer constant The minor version number of the Berkeley UPC release. Example: '0' for release '1.0.3'. Berkeley UPC only
__BERKELEY_UPC_PATCHLEVEL__ An integer constant The patch version number of the Berkeley UPC release. Example: '3' for release '1.0.3'. Berkeley UPC only
__BERKELEY_UPC_<NETWORK>_CONDUIT__ 1, or undefined Identifies the network API used. Example: if 'upcc -network=mpi' is used, '__BERKELEY_UPC_MPI_CONDUIT__' will be defined, with the value of 1 Berkeley UPC only
__BERKELEY_UPC_PSHM__ 1, or undefined Defined to 1 if and only if PSHM support is enabled Berkeley UPC only
__BERKELEY_UPC_PTHREADS__ 1, or undefined Defined to 1 if and only if the '-pthreads' flag is used Berkeley UPC only
__BERKELEY_UPC_RUNTIME__ 1, or undefined Defined to 1 if and only if the Berkeley UPC runtime is used, regardless of whether the Berkeley UPC translator or GUPC is used Berkeley UPC and GUPC+UPCR
__BERKELEY_UPC_RUNTIME_DEBUG__ 1, or undefined Defined to 1 if and only if a debugging runtime used (i.e. '-g' passed to upcc). Berkeley UPC and GUPC+UPCR
__BERKELEY_UPC_RUNTIME_RELEASE__ An integer constant, or undefined The major version number of the Berkeley UPC Runtime library. Example: '2' for release '2.12.0'. Berkeley UPC and GUPC+UPCR. Undefined prior to release 2.12.0
__BERKELEY_UPC_RUNTIME_RELEASE_MINOR__ An integer constant, or undefined The minor version number of the Berkeley UPC Runtime library. Example: '12' for release '2.12.0'. Berkeley UPC and GUPC+UPCR. Undefined prior to release 2.12.0
__BERKELEY_UPC_RUNTIME_RELEASE_PATCHLEVEL__ An integer constant, or undefined The patch version number of the Berkeley UPC Runtime library. Example: '0' for release '2.12.0'. Berkeley UPC and GUPC+UPCR. Undefined prior to release 2.12.0

Using a remote UPC-to-C translator

The upcc front end has the ability to use UPC-to-C translator located on a remote machine. This is provided both as a convenience (the translator takes much longer to build than the runtime, and we provide a public HTTP translator that allows users to get started with Berkeley UPC more quickly), and to support the many systems on which our translator does not build, due to C++ portability issues.

A remote translator can be used either over HTTP, or SSH. To use HTTP, the the 'upcc.cgi' CGI script (located in the 'contrib' directory of the runtime distribution) must be installed and configured with a web server on the remote host. Simply set the 'translator' parameter in your user configuration file (or the global 'upcc.conf') to the URL for the CGI script. To use SSH, you must be able to login to the remote host using SSH, and the 'translator' parameter must be set to 'remote_host:/path/to/translator'. You will want to use key-based authentication, and 'ssh-agent' to avoid entering your password each time you compile. See our SSH Agent Tutorial.

When using an HTTP-based remote translator, upcc also includes support for use of an HTTP proxy. Set the 'http_proxy' parameter in your user configuration file (or the global 'upcc.conf') to the proxy URL. The upcc front end does not currently support HTTPS or SOCKS proxies, nor HTTP proxies that require authentication (HTTP error 407).


Creating libraries of UPC code

At present you cannot create traditional C-style libraries with UPC code in them using Berkeley UPC (i.e. you cannot successfully use 'ar' to create 'libmyupc.a').

If you wish to create a reusable set of compiled code, you must currently keep the files in *.o format. So, instead of the traditional C format, where you'd create 'libmyupc.a', and then link with something like

    upcc myprogram.o -L/libpath -lmyupc
You must instead do something like
    upcc myprogram.o /libpath/libmyupc/*.o 
Note that beginning with Berkeley UPC 2.12.0 it is possible to link together static threads and dynamic threads objects, with the result being a static threads executable. In many cases this allows use of a dynamic threads object in the role of a library, which can be linked to an executable with any dynamic or static thread setting.


Running UPC programs

If you compile a UPC program with '--network=smp', you can run the executable normally (the same way you'd run 'ls' or 'grep'). Otherwise, you are generating a executable that uses a parallel network API, and this typically means you executable will need some special treatment to be launched correctly.

Berkeley UPC executables should be run the same way as any other parallel program on your system that uses the same underyling network API. So, for instance, a program compiled with '--network=mpi' is run on many systems via 'mpirun -np <number of processes> a.out'. Other systems may use other invocations, such as 'prun' or 'poe', especially when API's other the MPI are used. Consult your system's documentation for details.

Using 'upcrun'

The 'upcrun' script that is installed as part of the Berkeley UPC runtime is our attempt to provide a standard interface for running UPC programs. If your installation has configured 'upcrun.conf' correctly (in many cases the defaults will work), you can run UPC programs portably via commands like
    upcrun -n 4 parboil
This example runs the UPC executable 'parboil' on 4 nodes.

An additional benefit of using upcrun is that it provides consistent support for propagating environment variables to all threads of your UPC program. If you use upcrun, any environment variable beginning with either 'UPC_' or 'GASNET_' is guaranteed to be propagated to all threads. (Support for propagating all environment variables is planned). If you do not use upcrun, environment propagation will only work to the extent that the parallel job launcher you use provides it normally.

You can see how upcrun thinks your job should be run without actually running it by passing the '-t' flag to it. Also, 'upcrun -i <executable>' will provide information about a Berkeley UPC executable, such as the network API that it was built against, and the number of fixed threads (if any) that it was compiled for.

See 'upcrun --help' or the upcrun man page for more information.

Setting the amount of shared memory available to your applications

At startup, each Berkeley UPC thread reserves a fixed portion of its address space (via the 'mmap()' system call) for shared memory. This address range can not be used for regular unshared (i.e., malloc) memory allocations, and it also serves as a maximum value on the amount of shared memory (per-thread) that the program can use: a UPC program will die with a fatal error if any thread tries to allocate more shared memory than it reserved at startup.

The default amount of shared memory to reserve per UPC thread on a system is chosen at configure time (see the INSTALL.TXT document in the runtime distribution for details), but you can override that value for a particular application either at compile time, or at startup. Generally this is only needed if you observe that your application is running out of either shared or regular C memory.

To embed a different default amount of shared memory into your application, simply pass '-shared-heap=144MB' for instance (to get 144 megabytes per UPC thread). You can also use 'GB' for gigabyte amounts (if neither 'MB' nor 'GB' is used, megabytes are assumed). To override the embedded default amount of shared memory at application startup, set the UPC_SHARED_HEAP_SIZE environment variable to whatever value you want ('2GB', etc.), or pass '-shared-heap' to upcrun.

While it is tempting to simply grab an extremely large shared memory segment, be aware that this is not always a good idea, or even possible. Since the shared address space range cannot be used for regular malloc allocations, creating too large of a shared space can cause the amount of regular heap memory available to your application to become small (causing malloc to eventually return NULL when you request more memory). Also, the shared memory space is reserved via an mmap() call, and while this does not generally cause any physical memory pages to be allocated, certain operating systems (for instance, Linux) will not allow more memory to be reserved by applications then the OS can guarantee is available, and so allocating a shared region larger than the physical memory (plus swap space) may fail.

The default amount of shared memory per UPC thread can be changed system-wide by modifying the 'shared_heap' parameter in the installation's upcc.conf file. You can override the system-wide default for your own applications by setting shared_heap in your user configuration file.

The upcc.conf file also provides a 'heap_offset' parameter (and upcc provides a '-heap-offset' flag) that affects where the address region for shared memory is located in your program. However, at present it is not useful on any of our supported systems, and so we do not recommend its use.


Using pthreaded Berkeley UPC programs

At present Berkeley UPC programs may handle inter-process communication within a compute node in one of two ways. If your runtime has been configured with --enable-pshm to enable support for Process SHared Memory, then all communication within a compute node will be performed directly through shared memory. This is the default configuration on Linux clusters, and on Cray systems in the XE, XK and XC series; and is a configure-time option on most other platforms (though some system setup is often required). For all cases in which PSHM support is not enabled, the runtime will call the network APIs for all inter-process communication, even within a compute node. While many network APIs perform some kind of optimization for 'local' traffic (avoiding actually putting messages on the network), they are typically slower than simply using shared memory between UPC threads. To provide shared-memory performance within an SMP (or cluster of SMPs), Berkeley UPC supports creating executables that use pthreads to optimize communications between multiple UPC threads running in the same process. To utilize pthreads, pass the '-pthreads=N' option to upcc, where N is the number of processors per node on your system (or configure your 'upcc.conf' file, as described below). This will use one or more multithreaded processes on each node, with shared memory used among UPC threads in the same process. This is may often be the fastest way to run Berkeley UPC programs on SMP systems when PSHM is not available.

The '-pthreads' flag must be passed consistently at all stages of compilation and linking. Also, when pthreads are used, upcc needs to delay much of the compilation of your code until link time, so if you split code generation into separate compilation and linking steps (i.e., 'upcc -c foo.upc', followed by 'upcc foo.o bar.o'), you need to pass any macro and/or include path directives (ex: '-DFOO=bar -I/usr/local/include') to upcc to both the compilation and link commands.

Any C libraries that your code links against must be thread-safe in order to be used with -pthreads. If one or more of your libraries is not thread-safe, you must compile without pthreads, and run separate processes on the same machine to exploit an SMP system. Currently, such processes will not use any shared memory optimizations, and will communicate with each other via the network API. Support for shared memory between non-pthreaded Berkeley UPC processes will be provided in the near future.

When you link an application with '-pthreads', a subdirectory named <executable_name>_pthread-link will be created in the current directory. This directory exists in order to speed up further linking commands of the same program. If you link the same application again with the same object file names, and none of the global static unshared variables in your program have changed name or size, recompilation of all the files in your application can be avoided, which can make a significant difference in build time for programs with many source files. You may delete the temporary directory at any time without any side effects (other than possibly longer link times). One can prevent this optimization with the -nolink-cache flag to upcc.

Unless otherwise specified, pthreaded UPC applications use a default number of pthreads per process (run 'upcc --version' to see the default for your system. This number is set in the upcc.conf configuration file, and can be changed there (or in your user configuration file). It can also be overridden in several ways. Compiling with 'upcc -pthreads=<NUMBER>' changes the default number of pthreads per UPC process for an executable to NUMBER. If the 'UPC_PTHREADS_PER_PROC' environment variable is set to a nonzero integer when you run a UPC program, it will override any default value. Finally, upcrun is smart about pthreads in several ways. First, if you run a pthreaded parallel job with 'upcrun -n <NUMBER> ...', the number of processes actually launched will be divided by the number of pthreads, so that exactly NUMBER UPC threads are used. Second, if you use -network=smp (which generates a executable that will run only a single process), upcrun -n NUMBER will automatically set the number of pthreads to NUMBER.


Debugging Berkeley UPC programs

Berkeley UPC programs can now be debugged (with support for UPC-specific constructs) by the "TotalView" debugger produced by Rogue Wave Software. TotalView version 7.0.1 or greater is required, and support is currently only provided on x86 architectures, using MPI for the network. See our tutorial on using Berkeley UPC with TotalView for details.

If you do not have TotalView, you can also use a regular C debugger and get partial debugging support. Berkeley UPC provides several mechanisms for attaching a regular C debugger to one or more of your UPC application's threads at various points during execution. While this does not provide a fully normal debugging environment (the debugger will show the C code emitted by our translator, rather than your UPC code), it can still allow you to see program stack traces and other important information. This can be very useful if you wish to submit a helpful bug report to us. See Attaching a regular C debugger to Berkeley UPC programs for details.

Berkeley UPC also supports automatically generating backtraces if a fatal error occurs in your program. This will allow you to see a stack trace of the function calls that your program was in at the time it crashed. To use auto-backtracing, run with upcrun -backtrace or set GASNET_BACKTRACE=1 in your environment. The level of backtracing support available depends on the back-end C compiler and operating system, and so not all systems are equally functional, and some systems will not provide backtraces. See gasnet/README for more information on backtracing.


Analyzing UPC Programs with 'upc_trace'

As of version 2.0, Berkeley UPC includes 'upc_trace', a tool for analyzing the communication behavior of UPC programs. When run on the output of a trace-enabled Berkeley UPC program, 'upc_trace' provides information on which lines of code in your UPC program generated network traffic: how many messages the line caused, what type (local and/or remote gets/puts), what the maximum/minimum/average/combined sizes of the messages were.

Examining tracing information is one of the best ways to go about optimizing your UPC programs. It provides a way for you to see which lines of your code are generating the most network traffic (and the size of the network messages used). From this you may be able to determine how to either avoid some of this traffic, or change your code to use fewer, larger messages (for instance, by replacing sets of individual reads/writes with bulk memory movement calls like 'upc_memget()', etc.), which is typically more efficient. Examining barrier wait times can also let you know if your computations are imbalanced across threads, and/or if you could profit by using split-phase barriers, moving computation in between 'upc_notify' and 'upc_wait'.

How to use 'upc_trace'

  1. Tracing must be enabled in order to work. By default, tracing is enabled for debug compilations (i.e. if 'upcc -g' is used), but not otherwise (as it incurs some overhead). If you wish to also trace non-debug executables, you must rebuild your UPC runtime system and pass '--with-multiconf=+opt_trace' to configure, then build your application with 'upcc -trace'.

  2. You must run your application with 'upcrun -trace ...' or 'upcrun -tracefile TRACE_FILE_NAME ...'. Either of these flags causes your UPC executable to dump out tracing information while it executes. The '-trace' flag causes one file per UPC thread to be generated, with the name 'upc_trace-a.out..-N', where 'a.out' is the name of your executable, and 'N' is the UPC thread's number. The '-tracefile NAME' option lets you specify your own name for the tracing file(s): if the name contains a '%' character, one trace file per thread is generated, with the '%' replaced with the UPC thread's number. Otherwise, all threads will write to the same file.
    Note that running with tracing may slow down your application considerably: the exact amount depends on your filesystem, and the ratio of communication/computation in your program. If you are only interested in a subset of trace information, consider setting 'GASNET_TRACEMASK' as described below.

  3. After your application has completed, you may run 'upc_trace' on one or more of the trace files generated by your program run:
    1. Running 'upc_trace' on a trace file generated by a single UPC thread shows the information only for that thread. If you pass multiple files from the same application run, the information for the various threads is coalesced, so passing in all the tracefiles generated by a run allows you to see information for the entire application.
    2. There are a number of flags to 'upc_trace' which control what kinds of information is reported, and how it is sorted. See 'upc_trace --help' or the upc_trace man page for details.
    3. Note that upc_trace may take a while to run, especially on large tracefiles. Consider setting GASNET_TRACEMASK and/or GASNET_TRACELOCAL (described below) to streamline the trace file's contents to include only those events you're interested in analyzing.
    4. If you compile with 'upcc -opt', it is possible that the UPC-to-C translator has coalesced some of the network operations in your program, in order to get better network performance. This means that 'upc_trace' may not report communication for certain lines of your program, and other lines may seem to be getting/putting more data than they should.

Controlling what gets logged in the trace file by setting GASNET_TRACEMASK

By default, Berkeley UPC will trace all of the following program events:

ID Feature
G Network 'gets'. These include both bulk gets (from upc_memget, etc.), and network get operations caused by reading shared memory via shared variables/pointers. The 'g' mask does not include 'local' gets (i.e. reads from shared memory which has affinity to the reading UPC thread), as these do not result in network traffic. Use 'H' to trace local gets.
P Network 'puts'. These include both bulk puts (from upc_memput, etc.), and put operations caused by writing to shared memory via variables/pointers. The 'P' mask does not include 'local' puts (i.e. writes to shared memory which has affinity to the writing UPC thread), as these do not result in network traffic. Use 'H' to trace local puts.
B Barriers, including both blocking (upc_barrier) and non-blocking (upc_notify followed by upc_wait: a pair of these count as a single barrier).
N Line number information from UPC source files. The "N" and "H" flags must always be among those set for upc_trace to work!
H Miscellaneous UPC information. The "N" and "H" flags must always be among those set for upc_trace to work! Passing this flag causes the following things to be traced:
  • UPC lock functions ('upc_lock', 'lock_attempt', and 'upc_unlock').
  • UPC collective operations (besides barriers, which are controlled by 'B').
  • 'Local' puts/gets, i.e. gets and puts to shared memory which has affinity to the issuing UPC thread (which thus do not result in network traffic). Tracing local gets/puts can significantly expand the size of the trace file (and the time it takes to run 'upc_trace', so if you are not interested in viewing them, consider omitting them from the trace file. You can do this by setting the 'GASNET_TRACELOCAL' environment variable to "no" (or "0"). You may also selectively turn on/off local tracing during program execution by calling the 'bupc_trace_settracelocal()' function (described below). Local get/put tracing only includes accesses performed through shared pointers or the bulk 'upc_memget', etc., functions: it does not include accesses to shared memory made via 'localized' pointers, i.e., shared pointer that have been cast to regular C pointers.
  • UPC memory allocation operations, i.e., 'upc_alloc', 'upc_all_alloc', 'upc_global_alloc', and 'upc_free' function calls. (Note: allocation operations are not currently reported by 'upc_trace': if you wish to examine where/when your program run has called allocation functions, you must examine the trace file by hand.)
  • 'Strict' UPC operations. (Note: 'strict' operations are not currently reported by 'upc_trace': if you wish to examine where/when your program run has executed 'strict' operations, you must examine the trace file by hand.)

To trace only a subset of these features, set the 'GASNET_TRACEMASK' environment variable to a string containing the ID's of the features you wish to trace. Note that the "N" and "H" flags must always be among those set for 'upc_trace' to work (if you are intending to manually examine the trace file, they do not need to be set).

So, for instance, if you are trying to perform an analysis that does not require get/put information, you are highly advised to set 'GASNET_TRACEMASK' to "BHN" and 'GASNET_TRACELOCAL' to "no" (or "0"). This will turn off tracing for all get and put operations. Since gets/puts are typically the majority of items in a full trace file, this will probably result in much faster program execution, a much smaller trace file, and faster analysis by 'upc_trace'.

Controlling tracing during runtime

For even more control over tracing, you may call the following functions in your program to set the trace mask dynamically, read its current value, and/or insert your own custom messages into the trace file:
    extern void         bupc_trace_setmask (const char *newmask);
    extern const char * bupc_trace_getmask (void);
    extern int          bupc_trace_gettracelocal (void);
    extern void         bupc_trace_settracelocal (int val);
    void                bupc_trace_printf ((const char *msg, ...));
'bupc_trace_getmask' and 'bupc_trace_setmask' allow programmatic retrieval and modification of the trace masks in effect for the calling thread. The initial values are determined by the 'GASNET_TRACEMASK' environment variables, and the input and output to the mask manipulation functions have the same format as 'GASNET_TRACEMASK' values. Note that whenever any tracing is enabled (i.e. unless you are temporarily turning off tracing by passing an empty string), the "N" and "H" flags must always be among those set for 'upc_trace' to work.
'bupc_trace_{get,set}tracelocal' allow the calling thread to programmatically enable/disable tracing of local put/get operations, which correspond to pointer-to-shared accesses that actually have local affinity (and therefore invoke no network communication).

Different UPC threads may set different masks and tracelocal settings, but note that in pthreaded UPC jobs all pthreads in a process share these values. These functions have no effect if trace and stats communication profiling are disabled at upcr configure time, or are not enabled for the current run.
   Ex: 
     bupc_trace_setmask("PGHN");   // trace everything
     bupc_trace_settracelocal(1);  // include local puts and gets 
     // do something...
     bupc_trace_setmask("");      // stop tracing

The 'bupc_trace_printf' utility outputs a message into the trace file, if it exists. Note that two sets of parentheses are required when invoking this operation, in order to allow it to compile away completely for non-tracing builds.

  Ex:   double A[4] = ...; 
        int i = ...;
        bupc_trace_printf(("the value of A[%i] is: %f", i, A[i]));


Gathering application statistics

Berkeley UPC also provides the ability to generate a 'stats' report, which contains a statistical summary of program activity. While this report does not give as much information as provided by tracing, it does contain such information as the total number of get/put operations, barriers, etc. (although these cannot be traced back to specific lines of code, as 'upc_trace' provides). But the stats report is generally much smaller than the average trace file, so it may be useful if you are finding that tracing is adding too much overhead to your program runs.

To generate statistics, simply set the 'GASNET_STATSFILE' environment variable to a file name, into which statistics will be written at the end of your program's run. (Note: by default, only debug executables support statistics generation, as it incurs a performance penalty: if you wish to have non-debug UPC executables generate statistics, you must rebuild your UPC runtime system, passing '--with-multiconf=+opt_trace' to configure, then build your application with 'upcc -trace'.) You may generate both stats and tracing info for the same program run if you wish.

Just as with tracing, you may set a mask to control what types of events are included in the statistics, by setting the 'GASNET_STATSMASK' environment variable, and/or by calling the following functions:

    extern void         bupc_stats_setmask (const char *newmask);
    extern const char * bupc_stats_getmask (void);
The same mask IDs are used by the tracing and statistics masks, i.e., calling 'bupc_stats_setmask("BP")' would cause execution to gather statistics only for barriers and puts. See the table in the tracing documentation for the list of IDs.


Profiling UPC Programs with 'upcc -pg' and 'gprof'

The standard GNU 'gprof' profiling tool can be used with Berkeley UPC programs, if your backend C compiler supports gprof (this is autodetected at configure time). Simply compile your UPC program with 'upcc -pg'. When you run the program, one or more 'gmon.out' files are generated (if your UPC program consists of multiple processes, one file per process is created, each in it's own 'gmon.out.process_number' subdirectory). You can then use 'gprof' on one or more of these files (if multiple files are passed, the statistics are combined):
    upcc -pg foo.c
    upcrun -n 2 a.out
    gprof a.out gmon.out.0/gmon.out gmon.out.1/gmon.out | less
Note that 'gprof' provides timings and statistics for processor usage: it does not include time during which the process has been put to sleep waiting for I/O (including network reads/writes). However, since Berkeley UPC uses spin-locks in many cases to wait for network events, rather than blocking system calls, you may see that certain 'gasnet*' functions consume large amounts of CPU time. This generally means that your program is spending most of that time waiting for network communication to complete (some fraction is the software overhead inherent in sending/receiving the network traffic). If your program spends a lot of time waiting for network operations to complete, you may be suffering from an imbalanced load across threads (so that some take longer to "catch up" to a barrier, for instance). Restructuring your application may avoid these waiting periods. Or you may be able to use some of this "spare" time for computation (or other network traffic) by switching to use non-blocking barriers (i.e., 'upc_notify/upc_wait'), and/or our non-contiguous memcpy extensions to UPC. Replace blocking network constructs (such as 'upc_barrier', 'upc_memcpy', and read/writes to shared variables) with non-blocking equivalents, and insert unrelated computation (and/or network traffic) in between the initialization and completion calls. Of course, you must be able to find unrelated computation/communication for this to work, and the degree to which this is possible will depend on your application.


Berkeley-specific extensions to the UPC Language

Non-blocking memcpy functions

NOTICE: See the end of this section for information on new standardized interfaces which will replace a portion of these Berkeley-specific ones in a future release.

As of 2.0, Berkeley UPC fully implements a set of non-blocking extensions to the 'upc_memcpy()' function for contiguous data. These extensions allow you to explicitly overlap memcpy-like functions with computation (and/or with other memcpy calls).

The full interface is described in sections 2 through 4 of our Proposal for Extending the UPC Memory Copy Library Functions. See that document for details on the functions and their usage.

NOTICE: The following are slated for inclusion in the upcoming 1.3 revision of the UPC specification, with semantics compatible with those given in the document referenced above, and are available in Berkeley UPC beginning with the 2.18 release. Users are strongly encouraged to develop new codes using the standardized interfaces and to migrate existing code to them. Once revision 1.3 of the UPC spec is final, the corresponding portion of the Berkeley-specific non-blocking interfaces will become deprecated.

    #define     __UPC_NB__  1  // predefined feature macro
    #include    <upc_nb.h>     // defines the following:

    // Explicit-handle non-blocking operations and synchronization:
    typedef      ... upc_handle_t;
    #define      UPC_COMPLETE_HANDLE ...
    upc_handle_t upc_memcpy_nb(shared void * restrict dst,
                               shared const void * restrict src,
                               size_t n);
    upc_handle_t upc_memget_nb(void * restrict dst,
                               shared const void * restrict src,
                               size_t n);
    upc_handle_t upc_memput_nb(shared void * restrict dst,
                               const void * restrict src,
                               size_t n);
    upc_handle_t upc_memset_nb(shared void *dst, int c, size_t n);
    int          upc_sync_attempt(upc_handle_t handle);
    void         upc_sync(upc_handle_t handle);

    // Implicit-handle non-blocking operations and synchronization:
    void upc_memcpy_nbi(shared void * restrict dst,
                        shared const void * restrict src,
                        size_t n);
    void upc_memget_nbi(void * restrict dst,
                        shared const void * restrict src,
                        size_t n);
    void upc_memput_nbi(shared void * restrict dst,
                        const void * restrict src,
                        size_t n);
    void upc_memset_nbi(shared void *dst, int c, size_t n);
    int  upc_synci_attempt(void);
    void upc_synci(void);

Non-contiguous memcpy functions

As of 2.0, Berkeley UPC fully implements a set of extensions to the 'upc_memcpy()' function for use with non-contiguous data. These extensions provide versions that allow you to specify non-contiguous memory regions to get/put, and include both blocking and non-blocking versions.

The full interface is described in sections 5 and 6 of our Proposal for Extending the UPC Memory Copy Library Functions. See that document for details on the functions and their usage.

Point-to-point synchronization functions

As of 2.2, Berkeley UPC implements a set of point-to-point synchronization functions, partly based on the POSIX semaphore interfaces. These extensions allow you to explicitly synchronize between pairs of UPC threads, and to associate synchronization with data transfer.

The full interface is described in our Proposal for Extending the UPC Libraries with Explicit Point-to-Point Synchronization Support. See that document for details on the functions and their usage.

Value-based collectives convenience interface (bupc_collectivev.h)

This library wrapper provides a value-based convenience interface to the UPC collectives library that is part of UPC 1.2. There is a small amount of optimization for Berkeley UPC, but the wrapper is generic and can be used with any fully UPC-1.2 compliant implementation of the UPC collectives library. All operations are implemented as thin wrappers around that library. In most cases, operands to this library are simple values, and nothing is required to be single-valued except for the data type in use and the root thread identifier (in the case of rooted collectives). The purpose of this wrapper is to provide convenience for scalar-based collective operations, especially in cases where there are not multiple values available to be communicated in aggregate (in which case the full array-based UPC collectives interface is likely to use fewer messages and achieve better performance) or for use in setup code (where performance is secondary to simplicity). See the collectivev documentation for full interface details.

The 'bupc_all_reduce_all' function family

This is an extension to the UPC Collectives Specification. The 'bupc_all_reduce_all' functions behave identically to the 'upc_all_reduce' functions, except that the 'dest' argument has the semantics of the 'dest' argument to 'upc_all_broadcast', i.e. the result of the reduction is broadcast to all threads, instead of just one.

The 'bupc_dump_shared' function

Shared pointers in UPC are logically composed of three fields: the address of the data that the shared pointer currently points to, the UPC thread on which that address is valid, and the 'phase' of the shared pointer (see the official UPC language specification for an explanation of shared pointer phase). Our version of UPC provides a 'bupc_dump_shared' function that will write a description of these fields into a character buffer that the user provides:
    int bupc_dump_shared(shared const void *ptr, char *buf, int maxlen);
Any pointer to a shared type may be passed to this function. The 'maxlen' parameter gives the length of the buffer pointed to by 'buf', and this length must be at least BUPC_DUMP_MIN_LENGTH, or else -1 is returned, and errno set to EINVAL. On success, the function returns 0, The buffer will contain either "<NULL>" if the pointer to shared == NULL, or a string of the form
    "<address=0x1234 (addrfield=0x1234), thread=4, phase=1>" 
The 'address' field provides the virtual address for the pointer, while the 'addrfield' contains the actual contents of the shared pointer's address bits. On some configurations these values may be the same (if the full address of the pointer can be fit into the address bits), while on others they may be quite different (if the address bits store an offset from a base initial address that may differ from thread to thread).

Both bupc_dump_shared() and BUPC_DUMP_MIN_LENGTH are visible when any of the standard UPC headers (upc.h, upc_relaxed.h, or upc_strict.h) are #included.

The 'bupc_ptradd' function

Blocked pointers-to-shared in UPC are currently restricted to being declared with a compile-time constant block size. This can present problems in situations where the block size of a given array is input-dependent or otherwise unknown at compile time, and one wishes to conveniently access the array elements in layout order according to a specific block size.

The 'bupc_ptradd()' function provides support for performing pointer-to-shared arithmetic with general blocksize, which need not be compile-time constant.

  shared void * bupc_ptradd(shared void *p, size_t blockelems, size_t elemsz, ptrdiff_t elemincr);
    - 'p': the base pointer
    - 'blockelems': the block size (number of elements in a block)
    - 'elemsz': the element size (usually sizeof(*p))
    - 'elemincr': the positive or negative offset from the base pointer

The following call:

    bupc_ptradd(p, blockelems, sizeof(T), elemincr);
Returns a value q as if it had been computed:
    shared [blockelems] T *q = p;
    q += elemincr;
however, the blockelems argument is not required to be a compile-time constant. Blockelems must be non-negative, but may be zero to indicate an indefinite blocking factor. Here's an example of indexing into a dynamically-allocated array whose block size is not known until run time.
  int blockelems = ...; // choose some arbitrary block size

  // allocate an array of doubles with that blocksize
  shared void *myarr = upc_all_alloc(..., blockelems*sizeof(double)); 

  // access element 14
  double d = *(shared double *)bupc_ptradd(myarr, blockelems, sizeof(double), 14);

It's worth noting that in some cases bupc_ptradd() may be less efficient than regular pointer-to-shared addition, because the compile-time constant blocksize of the pointer referent type generally makes the latter more amenable to compiler optimization of the addition operation and surrounding code. This is especially true in the case of indefinitely blocked or cyclically blocked pointers-to-shared. However, the cost may be worth the added convenience in non-performance-critical code.

The 'bupc_poll' function

The 'bupc_poll()' function explicitly causes the UPC runtime to attempt to make progress on any network requests that may be pending.

You will normally not need to call this function, as the runtime will automagically perform checks for incoming network requests whenever your UPC code causes network activity to be performed, and this usually occurs fairly frequently in a UPC application. However, if you writing your own 'spin lock' style synchronization, you may need to use this function to avoid deadlock. Here is an example:

    shared strict int flag[THREADS];

    ...

    if (MYTHREAD % 2) {
        while (flag[MYTHREAD] == 0)
            bupc_poll();
    } else {
        ... some calculation ...
        flag[MYTHREAD + 1] = 1;
    }
Here the 'even' UPC threads are performing some calculation, then informing the 'odd' threads that the result is ready by setting a per-thread flag. If the 'bupc_poll()' were omitted, the 'odd' threads might (on certain platforms/networks) consume all of the CPU forever in the 'while' test, never checking for the incoming network message that would set flag[MYTHREAD].

If a program contains computationally intensive sections in which no remote accesses are performed for a long time, it is also possible that performance may be improved by intermittently calling bupc_poll, particularly if other threads are likely to be performing remote accesses (or memory allocation requests) during this time.

The 'bupc_assert_type' built-in

The 'bupc_assert_type(expr, type)' built-in operation allows testing for compile-time type equality, and is primarily used by our UPC compiler test suite.
  1. 'expr' = any arbitrary (legal) UPC expression
  2. 'type' = any legal C/UPC type

If 'expr' has a static type which is identical to 'type', does nothing. Otherwise, prints a non-fatal warning containing the line number and a description of the two differing types.

High-precision wall-clock timer support

NOTICE: See the end of this section for information on new standardized interfaces which will replace these Berkeley-specific ones in a future release.

    typedef     ... bupc_tick_t; /* 64-bit integral type */
    #define     BUPC_TICK_MAX ...
    #define     BUPC_TICK_MIN ...
    bupc_tick_t bupc_ticks_now (void);
    uint64_t    bupc_ticks_to_us (bupc_tick_t ticks);
    uint64_t    bupc_ticks_to_ns (bupc_tick_t ticks);
    double      bupc_ticks_granularityus (void); 
    double      bupc_ticks_overheadus (void);
The 'bupc_tick_t' type and associated functions provide portable support for querying high-precision system timers for obtaining wall-clock timings of sections of code. Most CPU hardware offers access to high-performance timers with a handful of instructions, providing timer precision and overhead that can be several orders of magnitude better than can be obtained through the use of the gettimeofday() system call.

The 'bupc_tick_t' type represents an integral quantity of abstract timer ticks, whose ratio to real time is system-dependent and thread-dependent. bupc_ticks_now() returns the current value of the tick timer for the calling thread, using the fastest mechanism available. bupc_ticks_to_us() and bupc_ticks_to_ns() convert a difference in bupc_tick_t values obtained by the calling thread into microseconds or nanoseconds, respectively. The bupc_ticks_to_{us,ns}() conversion calls can be significantly more expensive than the bupc_ticks_now() tick query, so for timing short intervals it's recommended to keep timing results in units of ticks until final output. BUPC_TICK_MAX and BUPC_TICK_MIN provide tick values which are respectively larger and smaller than any possible tick value. bupc_ticks_granularityus() and bupc_ticks_overheadus() respectively report the estimated microsecond granularity (minimum time between distinct ticks) and microsecond overhead (time it takes to read a single tick value, not including conversion) for the timer facility.

Example:

  bupc_tick_t start = bupc_ticks_now();
    compute_foo(); /* do something that needs to be timed */
  bupc_tick_t end = bupc_ticks_now();

  printf("Time was: %d microseconds\n", (int)bupc_ticks_to_us(end-start));

  printf("Timer granularity: <= %.3f us, overhead: ~ %.3f us\n",
       bupc_tick_granularityus(), bupc_tick_overheadus());
  printf("Estimated error: +- %.3f %%\n",
      100.0*(bupc_tick_granularityus()+bupc_tick_overheadus()) /
            bupc_ticks_to_us(end-start));
It's important to keep in mind that raw bupc_tick_t values are thread-specific quantities with a thread-specific interpretation (e.g. they might represent a hardware cycle count on a particular CPU, starting at some arbitrary time in the past). More specifically, raw ticks do NOT provide a globally-synchronized timer (i.e. the simultaneous absolute tick values may differ across threads), and furthermore the tick-to-wallclock conversion ratio might also differ across threads (e.g. on a cluster with heterogenerous CPU clock rates, the raw tick values may advance at different rates for different threads). Therefore as a rule of thumb, raw bupc_tick_t values and bupc_tick_t intervals obtained by different threads should never be directly compared or arithmetically combined, without first converting the relevant tick intervals to wall time intervals.

NOTICE: The following are slated for inclusion in the upcoming 1.3 revision of the UPC specification, with the same semantics as described above, and are available in Berkeley UPC beginning with the 2.16 release. Users are strongly encouraged to develop new codes using the standardized interfaces and to migrate existing code to them. Once revision 1.3 of the UPC spec is final, all Berkeley-specific timer interfaces will become deprecated.

    #define     __UPC_TICK__  1  // predefined feature macro
    #include    <upc_tick.h>     // defines the following:
    typedef     ... upc_tick_t;
    #define     UPC_TICK_MAX ...
    #define     UPC_TICK_MIN ...
    upc_tick_t  upc_ticks_now (void);
    uint64_t    upc_ticks_to_ns (upc_tick_t ticks);

Runtime thread layout query for hierarchical systems

    unsigned int bupc_thread_distance(int threadX, int threadY); 
    #define     BUPC_THREADS_SAME     ...
    #define     BUPC_THREADS_VERYNEAR ...
    #define     BUPC_THREADS_NEAR     ...
    #define     BUPC_THREADS_FAR      ...
    #define     BUPC_THREADS_VERYFAR  ...
bupc_thread_distance takes two thread identifiers (whose values must be in 0..THREADS-1, otherwise behavior is undefined), and returns an unsigned integral value which represents an approximation of the abstract 'distance' between the hardware entity which hosts the first thread, and the hardware entity which hosts the memory with affinity to the second thread. In this context 'distance' is intended to provide an approximate and relative measure of expected best-case access time between the two entities in question. Several abstract 'levels' of distance are provided as pre-defined constants for user convenience, which represent monotonically non-decreasing 'distance':

These constants have implementation-defined integral values which are monotonically increasing in the order given above. Implementations may add further intermediate level with values between BUPC_THREADS_VERYNEAR and BUPC_THREADS_VERYFAR (with no corresponding define) to represent deeper hierarchies, so users should test against the constants using <= or >= instead of ==.

The intent of the interface is for users to not rely on the physical significance of any particular level and simply test the differences to discover which threads are relatively closer than others. Implementations are encouraged to document the physical significance of the various levels whenever possible (see below), however any code based on assuming exactly N levels of hierarchy or a fixed significance for a particular level will probably not be performance portable to different implementations or machines.

The relation is symmettric, ie:

bupc_thread_distance(X,Y) == bupc_thread_distance(Y,X)

but the relation is not transitive: bupc_thread_distance(X,Y) == A && bupc_thread_distance(Y,Z) == A does NOT imply bupc_thread_distance(X,Z) == A

Furthermore, the value of bupc_thread_distance(X,Y) is guaranteed to be unchanged over the span of a single program execution, and the same value is returned regardless of the thread invoking the query.

Curerntly the significance of the BUPC_THREADS_* constants is as follows:

Value Meaning(s)
BUPC_THREADS_SAME Only returned when threadX == threadY.
BUPC_THREADS_VERY_NEAR threadX and threadY will communicate through shared memory.
May include pthreads in the same process when compiled with -pthreads, and processes in the same compute node when PSHM support is available.
BUPC_THREADS_NEAR threadX and threadY are in the same compute node, but will communicate using the network API.
This may occur because either PSHM support is not available or the -pshm-width flag to upcrun has placed these threads in disjoint shared memory domains.
BUPC_THREADS_FAR This value is not currently used.
BUPC_THREADS_VERY_FAR threadX and threadY are on different compute nodes.

Castability of pointers-to-shared

NOTICE: See the end of this section for information on new standardized interfaces which will replace these Berkeley-specific ones in a future release.

    int bupc_castable(shared void *ptr);
    int bupc_thread_castable(unsigned int threadnum);
    void * bupc_cast(shared void *ptr);

This family of functions implements a UPC language extension proposed by Brian Wibecan of HP. Their purpose is to allow a UPC programmer to take advantage of UPC implementations in which some or all of the data of a given UPC thread can be directly addressed by other UPC threads.

We use the term 'castable' to denote that the UPC implementation is able to represent a given pointer-to-shared using a pointer-to-private in a given thread. Any pointer-to-shared with affinity to a thread is also castable by that same thread. However, in general shared storage with affinity to one thread is not castable by other threads. Depending on the UPC implementation, it is possible that for a given pair of threads either all, none, or only some of the shared address space with affinity to the first may be castable by the second.

bupc_castable() takes a shared pointer as argument and returns non-zero if and only if the argument is castable by the calling thread. It is guaranteed that a call to bupc_castable() with an argument having affinity to the calling threads will always return non-zero.

bupc_thread_castable() takes a UPC thread number as argument and returns non-zero if and only if every pointer-to-shared with affinity to the argument thread is castable by the calling thread. It is guaranteed that bupc_thread_castable(MYTHREAD) is always non-zero.

bupc_cast() takes a shared pointer as argument and returns a pointer-to-private. The returned pointer may be used to reference the same object as the argument only if the argument pointer is castable by the calling thread, as may determined by bupc_castable() or bupc_thread_castable(). Otherwise the returned pointer is NULL.

NOTICE: The following are slated for inclusion in the upcoming 1.3 revision of the UPC specification, with similar functionality to those described above, and are available in Berkeley UPC beginning with the 2.16.2 release. Users are strongly encouraged to develop new codes using the standardized interfaces and to migrate existing code to them. Once revision 1.3 of the UPC spec is final, the Berkeley-specific versions of these interfaces will become deprecated.

    void *upc_cast(const shared void *ptr);
    upc_thread_info_t upc_thread_info(size_t threadnum);

In addition to the functions above, Berkeley UPC 2.14.2 and later implement an 'inverse cast' function:

    shared void * bupc_inverse_cast(void *ptr);
This function takes a pointer-to-private argument and returns a pointer-to-shared referencing the same location if and only if the argument references a shared object. If the argument is NULL or references a location not in the shared space, then the return value is (shared void *)NULL.

The 'bupc_atomic*' function family

    type bupc_atomicX_read_RS(shared void *ptr);
    void bupc_atomicX_set_RS(shared void *ptr, type val);
    type bupc_atomicX_swap_RS(shared void *ptr, type val);
    type bupc_atomicX_mswap_RS(shared void *ptr, type mask, type val);
    type bupc_atomicX_cswap_RS(shared void *ptr, type oldval, type newval);
    type bupc_atomicX_fetchadd_RS(shared void *ptr, type op);
    type bupc_atomicX_fetchand_RS(shared void *ptr, type op);
    type bupc_atomicX_fetchor_RS(shared void *ptr, type op);
    type bupc_atomicX_fetchxor_RS(shared void *ptr, type op);
    type bupc_atomicX_fetchnot_RS(shared void *ptr);
Where type and X take on the values of each pair from the following table, and RS is either `strict' or `relaxed'.
Type X
int I
unsigned int UI
long L
unsigned long UL
int64_t I64
uint64_t U64
int32_t I32
uint32_t U32

This family of functions provide atomic read, write and read-modify-write of the indicated data types. When these functions are used to access a memory location in a given synchronization phase, atomicity is guaranteed if and only if no other mechanisms are used to access the same memory location in the same synchronization phase. Memory accesses are relaxed or strict as indicated by the function names.

The swap functions set the location given by the first argument to the value of the second argument while atomically returning the prior value. The mswap (masked swap) functions atomically update the location given by the first argument to a value obtained by replacing those bits set in mask with the corresponding values from val, while returning the prior value.

The cswap (conditional swap) functions atomically set the location given by the first argument to the value newval only if the current value is equal to oldval, but return the prior value regardless of whether the write was performed.

The fetchadd functions atomically add the second argument to the location given by the first argument and return the value prior to the addition. Similarly, the fetchand, fetchor and fetchxor functions atomically perform the appropriate bit-wise operation and return the value prior to the operation. The fetchnot functions atomically perform a bit-wise negation of the location given by the argument and return the value prior to the negation.

In addition to the relaxed and strict atomic operations on shared data, the following are available to operate on private pointers, including pointers to shared data with local affinity. Other than the type of the first argument, these functions operate identically to the relaxed atomics above.

    type bupc_atomicX_read_private(void *ptr);
    void bupc_atomicX_set_private(void *ptr, type val);
    type bupc_atomicX_swap_private(void *ptr, type val);
    type bupc_atomicX_mswap_private(void *ptr, type mask, type val);
    type bupc_atomicX_cswap_private(void *ptr, type oldval, type newval);
    type bupc_atomicX_fetchadd_private(void *ptr, type op);
    type bupc_atomicX_fetchand_private(void *ptr, type op);
    type bupc_atomicX_fetchor_private(void *ptr, type op);
    type bupc_atomicX_fetchxor_private(void *ptr, type op);
    type bupc_atomicX_fetchnot_private(void *ptr);

Support for additional data types (e.g. short, char and floating point) and operations are expected to appear in future releases.

Collective deallocation functions

NOTICE: See the end of this section for information on new standardized interfaces which will replace these Berkeley-specific ones in a future release.

    void bupc_all_free(shared void *ptr);
    void bupc_all_lock_free(upc_lock_t *lockptr);

These two functions implement collective alternatives to the standard functions upc_free() and upc_lock_free(), as a convenience to the programmer. Both functions must be called collectively by all threads with the same argument. The object referenced by the argument is guaranteed to remain valid until all threads have entered the collective deallocation call, but the function does not otherwise guarantee any synchronization or strict reference. In all other respects the semantics of these functions and constraints on their usage are identical to their non-collective variants.

NOTICE: The following are slated for inclusion in the upcoming 1.3 revision of the UPC specification, with the same semantics as described above, and are available in Berkeley UPC beginning with the 2.16 release. Users are strongly encouraged to develop new codes using the standardized interfaces and to migrate existing code to them. Once revision 1.3 of the UPC spec is final, the Berkeley-specific versions of these interfaces will become deprecated.

    void upc_all_free(shared void *ptr);
    void upc_all_lock_free(upc_lock_t *lockptr);


Known bugs and limitations

This release of Berkeley UPC has a number of known limitations and bugs:

Implicit library calls may modify errno

The C99 standard gives the following semantics for errno:
The value of errno is zero at program startup, but is never set to zero by any library function. The value of errno may be set to nonzero by a library function call whether or not there is an error, provided the use of errno is not documented in the description of the function in this International Standard.
These semantics are actually somewhat weaker than one might hope - specifically, they allow library calls which succeed to change errno to a non-zero value. In practice many C/POSIX library implementation actually do this.

The problem in the context of Berkeley UPC and its source-to-source translation is that there is one copy of errno per UPC thread which is shared by both the generated code representing translated UPC code, and all the runtime libraries running underneath it (including UPCR, GASNet, vendor network libs, etc.). Furthermore, many actions in UPC which do not qualify as library calls at UPC level (e.g. dereferencing a pointer-to-shared) result in library calls within the generated code. Consequently, the value of errno set by a failed library call invoked at the UPC source level may be subsequently overwritten by any of these implicit library calls.

While one could imagine the Berkeley UPC compiler and runtime taking action to preserve the value of errno across all the implicit library calls, doing so would adversely affect performance and we do not currently take this approach. This means that a UPC user who wants to inspect the value of errno after a failed library call they make must do so immediately - not just before the next UPC-level library call, but also before taking any action that might possibly invoke implicit library calls in the generated source code.

Basically, the only 100% safe way for a UPC program to read errno when using Berkeley UPC is to copy it into a local variable immediately after the failed library call returns. This is the "recommended practice" for using errno with Berkeley UPC.

Preprocessor macros defined in UPC files must not affect .h files

Berkeley UPC translates your UPC programs into C code, then runs a regular C compiler on your system to generate object code. To avoid handling vendor-specific inline assembly code that appears in some header files on many of the various systems we run on, we currently have our UPC-to-C translator 'put back' all non-UPC header files (i.e., .h files which don't contain any UPC constructs), which are then handled by the regular C compiler (we do not support placing inline assembly in your UPC code). A side effect of this process is that the preprocessor is run twice on your program. Since any #defined macros you place in your UPC code are expanded (and their definitions forgotten) the first time the preprocessor is run, these macros will not be present the second time .h files are included. Thus, UPC code such as
    #define NDEBUG
    #include <assert.h>
will not work as expected if the NDEBUG definition modifies the behavior of assert.h (which, in this example, it does: this NDEBUG/assert.h case is the most common case where users run into this issue with our compiler).

There is a simple workaround: if you need to define a macro that affects the behavior of #included files, define it on the command line to upcc:

    upcc -DNDEBUG myprogam.upc

Behavior of the 'getenv()/setenv()' functions

It is not well-defined in the UPC specification whether the standard 'getenv()' function should return the same values on all threads, and/or if these values should include those present in the environment of the process that launches the UPC application.

Berkeley UPC guarantees that 'getenv()' allows retrieval of certain environment variable values that were present when the job was launched. At present this function is only guaranteed to retrieve these value for all threads if the environment variable's name begins with 'UPC_' or 'GASNET_'. On some platforms all environment variables seen by the job launcher may be propagated, but it is not portable to rely on this.

The 'setenv()' and 'unsetenv' functions are not guaranteed to work in a Berkeley UPC runtime environment, and should be avoided.

Correctness when using GCC 4.x (x<3) as the C compiler

There is a known correctness problem in the optimizer in gcc 4.0.x through 4.2.x that may affect correctness of shared-local accesses in UPC (i.e., shared accesses that result in node-local accesses at runtime). In a nutshell, it's possible that in rare cases these compilers may misoptimize a shared-local access such that it deterministically read or write an incorrect value. For this reason, configure will not allow you to use one of these compilers without an explicit option: '--enable-allow-gcc4'. If you do configure with '--enable-allow-gcc4', then you may encounter the optimizer bug. If you suspect you may be encountering this issue, the following actions are recommended for diagnosis:
  1. Try compiling your code in debug mode (ie with 'upcc -g'). If the problem persists, then this issue is *not* the culprit.
  2. Try compiling your code using the flag 'upcc -Wc,-fno-strict-aliasing'. If the problem persists, then this issue is *not* the culprit.
  3. Run your code several times. If the problem is intermittent, then this issue is probably not the culprit (the optimizer bug is deterministic).

If you still believe you are encountering this issue, there are several recommended workarounds:

  1. Configure BUPC to use a different backend C compiler. If you have a non-gcc vendor C compiler available, this may actually be a better choice for performance anyhow. Failing that, using gcc >= 4.3 (or gcc 3.x) should also resolve the issue, as the bug is only believed to be present in gcc 4.0.x through 4.2.x.
  2. Build the affected modules using the flag 'upcc -Wc,-fno-strict-aliasing'. This makes the gcc 4.x optimizer more conservative, and also inhibits the illegal optimization.
  3. Reconfigure BUPC using 'configure --enable-conservative-local-copy'. This globally activates a more conservative implementation of shared-local accesses that also prevents the illegal optimization.
The performance impact of the workarounds above is expected to be application-dependent.

GUPC+UPCR with -pthreads

GUPC+UPCR has a known problem in -pthreads compilation mode, whereby programs with a significant amount of statically-allocated private data may fail at program initiation time with an error message like:
    
    UPC Runtime error: pthread_create: Invalid argument
Users encountering this error are recommended to workaround it by either using the BUPC translator (which does not demonstrate the problem), or reworking their program to use less statically-allocated private data.

Other known limitations/bugs


Platform-specific issues

Running into Maximum size limits on pinning-based networks

On systems that pin RDMA-addressable memory (such as InfiniBand), the amount of shared memory that a default Berkeley UPC build can provide to a UPC program will be no larger than the maximum region that the OS and network drivers allow to be pinned at once. While this is typically a large fraction of physical memory, it may prove insufficient for your application. In this case, a "large segment" mode is available, which is slightly less fast in some situations, but which provides the maximum possible UPC shared memory space. To use large segment mode, the Berkeley UPC runtime needs to be reconfigured with '--enable-segment-large', and rebuilt. When using PSHM over POSIX shared memory (the default under Linux), it has also been observed that GASNet's InfiniBand support is unable to register a UPC shared heap as large as when using PSHM over SystemV shared memory. If you are using ibv-conduit on Linux and see crashes at startup with large UPC shared heap sizes, then we recommend reconfiguring your runtime with '--disable-pshm-posix --enable-pshm-sysv' before trying '--enable-segment-large'.


Feedback

Please contact us with your bug reports, comments, and suggestions.

Thank you for using Berkeley UPC!


Home
Downloads
Documentation
Bugs
Publications
Demos
Contact
Internal

This page last modified on Thursday, 31-Oct-2013 13:44:23 PDT