Berkeley UPC - Unified Parallel C

(A joint project of LBNL and UC Berkeley)

Home
Downloads
Documentation
Bugs
Publications
Demos
Contact
Internal

Berkeley UPC User's Guide version 2.1.0


Berkeley UPC Documentation
Berkeley UPC Site-wide
Entire Web
Jump to docs for version: v2.0.1 v2.1.0 v2.2.0 v2.4.0 v2.6.0
This guide tells you how to use the Berkeley UPC compiler, which is a portable UPC implementation that runs on many different parallel systems available today.

Contents


UPC standards and APIs supported by this version of Berkeley UPC

This version of Berkeley UPC includes


A Sample UPC Program

Here is a simple hello-world program written in UPC:
    #include <upc_relaxed.h>
    #include <stdio.h>

    int main() {
      printf("Hello from thread %i/%i\n", MYTHREAD, THREADS);
      upc_barrier;
      return 0;
    }
This program prints a message once from each thread (in some arbitrary interleaving), executes a barrier (optional), and exits.

For more involved examples of UPC code, see the UPC Language Tutorials - from the UPC Language Community website and the 'upc-examples' directory in of the Berkeley UPC runtime distribution. The official UPC language specification is a useful reference, and contains a description of the standard library. Finally, the UPC Collectives Specification describes the collective operations available in UPC.


Compiling UPC programs with 'upcc'

The upcc front end is used to compile UPC programs. It is designed with an interface that is very similar to the standard GNU gcc compiler, for ease of use. For instance, you could compile a physics simulation called 'light' from two source files via
    upcc -o light particle.upc wave.c -lgrottymath
Note that 'wave.c' can contain either UPC code or regular C code, and the 'grottymath' library that is linked into the application can be a regular C library: Berkeley UPC is fully interoperable with regular C source, object, and library files (note: if you compile with the -pthreads flag, any C libraries you use must be thread-safe). Berkeley UPC 2.0 also adds support for linking C++/FORTRAN/MPI objects into a UPC executable: see Mixing C/C++/MPI/FORTRAN with UPC.

upcc recognizes most commonly used C compiler flags (-D, -I, etc.). It also uses a number of its own flags for the choice of network API your program will run over, for compiling your UPC code for a static number of threads, and other UPC-specific options. See the upcc man page for details.

Choosing a network API for your UPC executable

Berkeley UPC executables are always compiled to run over a particular network API. To choose which network API is used, pass the '--network' flag with one of the following values:

Name Description
lapi LAPI API for IBM SP networks
gm GM API for Myrinet networks
elan elan API for Quadrics networks
vapi API for Mellanox-based Infiniband networks
sci SISCI API for Dolphin-based SCI networks (EXPERIMENTAL- currently requires the Linux BigPhysMem kernel patch in order to get more than 1MB of shared heap space)
shmem SHMEM API for SGI Altix systems and the Cray X1. Other systems providing a SHMEM API may also work, but have not been tested.
udp UDP: works on any system with a standard TCP/IP stack, but is typically slower than using one of the native network types. Generally the fastest option for systems with only Ethernet hardware (notably faster than MPI-over-TCP).
mpi MPI: works on any system with MPI installed, but is typically slower than using one of the other network types.
smp "Symmetric multiprocessor (SMP)" mode: uses no network. Currently runs with only a single process, so you must use -pthreads to run with multiple UPC threads.

Note that you can only compile for a given network type if your Berkeley UPC runtime was configured to support it at build/installation time. To see which APIs are supported in your installation, and to see which is used by default, use 'upcc --version'.

Compiling for a fixed number of UPC threads

The '-T <number>' option to upcc causes your executable to be build for a fixed number of UPC threads. Alternatively, you can set 'UPCC_FIXED_THREADS=<number>' in your environment (the '-T' flag overrides the environment setting if both are present).

An executable compiled for a fixed number of UPC threads will fail at startup if you try to run it with a different number of threads. However, fixing the number of threads allows optimization on certain operations (such as shared pointer arithmetic), especially when the number of threads is a power of 2.

Overriding global upcc.conf settings in your $HOME/.upccrc file

upcc gets global settings for your installation from a upcc.conf file which is created during the configuration stage of a runtime installation. After installation the file is located in the $prefix/etc directory of your installation. You can create a $HOME/.upccrc file to override any of these settings. See the upcc man page for a list of available settings.

Berkeley-specific preprocessor macros

Programs compiled with Berkeley UPC will see all of the preprocessor macros provided by your backend C compiler, plus the following:

Name Value Description Standard
__UPC__ 1 Defined by any UPC implementation UPC language
__UPC_VERSION__ Monotonically increasing positive integer constant UPC specification supported: value is YYYYMM date of that version's ratification (ex: '200310L)' UPC language
__UPC_STATIC_THREADS__ 1 if static threads: else undefined Set to 1 if the '-T' flag was passed to upcc UPC language
__UPC_DYNAMIC_THREADS__ 1 if dynamic threads: else undefined Set to 1 unless the '-T' flag was passed to upcc UPC language
__BERKELEY_UPC__ Monotonically increasing positive integer constant The major version number of the Berkeley UPC release. Example: '1' for release '1.0.3'. Berkeley UPC only
__BERKELEY_UPC_MINOR__ An integer constant The minor version number of the Berkeley UPC release. Example: '0' for release '1.0.3'. Berkeley UPC only
__BERKELEY_UPC_PATCHLEVEL__ An integer constant The patch version number of the Berkeley UPC release. Example: '3' for release '1.0.3'. Berkeley UPC only
__BERKELEY_UPC_<NETWORK>_CONDUIT__ 1, or undefined Identifies the network API used. Example: if 'upcc -network=mpi' is used, '__BERKELEY_UPC_MPI_CONDUIT__' will be defined, with the value of 1 Berkeley UPC only
__BERKELEY_UPC_PTHREADS__ 1, or undefined Defined to 1 if and only if the '-pthreads' flag is used Berkeley UPC only

Using a remote UPC-to-C translator

The upcc front end has the ability to use UPC-to-C translator located on a remote machine. This is provided both as a convenience (the translator takes much longer to build than the runtime, and we provide a public HTTP translator that allows users to get started with Berkeley UPC more quickly), and to support the many systems on which our translator does not build, due to C++ portability issues.

A remote translator can be used either over HTTP, or SSH. To use HTTP, the the 'upcc.cgi' CGI script (located in the 'contrib' directory of the runtime distribution) must be installed and configured with a web server on the remote host. Simply set the 'translator' parameter in your '$HOME/.upccrc' file (or the global 'upcc.conf') to the URL for the CGI script. To use SSH, you must be able to login to the remote host using SSH, and the 'translator' parameter must be set to 'remote_host:/path/to/translator'. You will want to use key-based authentication, and 'ssh-agent' to avoid entering your password each time you compile. See our SSH Agent Tutorial.


Running UPC programs

If you compile a UPC program with '--network=smp', you can run the executable normally (the same way you'd run 'ls' or 'grep'). Otherwise, you are generating a executable that uses a parallel network API, and this typically means you executable will need some special treatment to be launched correctly.

Berkeley UPC executables should be run the same way as any other parallel program on your system that uses the same underyling network API. So, for instance, a program compiled with '--network=mpi' is run on many systems via 'mpirun -np <number of processes> a.out'. Other systems may use other invocations, such as 'prun' or 'poe', especially when API's other the MPI are used. Consult your system's documentation for details.

Using 'upcrun'

The 'upcrun' script that is installed as part of the Berkeley UPC runtime is our attempt to provide a standard interface for running UPC programs. If your installation has configured 'upcrun.conf' correctly (in many cases the defaults will work), you can run UPC programs portably via commands like
    upcrun -n 4 parboil
This example runs the UPC executable 'parboil' on 4 nodes.

An additional benefit of using upcrun is that it provides consistent support for propagating environment variables to all threads of your UPC program. If you use upcrun, any environment variable beginning with either 'UPC_' or 'GASNET_' is guaranteed to be propagated to all threads. (Support for propagating all environment variables is planned). If you do not use upcrun, environment propagation will only work to the extent that the parallel job launcher you use provides it normally.

You can see how upcrun thinks your job should be run without actually running it by passing the '-t' flag to it. Also, 'upcrun -i <executable>' will provide information about a Berkeley UPC executable, such as the network API that it was built against, and the number of fixed threads (if any) that it was compiled for.

See 'upcrun --help' or the upcrun man page for more information.

Setting the amount of shared memory available to your applications

At startup, each Berkeley UPC thread reserves a fixed portion of its address space (via the 'mmap()' system call) for shared memory. This address range can not be used for regular unshared (i.e., malloc) memory allocations, and it also serves as a maximum value on the amount of shared memory (per-thread) that the program can use: a UPC program will die with a fatal error if any thread tries to allocate more shared memory than it reserved at startup.

The default amount of shared memory to reserve per UPC thread on a system is chosen at configure time (see the INSTALL document in the runtime distribution for details), but you can override that value for a particular application either at compile time, or at startup. Generally this is only needed if you observe that your application is running out of either shared or regular C memory.

To embed a different default amount of shared memory into your application, simply pass '-shared-heap=144MB' for instance (to get 144 megabytes per UPC thread). You can also use 'GB' for gigabyte amounts (if neither 'MB' nor 'GB' is used, megabytes are assumed). To override the embedded default amount of shared memory at application startup, set the UPC_SHARED_HEAP_SIZE environment variable to whatever value you want ('2GB', etc.).

Note: The Berkeley UPC runtime currently defaults to a limit of 2 Gigabytes maximum of shared memory per-process (i.e. if you are using pthreads, this limit is shared by the pthreads within a each process: otherwise the limit is per UPC thread). If your system can support more than this, you may configure the runtime to use a different maximum with 'configure --with-shared-mmap-max=16GB' (for 16 Gigs per process, etc.). The need to explicitly configure the runtime for large shared memory support will be removed in a future release.

While it is tempting to simply grab an extremely large shared memory segment, be aware that this is not always a good idea, or even possible. Since the shared address space range cannot be used for regular malloc allocations, creating too large of a shared space can cause the amount of regular heap memory available to your application to become small (causing malloc to eventually return NULL when you request more memory). Also, the shared memory space is reserved via an mmap() call, and while this does not generally cause any physical memory pages to be allocated, certain operating systems (for instance, Linux) will not allow more memory to be reserved by applications then the OS can guarantee is available, and so allocating a shared region larger than the physical memory (plus swap space) may fail.

The default amount of shared memory per UPC thread can be changed system-wide by modifying the 'shared_heap' parameter in the installation's upcc.conf file. You can override the system-wide default for your own applications by setting shared_heap in your $HOME/.upccrc file.

The upcc.conf file also provides a 'heap_offset' parameter (and upcc provides a '-heap-offset' flag) that affects where the address region for shared memory is located in your program. However, at present it is not useful on any of our supported systems, and so we do not recommend its use.


Using pthreaded Berkeley UPC programs

Berkeley UPC supports creating executables that use pthreads to optimize communications between UPC threads running on the same machine. To utilize pthreads, pass the '-pthreads' option to upcc.

The '-pthreads' flag must be passed consistently at all stages of compilation and linking. Also, when pthreads are used, upcc needs to delay much of the compilation of your code until link time, so if you split code generation into separate compilation and linking steps (i.e., 'upcc -c foo.upc', followed by 'upcc foo.o bar.o'), you need to pass any macro and/or include path directives (ex: '-DFOO=bar -I/usr/local/include') to upcc to both the compilation and link commands.

Any C libraries that your code links against must be thread-safe in order to be used with -pthreads. If one or more of your libraries is not thread-safe, you must compile without pthreads, and run separate processes on the same machine to exploit an SMP system. Currently, such processes will not use any shared memory optimizations, and will communicate with each other via the network API. While this is generally still much faster than communicating with UPC threads on other nodes, it is still not as fast as using shared memory. Support for shared memory between non-pthreaded Berkeley UPC processes will be provided in the near future.

When you link an application with '-pthreads', a subdirectory named <executable_name>_pthread-link will be created in the current directory. This directory exists in order to speed up further linking commands of the same program. If you link the same application again with the same object file names, and none of the global static unshared variables in your program have changed name or size, recompilation of all the files in your application can be avoided, which can make a significant difference in build time for programs with many source files. You may delete the temporary directory at any time without any side effects (other than possibly longer link times).

Unless otherwise specified, pthreaded UPC applications use a default number of pthreads per process (run 'upcc --version' to see the default for your system. This number is set in the upcc.conf configuration file, and can be changed there (or in your '$HOME/.upccrc' file). It can also be overridden in several ways. Compiling with 'upcc -pthreads=<NUMBER>' changes the default number of pthreads per UPC process for an executable to NUMBER. If the 'UPC_PTHREADS_PER_PROC' environment variable is set to a nonzero integer when you run a UPC program, it will override any default value. Finally, upcrun is smart about pthreads in several ways. First, if you run a pthreaded parallel job with 'upcrun -n <NUMBER> ...', the number of processes actually launched will be divided by the number of pthreads, so that exactly NUMBER UPC threads are used. Second, if you use smp network option (which generates a non-parallel, executable that will run only a single process), upcrun will automatically set the number of pthreads to NUMBER.


Analyzing UPC Programs with 'upc_trace'

As of version 2.0, Berkeley UPC includes 'upc_trace', a tool for analyzing the communication behavior of UPC programs. When run on the output of a trace-enabled Berkeley UPC program, 'upc_trace' provides information on which lines of code in your UPC program generated network traffic: how many messages the line caused, what type (local and/or remote gets/puts), what the maximum/minimum/average/combined sizes of the messages were.

How to use 'upc_trace':

  1. You must compile your application with a copy of upcc that was configured with '--enable-trace'. Note that '--enable-debug' implies tracing by default (i.e. pass '--disable-trace' if you do not want tracing enabled).

  2. You must run your application with 'upcrun -trace ...' or 'upcrun -tracefile TRACE_FILE_NAME ...'. Either of these flags causes your UPC executable to dump out tracing information while it executes. The '-trace' flag causes one file per UPC thread to be generated, with the name 'upc_trace-a.out..-N', where 'a.out' is the name of your executable, and 'N' is the UPC thread's number. The '-tracefile NAME' option lets you specify your own name for the tracing file(s): if the name contains a '%' character, one trace file per thread is generated, with the '%' replaced with the UPC thread's number. Otherwise, all threads will write to the same file.
    Note that running with tracing may slow down your application considerably: the exact amount depends on your filesystem, and the ratio of communication/computation in your program.

  3. After your application has completed, you may run 'upc_trace' on one or more of the trace files generated by your program run:
    1. Running 'upc_trace' on a trace file generated by a single UPC thread shows the information only for that thread. If you pass multiple files from the same application run, the information for the various threads is coalesced, so passing in all the tracefiles generated by a run allows you to see information for the entire application.
    2. There are a number of flags to 'upc_trace' which control what kinds of information is reported, and how it is sorted. See 'upc_trace --help' or the upc_trace man page for details.
    3. Note that upc_trace may take a while to run, especially on large tracefiles. We plan to optimize its performance in the future.


Debugging Berkeley UPC programs

There is currently no support in Berkeley UPC for debugging programs at the UPC source level. However, we are currently working with Etnus to provide support for Berkeley UPC within the TotalView debugger.

In the meantime, Berkeley UPC does come with several mechanisms for attaching a regular C debugger to one or more of your UPC application's threads at various points during execution. This can be very useful if you wish to submit a helpful bug report to us. See our Debugging Berkeley UPC programs page for more information.


Berkeley-specific extensions to the UPC Language

Non-blocking and non-contiguous memcpy functions

As of 2.0, Berkeley UPC fully implements the set of non-blocking and non-contiguous extensions to 'upc_memcpy()' described in Proposal for Extending the UPC Memory Copy Library Functions. See that document for details on the functions and their usage.

The 'bupc_all_reduce_all' function family

This is an extension to the UPC Collectives Specification. The 'bupc_all_reduce_all' functions behave identically to the 'upc_all_reduce' functions, except that the 'dest' argument has the semantics of the 'dest' argument to 'upc_all_broadcast', i.e. the result of the reduction is broadcast to all thread, instead of just one.

The 'bupc_dump_shared' function

Shared pointers in UPC are logically composed of three fields: the address of the data that the shared pointer currently points to, the UPC thread on which that address is valid, and the 'phase' of the shared pointer (see the official UPC language specification for an explanation of shared pointer phase). Our version of UPC provides a 'bupc_dump_shared' function that will write a description of these fields into a character buffer that the user provides:
    int bupc_dump_shared(shared const void *ptr, char *buf, int maxlen);
Any pointer to a shared type may be passed to this function. The 'maxlen' parameter gives the length of the buffer pointed to by 'buf', and this length must be at least BUPC_DUMP_MIN_LENGTH, or else -1 is returned, and errno set to EINVAL. On success, the function returns 0, The buffer will contain either "<NULL>" if the pointer to shared == NULL, or a string of the form
    "<address=0x1234 (addrfield=0x1234), thread=4, phase=1>" 
The 'address' field provides the virtual address for the pointer, while the 'addrfield' contains the actual contents of the shared pointer's address bits. On some configurations these values may be the same (if the full address of the pointer can be fit into the address bits), while on others they may be quite different (if the address bits store an offset from a base initial address that may differ from thread to thread).

Both bupc_dump_shared() and BUPC_DUMP_MIN_LENGTH are visible when any of the standard UPC headers (upc.h, upc_relaxed.h, or upc_strict.h) are #included.

The 'bupc_poll' function

The 'bupc_poll()' function explicitly causes the UPC runtime to attempt to make progress on any network requests that may be pending.

You will normally not need to call this function, as the runtime will automagically perform checks for incoming network requests whenever your UPC code causes network activity to be performed, and this usually occurs fairly frequently in a UPC application. However, if you writing your own 'spin lock' style synchronization, you may need to use this function to avoid deadlock. Here is an example:

    shared strict int flag[THREADS];

    ...

    if (MYTHREAD % 2) {
        while (flag[MYTHREAD] == 0)
            bupc_poll();
    } else {
        ... some calculation ...
        flag[MYTHREAD - 1] = 1;
    }
Here the 'even' UPC threads are performing some calculation, then informing the 'odd' threads that the result is ready by setting a per-thread flag. If the 'bupc_poll()' were omitted, the 'odd' threads might (on certain platforms/networks) consume all of the CPU forever in the 'while' test, never checking for the incoming network message that would set flag[MYTHREAD].

If a program contains computationally intensive sections in which no remote accesses are performed for a long time, it is also possible that performance may be improved by intermittently calling bupc_poll, particularly if other threads are likely to be performing remote accesses (or memory allocation requests) during this time.

Behavior of the 'getenv' function

It is not well-defined in the UPC specification whether the standard'getenv' function should return the same values on all threads, and/or if these values should include those present in the environment of the process that launches the UPC application.

The Berkeley UPC guarantees that 'getenv' allows retrieval of environment variable values that were present when the job was launched. At present this function is only guaranteed to retrieve these value for all threads if the environment variable's name begins with 'UPC_' or 'GASNET_'. On some platforms all environment variables seen by the job launcher may be propagated, but it is not portable to rely on this.

The 'setenv()' and 'unsetenv' functions are not guaranteed to work in a Berkeley UPC runtime environment, and should be avoided.

The 'bupc_assert_type' built-in

The 'bupc_assert_type(expr, type)' built-in operation allows testing for compile-time type equality, and is primarily used by our UPC compiler test suite.
  1. 'expr' = any arbitrary (legal) UPC expression
  2. 'type' = any legal C/UPC type

If 'expr' has a static type which is identical to 'type', does nothing. Otherwise, prints a non-fatal warning containing the line number and a description of the two differing types.


Known bugs and limitations

This release of Berkeley UPC has a number of known limitations and bugs:

Preprocessor macros defined in UPC files must not affect .h files

Berkeley UPC translates your UPC programs into C code, then runs a regular C compiler on your system to generate object code. To avoid handling vendor-specific inline assembly code that appears in some header files on many of the various systems we run on, we currently have our UPC-to-C translator 'put back' all non-UPC header files (i.e., .h files which don't contain any UPC constructs), which are then handled by the regular C compiler (we do not support placing inline assembly in your UPC code). A side effect of this process is that the preprocessor is run twice on your program. Since any #defined macros you place in your UPC code are expanded (and their definitions forgotten) the first time the preprocessor is run, these macros will not be present the second time .h files are included. Thus, UPC code such as
    #define NDEBUG
    #include <assert.h>
will not work as expected if the NDEBUG definition modifies the behavior of assert.h (which, in this example, it does: this NDEBUG/assert.h case is the most common case where users run into this issue with our compiler).

There is a simple workaround: if you need to define a macro that affects the behavior of #included files, define it on the command line to upcc:

    upcc -DNDEBUG myprogam.upc

Other known limitations/bugs


Feedback

Please contact us with your bug reports, comments, and suggestions.

Thank you for using Berkeley UPC!


Home
Downloads
Documentation
Bugs
Publications
Demos
Contact
Internal