This document describes the interface between the UPC compiler and the UPC runtime for handling static user data (both shared and unshared) in UPC programs.
Within this document, 'static' user data means 'not dynamically allocated' (i.e., not allocated on the stack, nor with malloc(), upc_all_alloc(), or any other memory allocation function). All of a user's global and static variables in the regular C sense are static user data for the purposes of this document.
Allocating and initializing static data in UPC is much more challenging than in regular C, where the linker simply gathers up the static data defined in various object files and places it in an executable along with any initial values (all of which are, in C, known at link time at the latest). In UPC we cannot always know the size, location, or initial value of a variable at link time, and thus support from the runtime layer is needed to properly allocate and initialize static data. In the Berkeley UPC compiler, the mechanisms we use to set up static data also require us to also refer to it specially during program execution.
The following example shows data definitions from two UPC files (and a shared '.uph' header file) that are part of the same program--we will use this example to illustrate the steps that need to be taken with static UPC data. [Working UPC and C files for all the code shown in this web page can be found in the 'tests/foo_bar' subdirectory of the UPC Runtime distribution].
foobar.uph |
foo.upc |
bar.upc |
As we can see, there are two types of data in a user's UPC program that we have to deal with: shared variables, which all UPC threads can see, and unshared variables, which are visible only to a single UPC thread. Note that 'pfoo' in bar.upc is NOT a shared variable: it is a local pointer, which happens to point to the shared integer type (it is of course not a 'normal' pointer, since more information is needed to point to a shared variable than an address. But this is a separate issue from whether it is itself shared or unshared). On the other hand, 'pbar' in foo.upc is a shared variable: it is a shared pointer to a shared integer. Also note that we've made the situation tricky by placing some of our pointers (pfoo and pquux) in a different file than the variable they are initialized to point to: in a regular C program, the linker handles resolving all such addresses, but in the UPC case things are not so simple...
In the Berkeley UPC compiler, .upc files are translated into .c files that have had all UPC specific constructs translated into C code. Below are two hand-translated .c files that should be similar to those that UPC compiler emits. Don't try to understand them all at first glance (especially the initialization code at the bottom of each file): the remainder of this document will go over each element in turn.
foo.c [source file ] |
bar.c [source file ] |
Support for properly converting multiple tentative definitions into a single variable requires special support from the linker (compilers cannot know when they see 'int foo;' whether the variable will be initialized in a different file).
Duplicate tentative definitions are rare in real code, and typically show up only in older C code. The 'extern' keyword is now typically used to avoid multiple definitions. However, since the UPC specification states that UPC officially follows the ANSI/ISO C specification except where explicitly noted otherwise, a UPC compiler ought to handle them. This specification contains a fair amount of logic dedicated specifically to handling tentative definitions correctly (although one of our two alternatives for handling unshared global UPC variables declared by the user does not currently support them completely, as explained later).
Note: the 'phaseless' upcr_pshared_ptr_t type is used (to save space and/or make address calculation easier) when the variable is either a scalar value that will live only on thread 0, or an array that either exists entirely on a single thread (i.e. is indefinitely blocked), or which uses the default UPC blocking of one element per block.
/*** UPC code ***/ shared int foo = 3; shared int bar; double do_sum() { static shared [3] double messy[16][4*THREADS] = { ... }; } /*** Translated C code ***/ upcr_pshared_ptr_t foo = UPCR_INITIALIZED_PSHARED; upcr_pshared_ptr_t bar; static upcr_shared_ptr_t do_sum_messy = UPCR_INITIALIZED_SHARED;A couple points are worth noting here.
The UPCR_INITIALIZED_{P}SHARED values are provided in upcr.h (as their values can differ across shared pointer representations), as are a pair of upcr_is_init{p}val() functions for testing pointers for that value. We use UPCR_INITIALIZED_PSHARED in foo.c to initialize the variables 'foo' and 'pbar': if 'foo' was also initialized in bar.upc the linker would catch the error.
double tmp; upcr_get_shared(&tmp, do_sum_messy, sizeof(double)*(16*i + j), sizeof(double)); total += tmp;Note that any optimizations performed by the compiler to avoid, schedule, or coalesce network traffic are performed above the level of the UPC runtime--the code here, for instance, might be altered by an enterprising compiler to use a single block copy per thread.
The allocation function must contain upcr_startup_{p}shalloc_t structs with the allocation information for each proxy pointer defined in the file:
upcr_startup_pshalloc_t pinfos[] = { { &foo, sizeof(int), 1, 0 }, { &bar, sizeof(int), 1, 0 } }; upcr_startup_shalloc_t infos[] = { { &do_sum_messy, 3*sizeof(double), 16*4*sizeof(double), 1 } }; /* Allocate shared data */ upcr_startup_pshalloc(pinfos, sizeof(pinfos) / sizeof(upcr_startup_pshalloc_t)); upcr_startup_shalloc(infos, sizeof(infos) / sizeof(upcr_startup_shalloc_t));A call to upcr_startup_{p}shalloc() is then made to actually allocate the shared memory for each proxy pointer (and spread the information about it to all of the node/threads in the UPC job). The function takes the address of the proxy pointer, the size and number of blocks of shared memory to allocate, and a flag indicating if the number of blocks should be multiplied by THREADS. The function also performs a bzero() on the data if it was never initialized by the user (which can be determined by noting whether the proxy pointer's initial value was UPCR_INITIALIZED_{P}SHARED or not).
In the initialization function for the file, all shared data that was initialized by the user must be assigned the correct values. Scalar shared values will all have affinity to thread 0, and so only that thread should run the code that sets the values. Here, for instance, is the relevant code from foo.c:
/* Explicit initializations of variables living only * on UPC thread 0 */ if (upcr_mythread() == 0) { *((int*)upcr_pshared_to_local(foo)) = 3; *((upcr_shared_ptr_t*)upcr_pshared_to_local(pbar)) = bar; }[Note that casting to local pointers is not the only way to achieve this--it was done here since calling upcr_put_pshared() would have first required storing the '3' in a temporary variable, and the author was feeling lazy. Compilers may generate any code that correctly does the job].
For arrays that are striped across UPC threads, initialization is trickier, and a helper function called upcr_startup_assignarray() function is provided. It takes a pointer to a local array from which the initial values for the shared array will be taken, and a set of information for each dimension of the arrays. Each thread initializes only the portion of the array which has affinity to it, to avoid unneeded network traffic. If the local array is not as large as the shared array, the remainder of shared array is filled with 0s.
double init_messy[1][5] = { { 1, 2, 3, 4, 5 } }; upcr_startup_arrayinit_diminfo_t init_messy_info[] = { { 1, 16, 0 }, { 5, 4, 1 } }; upcr_startup_initarray(do_sum_messy, init_messy, init_messy_info, 2, sizeof(double), 3);See the UPC Runtime Specification for more details on the parameters and behaviors of these functions.
There are various ways to transform global/static data into thread-local data. The Berkeley UPC compiler supports two methods: a 'global struct' approach, and a 'tld section' approach. Both strategies cause all such data across all files to be coalesced into a single region, a copy of which is made for each thread. References to thread-local variables are then transformed into offsets into the current thread's region.
Each strategy has its disadvantages: the 'global struct' approach occasionally requires all .c files in a UPC application to be recompiled, and uses more memory at runtime. The 'tld section' strategy requires compiler and linker behaviors that are not portable across different C compilers.
While this discussion is concerned specifically with the case when the UPC compiler is generating C output, the strategies (especially the 'tld section' approach) should also be relevant to UPC compilers that generate straight to object code.
extern int defined_somewhere_else; /* same in both .upc and .c output */There is an important exception to this rule--unshared pointers to shared data still need to be transformed into upcr_shared_ptr_t's:
extern shared int *pint; /* in .upc */ extern upcr_pshared_ptr_t pint; /* in .c output */Although the UPC compiler need not transform a 'extern' declaration itself, it does need to note the fact that the data in question is thread-local since such items are not referred to in the normal way, as we will see below.
int mcfoobar = 999; /* in .upc */ /* in .c output */ int UPCR_TLD_DEFINE(mcfoobar, 4) = 999;The macro takes the name and size (in bytes) of the variable. The full type of the definition must come before the macro, so
int natural[3] = { 1, 2, 3}; void (*int_taker)(int) = &print_int;cannot be transformed into
int UPCR_TLD_DEFINE(natural)[3] = {1, 2, 3} void (*(UPCR_TLD_DEFINE(int_taker, 4))(int) = &print_int;Instead the UPC compiler must declare typedefs for array and function pointer definitions:
typedef int _type_natural[3]; _type_natural UPCR_TLD_DEFINE(natural, 12) = { 1, 2, 3 }; typedef void (*_type_int_taker)(int); _type_int_taker UPCR_TLD_DEFINE(int_taker, 4) = &print_int;Finally, static unshared definitions must be promoted to regular (unstatic) type and global scope, and when this is done, their names must be mangled to avoid any name collisions with other global variables that may exist in other files (the 'suspects' array in foo.upc is an example of such a variable). Such mangling should be done in a deterministic fashion, so that the name of the variable is not changed across compilations unnecessarily (it is OK for the name to change whenever the set of names/sizes of other global unshared data change, but it should not change otherwise).
int quux;at file scope in foo.upc becomes
int UPCR_TLD_DEFINE_TENTATIVE(quux, 4);in foo.c.
The macro otherwise works identically to UPCR_TLD_DEFINE.
To link and operate correctly with regular C libraries, UPC must not treat data it sees in .c/.h files as thread-local variables: instead it must treat them as regular global variables. Variables are recognized as being external C variables if they are declared/defined in a #included .h or .c file.
Of course, for this strategy to work with a pthreaded UPC process, all linked C code must be thread-safe. UPC applications which need to use non-thread-safe C code or libraries should compile and run their UPC code as single-threaded executables.
assert (quux == 0 || quux == 1);in foo.upc must be converted into
assert( *((int*)UPCR_TLD_ADDR(quux)) == 0 || *((int*)UPCR_TLD_ADDR(quux)) == 1);in foo.c.
int *pquux = &quux;must be handled specially (since the address of quux will be different on different pthreads). Local pointers to shared data also require special treatment:
shared int pfoo = &foo;cannot be correctly assigned until the shared memory for 'foo' is allocated at startup. The UPC compiler must recognize all such special cases, and perform the appropriate assignments in each file's initialization function (information on the per-file allocation/initialization functions is provided later in this document). The above two definitions in bar.upc, for instance, cause the following special logic in bar.c's initialization function:
(*((int**)UPCR_TLD_ADDR(pquux))) = UPCR_TLD_ADDR(quux); (*((upcr_shared_ptr_t*)UPCR_TLD_ADDR(pfoo))) = foo;
The compiler directives used in the explanation below are all specific to the GNU GCC compiler. They also may not work (even with GCC) if the target machine does not support the ELF object format. Other C compilers may use different compiler/linker directives to achieve the same effect, or may not support the strategy at all. For this reason UPC compilers which target C code as their output may find it easier (and more portable) to use the 'global struct' strategy. Authors of UPC compilers which directly produce object code, however, will probably find the 'tld section' approach more natural within a compiler context.
Finally, as specified here, the tld section approach does not support multiple tentative definitions of the same UPC variable in multiple files (it does support it for variables defined in external C header files). It has not yet been determined if full support for tentative definitions is achievable under the tld section approach--at a minimum it appears that a custom linker script would need to be written to make them work. In the worst case it could certainly be done by modifying the linker itself.
int UPCR_TLD_DEFINE(jrandomvariable, 4) = 9;becomes
int jrandomvariable __attribute__((section(".upc_tld"))) = 9;Any ELF-compatible linker will automatically coalesce the '.upc_tld' sections from the various object files into a single, contiguous '.upc_tld' section in the executable.
You will note that we do not mention the UPCR_TLD_DEFINE_TENTATIVE macro here. This is because we have not yet figured out a way to get it to work correctly.
In regular C tentative definitions are placed in a special 'common' section of .o files. Multiple definitions of the same variable are permitted to exist in the various object files that are linked to form an executable, so long as at most one such variable is in an initialized data section. At link time the linker examines each variable defined in the common sections of the objects to be linked: if an initialized value exists, it is used, otherwise the object is created in the 'BSS' (i.e. it is created with an initial value of 0).
The gcc documentation states that the __attribute__((section)) directive only works with initialized values, and is ignored for uninitialized variables. In actuality, at least in recent gcc versions, the directive does not get ignored, and instead causes the variable to be put in the desired section with an initial value of 0. This, alas, is not sufficient, since if the same variable appears in multiple object files (even with the same initial 0 value), the linker declares a duplicate symbol error. One can avoid linker errors by causing the UPCR_TLD_DEFINE_TENTATIVE macro to use __attribute__(weak)), but this in turn causes the 'section' attribute to be ignored, so the variable will not be made thread-local.
It may be possible to have the UPCR_TLD_DEFINE_TENTATIVE use a different section name (ex: .upc_tld_common), and then somehow write a linker script that will treat that section with the common section's semantics at link time, but is not known if this will work (the author's several pleas for help on the gnu.gcc Usenet group have gone unanswered).
Another alternative may simply be to ban the use of multiple tentative definitions within UPC code, while supporting them for extern "C" code. Multiple tentative definitions can always be trivially avoided without any change in program semantics via the addition of an 'extern', and programmers writing new UPC code are unlikely to even notice the absence of full support for tentative definitions (C++, for instance, does not use tentative definitions--'int foo;' is equivalent to 'int foo = 0;'--but few programmers are even aware of this difference). Old C libraries may place tentative definitions in their header files, but since such code is treated as 'extern C' by the UPC compiler (and hence will not be converted into thread-local data), such definitions will still be handled correctly. If it is decided that support for multiple tentative unshared UPC variables is not needed, UPCR_TLD_DEFINE_TENTATIVE can simply be #defined to UPCR_TLD_DEFINE, and single tentative definitions will work correctly.
The UPC compiler will arrange to have the starting address and length of the .upc_tld segment written into two 'well-known' variables that are visible to the UPC runtime. This will probably need to be done in a linker script.
UPCR_TLD_ADDR(foo)which will return the equivalent of
(tld_addrs[MYTHREAD] + ( ((uintptr_t)&foo) - upc_tld_start))cast to a void pointer.
Since the 'tld_addrs[THREAD] - upc_tld_start' portion can be done only once, at startup, and then stored as a separate 'tld_offset[MYTHREAD]' variable, the cost of a lookup can be optimized to
tld_offset[MYTHREAD] + (unintptr_t)&foo)On most architectures, this should translate into a single indexed load instruction (assuming the value of tld_offset[MYTHREAD] is cached in a register).
UPCR_TLD_DEFINE(suspects_MANGLED, 8) UPCR_TLD_DEFINE(quux, 4)Note that any UPCR_TLD_DEFINE_TENTATIVE definitions are transformed into regular UPCR_TLD_DEFINEs in the .tld file (we do not need to distinguish between them here). Also, any duplicate definitions are discarded. Also note the lack of semicolons in the .tld file. Finally, the grep-like script will only overwrite an existing .tld data if its contents are different (this will only happen if a variable has been added/deleted/renamed, or its size has changed).
int quux;These regular global variables serve several purposes. First, they store the initial value (if any) for the definition. Secondly, they will cause the linker to catch any errors from the user initializing the value in multiple files. Third, the linker will handle tentative definitions of these variables correctly. These variables are otherwise unused in the final executable, and this is what makes the global struct approach consume more memory than the tld section approach (which can use the initial coalesced linker section of thread-local data as thread 0's section, only making copies for further threads).
#define UPCR_TLD_DEFINE(name, size) char name[size]; struct upcr_tld { #include "upcr_global_tld.tld" }; #undef UPCR_TLD_DEFINE /* at some point later in upcr.h or a file it includes... */ #define UPCR_TLD_DEFINE(name, size) nameAll variables are declared as the same type--arrays of char. This is done because it is virtually impossible to assemble the full set of type information that would needed to use the real types of the variables as they are declared in various scopes and .c files (the same type name may legally be used in different files/scopes to refer to different typedefs/structs. Correct ordering of type declarations is also difficult). Since the UPCR_TLD_ADDR macro returns a void * (and the compiler will always know what type to cast it to), this is not a problem. Alignment issues can be solved by sorting the definitions in the global tld file by size, and/or by padding the sizes of variables passed to UPCR_TLD_DEFINE{_TENTATIVE}.
#include <string.h> #include "upcr_global_tld.h" #undef UPCR_TLD_DEFINE #define UPCR_TLD_DEFINE(name, size) extern int name; #include "upcr_global_tld.tld" void upcri_startup_init_tld(struct upcr_tld *tld) { #undef UPCR_TLD_DEFINE #define UPCR_TLD_DEFINE(name, size) memcpy(&tld->name, &name, size); #include "upcr_global_tld.tld" }The function uses the values of the global variables left in the .c files as the source for initial values. Any initializations for which this simple memcpy is not sufficient must be handled in the special per-file initialization functions.
When the 'tld section' approach is used (or the UPC executable will run as a single-threaded process), the UPC compiler will also invoke the backend C compiler on its output .c files, the resulting .o files seen by the user will be regular C object files, and the UPC linker wrapper will simply send them directly to the regular C linker. When the 'global struct' approach is in use, however, UPC .o files will actually be copies of the .c files output by the UPC compiler: all compilation by the back-end C compiler will be done at link time, since this is the only time that enough information is available to know the full layout of the global thread-local data structure.
The fact that all C compilation occurs at link time in the global struct approach does not mean, however, that every intermediate .c file in a UPC application needs to be recompiled every time the application is linked. The fact that users typically link an application with the same set of files repeatedly can be exploited by the UPC linker to avoid needless recompilations. Under this scheme, UPC .o files will only be recompiled when the .o file itself has been changed (presumably because the user has modified and recompiled its parent .upc file), or when the size or layout of the global tld struct has changed (in which case all the UPC .o files in the application will need to be recompiled).
This optimization is performed via the following steps:
Note: The 'hidden' build directory and all the files it in will not be deleted after the link is complete. Thus, users will need to explicitly delete the 'upc-build' directory in their 'make clean' commands--it will never be deleted automatically for them.
The question then becomes how to arrange to call all of these functions at startup, how to name them in such a way that they do not collide in the symbol namespace, and how to determine the order in which they are called. The Berkeley UPC compiler takes the following approach to these issues:
The UPCR_INIT_ functions must perform all needed initializations of both shared and thread-local variables. The specifics of how this is done have already been described above in the relevant sections on shared and thread-local data initialization. The order in which TLD/shared initializations are performed within the functions should not be important.
nm *.o | grep UPCRI_ALLOC nm *.o | grep UPCRI_INITThese commands will provide lists of all the allocation/initialization functions in the UPC program. The UPC linker script will use these lists to generate a small .c file with two functions, one called upcri_linkergenerated_alloc() that calls all of the allocation functions, and one called upcri_linkergenerated_init() that calls all of the initialization functions. The order in which particular allocation/initialization functions are called by these functions is not specified, except that is is guaranteed that the same ordering will be used for all threads/nodes. The .c file is then compiled and the resulting .o file linked with the rest of the user's objects into the final UPC executable. [Note: if the names for functions are constructed in a deterministic manner--such as via the hash of the file's full pathname mentioned above--it may be possible to only need to compile this small .c file when the list of files being compiled changes, rather than each time the application is linked. But this would presumably not save a great deal of time, and may not be worth the complexity].