You can use a regular C debugger and get usable debugging support. Berkeley UPC provides several mechanisms for attaching a regular C debugger to one or more of your UPC application's threads at various points during execution. While this does not provide a fully normal debugging environment (the debugger will show the C code emitted by our translator, rather than your UPC code), it can still allow you to see program stack traces and other important information. This can be very useful if you wish to submit a helpful bug report to us.
Note: if you have a main() function in your UPC code, it will have been renamed user_main() after being translated to C code, and so you will need to refer to user_main in the debugger (to set a breakpoint, etc.). This is a special case: your other function names and variables generally have the same name in the C output as in your UPC code.
Make sure to build your UPC application with 'upcc -g', to turn on debugging. You can then tell your UPC application to busy-wait at startup at one of three different entry points by passing a upcrun option or setting an environment variable. You can then attach a debugger to one or more of your UPC processes, and resume the program within the debugger. The following table describes the different startup points:
upcrun setting / environment variable | Behavior | To continue |
upcrun -freeze[=<thread_id>] -or- UPC_FREEZE=<thread_id> |
Setting this option will cause UPC thread thread_id to
busy-wait at startup, right before user_main()
is called.
All the UPC threads in your application are guaranteed to exist by this time. They will all print out their hostname and pid when the specified thread begins busy-waiting, then wait in a barrier for the specified thread to stop busy-waiting. This allows you to attach a debugger to one or more of your UPC processes before continuing. |
Once you have attached a debugger to the busy-waiting thread's process (and to any of the other UPC processes you are interested in debugging), set the 'bupc_frozen' variable in the busy-waiting process to 0 within your debugger. Resume the execution of the process, and all the UPC threads in the application will leave the barrier and enter user_main(). |
upcrun -freeze-early[=<node_id>] -or- UPC_FREEZE_EARLY=<node_id> |
This is similar to the option above, but it occurs earlier in the runtime
startup process. It is primarily of interest if you are trying to
debug the runtime's initialization process.
Since the busy-wait/barrier occurs before any pthreads are launched, the node_id value passed must indicate a node number rather than a UPC thread number (if you did not compile your program with --pthreads, the node and UPC threads numbers are the same). |
Like with UPC_FREEZE, you must attach a debugger to the busy-waiting process and set 'bupc_frozen' to 0 before continuing the application. |
upcrun -freeze-earlier -or- GASNET_FREEZE=[anything] |
This option causes the application to be frozen as early as possible in the gasnet startup process. In many cases, the application may consist of a single process at this point. Any processes that do exist will print out their hostname and pid. | You must attach to the initial process(es) in the application, and set 'gasnet_frozen' variable to 0 before proceeding. |
upcrun -abort -or- UPC_ABORT=1 |
If this option is set, any fatal errors in the UPC runtime will
cause the application to die by calling abort(), which will produce
a core file (assuming your system is set up to create core files:
try 'ulimit -c BigNum' if you don't see a core file). You can then
run a debugger on the core file to see where the application was at
the time of the error. For example,
gdb a.out core backtrace |
You cannot continue the program when this option is used. Note that since gasnet_exit() is not called, on some systems the other processes in the job may not exit automatically, so you may need to kill them manually. |
upcrun -freeze-on-error -or- GASNET_FREEZE_ON_ERROR=1 |
If this option is set, most fatal errors will cause the relevant threads to stop, print a message and await a debugger to attach. This can be a good way to track down intermittent problems or issues that don't manifest when the entire process is run under the debugger (using the techniques above), however as it does not stop until the error is encountered it may be too late to properly diagnose some problems. | Follow the instructions in the generated message to continue the frozen process. |
upcrun -backtrace -or- GASNET_BACKTRACE=1 |
If this option is set, most fatal errors cause the relevant threads to attempt to automatically attach a debugger and generate stack backtrace information to stderr. This option is only supported on some platforms, and the quality of the auto-generated backtrace can be negatively affected by system-specific details such as the details of signal handling. However, it's often the simplest way to get hints about what went wrong in the execution which in some cases may be sufficient to resolve the issue. | Once the backtrace is generated, the process should exit. Note that some types of program crashes may cause the backtrace code to hang, potentially creating zombie processes that will need to be manually killed. |
You may set more than one of these variables if you choose to, and they will independently freeze the program at their respective execution locations.
Here is an example of a debugging session on a Linux system using a GCC compiler, GDB debugger, and the MPICH MPI library. On this system we use 'ssh' to log into the compute nodes.
$upcc -network=mpi -save-temps foo.upcYou will now see (among other files) a foo.trans.c file, which the GDB debugger will use.
$upcrun -np 4 -freeze=2 a.out ************************************************************************ ************* Freezing UPC application for debugging ******************* ************************************************************************ Thread 2 (pid 7100 on pcp-c-28) is frozen: attach a debugger to it and set the 'bupc_frozen' variable to 0 to continue. - To debug additional UPC threads, attach to them before unfreezing thread 2 - Note: if you wish to set a breakpoint at 'main', use 'user_main' instead Thread 0 (pid 10655 on pcp-c-31) waiting for thread 2 Thread 1 (pid 2379 on pcp-c-30) waiting for thread 2 Thread 3 (pid 2698 on pcp-c-29) waiting for thread 2
Note how the processes in your UPC application all pause, with information on their location.
$ssh pcp-c-28 pcp-c-28 $cd testprog/ pcp-c-28 $gdb a.out 7100 Attaching to program: /home/pcp1/jduell/testprog/a.out, process 7100 [...] 0x0804b4da in _upcri_startup_freeze (pargs=0xbffff4c0, freezethread=2) at upcr_init.c:170 170 while (bupc_frozen == 1) (gdb)
(gdb) break user_main Breakpoint 1 at 0x804a0be: file foo.upc.trans.c, line 33.
(gdb) set bupc_frozen=0 (gdb) continue Continuing. Breakpoint 1, user_main () at testprog.trans.c:45You may encounter problems setting this variable if you built your application and/or the UPC runtime without debugging symbols (note this will likely also negatively affect the accuracy of information reported by the debugger). In this situtation, you can alternatively continue the process by sending a SIGCONT signal to the relevant process pid indicated in the startup message, using a different window on the same compute node:
$ ssh pcp-c-28 kill -CONT 7100Now you can start to step through your code.
Program received signal SIGSEGV, Segmentation fault. 0x0804a045 in get_millionth_element (array=0x80cf0a0) at testprog.trans.c:39 39 return * (array + 999999LL);
Looking at the stack of function calls that got to your error is often quite informative:
(gdb) bt #0 0x0804a045 in get_millionth_element (array=0x80cf0a0) at testprog.trans.c:39 #1 0x0804a06e in user_main () at testprog.trans.c:49 #2 0x0804b5db in upcri_perthread_main (p_args=0xbffff4c0) at upcr_init.c:228 #3 0x0804b98d in upcr_run (pargc=0xbffff510, pargv=0xbffff514) at upcr_init.c:524 #4 0x0804a13a in main (argc=1, argv=0x80d2e68) at a.out_startup_tmp.c:34 #5 0x420158d4 in __libc_start_main () from /lib/i686/libc.so.6
Use the 'frame' command to move between contexts, the 'list' to view the code at each point, and 'print VAR' to print out the value of a variable or expression:
(gdb) frame 1 #1 0x0804a06e in user_main () at testprog.trans.c:49 49 } /* user_main */ (gdb) l 46 47 get_millionth_element((_INT32 *) & smallarray); 48 return 0; 49 } /* user_main */ (gdb) p smallarray[0] $1 = 0 (gdb) p smallarray[999999] Cannot access memory at address 0x849f99cHmm, looks like I shouldn't have passed 'smallarray' to my get_millionth_element() function. Well, I never was much of an applications developer anyway...
If your bug is more mysterious, just cut and paste the output of your stack trace (and whatever other helpful info you may have collected) into the main form of a new bug report. After you add the bug, please go back to it and attach your source files (as a tarball if there are lots of them: please don't send extremely large tarballs with .o files and core dumps, etc.).