Berkeley UPC programs can be debugged (with support for UPC-specific constructs) by the "Totalview" debugger produced by Etnus. See our tutorial on using Berkeley UPC with Totalview for details.
If you do not have Totalview, you can also use a regular C debugger and get partial debugging support. Berkeley UPC provides several mechanisms for attaching a regular C debugger to one or more of your UPC application's threads at various points during execution. While this does not provide a fully normal debugging environment (the debugger will show the C code emitted by our translator, rather than your UPC code), it can still allow you to see program stack traces and other important information. This can be very useful if you wish to submit a helpful bug report to us.
Note: if you have a main() function in your UPC code, it will have been renamed user_main() after being translated to C code, and so you will need to refer to user_main in the debugger (to set a breakpoint, etc.). This is a special case: your other function names and variables generally have the same name in the C output as in your UPC code.
If you configured your UPC runtime to support debugging (i.e., './configure --enable-debug' was used at configuration time), you can tell your UPC application to busy-wait at startup at one of three different entry points by setting an environment variable. You can then attach a debugger to one or more of your UPC processes, and resume the program within the debugger. The following table describes the different startup points:
| Environment variable | Behavior | To continue |
|
|
Setting UPC_FREEZE will cause UPC thread <THREAD_NUMBER> to
busy-wait at startup, right before user_main()
is called.
All the UPC threads in your application are guaranteed to exist by this time. They will all print out their hostname and pid when the specified thread begins busy-waiting, then wait in a barrier for the specified thread to stop busy-waiting. This allows you to attach a debugger to one or more of your UPC processes before continuing. |
Once you have attached a debugger to the busy-waiting thread's process (and to any of the other UPC processes you are interested in debugging), set the 'bupc_frozen' variable in the busy-waiting process to 0 within your debugger. Resume the execution of the process, and all the UPC threads in the application will leave the barrier and enter user_main(). |
|
|
This is similar to UPC_FREEZE, but it occurs earlier in the runtime
startup process. It is primarily of interest if you are trying to
debug the runtime's initialization process.
Since the busy-wait/barrier occurs before any pthreads are launched, you must set UPC_FREEZE_EARLY to a node number rather than a UPC thread number (if you did not compile your program with --pthreads, the node and UPC threads numbers are the same). |
Like with UPC_FREEZE, you must attach a debugger to the busy-waiting process and set 'bupc_frozen' to 0 before continuing the application. |
|
|
This option causes the application to be frozen as early as possible in the gasnet startup process. In many cases, the application may consist of a single process at this point. Any processes that do exist will print out their hostname and pid. | You must attach to the initial process(es) in the application, and set 'gasnet_frozen' variable to 0 before proceeding. |
If set to '1' or 'yes', any fatal errors in the UPC runtime will
cause the application to die by calling abort(), which will produce
a core file (assuming your system is set up to create core files:
try 'ulimit -c BigNum' if you don't see a core file). You can then
run a debugger on the core file to see where the application was at
the time of the error. For example,
gdb a.out core
backtrace
|
You cannot continue the program when UPC_ABORT is used. Note that since gasnet_exit() is not called, on some systems the other processes in the job may not exit automatically, so you may need to kill them manually. |
You may set more than one of these variables if you choose to, and they will independently freeze the program at their respective execution locations.
Here is an example of a debugging session on a Linux system using a GCC compiler, GDB debugger, and the MPICH MPI library. On this system we use 'ssh' to log into the compute nodes.
$upcc -network=mpi -save-temps foo.upc
You will now see (among other files) a foo.trans.c file, which
the GDB debugger will use.
$UPC_FREEZE=2; export UPC_FREEZE // for sh, ksh, bash, etc.
or
$setenv UPC_FREEZE 2 // for csh, tcsh, etc.
$mpirun -np 4 a.out
************************************************************************
************* Freezing UPC application for debugging *******************
************************************************************************
Thread 2 (pid 7100 on pcp-c-28) is frozen: attach a debugger to it and set
the 'bupc_frozen' variable to 0 to continue.
- To debug additional UPC threads, attach to them before unfreezing thread 2
- Note: if you wish to set a breakpoint at 'main', use 'user_main' instead
Thread 0 (pid 10655 on pcp-c-31) waiting for thread 2
Thread 1 (pid 2379 on pcp-c-30) waiting for thread 2
Thread 3 (pid 2698 on pcp-c-29) waiting for thread 2
Note how the processes in your UPC application all pause, with information on their location.
$ssh pcp-c-28
pcp-c-28 $cd testprog/
pcp-c-28 $gdb a.out 7100
Attaching to program: /home/pcp1/jduell/testprog/a.out, process 7100
[...]
0x0804b4da in _upcri_startup_freeze (pargs=0xbffff4c0, freezethread=2) at upcr_init.c:170
170 while (bupc_frozen == 1)
(gdb)
(gdb) break user_main
Breakpoint 1 at 0x804a0be: file foo.upc.trans.c, line 33.
(gdb) set bupc_frozen=0
(gdb) continue
Continuing.
Breakpoint 1, user_main () at testprog.trans.c:45
Now you can start to step through your code.
Program received signal SIGSEGV, Segmentation fault.
0x0804a045 in get_millionth_element (array=0x80cf0a0) at testprog.trans.c:39
39 return * (array + 999999LL);
Looking at the stack of function calls that got to your error is often quite informative:
(gdb) bt
#0 0x0804a045 in get_millionth_element (array=0x80cf0a0) at testprog.trans.c:39
#1 0x0804a06e in user_main () at testprog.trans.c:49
#2 0x0804b5db in upcri_perthread_main (p_args=0xbffff4c0) at upcr_init.c:228
#3 0x0804b98d in upcr_run (pargc=0xbffff510, pargv=0xbffff514) at upcr_init.c:524
#4 0x0804a13a in main (argc=1, argv=0x80d2e68) at a.out_startup_tmp.c:34
#5 0x420158d4 in __libc_start_main () from /lib/i686/libc.so.6
Use the 'frame' command to move between contexts, the 'list' to view the code at each point, and 'print VAR' to print out the value of a variable or expression:
(gdb) frame 1
#1 0x0804a06e in user_main () at testprog.trans.c:49
49 } /* user_main */
(gdb) l
46
47 get_millionth_element((_INT32 *) & smallarray);
48 return 0;
49 } /* user_main */
(gdb) p smallarray[0]
$1 = 0
(gdb) p smallarray[999999]
Cannot access memory at address 0x849f99c
Hmm, looks like I shouldn't have passed 'smallarray' to my
get_millionth_element() function. Well, I never was much of an
applications developer anyway...
If your bug is more mysterious, just cut and paste the output of your stack trace (and whatever other helpful info you may have collected) into the main form of a new bug report. After you add the bug, please go back to it and attach your source files (as a tarball if there are lots of them: please don't send extremely large tarballs with .o files and core dumps, etc.).