Debugging Berkeley UPC applications

You can use a regular C debugger and get usable debugging support. Berkeley UPC provides several mechanisms for attaching a regular C debugger to one or more of your UPC application's threads at various points during execution. While this does not provide a fully normal debugging environment (the debugger will show the C code emitted by our translator, rather than your UPC code), it can still allow you to see program stack traces and other important information. This can be very useful if you wish to submit a helpful bug report to us.

Note: if you have a main() function in your UPC code, it will have been renamed user_main() after being translated to C code, and so you will need to refer to user_main in the debugger (to set a breakpoint, etc.). This is a special case: your other function names and variables generally have the same name in the C output as in your UPC code.

How it works

Make sure to build your UPC application with 'upcc -g', to turn on debugging. You can then tell your UPC application to busy-wait at startup at one of three different entry points by passing a upcrun option or setting an environment variable. You can then attach a debugger to one or more of your UPC processes, and resume the program within the debugger. The following table describes the different startup points:

upcrun setting / environment variable Behavior To continue
upcrun -freeze[=<thread_id>]
-or-
UPC_FREEZE=<thread_id>
Setting this option will cause UPC thread thread_id to busy-wait at startup, right before user_main() is called.

All the UPC threads in your application are guaranteed to exist by this time. They will all print out their hostname and pid when the specified thread begins busy-waiting, then wait in a barrier for the specified thread to stop busy-waiting. This allows you to attach a debugger to one or more of your UPC processes before continuing.

Once you have attached a debugger to the busy-waiting thread's process (and to any of the other UPC processes you are interested in debugging), set the 'bupc_frozen' variable in the busy-waiting process to 0 within your debugger. Resume the execution of the process, and all the UPC threads in the application will leave the barrier and enter user_main().
upcrun -freeze-early[=<node_id>]
-or-
UPC_FREEZE_EARLY=<node_id>
This is similar to the option above, but it occurs earlier in the runtime startup process. It is primarily of interest if you are trying to debug the runtime's initialization process.

Since the busy-wait/barrier occurs before any pthreads are launched, the node_id value passed must indicate a node number rather than a UPC thread number (if you did not compile your program with --pthreads, the node and UPC threads numbers are the same).

Like with UPC_FREEZE, you must attach a debugger to the busy-waiting process and set 'bupc_frozen' to 0 before continuing the application.
upcrun -freeze-earlier
-or-
GASNET_FREEZE=[anything]
This option causes the application to be frozen as early as possible in the gasnet startup process. In many cases, the application may consist of a single process at this point. Any processes that do exist will print out their hostname and pid. You must attach to the initial process(es) in the application, and set 'gasnet_frozen' variable to 0 before proceeding.
upcrun -abort
-or-
UPC_ABORT=1
If this option is set, any fatal errors in the UPC runtime will cause the application to die by calling abort(), which will produce a core file (assuming your system is set up to create core files: try 'ulimit -c BigNum' if you don't see a core file). You can then run a debugger on the core file to see where the application was at the time of the error. For example,
    gdb a.out core
    backtrace
            
You cannot continue the program when this option is used. Note that since gasnet_exit() is not called, on some systems the other processes in the job may not exit automatically, so you may need to kill them manually.
upcrun -freeze-on-error
-or-
GASNET_FREEZE_ON_ERROR=1
If this option is set, most fatal errors will cause the relevant threads to stop, print a message and await a debugger to attach. This can be a good way to track down intermittent problems or issues that don't manifest when the entire process is run under the debugger (using the techniques above), however as it does not stop until the error is encountered it may be too late to properly diagnose some problems. Follow the instructions in the generated message to continue the frozen process.
upcrun -backtrace
-or-
GASNET_BACKTRACE=1
If this option is set, most fatal errors cause the relevant threads to attempt to automatically attach a debugger and generate stack backtrace information to stderr. This option is only supported on some platforms, and the quality of the auto-generated backtrace can be negatively affected by system-specific details such as the details of signal handling. However, it's often the simplest way to get hints about what went wrong in the execution which in some cases may be sufficient to resolve the issue. Once the backtrace is generated, the process should exit. Note that some types of program crashes may cause the backtrace code to hang, potentially creating zombie processes that will need to be manually killed.

You may set more than one of these variables if you choose to, and they will independently freeze the program at their respective execution locations.

Attaching a debugger to a UPC process

The method for how you attach a debugger to one of your UPC processes will vary from machine to machine (and with different debuggers). The general mechanism is usually to log into the node that the process is running on, and then execute the debugger with flags that tell it to attach to the process you are interested in.

An example using GCC/GDB, and TCP MPICH

Here is an example of a debugging session on a Linux system using a GCC compiler, GDB debugger, and the MPICH MPI library. On this system we use 'ssh' to log into the compute nodes.

  1. GDB relies on having the source C files available to provide the source for the code that is being executed. If you want to step through specific lines of code, pass the -save-temps flag to upcc when compile your UPC application, which will save the '*.trans.c' files for your application. (This is not necessary if you are just trying to get a stack trace from a program that has crashed as part of a bug report):
       
            $upcc -network=mpi -save-temps foo.upc
        
    You will now see (among other files) a foo.trans.c file, which the GDB debugger will use.

  2. Let's pretend that you've already noticed that it's always UPC thread 2 that crashes, and so it's the thread you are interested in debugging. To have it be the thread that busy-waits, we'll pass it to upcrun -freeze. If you're uncertain which thread is crashing, you can optionally attach debuggers to all of them before continuing the debugging session.

  3. Start the UPC application (on our system you need to first create an interactive PBS session with the 'qsub' command: see your system for details on running a parallel job):
        $upcrun -np 4 -freeze=2 a.out
        ************************************************************************
        ************* Freezing UPC application for debugging *******************
        ************************************************************************
        Thread 2 (pid 7100 on pcp-c-28) is frozen: attach a debugger to it and set
         the 'bupc_frozen' variable to 0 to continue.
         - To debug additional UPC threads, attach to them before unfreezing thread 2
         - Note: if you wish to set a breakpoint at 'main', use 'user_main' instead
        Thread 0 (pid 10655 on pcp-c-31) waiting for thread 2
        Thread 1 (pid 2379 on pcp-c-30) waiting for thread 2
        Thread 3 (pid 2698 on pcp-c-29) waiting for thread 2
        

    Note how the processes in your UPC application all pause, with information on their location.

  4. Open a separate shell on the same host as thread 2's process, move to the directory where the executable resides, and run gdb with arguments to attach to the thread 2's PID:
        $ssh pcp-c-28
        pcp-c-28 $cd testprog/
        pcp-c-28 $gdb a.out 7100
        Attaching to program: /home/pcp1/jduell/testprog/a.out, process 7100
        [...]
        0x0804b4da in _upcri_startup_freeze (pargs=0xbffff4c0, freezethread=2) at upcr_init.c:170
        170             while (bupc_frozen == 1)
        (gdb)
        

  5. Now that you have attached the debugger, you can perform any regular debugger tasks, such as printing a stack trace or setting a breakpoint. For instance, you could set the following breakpoint to your UPC program's 'main' function: (note: most functions in your UPC code will have the same name in the generated C code, but main() is an exception--it gets renamed to user_main()):
            (gdb) break user_main
            Breakpoint 1 at 0x804a0be: file foo.upc.trans.c, line 33.
        

  6. Set the application's bupc_frozen variable to 0, and your UPC application can continue:
            (gdb) set bupc_frozen=0
            (gdb) continue
            Continuing.
            Breakpoint 1, user_main () at testprog.trans.c:45
        
    You may encounter problems setting this variable if you built your application and/or the UPC runtime without debugging symbols (note this will likely also negatively affect the accuracy of information reported by the debugger). In this situtation, you can alternatively continue the process by sending a SIGCONT signal to the relevant process pid indicated in the startup message, using a different window on the same compute node:
           $ ssh pcp-c-28 kill -CONT 7100
        
    Now you can start to step through your code.

  7. If your application crashes, gdb should tell you the line of offending code:
        Program received signal SIGSEGV, Segmentation fault.
        0x0804a045 in get_millionth_element (array=0x80cf0a0) at testprog.trans.c:39
        39        return * (array + 999999LL);
        

    Looking at the stack of function calls that got to your error is often quite informative:

        (gdb) bt
        #0  0x0804a045 in get_millionth_element (array=0x80cf0a0) at testprog.trans.c:39
        #1  0x0804a06e in user_main () at testprog.trans.c:49
        #2  0x0804b5db in upcri_perthread_main (p_args=0xbffff4c0) at upcr_init.c:228
        #3  0x0804b98d in upcr_run (pargc=0xbffff510, pargv=0xbffff514) at upcr_init.c:524
        #4  0x0804a13a in main (argc=1, argv=0x80d2e68) at a.out_startup_tmp.c:34
        #5  0x420158d4 in __libc_start_main () from /lib/i686/libc.so.6
        

    Use the 'frame' command to move between contexts, the 'list' to view the code at each point, and 'print VAR' to print out the value of a variable or expression:

        (gdb) frame 1
        #1  0x0804a06e in user_main () at testprog.trans.c:49
        49      } /* user_main */
        (gdb) l
        46        
        47        get_millionth_element((_INT32 *) & smallarray);
        48        return 0;
        49      } /* user_main */
        (gdb) p smallarray[0]
        $1 = 0
        (gdb) p smallarray[999999]
        Cannot access memory at address 0x849f99c
        
    Hmm, looks like I shouldn't have passed 'smallarray' to my get_millionth_element() function. Well, I never was much of an applications developer anyway...

    If your bug is more mysterious, just cut and paste the output of your stack trace (and whatever other helpful info you may have collected) into the main form of a new bug report. After you add the bug, please go back to it and attach your source files (as a tarball if there are lots of them: please don't send extremely large tarballs with .o files and core dumps, etc.).