NCSA Home
Contact Us | Intranet | Search

Debugging on Mercury

Contents


Overview

Debugging can be frustrating--people don't typically use debugging tools because they're fun. Look over the list of available tools above and their descriptions in the table below while considering questions like: Does the problem happen with every execution, or only when scaling up parameters? Is the program source code available? No debugging tool is perfect, and if an approach or tool shown here doesn't yield results, try another one with similar capabilities, or consider using a tool that analyzes your program from a different perspective.  For example, a Totalview session combined with the output from the source code analyzer "ftnchek" may work in concert to help pinpoint a bug. Before diving into a debugger, there are a couple things that can be done quickly.  

1) If a specific error message is produced with a popular application or community code, try looking for it in the FAQ for the package or do an internet search with the application name and the error.

2) ) On mercury, core files cannot be created on the interactive login hosts tg-loginN.ncsa.teragrid.org. If a core file was created during a batch job, the serial debuggers may quickly point to the problem area with a command similar to one shown here [in these examples, the application is named a.out]:

gdb a.out core		# command line gdb
idb a.out core    	# command line idb
ddd a.out core		# ddd graphical debugger
idb -gui a.out core	# idb graphical debugger

3) If the program is written in Fortran, try recompiling with the Intel Fortran compiler flags: "-check bounds -traceback -g" . Then run the program again.

Most of the tools available will provide more information if your code was compiled with the -g flag [and with -O3 or higher optimizations disabled] , therefore it's a good idea to rebuild your code with that flag when proceeding to use a debugger. Higher levels of optimization can lead to incorrect debugging results for values and locations of variables.

Debugger / Tool
Description
Serial, OpenMP / Threaded, or Parallel Strengths
Limitations
gdb


[classic serial debuggers]

The GNU gdb debugger is available and may be used with serial programs or core files from serial or parallel programs. For information on  using gdb see the online man page and the  gdb user manual

serial or openmp/threaded
best debugger  for gnu compilers

can attach to running processes
gdb sometimes has difficulty with Intel compiled code
ddd


[classic serial debuggers]

The ddd graphical interface for gdb is available on the login nodes. ddd can be used with c, c++, fortran, and perl source code. See the ddd user guide for more information and examples.

serial or openmp/threaded
intuitive GUI interface

clicking on a variable in ddd will display its value
requires X windows

the interface can be slow to draw for sites far away from NCSA
idb


[classic serial debuggers]

The Intel debugger idb is installed along with the Intel compilers.For more information on the Intel debugger, see the idb man page. idb is similar to gdb in operation and it will recognize most gdb commands if started with the -gdb flag. idb works well with C, C++, and Fortran codes.

serial or openmp/threaded
best debugger for Intel compiled code

can debug Fortran 77, 90, and 95

can attach to running processes
idb default interface is dbx, use "idb -gdb" for the gdb compatible interface
idb -gui


[classic serial debuggers]

idb provides a graphical interface when started with the -gui flag [example idb -gui screenshot].

serial or openmp/threaded
intuitive GUI interface

interface has fairly low overhead for X and draws quickly
requires X windows
Totalview

[classic serial debuggers]

[parallel debuggers]

The Totalview debugger [with graphical user interface] works with the supported MPI environment. It is our recommended debugger for MPI code.

Information on starting Totalview on Mercury (and other NCSA platforms) can be found here.

Totalview documentation can be found on the Etnus website here.

parallel , serial, or openmp/threaded
intuitive graphical interface and debugger for use with default MPI environment

can attach to running processes
requires X windows

can debug up to 128 MPI ranks with our license
Electric Fence


[memory allocation debuggers]

The Electric Fence malloc debugger is installed on the login hosts.  It can debug malloc() and pointer related bugs in c/c++ code. Electric Fence does not work simply with the default MPI environment, but it can be used with any of the mpich-tcp MPI environments available via softenv.  Electric fence slows execution and should only be used when debugging.

serial , openmp/threaded, or parallel
can run in batch mode
can slow execution

potentially verbose output
c/c++ only, no fortran

may require relink
MALLOC_CHECK_

[memory allocation debuggers]

C/C++ programs using malloc(), calloc(), or realloc() can set the  MALLOC_CHEC_ env. variable.  From "man malloc": If MALLOC_CHECK_ is set to 0, any detected heap corruption is silently ignored; if set to 1, a diagnostic is printed on stderr; if set to 2, abort() is called immediately.

serial, openmp / threaded, or parallel can run in batch mode

environment setting, recompile/relink not required
C/C++ only, not Fortran

can slow performance, so leave MALLOC_CHECK_ unset for production runs
Marmot MPI check libraries

[MPI specific debugging tools]

These libraries can be linked with your program to provide runtime checks for common MPI programming problems and MPI deadlock detection.  This debugging aid can scale with your MPI application to the maximum number of processes you can employ.  It's a good option when bugs appear when running at scalle

parallel
can run in batch mode

can find MPI deadlocks

scales with MPI
may not find some bugs

recompile/relink required
MPICH2 & gdb

[MPI specific debugging tools]

The MPICH2 MPI environment is installed and supports a text mode gdb interface. This debugging setup has been tested at large scale [> 200 processes] and is the only interactive debugging option available for large scale runs.

parallel
scales with MPI

text interface to gdb scales well
not the default MPI [recompile/relink and porting to mpich2 required]

uses tcp/ip over ethernet
floating point exceptions

[general purpose tools and techniques]

The techniques for trapping floating point exceptions [which are not always bugs] vary by compiler and operating system.  Since each case is a little different, see the link at left for the examples that match your situation.  Fortran compilers tend to have a more straightforward approach to floating point exceptions than c compilers.

serial , openmp/threaded, or parallel
scales with MPI

useful with batch mode

minimal performance impact for most codes
floating point exceptions are not necessarily bugs

recompile/relink required
source code analysis tools

[general purpose tools and techniques]

Splint [for c code] and ftnchek [fortran] are available.  While not true debuggers, source code analysis tools can be very helpful when trying to track down a program bug.  They can also help you write clean maintainable code by providing useful feedback about coding style, unused variables, non-portable practices, ...   Don't be alarmed by the number of warnings generated by these tools, they're designed to detect a great variety of potential problems.

serial , openmp/threaded, or parallel can pinpoint problem areas of source code

can search for problems without running code

ftncheck can generate a call graph
source code required
strace


[general purpose tools and techniques]

Strace produces a system call trace for any program you can run.

Each line in the trace contains the system call name as used by your program, followed by  its arguments in parentheses and its return value.

If you know what sort of system call may be failing, strace can be quite powerful.

For more information, see:

man strace
serial , openmp/threaded, or parallel
scales with MPI

can be used without source code

guaranteed to produce some output

can attach to running processes

-c option can do profiling for your code's system calls

recompile/relink not required
extremely verbose output


ltrace


[general purpose tools and techniques]

ltrace produces a library call trace for any program you can run.

Each line in the trace contains the library call name as used by your program, followed by  its arguments in parentheses and its return value.

If you know what sort of library call may be failing, ltrace can be quite powerful.

For more information, see-

man ltrace
serial , openmp/threaded, or parallel
scales with MPI

can be used without source code

guaranteed to produce some output

can attach to running processes

-c option can do profiling for your code's library calls

recompile/relink not required
extremely verbose output

can dramatically slow program execution

ssh_pbs.pl and gdbwhere.pl


[general purpose tools and techniques]

ssh_pbs.pl and ~consult/debug/
gdbwhere.pl can be combined to get a gdb backtrace from processes in a running job.

See also:
ssh_pbs.pl -help ; cat ~consult/debug/
gdbwhere.pl

serial , openmp/threaded, or parallel
scales with MPI

can be used without source code

guaranteed to produce some output

can attach to running processes

recompile/relink not required, but recommended if you forgot "-g"
possibly verbose output [redirect to a file]



References