NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
Restore navigation column
NERSC Tutorials

Debugging Tutorial

While programming might be thought of as the process of developing algorithms and implementing them in a particular programming language, any experienced programmer will tell you most of your time won't be spent writing the program; it will be spent debugging the code and verifying the results. Although there exists documentation on how to use specific debugging utilities, there is little in the way of general debugging strategies or how best to implement a particular debugging tool.

This tutorial gives some general advice about debugging a large code for scientific applications. It will focus on using Fortran 90, but much will apply to any debugging problem. It will also address specific issues of concern to NERSC users such as MPI errors and debugging utilities available on NERSC machines.

Good Programming Practices

The best debugging strategy is to use good programming practices, since this is the key to easy debugging. Obvious practices include:

It is also a good idea to try and align complicated equations to easily catch syntax and typing errors. For example, both these statements have the same typo, but it is readily apparent in the well-written one:

example =   ( ax1*by1 - ay1*bx1 )  &
          - ( ax1*bz1 - az*bx1  )  &
          + ( ay1*bz1 - az1*by1 )   


example=(ax1*by1-ay1*bx1)-(ax1*bz1-az*bx1)+(ay1*bz1-az1*by1)

These kinds of typos can be difficult to track down, because if variable az exists, the compiler won't catch the error even when you use IMPLICIT NONE. You have to wait until wrong answers show up in your code to realize you made the typo. If the variable az is close enough in value to az1, it will take a while.

Another programming practice that can be helpful but many people don't use is the make utility. There is a good NERSC make tutorial which gives a general introduction on using make. Advanced programmers aren't the only ones who should use make; it is useful for smaller codes that only have a few source files as well. For instance, it is an excellent way to keep track of the various compiler options used.

Compiler Errors

Compiler errors are by far the easiest to fix. The compiler usually gives you an unambiguous description of the error and a specific line number on which the error occurred. The only disconcerting thing is that often a single syntax error will generate literally hundreds of other compiler errors. This gives the impression the code is riddled with bugs when only a single typo is the problem.

Error Messages

IBM SP

For dealing with xlf problems, see the xlf online User's Guide.

What Did You Change?

A common place that compiler errors crop up is during code development when many modifications are being made during a short period of time. The best way to have good records of code modifications is to use revision control. Revision control packages such as RCS, SCCS, and CVS are available on all NERSC machines in the GNU module and should be used, particularly when multiple people are working on the same code. Both of these are extremely simple to learn.

Run-Time Errors

The largest chunk of debugging usually occurs tracking down run-time errors. Run-time errors can be difficult to track down because the error messages are very general and the code crashes before useful data can be saved. But surprisingly most run time errors can be traced to a few common causes. These run time errors and the common causes will be listed here.

Run-time errors often lack meaningful error explanations, which can make them difficult to understand and debug. Since AIX is basically a standard UNIX implementation a good reference book on UNIX provides useful information on run-time errors.

One thing commonly associated with run-time errors is a core file. A core file is the image of a terminated process; or in other words, a dump of everything in memory at the time of the crash.

Many successful programmers never look at core files, while others swear it is the easiest way to track down an error. Each person develops their own debugging style, but it is important not to automatically erase the core file because you think you aren't an advanced enough programmer to understand what's in it. When a code is compiled with the -g option, symbolic debugging is on and the core file is relatively easy to understand.

Operand Range Errors (ORE) / Segmentation Violation or Fault

Often, puzzling errors occur when a program accesses memory that it does not own or uses an index that is out of bounds for a given array. This sometimes leads to a segmentation fault or operand range error (ORE). But sometimes no problem is immediately apparent.

These errors can be difficult to find since the error often shows up in a different location from where it originated; one part of the program corrupts memory but the crash occurs later when that bad memory is accessed, often by a properly written piece of code.

The most important question is: How did the problem storing the data in memory occur in the first place? The most common cause is overwriting the bounds of an array. Compilers provide switches that allow you to turn on run-time bounds checking for statically allocated arrays.

For IBM use -C -qextchk. This run-time checking will dramatically slow down execution of your code, so use it only as a debugging tool. Note that -qextchk checks for argument mismatches across subroutine calls; this will result in many warnings for MPI codes, which depend on weak type checking.

The code below has a number of problems. In line 11, the variable m is set to 1, but he subroutine called in line 12 writes data beyond the bounds of array n. Depending on a number of memory issues, this may or may not cause the program to crash. However, it will certainly corrupt the variable m, as is shown in line 13.

On line 15 memory is allocated for array r. The loop beginning on line 16 writes to out of bounds memory locations. Again, the code may or may not crash.

     1  !filename: ore.F
     2
     3   program ore
     4           implicit none
     5           integer:: m, i, errcode
     6           integer, dimension(10):: n
     7           real, allocatable, dimension(:) :: r
     8
     9           common n,m           !n,m are adjacent in memory
    10
    11           m = 1
    12           call sub             !sub has an error that clobbers m
    13           print *,'m:', m      !see the value of m
    14
    15           allocate(r(100))     !allocate a 100-element array
    16           do i=1,200           !write beyond allocated memory
    17                  r(i) = float(i)
    18           end do
    19

    28           deallocate(r)
    29   end program ore
    30
    31
    32   subroutine sub
    33          implicit none
    34           integer:: i
    35           integer, dimension(10):: q
    36
    37           common q
    38
    39           do i=1,50            !Clobber memory by exceeding
    40             q(i) = i           !array bounds
    41           enddo
    42   end subroutine sub

Following are some examples of compiling and running this code.

IBM SP

% xlf90 -o ore ore.F
** ore   === End of Compilation 1 ===
** sub   === End of Compilation 2 ===
1501-510  Compilation successful for file ore.F.
% ./ore
 m: 11
Segmentation fault (core dumped)
% xlf90 -C -qextchk -o ore ore.F
** ore   === End of Compilation 1 ===
** sub   === End of Compilation 2 ===
1501-510  Compilation successful for file ore.F.
% ./ore
Trace/BPT trap (core dumped)

On the IBM SP, a Segmentation Violation can occur if you application exceeds the stack memory limit.

Floating Point Exceptions (FPE)

Floating point exceptions (FPE) are usually easier to debug than ORE's. And they are almost always caused by a divide by zero. FPE's result when a floating point operation is attempted and one of the operands is not valid. Examples include divide by zero or square root of a negative number.

The IBM xlf compiler will not trap floating point exceptions by default. See xlf Floating Point Exceptions.

A common cause of FPE's is using uninitialized variables in a floating point operation. An option that may be helpful is the IBM xlf -qinitauto option, which initializes variables. By using -qinitauto=FF and -qflttrap=invalid:enable, you can identify uninitialized floating point variables on the IBM machines since anything that "touches" the unitialized variable will become a -NANQ.

For example, this code does not initialize the variable a.

PROGRAM NOINIT
        IMPLICIT NONE
        
        real:: a,b

        b = 2.0 * a

        print *, b
END PROGRAM NOINIT      

Here are examples of compiling and running using both default compiler options and the ones mentioned above.

IBM SP

% xlf90 -o noinit noinit.f 
** noinit   === End of Compilation 1 ===
1501-510  Compilation successful for file noinit.f.
% ./noinit 
 0.0000000000E+00
% xlf90 -qflttrap=invalid:enable -qinitauto=FF -o noinit noinit.f
** noinit   === End of Compilation 1 ===
1501-510  Compilation successful for file noinit.f.
% ./noinit 
 -NaNQ

Even when the FPE is not due to an uninitialized variable, it is still usually easy to find by simply recompiling your code with symbolic debugging on and using totalview to help you find the error. Opening the resulting core file in Totalview will usually put you exactly at the line where the FPE occurred. Use totalview executable -c core.

You can also use brute force methods to find FPE's like looking at variable values from within Totalview or including flags in your code that print variables when their values exceed certain limits. The latter suggestion is sometimes good for general debugging of run-time or logic errors. But stepping through the code with Totalview gets tedious even when the crash occurs relatively early. Don't be afraid of the core -- playing around with Totalview and becoming familiar with its capabilities is the key.

Other Errors

This last category is for any other general error encountered which prevents the code from completing normally. They might not properly be termed "run-time" errors, but they will be presented in this section anyway.

Loader Errors

The most common load-time error is missing or unsatisfied externals. This is usually due to a missing library or to undimensioned arrays.

Wrong Answers

Once the code is compiled and running, the most important question to ask is whether it is giving the correct answers. Checks for this include determining whether energy is conserved or running a test case where the answers are known analytically. If the code fails to give the correct results, then the debugging process must continue. One important point is that the entire parameter space should be tested before the code is declared to be working properly. Often errors don't reveal themselves until the code is run with a given set of input parameters, even though the errors affect the results in all regimes.

When it's discovered the code is generating wrong answers, the first thing to determine whether the discrepancy is small or large. These two categories of errors should be approached differently in terms of debugging and will be discussed below.

Small Discrepancies from Expected Results

When code that was originally developed on a workstation is ported to a machine at NERSC, there might be slight discrepancies in the results.

Another source of small discrepancies between the expected and actual results is optimization. Optimization should always be turned off when the initial debugging is done. If the discrepancy disappears when optimization is turned off, then optimization is obviously the culprit.

A few other checks are useful when the code is exhibiting slight deviations from the expected results. The first is the effect of single versus double precision. Another is initialization of variables, which can be checked with compiler options such as:

Large Discrepancies from Expected Results

When large discrepancies exist between the results and the correct answers, the problem is usually either a logic error in the algorithm or a memory allocation problem. Correcting logic errors can be facilitated by showing the call tree in Totalview. Memory allocation problems can often be corrected by invoking the -qsave (IBM) option when compiling which allocates variables to static storage.

MPI Errors

Special considerations must be taken when debugging MPI codes. Not only is the programming paradigm less familiar than traditional serial programming, but synchronization problems can be particularly tricky to identify. The best advice for debugging MPI is to have a good reference available, as man pages can be incomplete (e.g. the Fortran bindings are missing from the man pages here at NERSC). In addition, always run test cases before you implement a routine in a large code. You can never be sure that you understand how an MPI binding works until you've run test cases on the specific machine you will implement it on.

General Strategy

It's difficult to identify a general strategy for debugging MPI codes, since every program tends to be so different. But it's always a good idea to check the man page on every binding to make sure the syntax is correct.

The best tool for debugging MPI codes is Totalview. Totalview allows you to view variables on any processor and set break points in multiple places. An important consideration is that often it is best to set a break point after a chunk of code which contains complicated communication procedures, because otherwise the line-by-line stepping of Totalview can affect communication behavior.

Things to look for when using Totalview include the arguments being passed in an MPI call. Is every processor passing the arguments you think it's passing? Totalview is also handy for checking message tags when the code hangs. Another problem that can be easily spotted in Totalview is if one processor has stopped execution through a stop statement or some other reason. A last thing to check for is that a collective communication call such as MPI_BCAST is being made on every processor.

Common Errors

Here are some common MPI errors:

Fortran versus C Bindings

Fortran bindings require an extra argument that is often forgotten by the programmer. For example, the Fortran binding to call a barrier is:

Fortran
MPI_BARRIER(comm,ierr)
C
MPI_Barrier(MPI_Comm comm)

It is easy to forget the ierr tag in the Fortran case.

Status in Receives

A few MPI bindings require a status argument, including the basic receive. For example, here is the Fortran binding for a basic receive:

Fortran
MPI_RECV(buf,count,datatype,source,tag,comm,status,ierr)

The status argument should be dimensioned as an integer array of length MPI_STATUS_SIZE:

status(MPI_STATUS_SIZE)

where MPI_STATUS_SIZE is defined in the include file (mpif.h or mpic.h). The user should not dimension this array themselves.

Reduction Operations

When using a reduction operation such as MPI_REDUCE, where a variable from all processors are combined in some way into a variable on one processor, make sure that the source and target names are different. For example, the Fortran binding for a reduce where a scalar from each processor is summed and the result is stored in a variable on processor zero would be:

MPI_REDUCE(sendbuf,recvbuf,count,datatype,op,root,comm,ierr)

If you wanted to use this to sum the energy calculated on each processor to find the total energy of the system, then you might be tempted to call both variables energy. Hence you would type:

MPI_REDUCE(energy,energy,1,MPI_REAL,MPI_SUM,0,MPI_COMM_WORLD,ierr)

This will bomb.

Non-Blocking I/O

An important aspect of non-blocking I/O that is in the man page but not in the manual is that you cannot call MPI_WAIT. You need to call MPIO_WAIT, otherwise the code will hang.

Limit on MPI Tags

MPI tag numbers greater than 32,768 (this is 2^15) are not allowed. This problem might arise when using automatic tag generation.

Utilities

There are numerous utilities to aid in the debugging process. Those considered most useful will be described. The important thing to understand is that the more skilled you become with a particular debugging tool, the more useful it is for your debugging. Using a utility you are totally unfamiliar with is frustrating and probably won't help you find your error. But investing some time in learning the tool will save countless hours in future debugging.

Totalview

By far the most useful utility at NERSC for debugging is Totalview. Totalview provides an X Windows interface to a powerful debugger that lets you do everything from step through your code running on multiple processors to viewing your core files. It can be used for Fortran 90, C, C++ and HPF code.

Please see

Other Utilities

The table contains a list of some other debugging utilities. The most useful is Totalview, but the lint programs are good as a last resort to track down particularly difficult errors. They tend to generate more information than you would ever need, so options should be used to limit the output.

Please see the man page for each utility to get more information.

Utility Description
lint C language program checker
cflow Generates C language flow graph
totalview Debugger

LBNL Home
Page last modified: Fri, 21 May 2004 20:53:54 GMT
Page URL: http://www.nersc.gov/nusers/help/tutorials/debug/print.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science