JacquardVendor Manuals (PDF)PathScale User GuideACML User Guide PBS Pro User Guide Status & StatisticsUP Wed 10/31 14:54NERSC MOTD Announcements Jacquard Queue Status Completed Jobs Jacquard Job Stats |
The TotalView debugger on Jacquard is a provided and supported by Etnus. There is extensive documentation for it, including the User's Guide at User's Guide. There are web-based tutorials at Etnus and LLNL. Using Totalview on Jacquard
Compiling an example programIn order to use the debugger, code must be compiled with the -g option. This will produce a larger executable that may run relatively slowly, so be sure to recompile without the -g option once you are ready to execute production runs. The attached example program, totex.f, can be compiled with the following command: % mpif90 -o totex -g totex.f Running the program, without the debugger, produces the following output: % mpirun -np 4 ./totex All these values should be the same: Processor Number : 0 Before send = 0.761171758 x1(3) on Processor No. : 0 After recv = 0.761171758 x1(3) on Processor No. : 1 After recv = 0.E+0 x1(3) on Processor No. : 2 After recv = 0.E+0 x1(3) on Processor No. : 3 After recv = 0.761171758 Processors 1 and 2 contain unexpected values in array x1. Starting the debuggerIn order to run the program under totalview on Jacquard it is necessary to start a batch job that runs an xterm on the desired number of nodes. The attached example batch script, totex.pbs, runs an xterm on 2 nodes. It reserves 2 processors on each node, so that a 4 processor job can be run. Submit this job to pbs: % qsub totex.pbs When pbs runs the xterm job, an xterm window will pop up on your workstation. From this window you must load the totalview module and launch the job under totalview using the -tv argument to the mpiexec job launcher. % module load totalview % mpiexec -n 4 -tv ./totex This code will use two nodes with a total of four processes to run the executable totex. The above command will open two windows: the root window and the process window. Setting break pointsAt this point, you can set break points at the lines at which you wish the program to stop during the debug run in the process window. To set a break point, left-click with the mouse on the line number. Use a left mouse click on line 18 to create a break point there. The breakpoint has been set for all of the MPI tasks. The process window will look like this once the break point is set successfully. Advancing to breakpointsTo start the program go to the process window, and left click on the Go button. Three windows will pop up in succession, and you must click the appropriate response on each before the program actually starts. At the first window, left click on No. We have found that stopping the program here to set break points can lead to unpredictable results. At the second window, left click on No again. At the third window, left click on OK. After this last click, the program will start executing on all processors and will stop at line 18. By default the process window shows the state of MPI process 0. Totalview processes on JacquardTotalview process labelling conventions on Jacquard are somewhat confusing and inconsistent. Within MPI an N process job has processes (called ranks within the MPI program) labelled from 0 to N-1. Totalview, on the other hand, labels the processes from 1 to N. In the case of our 4 process totex run, the processes are labelled totex.1, totex.2, totex.3, and totex.4, and totalview process totex.m corresponds to MPI process m-1. In addition, each MPI process on Jacquard consists of 3 threads, one actually running the user code and two running in the system routine ioctl. Only the thread running the user code is of interest to you. For totalview process 1 (MPI process 0), the user code is thread 1.1. For all other processes the user code thread is number 3, so if you want to look at the state of MPI process 2, you click on thread 3.3. Debugging the example programOnce the MPI program has started executing and you have reached a break point, the root window now shows the status of each MPI process. Left click on the plus(+) sign beside the process with the ID of 3 and Rank of 2 in the root window to see the status of each thread of MPI process 2. To see what's happening in this process right-click the user MAIN_ thread 3.3 then choose Dive in New Window from the popup menu (Dive will change the existing process window), and a new process window will open for MPI process 2. You can also step to adjacent processes with the "P-" and "P+" button in the lower right corner of the process window. Finding the errorNow let's try to find the problem. Allow the program to advance through all the MPI calls and stop it right afterward by setting a breakpoint at line 52 on a process window and clicking "Go" on that window. All MPI processes of the program will advance to line 52 and stop. Examining variablesYou examine variables by right-clicking with the mouse on the variable name in the process window and selecting Dive from the popup menu. For example, right click on on the x1 variable name on line 46 of the process window. A new window containing the values of x1 on the thread represented by the process window will open. In the upper left hand corner of this data window, you see the process.thread combination of 1.1 indicating that this is the user thread of process 1 (MPI process 0). By clicking on the arrow on the left of this process.thread number, you can advance to process 2 (MPI process 1)at 2.3. will open showing the values of x1 on processor 1. Similarly, we can examine the values on process 3 (MPI process 2, thread 3.3) and values on process 4 (MPI process 3, thread 4.3). Solving the problemBy looking at the values contained in the x1 array, we get a big clue to finding the solution. Since each processor has a number of non-zero elements that depends on its MPI rank, we suspect the problem is contained in one of the loops that performs the MPI_SENDs and/or MPI_RECVs. If we first convince ourselves that line number 46 is OK, we are led to take a look at line 34. There we see that we're sending i elements of the array x1, not all im1 elements as we had intended. Once we make the change, recompile and run the program, we get the following output: All these values should be the same: Processor Number : 0 Before send = 0.761171758 x1(3) on Processor No. : 0 After recv = 0.761171758 x1(3) on Processor No. : 1 After recv = 0.761171758 x1(3) on Processor No. : 3 After recv = 0.761171758 x1(3) on Processor No. : 2 After recv = 0.761171758 And the code has been fixed! |
![]() |
Page last modified: Thu, 09 Nov 2006 19:10:06 GMT Page URL: http://www.nersc.gov/nusers/systems/jacquard/software/totalview/ Web contact: webmaster@nersc.gov Computing questions: consult@nersc.gov Privacy and Security Notice |
![]() |