(*) Denotes limited support
Historical: hpmcount and hpmlib on Seaborg Seaborg Decommissioned January 2008
hpmcount runs an application and then reports
execution wall clock time, hardware performance counter
information, derived hardware metrics, and resource utilization statistics.
The overhead of involved in use of hpmcount is very small.
hpmlib is a library which provides an API to hpm from within a program.
How to use hpmcount
To use the hpmcount
utility, you do not need to modify your code.
Note:
To guarantee correct hpmcount results, your code must be compiled with
-qarch=pwr3 (or -qarch=auto).
By default hpmcount will write performance
statistics for each task to standard output.
hpmcount is part of the
HPM Toolkit. NERSC has compiled an
explanation of some of the hpmcount output.
All users will be interested in the reported value of
Floating point instructions + FMA rate, which is
the quantity usually called "FLOP" rate. The units reported are "Mflip/s",
which is equivalent to the term "Mflop/s" for most purposes.
Usage
- Serial Job
% hpmcount executable_name
- Parallel jobs (separate output for each task)
% poe hpmcount executable_name -nodes x -procs y
Aggregate numbers for parallel jobs can be obtained by using
IPM.
Caution: A common mistake:
Do not do the following:
hpmcount poe ./executable_name
hpmcount ./executable_name (if compiled with an mp* compiler)
Both of these are incorrect for parallel codes. They return the hardware
performance data on the poe controlling process, not your code.
Examples
Three examples of the use of hpmcount
are provided to illustrate various optimization
levels for a matrix-matrix multiply computation. Examples are
generally provided in Fortran, C and C++.
We've compiled an explanation of the
hardware counter output.
The IBM documentation for the
HPM Toolkit
provides detailed descriptions
of all the output measures provided by hpmcount
as well as information on the lower-level API libhpm
and a visualization tool for hpmviz for display of
output files produced using libhpm. An example is
provided in Fortran, C and C++ demonstrating the use of the
low-level API to instrument distinct code sections, and the
use of the hpmviz tool to visualize the analysis results.
For parallel codes, IPM is available to
aggregate hpmcount information across multiple processors. The use
of IPM is demonstrated with the following example.
Example 1: Unoptimized Matrix-Matrix Multiply
The following is a sample code which performs a matrix-matrix multiply.
It is provided in
Running this code under hpmcount produces the following
output.
Note that these some of these results vary slightly for the
same compiler depending on machine loading.
The initial output about adding counter indicates the
events that hpmcount is tracking.
After the code runs, hpmcount provides four sections of
output:
- Total execution wall clock time,
- Resource usage, including basic information on timing
memory usage, page faults, file I/O, interprocessor communication
(IPC) and context switching
- Hardware counter information such as cycles, instructions,
cache misses, register stores and loads, floating point unit (FPU)
instructions, and floating point multiply-adds (FMAs)
- Rate information such as utilization, loads per cache miss,
instructions per load/store, millions of instructions per second
(MIPS), instructions per cycle, various sums of flaoting point operations
and rates, and a measure of computation intensity
The Mflip/s metric near the end of the output indicates
Millions of (arithmetic) FLoat Instructions Per Second.
This reflects the floating point
arithmetic operations performed by the code.
These values may be used to deduce the computational
efficiency of the code and possibly suggest optimization
strategies.
In general:
Mflip/s between 0- 100 (code needs optimization)
Mflip/s between 100- 400 (code may need some optimization)
Mflip/s between 400- 800 (well optimized code)
Mflip/s between 800-1500 (very well optimized code or IBM libraray)
The example code in Fortran is only reporting around 8 Mflip/s, a very
small percentage of the peak for a Power3 processor (1500 Mflip/s).
The example code in C and C++ is reporting about 200 Mflip/s,
which is better.
Some other useful values are:
Maximum resident set size : Amount of memory the code used
PM_TLB_MISS (TLB misses) : Use of Cache (Should be small)
Avg number of loads per TLB miss : Should be high (about 300 or more)
This code only used about 23 megabytes of memory. The Fortran
version had about 2 billion cache misses,
and only had about 1 cache page loads per miss. This indicates
that the code was making inefficient use of the memory bandwidth.
The C and C++ versions had only about 2 million cache misses,
and over 1000 cache page loads per miss. The C and C++ versions of the code
made much more efficient use of memory bandwidth.
Example 2: Optimized Matrix-Matrix Multiply
The memory access pattern of the code shown above in Example 1
can be made more efficient if the order of the indices in the
nested loops is changed from i, k, j to j, k, i.
This results in a memory stride of one.
do j=1,index
do k=1,index
do i=1,index
The results of running this rearranged code under hpmcount are
shown here.
Notice that all of the examples shown thus far have the
same number of Floating point instructions + FMAs
of just over 2 billion operations. This reordering of the
indices in the Fortran code has now improved the performance
to the level of the C and C++ versions shown earlier.
Example 3: Using ESSL Library Function
Here is a third example which uses the IBM Engineering and Scientific
Subroutine Library (ESSL) function for the matrix-matrix multiply.
The function name is DGEMM.
Fortran
The Fortran code is the same as that shown above, except that the
triple-nested loop that does the matrix-matrix multiply is replaced
by a single library call:
call DGEMM('N','N',N,N,N,1.0d0,matrixa,N,matrixb,N,0.0d0,mres,N)
C/C++
Notice that C and C++
store matrices differently from Fortran, so the transpose of the matrices
is used to obtain the same result as the Fortran code.
The C++ version of Example 3 is quite similar to the C++ version
of Example 1, with the additional #include <essl.h>
and the replacement of the triple nested for loops with a call to
dgemm, as shown in the C example above.
The output from hpmcount is shown below:
% xlf90 -o ex3 -lessl ex3.f
** _main === End of Compilation 1 ===
1501-510 Compilation successful for file ex3.f.
% hpmcount ./ex3
adding counter 5 event 12 Cycles
adding counter 0 event 1 Instructions completed
adding counter 7 event 0 TLB misses
adding counter 2 event 9 Stores completed
adding counter 3 event 5 Loads completed
adding counter 4 event 5 FPU 0 instructions
adding counter 1 event 35 FPU 1 instructions
adding counter 6 event 9 FMAs executed
mres( 1000 1000 )= 666166500.000000000
hpmcount (V 2.3.1) summary
Total execution time (wall clock time): 2.377406 seconds
######## Resource Usage Statistics ########
Total amount of time in user mode : 2.140000 seconds
Total amount of time in system mode : 0.180000 seconds
Maximum resident set size : 31704 Kbytes
Average shared memory use in text segment : 924 Kbytes*sec
Average unshared memory use in data segment : 6367120 Kbytes*sec
Number of page faults without I/O activity : 7933
Number of page faults with I/O activity : 1
Number of times process was swapped out : 0
Number of times file system performed INPUT : 0
Number of times file system performed OUTPUT : 0
Number of IPC messages sent : 0
Number of IPC messages received : 0
Number of signals delivered : 0
Number of voluntary context switches : 2
Number of involuntary context switches : 238
####### End of Resource Statistics ########
PM_CYC (Cycles) : 797311835
PM_INST_CMPL (Instructions completed) : 1634801891
PM_TLB_MISS (TLB misses) : 3127858
PM_ST_CMPL (Stores completed) : 14477496
PM_LD_CMPL (Loads completed) : 518258411
PM_FPU0_CMPL (FPU 0 instructions) : 506399346
PM_FPU1_CMPL (FPU 1 instructions) : 499241814
PM_EXEC_FMA (FMAs executed) : 1000001941
Utilization rate : 89.418 %
Avg number of loads per TLB miss : 165.691
Load and store operations : 532.736 M
Instructions per load/store : 3.069
MIPS : 687.641
Instructions per cycle : 2.050
HW Float points instructions per Cycle : 1.261
Floating point instructions + FMAs : 2005.643 M
Float point instructions + FMA rate : 843.627 Mflip/s
FMA percentage : 99.719 %
Computation intensity : 3.765
% xlc -o ex3 -lessl ex3.c
% hpmcount ./ex3
(The program output and initial hpmcount output for the C
version are
identical to Fortran example above and not shown here.)
Summary
This demonstrates a striking improvement in the efficiency when
using the highly optimized library routines in ESSL. The
Fortran code runs 100 times faster than the original version in Example 1
with inefficient memory stride, and even the C version runs 50 times
faster than the original version with an efficient memory stride.
The performance results for the C++ version are quite similar to those
for the C version of this example.
How to use hpmlib
Example 4: Sections of Unoptimized Matrix-Matrix Multiply
The following is a sample code which performs a matrix-matrix multiply.
The code has three separate instrumented sections using the libhpm
functions:
- Initialize the matrices
- Perform the matrix-matrix multiply
- Final output
The example is provided in Fortran, C and C++.
HPM data collection
is initialized with a call to f_hpminit for Fortran or
hpmInit for C/C++. The data is identified by
an integer task identifier (e.g., zero for serial codes,
MPI rank for MPI codes), and a text string.
Data collection is concluded with a
call to f_hpmterminate for Fortran or hpmTerminate for C/C++.
When HPM is terminated, the data is written to a .viz file.
Individual sections for HPM data collection are delimited by calls
to f_hpmstart and f_hpmstop
for Fortran or hpmStart and hpmStop for C/C++.
Individual sections are identified by a number and a text string.
Notice that the module
hpmtoolkit must be loaded before the code can be compiled.
! filename: ex4.f
! compile: module load hpmtoolkit
! xlf -o ex4 ex4.f -qsuffix=cpp=f $HPMTOOLKIT
! run: ./ex4
implicit none
integer, PARAMETER :: index=1000
REAL*8 matrixa(index,index),matrixb(index,index)
REAL*8 mres(index,index)
INTEGER i,j,k,n
#include "f_hpm.h"
! Start hpm monitoring
call f_hpminit (0, "ex4.f")
! Initialize the Matrix arrays
call f_hpmstart(1, "initialize matrices")
do i=1,index
do j=1,index
matrixa(i,j) = real(i+j)
matrixb(i,j) = real(j-i)
mres(i,j) = 0.0
end do
end do
call f_hpmstop(1)
! Matrix-Matrix Multiply
call f_hpmstart(2, "matrix-matrix multiply")
N = index
do i=1,index
do k=1,index
do j=1,index
mres(i,j) = mres(i,j) + matrixa(i,k)*matrixb(k,j)
end do
end do
end do
call f_hpmstop(2)
call f_hpmstart(3, "final output")
write(*,*)'mres(',n,n,')=',mres(n,n)
call f_hpmstop(3)
! End hpm monitoring
call f_hpmterminate(0)
stop
end
The C version of the example:
/* filename: ex4.c
compile: module load hpmtoolkit
xlc -o ex4 ex4.c $HPMTOOLKIT
run: ./ex4
*/
#include "stdio.h"
#include "libhpm.h"
#define INDEX 1000
int main ()
{
int index=INDEX;
double matrixa[INDEX][INDEX], matrixb[INDEX][INDEX],
mres[INDEX][INDEX];
int i,j,k,n;
/* Start HPM monitoring */
hpmInit(0,"ex4.c");
/* Initialize the Matrix arrays */
hpmStart(1, "initialize matrices");
for (i=0; i<INDEX; i++) {
for (j=0; j<INDEX; j++) {
matrixa[i][j] = i+j+2;
matrixb[i][j] = j-i;
mres[i][j] = 0;
}
}
hpmStop(1);
/* Matrix-Matrix Multiply */
hpmStart(2, "matrix-matrix multiply");
n = INDEX;
for (i=0; i<INDEX; i++) {
for (k=0; k<INDEX; k++) {
for (j=0; j<INDEX; j++) {
mres[i][j] = mres[i][j] + matrixa[i][k]*matrixb[k][j];
}
}
}
hpmStop(2);
hpmStart(3, "final output");
printf("mres(%d,%d)=%f\n", n, n, mres[n-1][n-1]);
hpmStop(3);
/* End hpm monitoring */
hpmTerminate(0);
return 0;
}
And the C++ version of the example. Notice that the name
of the include file has changed from the C version.
// filename: ex4.C
// compile: module load hpmtoolkit
// xlC -o ex4 ex4.C $HPMTOOLKIT
// run: ./ex4
#include <iostream.h>
#include <libhpm.H>
#define INDEX 1000
int main ()
{
int index=INDEX;
double matrixa[INDEX][INDEX], matrixb[INDEX][INDEX],
mres[INDEX][INDEX];
int i,j,k,n;
// Start HPM monitoring
hpmInit(0,"ex4.C");
// Initialize the Matrix arrays
hpmStart(1, "initialize matrices");
for (i=0; i<INDEX; i++) {
for (j=0; j<INDEX; j++) {
matrixa[i][j] = i+j+2;
matrixb[i][j] = j-i;
mres[i][j] = 0;
}
}
hpmStop(1);
// Matrix-Matrix Multiply
hpmStart(2, "matrix-matrix multiply");
n = INDEX;
for (i=0; i<INDEX; i++) {
for (k=0; k<INDEX; k++) {
for (j=0; j<INDEX; j++) {
mres[i][j] = mres[i][j] + matrixa[i][k]*matrixb[k][j];
}
}
}
hpmStop(2);
hpmStart(3, "final output");
cout.setf(ios::fixed);
cout << "mres(" << n << ","
<< n << ")=" << mres[n-1][n-1] << endl;
hpmStop(3);
// End hpm monitoring
hpmTerminate(0);
return 0;
}
Compiling and running this code under produces the following output.
% module load hpmtoolkit
% xlf -o ex4 ex4.f -qsuffix=cpp=f $HPMTOOLKIT
** _main === End of Compilation 1 ===
1501-510 Compilation successful for file ex4.f.
% ./ex4
adding counter 5 event 12 Cycles
adding counter 0 event 1 Instructions completed
adding counter 7 event 0 TLB misses
adding counter 2 event 9 Stores completed
adding counter 3 event 5 Loads completed
adding counter 4 event 5 FPU 0 instructions
adding counter 1 event 35 FPU 1 instructions
adding counter 6 event 9 FMAs executed
mres( 1000 1000 )= 666166500.000000000
libHPM output in perfhpm0000.64482
% module load hpmtoolkit
% xlc -o ex4 ex4.c $HPMTOOLKIT
% ./ex4
adding counter 5 event 12 Cycles
adding counter 0 event 1 Instructions completed
adding counter 7 event 0 TLB misses
adding counter 2 event 9 Stores completed
adding counter 3 event 5 Loads completed
adding counter 4 event 5 FPU 0 instructions
adding counter 1 event 35 FPU 1 instructions
adding counter 6 event 9 FMAs executed
mres(1000,1000)=666166500.000000
libHPM output in perfhpm0000.81814
% module load hpmtoolkit
% xlC -o ex4 ex4.C $HPMTOOLKIT
% ./ex4
adding counter 5 event 12 Cycles
adding counter 0 event 1 Instructions completed
adding counter 7 event 0 TLB misses
adding counter 2 event 9 Stores completed
adding counter 3 event 5 Loads completed
adding counter 4 event 5 FPU 0 instructions
adding counter 1 event 35 FPU 1 instructions
adding counter 6 event 9 FMAs executed
mres(1000,1000)=666166500.000000
libHPM output in perfhpm0000.70542
The .viz files that are created are:
hpm0000_ex4.f_64482.viz for Fortran,
hpm0000_ex4.c_81814.viz for C, and
hpm0000_ex4.C_70542.viz for C++. Notice that
these names include the arguments to the hpmInit function,
and are also connected, by means of the process id, to the
text output files.
The text output file for the Fortran version of this example
is shwon below. Notice that there is an overall summary
of resource usage, and then separate information for each
instrumented section of the code.
% cat perfhpm0000.64482
libhpm (Version 2.3.1) summary - running on POWER3-II
Total execution time of instrumented code (wall time): 289.56 seconds
######## Resource Usage Statistics ########
Total amount of time in user mode : 289.140000 seconds
Total amount of time in system mode : 0.350000 seconds
Maximum resident set size : 23784 Kbytes
Average shared memory use in text segment : 166252 Kbytes*sec
Average unshared memory use in data segment : 685331496 Kbytes*sec
Number of page faults without I/O activity : 5966
Number of page faults with I/O activity : 22
Number of times process was swapped out : 0
Number of times file system performed INPUT : 0
Number of times file system performed OUTPUT : 0
Number of IPC messages sent : 0
Number of IPC messages received : 0
Number of signals delivered : 0
Number of voluntary context switches : 10
Number of involuntary context switches : 28960
####### End of Resource Statistics ########
Instrumented section: 1 - Label: initialize matrices - process: 0
file: ex4.f, lines: 17 >--< 25
Count: 1
Wall Clock Time: 0.739748 seconds
Total time in user mode: 0.582817780163786 seconds
PM_CYC (Cycles) : 218591825
PM_INST_CMPL (Instructions completed) : 44007010
PM_TLB_MISS (TLB misses) : 3006426
PM_ST_CMPL (Stores completed) : 6002003
PM_LD_CMPL (Loads completed) : 12001002
PM_FPU0_CMPL (FPU 0 instructions) : 4000023
PM_FPU1_CMPL (FPU 1 instructions) : 999988
PM_EXEC_FMA (FMAs executed) : 0
Utilization rate : 78.786 %
Avg number of loads per TLB miss : 3.992
Load and store operations : 18.003 M
Instructions per load/store : 2.444
MIPS : 59.489
Instructions per cycle : 0.201
HW Float points instructions per Cycle : 0.023
Floating point instructions + FMAs : 5.000 M
Float point instructions + FMA rate : 6.759 Mflip/s
FMA percentage : 0.000 %
Computation intensity : 0.278
Instrumented section: 2 - Label: matrix-matrix multiply - process: 0
file: ex4.f, lines: 28 <--> 37
Count: 1
Wall Clock Time: 288.821541 seconds
Total time in user mode: 287.304251770298 seconds
PM_CYC (Cycles) : 107756496862
PM_INST_CMPL (Instructions completed) : 29007007012
PM_TLB_MISS (TLB misses) : 2001221941
PM_ST_CMPL (Stores completed) : 2002002004
PM_LD_CMPL (Loads completed) : 8001001002
PM_FPU0_CMPL (FPU 0 instructions) : 1000016230
PM_FPU1_CMPL (FPU 1 instructions) : 0
PM_EXEC_FMA (FMAs executed) : 1000016230
Utilization rate : 99.475 %
Avg number of loads per TLB miss : 3.998
Load and store operations : 10003.003 M
Instructions per load/store : 2.900
MIPS : 100.432
Instructions per cycle : 0.269
HW Float points instructions per Cycle : 0.009
Floating point instructions + FMAs : 2000.032 M
Float point instructions + FMA rate : 6.925 Mflip/s
FMA percentage : 100.000 %
Computation intensity : 0.200
Instrumented section: 3 - Label: final output - process: 0
file: ex4.f, lines: 39 <--> 41
Count: 1
Wall Clock Time: 0.001785 seconds
Total time in user mode: 0.000283860815953749 seconds
PM_CYC (Cycles) : 106320
PM_INST_CMPL (Instructions completed) : 25864
PM_TLB_MISS (TLB misses) : 116
PM_ST_CMPL (Stores completed) : 6008
PM_LD_CMPL (Loads completed) : 4985
PM_FPU0_CMPL (FPU 0 instructions) : 345
PM_FPU1_CMPL (FPU 1 instructions) : 85
PM_EXEC_FMA (FMAs executed) : 75
Utilization rate : 15.881 %
Avg number of loads per TLB miss : 42.974
Load and store operations : 0.011 M
Instructions per load/store : 2.353
MIPS : 14.490
Instructions per cycle : 0.243
HW Float points instructions per Cycle : 0.004
Floating point instructions + FMAs : 0.001 M
Float point instructions + FMA rate : 0.283 Mflip/s
FMA percentage : 29.703 %
Computation intensity : 0.046
The hpmviz tool is an X-Windows application that provides a graphical
user interface (GUI) to examine the separate sections of the code.
The tool is started with the command
After hpmviz has started, the various .viz files
can be opened with the pull-down File menu.
A sample screen shot, after loading the three .viz
files for this example, is shown by following the link below.
hpmviz screen shot
The left side of the screen contains the instrumented sections
of the code and timing information. The right side contains
the source code. Tabs at the top of each half-screen window
control the file being displayed.
Left-mouse-button clicking on the name of an instrumented
section on the left will
hightlight the corresponding source code section on the right.
Right-mouse-button clicking on the neame of an instrumented
section on the left will open a new window with detailed HPM
information.
hpmviz screen shot
The hpmviz is particularly useful for multiprocessor
computations with multiple instrumented sections.
Example 5: Parallel Library Function (PESSL)
Here is an example which uses the IBM Parallel Engineering and Scientific
Subroutine Library (PESSL) function for the matrix-matrix multiply.
The funciton name is PDGEMM.
The code is provided in
Running this code under IPM for parallel HPM
produces the following
output.
Although the parallel code is slower than the serial code for a problem
of this size (1000 by 1000 matrices), due in part to communication overhead,
for larger problems (such as 10,000 by 10,000 matrices) which cannot fit
on a single node, performance approaches 800 Mflip/s/processor for parallel
computations.
Performance values are comparable whether the IBM PESSL library
or the NERSC/ScaLAPACK library is used. Instructions for compiling
and linking the codes using either library set are included in the
source code files above.
|