NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
 
PackagePlatformVersionModule Docs
(*) Denotes limited support

Historical: hpmcount and hpmlib on Seaborg
Seaborg Decommissioned January 2008

hpmcount runs an application and then reports execution wall clock time, hardware performance counter information, derived hardware metrics, and resource utilization statistics. The overhead of involved in use of hpmcount is very small.

hpmlib is a library which provides an API to hpm from within a program.

How to use hpmcount

To use the hpmcount utility, you do not need to modify your code.

Note: To guarantee correct hpmcount results, your code must be compiled with -qarch=pwr3 (or -qarch=auto).

By default hpmcount will write performance statistics for each task to standard output.

hpmcount is part of the HPM Toolkit. NERSC has compiled an explanation of some of the hpmcount output. All users will be interested in the reported value of Floating point instructions + FMA rate, which is the quantity usually called "FLOP" rate. The units reported are "Mflip/s", which is equivalent to the term "Mflop/s" for most purposes.

Usage

  • Serial Job
    % hpmcount executable_name
    
  • Parallel jobs (separate output for each task)
    % poe hpmcount executable_name -nodes x -procs y
    

Aggregate numbers for parallel jobs can be obtained by using IPM.

Caution: A common mistake: Do not do the following:

hpmcount poe ./executable_name 
hpmcount ./executable_name (if compiled with an mp* compiler)

Both of these are incorrect for parallel codes. They return the hardware performance data on the poe controlling process, not your code.

Examples

Three examples of the use of hpmcount are provided to illustrate various optimization levels for a matrix-matrix multiply computation. Examples are generally provided in Fortran, C and C++.

We've compiled an explanation of the hardware counter output. The IBM documentation for the HPM Toolkit provides detailed descriptions of all the output measures provided by hpmcount as well as information on the lower-level API libhpm and a visualization tool for hpmviz for display of output files produced using libhpm. An example is provided in Fortran, C and C++ demonstrating the use of the low-level API to instrument distinct code sections, and the use of the hpmviz tool to visualize the analysis results.

For parallel codes, IPM is available to aggregate hpmcount information across multiple processors. The use of IPM is demonstrated with the following example.

Example 1: Unoptimized Matrix-Matrix Multiply

The following is a sample code which performs a matrix-matrix multiply. It is provided in

Running this code under hpmcount produces the following output.

Note that these some of these results vary slightly for the same compiler depending on machine loading.

The initial output about adding counter indicates the events that hpmcount is tracking.

After the code runs, hpmcount provides four sections of output:

  • Total execution wall clock time,
  • Resource usage, including basic information on timing memory usage, page faults, file I/O, interprocessor communication (IPC) and context switching
  • Hardware counter information such as cycles, instructions, cache misses, register stores and loads, floating point unit (FPU) instructions, and floating point multiply-adds (FMAs)
  • Rate information such as utilization, loads per cache miss, instructions per load/store, millions of instructions per second (MIPS), instructions per cycle, various sums of flaoting point operations and rates, and a measure of computation intensity

The Mflip/s metric near the end of the output indicates Millions of (arithmetic) FLoat Instructions Per Second. This reflects the floating point arithmetic operations performed by the code. These values may be used to deduce the computational efficiency of the code and possibly suggest optimization strategies. In general:

  Mflip/s between   0- 100 (code needs optimization)
  Mflip/s between 100- 400 (code may need some optimization)
  Mflip/s between 400- 800 (well optimized code)
  Mflip/s between 800-1500 (very well optimized code or IBM libraray)

The example code in Fortran is only reporting around 8 Mflip/s, a very small percentage of the peak for a Power3 processor (1500 Mflip/s). The example code in C and C++ is reporting about 200 Mflip/s, which is better.

Some other useful values are:

 Maximum resident set size        : Amount of memory the code used
 PM_TLB_MISS (TLB misses)         : Use of Cache   (Should be small)
 Avg number of loads per TLB miss : Should be high (about 300 or more)

This code only used about 23 megabytes of memory. The Fortran version had about 2 billion cache misses, and only had about 1 cache page loads per miss. This indicates that the code was making inefficient use of the memory bandwidth. The C and C++ versions had only about 2 million cache misses, and over 1000 cache page loads per miss. The C and C++ versions of the code made much more efficient use of memory bandwidth.

Example 2: Optimized Matrix-Matrix Multiply

The memory access pattern of the code shown above in Example 1 can be made more efficient if the order of the indices in the nested loops is changed from i, k, j to j, k, i. This results in a memory stride of one.

        do j=1,index
           do k=1,index
              do i=1,index

The results of running this rearranged code under hpmcount are shown here.

Notice that all of the examples shown thus far have the same number of Floating point instructions + FMAs of just over 2 billion operations. This reordering of the indices in the Fortran code has now improved the performance to the level of the C and C++ versions shown earlier.

Example 3: Using ESSL Library Function

Here is a third example which uses the IBM Engineering and Scientific Subroutine Library (ESSL) function for the matrix-matrix multiply. The function name is DGEMM.

Fortran

The Fortran code is the same as that shown above, except that the triple-nested loop that does the matrix-matrix multiply is replaced by a single library call:

     call DGEMM('N','N',N,N,N,1.0d0,matrixa,N,matrixb,N,0.0d0,mres,N)

C/C++

Notice that C and C++ store matrices differently from Fortran, so the transpose of the matrices is used to obtain the same result as the Fortran code.

The C++ version of Example 3 is quite similar to the C++ version of Example 1, with the additional #include <essl.h> and the replacement of the triple nested for loops with a call to dgemm, as shown in the C example above.

The output from hpmcount is shown below:

% xlf90 -o ex3 -lessl ex3.f
** _main   === End of Compilation 1 ===
1501-510  Compilation successful for file ex3.f.

% hpmcount ./ex3
 adding counter 5 event 12 Cycles
 adding counter 0 event 1 Instructions completed
 adding counter 7 event 0 TLB misses
 adding counter 2 event 9 Stores completed
 adding counter 3 event 5 Loads completed
 adding counter 4 event 5 FPU 0 instructions
 adding counter 1 event 35 FPU 1 instructions
 adding counter 6 event 9 FMAs executed
 
 mres( 1000 1000 )= 666166500.000000000
 
 hpmcount (V 2.3.1) summary
 
 Total execution time (wall clock time): 2.377406 seconds
 
 ########  Resource Usage Statistics  ########  
 
 Total amount of time in user mode            : 2.140000 seconds
 Total amount of time in system mode          : 0.180000 seconds
 Maximum resident set size                    : 31704 Kbytes
 Average shared memory use in text segment    : 924 Kbytes*sec
 Average unshared memory use in data segment  : 6367120 Kbytes*sec
 Number of page faults without I/O activity   : 7933
 Number of page faults with I/O activity      : 1
 Number of times process was swapped out      : 0
 Number of times file system performed INPUT  : 0
 Number of times file system performed OUTPUT : 0
 Number of IPC messages sent                  : 0
 Number of IPC messages received              : 0
 Number of signals delivered                  : 0
 Number of voluntary context switches         : 2
 Number of involuntary context switches       : 238
 
 #######  End of Resource Statistics  ########
 
  PM_CYC (Cycles)                            :       797311835
  PM_INST_CMPL (Instructions completed)      :      1634801891
  PM_TLB_MISS (TLB misses)                   :         3127858
  PM_ST_CMPL (Stores completed)              :        14477496
  PM_LD_CMPL (Loads completed)               :       518258411
  PM_FPU0_CMPL (FPU 0 instructions)          :       506399346
  PM_FPU1_CMPL (FPU 1 instructions)          :       499241814
  PM_EXEC_FMA (FMAs executed)                :      1000001941
 
  Utilization rate                           :          89.418 %
  Avg number of loads per TLB miss           :         165.691
  Load and store operations                  :         532.736 M
  Instructions per load/store                :           3.069
  MIPS                                       :         687.641
  Instructions per cycle                     :           2.050
  HW Float points instructions per Cycle     :           1.261
  Floating point instructions + FMAs         :        2005.643 M
  Float point instructions + FMA rate        :         843.627 Mflip/s
  FMA percentage                             :          99.719 %
  Computation intensity                      :           3.765

% xlc -o ex3 -lessl ex3.c 
% hpmcount ./ex3
(The program output and initial hpmcount output for the C version are identical to Fortran example above and not shown here.)

Summary

This demonstrates a striking improvement in the efficiency when using the highly optimized library routines in ESSL. The Fortran code runs 100 times faster than the original version in Example 1 with inefficient memory stride, and even the C version runs 50 times faster than the original version with an efficient memory stride. The performance results for the C++ version are quite similar to those for the C version of this example.

How to use hpmlib

Example 4: Sections of Unoptimized Matrix-Matrix Multiply

The following is a sample code which performs a matrix-matrix multiply. The code has three separate instrumented sections using the libhpm functions:

  1. Initialize the matrices
  2. Perform the matrix-matrix multiply
  3. Final output

The example is provided in Fortran, C and C++.

HPM data collection is initialized with a call to f_hpminit for Fortran or hpmInit for C/C++. The data is identified by an integer task identifier (e.g., zero for serial codes, MPI rank for MPI codes), and a text string. Data collection is concluded with a call to f_hpmterminate for Fortran or hpmTerminate for C/C++. When HPM is terminated, the data is written to a .viz file.

Individual sections for HPM data collection are delimited by calls to f_hpmstart and f_hpmstop for Fortran or hpmStart and hpmStop for C/C++. Individual sections are identified by a number and a text string.

Notice that the module hpmtoolkit must be loaded before the code can be compiled.

! filename:  ex4.f
! compile:   module load hpmtoolkit
!            xlf -o ex4 ex4.f -qsuffix=cpp=f $HPMTOOLKIT
! run:       ./ex4
 
        implicit none
        integer, PARAMETER :: index=1000
        REAL*8 matrixa(index,index),matrixb(index,index)
        REAL*8 mres(index,index)
        INTEGER i,j,k,n
#include "f_hpm.h"
 
! Start hpm monitoring
        call f_hpminit (0, "ex4.f") 
 
! Initialize the Matrix arrays
        call f_hpmstart(1, "initialize matrices") 
        do i=1,index
           do j=1,index
              matrixa(i,j) = real(i+j)
              matrixb(i,j) = real(j-i)
              mres(i,j) = 0.0
           end do
        end do
        call f_hpmstop(1) 
 
! Matrix-Matrix Multiply
        call f_hpmstart(2, "matrix-matrix multiply") 
        N = index
        do i=1,index
           do k=1,index
              do j=1,index
                 mres(i,j) = mres(i,j) + matrixa(i,k)*matrixb(k,j)
              end do
           end do
        end do
        call f_hpmstop(2) 
 
        call f_hpmstart(3, "final output")
        write(*,*)'mres(',n,n,')=',mres(n,n)
        call f_hpmstop(3)
 
! End hpm monitoring
        call f_hpmterminate(0) 
 
        stop
        end

The C version of the example:


/* filename:   ex4.c
   compile:    module load hpmtoolkit
               xlc -o ex4 ex4.c $HPMTOOLKIT
   run:        ./ex4
*/
 
#include "stdio.h"
#include "libhpm.h"
#define INDEX 1000
 
int main ()
{
   int index=INDEX;
   double matrixa[INDEX][INDEX], matrixb[INDEX][INDEX], 
			mres[INDEX][INDEX];
   int i,j,k,n;
 
/*  Start HPM monitoring   */
   hpmInit(0,"ex4.c");
 
/*  Initialize the Matrix arrays   */
   hpmStart(1, "initialize matrices");
   for (i=0; i<INDEX; i++) {
      for (j=0; j<INDEX; j++) {
         matrixa[i][j] = i+j+2;
         matrixb[i][j] = j-i;
         mres[i][j]    = 0;
      }
   }
   hpmStop(1);
 
/*   Matrix-Matrix Multiply       */
 
   hpmStart(2, "matrix-matrix multiply");
   n = INDEX; 
   for (i=0; i<INDEX; i++) {
      for (k=0; k<INDEX; k++) {
         for (j=0; j<INDEX; j++) {
            mres[i][j] = mres[i][j] + matrixa[i][k]*matrixb[k][j];
         }
      }
   }
   hpmStop(2);
 
   hpmStart(3, "final output");
   printf("mres(%d,%d)=%f\n", n, n, mres[n-1][n-1]);
   hpmStop(3);
 
/* End hpm monitoring   */
   hpmTerminate(0);
 
   return 0;
}

And the C++ version of the example. Notice that the name of the include file has changed from the C version.


// filename:   ex4.C
// compile:    module load hpmtoolkit
//             xlC -o ex4 ex4.C $HPMTOOLKIT
// run:        ./ex4
 
#include <iostream.h>
#include <libhpm.H>
#define INDEX 1000
 
int main ()
{
   int index=INDEX;
   double matrixa[INDEX][INDEX], matrixb[INDEX][INDEX], 
		mres[INDEX][INDEX];
   int i,j,k,n;
 
//  Start HPM monitoring 
   hpmInit(0,"ex4.C");
 
//  Initialize the Matrix arrays 
   hpmStart(1, "initialize matrices");
   for (i=0; i<INDEX; i++) {
      for (j=0; j<INDEX; j++) {
         matrixa[i][j] = i+j+2;
         matrixb[i][j] = j-i;
         mres[i][j]    = 0;
      }
   }
   hpmStop(1);
 
//   Matrix-Matrix Multiply 
 
   hpmStart(2, "matrix-matrix multiply");
   n = INDEX; 
   for (i=0; i<INDEX; i++) {
      for (k=0; k<INDEX; k++) {
         for (j=0; j<INDEX; j++) {
            mres[i][j] = mres[i][j] + matrixa[i][k]*matrixb[k][j];
         }
      }
   }
   hpmStop(2);
 
   hpmStart(3, "final output");
   cout.setf(ios::fixed);
   cout << "mres(" << n << "," 
		<< n << ")=" << mres[n-1][n-1] << endl;
   hpmStop(3);
 
// End hpm monitoring 
   hpmTerminate(0);
 
   return 0;
}

Compiling and running this code under produces the following output.

% module load hpmtoolkit 

% xlf -o ex4 ex4.f -qsuffix=cpp=f $HPMTOOLKIT 
** _main   === End of Compilation 1 ===
1501-510  Compilation successful for file ex4.f.

% ./ex4
 adding counter 5 event 12 Cycles
 adding counter 0 event 1 Instructions completed
 adding counter 7 event 0 TLB misses
 adding counter 2 event 9 Stores completed
 adding counter 3 event 5 Loads completed
 adding counter 4 event 5 FPU 0 instructions
 adding counter 1 event 35 FPU 1 instructions
 adding counter 6 event 9 FMAs executed

 mres( 1000 1000 )= 666166500.000000000
 
libHPM output in perfhpm0000.64482


% module load hpmtoolkit
% xlc -o ex4 ex4.c $HPMTOOLKIT
% ./ex4
 adding counter 5 event 12 Cycles
 adding counter 0 event 1 Instructions completed
 adding counter 7 event 0 TLB misses
 adding counter 2 event 9 Stores completed
 adding counter 3 event 5 Loads completed
 adding counter 4 event 5 FPU 0 instructions
 adding counter 1 event 35 FPU 1 instructions
 adding counter 6 event 9 FMAs executed
mres(1000,1000)=666166500.000000
 
libHPM output in perfhpm0000.81814


% module load hpmtoolkit
% xlC -o ex4 ex4.C $HPMTOOLKIT
% ./ex4
 adding counter 5 event 12 Cycles
 adding counter 0 event 1 Instructions completed
 adding counter 7 event 0 TLB misses
 adding counter 2 event 9 Stores completed
 adding counter 3 event 5 Loads completed
 adding counter 4 event 5 FPU 0 instructions
 adding counter 1 event 35 FPU 1 instructions
 adding counter 6 event 9 FMAs executed
mres(1000,1000)=666166500.000000
 
libHPM output in perfhpm0000.70542

The .viz files that are created are: hpm0000_ex4.f_64482.viz for Fortran, hpm0000_ex4.c_81814.viz for C, and hpm0000_ex4.C_70542.viz for C++. Notice that these names include the arguments to the hpmInit function, and are also connected, by means of the process id, to the text output files.

The text output file for the Fortran version of this example is shwon below. Notice that there is an overall summary of resource usage, and then separate information for each instrumented section of the code.


% cat perfhpm0000.64482
 
 libhpm (Version 2.3.1) summary - running on POWER3-II
 
 Total execution time of instrumented code (wall time): 289.56 seconds
 
 ########  Resource Usage Statistics  ########  
 
 Total amount of time in user mode            : 289.140000 seconds
 Total amount of time in system mode          : 0.350000 seconds
 Maximum resident set size                    : 23784 Kbytes
 Average shared memory use in text segment    : 166252 Kbytes*sec
 Average unshared memory use in data segment  : 685331496 Kbytes*sec
 Number of page faults without I/O activity   : 5966
 Number of page faults with I/O activity      : 22
 Number of times process was swapped out      : 0
 Number of times file system performed INPUT  : 0
 Number of times file system performed OUTPUT : 0
 Number of IPC messages sent                  : 0
 Number of IPC messages received              : 0
 Number of signals delivered                  : 0
 Number of voluntary context switches         : 10
 Number of involuntary context switches       : 28960
 
 #######  End of Resource Statistics  ########
 
 Instrumented section: 1 - Label: initialize matrices - process: 0
 file: ex4.f, lines: 17 >--< 25
  Count: 1
  Wall Clock Time: 0.739748 seconds
  Total time in user mode: 0.582817780163786 seconds
 
  PM_CYC (Cycles)                            :       218591825
  PM_INST_CMPL (Instructions completed)      :        44007010
  PM_TLB_MISS (TLB misses)                   :         3006426
  PM_ST_CMPL (Stores completed)              :         6002003
  PM_LD_CMPL (Loads completed)               :        12001002
  PM_FPU0_CMPL (FPU 0 instructions)          :         4000023
  PM_FPU1_CMPL (FPU 1 instructions)          :          999988
  PM_EXEC_FMA (FMAs executed)                :               0
 
  Utilization rate                           :          78.786 %
  Avg number of loads per TLB miss           :           3.992
  Load and store operations                  :          18.003 M
  Instructions per load/store                :           2.444
  MIPS                                       :          59.489
  Instructions per cycle                     :           0.201
  HW Float points instructions per Cycle     :           0.023
  Floating point instructions + FMAs         :           5.000 M
  Float point instructions + FMA rate        :           6.759 Mflip/s
  FMA percentage                             :           0.000 %
  Computation intensity                      :           0.278
 
 
 Instrumented section: 2 - Label: matrix-matrix multiply - process: 0
 file: ex4.f, lines: 28 <--> 37
  Count: 1
  Wall Clock Time: 288.821541 seconds
  Total time in user mode: 287.304251770298 seconds
 
  PM_CYC (Cycles)                            :    107756496862
  PM_INST_CMPL (Instructions completed)      :     29007007012
  PM_TLB_MISS (TLB misses)                   :      2001221941
  PM_ST_CMPL (Stores completed)              :      2002002004
  PM_LD_CMPL (Loads completed)               :      8001001002
  PM_FPU0_CMPL (FPU 0 instructions)          :      1000016230
  PM_FPU1_CMPL (FPU 1 instructions)          :               0
  PM_EXEC_FMA (FMAs executed)                :      1000016230
 
  Utilization rate                           :          99.475 %
  Avg number of loads per TLB miss           :           3.998
  Load and store operations                  :       10003.003 M
  Instructions per load/store                :           2.900
  MIPS                                       :         100.432
  Instructions per cycle                     :           0.269
  HW Float points instructions per Cycle     :           0.009
  Floating point instructions + FMAs         :        2000.032 M
  Float point instructions + FMA rate        :           6.925 Mflip/s
  FMA percentage                             :         100.000 %
  Computation intensity                      :           0.200
 
 
 Instrumented section: 3 - Label: final output - process: 0
 file: ex4.f, lines: 39 <--> 41
  Count: 1
  Wall Clock Time: 0.001785 seconds
  Total time in user mode: 0.000283860815953749 seconds
 
  PM_CYC (Cycles)                            :          106320
  PM_INST_CMPL (Instructions completed)      :           25864
  PM_TLB_MISS (TLB misses)                   :             116
  PM_ST_CMPL (Stores completed)              :            6008
  PM_LD_CMPL (Loads completed)               :            4985
  PM_FPU0_CMPL (FPU 0 instructions)          :             345
  PM_FPU1_CMPL (FPU 1 instructions)          :              85
  PM_EXEC_FMA (FMAs executed)                :              75
 
  Utilization rate                           :          15.881 %
  Avg number of loads per TLB miss           :          42.974
  Load and store operations                  :           0.011 M
  Instructions per load/store                :           2.353
  MIPS                                       :          14.490
  Instructions per cycle                     :           0.243
  HW Float points instructions per Cycle     :           0.004
  Floating point instructions + FMAs         :           0.001 M
  Float point instructions + FMA rate        :           0.283 Mflip/s
  FMA percentage                             :          29.703 %
  Computation intensity                      :           0.046
 

The hpmviz tool is an X-Windows application that provides a graphical user interface (GUI) to examine the separate sections of the code. The tool is started with the command

% hpmviz

After hpmviz has started, the various .viz files can be opened with the pull-down File menu. A sample screen shot, after loading the three .viz files for this example, is shown by following the link below.

hpmviz image link hpmviz screen shot

The left side of the screen contains the instrumented sections of the code and timing information. The right side contains the source code. Tabs at the top of each half-screen window control the file being displayed.

Left-mouse-button clicking on the name of an instrumented section on the left will hightlight the corresponding source code section on the right. Right-mouse-button clicking on the neame of an instrumented section on the left will open a new window with detailed HPM information.

hpmviz image link hpmviz screen shot

The hpmviz is particularly useful for multiprocessor computations with multiple instrumented sections.

Example 5: Parallel Library Function (PESSL)

Here is an example which uses the IBM Parallel Engineering and Scientific Subroutine Library (PESSL) function for the matrix-matrix multiply. The funciton name is PDGEMM. The code is provided in

Running this code under IPM for parallel HPM produces the following output.

Although the parallel code is slower than the serial code for a problem of this size (1000 by 1000 matrices), due in part to communication overhead, for larger problems (such as 10,000 by 10,000 matrices) which cannot fit on a single node, performance approaches 800 Mflip/s/processor for parallel computations.

Performance values are comparable whether the IBM PESSL library or the NERSC/ScaLAPACK library is used. Instructions for compiling and linking the codes using either library set are included in the source code files above.


LBNL Home
Page last modified: Tue, 22 Apr 2008 18:24:57 GMT
Page URL: http://www.nersc.gov/nusers/systems/SP/old_stuff/hpmcount/
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science