Hardware Counter Information
Thanks to Scientific Supercompuing Center Karlsruhe for
some of the following information.
Following is a description of of the output from the hardware counters:
- PM_CYC (Cycles)
- The number of machine cycles used by the program. The CPU
time used is this number divided by 375 MHz.
- PM_INST_CMPL (Instruction completed)
- The number of instructions that were executed by the program.
- PM_TLB_MISS (TLB misses)
- When a program is running it uses virtual memory addresses.
In order to access physical memory the
system must map these virtual address to physical memory addresses
during program execution.
This mapping process is done in units of 4kB pages and the system
keeps the addresses of recently used pages in fast memory - the
Translation Lookaside Buffer (TLB). When the program needs to access
memory that is mapped in the TLB the mapping is done very quickly.
When access is needed to memory whose page is not in the TLB, a
TLB miss occurs and it takes many cycles to perform the memory
address translation. This slows down memory access significantly.
- PM_ST_CMPL (Stores completed)
- The number of store instructions which move data from a register
to memory.
- PM_LD_CMPL (Loads completed)
- The number of load instructions which copy data into a register.
- PM_FPU0_CMPL (FPU 0 instructions)
- The POWER3 processor has two Floating Point Units (FPU) which
operate in parallel. Each FPU can start a new instruction at every
cycle. This is the number of floating point instructions
(add, multiply, subtract, divide, multiply+add) that have been
executed by the first FPU.
- PM_FPU1_CMPL (FPU 1 instructions)
- This is the number of floating point instructions
(add, multiply, subtract, divide, multiply+add) that have been
executed by the second FPU.
- PM_EXEC_FMA (FMAs executed)
- The POWER3 can execute a compution of the form x=s*a+b
with one instruction. The is known as a Floating point Multiply &
Add (FMA) and occurs commonly in many codes, particularly those
that perform matrix operations. The compiler will generate FMA instructions
as often as possible. This counter shows the number of FMAs executed by
the program.
Various derived quantities are reported by hpmcount:
- Utilization Rate
- The ratio of CPU time (see PM_CYC above) to wall clock time.
For a task on a dedicated compute node this ratio will be
extremely close to 1.
- Average number of loads per TLB miss
- The ratio of PM_LD_CMPL/PM_TLB_MISS. Each time a TLB miss occurs,
a new page in brought into the buffer. Each page is 4kB with 512 data
elements of size 8 bytes. So an average number of loads per TLB miss
in the 500 range indicates that each data element is being accessed once
(on average).
Higher values indicate that data is being reused more than once while
the address is in the TLB. A small value indicates that needed data is
stored in widely separated places in memory and a redesign of data
structures may help performance significantly.
- Load and store operations
- The sum of PM_ST_CMPL and PM_LD_CMPL.
- Instructions per load/store
- A low value of this metric indicates that the code is dominated
by data movement operations (load/store) rather than computation.
A value of 2 means that half the instructions are used for moving
data. By reusing data in registers a larger value can be reached.
- MIPS
- The average number of instructions completed per second. The POWER3
can execute several instructions in parallel.
- Instructions per cycle
- The average number of instructions issued per clock cycle.
Well tuned programs may reach more than 2 instructions per cycle.
- Float point instructions + FMAs
- The number of floating point operations performed by the code.
The number of FMAs is added to the number of instructions since one
FMA instruction performs two floating point operations.
- Float point instructions + FMA rate
- This is the MFlops rate. The POWER3 has a peak performance
of 1500 MFlops (2 FPUs each executing a FMA per cycle with a clock
frequency of 375MHz).
- FMA percentage
- The percentage of floating point instructions that are FMA calculations
of the form x=s*a+b.
- Computation intensity
- The ratio of floating point operations to the total number of loads and stores.
|