NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
  IBM SP Parallel Scaling Overview - SMP

IBM SP Parallel Scaling Overview - SMP Scaling


Nighthawk II node

Seaborg compute resources consist of 416 compute node each with 16 CPUs per node. Each node is capable for performing at most 24 GFLOP/second. All the nodes are IBM 375 Mhz NightHawk II nodes with the following overall CPU and memory specification.

Processor
Processor class POWER_630
Clock frequency 374.7 MHz
Floating Point Units 2 (*,+,FMA)
Peak GFLOP/s 1.5
Real Registers 40
Virtual Registers 64
L1 Inst Cache Size 32 KB
L1 Data Cache Size 64 KB
L1 Data Cache Line Size 128 B
L1 Cache Associativity 4 way by line
L1 latency / bandwidth 5 nsec / 3.2 GB/sec
L2 Cache Size 8192 KB
L2 Cache Associativity 4
L2 latency / bandwidth 45 nsec / 6.4 GB/sec
Memory
Memory topology crossbar
Memory format SDRAM DIMMs
Memory banking 4 banks / DRAM
Peak Memory BW 16 GB/sec
Memory bus speed 187 Mhz (2:1)
Page Size 4 KB
TLB size 128x2 Pages
TLB miss penalty 25-125 cycles
L2->L1 Prefetch Registers 10
L2->L1 Prefetch Streams 4
L1 -> registers 2 Word/cycle Load
L1 <-> registers 1 Word/cycle Load/Store
L2 -> L1 1.3 Word/cycle
Memory -> L1/L2 1 Word/cycle
Instruction #Cycles 32 bit #Cycles 64 bit
Integer Multiply 3-4 3-9
Integer Divide 21 37
FP Multiply or Add 3-4 3-4
FP Multiply-Add 3-4 3-4
FP Divide 14-21 18-25
FP Square Root 14-23 22-31

Each CPU has its own separate caches so there is minimal resource sharing or cache conflict between CPUs on an SMP power III node. This separation provides an important simplification to the developer of parallel codes. Since each CPU's filling and invalidation of cache impacts only code running on that CPU there is less contention at this lowest level than on machines where low level resources are highly shared. Application programmers need not partition memory or cache access patterns along processor card or multi-chip module (MCM) boundaries as cache memory affinity is not an issue.

Conversely, for main memory there is no notion of local memory. All CPUs within a node access main memory over a uniform crossbar switch. While the possibility of contention over this switch is real (and will be treated below), there is no need for the application programmer to keep track of which parts of main memory are local to the CPU.

from: RS/6000 SP 375MHz POWER3 SMP High Node Overview

Memory Contention

As main memory is shared on an SMP, contention may occur. The peak main memory bandwidth is 15.6 GB/s based on the crossbar memory subsystem detailed above.

The full bandwidth is not available to a single task. In order to saturate the main memory bandwidth multiple tasks are required. A more detailed understanding of memory contention on the nighthawk II node can be arrived at by considering how the performance of N memory intensive tasks

Scaling of SMP memory contention.
Source: xtream memory profiling tool (concurrency through MPI). More results.

Two aspects of how memory contention scales with concurrency in parallel applications are demonstrated above:

As the SMP is loaded with more processes the main memory bandwidth available to each task individually will decrease. This is summarized below for the daxpy like Triad microkernel (a(i) = b(i) + s*c(i)). As the number of tasks on the node increases the main memory bandwidth per task, shown here as a percentage, decreases.

That tasks in an cache based SMP compete for main memory access is certainly no surprise and in practice the situation is not as bad as it may seem. The example above is roughly a worst case scenario for memory contention, where n memory bandwidth bound processes contend for access to main memory. Applications which are not strictly memory bound, show more varied memory accesses, or greater memory reuse should in practice show less contention.

Developers of scientific applications should realize the above issue of memory contention may also to varying degrees impact their applications. For applications or algorithms which are particularly starved for memory bandwidth there may be benefit from decreasing the number of tasks run per node. This could yield a faster time to solution trading off of course a smaller maximum FLOP/s and possibly a decreased percentage of peak performance. Taking this approach should be done cautiously as in many cases it would lead to less efficient utilization of the nodes in a batch job.

MPI message sent inside the SMP through shared memory are detailed here.



LBNL Home
Page last modified: Mon, 24 May 2004 04:34:37 GMT
Page URL: http://www.nersc.gov/news/reports/technical/seaborg_scaling/smp.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science