ReportsReports HomeSeaborg Scaling |
IBM SP Parallel Scaling Overview - SMP ScalingNighthawk II nodeSeaborg compute resources consist of 416 compute node each with 16 CPUs per node. Each node is capable for performing at most 24 GFLOP/second. All the nodes are IBM 375 Mhz NightHawk II nodes with the following overall CPU and memory specification.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Each CPU has its own separate caches so there is minimal resource sharing or cache conflict between CPUs on an SMP power III node. This separation provides an important simplification to the developer of parallel codes. Since each CPU's filling and invalidation of cache impacts only code running on that CPU there is less contention at this lowest level than on machines where low level resources are highly shared. Application programmers need not partition memory or cache access patterns along processor card or multi-chip module (MCM) boundaries as cache memory affinity is not an issue.
Conversely, for main memory there is no notion of local memory. All CPUs within a node access main memory over a uniform crossbar switch. While the possibility of contention over this switch is real (and will be treated below), there is no need for the application programmer to keep track of which parts of main memory are local to the CPU.
|
|
|
| |
As main memory is shared on an SMP, contention may occur. The peak main memory bandwidth is 15.6 GB/s based on the crossbar memory subsystem detailed above.
The full bandwidth is not available to a single task. In order to saturate the main memory bandwidth multiple tasks are required. A more detailed understanding of memory contention on the nighthawk II node can be arrived at by considering how the performance of N memory intensive tasks
| Scaling of SMP memory contention. | |
|---|---|
|
|
| Source: xtream memory profiling tool (concurrency through MPI). More results. | |
Two aspects of how memory contention scales with concurrency in parallel applications are demonstrated above:
As the SMP is loaded with more processes the main memory bandwidth available to each task individually will decrease. This is summarized below for the daxpy like Triad microkernel (a(i) = b(i) + s*c(i)). As the number of tasks on the node increases the main memory bandwidth per task, shown here as a percentage, decreases.
|
That tasks in an cache based SMP compete for main memory access is certainly no surprise and in practice the situation is not as bad as it may seem. The example above is roughly a worst case scenario for memory contention, where n memory bandwidth bound processes contend for access to main memory. Applications which are not strictly memory bound, show more varied memory accesses, or greater memory reuse should in practice show less contention.
Developers of scientific applications should realize the above issue of memory contention may also to varying degrees impact their applications. For applications or algorithms which are particularly starved for memory bandwidth there may be benefit from decreasing the number of tasks run per node. This could yield a faster time to solution trading off of course a smaller maximum FLOP/s and possibly a decreased percentage of peak performance. Taking this approach should be done cautiously as in many cases it would lead to less efficient utilization of the nodes in a batch job.
MPI message sent inside the SMP through shared memory are detailed here.
![]() |
Page last modified: Mon, 24 May 2004 04:34:37 GMT Page URL: http://www.nersc.gov/news/reports/technical/seaborg_scaling/smp.php Web contact: webmaster@nersc.gov Computing questions: consult@nersc.gov Privacy and Security Notice |
![]() |