number_bands_fft

 

This parameter controls the number of bands which are simultaneously transformed in a Fourier transform. So far, we have only seen  a benefit for runs using multiple SMP nodes for the distributed environment (specifically the IBM SP at NERSC). The problem lies in the latency of communication within the FFT. When mutiple bands are transformed at the same time, the size of the data packets become larger thus overcoming the latency problem. One problem with bigger data is that it no longer may reside in cache so other non-communicative parts of the code may run slower. One has to find the optimal number of bands for the total number of plane waves and the total number of nodes (16 processors per node on the IBM SP at NERSC). Below we give some scaling numbers for a 128 GaAs FCC unit Cell.

 

 

 

For 1 bands in Fig 1, it is quite evident that the communication is gives horrible scaling.

 

Fig. 1 – number_bands_fft =1

 

The non-communication  decreases proportionately with the number of nodes (16 processsors per node) well while the communication increases dramatically.

 

When using 4 bands, one sees a larger time for 1 node. This is do to the aforementioned cache effects. These cache effects also explain the better than perfect scaling from 1 node to 2 nodes for the non-communicative time (48.4 to 19.5). The communicative time stays about constant from 1 node to 2 nodes, but again increases dramatically from 2 to 4 nodes. The packets again are too small and latency is causing problems. The flat communication time from 1nodes to 2 nodes prevents perfect scaling from 1 to 2 nodes.

 

 

Fig. 2 – number_bands_fft =4

 

For 16 bands, the communication time becomes great at only 8 nodes. Again the time for 1 node is increased over runs using fewer bands. The flatness in the communicative time prevents proper scaling from 2 to 4 nodes.

 

 

 

Fig. 3 – number_bands_fft =16

 

For 64 bands, we see a similar behavior. The total time does always decrease with more nodes, but the decrease becomes smaller and smaller.

 

 

Fig. 4 – number_bands_fft =64

 

In Fig. 5, we present the timings for the best setting of number_bands_fft at each number of nodes. For a cutoff of 25 Ryd, the optimal number_bands_fft increases by a factor of 4 as the number of nodes increases by a factor of 2.

 

 

 

Fig. 5 1 – node (1 band) 2- nodes (4 bands) 4 nodes (16 bands) 8 nodes (64 bands)

# of plane waves 41,302 (25 Ryd cutoff)

 

 

Using a cutoff of 15 Ryd, the optimal values change. Hopefully, this should give a user a guide to what number_bands_fft should be for a smaller number of plane waves. This system does not scale as well due to the smaller data set.

 

 

 

Fig. 6 1 – node (4 band) 2- nodes (16 bands) 4 nodes (32 bands) 8 nodes (128 bands)

# of plane waves 19,266 (15 Ryd cutoff).