The logically shared, distributed memory access (SHMEM) routines provide low-latency, high-bandwidth communication for use in highly parallelized scalable programs.
The SHMEM data-passing library routines are similar to the message passing interface (MPI) library routines: they pass data between cooperating parallel processes. The SHMEM data-passing routines can be used in programs that perform computations in separate address spaces and that explicitly pass data to and from different processing elements (PEs) in the program.
The SHMEM parallel programming model assumes an MPI-1 like group of processes that runs in parallel from job launch to job termination. No processes can be added or removed from this group and all processes execute the same application. Thus, SHMEM applications are of the SPMD (Single Program Multiple Data) type. However, a SHMEM application can be part of a larger MPMD (Multiple Program Multiple Data) type MPI job. SHMEM is a one-sided message passing model in which memory is private to each process.
The SHMEM routines minimize the overhead associated with data passing requests, maximize bandwidth, and minimize data latency. Data latency is the length of time between a PE initiating a transfer of data and a PE being able to use the data.
SHMEM routines support remote data transfer through put operations that transfer data to a different PE and get operations that transfer data from a different PE. Other supported operations are work-shared broadcast and reduction, barrier synchronization, and atomic memory operations.
This section lists routines by topic and is organized by routine name and language, and indicates whether these routines are available in C, C++, or Fortran.
The following routines prepare or, respectively, clean-up platform specific resources required for correct operation of a SHMEM application:
The following routines transfer data from or to a specified processing element (PE), respectively:
The following routines synchronize processing elements to ensure desired order of PE execution:
The following routines wait for a variable on the local processing element (PE) to change:
| shmem_wait | C, C++, and Fortran |
| shmem_int4_wait | Fortran |
| shmem_int8_wait | Fortran |
| shmem_wait_until | C, C++, and Fortran |
| shmem_int4_wait_until | Fortran |
| shmem_int8_wait | Fortran |
| shmem_short_wait | C and C++ |
| shmem_short_wait_until | C and C++ |
| shmem_int_wait | C and C++ |
| shmem_int_wait_until | C and C++ |
| shmem_long_wait | C and C++ |
| shmem_long_wait_until | C and C++ |
| shmem_longlong_wait | C and C++ |
| shmem_longlong_wait_until | C and C++ |
The following routines perform a logical AND function reduction across a set of processing elements (PEs):
The following routines perform a maximum function reduction across a set of processing elements (PEs):
| shmem_int2_max_to_all | Fortran |
| shmem_int4_max_to_all | Fortran |
| shmem_int8_max_to_all | Fortran |
| shmem_real4_max_to_all | Fortran |
| shmem_real8_max_to_all | Fortran |
| shmem_short_max_to_all | C and C++ |
| shmem_int_max_to_all | C and C++ |
| shmem_long_max_to_all | C and C++ |
| shmem_longlong_max_to_all | C and C++ |
| shmem_float_max_to_all | C and C++ |
| shmem_double_max_to_all | C and C++ |
The following routines perform a minimum function reduction across a set of processing elements (PEs):
The following routines performs a logical OR function reduction across a set of processing elements (PEs):
The following routines perform a product reduction across a set of processing elements (PEs):
The following routines perform a logical XOR function reduction across a set of processing elements (PEs):
The following routine broadcasts a block of data from one processing element (PE) to one or more target PEs:
The following routines perform an atomic conditional swap to a remote data object:
The following routines perform an atomic fetch-and-increment operation on a remote data object:
The following routines perform an atomic fetch-and-add operation on a remote data object:
The following routines perform non-blocking put operations.
| shmem_put_nb | C, C++, and Fortran |
| shmem_put16_nb | C, C++, and Fortran |
| shmem_put32_nb | C, C++, and Fortran |
| shmem_put64_nb | C, C++, and Fortran |
| shmem_put128_nb | C, C++, and Fortran |
| shmem_putmem_nb | C, C++, and Fortran |
| shmem_short_put_nb | C, C++, and Fortran |
| shmem_int_put_nb | C, C++, and Fortran |
| shmem_long_put_nb | C, C++, and Fortran |
| shmem_longlong_put_nb | C, C++, and Fortran |
| shmem_float_put_nb | C, C++, and Fortran |
| shmem_double_put_nb | C, C++, and Fortran |
Some SHMEM routines are collective routines. They distribute work across a set of processing elements and must be called concurrently by all PEs in the active set.
The following man pages describe the SHMEM collective routines:
shmem_and(3), shmem_barrier(3), shmem_broadcast(3), shmem_max(3), shmem_min(3), shmem_or(3), shmem_prod(3), shmem_sum(3), shmem_xor(3)
Typically, target or source arrays that reside on remote processing elements (PEs) are identified by passing the address of the corresponding data object on the local PE. The local existence of a corresponding data object implies that a data object is symmetric.
Symmetric accessible data objects passed to SHMEM routines can be arrays or scalars. A symmetric data object is one where the local and remote addresses have a known relationship. You can use SHMEM routines to access remote symmetric data objects by using the address of the corresponding data object on the local PE.
The following data objects are symmetric:
Fortran data objects in common blocks or with the SAVE attribute.
Non-stack C and C++ variables.
Fortran arrays allocated with shpalloc(3f)
C and C++ data allocated by shmalloc(3c)
A SHMEM application must call start_pes or shmem_init as the very first SHMEM routine called within the application to guarantee that lower-level resources have been set up correctly. Otherwise, the SHMEM application will not execute correctly. Similarly, a SHMEM application must call shmem_finalize as the very last SHMEM routine called within the application to guarantee correct clean-up of previously allocated network protocol resources.
SHMEM routines can be used in conjunction with Message Passing Interface (MPI) routines in the same application. Programs that use both MPI and SHMEM should call MPI_Init followed by start_pes or shmem_init. At the end of the program, shmem_finalize should be called followed by MPI_Finalize. SHMEM processing element numbers are equal to the MPI rank within the MPI_COM_WORLD environment variable if the MPI job consists of a single application.
The SHMEM routines reside in libsma.a. The following command lines compile programs that include SHMEM routines:
cc c_program.c -lsma CC cplusplus_program.C -lsma ftn fortran_program.f -lsma |
Example 1: Fortran SHMEM
The following example is a Fortran SHMEM program:
cat example.f90
PROGRAM REDUCTION
INCLUDE 'mpp/shmem.fh'
REAL VALUES, SUM
COMMON /C/ VALUES
REAL WORK
CALL START_PES(0)
VALUES = SHMEM_MY_PE()
CALL SHMEM_BARRIER_ALL ! Synchronize all PEs
SUM = 0.0
DO I = 0,SHMEM_N_PES()-1
CALL SHMEM_GET(WORK, VALUES, 1, I) ! Get next value
SUM = SUM + WORK ! Sum it
ENDDO
PRINT*,'PE ',SHMEM_MY_PE(),' COMPUTED SUM=',SUM
CALL SHMEM_BARRIER_ALL
CALL SHMEM_FINALIZE
END |
Enter the following command to compile the program:
ftn -o example example.f90 -lsma
Enter the following command to run the program:
yod -np 2 ./example
Example 2: C SHMEM
The following example is a C SHMEM program:
cat example.c
#include <mpp/shmem.h>
#include <stdio.h>
main()
{
long source[10] = { 1, 2, 3, 4, 5,
6, 7, 8, 9, 10 };
static long target[10];
start_pes(0);
target[0] = 0;
if (shmem_my_pe() == 0) {
/* put 10 words into target on PE 1 */
shmem_long_put(target, source, 10, 1);
}
shmem_barrier_all(); /* sync sender and receiver */
printf("target[0] on PE %d is %d\n", shmem_my_pe(), target[0]);
shmem_finalize();
} |
In the preceding C program, PE 0 sends ten integers to the target array on PE 1.
Enter the following command to compile the program:
cc example.c -lsma
Enter the following command to execute the program:
yod -sz 4 ./a.out
cc(1), CC(1), ftn(1), f77(1), yod(1)
shmalloc(3), shmem_and(3), shmem_barrier(3), shmem_barrier_all(3), shmem_broadcast(3), shmem_cswap(3), shmem_event(3), shmem_fadd(3), shmem_fence(3), shmem_finalize(3), shmem_finc(3), shmem_g(3), shmem_get(3), shmem_iget(3), shmem_iput(3), shmem_lock(3), shmem_max(3), shmem_min(3), shmem_my_pe(3), shmem_or(3), shmem_p(3), shmem_prod(3), shmem_put(3), shmem_quiet(3), shmem_sum(3), shmem_swap(3), shmem_wait(3), shmem_xor(3), shpalloc(3), shpclmove(3), shpdeallc(3), start_pes(3)