Last changed: 08-16-2006

intro_shmem(1)

NAME

intro_shmem -- Introduces logically shared distributed memory access routines

IMPLEMENTATION

UNICOS/lc systems

DESCRIPTION

The logically shared, distributed memory access (SHMEM) routines provide low-latency, high-bandwidth communication for use in highly parallelized scalable programs.

The SHMEM data-passing library routines are similar to the message passing interface (MPI) library routines: they pass data between cooperating parallel processes. The SHMEM data-passing routines can be used in programs that perform computations in separate address spaces and that explicitly pass data to and from different processing elements (PEs) in the program.

The SHMEM parallel programming model assumes an MPI-1 like group of processes that runs in parallel from job launch to job termination. No processes can be added or removed from this group and all processes execute the same application. Thus, SHMEM applications are of the SPMD (Single Program Multiple Data) type. However, a SHMEM application can be part of a larger MPMD (Multiple Program Multiple Data) type MPI job. SHMEM is a one-sided message passing model in which memory is private to each process.

The SHMEM routines minimize the overhead associated with data passing requests, maximize bandwidth, and minimize data latency. Data latency is the length of time between a PE initiating a transfer of data and a PE being able to use the data.

SHMEM routines support remote data transfer through put operations that transfer data to a different PE and get operations that transfer data from a different PE. Other supported operations are work-shared broadcast and reduction, barrier synchronization, and atomic memory operations.

ROUTINES

This section lists routines by topic and is organized by routine name and language, and indicates whether these routines are available in C, C++, or Fortran.

Initialization and Clean up

The following routines prepare or, respectively, clean-up platform specific resources required for correct operation of a SHMEM application:

start_pes

shmem_init

shmem_finalize

Symmetric Heap Management

The following routines are symmetric heap memory management functions:

shmalloc

C and C++

shrealloc

C and C++

shfree

C and C++

shpalloc

Fortran

shpdeallc

Fortran

PE Queries

The following routines return processing element (PE) information:

shmem_my_pe

shmem_n_pes

Block Data Get

The following routines transfer data from a specified processing element (PE):

shmem_get

C, C++, and Fortran

shmem_get16

C, C++, and Fortran

shmem_get32

C, C++, and Fortran

shmem_get64

C, C++, and Fortran

shmem_get128

C, C++, and Fortran

shmem_short_get

C and C++

shmem_int_get

C and C++

shmem_long_get

C and C++

shmem_longlong_get

C and C++

shmem_float_get

C and C++

shmem_double_get

C and C++

Block Data Put

The following routines transfer data to a specified processing element (PE):

shmem_put

C, C++, and Fortran

shmem_put16

C, C++, and Fortran

shmem_put32

C, C++, and Fortran

shmem_put64

C, C++, and Fortran

shmem_put128

C, C++, and Fortran

shmem_short_put

C and C++

shmem_int_put

C and C++

shmem_long_put

C and C++

shmem_longlong_put

C and C++

shmem_float_put

C and C++

shmem_double_put

C and C++

Byte_granularity Block Get and Put

The following routines transfer data from or to a specified processing element (PE), respectively:

shmem_getmem

C, C++, and Fortran

shmem_putmem

C, C++, and Fortran

Barrier Synchronization

The following routines synchronize processing elements to ensure desired order of PE execution:

shmem_barrier

C, C++, and Fortran

shmem_barrier_all

C, C++, and Fortran

shmem_fence

C, C++, and Fortran

shmem_quiet

C, C++, and Fortran

Point-to-point Synchronization

The following routines wait for a variable on the local processing element (PE) to change:

shmem_wait

C, C++, and Fortran

shmem_int4_wait

Fortran

shmem_int8_wait

Fortran

shmem_wait_until

C, C++, and Fortran

shmem_int4_wait_until

Fortran

shmem_int8_wait

Fortran

shmem_short_wait

C and C++

shmem_short_wait_until

C and C++

shmem_int_wait

C and C++

shmem_int_wait_until

C and C++

shmem_long_wait

C and C++

shmem_long_wait_until

C and C++

shmem_longlong_wait

C and C++

shmem_longlong_wait_until

C and C++

Event

The following routines clear, set, test or wait on an event, respectively:

shmem_clear_event

C, C++, and Fortran

shmem_set_event

C, C++, and Fortran

shmem_test_event

C, C++, and Fortran

shmem_wait_event

C, C++, and Fortran

Shmem_and

The following routines perform a logical AND function reduction across a set of processing elements (PEs):

shmem_int2_and_to_all

C, C++, and Fortran

shmem_int4_and_to_all

C, C++, and Fortran

shmem_int8_and_to_all

C, C++, and Fortran

shmem_short_and_to_all

C and C++

shmem_int_and_to_all

C and C++

shmem_long_and_to_all

C and C++

shmem_longlong_and_to_all

C and C++

Shmem_max

The following routines perform a maximum function reduction across a set of processing elements (PEs):

shmem_int2_max_to_all

Fortran

shmem_int4_max_to_all

Fortran

shmem_int8_max_to_all

Fortran

shmem_real4_max_to_all

Fortran

shmem_real8_max_to_all

Fortran

shmem_short_max_to_all

C and C++

shmem_int_max_to_all

C and C++

shmem_long_max_to_all

C and C++

shmem_longlong_max_to_all

C and C++

shmem_float_max_to_all

C and C++

shmem_double_max_to_all

C and C++

Shmem_min

The following routines perform a minimum function reduction across a set of processing elements (PEs):

shmem_int2_min_to_all

Fortran

shmem_int4_min_to_all

Fortran

shmem_int8_min_to_all

Fortran

shmem_short_min_to_all

C and C++

shmem_int_min_to_all

C and C++

shmem_long_min_to_all

C and C++

shmem_longlong_min_to_all

C and C++

shmem_float_min_to_all

C and C++

shmem_double_min_to_all

C and C++

Shmem_or

The following routines performs a logical OR function reduction across a set of processing elements (PEs):

shmem_int2_or_to_all

Fortran

shmem_int4_or_to_all

Fortran

shmem_int8_or_to_all

Fortran

shmem_short_or_to_all

C and C++

shmem_int_or_to_all

C and C++

shmem_long_or_to_all

C and C++

shmem_longlong_or_to_all

C and C++

Shmem_prod

The following routines perform a product reduction across a set of processing elements (PEs):

shmem_int2_prod_to_all

Fortran

shmem_int4_prod_to_all

Fortran

shmem_int8_prod_to_all

Fortran

shmem_short_prod_to_all

C and C++

shmem_int_prod_to_all

C and C++

shmem_long_prod_to_all

C and C++

shmem_longlong_prod_to_all

C and C++

shmem_float_prod_to_all

C and C++

shmem_double_prod_to_all

C and C++

Shmem_sum

The following routines perform a sum reduction across a set of processing elements (PEs)

shmem_int2_sum_to_all

Fortran

shmem_int4_sum_to_all

Fortran

shmem_int8_sum_to_all

Fortran

shmem_short_sum_to_all

C and C++

shmem_int_sum_to_all

C and C++

shmem_long_sum_to_all

C and C++

shmem_longlong_sum_to_all

C and C++

shmem_float_sum_to_all

C and C++

shmem_double_sum_to_all

C and C++

Shmem_xor

The following routines perform a logical XOR function reduction across a set of processing elements (PEs):

shmem_int2_xor_to_all

Fortran

shmem_int4_xor_to_all

Fortran

shmem_int8_xor_to_all

Fortran

shmem_int_xor_to_all

C and C++

shmem_short_xor_to_all

C and C++

shmem_long_xor_to_all

C and C++

shmem_longlong_xor_to_all

C and C++

Broadcast

The following routine broadcasts a block of data from one processing element (PE) to one or more target PEs:

shmem_broadcast

C, C++, and Fortran

shmem_broadcast4

Fortran

shmem_broadcast8

Fortran

shmem_broadcast32

C, C++, and Fortran

shmem_broadcast64

C, C++, and Fortran

Atomic Swap

The following routines perform an atomic swap to a remote data object:

shmem_swap

C, C++, and Fortran

shmem_int4_swap

C, C++, and Fortran

shmem_int8_swap

C, C++, and Fortran

shmem_int_swap

C and C++

shmem_real4_swap

Fortran

shmem_real8_swap

Fortran

shmem_long_swap

C and C++

shmem_longlong_swap

C and C++

shmem_float_swap

C and C++

shmem_double_swap

C and C++

Atomic Conditional Swap

The following routines perform an atomic conditional swap to a remote data object:

shmem_cswap

C, C++, and Fortran

shmem_int4_cswap

Fortran

shmem_int8_cswap

Fortran

shmem_short_cswap

C and C++

shmem_int_cswap

C and C++

shmem_long_cswap

C and C++

shmem_longlong_cswap

C and C++

Atomic Fetch and Increment

The following routines perform an atomic fetch-and-increment operation on a remote data object:

shmem_finc

C, C++, and Fortran

shmem_int4_finc

Fortran

shmem_int8_finc

Fortran

shmem_int_finc

C and C++

shmem_long_finc

C and C++

shmem_longlong_finc

C and C++

Atomic Fetch and Add

The following routines perform an atomic fetch-and-add operation on a remote data object:

shmem_fadd

C, C++, and Fortran

shmem_int4_fadd

Fortran

shmem_int8_fadd

Fortran

shmem_int_fadd

C and C++

shmem_long_fadd

C and C++

shmem_longlong_fadd

C and C++

Atomic Lock

The following routines manipulate a mutual exclusion memory lock.

shmem_clear_lock

C, C++, and Fortran

shmem_set_lock

C, C++, and Fortran

shmem_test_lock

C, C++, and Fortran

Non-blocking Put Operations

The following routines perform non-blocking put operations.

shmem_put_nb

C, C++, and Fortran

shmem_put16_nb

C, C++, and Fortran

shmem_put32_nb

C, C++, and Fortran

shmem_put64_nb

C, C++, and Fortran

shmem_put128_nb

C, C++, and Fortran

shmem_putmem_nb

C, C++, and Fortran

shmem_short_put_nb

C, C++, and Fortran

shmem_int_put_nb

C, C++, and Fortran

shmem_long_put_nb

C, C++, and Fortran

shmem_longlong_put_nb

C, C++, and Fortran

shmem_float_put_nb

C, C++, and Fortran

shmem_double_put_nb

C, C++, and Fortran

Collective Routines

Some SHMEM routines are collective routines. They distribute work across a set of processing elements and must be called concurrently by all PEs in the active set.

The following man pages describe the SHMEM collective routines:

shmem_and(3), shmem_barrier(3), shmem_broadcast(3), shmem_max(3), shmem_min(3), shmem_or(3), shmem_prod(3), shmem_sum(3), shmem_xor(3)

NOTES

Typically, target or source arrays that reside on remote processing elements (PEs) are identified by passing the address of the corresponding data object on the local PE. The local existence of a corresponding data object implies that a data object is symmetric.

Symmetric accessible data objects passed to SHMEM routines can be arrays or scalars. A symmetric data object is one where the local and remote addresses have a known relationship. You can use SHMEM routines to access remote symmetric data objects by using the address of the corresponding data object on the local PE.

The following data objects are symmetric:

A SHMEM application must call start_pes or shmem_init as the very first SHMEM routine called within the application to guarantee that lower-level resources have been set up correctly. Otherwise, the SHMEM application will not execute correctly. Similarly, a SHMEM application must call shmem_finalize as the very last SHMEM routine called within the application to guarantee correct clean-up of previously allocated network protocol resources.

SHMEM routines can be used in conjunction with Message Passing Interface (MPI) routines in the same application. Programs that use both MPI and SHMEM should call MPI_Init followed by start_pes or shmem_init. At the end of the program, shmem_finalize should be called followed by MPI_Finalize. SHMEM processing element numbers are equal to the MPI rank within the MPI_COM_WORLD environment variable if the MPI job consists of a single application.

The SHMEM routines reside in libsma.a. The following command lines compile programs that include SHMEM routines:
cc c_program.c -lsma
CC cplusplus_program.C -lsma
ftn fortran_program.f -lsma

psync Arrays

psync arrays are not supported.

EXAMPLES


Example 1: Fortran SHMEM

The following example is a Fortran SHMEM program:

cat example.f90
        PROGRAM REDUCTION
        INCLUDE 'mpp/shmem.fh'
        REAL VALUES, SUM
        COMMON /C/ VALUES
        REAL WORK
        CALL START_PES(0)
        VALUES = SHMEM_MY_PE()
        CALL SHMEM_BARRIER_ALL                  ! Synchronize all PEs
        SUM = 0.0
        DO I = 0,SHMEM_N_PES()-1
           CALL SHMEM_GET(WORK, VALUES, 1, I)   ! Get next value
           SUM = SUM + WORK                     ! Sum it
        ENDDO
        PRINT*,'PE ',SHMEM_MY_PE(),' COMPUTED       SUM=',SUM
        CALL SHMEM_BARRIER_ALL
        CALL SHMEM_FINALIZE
        END

Enter the following command to compile the program:

ftn -o example example.f90 -lsma

Enter the following command to run the program:

yod -np 2 ./example



Example 2: C SHMEM

The following example is a C SHMEM program:

cat example.c
        #include <mpp/shmem.h>
        #include <stdio.h>
        main()
        {
        long source[10] = { 1, 2, 3, 4, 5,
                               6, 7, 8, 9, 10 };
        static long target[10];
        start_pes(0);
        target[0] = 0;
        if (shmem_my_pe() == 0) {
           /* put 10 words into target on PE 1 */
           shmem_long_put(target, source, 10, 1);
        }
        shmem_barrier_all();  /* sync sender and receiver */
        printf("target[0] on PE %d is %d\n", shmem_my_pe(), target[0]);
        shmem_finalize();
        }


In the preceding C program, PE 0 sends ten integers to the target array on PE 1.

Enter the following command to compile the program:

cc example.c -lsma

Enter the following command to execute the program:

yod -sz 4 ./a.out

SEE ALSO

cc(1), CC(1), ftn(1), f77(1), yod(1)

shmalloc(3), shmem_and(3), shmem_barrier(3), shmem_barrier_all(3), shmem_broadcast(3), shmem_cswap(3), shmem_event(3), shmem_fadd(3), shmem_fence(3), shmem_finalize(3), shmem_finc(3), shmem_g(3), shmem_get(3), shmem_iget(3), shmem_iput(3), shmem_lock(3), shmem_max(3), shmem_min(3), shmem_my_pe(3), shmem_or(3), shmem_p(3), shmem_prod(3), shmem_put(3), shmem_quiet(3), shmem_sum(3), shmem_swap(3), shmem_wait(3), shmem_xor(3), shpalloc(3), shpclmove(3), shpdeallc(3), start_pes(3)