Last changed: 09-11-2006

intro_mpi(1)

NAME

intro_mpi -- Introduces the Message Passing Interface (MPI)

IMPLEMENTATION

UNICOS/lc

STANDARDS

This release of MPI-2 derives from MPICH-2 and implements the MPI-2 standard, except for spawn support. It also implements the MPI 1.2 standard, as documented by the MPI Forum in the spring 1997 release of MPI: A Message Passing Interface Standard.

DESCRIPTION

The Message-Passing Interface (MPI) supports parallel programming across a network of computer systems through a technique known as message passing. The goal of the MPI Forum, simply stated, is to develop a widely used standard for writing message-passing programs. MPI establishes a practical, portable, efficient, and flexible standard for message passing that makes use of the most attractive features of a number of existing message passing systems, rather than selecting one of them and adopting it as the standard.

MPI is a specification (like C or Fortran) that has a number of implementations.

Other sources of MPI information include the man pages for MPI library functions and the following URLs:
http://www-unix.mcs.anl.gov/mpi/index.html
http://www-unix.mcs.anl.gov/mpi/mpich2
http://www.mpi-forum.org/

To invoke the PGI compiler for all applications, including MPI applications, use either the cc, CC, ftn, or f77 command. If you invoke a compiler directly by using a command such as mpicc, the resulting executable will not run on a Cray XT3 system.

NOTES

There is a name conflict between stdio.h and the MPI C++ binding in relation to the names SEEK_SET, SEEK_CUR, and SEEK_END. If your application does not reference these names, you can work around this conflict by using the compiler flag -DMPICH_IGNORE_CXX_SEEK. If your application does require these names, as defined by MPI, undefine the names (#undef SEEK_SET, for example) prior to including mpi.h. Alternatively, if the application requires the stdio.h naming, your application should include mpi.h before stdio.h or the iostream routine.

The following MPI-2 features are not supported: thread safety and dynamic process management.

The following process-creation commands are not supported and if used, will generate aborts at runtime:

The MPI_LONG_DOUBLE data type is not supported.

ENVIRONMENT VARIABLES

Environment variables have predefined values. You can change some variables to achieve particular performance objectives; others are required values for standard-compliant programs.

Note: The use of some of the following environment variables assumes knowledge of Portals. For more information about Portals, see the Cray XT Series Programming Environment User's Guide.

MPI_COLL_OPT_ON

Enables collective optimizations using non-default, architecture specific algorithms for some MPI collective operations.

Default: Not enabled

MPICH_ALLTOALL_SHORT_MSG

Adjusts the cut-off point for which the store and forward Alltoall algorithm is used for short messages.

Default: When building with Portals, the default value is 512 bytes, otherwise, it is 256.

MPICH_ALLTOALLVW_SENDRECV

Disables the flow-controlled Alltoall algorithm. When disabled, the pairwise sendrecv algorithm is used which is the default for messages larger than 32768 bytes. Setting this variable may avoid situations where the flow-controlled Alltoall algorithm causes event queue overflow; see SPR 734419.

MPICH_BCAST_ONLY_TREE

Setting to 1 (or any nonvalue) or 0, respectively, disables or enables the ring algorithm in the implementation for MPI_Bcast for communicators of nonpower of two size.

Default: When building with Portals, the default value is 1, otherwise, it is 0.

MPICH_DBMASK

Causes MPICH-2 to set a bit mask variable for control debugging modes. Currently, only the setting for bit 9 (that is MPICH_DBMASK to 0x200) is relevant to users. This setting causes a core dump when MPICH-2 detects an internal error.

MPICH_MAX_SHORT_MSG_SIZE

Sets the maximum size of a message in bytes that can be sent via the short (eager) protocol. Its default setting is 128,000 bytes. Messages that are larger than this are sent via a long message protocol that is handled differently by the receiver. If you increase MPICH_MAX_SHORT_MSG_SIZE, also increase MPICH_UNEX_BUFFER_SIZE, which is the total size of the buffers that hold unexpected short messages. If the matching receiver has been pre-posted, the message is immediately transferred to the pre-posted buffer. Otherwise, the message is dropped and the actual message transfer occurs once the matching receive is posted.

MPICH_MAX_VSHORT_MSG_SIZE

Specifies in bytes the maximum size message to be considered for the vshort path. The maximum allowed is 16384 bytes.

Default: 1024

MPICH_PTL_OTHER_EVENTS

Sets the number of entries in the event queue that is used to receive all other MPI-related Portals events.

Default: 2048

MPICH_PTL_SEND_CREDITS

Sets the number of send credits from one process to another (send credits are processed by the MPI progress engine of the receiving process). Note: The value -1 sets the number of send credits equal to the size of the unexpected event queue divided by the number of processes in the job, which should prevent queue overflow in any situation. The default value 0 disables this mode.

Setting the environment variable MPICH_PTL_SEND_CREDITS enables a flow control mechanism to prevent the Portals event queue associated with the MPI unexpected receive message queue from being exhausted in applications when sends from other processes can run ahead of the receives on a particular process.

Default: 0

MPICH_PTL_UNEX_EVENTS

Sets the number of event queue entries for unexpected MPI point-to-point messages.

Default: 20480

MPICH_RANK_REORDER_METHOD

Overrides the default MP rank placement scheme. If this variable is not set, the default launcher placement policy is used. To display the MPI rank placement and launching information, set PMI_DEBUG to 1. MPICH_RANK_REORDER_METHOD accepts the following values:

1

Specifies SMP-style placement. For a multi-core node, this places sequential MPI ranks on the same node. For example, for a yod-launched 8 process (4 node) MPI job on dual-core nodes, the placement would be:
NODE    0    1    2    3
RANK   0&1  2&3  4&5  6&7
Note that aprun uses SMP-style placement by default.

2

Specifies folded-rank placement. On dual core nodes, this can be used to achieve a folded rank placement. Instead of rank placement starting over on the first node when half of the MPI processes have been placed, this option places the N/2 process on the last node, going back to the initial node. For example, for a yod-launched 8 process (4 node) job on dual-core nodes, the placement would be:
NODE    0    1    2    3
RANK   0&7  1&6  2&5  3&4

3

Specifies a custom rank placement defined in file named MPICH_RANK_ORDER. The file MPICH_RANK_ORDER must be readable by all ranks in the current running directory. The order in which the ranks are listed in the file determines which ranks are placed closest to each other. This is most helpful for dual-core nodes.

For example:

0-15

Places the ranks in SMP-style order (see above).

15-0

Places ranks 15&14 on the first node, 13&12 on next, etc.

0,4,1,5,2,6,3,7

Places ranks 0&4 on the first node, 1&5 on the next, 2&6 together, and 3&7 together.

You can use combinations of ranges (8-15) or individual rank numbers in the file. The number of ranks listed in the file must match with the number of processes launched.

A # denotes the beginning of a comment. The comment will continue to the end of the line. A comment can start in the middle of a line.

There is also a timing backoff available, useful when launching large number of processes. Since the data file MPICH_RANK_ORDER has to be read by each process, contention may occur.

The two backoff environment variables are:

MPICH_RANK_FILE_BACKOFF

Specifies the number of milliseconds for backoff.

MPICH_RANK_FILE_GROUPSIZE

Specifies the number of ranks in the group size.

The following example indicates that the first 512 ranks will read the file, while all remaining ranks wait for 1 second (1000 milliseconds). Then the second group of 512 ranks read the file, while the remaining ranks wait an additional 1 second and so on, until all ranks have read the file.
export MPICH_RANK_FILE_BACKOFF=1000
export MPICH_RANK_FILE_GROUPSIZE=512

If the MPICH_RANK_REORDER_METHOD is not set, the default launcher placement policy is used. Setting environment variable PMI_DEBUG to 1 will display additional debug information as well as the MPI rank placement and launching information.

MPICH_REDUCE_SHORT_MSG

Adjusts the cut-off point for which a reduce-scatter algorithm is used. A binomial tree algorithm is used for smaller values.

Default: When building with Portals, the default value is 65536 bytes, otherwise, the default is 2046.

MPICH_RMA_BUFFER_SIZE

Overrides the size of the buffers allocated to the queue for messages associated with RMA operations. Default is 6M. This value is interpreted as the number of bytes or megabytes, if the string ends in M, to be allocated to buffers associated with this queue.

Default: 6M

MPICH_UNEX_BUFFER_SIZE,

Overrides the size of the buffers allocated to the MPI unexpected receive queue. This value is interpreted as the number of bytes or megabytes, if the string ends in M, to be allocated to buffers associated with this queue.

Default: 60M

MPICH_USE_BUILTIN_DIMS_CREATE

Setting this environment variable causes the mpich2 MPI_Dims_create algorithm to be used which does not comply with the MPI standard. The default mpi_dims_create algorithm is consistent with the MPI standard.

Default: Not enabled

MPICH_VSHORT_OFF

Disables the vshort path optimization.

Default: on

MPICH_VSHORT_BUFFERS

Specifies the number of buffers to be pre-allocated for the send side buffering of messages for the vshort protocol. Each buffer is 16384 bytes in size.

Default: 32

Flow-controlled versions of the MPICH_ALLTOALLV and MPICH_ALLTOALLW MPI operations are enabled. Currently, the algorithm is only implemented for intra-communicators and by default the algorithm is enabled when the size of the communicator is greater than 120.

The values chosen are conservative settings that should not exhaust the number of Seastar CAM entries that control remote DMA. Software retry is needed when there is exhaustion of CAM entries.

Note: This algorithm also applies to medium size (greater than 256 but less than 32768 bytes) MPI_ALLTOALL MPI operations.

MPICH_ALLTOALLVW_FCSIZE

Sets the size of the communicator.

MPICH_ALLTOALLVW_SENDWIN

Set the send window size. When flow control is enabled, send and receive windows are established that can allow up to 80 Isend operations.

MPICH_ALLTOALLVW_RECVWIN

Set the receive window size. When flow control is enabled, send and receive windows are established that can allow up to 100 Irecv operations.

SEE ALSO

Cray XT Series Programming Environment User's Guide

cc(1), CC(1), ftn(1), f77(1), yod(1)