NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
  IBM SP Parallel Scaling Overview - Parallel I/O

IBM SP Parallel Scaling Overview - Parallel I/O


Abstract

Here we will address performance and scaling concerns when moving data from memory to disk.

An important first step in examining a slow running or poorly scaling IO code section is to consider the amount of metadata movement versus user data movement. Similarly to the guiding principle for MPI above, fewer large IO transactions are generally preferable. Frequently writing small temporary scratch files from each task may not noticeably impede a code running 128 way and yet may have significant impact at 1024 way.

Concurrency, data size and data topology are typically the determining factors in deciding how to accomplish I/O from a parallel code. In what follows the first two will be treated as free variables. Spaning the space of all data topologies is not possible here so the test below are restricted to the following rough sketches; largeblock contiguous, interleaved blocks, and scattered.

File Systems

NERSC's SP relies on GPFS for most user file I/O. Both home ($HOME) and scratch ($SCRATCH) filesystems on mounted globally on all nodes via GPFS.

Users have two main choices for parallel filesystems. All $HOME directories and each of the users $SCRATCH directory are in /usr/common/homefs and /usr/common/scratchfs. The resources mentioned in the hardware section (spindles, adapters, etc GETINFO) are allocated between these filesystems in an asymmetric way. While both filesystem are robust and global across the machine, $SCRATCH has greater bandwidth to disk.

Seaborg File I/O Resources

A small local disk filesystem (/tmp) exists on each node, but this space is tiny and to be used only for AIX and system temporary files. Fortran and other software which use the TMPDIR variable will write their scratch files to $SCRATCH, a large fast parallel file system in GPFS. In this paper, all considerations of how user file I/O scales are based on GPFS rather than local disk.

The overall parallel strategy of GPFS is to load balance I/O requests across the machine by distributing ownership of data blocks widely across a large number of servers. These participants in the GPFS filesystem are connected via the colony switch which provides considerable bandwidth for data transfers. Modern versions of GPFS (> 1.3) support memory mapped files and most other POSIX I/O functions.

Likewise it can be important to keep in mind that due to the distributed nature of GPFS nodes other that the GPFS server nodes, e.g. batch nodes acting as GPFS clients, can also fulfill requests for disk data. Each client node has a "buddy buffer" of up to 256 MB from which it may serve blocks requested by other clients if such a request permitted based on file system locks.

The GPFS filesystems on seaborg are built from 44 TB of SSA disks, served from 16 GPFS server nodes. A large amount of memory, 32 GB on each server node is available as a cache buffer.

GPFS Basics

Scalability results from the distribution of file data across a large number of nodes. The distribution is at the block level. Blocks are 512Kbytes. Switch data movement is through KLAPI.

Each node participating as a GPFS client can obtain a write lock for a range of blocks within a file and all or some portion of that data may reside in the client ndoes memory (rather than on disk). If a read request is made for that data the client node may provide the data directly or the transaction may be completed

GPFS supports byte range locking, which means that several tasks may read or write disjoint areas of a file without competing for exclusive locks to the file. No special coding, other than making sure the accesses do not overlap is required.

In order to implement parallel I/O well on seaborg it is useful to understand what parallel performance the GPFS filesystems are capable of, but also what types of file activitiy to avoid.

Despite it's distributed nature, GPFS must still use exclusive locks to maintain data coherence and the management of these locks is handled in a more centralized way than the way that the data itself is distributed. While not delving into a detailed description of GPFS metadata, mind that there is a metadata workload associated with each

Scaling of Directory Operations
Calling libc(2) functions from each MPI task (not shell/system calls).

From the above it should be obvious that directory creation or removal from each task of a parallel job is a severe impediment to scaling. Certain I/O operations do not scale well with concurrency as they involve large metadata workloads ( inode creation or destruction ~ concurrency) or tax GPFS's token management. Knowing which operations these are is useful when choosing the building blocks of a parallel I/O strategy.

As a rough sketch, this is demonstrated by using a neglibly small data size and testing the impact on concrrency alone on file operations.

pseudocode timings
mkdir(fname,S_IRWXU);
rmdir(fname);
fd = open(fname,O_TRUNC,S_IRWXU);
write(fd,byte,1);
close(fd);
fd = open(fname,O_RDONLY,S_IRWXU);
read(fd,byte,1);
close(fd);
unlink(fname);

The performance considerations arising from the above analysis have to do with near zero (1B) data size per task. While it is useful to understand the impact of concurrency itself, in all real applications tradeoffs between concurency and data size will dominate the decisions made about I/O strategies.

GPFS Performance Contiguous Writes

Parallel I/O Goals

Before turning to the identification of optimal strategies for parallel I/O it's worth clarifying that the important goals are. Typically the primary goal of a parallel I/O strategy is to increase the read/write bandwidths from memory to disk. Other considerations include minimizing on disk storage size, conserving inodes, overlapping I/O with computation, and preserving a particular organization of data within a file.

  • Maximize Performance
  • Conserve disk resources (blocks and/or inodes)
  • Maintain file organization or data structure

Realistically, many applications will benefit from a balance of these goals. Many applications and existing codes may have built in constraints or rely heavily on certain I/O strategies which impede performance under GPFS on the SP. By using the quantitative comparisons provided in the next section, researchers may determine at what point the burdens of modifying their code are overtaken by sufficient increases in performance.

Parallel I/O access patterns come in many types, e.g.,

  • Structured / Unstructured
  • Synchronized / Asynchronous
  • Transactional and database access patterns.

Applications also vary greatly in their built in assumptions about I/O patterns. In the comparisons that follow we will first focus on the simple block I/O pattern in which n tasks each move a block of double precision numbers to and from disk.

Memory address space

The assumption that the the data is contiguous in the memory address space may not be valid for all applications, but the access times for discontiguous memory organization will typically be orders of magnitude smaller than the I/O times involved. For this reason the primary concern is the structure of the data in the file offset space.

Parallel I/O Strategies

There are many ways to organize the movement of data between memory and disk. Below are diagrams showing four strategies which will be compared in the following sections. Movement of data through MPI is shown in blue and disk I/O is shown in black.

1) serial 2) multiple file
3) POSIX I/O 4) MPI I/O

Which strategy is optimal in terms of performance depends largely on the organization of the data on disk and on the total concurrency of the parallel code. Treating a large number of I/O patterns is not feasible so in this writing we will stick to the three identified previously, treating them in turn.

Contiguous File Structure

Hyperslab decomposition of grids is a common type of partitioning which leads to logically contiguous and typically large regions of data in memory and disk.

E.g. task i owns the data in a multidimensional tensor from some lower_index(i) to some upper_index(i) of the slowest running index.

Contiguous memory spaces
Fortran grid(..., lower_index(i)) through grid(..., upper_index(i))
C grid[lower_index(i)] through grid[upper_index(i)]

Contiguous file spaces
multiple files POSIX I/O

In this case each task does its nbytes of I/O to a separate file or a unique region of a common file.

Multiple FilesPOSIX I/O
  t0 = MPI_Wtime();
  MPI_Barrier(MPI_COMM_WORLD);
  fp=fopen(fname_rank,"w");
  fwrite(data,nbyte,1,fp);
  fclose(fp);
  MPI_Barrier(MPI_COMM_WORLD);
  t1 = MPI_Wtime();
  t0 = MPI_Wtime();
  MPI_Barrier(MPI_COMM_WORLD);
  fd=open(fname_global,O_CREAT|O_RDWR, S_IRUSR|S_IWUSR);
  lseek(fd,(off_t)(rank*nbyte)-1,SEEK_SET);
  write(fd,data,1);
  close(fd);
  MPI_Barrier(MPI_COMM_WORLD);
  t1 = MPI_Wtime();

The performance of these strategies are depicted below.

Optimal I/O strategies for contiguous data

In summary for contiguous I/O patterns the performance have little to do with the topology of data on disk but rather controlling the number of tasks writing concurrently.

Noncontiguous Parallel I/O

This section examines the case of multiple contiguous sections of data. Here both the total size and number of sections are relevant to how to best perform I/O. That is both the concurrency and problem size determine the best I/O algorithm. Block cyclic decompositions common in distributed linear algebra lead to this sort of blocked file structure.

Noncontiguous file structure
decreasing block size ------>

For suffciently small block size the metadata and file locking tasks required from GPFS are expected to impede performance. Each of the I/O operations, taken individually is more complicated for the filesystem than if coordianted at a global level by MPI-I/O. For this reason MPI-I/O should be able to provide the benefit when the block size is small enough.

POSIX-I/O MPI-I/O
   t0 = MPI_Wtime();
   fp=fopen(fname,"w");
   MPI_Barrier(MPI_COMM_WORLD);
   for(i=0;i<n/bn;i++) {
    fseek(fp,
     (off_t)((i*size + rank)*bn*sizeof(DATA_T)),
     SEEK_SET);
    fwrite(data+i*bn,bn*sizeof(DATA_T),1,fp);
   }
   fclose(fp);
   MPI_Barrier(MPI_COMM_WORLD);
   t1= MPI_Wtime();
   t0 = MPI_Wtime();
   MPI_Type_vector(n/bn, bn, size*bn, MPI_DOUBLE, &vectype);
   MPI_Type_commit(&vectype);
   MPI_Type_size(vectype,&bvect); bvect/=sizeof(int);
   MPI_File_open(MPI_COMM_WORLD, fname, 
    MPI_MODE_CREATE | MPI_MODE_RDWR, MPI_INFO_NULL, &fh);
   MPI_File_set_view(fh, rank*bn*sizeof(double), MPI_BYTE,
	 vectype, "native", MPI_INFO_NULL);
/*   MPI_File_preallocate(fh,nbyte*size); */
   MPI_File_write_all(fh, data, bvect, MPI_INT, &s);
   MPI_File_sync(fh);
   MPI_File_close(&fh);
   MPI_Type_free(&vectype);
   MPI_Barrier(MPI_COMM_WORLD);
   t1 = MPI_Wtime();

Testing the above strategies on seaborg's $SCRATCH filesystem shows the following results.

MPI_Size Parallel Write Strategies and Rates
16
32
64
128
256
In each graph: x axis is the block size, y axis is total data size. First graph shows winning I/O strategy.

Aspects of parallel I/O are demonstrated above:

  • I/O performance increases as the block size increases. When possible use large/contiguous transfers to disk. An IBM specific file hint exists which when used helps recover the performance difference between the two strategies for very large block sizes. It's notable that the relation between block size and performance is not however monotonic.

  • The important role played by MPI-I/O in aggregating small I/O requests together is seen in the the write performance of MPI-I/O extends further into regions of smaller block size than POSIX I/O.

  • Even in regions of reasonable I/O performance, the average performance for a fixed problem size decreases with concurrency. This is a commonly encountered isoefficeincy issue when scaling up parallel applications. If the increase in parallelism is not matched by an increase in the total data involved then the I/O requests will necessarily be smaller and as a result less efficient.



LBNL Home
Page last modified: Mon, 13 Sep 2004 20:28:59 GMT
Page URL: http://www.nersc.gov/news/reports/technical/seaborg_scaling/io.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science