National Energy Research Scientific Computing Center 2004 Annual Report

Navigation
Science-Driven Services
NERSC’s Science-Driven Services are designed to strike a balance between meeting the special needs of leading-edge computational research projects and providing responsive, comprehensive services for routine operations. Major issues addressed in NERSC’s five-year plan include working with users to lower the gap between peak performance of terascale computing systems and the performance realized by scientific applications; providing special support for large-scale projects while at the same time maintaining the high level of support for all users; and helping users make productive use of the flood of scientific data from simulations and experiments.
Aspects of this strategy discussed below include support for large-scale projects, development of new data storage and retrieval strategies, and ongoing improvements in NERSC’s everyday operations and services based on feedback from clients.
Support for Large-Scale Projects
NERSC works directly with scientists on major projects that require extensive scientific computing capabilities, such as the SciDAC and INCITE collaborations. These projects are often characterized by large collaborations, the development of community codes, and the involvement of computer scientists and applied mathematicians. In addition to high-end computing, these large projects handle issues in data management, data analysis, and data visualization, as well as automation features for resource management.
NERSC provides its highest level of support to these researchers, including special service coordination for queues, throughput, increased limits, etc.; and specialized consulting support, which may include algorithmic code restructuring to increase performance, I/O optimization, visualization support—whatever it takes to make the computation scientifically productive. The three INCITE projects for 2005 are good examples of this kind of support.
The INCITE project “Magneto-Rotational Instability and Turbulent Angular Momentum Transport,” led by Fausto Cattaneo of the University of Chicago, is attempting to understand the forces that help newly born stars and black holes increase in size by simulating laboratory experiments that study magnetically caused instability. “With the help of NERSC staff, we were able to tune our software for Seaborg’s hardware and realize performance improvements that made additional simulations possible,” Catteneo said. NERSC also provided crucial help in creating animated visualizations of the simulation results, which involve the formation of complex, three-dimensional structures that need to be seen to be understood.
For the INCITE project “Direct Numerical Simulation of Turbulent Nonpremixed Combustion,” Jacqueline Chen, Evatt Hawkes, and Ramanan Sankaran of Sandia National Laboratories have performed the first 3D direct numerical simulations of a turbulent nonpremixed flame with detailed chemistry. After analyzing and optimizing the code’s performance with the help of NERSC staff, the researchers improved the code’s efficiency by 45%. The simulations generated 10 TB of raw data, and NERSC consultants helped the researchers figure out the best strategy for efficiently transferring all that data from NERSC systems to the researchers’ local cluster. “The assistance we received from the NERSC computing staff in optimizing our code and with terascale data movement has been invaluable,” Chen said. “The INCITE award has enabled us to extend our computations to three dimensions so that we may investigate interactions between turbulence, mixing, and finite-rate detailed chemistry in combustion.”
The “Molecular Dynameomics” INCITE project, led by Valerie Daggett of the University of Washington, is an ambitious attempt to use molecular dynamics simulations to characterize and catalog the folding/unfolding pathways of representative proteins from all known protein folds. David Beck, a graduate student in the Daggett lab, worked with NERSC consultants to optimize the performance of the group’s code on Seaborg. “The INCITE award gave us a unique opportunity to improve the software, as well as do good science,” Beck said. Improvements included load balancing, which sped up the code by 20%, and parallel efficiency, which reached 85% on 16-processor nodes. The INCITE award enabled the team to do five times as many simulations as they had previously completed using other computing resources. “We are quite satisfied with our experience at NERSC,” Daggett commented.
Archiving Strategies for Genome Researchers
When researchers at the Production Genome Facility of DOE’s Joint Genome Institute (JGI) found they were generating data faster than they could find somewhere to store the files, a collaboration with NERSC’s Mass Storage Group developed strategies for improving the reliability of data storage while also making retrieval easier.
JGI is one of the world’s leading facilities in the scientific quest to unravel the genetic data that make up living things. With advances in automatic sequencing of genomic information, scientists at the JGI’s Production Genome Facility (PGF) found themselves overrun with sequence data, as their production capacity had grown so rapidly that data had overflowed the existing storage capacity (Figure 13). Since the resulting data are used by researchers around the world, PGF has to ensure that the data are reliably archived as well as easily retrievable.
Figure 13. JGI’s automated sequencing facilities threatened to produce more genomic data than they could store or manage. |
Figure 14. NERSC’s High-Performance Storage System (HPSS) is capable of storing and retrieving all of JGI’s genomic data efficiently. |
As one of the world’s largest public DNA sequencing facilities, the PGF produces 2 million files per month of trace data (25 to 100 KB each), 100 assembled projects per month (50 MB to 250 MB), and several very large assembled projects per year (~50 GB). In aggregate, this averages about 2,000 GB per month.
In addition to the amount of data, a major challenge is the way the data are produced. Data from the sequencing of many different organisms are produced in parallel each day, resulting in a daily archive that spreads the data for a particular organism over many tapes.
DNA sequences are considered the fundamental building blocks for the rapidly expanding field of genomics. Constructing a genomic sequence is an iterative process. The trace fragments are assembled, and then the sequence is refined by comparing it with other sequences to confirm the assembly. Once the sequence is assembled, information about its function is gleaned by comparing and contrasting the sequence with other sequences from both the same organism and other organisms. Current sequencing methods generate a large volume of trace files that have to be managed—typically 100,000 files or more. And to check for errors in the sequence or make detailed comparisons with other sequences, researchers often need to refer back to these traces. Unfortunately, these traces are usually provided as a group of files with no information as to where the traces occur in the sequence, making the researchers’ job more difficult.
This problem was compounded by the PGF’s lack of sufficient online storage, which made organization (and subsequent retrieval) of the data difficult and led to unnecessary replication of files. This situation required significant staff time to move files and reorganize filesystems to find sufficient space for ongoing production needs; and it required auxiliary tape storage that was not particularly reliable.
Staff from NERSC’s Mass Storage Group and the PGF agreed to work together to address two key issues facing the genome researchers. The most immediate goal was for NERSC’s HPSS to become the archive for the JGI data, replacing the less-reliable local tape operation and freeing up disk space at the PGF for more immediate production needs (Figure 14). The second goal was to collaborate with JGI to improve the data handling capabilities of the genome sequencing and data distribution processes.
NERSC storage systems are robust and available 24 hours a day, seven days a week, as well as highly scalable and configurable. NERSC has high-quality, high-bandwidth connectivity to the other DOE laboratories and major universities provided by ESnet.
Most of the low-level data produced by the PGF are now routinely archived at NERSC, with ~50 GB of raw trace data being transferred from JGI to NERSC each night.
The techniques used in developing the archiving system allow it to be scaled up over time as the amount of data continues to increase—up to billions of files can be handled with these techniques. The data have been aggregated into larger collections which hold tens of thousands of files in a single file in the NERSC storage system. This data can now be accessed as one large file, or each individual file can be accessed without retrieving the whole aggregate.
Not only will the new techniques be able to handle future data, they also helped when the PGF staff discovered raw data that had been previously processed by software that had an undetected bug. The staff were able to retrieve the raw data from NERSC and reprocess it in about a month and a half, rather than go back to the sequencing machines and produce the data all over again—which would have taken about six months. In addition to saving time, this also saved money—a rough estimate is that the original data collection comprised up to 100,000 files per day at a cost of $1 per file, which added up to $1.2 million for processing six months’ worth of data. Comparing this figure to the cost of a month and a half of staff time, the estimated savings are about $1 million—and the end result is a more reliable archive.
User Survey Provides Valuable Feedback
The results from the 2005 user survey show generally high satisfaction with NERSC’s systems and support. Areas with the highest user satisfaction include account support services, the reliability and uptime of the HPSS mass storage system, and HPC consulting. The largest increases in satisfaction over last year’s survey include the NERSC CVS server, the Seaborg batch queue structure, PDSF compilers, Seaborg uptime, available computing hardware, and network connectivity.
Areas with the lowest user satisfaction include batch wait times on both Seaborg and Jacquard, Seaborg’s queue structure, PDSF disk stability, and Jacquard’s performance and debugging tools. Only three areas were rated significantly lower this year: PDSF overall satisfaction and uptime, and the amount of time taken to resolve consulting issues. The introduction of three major systems in the last year combined with a reduction in consulting staff explain the latter.
Eighty-two users answered the question “What does NERSC do well?” Forty-seven respondents stated that NERSC gives them access to powerful computing resources without which they could not do their science; 32 mentioned excellent support services and NERSC’s responsive staff; 30 pointed to very reliable and well managed hardware; and 11 said “Everything.”
Sixty-five users responded to “What should NERSC do differently?” The areas of greatest concern are the interrelated issues of queue turnaround times (24 comments), job scheduling and resource allocation policies (22 comments), and the need for more or different computational resources (17 comments). Users also voiced concerns about data management, software, group accounts, staffing, and allocations.
As in the past, comments from the previous survey led to changes in 2005, including a restructuring of Seaborg’s queuing polices, the addition of the new Jacquard and Bassi clusters, the upgrade of ESnet’s connectivity to NERSC to 10 gigabits per second, and the installation of additional visualization software.
The complete survey results can be found at https://www.nersc.gov/news/survey/2005/.