1998 Annual Report
Grand Challenge Projects

High-Throughput Analysis Engine for Large-Scale Genome Annotation

E. Uberbacher and R. Mural, Oak Ridge National Laboratory
M. Zorn and S. Spengler, Lawrence Berkeley National Laboratory

The Genome Channel provides online access to genome sequence information.
The graphics-based system allows users to zoom in on areas of interest
(in this example, human chromosome X).

Research Objectives

Interpretation of the human genome represents the next grand challenge at the interface of computing and biology. Many genome sequencing projects are producing sequence data at a rate that exceeds current analysis capabilities. New methods and infrastructure need to be implemented for effective analysis and management of this data. Our overall objective is to design and implement a distributed computational framework for the genome community that will provide users with services, tools, and infrastructure for high quality analysis and annotation of large amounts of genomic sequence data.

The main components of the Analysis and Annotation Engine consist of a number of services, a broker that oversees task distribution, and a data warehouse, with services implemented through distributed object technology. We will use state-of-the-art computational technologies, algorithms, and data management techniques to provide biologists with as much information about a sequence as is feasible at any given time and to provide mechanisms for updating descriptions of genomic regions over time. This framework will make maximal use of existing tools and database systems and integrate services across many resources. It will address issues of software sharing and reuse, generic interfaces to analysis tools, and methods for analysis system interoperation. These issues have not been addressed adequately in informatics developments.

Computational Approach

In a phased development over the course of the project, we will construct a framework for an Analysis and Annotation Engine that deals with the design and construction of a genome analysis environment using distributed object technology and issues of interoperability. Data collection agents visit web sites of various genome centers and retrieve new sequence data for processing by the analysis services.

The analysis includes high-performance implementations of a number of commonly used sequence analysis programs and development of specific new algorithms and software to provide a most complete annotation. Data mining services deal with collecting links and relevant information from outside databases and web sites. A series of data marts serve as data management and storage facilities.

Accomplishments

The team developed a prototype web-based framework, The Genome Channel, that shows the current progress of the international sequencing effort and allows navigation through the data down to individual sequences and gene annotations. The team at NERSC is working on the following tasks: (1) developing a CORBA-based analysis framework to facilitate automation of the genome annotation process; (2) preparing to utilize the NERSC T3E for production analysis; (3) developing specialized software and databases, such as a protein fold predictor to gauge possible structural folds for a predicted gene, an alternative splicing database to sharpen the prediction of alternatively spliced genes, and a motif processor to identify possible functional motifs in the sequence.

Significance

The direct significance of this work is to enhance biological knowledge by providing comprehensive annotation of the human and other genomes. The Genome Channel overview also shows the current progress of the international human genome sequencing efforts.


 INDEX  NEXT >>