1998 Annual Report
Grand Challenge Projects
High-Throughput Analysis Engine for Large-Scale Genome Annotation
E. Uberbacher and R. Mural, Oak Ridge National Laboratory
![]()
The graphics-based system allows users to zoom in on areas of interest (in this example, human chromosome X). |
Research Objectives
Interpretation of the human genome represents the next grand challenge
at the interface of computing and biology. Many genome sequencing
projects are producing sequence data at a rate that exceeds current
analysis capabilities. New methods and infrastructure need to
be implemented for effective analysis and management of this data.
Our overall objective is to design and implement a distributed
computational framework for the genome community that will provide
users with services, tools, and infrastructure for high quality
analysis and annotation of large amounts of genomic sequence data.
The main components of the Analysis and Annotation Engine consist
of a number of services, a broker that oversees task distribution,
and a data warehouse, with services implemented through distributed
object technology. We will use state-of-the-art computational
technologies, algorithms, and data management techniques to provide
biologists with as much information about a sequence as is feasible
at any given time and to provide mechanisms for updating descriptions
of genomic regions over time. This framework will make maximal
use of existing tools and database systems and integrate services
across many resources. It will address issues of software sharing
and reuse, generic interfaces to analysis tools, and methods for
analysis system interoperation. These issues have not been addressed
adequately in informatics developments. Computational ApproachIn a phased development over the course of the project, we will construct a framework for an Analysis and Annotation Engine that deals with the design and construction of a genome analysis environment using distributed object technology and issues of interoperability. Data collection agents visit web sites of various genome centers and retrieve new sequence data for processing by the analysis services. |
The analysis includes high-performance implementations
of a number of commonly used sequence analysis programs and development
of specific new algorithms and software to provide a most complete
annotation. Data mining services deal with collecting links and
relevant information from outside databases and web sites. A series
of data marts serve as data management and storage facilities.
Accomplishments
The team developed a prototype web-based framework, The Genome
Channel, that shows the current progress of the international
sequencing effort and allows navigation through the data down
to individual sequences and gene annotations. The team at NERSC
is working on the following tasks: (1) developing a CORBA-based
analysis framework to facilitate automation of the genome annotation
process; (2) preparing to utilize the NERSC T3E for production
analysis; (3) developing specialized software and databases, such
as a protein fold predictor to gauge possible structural folds
for a predicted gene, an alternative splicing database to sharpen
the prediction of alternatively spliced genes, and a motif processor
to identify possible functional motifs in the sequence. Significance
The direct significance of this work is to enhance biological
knowledge by providing comprehensive annotation of the human and
other genomes. The Genome Channel overview also shows the current
progress of the international human genome sequencing efforts.
|
| INDEX | NEXT >> |