1998 Annual Report
Biological and Environmental Research
Protein Fold Prediction in the Context of Fine-Grained ClassificationsInna Dubchak, Chris Mayor, Sylvia Spengler, and Manfred Zorn, Lawrence Berkeley National Laboratory
Sample fold predictions: (left) representative of beta-trefoil fold; (right) representative of lipocalin fold.
|
Research ObjectivesA key to understanding the function of biological macromolecules, e.g., proteins, is the determination of the three-dimensional structure. Large-scale sequencing projects produce a massive number of putative protein sequences in contrast to the much slower increase in the number of known three-dimensional protein structures. This creates both a need and an opportunity for extracting structural knowledge from sequence databases. Predicting a protein fold and implied function from the amino acid sequence is a problem of great interest and importance. Computational ApproachWe have developed a neural network (NN) based expert system which, given a classification of protein folds, can assign a protein to a folding class using primary sequence. It addresses the inverse protein folding problem from a taxonometric rather than threading perspective. Recent classifications suggest the existence of 300 to 500 different folds. The occurrence of several representatives for each fold allows extraction of the common features of its members. Our method (1) provides a global description of a protein sequence in terms of the biochemical and structural properties of the constituent amino acids; (2) combines the descriptors using NNs, allowing discrimination of members of a given folding class from members of all other folding classes; and (3) uses a voting procedure among predictions based on different descriptors to decide on the final assignment. The level of generalization in this method is higher than in the direct sequence-sequence and sequence-structure comparison approaches. Two sequences belonging to the same folding class can differ significantly at the amino acid level, but the vectors of their global descriptors will be located very close in parameter space. Thus, utilizing these aggregate properties for fold recognition has an advantage over using detailed sequence comparisons. |
AccomplishmentsIn an attempt to simplify the fold recognition problem and to increase the reliability of predictions, we approached a reduced fold recognition problem, when the choice is limited to two folds. Our prediction scheme demonstrated high accuracy in extensive testing on the independent sets of proteins. In order to expand the protein database for machine learning and increase efficiency of the recognition, we are porting the Stuttgart Neural Network Simulator (http://www-ra.informatik.uni-tuebingen.de/SNNS/) to the T3E. As a first step, SNNS was compiled and run on the NERSC C90 and J90. The fold predictor is made available on the web, where users can submit a sequence and receive a report with a detailed prediction of possible folds for the submitted sequence. SignificanceThe prediction procedure is simple, efficient, and incorporated into easy-to-use software. It has been applied to fold predictions in the context of fine-grained classifications such as the 3D_ali database and the Structural Classification of Proteins (SCOP) database. PublicationsI. Dubchak, I. Muchnik, C. Mayor, and S.-H. Kim, "Recognition of a protein fold in the context of the SCOP classification," Proteins (submitted, 1998). I. Dubchak, I. Muchnik, and S.-H. Kim, "Prediction of folds for proteins of unknown function in three microbial genomes," Microbial and Comparative Genomics 3, 171-175 (1998). |