BMC Bioinformatics

台中勝華科技 紀品顥

BMC Bioinformatics Software Open Access A fast SCOP fold classification system using content-based E-Predict algorithm Pin-Hao Chi1, Chi-Ren Shyu*1 and Dong Xu2 Address: 1Medical and Biological Digital Library Research Lab, Department of Computer Science, University of Missouri, Columbia, MO 65211, USA and 2Digital Biology Laboratory, Department of Computer Science and Life Sciences Center, University of Missouri, Columbia, MO 65211, USA Email: Pin-Hao Chi - pinhao@diglib1.cecs.missouri.edu; Chi-Ren Shyu* - shyuc@missouri.edu; Dong Xu - xudong@missouri.edu * Corresponding author Abstract Background: Domain experts manually construct the Structural Classification of Protein (SCOP) database to categorize and compare protein structures. Even though using the SCOP database is believed to be more reliable than classification results from other methods, it is labor intensive. To mimic human classification processes, we develop an automatic SCOP fold classification system to assign possible known SCOP folds and recognize novel folds for newly-discovered proteins. Results: With a sufficient amount of ground truth data, our system is able to assign the known folds for newly-discovered proteins in the latest SCOP v1.69 release with 92.17% accuracy. Our system also recognizes the novel folds with 89.27% accuracy using 10 fold cross validation. The average response time for proteins with 500 and 1409 amino acids to complete the classification process is 4.1 and 17.4 seconds, respectively. By comparison with several structural alignment algorithms, our approach outperforms previous methods on both the classification accuracy and efficiency. Conclusion: In this paper, we build an advanced, non-parametric classifier to accelerate the manual classification processes of SCOP. With satisfactory ground truth data from the SCOP database, our approach identifies relevant domain knowledge and yields reasonably accurate classifications. Our system is publicly accessible at http://ProteinDBS.rnet.missouri.edu/EPredict. php. Background Protein structure classification is well-known to be an important research topic in computational and molecular biology. Through the use of structural classification, life science researchers and biologists are able to study evolutionary evidence from similar proteins that have been conserved in multiple species. In addition, similar 3-D conformations of enzyme active sites and binding sites may correlate with biochemical functions [1]. In recent years, structural genomics projects [2-5] have aimed to link protein sequences to possible functions via highthroughput techniques such as X-ray crystallography and nuclear magnetic resonance (NMR) that determine 3-D protein structures. With a large-scale set of newly-discovered structures, a system that classifies similar protein structures with high efficiency and accuracy becomes an indispensable requirement to the study of structure-tofunction relationships. Published: 26 July 2006 BMC Bioinformatics 2006, 7:362 doi:10.1186/1471-2105-7-362 Received: 29 December 2005 Accepted: 26 July 2006 This Article is available from: http://www.biomedcentral.com/1471-2105/7/362 c 2006 Chi et al; licensee BioMed Central Ltd. This is an Open Access Article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. BMC Bioinformatics 2006, 7:362 http://www.biomedcentral.com/1471-2105/7/362 Page 2 of 17 (page number not for citation purposes) Several classification systems categorize proteins based on structural similarities. The Class, Architecture, Topology, Homologous Superfamily (CATH) database [6] is constructed by applying the Secondary Structure Alignment Program (SSAP) [7], which consists of a double dynamic programming technique to find the optimal structural alignment of two proteins. The Fold Classification based on Structure-Structure Alignment of Proteins (FSSP) database [8] is built based on the Distance Alignment (DALI) [9] algorithm that applies Monte Carlo heuristics to compare structural similarities from 2-D distance matrices mapped from 3-D protein structures. Generally, these systems rely on the structural alignment algorithms to measure the similarity of two proteins, which is known to be of complexity NP-Hard [10]. To reduce the computational effort of scanning large-scale protein databases, those structural alignment algorithms need to apply heuristics with trade-offs which may return divergent results from the same query protein. At present, the Structural Classification of Protein (SCOP) database [11], which is manually constructed by human experts, is believed to contain the most accurate structural classifications. In the SCOP database, proteins with similar domain structures are usually clustered into the same fold hierarchy. Even though manual classification provides reliable results, it is labor intensive. As of May 30th, 2006, 10864 newly-discovered proteins deposited in the Protein Data Bank (PDB) [12] have not been classified in the latest SCOP v1.69 release. The number of newly-discovered proteins is increasing continuously. Recent studies [13,14] apply a consensus scheme to classify the SCOP folds for newly-discovered proteins by intersecting multiple classification results from classical structural alignment algorithms such as DALI [9], Combinatorial Extension (CE) [15] and VAST [16]. These consensus approaches yield higher classification accuracies than each individual method. However, a combination of structural alignment algorithms is computationally expensive. To accelerate the manual classification process of SCOP, there is an urgent need to develop a fast, automated SCOP fold classification system with a reasonably high accuracy. By extending our recent works with the real-time tertiary structure retrieval system, ProteinDBS [17-19], we have already studied an efficient model of association rule (AR) mining to identify relevant structural patterns in proteins for SCOP domain and fold classifications [20]. In this paper, we further develop a nonparametric classifier to conduct the SCOP fold classifications with better accuracy and efficiency. Our contribution is to introduce a real-time classification model, EPredict, that applies the E_Measure metric [21] from the Information Retrieval (IR) field to assign the known SCOP folds and recognize the novel folds for newly-discovered proteins. In the past, a number of systems have been developed to assign a protein structure to an existing fold or recognize it as a novel fold. For example, DALI [22] uses Z-score of the best structural match to either assign a structure to a known fold (Z>2) or novel fold (Z≤2). Other programs, such as CE [15] and VAST [23] can perform similar tasks. However, the computational effort associated with those methods prevents a user from exploring the protein structure database in real time. Results There are two important tasks for the SCOP fold classifications. 1) Known SCOP Fold Assignments: the algorithm assigns newly-discovered protein structures into the known SCOP folds. 2) Novel SCOP Fold Recognitions: the algorithm detects whether or not newly-discovered protein structures should be categorized into the novel folds. Given two SCOP database releases v1 and v2 (v1 ⊂ v2), denotes a set of newly-discovered proteins in v2 that have not been identified in v1. The proteins from will be partitioned into either the known SCOP folds of v1 ( ), or the novel folds that have not been determined prior to v2 ( ), where . In our experiments, we measure the classification accuracy for proteins from , and then we gauge the accuracy for classifying proteins from . Finally, we report the efficiency of SCOP fold classifications. Assigning newly-discovered proteins to the known folds We conduct three experiments for classifying newly-discovered proteins into the known folds. The first experiment compares our classification model, E-Predict, with several methods reported in a recent work [13] such as CE, DALI, VAST and CBOOST. Our test data shown in Table 1 is the same test set used in their work, which has proteins with average sequence identities equal to 16.88% and average sequence similarities equal to 20.76% by conducting all against all pairwise alignments using EMBOSSAlign [24] algorithm. The same ground truth data with their work includes proteins from the entire SCOP vl.59 release. To evaluate the accuracy, we use a general metric, Correct Classification Rate (CCR), which is defined as follows: )v v 1 2 2v v 1 2 2v v known 1 2 , ,v v novel 1 2 , ,,,,,,v v known v v novel v v 1 2 1 2 1 , , 2 ∪ v v known 1 2 , ,v v novel 1 2 , CCR The number of correctly classified proteins The tota T l number of test proteins l11 BMC Bioinformatics 2006, 7:362 http://www.biomedcentral.com/1471-2105/7/362 Page 3 of 17 (page number not for citation purposes) Figure 1 shows that E-Predict outperforms DALI, CE, and VAST, exhibiting an accuracy of 64.86%. Can et al. [13] have proposed a method, named CBOOST, which utilizes a decision tree to integrate DALI, CE, and VAST, achieving the same accuracy of 64.86%. It is worth mentioning that the computationally expensive structural alignment algorithms of CBOOST may not be able to efficiently classify a large number of newly-discovered proteins generated from on-going, high-throughput structure determination projects. The second experiment exhaustively evaluates the accuracy of E-Predict on several general test sets from to . In Table 2 and Table 3, our test proteins in are selected from the known SCOP folds of v2, which also maintain at least one protein chain and 10 proteins in v1, respectively. Figure 2(a) shows that E-Predict achieves 72% to 82% classification accuracies for the general test sets of seven SCOP releases. According to Figure 3, there exists a large number of SCOP folds with small sizes. When a newly-discovered protein belongs to a small-size fold, there is a limited amount of ground truth data available. In machine learning, classifiers usually require sufficient ground truth data to guarantee the accuracy. Figure 2(b) demonstrates that E-Predict is able to achieve much higher accuracies, 90% to 96%, for the general test sets of seven SCOP releases with more than 10 ground truth proteins. In the future, when newlydiscovered protein structures are categorized into those small-size SCOP folds, the accuracy of E-Predict could be further improved. The third experiment evaluates the accuracy of E-Predict on non-redundant test sets, which are obtained from randomly sampling one protein chain among each SCOP superfamily. In Table 2 and Table 3, a non-redundant test set is defined by randomly selecting one protein from each SCOP superfamily of the general test set . According to SCOP [11], proteins between two different SCOP superfamilies have low sequence similarities, which suggest that test proteins in our non-redundant sets should maintain low sequence similarities. Table 4 measures the degree of sequence redundancy for 10 pairs of proteins, which are randomly sampled from the nonredundant set with the average sequence identity and sequence similarity equal to 12.55% and 21.17%, respectively. In addition, the experiment using the non-redundant test sets avoids the case that some folds in the general test sets predominate the classification accuracy with relatively more test proteins. For example, there are 900 out of 1000 test proteins in a general test from the same SCOP fold f1. The quantity of this fold may affect the accuracy significantly when a majority of these 900 proteins are correctly classified. In Figure 2(a), E-Predict presents a reduction of accuracies on several sets of non-redundant proteins in comparison ov general v known 1 55 1 57 . , . , .v general v known 1 67 1 69 . , . , .v v known 1 2 , ,v non redundant v known 1 2 , , , ,v general v known 1 2 , , ,v nonredundant v known 1 67 1 69 . , . , . The tFeisgtu pCrroeor rt1eecitn sC lians sTifaicbaleti o1n Rate of assigning the known folds for The Correct Classification Rate of assigning the known folds for test proteins in Table 1. Table 1: A test set that contains 37 protein chains from [13]. pdb_id fold_id pdb_id fold_id pdb_id fold_id pdb_id fold_id pdb_id fold_id 1gyz_A 63569 1key_A 48370 1key_B 48370 1key_C 48370 1key_D 48370 1lkv_X 48370 1ldk_A 48370 1ifr_A 48725 1ivt_A 48725 1gyv_A 48725 1gyu_A 48725 1iu1_A 48725 1iu1_B 48725 1gyw_A 48725 1gyw_B 48725 1l6p_A 48725 1lpl_A 50036 1k3b_A 50875 1gyh_A 50933 1gyh_B 50933 1gyh_C 50933 1gyh_D 50933 1gyh_E 50933 1gyh_F 50933 1gyd_B 50933 1gye_B 50933 1jof_A 50964 1jof_B 50964 1jof_C 50964 1jof_D 50964 1jof_E 50964 1jof_F 50964 1jof_G 50964 1Jof_H 50964 1l2q_A 51350 1ln4_A 55199 1kuu_A 56234 5v v known 1 59 1 61 . . , BMC Bioinformatics 2006, 7:362 http://www.biomedcentral.com/1471-2105/7/362 Page 4 of 17 (page number not for citation purposes) with the general test sets in Table 2, which includes smallsize folds. This gap demonstrates that the impact of some SCOP folds with outnumbered proteins in the general test sets improves the overall accuracy. Figure 2(b) shows that E-Predict exhibits similar accuracies on seven sets of the non-redundant proteins in comparison with the general test sets in Table 3, which have at least 10 ground truth proteins. This suggests that with a sufficient amount of ground truth data non-redundant proteins can still be classified with a reasonably high accuracy. Recognizing the novel folds for newly-discovered proteins We measure the accuracies of classifying six sets of proteins with the novel folds from to , which are listed in Table 2. We accumulate labeled proteins from the prior SCOP releases to obtain more ground fv v novel 1 57 1 59 . . , .v v novel 1 67 1 69 . . , Table 3: The number of proteins in general and non-redundant test sets in which are selected from the known SCOP folds of v2 with at least 10 protein chains in v1. test set size (#proteins) test set size (#proteins) 1832 158 1901 168 2136 166 1947 189 2062 198 4735 302 2298 263 2v v known 1 2 , ,v general v known 1 55 1 57 . , . , .v nonredundant v known 1 55 1 57 . , . , . .v general v known 1 57 1 59 . , . , .v nonredundant v known 1 57 1 59 . , . , . .v general v known 1 59 1 61 . , . , .v nonredundant v known 1 59 1 61 . , . , . .v general v known 1 61 1 63 . , . , .v nonredundant v known 1 61 1 63 . , . . .v general v known 1 63 1 65 . , . , .v nonredundant v known 1 63 1 65 . , . , . .v general v known 1 65 1 67 . , . , .v nonredundant v known 1 65 1 67 . , . , . .v general v known 1 67 1 69 . , . , .v nonredundant v known 1 67 1 69 . , . , . Table 2: The number of proteins in a test set of novel folds, general and non-redundant test sets in which are selected from the known SCOP folds of v2 with at least one protein chain in v1. test set size (#proteins) test set size (#proteins) test set size (#proteins) 4192 442 - - 4047 431 94 4547 468 10 5226 491 190 5445 494 48 10521 736 215 5604 585 86 5v v known 1 2 , ,v general v known 1 55 1 57 . , . , .v nonredundant v known 1 55 1 57 . , . , . .v general v known 1 57 1 59 . , . , .v nonredundant v known 1 57 1 59 . , . , ...v v novel 1 57 1 59 . . , .v general v known 1 59 1 61 . , . , .v nonredundant v known 1 59 1 61 . , . , ...v v novel 1 59 1 61 . . , .v general v known 1 61 1 63 . , . , .v nonredundant v known 1 61 1 63 . , . ...v v novel 1 61 1 63 . . , .v general v known 1 63 1 65 . , . , .v nonredundant v known 1 63 1 65 . , . , ...v v novel 1 63 1 65 . . , .v general v known 1 65 1 67 . , . , .v nonredundant v known 1 65 1 67 . , . , ...v v novel 1 65 1 67 . . , .v general v known 1 67 1 69 . , . , .v nonredundant v known 1 67 1 69 . , . , ...v v novel 1 67 1 69 . . , BMC Bioinformatics 2006, 7:362 http://www.biomedcentral.com/1471-2105/7/362 Page 5 of 17 (page number not for citation purposes) truth data. For example, when an experiment is conducted with test proteins from , our ground truth data is composed of new proteins from . We compare our E-Predict algorithm with two prevalent classification methods, Nearest Neighbor search (NN) [25] and C4.5 Decision Tree (DT) [26]. Figure 4 presents a plot of CCR against six test sets to , which are listed in Table 2. From computational results, E-Predict outperforms NN and C4.5 DT. There is a noticeable reduction in accuracy when classifying proteins in . This is probably because the test set, , is harder to be correctly predicted than the other sets. To address the issue that accuracies may be biased by particular new structures, we conduct 10 fold cross validation that sequentially selects 10% of ground truth data from as a test set and the rest of 90% of ground truth data as a training set for 10 times. In the 10 fold experiment, our approach achieves 89.27% accuracy of the novel fold recognitions. Efficiency For efficiency, we measure the average response time of the entire classification process, including feature extraction, nearest neighbor search on an M-tree [27], and the computation of the SCOP folds by the E-Predict algorithm. The classification process performs one-against-all structural comparisons by scanning the entire SCOP database. Our system runs on a Fedora-Core Linux system with Dual Xeon IV 2.4 GHz processors and 2 GB RAM. A large-scale test set is chosen from the SCOP vl.69 release with 51911 protein chains which have more than 20 amino acids. Figure 5 shows the average response time of fold classifications for various protein chain sizes. When the protein size increases, the E-Predict algorithm demands more computational resources to extract features from larger distance matrices. When the protein chain size reaches a certain threshold, the Linux system may swap huge distance matrices into the virtual memory resulting in a significant I/O time. This effect is reflected in Figure 5 with long computation times for the protein chain size larger than 1099 amino acids, where more memory is required to prevent page swapping. On average, classifying a newly-discovered protein to a SCOP fold cv v novel 1 67 1 69 . . , .v v 1 55 1 67 . . .v v novel 1 57 1 59 . . , .v v novel 1 67 1 69 . . , .v v novel 1 65 1 67 . . , .v v novel 1 65 1 67 . . , .v v 1 55 1 69 . . SFTChigeOu aPrme f o3ludnst ino ft hpero StCeiOnsP i nv 1t.h6e9 froeldlesa saegainst the number of The amount of proteins in the folds against the number of SCOP folds in the SCOP v1.69 release. gTreehdneue Cnradolra ranentcd tt eCnsolatn ss-serifteic dianut in owdna hRnicath tte ea osrtef s asesetsl ieigncn t ienwdgh tifcrhohe m akr netoh sewe nlke ncfootewlddns f frSooCrm Ov atPhr iefoo ukldsn soS owCfnO v 2SP Cw rOietlheP a afstoe llsde sau ssoti fno vgn Ee -Pprreodtiectin o cnh (aain) gine nve1r (aTl aabnlde 2n)o n(b-) (FTigabulree 3 2) 2 with at least 10 protein chains in v1 The Correct Classification Rate of assigning the known folds for various SCOP releases using E-Predict on (a) general and nonredundant test set in which are selected from the known SCOP folds of v2 with at least one protein chain in v1 (Table 2) (b) general and non-redundant test set in which are selected from the known SCOP folds of v2 with at least 10 protein chains in v1 (Table 3). (v v known 1 2 , ,v v known 1 2 , BMC Bioinformatics 2006, 7:362 http://www.biomedcentral.com/1471-2105/7/362 Page 6 of 17 (page number not for citation purposes) takes 3.5 seconds. In our test set, the longest protein chain, comprised of 1409 amino acids, completes the classification process in 17.4 seconds. Discussion Our approach yields better accuracy and efficiency compared to the structure alignment algorithms. The accuracy is achieved by analyzing the ranked SCOP folds of a nearest neighbors search using the E-Predict algorithm. In addition, efficiency results from using an M-tree [27] for fast nearest neighbor searches. In the following subsections, we compare our performance with the structural alignment algorithms in terms of efficiency and accuracy. Performance in efficiency Since structural alignment algorithms usually apply dynamic programming techniques to align each pair of amino acids in two proteins, they demand a huge amount of computational resources. Instead of aligning amino acids, our E-Predict model transforms relevant protein structure information into high-level features, and similar protein structures are then retrieved from a high-dimensional feature space by a nearest neighbors search in the M-tree. Our approach is able to return the classification result in seconds. Since performing the structural alignment algorithms with multiple pairwise alignments of a newly-discovered structure against the known protein structures from the SCOP database is known to be computationally expensive [10], the response times for the structural alignment algorithms are not plotted in Figure 5. The accuracy of assigning newly-discovered proteins to the known folds For the assignment of proteins to the known SCOP folds, the E-Predict algorithm mainly contributes to the accuracy. Traditional structural alignment methods usually apply heuristics to reduce computational efforts of aligning a large combination of amino acids in two proteins. Different heuristics could return diverse results from the same set of proteins since these algorithms might be trapped in local optimal solutions. Even though a consensus method that combines classification results of multiple structural alignment algorithms outperforms each individual structural alignment approach [13], it is computationally cFTlhiagseus iprfyreoin 5tge itne scth pairno tseizines against the average response time of The protein chain sizes against the average response time of classifying test proteins. Table 4: The sequence redundancy in a set that contains 10 pairs of proteins, which are randomly sampled from pairs pdb_id1 super family_id1 pdb_id2 super family_id2 sequence identity sequence similarity 01 1osd_A 55008 1uta_A 110997 2.10% 3.50% 02 1ug8_A 82708 1vm0_A 82704 12.80% 26.80% 03 1v5n_A 57889 1rq8_A 75471 13.60% 23.50% 04 1veu_B 103196 1j3m_A 103247 22.40% 34.20% 05 1tu1_B 55724 1smb_A 55797 6.80% 10.80% 06 1thq_A 56925 1xfs_B 55961 18.10% 28.40% 07 1vki_B 55826 1sk3_A 55846 17.70% 30.50% 08 1tf1_D 55781 1pp6_E 55676 10.30% 17.50% 09 1ucd_A 55895 1vkw_A 55469 9.00% 14.70% 10 1tt4_A 55931 1vkp_A 55909 12.70% 21.80% Avg. 12.55% Avg. 21.17% Av nonredundant v known 1 67 1 69 . , . , . TfFohilgdeus C rfooerr r4 epcrt oCtleaisnssif iicna tvioanri oRuatse Ss CoOf rPe croeglenaizsiensg the novel SCOP The Correct Classification Rates of recognizing the novel SCOP folds for proteins in various SCOP releases. BMC Bioinformatics 2006, 7:362 http://www.biomedcentral.com/1471-2105/7/362 Page 7 of 17 (page number not for citation purposes) expensive. Instead of performing structural alignments, our model maps both known proteins from the SCOP database and newly-discovered protein structures into 33- D feature vectors. With a nearest neighbor search for a newly-discovered structure t in the high-dimensional feature space, there may exist multiple candidate folds, which are associated with nearest neighbor proteins in the vicinity of t. One way to assign a SCOP fold to t is to choose the fold of the nearest neighbor protein in the high-dimensional feature space. Since it is possible that hundreds of folds are partially overlapped in the highdimensional feature space, the nearest neighbor of t may be an outlier that deviates from the majority of proteins in its fold. To avoid selecting an outlier, we apply the E_Measure metric that considers the ranks of at least two nearest neighbor proteins for each fold. The algorithm rewards a SCOP fold in which proteins are highly ranked and penalizes a fold with proteins in the lower ranks. Hence, when the SCOP fold includes only a single highly ranked protein with the other proteins from this fold ranked much lower, the algorithm is able to avoid assigning this fold to t based on the penalty of low ranking. From computational results, E_Measure has a vital impact on the classification accuracy. Misclassifications of assigning newly-discovered proteins to the known folds Within the framework of ProteinDBS [17-19], our model, E-Predict, transforms a 3-D protein structure into a 33-D feature vector that represents the geometric properties of folded proteins. Applying these features to measure the structural similarity of proteins, E-Predict outperforms several classification methods that apply the structural alignment algorithm using the test set in Table 1. E-Predict also yields reasonably high accuracy for several test sets in Table 3 with sufficient ground truth data. However, misclassifications still exist. The limited amount of 33-D ground truth data available for training contributes to the classification errors. As more ground truth data becomes available in small-size SCOP folds, a higher classification accuracy is expected. The second reason for misclassifications is due to the overlapping of folds in the high-dimensional feature space. To further separate overlapping folds, our system needs more relevant features to detect the protein 3-D folding with sufficient discriminating power. Another possible reason for misclassifications is that SCOP may categorize a partial segment of a PDB protein chain (substructure) into a domain. Since our approach measures the global similarity of distance matrices for classification, users need to submit the portion of the protein chain identified in the SCOP domain to ensure a correct classification. In Figure 6, we measure the correlation between the classification accuracy and a structure variation value, S, for a query protein t and the best matched protein of t in our classified SCOP fold. For a pair of proteins (p1, p2), the structural variation S is defined as follows: where RMSD means the root mean square deviation of aligned segments, and NA denotes the number of amino acids in the aligned segments of two proteins. Np1 and Np2 represent the number of amino acid residues in p1 and p2, respectively. These measurements are computed using SARF [28]. The smaller S value can be interpreted as a better structural match for two proteins p1 and p2. Two proteins that have a high structural similarity can usually be superimposed with longer aligned residue segments and a small RMSD value, resulting in a small S value. For example, the SARF algorithm aligns a query protein t with 100 amino acids and its best matched protein p1 with 100 amino acids and returns structurally similar segments with 90 amino acid residues and 0.3 Å of RMSD. Their structure variation value S is computed as 0.3/ ( ) = 0.67. When S is smaller than 6, we expect the E-Predict algorithm to maintain above 90% classification accuracy. This statistic is obtained from the classification of 41262 testing proteins. The accuracy of recognizing the novel folds for newlydiscovered proteins Since no protein has been labeled with the novel folds in our 33-D ground truth data, the novel fold recognition becomes a challenging problem. To address this issue, we introduce three features: E_Measure evaluation score, structural variation value, and Euclidean distance measurement. These features measure structural similarity S p p RMSD N N N A p p ( 1, 2) /( ), 1 2 112 2 222 90 100 1100 sCFtoirgrurucertcute r C6alla vssairfiicaattioionn vRaaluteess of classifying test proteins against Correct Classification Rates of classifying test proteins against structural variation values. BMC Bioinformatics 2006, 7:362 http://www.biomedcentral.com/1471-2105/7/362 Page 8 of 17 (page number not for citation purposes) between a newly-discovered protein and the nearest neighbor protein in a candidate known fold suggested by the E-Predict algorithm. Then, our method applies the EPredict algorithm as a classifier to identify meaningful patterns from ground truth data, which has been obtained by the aggregation of proteins in several prior SCOP releases. Computational results show that using these three features benefits the classification accuracy. Misclassifications of recognizing the novel folds for newlydiscovered proteins To recognize the novel folds for newly-discovered protein structures, our classification model exploits three relevant features. With the assumption that protein structures in the novel folds usually present low structural similarities to proteins in the known folds, a high E_Measure evaluation score, a high Euclidean distance, and a high structural variation value are expected for newly-discovered protein structures from the novel folds. Due to noise in ground truth data and imperfect features, a few proteins in the novel folds may have a low structural variation value, a low E_Measure score, or a low Euclidean distance measurement. Even though our approach presents an improved accuracy over NN and C4.5 DT, there is still a need to discover more relevant features for better recognition performance. Conclusion We have developed an automatic SCOP fold classification system that is able to assign the known SCOP folds and recognize the novel folds for newly-discovered proteins. For the known fold assignments, the algorithm transforms protein structures into 33-D feature vectors and constructs an M-tree to index these feature vectors for fast retrievals. The E-Predict algorithm is then applied to classify newly-discovered proteins in the known SCOP folds. For the novel fold recognitions, the algorithm utilizes three relevant features that are related to structural similarity of proteins. From the computational results, our method outperforms several structural alignment algorithms such as DALI, CE and VAST, achieving reasonably high classification accuracy and efficiency. This research can help accelerate the classification process of the SCOP database and benefit the biomedical research community to further study biochemical functions of proteins with similar 3-D structures. Methods Our classification model, E-Predict, contains three primary functions. First, E-Predict assigns newly-discovered protein structures to the known SCOP folds. Second, E-Predict recognizes the novel folds for newly-discovered protein structures. Third, E-Predict indexes the high-dimensional protein data for a fast nearest neighbors search. Assigning newly-discovered proteins to the known folds According to the SCOP hierarchical setting, proteins that share similar secondary structure arrangements are usually classified in the fold level [11]. The entire process of assigning newly-discovered proteins to the known folds is shown in Figure 7. The labeling procedure transforms protein structures from the SCOP database into 33-D feature vectors, which are labeled with their corresponding SCOP folds. These labeled proteins are then used as our ground truth data. The testing procedure converts newly-discovered proteins into feature vectors and submits these unlabeled vectors into a classifier to obtain possible SCOP fold assignments. In the following, we discuss several components of the entire process such as distance matrix generation, feature extraction, and classifier design. Mapping 3-D backbone structures into 2-D distance matrices Proteins are polypeptide chains, which are chained by 20 types of amino acids. Instead of considering the side chains of amino acids, many computational biology papers [6,9,15] use the Cα atom to describe each amino acid. In our model, the Cα backbone of the kth protein chain with n amino acids can be represented by a set of vectors, Ωk = { }, where denotes the 3-D coordinate of the ith Cα atom. Protein backbones can be transformed into 2-D distance matrices. For Ωk, the corresponding distance matrix, Dk, is defined as D[i, j] = dist( , ), 1 ≤ i, j ≤ n, in a Euclidean space. A distance geometry method [29] shows that the 2-D distance matrix is generally sufficient to recover the original 3-D structure in polynomial time. Several examples in the literature convert protein backbone structures into distance matrices and then detect the structural similarity from them [9,30,31]. Since 2-D distance matrices maintain sufficient 3-D structural information, similar protein backbones are expected to have similar distance matrices. Figure 8 shows 3-D protein backbone structures and their corresponding 2-D distance matrices sampled from the SCOP Heme-dependent peroxidases and Acid proteases folds. Within each SCOP fold, the proteins maintain high similarities in both 3-D backbone structures and 2-D distance matrices. Variations in distance matrices are detectable by comparing structures that belong to different folds. Feature extraction In the area of content-based image retrieval (CBIR), several computational techniques have been developed to retrieve visually similar images from databases for a query image [32-34]. When each element of the distance matrix is interpreted as a grayscale pixel, the distance matrix can Ck Ck Ck n kkkkk