A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins
- Yu-Cheng Liu†1,
- Meng-Han Yang†2,
- Win-Li Lin1,
- Chien-Kang Huang4Email author and
- Yen-Jen Oyang2, 3, 5Email author
© Liu et al; licensee BioMed Central Ltd. 2009
Published: 3 December 2009
Proteins are dynamic macromolecules which may undergo conformational transitions upon changes in environment. As it has been observed in laboratories that protein flexibility is correlated to essential biological functions, scientists have been designing various types of predictors for identifying structurally flexible regions in proteins. In this respect, there are two major categories of predictors. One category of predictors attempts to identify conformationally flexible regions through analysis of protein tertiary structures. Another category of predictors works completely based on analysis of the polypeptide sequences. As the availability of protein tertiary structures is generally limited, the design of predictors that work completely based on sequence information is crucial for advances of molecular biology research.
In this article, we propose a novel approach to design a sequence-based predictor for identifying conformationally ambivalent regions in proteins. The novelty in the design stems from incorporating two classifiers based on two distinctive supervised learning algorithms that provide complementary prediction powers. Experimental results show that the overall performance delivered by the hybrid predictor proposed in this article is superior to the performance delivered by the existing predictors. Furthermore, the case study presented in this article demonstrates that the proposed hybrid predictor is capable of providing the biologists with valuable clues about the functional sites in a protein chain. The proposed hybrid predictor provides the users with two optional modes, namely, the high-sensitivity mode and the high-specificity mode. The experimental results with an independent testing data set show that the proposed hybrid predictor is capable of delivering sensitivity of 0.710 and specificity of 0.608 under the high-sensitivity mode, while delivering sensitivity of 0.451 and specificity of 0.787 under the high-specificity mode.
Though experimental results show that the hybrid approach designed to exploit the complementary prediction powers of distinctive supervised learning algorithms works more effectively than conventional approaches, there exists a large room for further improvement with respect to the achieved performance. In this respect, it is of interest to investigate the effects of exploiting additional physiochemical properties that are related to conformational ambivalence. Furthermore, it is of interest to investigate the effects of incorporating lately-developed machine learning approaches, e.g. the random forest design and the multi-stage design. As conformational transition plays a key role in carrying out several essential types of biological functions, the design of more advanced predictors for identifying conformationally ambivalent regions in proteins deserves our continuous attention.
Proteins are dynamic macromolecules which may undergo conformational transitions upon changes in environment, such as pH, temperature, or upon interactions with other macromolecules . It has been observed in laboratories that conformational transition plays a key role in carrying out several essential types of biological functions, including enzyme catalysis, macromolecule recognition, binding, and signal transduction . For instance, the GTPase HRas protein, whose gene serves as an oncogene of the bladder cancer, shows different conformations in the Switch II region when this protein switches between the RAS-GTP state and the RAS-GDP state [3–6]. Another example is the U1 snRNP A from Homo sapiens. The conformation of one portion of the RNA binding region switches from a helix in the unbound state to a loop in the bound state respectively [7, 8]. Conformational switches sometimes even cause diseases. For instance, the prion protein (PrP) causes the mad cow disease when a specific secondary structure element changes from a helix to a β-sheet .
As conformational flexibility is related to protein functions and interactions, scientists have been designing various types of predictors for identifying conformationally flexible regions in proteins [10–12]. In this respect, there are two major categories of predictors. The problem that was firstly addressed by Young et al.  concerns identifying polypeptide segments that may fold to form different secondary structure elements in different environments based on sequence analysis. Another major category of the predictors attempts to identify conformationally flexible regions through analysis of protein tertiary structures . As the availability of protein tertiary structures is generally limited, the design of predictors that work completely based on sequence information is crucial for advances of molecular biology research.
In this article, we will propose a novel approach to design a sequence-based predictor for identifying conformationally ambivalent regions in proteins. The novelty in the design stems from incorporating two classifiers based on two distinctive supervised learning algorithms that provide complementary prediction powers. These two machine learning algorithms are the relaxed variable kernel density estimation (RVKDE) algorithm [13, 14] and the QUICKRBF algorithm  that we have recently proposed. With these two classifiers, the proposed hybrid predictor can operate under either the high-sensitivity mode or the high-specificity mode, depending on the user' s application. Experimental results show that the overall performance delivered by the proposed hybrid predictor is superior to the performance delivered by the existing predictors. Furthermore, the case study presented in this article demonstrates that the proposed hybrid predictor is capable of providing the biologists with valuable clues about the functional sites in a protein chain.
Overview of the proposed hybrid predictor
The basis of the mechanism described above for merging the outputs of the QUICKRBF based classifier and the RVKDE based classifier is to adopt a more cautious stand in predicting a residue to be conformationally ambivalent. During our study, we tried several alternative mechanisms and decided to employ the one describe above due to its effects observed in the cross validation procedure. The detailed design of QUICKRBF based classifier and the RVKDE based classifier as well as the cross validation procedure employed to set the parameters of the classifiers will be elaborated in the section entitled " Methods".
Generation of the training data set
Both the learning processes of the RVKDE based classifier and the QUICKRBF based classifier in Fig. 1 have been carried out with the training data set generated by the following procedure.
(1) All the protein chains in the PDB  (released on 01-April-2008) that have the same entry name and primary accession number in SwissProt (release 55.1 of 18-March-2008) are grouped. In the end, there are a total of 11084 groups of protein chains.
(2) The BLAST package  is invoked to check the redundancy among the groups of protein chains. It is guaranteed that no two protein chains in different groups have a sequence identity higher than 25%. In the end, 3496 groups of protein chains remain.
(3) For each of the 3496 groups of protein chains, the CLUSTALW package  is invoked to carry out multiple-sequence alignment on the protein chains in the group and the DSSP package  is invoked to label each residue in the protein chains with one of the following 3 types of secondary structure: helix, sheet, and coil. Then, one protein chain is randomly selected from each group as the representative. In this respect, we further checked the sequence identity between the 3496 representatives and the collection of 170 testing protein chains described in the next subsection. We removed 92 representatives due to there existing a homologous testing protein chain with a BLAST-computed sequence identity higher than 20%. Finally, each residue in the remaining 3404 representative protein chains was examined to determine whether the residue and all the residues in other protein chains that are aligned with the residue have been labelled with the same type of secondary structure. A conformationally ambivalent region is defined to be a segment of 3 or more consecutive residues within which each residue and the aligned residues have discrepant types of secondary structures.
One may wonder why we discarded those rows in the PSSM that correspond to residue types that are neither charged nor polar. The reason was that we conducted an analysis on the propensity of residue types in conformationally ambivalent regions and found that the propensity of hydrophobic residues is essentially uniform in conformationally ambivalent regions and in rigid regions. On the other hand, the conformationally ambivalent regions contain significantly higher percentage of charged and/or polar residues than rigid regions. Therefore, we duplicated those rows in the PSSM that correspond to residues with charge.
Generation of the independent testing data set
The experiments reported in this article have been conducted with an independent testing data set derived from the collection of protein conformational ambivalence regions created by Boden et. al. http://pprowler.itee.uq.edu.au/sspred/. According to Boden' s description, this collection of 170 protein chains was extracted from MolMovDB http://www.molmovdb.org/[21, 22], which is a database that records the motion of macromolecules, especially proteins, from literatures of PubMed. As mentioned earlier, it was guaranteed that none of these 170 testing protein chains is homologous to the training protein chains described in the previous subsection by having a BLAST-computed sequence identity higher than 20%.
In generating the independent testing data set, we followed the procedure elaborated above for generating the training data set in order to associate each residue in the testing protein chains with a feature vector and labelled each residue as conformationally ambivalent or not based on the annotations in the MolMovDB database. In the end, the testing data set generated contains 5807 positive samples and 54823 negative samples.
The F-score is the harmonic mean of sensitivity and precision and is a widely used metric in machine learning research for providing a balanced assessment of the performance of a predictor.
Comparison with Boden' s predictor of protein conformational ambivalence
Performance of Boden' s predictor with different cut-off values of entropy
Performance comparison between the hybrid predictor and Boden' s predictor
The hybrid predictor (high-sensitivity mode)
The hybrid predictor (high-specificity mode)
Boden' s predictor (3-class mode)with entropy threshold = 0.4
Boden' s predictor (3-class mode)with entropy threshold = 0.65
Boden' s predictor (8-class mode)with entropy threshold = 0.52
Boden' s predictor (8-class mode)with entropy threshold = 0.69
The numbers in Table 2 reveal that when the hybrid predictor and Boden' s predictor deliver comparable sensitivity, the hybrid predictor can deliver higher specificity and precision. Furthermore, when the hybrid predictor and Boden' s predictor deliver comparable specificity, the hybrid predictor can deliver higher sensitivity and precision.
Comparison with Kuznetsov' s predictor of protein conformational ambivalence
In this section, the performance of the hybrid predictor proposed in this article is compared with that of the sequence-based predictor proposed by Kuznetsov [11, 26]. It must be noted that Kuznetsov employed a different definition of protein conformational ambivalence. By Kuznetsov' s definition, a residue in a protein chain is said to be flexible, if its phi degree varies more than 62 or its psi degree varies more than 68 in two different conformations. In order to accommodate Kuznetsov' s definition, we labelled the residues in our collection of training protein chains based on Kuznetsov' s definition and then trained our hybrid predictor with this separately-generated training data set.
The testing data set used in this experiment was derived from Boden' s collection of testing protein chains. Again, in order to accommodate Kuznetsov' s definition of protein conformational ambivalence, we labelled the residues in the testing protein chains based on Kuznetsov' s definition. Furthermore, in order to carry out a fair comparison, we removed those testing protein chains in Boden' s collection that are homologous to one or more protein chains in Kuznetsov's training set by having a sequence identify higher than 20%. In the end, 137 out of the 170 testing protein chains in Boden' s collection were used for carrying out the benchmark reported in this section.
Performance comparison between the hybrid predictor and Kuznetsov' s predictor
The hybrid predictor with only the RVKDE based classifier enabled
The hybrid predictor with both the QUICKRBF and RVKDE based classifiers enabled
Kuznetsov' s predictor With false positive rate = 20
Kuznetsov' s predictor With false positive rate = 10
In this article, we propose a novel approach to design a sequence-based predictor for identifying conformationally ambivalent regions in proteins. The novelty in the design stems from incorporating two classifiers based on two distinctive supervised learning algorithms that provide complementary predictive powers. Experimental results show that the overall performance delivered by the hybrid predictor proposed in this article is superior to the performance delivered by the existing predictors. Furthermore, the case study presented in this article demonstrates that the hybrid predictor proposed in this article is capable of providing the biologists with valuable clues about the functional sites in a protein chain.
Nevertheless, experimental results also show that there exists a large room for improvement with respect to the performance of the predictor. Therefore, it is of great interest to investigate how to design more advanced predictors. In this respect, it is of interest to investigate how physiochemical properties of polypeptide segments can be more effectively exploited. In this study, we have only exploited the information in the PSSM and a natural extension is to investigate the effects of incorporating the other physiochemical properties of polypeptide segments recently exploited by the related studies [29–31]. Furthermore, it is of interest to investigate the effects of incorporating the lately-developed machine learning approaches, e.g. the random forest design and the multi-stage design [32, 33]. As conformational transition plays a key role in carrying out several essential types of biological functions, design of more advanced predictors deserves our continuous attention.
Design of the proposed hybrid predictor
where f j (v) is the function corresponding to the j-th output node and is a linear combination of k radial basis functions with center μ i and bandwidth σ i ; w ji is the weight associated with the link between the j-th output node and the i-th hidden node. For data classification applications, the RBF network has one output node corresponding to one class of samples and a query sample is predicted to belong to the class of which the corresponding output node yields the maximum value. The tasks that the learning algorithm of a RBF network carries out include: (1) determining the centers of the activation functions associated with the hidden nodes; (2) setting the parameters associated with the activation functions; (3) optimizing the weights associated with the links between the hidden layer and the output layer.
In our implementation of the QUICKRBF package, the user can specify the number of hidden nodes to be incorporated and then the learning algorithm will place the activation functions at a set of randomly selected training samples. Our experience suggest that how a RBF network performs in terms of classification accuracy is not sensitive to how the bandwidths associated with the activation functions are set, as long as the weights in Equation (1) are optimized. Therefore, the QUICKRBF algorithm simply employs a default value and resorts to the Cholesky decomposition  to optimize the weights in Equation (1).
2) R(s i ) is the maximum distance between s i and its k nearest training instances;
3) Γ(·) is the Gamma function ;
4) β and k are parameters to be set either through cross validation or by the user.
Where |S j | is the number of class-j training instances, and (v) is the kernel density estimator corresponding to class-j training instances. In our current implementation, in order to improve the efficiency of the classifier, we include only a limited number, denoted by k', of the nearest class-j training instances of v while evaluating (v).
As equation (2) exhibits, there are 3 parameters, β, k, and k' associated with a RVKDE based classifier. Supposedly, d should be equal to the dimension of the feature vectors associated with the samples. However, due to the fact that there may exist correlations among features, d is treated as a parameter to be set during the learning process. As a result, to create a RVKDE based classifier, there are a total of 4 parameters to be set. On the other hand, to create a QUICKRBF based classifier, the user only need to determine the number of hidden nodes to be incorporated.
Parameter settings of the proposed hybrid predictor for the experiment reported in Table 2
Number of hidden nodes
Parameter settings of the proposed hybrid predictor for the experiment reported in Table 3
Number of hidden nodes
Other papers from the meeting have been published as part of BMC Bioinformatics Volume 10 Supplement 15, 2009: Eighth International Conference on Bioinformatics (InCoB2009): Bioinformatics, available online at http://www.biomedcentral.com/1471-2105/10?issue=S15.
We would like to thank both Dr. Mikael Boden and Dr. Igor B. Kuznetsov for making their benchmark data sets available.
This research has been supported by the National Science Council and National Taiwan University. Funding for open access charge: National Science Council, NSC 97-2627-P-001-002.
This article has been published as part of BMC Genomics Volume 10 Supplement 3, 2009: Eighth International Conference on Bioinformatics (InCoB2009): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/10?issue=S3.
- Creighton TE: Protein folding. Biochem J. 1990, 270 (1): 1-16.PubMed CentralView ArticlePubMedGoogle Scholar
- Ambroggio XI, Kuhlman B: Design of protein conformational switches. Current opinion in structural biology. 2006, 16 (4): 525-530. 10.1016/j.sbi.2006.05.014.View ArticlePubMedGoogle Scholar
- Goodsell DS: The molecular perspective: the ras oncogene. The oncologist. 1999, 4 (3): 263-264.PubMedGoogle Scholar
- Downward J: Targeting RAS signalling pathways in cancer therapy. Nature reviews. 2003, 3 (1): 11-22. 10.1038/nrc969.PubMedGoogle Scholar
- Vetter IR, Wittinghofer A: The guanine nucleotide-binding switch in three dimensions. Science (New York, NY). 2001, 294 (5545): 1299-1304.View ArticleGoogle Scholar
- Sprang SR: G proteins, effectors and GAPs: structure and mechanism. Current opinion in structural biology. 1997, 7 (6): 849-856. 10.1016/S0959-440X(97)80157-1.View ArticlePubMedGoogle Scholar
- Lutz CS, Cooke C, O'Connor JP, Kobayashi R, Alwine JC: The snRNP-free U1A (SF-A) complex(es): identification of the largest subunit as PSF, the polypyrimidine-tract binding protein-associated splicing factor. RNA (New York, NY). 1998, 4 (12): 1493-1499.View ArticleGoogle Scholar
- Ellis JJ, Jones S: Evaluating conformational changes in protein structures binding RNA. Proteins. 2008, 70 (4): 1518-1526. 10.1002/prot.21647.View ArticlePubMedGoogle Scholar
- Prusiner SB, Scott MR, DeArmond SJ, Cohen FE: Prion protein biology. Cell. 1998, 93 (3): 337-348. 10.1016/S0092-8674(00)81163-0.View ArticlePubMedGoogle Scholar
- Boden M, Bailey TL: Identifying sequence regions undergoing conformational change via predicted continuum secondary structure. Bioinformatics (Oxford, England). 2006, 22 (15): 1809-1814. 10.1093/bioinformatics/btl198.View ArticleGoogle Scholar
- Kuznetsov IB: Ordered conformational change in the protein backbone: prediction of conformationally variable positions from sequence and low-resolution structural data. Proteins. 2008, 72 (1): 74-87. 10.1002/prot.21899.View ArticlePubMedGoogle Scholar
- Young M, Kirshenbaum K, Dill KA, Highsmith S: Predicting conformational switches in proteins. Protein Sci. 1999, 8 (9): 1752-1764. 10.1110/ps.8.9.1752.PubMed CentralView ArticlePubMedGoogle Scholar
- Oyang YJ, Hwang SC, Ou YY, Chen CY, Chen ZW: Data classification with radial basis function networks based on a novel kernel density estimation algorithm. Ieee Transactions on Neural Networks. 2005, 16 (1): 225-236. 10.1109/TNN.2004.836229.View ArticlePubMedGoogle Scholar
- Oyang YJ, Ou YY, Hwang SC, Chen CY, Chang DTH: Data classification with a relaxed model of variable kernel density estimation. Proceedings of the International Joint Conference on Neural Networks (IJCNN). 2005, 1-5: 2831-2836. full_text.Google Scholar
- Ou YY, Oyang YJ, Chen CY: A novel radial basis function network classifier with centers set by hierarchical clustering. Proceedings of the International Joint Conference on Neural Networks (IJCNN), Vols. 2005, 1-5: 1383-1388.Google Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic acids research. 2000, 28 (1): 235-242. 10.1093/nar/28.1.235.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of molecular biology. 1990, 215 (3): 403-410.View ArticlePubMedGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic acids research. 1994, 22 (22): 4673-4680. 10.1093/nar/22.22.4673.PubMed CentralView ArticlePubMedGoogle Scholar
- Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983, 22 (12): 2577-2637. 10.1002/bip.360221211.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Gerstein M, Krebs W: A database of macromolecular motions. Nucleic acids research. 1998, 26 (18): 4280-4290. 10.1093/nar/26.18.4280.PubMed CentralView ArticlePubMedGoogle Scholar
- Flores S, Echols N, Milburn D, Hespenheide B, Keating K, Lu J, Wells S, Yu EZ, Thorpe M, Gerstein M: The Database of Macromolecular Motions: new features added at the decade mark. Nucleic acids research. 2006, D296-301. 10.1093/nar/gkj046. 34 DatabaseGoogle Scholar
- Boden M, Yuan Z, Bailey TL: Prediction of protein continuum secondary structure with probabilistic models based on NMR solved structures. BMC bioinformatics. 2006, 7: 68-10.1186/1471-2105-7-68.PubMed CentralView ArticlePubMedGoogle Scholar
- Chou PY, Fasman GD: Conformational parameters for amino acids in helical, beta-sheet, and random coil regions calculated from proteins. Biochemistry. 1974, 13 (2): 211-222. 10.1021/bi00699a001.View ArticlePubMedGoogle Scholar
- Chou PY, Fasman GD: Prediction of protein conformation. Biochemistry. 1974, 13 (2): 222-245. 10.1021/bi00699a002.View ArticlePubMedGoogle Scholar
- Kuznetsov IB, McDuffie M: FlexPred: a web-server for predicting residue positions involved in conformational switches in proteins. Bioinformation. 2008, 3 (3): 134-136.PubMed CentralView ArticlePubMedGoogle Scholar
- Fletcher JI, Swarbrick JD, Maksel D, Gayler KR, Gooley PR: The structure of Ap(4)A hydrolase complexed with ATP-MgF(x) reveals the basis of substrate binding. Structure. 2002, 10 (2): 205-213. 10.1016/S0969-2126(02)00696-2.View ArticlePubMedGoogle Scholar
- Swarbrick JD, Bashtannyk T, Maksel D, Zhang XR, Blackburn GM, Gayler KR, Gooley PR: The three-dimensional structure of the Nudix enzyme diadenosine tetraphosphate hydrolase from Lupinus angustifolius L. Journal of molecular biology. 2000, 302 (5): 1165-1177. 10.1006/jmbi.2000.4085.View ArticlePubMedGoogle Scholar
- Tomovic A, Oakeley EJ: Computational structural analysis: multiple proteins bound to DNA. PLoS ONE. 2008, 3 (9): e3243-10.1371/journal.pone.0003243.PubMed CentralView ArticlePubMedGoogle Scholar
- Kuznetsov IB, Rackovsky S: On the properties and sequence context of structurally ambivalent fragments in proteins. Protein Sci. 2003, 12 (11): 2420-2433. 10.1110/ps.03209703.PubMed CentralView ArticlePubMedGoogle Scholar
- Gunasekaran K, Nussinov R: How different are structurally flexible and rigid binding sites? Sequence and structural features discriminating proteins that do and do not undergo conformational change upon ligand binding. Journal of molecular biology. 2007, 365 (1): 257-273. 10.1016/j.jmb.2006.09.062.View ArticlePubMedGoogle Scholar
- Hirose S, Shimizu K, Kanai S, Kuroda Y, Noguchi T: POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions. Bioinformatics (Oxford, England). 2007, 23 (16): 2046-2053. 10.1093/bioinformatics/btm302.View ArticleGoogle Scholar
- Peng K, Radivojac P, Vucetic S, Dunker AK, Obradovic Z: Length-dependent prediction of protein intrinsic disorder. BMC bioinformatics. 2006, 7: 208-10.1186/1471-2105-7-208.PubMed CentralView ArticlePubMedGoogle Scholar
- Press WH: Numerical Recipes in C. 1992, Cambridge: Cambridge University Press, secondGoogle Scholar
- Artin E: The Gamma Function. 1964, New York: Holt, Rinehart and WinstonGoogle Scholar
- Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE: UCSF Chimera--a visualization system for exploratory research and analysis. Journal of computational chemistry. 2004, 25 (13): 1605-1612. 10.1002/jcc.20084.View ArticlePubMedGoogle Scholar
- Jmol: an open-source Java viewer for chemical structures in 3D. [http://www.jmol.org]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.