Volume 11 Supplement 4
Predicting RNA-binding residues from evolutionary information and sequence conservation
© Huang et al; licensee BioMed Central Ltd. 2010
Published: 2 December 2010
RNA-binding proteins (RBPs) play crucial roles in post-transcriptional control of RNA. RBPs are designed to efficiently recognize specific RNA sequences after it is derived from the DNA sequence. To satisfy diverse functional requirements, RNA binding proteins are composed of multiple blocks of RNA-binding domains (RBDs) presented in various structural arrangements to provide versatile functions. The ability to computationally predict RNA-binding residues in a RNA-binding protein can help biologists reveal important site-directed mutagenesis in wet-lab experiments.
The proposed prediction framework named “ProteRNA” combines a SVM-based classifier with conserved residue discovery by WildSpan to identify the residues that interact with RNA in a RNA-binding protein. Although these conserved residues can be either functionally conserved residues or structurally conserved residues, they provide clues on the important residues in a protein sequence. In the independent testing dataset, ProteRNA has been able to deliver overall accuracy of 89.78%, MCC of 0.2628, F-score of 0.3075, and F0.5-score of 0.3546.
This article presents the design of a sequence-based predictor aiming to identify the RNA-binding residues in a RNA-binding protein by combining machine learning and pattern mining approaches. RNA-binding proteins have diverse functions while interacting with different categories of RNAs because these proteins are composed of multiple copies of RNA-binding domains presented in various structural arrangements to expand the functional repertoire of RNA-binding proteins. Furthermore, predicting RNA-binding residues in a RNA-binding protein can help biologists reveal important site-directed mutagenesis in wet-lab experiments.
RNA-binding proteins (RBPs) are designed to efficiently recognize specific RNA sequences after they are derived from the DNA sequences. Protein-RNA interactions are fundamental to cellular processes, including the assembly and function of ribonucleoprotein particles (RNPs), such as ribosomes and spliceosomes and the post-transcriptional regulation of gene products. For satisfying diverse functional requirements, RNA binding proteins are composed of multiple blocks of RNA-binding domains (RBDs) presented in various structural arrangements to provide versatile functionality [1, 2]. Although RNA structure is hierarchical, that is, the primary sequence determines the secondary structure which, in turns, determines tertiary structure, the tertiary structure of RNA is not as stable as secondary structure and is hard to predict . However, sequence conservations in RNA-binding domains have been discovered in RNA-binding proteins [4, 5, 6, 7]. With the recent growth of protein-RNA complexes in the Protein Data Bank (PDB)  and the Nucleic Acid Database (NDB) , structural analysis on RNA-binding pockets [10, 11, 12, 13, 14, 15, 16, 17, 18] and the themes of RNA-protein recognition [18, 19, 20] have been investigated as well.
Most recent works on predicting RNA-binding residues used support vector machine (SVM) with protein evolutionary information from protein sequence. Wang and Brown (2006) developed the web service, BindN , to predict DNA and RNA binding sites using sequence features to represent structural characteristics including relative solvent accessible surface area, side chain pKa, hydrophobicity index and molecular mass of an amino acid. Tong et al. (2008)  proposed the hybrid RISP (RNA-Interaction Site Prediction) method by adjusting cutoff value of SVM discrimination function to improve prediction performance. Kumar et al. (2008) developed Pprint  by using evolutionary profiles of the position-specific scoring matrices (PSSMs) and amino acid composition while they also adjusted cutoff value of SVM discrimination function to improve prediction performance. Wang et al. (2008) developed PRINTR  by using additional structural information from protein-RNA complexes. Cheng et al. (2008) developed RNAProB  by smoothing PSSM profiles with consideration of the correlation and dependency from the neighboring residues for each amino acid in a protein. Spriggs et al. (2009)  developed the PiRaNhA by using support vector machine with a PSSM profile and three amino acid properties, including interface propensity (IP), predicted solvent accessibility (pA) and hydrophobicity (H) for recognizing RNA-binding residues . Other machine learning approaches such as neural network and Naïve Bayes classifier have also been applied to predict RNA-binding residues. Jeong et al. (2004)  applied artificial neural network (ANN)-based method with amino acid sequence and predicted secondary structure information and improved the performance by using post-processing procedures such as state-shifting and filtering isolated interacting residues from prediction. Improved version by Jeong et al. (2006)  used evolutionary information extracted from PSI-BLAST profiles and CLUSTALW alignment. Terribilini et al. (2006)  applied a Naïve Bayes classifier with amino acid sequence information for predicting RNA interacting residues and presented the results through the web service RNABindR . The ability to computationally predict RNA-binding residues in a RNA-binding protein can help biologists reveal site-directed mutagenesis in wet-lab experiments.
Caragea et al.  explored the problem of assessing the performance of classifiers trained on macromolecular sequence data, with the emphasis on cross-validation and data selection methods. In comparison of window-based k-fold cross-validation and sequence-based k-fold cross-validation, window-based cross-validation can yield overly optimistic estimates of the performance of classifier relative to the estimates obtained using sequence-based cross-validation. RNAProB, BindN, RISP, PRINTR and PiRaNhA are predictors that report performance window-based k-fold cross-validation while Pprint and RNABindR report performance with sequence-based k-fold cross-validation. The predictors evaluated with window-based k-fold cross-validation have superior performance than those with sequence-based k-fold cross-validation. The reason is that data instances in the testing fold would be predicted by data instances with sub-sequence identity higher than 25% in the training fold in window-based k-fold cross-validation. Therefore, in data with class imbalance, the metrics that measure the classification performance must be chosen carefully. Matthew's correlation coefficient (MCC), F-score and F0.5-score are widely applied to assess the prediction performance. MCC is used to measure prediction quality with the consideration of both under- and over-predictions. F-score and F0.5-score are used to assess balanced prediction quality on both positive class and negative class.
In this article, we proposed the prediction framework “ProteRNA” with the combination of SVM-based classifier with evolutionary profiles and conserved residues discovery by sequence conservation for identifying RNA-interacting residues in a RNA-binding protein. In the SVM-based classifier, we use features including position-specific scoring matrix computed by PSI-BLAST and secondary structure information predicted by PSIPRED as feature vectors . To exploit the sequence conservation information, WildSpan  (http://biominer.bime.ntu.edu.tw/wildspan/), which is developed to discover functional signatures and diagnostic patterns of proteins directly from a set of unaligned protein sequences, is incorporated. The most distinguishing feature of WildSpan is that it links short motifs (local conserved regions) with large flexible gaps to deliver the most frequently observed discontinuous patterns present in related proteins. WildSpan has been embedded in many applications [35, 36, 37, 38, 39] to discover functionally important residues; therefore, we apply WildSpan to discover conserved residues as RNA-binding residues in a protein sequence to improve prediction performance on detecting more RNA-binding residues. The independent testing dataset collected for performance evaluation contains 33 testing RNA-binding proteins with less than 30% sequence identity against with training data. In the independent testing dataset, ProteRNA has been able to deliver overall accuracy of 89.78%, MCC of 0.2628, F-score of 0.3075, and F0.5-score of 0.3546. We emphasize MCC, F-score and F0.5-score because it provides the biochemist with a confidence level for designing an experiment to confirm whether a predicted binding residue is really involved in interaction with the RNA.
Results and discussion
In this section, we will report the experiments conducted to evaluate the performance of our proposed approach, ProteRNA with the combination of SVM-based classifier with evolutionary profiles and conserved residues discovery by sequence conservation. In order to avoid bias, we repeated 5-fold cross-validation procedure 20 times to observe prediction performance on the training dataset RB147 (see Materials and Methods for details). For each run, we applied sequence-based 5-fold cross-validation; therefore, protein chains will be randomly divided into 5 folds: one fold for testing and remaining 4 folds for training. For this study, LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm) was used for data training and classification and WildSpan was used for detecting conserved residues from homologous protein sequences. We use independent testing dataset containing 33 protein chains for comparing ProteRNA with other predictors such as PiRaNhA, Pprint, BindN, and PRIP.
Performance evaluation by five-fold cross-validation
Prediction performance evaluated by the 5-fold cross-validation using the training dataset, RB147
38.85% ± 0.46%
97.01% ± 0.09%
75.99% ± 0.48%
85.93% ± 0.08%
0.4732 ± 0.0036
0.5170 ± 0.0040
0.6343 ± 0.0034
44.84% ± 0.37%
93.56% ± 0.09%
62.10% ± 0.25%
84.28% ± 0.06%
0.4378 ± 0.0027
0.5208 ± 0.0027
0.5766 ± 0.0022
Statistical information of the training dataset, RB147 in terms of RNA-binding residues
Number of RNA-binding residues
Total number of residues
Ratio of RNA-binding residues
Prediction performance breakdown in terms of the categories of RNA using the training dataset, RB147
As we known, rRNA is the major group among the training dataset. Comparing the amount of RNA-binding proteins in terms of interacting target (e.g. rRNA, tRNA, mRNA), we find that tRNA generally has the most interaction partners followed by mRNA and rRNA has the least partners. ProteRNASVM tends to predict negative for proteins in the mRNA group and over-predict either positive class or negative class in tRNA group. However, ProteRNAWildSpan shows no different between categories of RNAs because of discovered homologous proteins in Swiss-Prot. In addition, ProteRNAWildSpan detects conserved residues as binding residues that cover regions that ProteRNASVM doesn’t predict; therefore, we apply WildSpan to detect conserved residues because these conserved residues have higher probability to play roles in interacting RNAs.
Comparison with other predictors by independent testing
Comparison of ProteRNA with other predictors using the independent testing dataset, RB33
Comparison of the top 10 ranking predictions with results from other predictors
(a) Rank by MCC
MCC of Rank 1
MCC of Rank 10
(b) Rank by precision
Precision of Rank 1
Precision of Rank 10
This article presents the design of a sequence based predictor aiming to identify the RNA-binding residues in a RNA-binding protein by machine learning and pattern mining approaches. RNA-binding proteins play different roles while interacting with different categories of RNAs to represent diverse functions. However, RNA-binding proteins are accommodated by the presence of multiple copies of these RNA-binding domains presented in various structural arrangements to expand the functional repertoire of RNA-binding proteins. Therefore, it is still difficult to predict RNA-binding residues in a RNA-binding protein. Furthermore, predicting RNA-binding residues in a RNA-binding protein can help biologists reveal site-directed mutagenesis in wet-lab experiments.
In the experiments reported in this article, ProteRNA used not only evolutionary profile with predicted secondary structure but also sequence conservation information. Although these conserved residues can be functional conserved residues or structural conserved residues, they also provide clues to indicate the important residues in a protein sequence. In the independent testing dataset, ProteRNA has been able to deliver overall accuracy of 89.78%, MCC of 0.2628, F-score of 0.3075, and F0.5-score of 0.3546. It is anticipated that the prediction accuracy delivered by ProteRNA will continue to improve as the number of protein-RNA complexes deposited in the PDB continues to grow and the number of training samples that can be exploited continues to increase accordingly. Nevertheless, it is the computational biologists’ primary interest to develop more advanced prediction mechanisms. In this respect, we believe that, as the number of protein-RNA complexes deposited in the PDB increases, we can obtain more insights about the key physiochemical properties that play essential roles in protein-RNA interactions and then we will be able to develop more advanced prediction mechanisms accordingly. In addition, we will exploit the experiences learned in this study in order to design specific predictors for other families of proteins interacting with RNA. We believe that different families of proteins may have very different characteristics. Therefore, concerning a specific type of proteins, a specifically-designed predictor should be able to deliver superior performance in compared to a general-purpose predictor.
Materials and methods
Datasets for ProteRNA
(a) Training dataset - RB147
(b) Independent Testing Dataset - RB33
In order to evaluate prediction performance among different prediction models, we collected a new independent testing dataset by extracting all structures of Protein-RNA complexes from the PDB that were added after January 2006. Protein chains with a resolution better than 3.5 Å and sequence length of protein chain longer than 40 amino acids will be reserved. We then performed a redundancy reduction using BLASTclust  to ensure that none of the chains showed a sequence similarity of more than 30% within the dataset and also in the training dataset; therefore, 33 protein-RNA complexes were selected to create a dataset called RB33. The list of PDB ids in RB33 are shown in Table 6(b). Based on the cut-off distance of 5 Å, a total of 9,785 amino acids are in RB33, which contains 882 RNA-binding residues and 8,903 non-binding residues.
Framework for prediction RNA-interacting residues
In the part of WildSpan (ProteRNAWildSpan), for protein-based mining suggested by the authors, at most 150 unique homologous proteins with sequence identity ranged from 30% to 90% are required by searching against Swiss-Prot sequence database with PSI-BLAST (blastpgp –j 6). Then we applied default parameter to obtain patterns by WildSpan. WildSpan can’t generate any pattern if there are not enough homologous proteins selected from Swiss-Prot protein sequence database or too similar homologous proteins.
Significance and performance evaluation
MCC is used to measure prediction performance with the consideration of both under- and over-predictions, where MCC = 1 denotes a perfect prediction, MCC = 0 indicates a completely random assignment, and MCC = -1 means a perfectly reverse correlation.
Precision is used to assess prediction power on positive class. F-score (F1-score) is the harmonic mean of precision and Sensitivity if β = 1. F0.5-score weights precision twice as much as sensitivity if β = 0.5.
List of abbreviations
RNA-binding protein, RBD: RNA-binding domain
Position-specific scoring matrix
Nucleic Acid Database
Protein Data Bank
Support vector machine.
This research has been supported by the National Science Council and National Taiwan University. Funding for article processing charges: National Science Council, NSC 98-2627-B-002-018.
This article has been published as part of BMC Genomics Volume 11 Supplement 4, 2010: Ninth International Conference on Bioinformatics (InCoB2010): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/11?issue=S4.
- Lunde BM, Moore C, Varani G: RNA-binding proteins: modular design for efficient function. Nat Rev Mol Cell Biol. 2007, 8 (6): 479-490.View ArticlePubMed
- Burd CG, Dreyfuss G: Conserved structures and diversity of functions of RNA-binding proteins. Science. 1994, 265 (5172): 615-621.View ArticlePubMed
- Tinoco I, Bustamante C: How RNA folds. Journal of molecular biology. 1999, 293 (2): 271-281.View ArticlePubMed
- Yan KS, Yan S, Farooq A, Han A, Zeng L, Zhou MM: Structure and conserved RNA binding of the PAZ domain. Nature. 2003, 426 (6965): 468-474.View ArticlePubMed
- Soulard M, Della Valle V, Siomi MC, Pinol-Roma S, Codogno P, Bauvy C, Bellini M, Lacroix JC, Monod G, Dreyfuss G: hnRNP G: sequence and characterization of a glycosylated RNA-binding protein. Nucleic Acids Res. 1993, 21 (18): 4210-4217.PubMed CentralView ArticlePubMed
- Swanson MS, Nakagawa TY, LeVan K, Dreyfuss G: Primary structure of human nuclear ribonucleoprotein particle C proteins: conservation of sequence and domain structures in heterogeneous nu-clear RNA, mRNA, and pre-rRNA-binding proteins. Mol Cell Biol. 1987, 7 (5): 1731-1739.PubMed CentralView ArticlePubMed
- Zahler AM, Lane WS, Stolk JA, Roth MB: SR proteins: a conserved family of pre-mRNA splicing factors. Genes Dev. 1992, 6 (5): 837-847.View ArticlePubMed
- Berman HM, Battistuz T, Bhat TN, Bluhm WF, Bourne PE, Burkhardt K, Feng Z, Gilliland GL, Iype L, Jain S: The Protein Data Bank. Acta Crystallogr D Biol Crystallogr. 2002, 58 (Pt6No1): 899-907.View ArticlePubMed
- Berman HM, Olson WK, Beveridge DL, Westbrook J, Gelbin A, Demeny T, Hsieh SH, Srinivasan AR, Schneider B: The nucleic acid database. A comprehensive relational database of three-dimensional structures of nucleic acids. Biophys J. 1992, 63 (3): 751-759.PubMed CentralView ArticlePubMed
- Bahadur RP, Zacharias M, Janin J: Dissecting protein-RNA recognition sites. Nucleic Acids Res. 2008, 36 (8): 2705-2716.PubMed CentralView ArticlePubMed
- Allers J, Shamoo Y: Structure-based analysis of protein-RNA interactions using the program ENTANGLE. J Mol Biol. 2001, 311 (1): 75-86.View ArticlePubMed
- Jones S, Daley DT, Luscombe NM, Berman HM, Thornton JM: Protein-RNA interactions: a structural analysis. Nucleic Acids Res. 2001, 29 (4): 943-954.PubMed CentralView ArticlePubMed
- Treger M, Westhof E: Statistical analysis of atomic contacts at RNA-protein interfaces. J Mol Recognit. 2001, 14 (4): 199-214.View ArticlePubMed
- Morozova N, Allers J, Myers J, Shamoo Y: Protein-RNA interactions: exploring binding patterns with a three-dimensional superposition analysis of high resolution structures. Bioinformatics. 2006, 22 (22): 2746-2752.View ArticlePubMed
- Ellis JJ, Broom M, Jones S: Protein-RNA interactions: structural analysis and functional classes. Proteins. 2007, 66 (4): 903-911.View ArticlePubMed
- Kim H, Jeong E, Lee SW, Han K: Computational analysis of hydrogen bonds in protein-RNA complexes for interaction patterns. FEBS Lett. 2003, 552 (2-3): 231-239.View ArticlePubMed
- Towfic F, Caragea C, Dobbs D, Gemperline DC, Feihong W, Honavar V: Structural characterization of RNA-binding sites of proteins: Preliminary results. Bioinformatics and Biomedicine Workshops, 2007 BIBMW 2007 IEEE International Conference on: 2-4 Nov. 2007. 2007, 60-66. 2007View Article
- Zhou P, Zou J, Tian F, Shang Z: Geometric similarity between protein-RNA interfaces. Journal of Computational Chemistry. 2009, 30 (16): 2738-2751.View ArticlePubMed
- Draper DE: Themes in RNA-protein recognition. J Mol Biol. 1999, 293 (2): 255-270.View ArticlePubMed
- Draper DE: Protein-RNA Recognition. Annual Review of Biochemistry. 1995, 64 (1): 593-620.View ArticlePubMed
- Wang L, Brown S: BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic acids research. 2006, 34 (WebServerissue): W243-PubMed CentralView ArticlePubMed
- Tong J, Jiang P, Lu ZH: RISP: a web-based server for prediction of RNA-binding sites in proteins. Comput Methods Programs Biomed. 2008, 90 (2): 148-153.View ArticlePubMed
- Kumar M, Gromiha MM, Raghava GP: Prediction of RNA binding sites in a protein using SVM and PSSM profile. Proteins. 2008, 71 (1): 189-194.View ArticlePubMed
- Wang Y, Xue Z, Shen G, Xu J: PRINTR: prediction of RNA binding sites in proteins using SVM and profiles. Amino Acids. 2008, 35 (2): 295-302.View ArticlePubMed
- Cheng CW, Su EC, Hwang JK, Sung TY, Hsu WL: Predicting RNA-binding sites of proteins using support vector machines and evolutionary information. BMC Bioinformatics. 2008, 9 (Suppl12): S6-PubMed CentralView ArticlePubMed
- Spriggs RV, Murakami Y, Nakamura H, Jones S: Protein function annotation from sequence: prediction of residues interacting with RNA. Bioinformatics. 2009, 25 (12): 1492-1497.View ArticlePubMed
- Spriggs RV, Jones S: RNA-binding residues in sequence space: Conservation and interaction patterns. Computational Biology and Chemistry. 2009, 33 (5): 397-403.View ArticlePubMed
- Jeong E, Chung IF, Miyano S: A neural network method for identification of RNA-interacting residues in protein. Genome Inform. 2004, 15 (1): 105-116.PubMed
- Jeong E, Miyano S: A weighted profile based method for protein-RNA interacting residue prediction. Lecture notes in computer science. 2006, 3939: 123-View Article
- Terribilini M, Lee J, Yan C, Jernigan R, Honavar V, Dobbs D: Prediction of RNA binding sites in proteins from amino acid sequence. Rna. 2006, 12 (8): 1450-PubMed CentralView ArticlePubMed
- Terribilini M, Sander JD, Lee JH, Zaback P, Jernigan RL, Honavar V, Dobbs D: RNABindR: a server for analyzing and predicting RNA-binding sites in proteins. Nucleic Acids Res. 2007, 35 (WebServerissue): W578-584.PubMed CentralView ArticlePubMed
- Caragea C, Sinapov J, Honavar V, Dobbs D: Assessing the Performance of Macromolecular Sequence Classifiers. Bioinformatics and Bioengineering, 2007 BIBE 2007 Proceedings of the 7th IEEE International Conference on: 2007. 2007, 320-326.
- McGuffin LJ, Bryson K, Jones DT: The PSIPRED protein structure prediction server. Bioinformatics. 2000, 16 (4): 404-405.View ArticlePubMed
- Hsu C-M, Chen C-Y, Hsu C-C, Liu B-J: Efficient Discovery of Structural Motifs from Protein Sequences with Combination of Flexible Intra- and Inter-block Gap Constraints. . 2006, 530-539.
- Chen CY, Tsai HK, Hsu CM, May Chen MJ, Hung HG, Huang GT, Li WH: Discovering gapped binding sites of yeast transcription factors. Proc Natl Acad Sci U S A. 2008, 105 (7): 2527-2532.PubMed CentralView ArticlePubMed
- Hsu CM, Chen CY, Liu BJ: MAGIIC-PRO: detecting functional signatures by efficient discovery of long patterns in protein sequences. Nucleic Acids Res. 2006, 34 (WebServerissue): W356-361.PubMed CentralView ArticlePubMed
- Su CT, Chen CY, Hsu CM: iPDA: integrated protein disorder analyzer. Nucleic Acids Res. 2007, 35 (WebServerissue): W465-472.PubMed CentralView ArticlePubMed
- Chien TY, Chang DT, Chen CY, Weng YZ, Hsu CM: E1DS: catalytic site prediction based on 1D signatures of concurrent conservation. Nucleic Acids Res. 2008, 36 (WebServerissue): W291-296.PubMed CentralView ArticlePubMed
- Hsu CM, Chen CY, Liu BJ, Huang CC, Laio MH, Lin CC, Wu TL: Identification of hot regions in protein-protein interactions by sequential pattern mining. BMC Bioinformatics. 2007, 8 (Suppl5): S8-PubMed CentralView ArticlePubMed
- Towfic F, Caragea C, Gemperline D, Dobbs D, Honavar V: Struct-NB: Predicting Protein-RNA Binding Sites Using Structural Features. International Journal of Data Mining and Bioinformatics. 2008
- Soler N, Fourmy D, Yoshizawa S: Structural insight into a molecular switch in tandem winged-helix motifs from elongation factor SelB. J Mol Biol. 2007, 370 (4): 728-741.View ArticlePubMed
- Hoang C, Chen J, Vizthum Caroline A, Kandel JM, Hamilton Christopher S, Mueller EG, Ferré-D'Amaré AR: Crystal Structure of Pseudouridine Synthase RluA: Indirect Sequence Readout through Protein-Induced RNA Structure. 2006, 24 (4): 535-545.
- Larranaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA, Armananzas R, Santafe G, Perez A: Machine learning in bioinformatics. Brief Bioinform. 2006, 7 (1): 86-112.View ArticlePubMed
- Bhaskar H, Hoyle DC, Singh S: Machine learning in bioinformatics: a brief survey and recommendations for practitioners. Comput Biol Med. 2006, 36 (10): 1104-1125.View ArticlePubMed
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402.PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.