Computational prediction of associations between long non-coding RNAs and proteins
© Lu et al.; licensee BioMed Central Ltd. 2013
Received: 20 May 2013
Accepted: 17 September 2013
Published: 24 September 2013
Skip to main content
© Lu et al.; licensee BioMed Central Ltd. 2013
Received: 20 May 2013
Accepted: 17 September 2013
Published: 24 September 2013
Though most of the transcripts are long non-coding RNAs (lncRNAs), little is known about their functions. lncRNAs usually function through interactions with proteins, which implies the importance of identifying the binding proteins of lncRNAs in understanding the molecular mechanisms underlying the functions of lncRNAs. Only a few approaches are available for predicting interactions between lncRNAs and proteins. In this study, we introduce a new method lncPro.
By encoding RNA and protein sequences into numeric vectors, we used matrix multiplication to score each RNA–protein pair. This score can be used to measure the interactions between an RNA–protein pair. This method effectively discriminates interacting and non-interacting RNA–protein pairs and predicts RNA–protein interactions within a given complex. Applying this method on all human proteins, we found that the long non-coding RNAs we collected tend to interact with nuclear proteins and RNA-binding proteins.
Compared with the existing approaches, our method shortens the time for training matrix and obtains optimal results based on the model being used. The ability of predicting the associations between lncRNAs and proteins has also been enhanced. Our method provides an idea on how to integrate different information into the prediction process.
Recent studies show that only a small part of the human transcriptome is involved in the protein-coding process . Long non-coding RNAs (lncRNAs) comprise the majority of transcripts; however, little is known about the function of lncRNAs . The GENCODE project discovered more than 14,000 lncRNA transcripts from approximately 9,000 gene loci . The lncRNA database collected a few hundred of high-confidence, experimentally validated lncRNAs . For example, the Xist RNA of humans is a large RNA sequence (19 kb) that has remained untranslated . Xist RNA has been proven to have an important function in regulating X-inactivation [6, 7]. Although an increasing list of evidence demonstrates that lncRNAs may be involved in multiple biological processes, including epigenetic regulation, chromatin remodeling, and cell proliferation and differentiation, the molecular mechanisms underlying the functions of lncRNAs still remain largely elusive.
In general, lncRNAs function with their binding proteins. Thus, in order to understand the molecular mechanisms underlying the functions of lncRNAs, the binding proteins of lncRNAs need to be identified. Several experimental approaches such as RNA immunoprecipitation followed by mass spectrometry analysis have been developed to identify the lncRNA binding proteins. Recently, Bellucci et al.  developed a computational method CatRAPID for predicting RNA–protein interactions. They further used the method on the predictions of protein interactions in the Xist network . The method involves encoding protein-RNA pairs into feature vectors, identifying a matrix, and calculating the interaction score through matrix computation. Their results showed that the method is powerful and may be used to predict RNA–protein interactions from sequences.
This study aims to explore a new method, lncPro, for predicting lncRNA–protein interactions. lncPro yields a score using amino acid and nucleotide sequences. This score can be used to measure the interaction between a pair of lncRNA and protein. Fisher’s linear discriminant method was used to compute the matrix directly. Based on the mathematical model being used, the result was found to be theoretically optimal in the sense of discriminant. Applying lncPro on all human proteins, we found that the long non-coding RNAs we collected tend to interact with nuclear proteins and RNA-binding proteins. A convenient online server (http://cmbi.bjmu.edu.cn/lncpro) has been developed for lncPro.
A training set containing many pairs of proteins and RNAs is needed. Information on the interactivity of each pair is also required.
Complexes used in the training set
Complex ID (PDB database)
1GIY, 3HUW, 3I8I
Canis lupus familiaris
After collecting RNA–protein pairs, a criterion is needed to determine whether a pair is interactive or non-interactive. The “least atom distance” was used as the criterion : assume that R is an RNA molecule and P is a protein molecule. If there exists an atom r of R and an atom p of P such that the distance between r and p is less than 5 Å, the pair (R and P) is considered to be interactive. Otherwise, the pair is non-interactive. The distance cutoff 5 Å was borrowed from the PRIDB database . RNA–protein pairs in the training set can now be classified based on interactivity. Each pair in the training set was checked, yielding 355 interactive pairs and 371 non-interactive pairs. Since some sequences in this dataset are similar, this set will be called the redundant training set in the following sections. If these highly similar sequences were assigned to the training set and test set respectively, the evaluation of the performance would be biased. The CD-HIT tool (http://weizhong-lab.ucsd.edu/cdhit_suite/cgi-bin) was used to compute sequence similarity for both protein and RNA sequences. Pairs that share both the protein sequence and the RNA sequence are considered to be similar, and thus removed from the training set. The similarity cutoff was set at 90% for both protein and RNA. After redundancy removal, a training set containing 649 protein-RNA pairs, including 322 interactive pairs and 327 non-interactive pairs, was obtained. This set will be called the non-redundant training set.
RNAsubopt from the Vienna Package  was used to predict the secondary structure of RNAs. RNAsubopt provides n possible forms of secondary structure with the lowest free energy. Different selections of n may lead to differences in performance. The Discriminative Power (Details of discriminative power can be found in the “Measuring the performance of the method” section) is used as an indicator of the performance. The Discriminative Power does not change significantly when n is set at values of 5, 6, 7, or 8 (4-fold cross validation gave discriminative power values of 90.5, 90.3, 91.2, and 91.6, respectively). Since a larger n value brings more computation, n was set at six. When an RNA sequence was given, six results were obtained in the form of dots and brackets. Each bracket was replaced with “1”, and each dot was replaced with “0”. Thereafter, six binary sequences were added to obtain a new feature vector with integers between zero and six.
Predator  software was used to predict the secondary structure of proteins. Chou-Fasman propensities were used to encode each amino acid . By replacing each amino acid in the sequence with the corresponding Chou-Fasman propensity, the sequence was transformed into a numerical feature vector.
Purine and pyrimidine contact information from a set of 41 RNA–protein complexes  was used to encode the RNA numerical feature vectors for hydrogen bonding and Van der Waals’ interaction. The RNA feature vectors showed the propensities of the atoms to form hydrogen bonds and Van der Waals’ interaction. The performance evaluation may be biased if the set of 41 RNA–protein complexes for purine and pyrimidine contact information have significant overlaps with our training set. Among these 41 complexes, only one complex (1JJ2) is shared with our training set and 12 RNA-protein pairs are involved. We further used the CD-HIT tool to check whether there are significant similarities for the other sequences. When the cutoff was set at 0.9 for both RNA and protein, only three RNA-protein pairs are shared between our training set and those 41 complexes. Therefore, these 41 RNA–protein complexes for purine and pyrimidine contact information do not have significant overlap with the training set.
For the proteins, hydrogen bonding feature vectors were encoded using Grantham’s propensities  and Zimmerman’s propensities . Feature vectors of Van der Waals’ interaction were encoded using Kyte-Doolittle  and Bull-Breese propensities . Together with the secondary structure feature vector, each protein sequence was encoded into five numerical feature vectors.
Each protein sequence and each RNA sequence were transformed into five and three numerical feature vectors, respectively. However, these vectors cannot be used for direct computations because the dimension of each vector depends on the length of the corresponding RNA or protein sequence, which makes it impossible to find a fixed matrix M to conduct the computation. Therefore, the vectors need to be transformed in order to unify the dimension.
Where L is the length of the original feature vector.
Here, the first 10 terms of the Fourier series were used as the new numerical feature vector. Dimension 10 was selected because the results did not improve significantly with a higher dimension and the computation is faster when the dimension is set lower. When the dimension was set at 10, 15, and 20, we acquired Discriminative Power (Details of discriminative power can be found in the “Measuring the performance of the method” section) values 90.3%, 89.8%, and 91.9% respectively. After transforming each RNA–protein pair into eight 10-dimensional numerical feature vectors (three for RNA and five for protein), the feature vector encoding process was completed.
For each pair of feature vectors r and p (representing the RNA feature vector and the protein feature vector, respectively), we want to train a matrix M and use the score < p|M|r > to measure the interaction between r and p. M will be a 100-D matrix because the dimension of vectors was set at 10. If we unsystematically search the matrix in the 100-D Euclidean space, the efficiency and accuracy would be low. The efficiency and accuracy will be further degraded when a higher dimension is used.
Let us analyze the expansion of < p|M|r>. Without loss of generality, the situation of dimension two is used to clarify the idea:
Assuming that p = (p 1 , p 2 ), r = (r 1 , r 2 )T,
Here, k = (M 1, M 2, M 3, M 4), x = (p 1 r 1, p 1 r 2, p 2 r 1, p 2 r 2)T.
Here, m i represents the mean of each category. . The subscript denotes two different classes.
Where k = (M 1, M 2, …, M 100) and x = (p 1 r 1, p 1 r 2, …, p 10 r 10)T; both are 100-dimensional vectors. After transforming each pair of r and p in the training set into (p 1 r 1, p 1 r 2, …, p 10 r 10)T, Fisher’s linear discriminant method was applied to these 100-dimensional vectors. Subsequently, the optimal direction k was computed directly.
Thus, the problem of finding a matrix M in the 100-dimensional Euclidean space was transformed into another equivalent problem of finding a vector k. Using Fisher’s linear discriminant method, k was computed directly from known data, thereby simplifying the process. Based on the mathematical model being used, the obtained result was found to be theoretically optimal.
However, Equation (6) will lead to cross terms such as < p 1 |M|r 2 >. Assuming that p 1 indicates the protein secondary structure information and that r 2 indicates the RNA hydrogen bonding information, then the computation of < p 1 |M|r 2 > is nonsensical because combining different kinds of information is theoretically meaningless. Thus, we selected another combining method. We computed five scores using each feature vector pairs, respectively, which included: the protein and RNA secondary structures, protein Grantham’s propensities and RNA hydrogen bonding, protein Zimmerman’s propensities and RNA hydrogen bonding, protein Kyte-Doolittle propensities and RNA Van der Waals’ interaction, and protein Bull-Breese propensities and RNA Van der Waals’ interaction. Here the Grantham and Zimmerman scores are both characterizing the protein hydrogen bonds, the Kyte-Doolittle and Bull-Breese scores are both characterizing the protein Van der Waals’ interaction.
Where X is the raw score, Y is the transformed score, c 1 = k * m 1 , c 2 = k * m 2 , and c = (c 1 + c 2 )/2 is the mean of c 1 and c 2 . In c 1 = k * m 1 , and c 2 = k * m 2 , m 1 and m 2 are the mean vectors of positive and negative sets, respectively. c = (c 1 + c 2 )/2 can be considered as the separate point of interactive pairs and non-interactive pairs. If X> > c, then Y is near 100. If X < <c, then Y is near zero. If X = c, then Y = 50. Thus, the cutoff will be decided at 50, naturally.
According to this formula, we can transform all five scores into a scale ranging from 0 to 100. We considered two options to combine the five scores. The first is the arithmetic mean of the five scores. The second is to set weights at 1/3, 1/6, 1/6, 1/6, and 1/6, respectively, considering the Grantham and Zimmerman scores are both characterizing hydrogen bonds and the Kyte-Doolittle and Bull-Breese scores are both characterizing the Van der Waals’ interaction. We observed the prediction of different cases. The former method is more accurate than the latter. Therefore, the arithmetic mean of the five scores was used as the final score.
After all pairs were reordered with the score provided by our method, DP = 1 if all the interactive pairs are ordered before the non-interactive ones. If all the non-interactive pairs are ordered before interactive ones, then DP = 0. Thus, DP can be used to determine whether the method could discriminate the positive and negative sets well.
Furthermore, each pair was given a score between 0 and 100 using the method above. The cutoff is naturally decided as 50 according to the formula we used. Thus, a pair is interactive if it has a score over 50; otherwise, the pair is non-interactive. Using this cutoff, we can calculate the accuracy of the prediction.
Here, TP is the number of true positives; FP is the number of false positives; TN is the number of true negatives; FN is the number of false negatives.
First, we used cross-validation to test the classification ability of the method on the training set. Then the ncRNA–Protein Interaction (NPInter) database and several complexes were used to test the ability of the prediction. Finally, we used the method on all human proteins.
Since the CD-HIT tool has been used to delete similar sequences, a 4-fold cross-validation could be performed on the non-redundant training set. The mean DP was taken as the final result. Our method obtained a DP value 90.3% on the non-redundant set. Since there are some similar pairs in the redundant set, a higher discriminative power value 94.3% was obtained on the redundant set. Compared with the results of CatRAPID (78% on the non-redundant set and 90% on the redundant set), this method showed a better ability to discriminate interactive and non-interactive pairs. The Matthews Correlation Coefficient (MCC) was also computed. We obtained an MCC value of 0.74 on the non-redundant set and a higher MCC value of 0.83 on the redundant set. This is consistent with the previous DP results.
CatRAPID used 7 Å as the cutoff between interacting and non-interacting RNA-protein pairs and RNA sequences shorter than 100 bp were also kept in the training set. To compare with CatRAPID, the method was also tested on the other set when distance cutoff was set at 7 Å and RNA sequences shorter than 100 bp were kept in the training set. A 4 fold cross-validation was performed. We obtained DP values of 90.2% and 91.6% on the non-redundant set and the redundant set, respectively, which were more accurate than CatRAPID. MCC values were 0.77 on the non-redundant set and 0.78 on the redundant set.
Discriminative power for each score (non-redundant set)
Discriminative power for each score (redundant set)
We further checked whether a strict sequence similarity cutoff for the non-redundant training set would influence the prediction performance dramatically. When a strict sequence similarity cutoff was used (0.3 for protein and 0.8 for RNA), 21 pairs were deleted from the non-redundant training set. A DP value of 87.1% was then obtained. The result was still more accurate than CatRAPID (78% on the non-redundant set).
ncRNA binds to the protein;
ncRNA regulates the mRNA;
ncRNA indirectly regulates a gene;
ncRNA is regulated by the protein;
ncRNA as a factor affects the protein’s function;
The protein as a factor affects the ncRNA’s function;
Genetic linkages between the ncRNA and the protein;
Special linkages between the ncRNA and the protein.
All pairs belong to a positive set (interactive). We considered groups 1, 5, and 6 to have direct evidence of interaction. The rest of the groups were considered to have indirect evidence. Since the RNAsubopt software will only process the first 4095 nucleotides if the input RNA is too long, sequences longer than 4095 were deleted. Finally, 74 pairs with direct evidence of interaction and 41 pairs with indirect evidence were obtained. The CD-HIT tool was used to check if there is significant similarity between the training set and the pairs from NPInter. When both cutoffs were set at 0.9 for protein and RNA sequences, CD-HIT showed that there was no overlapping. Among the 74 pairs with direct evidence, 48 pairs have scores over 50, suggesting that 65% of the interactive pairs were predicted. Among the 41 pairs with indirect evidence, we predicted 27 pairs, which accounted for 66%.
Both pairs with direct and indirect evidence (115 pairs) were used as the positive set. Then we used another 115 pairs randomly from the non-redundant negative training set as the negative set. A DP value of 91.7% was obtained, which was higher than the result of CatRAPID (85%). The MCC value was 0.64, which also indicates acceptable classification ability.
By shuffling pairs in the non-redundant negative training set, we built a randomized set containing 327 pairs. The mean value of predicted interaction scores on this set was 32.71 and the distribution of the predicted interaction scores is displayed in Additional file 3: Figure S1. We further randomly selected 115 of them as the negative set. When this new negative set was used, a DP value of 86.7% was obtained for the positive set, which includes both pairs with direct and indirect evidence (115 pairs). These results confirmed that the classification performance of lncPro is satisfactory.
We also studied the performance of the method on several complexes. First, we took MRP, RNase P, PRC-2, and LSD1/CoREST/REST complex into consideration. The RNA sequences of MRP and RNase P were obtained from the Functional RNA Database (fRNAdb; http://www.ncrna.org/frnadb). The sequence of HOTAIR and MEG3 were downloaded from the National Center for Biotechnology Information database (http://www.ncbi.nlm.nih.gov/). The protein sequences were obtained from the Uniprot database (http://www.uniprot.org/).
The human MRP complex contains one RNA sequence and ten protein sequences: hPop1, hPop5, Rpp14, Rpp20, Rpp21, Rpp25, Rpp29, Rpp30, Rpp38, and Rpp40. RNase P is another complex inside the human body. RNase P shares proteins with the MRP system but has a different RNA sequence. Previous studies focused on MRP and RNase P [20, 21]. The results can be summarized as follows. hPop1, Rpp20, Rpp21, Rpp25, Rpp29, and Rpp38 have direct interactions with the corresponding RNA, whereas hPop5, Rpp14, Rpp30, and Rpp40 have relatively weak interactions. We applied the method on these two complexes. The results are shown in Table 4. Considering hPop5, Rpp14, Rpp30, and Rpp40 are non-interactive, we then correctly predicted 70% (7 of 10) of the interactions for the MRP and RNase P complexes.
Interaction scores of MRP and RNase P
RNase P result
RNase P exp*
Interaction scores of PRC-2 with HOTAIR and MEG3
Interaction scores of LSD1/CoREST/REST complex and HOTAIR
Interaction scores with roX1 and roX2
Most well-studied lncRNAs are located in the nucleus of cells. Recently, non-coding RNAs have been found to be predominantly localized in the nucleus . Therefore, it is natural to obtain higher interaction scores between nuclear proteins and non-coding RNAs. We used our method to see whether the scores of the nuclear proteins are indeed higher than those of other proteins.
Following a similar procedure, we also compared the CDF of the RNA-binding protein scores to that of the overall protein scores for each given lncRNA. Among the 20224 protein sequences of Homo sapiens we collected, 639 were annotated as the RNA-binding proteins according to the Cellular Component of Gene Ontology (GO:0003723, RNA binding, downloaded from BioMart, Additional file 4: Table S3). 56.9% (41 out of 72) sequences have significant p-values (Additional file 6: Table S5). Thus, more than half of these lncRNAs have significantly higher predicted interaction scores with known RNA-binding proteins.
In this study, we introduced a new method lncPro for the prediction of protein associations with lncRNAs. Compared with existing methods, our method shortened the time for training matrix M. This matrix was also found to be theoretically optimal based on the model being used. Our method is computational friendly and does not lead to nonsensical cross terms. The comparison results with CatRAPID also show that our method has enhanced abilities of predicting the associations between lncRNAs and proteins. Specifically, we found the human lncRNAs we collected tend to interact with nuclear proteins and RNA-binding proteins.
However, the process of finding proteins that directly interact with a given lncRNA is still unsatisfactory because of the large number of proteins. Only when the complex is provided can the prediction of interaction within the complex be more accurate. Also, most RNA-protein pairs in our training set exist in the ribosome, so the training data might not be general enough. Though we tested the method on some non-ribosome complexes and it performed well. We should still be aware of the limited range of cases being used. To conduct a more accurate prediction, further work needs to be performed and more information should be considered. The use of Fisher’s linear discriminant method provides direction on how to incorporate different information into the prediction process.
Another thing is that lncPro meets some computational issues when dealing with very long non-coding RNAs. This is limited by the computational ability of RNA secondary structure prediction program. We will update lncPro when new software is available. When we study a long RNA sequence, sometimes we are only interested in certain sections of this sequence. lncPro can be applied if the particular section is not too long.
This work is supported by the National Basic Research Program [2011CBA01104, 2012CB316504], the National High-tech R&D Program [2012AA020401] of China, and the National Natural Science Foundation of China [31371337 and 31030041]. The funding agencies played no active role in the study, design, data collection, analysis, decision to publish, or preparation of the manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.