A Bayesian approach for identifying miRNA targets by combining sequence prediction and gene expression profiling
© Yufei et al. 2010
Published: 01 December 2010
Skip to main content
© Yufei et al. 2010
Published: 01 December 2010
MicroRNAs (miRNAs) are single-stranded non-coding RNAs shown to plays important regulatory roles in a wide range of biological processes and diseases. The functions and regulatory mechanisms of most of miRNAs are still poorly understood in part because of the difficulty in identifying the miRNA regulatory targets. To this end, computational methods have evolved as important tools for genome-wide target screening. Although considerable work in the past few years has produced many target prediction algorithms, most of them are solely based on sequence, and the accuracy is still poor. In contrast, gene expression profiling from miRNA transfection experiments can provide additional information about miRNA targets. However, most of existing research assumes down-regulated mRNAs as targets. Given the fact that the primary function of miRNA is protein inhibition, this assumption is neither sufficient nor necessary.
A novel Bayesian approach is proposed in this paper that integrates sequence level prediction with expression profiling of miRNA transfection. This approach does not restrict the target to be down-expressed and thus improve the performance of existing target prediction algorithm. The proposed algorithm was tested on simulated data, proteomics data, and IP pull-down data and shown to achieve better performance than existing approaches for target prediction. All the related materials including source code are available at http://compgenomics.utsa.edu/expmicro.html.
The proposed Bayesian algorithm integrates properly the sequence paring data and mRNA expression profiles for miRNA target prediction. This algorithm is shown to have better prediction performance than existing algorithms.
MicroRNAs (miRNAs) are single-stranded non-coding RNAs with about 19 to 25 nucleotides in length. MiRNA is known to inhibit target translation or cleave target mRNA by binding to the complementary sites in the 3’ untranslated region (UTR) of targets. The importance of miRNA regulation lies in the fact that a miRNA is estimated to regulate hundreds of targets . As a result, miRNAs have been shown and are speculated to play many important post-transcriptional regulatory roles in a wide range of biological processes and diseases including development, stress responses, viral infection, and cancer [2–5]. Despite rapid advance in miRNA research, the detailed functions and regulatory mechanisms of most of miRNAs are still poorly understood. To gain better understanding, an important task is to identify miRNAs’ regulatory targets. However, the current knowledge about the known targets is disproportional to that of the known miRNAs. In the miRNA registry miRBase, 969 human miRNAs are annotated; in contrast, only 815 targets of 121 human miRNAs are recorded in the most up-to-date target database miRecords. Given that the number of targets of each miRNA could be in hundreds , the reported number of verified targets accounts for only a very small fraction of the potential human targets. This fact greatly underscores the urgent need of effective target identification methods, and, for genome-wide target discovery, computational prediction proceeding experimental testing is a preferable, efficient strategy. Considerable advances have been made in computational target prediction  and many prediction algorithms have been proposed, mainly based on various important features of miRNA:target nucleotide sequence interaction. Although different algorithms utilize different sets of features, a few important features including “seed region complementary”, “binding free energy”, and "sequence conservation" are among the most common ones. Depending on how these features are derived, the algorithms using sequence binding data can be further categorized into the rule based and the data driven. In the rule-based algorithms, features are determined from the prior knowledge of miRNA binding and these algorithms include TargetScan , miRanda , PITA , DIANA-microT , RNAhybrid , microInspector , MovingTargets , and Nucleus . In contrast, for the data driven algorithms, the features are partially or entirely determined by the algorithm itself from the training data, or the existing sequence binding data of verified positive and negative miRNA:target pairs. The data driven algorithms include MirTarget [15, 16], PicTar , miTarget , rna22 , NBmiRTar , Targeting  and SVMicrO . Given sufficient training data, the data driven algorithms hold the promise to outperform the rule based algorithms, since they have the ability to uncover important features from data that cannot be easily observed otherwise.
Despite these effort, the existing algorithms using sequence data alone are still of poor prediction specificity and sensitivity [23, 24]. The first reason of the deficient performance is due to the poor understanding of the precise mechanisms underlying miRNA:target interaction [25–27] and, as a result, the adopted features of the rules are not yet as specific and sensitive as needed. Secondly, verified positive and negative training data essential for good performance of data driven algorithms are particularly lacking and the limited verified data can hardly include important features for different aspects of the miRNA:target interactions, thus hampering the ability of date driven algorithms to select discriminative features. These facts motivated us to incorporate data other than sequence pairing to further improve the prediction performance of existing algorithms.
Microarray profiling of differential gene expression after miRNA transfection is a widely adopted approach to investigate the impact of the miRNA regulation. Such gene expression profiles have been used in a variety of studies for predicting miRNA targets. However, the majority of existing research relies on the assumption that miRNA targets are down-expressed in microarray and thus search within the intersection of sequence level prediction and down-regulated genes in microarray for potential targets[29, 30]. Given that the primary function of miRNA is translation inhibition with target mRNA degradation being the secondary mode of regulation, the down-expression of mRNA is neither the sufficient nor the necessary condition for miRNA regulation. Therefore, the outcome of this practice is unlikely to greatly reduce the high false positive rate; on the contrary, it deteriorates more the prediction sensitivity.
where the second equality is arrived based on the assumption that e and S are independent, and α (e) and β (S) are the APPs of t given e and S, respectively. Although e and S are not independent in reality, this assumption reduces the complexity of modeling and the subsequent computation. Additionally, the Naïve Bayes formulation has been shown to be able to achieve satisfactory performance even when the data are correlated. We will discuss next the models and approaches for calculating α (e) and β (S) , respectively.
where α 0 and α 1 are the parameters to be trained.
Microarray Data Source of Negative Samples
where μ., σ 2 are the mean and variance of the respective Gaussian mixtures, the subscripts + and — denote the positive (t = 1) and negative (t = 0) targets, π + + π = 1, and θ represents the collection of the model parameters. Given model (3), the goal is to uncover mixture components from the expression data, which is equivalent to estimate the parameters from the expression data. Note that since the number of positive targets is only in hundreds, π + is very small, which means that the component of the positive target is much weaker compared with the negative target and likely to be completely buried in the mixture. This can be illustrated by Figure 3-(a), where the histogram of genom-wide expression of 11988 human mRNAs for transfection of hsa-miR-124  is plotted. Since the true targets of a miRNA counts for only very small portion of the entire genome, the histogram of the genome-wide expression for transfection of hsa-miR-124 appears more like a single Gaussian instead of a mixture of two. Unless additional information about the expression of positive data is available, the estimation of the positive component from the mixture is under-determined and there could be a large number of suboptimal solutions. Fortunately, the expression data of experimentally validated targets are available. These expression levels, although limited in quantity, can be used to aid the estimation of the positive component. which Supposedly,
Under the Bayesian framework, the goal of estimating model parameters θ is to obtain the posterior distribution
p (θ | e) ∞ p (e | θ) p (θ) (4)
N = 209 in our case, ē p and s 2 are the sample mean and variance of e p , and all other parameters with subscript 0 are the same as those in (5), which define the noninformative prior. Next, for the noninformative priors in (5) and (6), the parameters are chosen as:
μ_ = 0, σ_ = 5, μ 0 = 0, κ 0 = 0.2, α 0 = 0.2, β 0 = 0.2.
Lastly, the parameters of the Dirichlet prior are chosen as γ +,0 = 200 and γ_,0 = 20000, which reflects the common belief that a miRNA regulates about 200 targets.
Since the likelihood assumes the mixture model in (3), the posterior distribution cannot be obtained analytically. A Variational Bayes Expectation Maximization (VBEM) algorithm is applied to estimate the desired distributions.
where as above the inequality is due to the Jensen's inequality, , as well as q (π) and q (ϕ) are the free distributions introduced to approximate the unknown posterior distributions p (π|e) and p (ϕ|e) . The distributions q (·) (or their parameters) are determined to maximize the lower bound (9). Using the variational derivatives and an iterative coordinate ascent procedure, the optimization can be achieved in an iterative fashion, whose j + 1 iteration operates as follows:
Distributions and parameters used to generate test data
N (0.75, 0.5)
N (–0.5, 0.5)
N (–0.75, 0.5)
N (0, 0.4)
GMM parameters estimated by VBEM
N (–0.4714, 0.5573)
N (0.0044, 0.3994)
3'UTR sequences of human genome were downloaded from UCSC Genome Browser mySQL database. Prediction of genome-wide targets of hsa-miR-124 and hsa-miR-1 based on the sequence pairing data were carried out by SVMicrO. The prediction scores were recorded for each mRNA, which were then mapped to the APPs of being targets using the logistic function β (S) defined in (2). Gene expression profile of transfecting hsa-miR-124 or hsa-miR-1 was obtained from  and the APPs of targets given expression fold changes were calculated based on the function α (e) defined in (12) with heuristics. The integrated score was calculated based on (1) as a product of β (S) and α (e) .
As we mentioned before, most literature considers overlapping between sequence level prediction and down-regulated mRNA for target prediction. The performance of such overlapping scheme was also evaluated. In Figure 11 and Figure 12, the black dotindicates the precision and recall of the method that considers the intersection of SVMicrO prediction and down-regulated mRNA as targets. First, this overlapping method is outperformed by the proposed combined method. Secondly, it can be noted that the performance of this is not consistent. Particularly, for hsa-miR-124, the performance is slightly improved compared to SVMicro, while for hsa-miR-1 the performance greatly deteriorates. By investigating the detailed prediction results, we found that some of the experimentally validated targets were not down-regulated but predicted as positive by SVMicrO. Examples include NM080430, NM001078174, NM144706, NM001040402 and so on for hsa-miR-124 and NM002622 for hsa-miR-1. These positive predictions by SVMicrO were reverted to negative by the overlapping approach. This is the very reason why the precision cannot be increased. Therefore, a conclusion can be drawn once more that searching down-regulated mRNAs for targets is not an effective approach. Our proposed method provides a proper model for the true distribution of miRNA targets. As a result, improved performance can be achieved.
In this paper, we presented a novel algorithm for miRNA target prediction by integrating sequence level prediction results with microarray expression profiling of miRNA transfection. A Gaussian mixture model was designed to model the gene expression profiles of the positive and negative targets and a Bayesian algorithm is devised to integrate the data. The validation results on both proteomics and IP pull-down data demonstrated the superior performance of proposed algorithm.
Since our algorithm is proposed for integrating sequence data with microarray measurement of miRNA transfection, target prediction can be carried out only for the miRNAs, for which both types of data are available. Since microarray measurements of genome-wide miRNA transfection are not yet available, it is still infeasible to conduct genome-wide prediction using this algorithm. However, as miRNA transfection becomes increasingly popular and indispensible for miRNA target identification, the need for integrating the two data types is highly desirable. In an effort to provide prediction results, we retrieved around 20 miRNA over-express microarray data From GEO database. The prediction result can be found in http://expmicro.cbi.utsa.edu.
The subsequence work of this paper will focus in two aspects, which are, firstly, continue the predictions for more miRNAs once the two types of data are accessible and secondly improve the mathematical model to further increase the performance.
Hui Liu and Lin Zhang are supported by the talent introduction project of China University of Mining Technology, the set-sail project of China University of Mining Technology and Fok Ying-Tung Education Foundation for Young Teachers (121066). Yidong Chen is supported by NCI Cancer Center grant P30 CA054174-17 and NIH CTSA 1UL1RR025767-01. Shou-Jiang Gao is supported by NIH grants CA096512 and CA124332. Yufei Huang is supported by an NSF Grant CCF-0546345 and an NIH grants CA096512. Publication of this supplement was made possible with support from the International Society of Intelligent Biological Medicine (ISIBM).
This article has been published as part of BMC Genomics Volume 11 Supplement 3, 2010: The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/11?issue=S3.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.