MicroRNA-disease association prediction by matrix tri-factorization

Background Biological evidence has shown that microRNAs(miRNAs) are greatly implicated in various biological progresses involved in human diseases. The identification of miRNA-disease associations(MDAs) is beneficial to disease diagnosis as well as treatment. Due to the high costs of biological experiments, it attracts more and more attention to predict MDAs by computational approaches. Results In this work, we propose a novel model MTFMDA for miRNA-disease association prediction by matrix tri-factorization, based on the known miRNA-disease associations, two types of miRNA similarities, and two types of disease similarities. The main idea of MTFMDA is to factorize the miRNA-disease association matrix to three matrices, a feature matrix for miRNAs, a feature matrix for diseases, and a low-rank relationship matrix. Our model incorporates the Laplacian regularizers which force the feature matrices to preserve the similarities of miRNAs or diseases. A novel algorithm is proposed to solve the optimization problem. Conclusions We evaluate our model by 5-fold cross validation by using known MDAs from HMDD V2.0 and show that our model could obtain the significantly highest AUCs among all the state-of-art methods. We further validate our method by applying it on colon and breast neoplasms in two different types of experiment settings. The new identified associated miRNAs for the two diseases could be verified by two other databases including dbDEMC and HMDD V3.0, which further shows the power of our proposed method.


Background
MicroRNAs(miRNAs), a class of small, endogenous, noncoding RNAs including approximately 22 nucleotides, could regulate post-transcription of gene expression and RNA silencing by binding specific target messenger RNAs through base-pairing interactions [1,2]. Since the first miRNA named lin-4 was found twenty years ago by Victor Ambros [3], with the development of technology, an increasing number of studies found that miRNAs play important roles in various stages of biological processes *Correspondence: liminli@mail.xjtu.edu.cn School of Mathematics and Statistics, Xi'an Jiaotong University, Xianning West 28, Xi'an, China [3], such as cell development [4], proliferation [5] and viral infection [6]. Meanwhile, biological experiments indicate that miRNAs are involved in close relationships with the emergence and development processes of various human diseases [7]. For example, the study in [8] showed that a chromosomal translocation at 12q5 could influence the expression of let-7 and finally could cause the repress of the oncogene High Mobility Group A2(Hmga2). Another example is that mir-7 could influence epidermal growth factor receptor (EGFR) expression and protein kinase B activity in head and neck cancer(HNC) [9]. Furthermore, the work in [10] showed that mir-15a is a potential marker to differentiate between benign and malignant renal tumors in biopsy and urine samples. It is very important to identify miRNA-disease associations (MDAs) for the research on disease mechanism and discovering disease biomarkers. Due to the high costs of the current biological technologies, computational methods are useful by prioritizing candidate miRNAs for specific diseases.
The main information used for predicting MDAs mainly includes miRNA similarities, disease similarities and the known MDAs. Generally, miRNA similarities could be computed by using functional or sequence information of miRNAs, disease similarities could be obtained by using the phenotype terms, and the known MDAs could be obtained from databases such as HMDD [11]. The main challenge in MDA prediction is how to optimally utilize these information and predict MDAs with a high accuracy. Based on these information, many computational models are proposed to predict new MDAs.
The existing methods follow two lines. The first line is to determine the link probabilistically by using random walk. For example, RWRMDA [12] adopts random walk on the miRNA functional similarity network. It first gives each miRNA an initial probability in the miRNA functional similarity network(MFSN), and then use a random walk algorithm until the probability get stable. However, this method cannot predict new disease without any known related miRNAs. Thus Shi et al. [13] use the random walk algorithm on miRNA target and disease genes at the same time to map the protein-protein interaction (PPI) network, and then they construct a bipartite miRNAdisease network by using p-values in the PPI network and identify co-regulated modules by hierarchical clustering analysis. Later Xuan et al. [14] develop the MIDP method by using the prior information of nodes. They first divide the diseases related to the miRNAs into labeled nodes and unlabeled nodes, and establish the transition matrices for the two categories of nodes. Then by using the random walk algorithm on the two weighted transition matrices, the final miRNA ranking could be obtained. Liu et al. [15] proposed a random walk method to predict the associations by combining the multiple data sources.
The second line is to formulate the problem as machine learning problems such as classification, matrix completion. For classification formulation, examples include the RLSMDA [16] and the MTDN [17] methods. RLSMDA [16] develops Regularized Least Squares algorithm by training two classifiers from the miRNA space and the disease space. However, how to choose the parameter of RLSMDA and how to combine the classifiers need to be studied furthermore. Xu et al. [17] introduce the MTDN approach based on miRNA target-dysregulated network to prioritize novel disease miRNAs. The method first constructs the network by combining computational target prediction with miRNA and mRNA expression profiles in tumor and non-tumor tissues, and then applies a support vector machine classifier to distinguish positive miRNAdisease associations from negative ones by extracting the feature of network topologic information. However, it is hard to obtain the negative miRNA-disease associations. Another option using machine learning is matrix completion such as MCLPMDA [18] , IMCMDA [19], CMFMDA [20] and PMAMCA [21]. MCLPMDA [18] constructs new miRNA and disease similarity matrices by matrix completion algorithm firstly, and then uses label propagation algorithm to predict miRNAs. Chen et al. [19] propose a method named IMCMDA based on nonnegative matrix factorization, whose main idea is to complete the missing miRNA-disease association based on the known associations and miRNA and disease similarity. CMFMDA [20] and PMAMCA [21] both factorize the association matrix into two matrices which representing the features for miRNAs and diseases, respectively.
In this study, we propose a novel computational method MTFMDA to predict new MDAs by matrix trifactorization, to follow the idea of matrix completion. The main idea of MTFMDA is that we factorize the complete MDA matrix to three matrices, a feature matrix P for miRNAs, a feature matrix Q for diseases, and a low-rank matrix D representing relationships between miRNA features and disease features. Laplacian regularizers are used for the feature matrices P and Q by using two types of miRNA similarities, and two types of disease similarities, respectively. Optimal matrices P, D and Q are learnt by using the known MDAs and the Laplacian regularizers, and then the MDA matrix is completed by PDQ T and thus new MDAs can be identified.
The contributions in this work are listed as follows: 1. We propose a new MDA prediction model by matrix tri-factorization model, which combines the two types of miRNA similarities, two types of disease similarities, and the known miRNA-disease associations, and predict new MDAs by completing the MDA matrix. We develop an algorithm for solving the optimization problem. 2. We evaluate our MTFMDA model by 5-fold cross-validation and obtain higher accuracies than other state-of-art methods. 3. We apply our method on two diseases to identify related miRNAs, and our prediction results could be supported by other databases. This further validates the effectiveness of our model MTFMDA.

MiRNA functional similarity and sequence similarity
The functional similarities among miRNAs can be calculated by the method proposed in [22], and we download the similarity data from http://www.cuilab.cn. Since miRNA's function is closely relevant to the miRNA sequence [23], we also obtain the miRNA sequence similarity from http://www.mirbase.org/ftp.shtml. The integrated similarities among miRNAs are defined as the average of the functional similarity and the sequence similarity, and the integrated similarity matrix for miRNAs is denoted as S m .

Two disease semantic similarities
To calculate disease semantic similarities, Wang [22] and Xuan [24] propose two methods based on the Medical Subject Headings (MeSH) descriptors which could be downloaded from the National Library of Medicine (http://www.nlm.nih.gov/). Wang's method [22] first calculates the semantic value and contribution value of a disease, and then uses these two values to compute the semantic similarity between two diseases. Unlike Wang's method, Xuan et al. [25] improves the calculation method of semantic value. It also uses semantic value and contribution value to calculate the semantic similarity. We use the integrated similarity in our work by averaging the two types of semantic similarities, and denote the integrated similarity matrix for diseases as S d .

Our proposed method via matrix tri-factorization Problem statement and notations
We are now given the integrated similarity matrix S m ∈ R n m ×n m among n m miRNAs m 1 , · · · , m n m , and the integrated similarities S d ∈ R n d ×n d among n d diseases d 1 , · · · , d n d . We are also given the miRNA-disease association (MDA) indicator matrix A ∈ R n m ×n d defined as follows 0, association between i-th miRNA m i and j-th disease d j is unknown.
We denote = i, j |A ij = 1 to be the indices for the miRNA-disease pairs which are known to be associated, and c = i, j |A ij = 0 to be all the pairs whose associations are unknown. For any matrix M, we denote R (M) by only keeping its part and forcing its c part to be zeros, that is, Our aim in this work is to complete the c part in matrix A, and recover the complete matrixÃ.

MTFMDA model
We propose our MTFMDA method by considering the following three aspects. First, the unknown complete miRNA-disease association (MDA) matrixÃ can be factorized into three matrices, a feature matrix for miRNAs P ∈ R n m ×r m , a feature matrix for diseases Q ∈ R n d ×r d , and the feature relationship matrix D ∈ R r m ×r d . The factoriza-tionÃ = PDQ T implies that the column vectors inÃ lie in the subspace spanned by the column vectors in P, and the row vectors inÃ lie in the subspace spanned by the column vectors in Q. D is generally required to be low rank, and P and Q are orthonormal matrices satisfying P T P = I and Q T Q = I. Second, the completeÃ should recover the known associations between miRNAs and diseases, i.e, the part of the difference matrix A −Ã should be zero or as small as possible. Third, the feature vectors in P and Q should preserve the similarity information hidden in the S m and S d , respectively, and thus two Laplacian regularizers should be used for preserving the geometric structure. By considering the above three aspects, we propose the following MTFMDA model where λ 1 , λ 2 and λ 3 are the regularization parameters to control the trade-offs. The first term is to recover the known MDAs in A. In the second term, L m = D m − S m is the Laplacian matrix for the miRNAs, where D m is a diagonal matrix with the i-th diagonal element being the sum of i-th row in S m . In the third term, L d is the Laplacian matrix for diseases, defined in the same way as L m . Once the optimal P, D and Q are solved in the optimization problem, the completed MDA matrixÃ can be obtained byÃ = PDQ T . The flowchart of our method is shown in Fig.1.

Optimization algorithm
In order to solve optimization problem above, we develop an alternate iteration algorithm to update P, D and Q alternately.
Step 1: Fix P and Q, solve D By fixing P and Q in the optimization problem (1), the subproblem to solve D can be obtained as follows: The sub-problem can be solved by an accelerated gradient descent algorithm [26] with the following iterations, Step 2: Fix D and Q, solve P. By fixing D and Q in optimization problem (1), we obtain the sub-problem of P as follows: Similarly to solving D in step 1, we could also use the accelerated gradient descent (APG) model to update P as follows: Wen's algorithm proposed in [28] is used to solve the second Eq. in (5).
Step 3: Fix D and P, solve Q. By fixing D and P in optimization problem (1), we obtain the sub-problem of Q as follows: It can be seen that the sub-problem (6) to solve for Q is the same with the sub-problem (4) to solve for P. Thus we skip the details.
Overall, the framework of our algorithm is shown as follows:

Results
In this section, we will first evaluate our method on the known associations collected from HMDD V2.0 database by 5-fold cross validation. Then we further evaluate our method by using the probability of recovering a true association in the top-t predictions for new diseases. We also estimate the contribution of the performance for the known association matrix, integrated miRNA and disease similarity.

Comparing methods
We compare our MTFMDA method with the following seven methods.
3. whileÃ, P, Q and D not converged 4. Fixed P and Q, update D by using Eq. (3); 5. Fixed D and Q, update P by using Eq. (5); 6. Fixed P and D, update Q by using the same way with updating P; 7.Ã = PDQ T ; 8. end • RLSMDA [16].The model is a semi-supervised learning method, and it develops a Regularized Least Squares algorithm by training two classifiers from the miRNA space and the disease space (we use parameter ω = 0.9). • RWRMDA [12]. The model is a random walk method which infers potential miRNA-disease interactions by implementing random walk on the miRNA-miRNA functional similarity network (we use parameters r = 0.2, threshold = 10 −6 ). • IMCMDA [19]. The method is a matrix completion model by nonnegative matrix factorization using the same datasets with our method (we use parameter r=100). • NCPMDA [29]. The method is a non-parametric universal network-based method that combines miRNA space and disease space. • KBMFMDA [30]. The model combines kernel-based nonlinear dimensionality reduction, matrix factorization and binary classification. The main idea of the method is to project miRNAs and diseases into a subspace and estimate the association network in the subspace. • CMFMDA [20]. The model factorizes the association matrix into two parts which represent miRNA and disease information, respectively. SVD factorization is used to initialize the two parts. • PMAMCA [21]. The method divides the association matrix into two latent matrices, and solve the matrix factorization by using the recommend system. Note that this method doesn't use miRNA and disease similarity matrices.

Evaluating our method by cross-validation
We first evaluate the performance of our MTFMDA method by the 5-fold cross validation framework. We use the data with 3693 associations between 368 miRNAs and 383 diseases collected from the HMDD V2.0 database. For the 5-fold cross validation, we divide 368 miRNAs into five folds. We take one fold as the test set, and take the rest as the training set. Each fold is taken as the test set once in turn. After obtaining the complete MDA matrixÃ, we rank the scores for all the test pairs of miRNA-diseases. If the rank of an miRNA-disease pair exceeds a given threshold, then the pair is considered to have an association. In our method, we set the parameters as λ 1 = λ 2 = 10 and λ 3 = 1. The dimension parameters r m and r d are set as the one sixth of n m and n d , respectively. We first plot Receiver Operation Characteristics (ROC) curves for all the methods to check the true positive rates and false positive rates. In the ROC curve, the xaxis is the true positive rate (TPR) and the y-axis is the false positive rate (FPR). The ROC curves for the all the methods are plotted in Fig. 2. We can see that our MTFMDA could obtain the best ROC curve. We then perform 50 runs of 5-fold cross validation, and calculate the AUC (  achieves the highest AUC value and performs better than other methods. We further analyze the differences of inference capability between our method and others. Note that for each method we obtain 50 AUC values for the 50 runs of 5-fold cross validation. Thus, the paired t-test can be used to check whether our method is significantly better than other methods. The p-values between our method and other five methods are reported in Table 1.
The results show that our method MTFMDA performs significantly better than other methods. We further plot the Precision-Recall curves in Fig. 3 for all the methods, and we can also see that our method performs better than all the other comparing methods.

Probability to recover true associated miRNAs for new diseases
We further evaluate our MTFMDA method by the probability of recovering a true association in the top-t predictions for a new disease. The probability can measure whether the method can predict potential related miRNA for a new disease. The measurement has been used in many other publications such as [19,31,32]. In detail, for each test disease, we first mask its known associated miRNAs as zero in matrix A, and then apply our model to obtain the ranks of the masked true associated miRNAs. Thus for all the 383 diseases, we could obtain the ranks of the true associated miRNAs among all the miRNAs. We could then plot the cumulative distribution function (CDF), where x-axis represents the top-t predicted miR-NAs, and y-axis represents the probability of recovering a true association in the top t predictions. Other methods could also plot the curves, except the RWRMDA, which cannot predict the new diseases. The CDFs for the five methods are shown in Fig. 4. From the figure, we can see that though NCPMDA performs better than ours for the top 13 miRNAs, the probability of recovering a true association in the top t predictions does not change much when t from 13 to 40. When t is from 13 to 40, our method could recover true associated miRNAs with highest probabilities.

Contributions of different data sources
We use three data sets in our model, miRNA similarity S m , disease similarity S d and known miRNA-disease association matrix A. To examine their contributions to the performance of our model, we first change each of the three matrices to a random matrix, and then apply our MTFMDA model to check the AUC values based on 5fold cross validation. If one data type contributes the most, then the corresponding AUC should decrease a lot when changing to the data to be random. We change S m , S d and A to be random in turn, and the resulting AUCs are reported in Table 2. As shown in Table 2, the average AUC value based on the random miRNA similarity matrix is much lower than the other two types in our model, and thus the miRNA similarity contributes the most to the performance of our model. We also see that the disease similarity contributes the least in our MTFMDA model.

Discussion
We apply two diseases including colon and breast neoplasms to verify the effectiveness of the MTFMDA for miRNA-disease association prediction. For each disease, we apply our MTFMDA model to predict the top-t associated miRNAs, and then examine these predictions by two datasets: dbDEMC [33] and HMDD V3.0 [34].  Table 3 shows the predicted top 50 miRNAs for the colon neoplasms using our MTFMDA model. Colon neoplasms is an out-of-control cell growth, and it is the second leading cause of death in cancer and the most common tumor in the gastrointestinal tract [35,36]. Many factors will cause the neoplasms, such as old age, unhealthy lifestyle and heredity [37]. Recently, more and more evidence proved that some miRNAs are related to the colon neoplasms. For example, Zhang et al. [38] showed that mir-21, mir-17 and mir-19a promote the metastasis and spread of colon neoplasms. Coincidentally, the expression levels of miR-106a of normal human are higher than colon cancer patients [39]. Shi et al. [40] found that mir-145 could down-regulate the IRS-1 protein in the colon neoplasms cells and inhibit cells growth through targeting the IRS-1 3'-untranslated region. From our prediction results shown in Table 3, we can see that all the predicted top 10 miRNAs by our method can be confirmed by both the dbDEMC and HMDD V3.0 databases, and 46 among the top 50 miRNAs can be confirmed by dbDEMC and HMDD V3.0 databases. This validates the effectiveness of our MTFMDA method.
For the breast neoplasms, we evaluate our method in another way. We mask all the known associated miRNAs with breast neoplasms and apply our MTFMDA method to obtain the predicted top 50 associated miRNAs for the breast neoplasms, shown in Table 4. From this table, we can see that, all the top 10 miRNAs are confirmed by the two databases, and 49 of the top 50 miRNAs can be confirmed by the two databases. Through the Table 4, we found that hsa-mir-155 ranks the first, and it has been found that this miRNA could affect many cancers in recent studies, such as breast neoplasms, colon neoplasms and esophageal neoplasms [41][42][43].
Overall, the case studies on colon and breast neoplasms further validate the effectiveness of our MTFMDA method for predicting miRNA-disease associations.

Conclusion
Identifying potential miRNA-disease associations could help understand the pathogenesis of the disease from a genetic perspective. In this work, we propose a computational method MTFMDA to predict new MDAs by using an idea of matrix tri-factorization. Different from other matrix completion methods, we factorize the complete MDA matrix to three matrices including a feature matrix for miRNAs, a feature matrix for diseases and a low-rank matrix representing the relationships between miRNA features and disease features. Experiments show that our method performs better for predicting miRNAs associated with new diseases. As we have shown, based on the 5-fold cross validation, the comparisons on the ROC curves, AUCs and Precision-Recall curves show that our MTFMDA performs better than the other methods. Furthermore, the experiments to predict associated miR-NAs for colon and breast neoplasms also demonstrate the effectiveness of our method. However, this research only takes the average of two types of similarities of miRNAs and diseases, but not consider how to combine the two similarities optimally. This could be our future topic to work on.