 Research
 Open access
 Published:
Graph embedding ensemble methods based on the heterogeneous network for lncRNAmiRNA interaction prediction
BMC Genomics volume 21, Article number: 867 (2020)
Abstract
Background
Researchers discover LncRNA–miRNA regulatory paradigms modulate gene expression patterns and drive major cellular processes. Identification of lncRNAmiRNA interactions (LMIs) is critical to reveal the mechanism of biological processes and complicated diseases. Because conventional wet experiments are timeconsuming, laborintensive and costly, a few computational methods have been proposed to expedite the identification of lncRNAmiRNA interactions. However, little attention has been paid to fully exploit the structural and topological information of the lncRNAmiRNA interaction network.
Results
In this paper, we propose novel lncRNAmiRNA prediction methods by using graph embedding and ensemble learning. First, we calculate lncRNAlncRNA sequence similarity and miRNAmiRNA sequence similarity, and then we combine them with the known lncRNAmiRNA interactions to construct a heterogeneous network. Second, we adopt several graph embedding methods to learn embedded representations of lncRNAs and miRNAs from the heterogeneous network, and construct the ensemble models using two ensemble strategies. For the former, we consider individual graph embedding based models as base predictors and integrate their predictions, and develop a method, named GEELPI. For the latter, we construct a deep attention neural network (DANN) to integrate various graph embeddings, and present an ensemble method, named GEELFI. The experimental results demonstrate both GEELPI and GEELFI outperform other stateoftheart methods. The effectiveness of two ensemble strategies is validated by further experiments. Moreover, the case studies show that GEELPI and GEELFI can find novel lncRNAmiRNA associations.
Conclusion
The study reveals that graph embedding and ensemble learning based method is efficient for integrating heterogeneous information derived from lncRNAmiRNA interaction network and can achieve better performance on LMI prediction task. In conclusion, GEELPI and GEELFI are promising for lncRNAmiRNA interaction prediction.
Background
Noncoding RNAs (ncRNAs), including long noncoding RNA (lncRNA), miRNA, snRNA, are a category of RNAs that are not translated into functional proteins. A surge of studies has betrayed that ncRNAs have regulatory functions in biological processes [1,2,3,4]. LncRNAs are a class of ncRNAs with more than 200 nucleotides (nt), playing important roles in gene imprinting, immune response, and chromatin remodeling [1, 2]. MiRNAs are a category of singlestranded, endogenous, evolutionally conserved ncRNAs with 2025 nt, which are involved in diverse biological processes, such as the regulation of metabolism, cell differentiation, gene expression, embryonic development, and apoptosis [3,4,5]. LncRNAmiRNA regulatory paradigms modulate gene expression patterns that drive major cellular processes (e.g., cell proliferation, cell differentiation, and cell death) which are central to mammalian physiologic and pathologic processes [6]. Furthermore, it has been found that both lncRNAs and miRNAs relate closely to severe diseases [7, 8]. Therefore, a critical key to reveal the mechanism of associated biological processes and diseases is to characterize various functions of lncRNAs and miRNAs.
LncRNAs and miRNAs produce complicated effects through their interactions with other biological molecules such as DNAs, RNAs, and proteins, thus conducting researches on lncRNAbiomolecule interactions contributes to portraying the functions of lncRNAs and miRNAs [9,10,11]. Lately, some studies have demonstrated that lncRNAs can be used as a decoy or sponge to regulate miRNAs’ behavior [12], indicating that identifying lncRNAmiRNA interactions (LMIs) helps to understand the functions of lncRNAs and miRNAs.
In earlier researches, unknown LMIs were identified through wet experiments. However, due to the laborious, costly, and timeconsuming process of wet methods, it is more common to refine the candidate list in silico prediction for further validation experiments, in order to accelerate the identification of LMIs.
Recently, plenty of computational approaches have been proposed to predict LMIs. Huang et al. [13] propose a twoway diffusion model EPLMI for lncRNAmiRNA interaction prediction, which considers the known LMIs as a bipartite network. Huang et al. [14] develop GBCF, which builds a Bayesian collaborative filtering model using sequence, expression profiles, and target genes. Hu et al. [15] introduce a model, namely INLMI, which is based on the sequence similarity network and the expression similarity network. Zhang et al. [16] propose SLNPM which constructs the integrated similaritybased graph exploiting LMIs and genomic sequences, and implement a label propagation process on graphs for LMI prediction. These pioneers have produced good performances, but there still exist some limitations. On the one hand, some of the existing methods (e.g., EPLMI, GBCF and INLMI) heavily rely on biological features of lncRNAs and miRNAs, such as target gene information or expression profiles, which are not obtainable for all lncRNAs (miRNAs). On the other hand, the structure of the LMI network cannot be fully in pervious methods; nevertheless, it is fairly crucial to effectively utilize the structural and topological information of the LMI network for link inference.
Graph embedding learning (a.k.a. network representation learning), can be employed to preserve the structural property of the graph and map nodes of the graph into lowdimensional space, attracting widespread attention recently. To the best of our knowledge, some graph embedding methods have been exploited to reveal unknown associations between biomedical entities [17,18,19]. Motivated by the previous work in bioinformatics, we use graph embedding methods to capture information from LMI network.
Ensemble learning is one of the research hotspots in machine learning and pattern recognition. To date, ensemble learning methods have been increasingly used in computational biology because of their unique advantages in managing small samples, complex data structures, and high dimensionality [20]. Ensemble learning is an efficient technique that aggregates multiple machine learning models to achieve overall high prediction accuracy and good generalization [21]. It usually performs better than individual methods. Inspired by pioneering works [22,23,24,25,26,27], we adopt ensemble strategies to integrate individual predictions and embeddings to enhance the performance of LMI prediction.
In this paper, we propose novel LMI prediction methods based on graph embedding and ensemble strategies. Firstly, we calculate similarity based on lncRNA sequences and miRNA sequences and construct a heterogeneous network by combining them with the known LMIs. Secondly, we utilize five graph embedding methods (i.e., Laplacian Eigenmaps [28], HOPE [29], GraRep [30], DeepWalk [31], and GAE [32]) to capture structural information from the heterogeneous network, and learn the representation of lncRNAs and miRNAs. Later, we represent the lncRNAmiRNA pairs by merging lncRNA’s representation with miRNA’s representation, and build ensemble models based on pair features. As the extension of our previous work [33], we consider two ensemble strategies. For the former, we consider all the individual graph embedding based models as base predictors and integrate their predictions to develop a prediction method, named GEELPI. As for the latter, we construct a deep attention neural network (DANN) to learn lncRNAmiRNA pair representations by combining various graph embeddings, and develop a method, named GEELFI. The experimental results demonstrate that the proposed methods GEELPI and GEELFI can predict lncRNAmiRNA interactions with higher accuracy compared with other stateoftheart methods. Moreover, the effectiveness of the prediction integration and attention network is proved by extensive experiments. Furthermore, we conduct case studies to validate the predicted LMIs which do not exist in our dataset. In conclusion, both GEELPI and GEELFI are useful for predicting LMIs. Our contribution can be summarized as:

(1)
We consider a variety of graph embedding methods to learn the embedded representations from the lncRNAmiRNA heterogeneous network.

(2)
We introduce a deep attention neural network to learn highlevel sophistic representations by focusing on different aspects of embedded representations.

(3)
We consider two different ensemble strategies in this work. Then we design comprehensive experiments to compare them and analyze their effectiveness.
Results and discussion
Evaluation metrics
In this paper, we implement 5fold crossvalidation (5CV) to evaluate our models. The following metrics are adopted in our experiments: the area under the precisionrecall curve (AUPR), the area under the receiveroperating characteristic curve (AUC), Fmeasure (F1), accuracy (ACC), recall (REC), specificity (SPEC), and precision (PRE).
Parameter settings
In this study, both GEELPI and GEELFI have two major components: graph embedding and ensemble learning. Here, we introduce parameter settings.
Parameter settings for graph embedding methods
In this study, both GEELPI and GEELFI adopt five graph embedding methods: LE, GraRep, HOPE, DeepWalk, and GAE to learn representations of lncRNAs and miRNAs. The graph embedding methods are implemented by BioNEV [19].
Here, we discuss the parameter settings of five graph embedding methods. Firstly, we fix the representation dimension of all the graph embedding methods θ as 120 and consider other specified parameters of each graph embedding method. For GraRep, we consider the k th transition probability matrix kstep ∈ {1, 2, 3, 4}. For DeepWalk, we fix the walk length t as 80, and consider the combinations of window size w ∈ {10,20,30,40} and walk per vertex γ ∈ {10,20,30,40}. For GAE, we consider autoencoder and variational autoencoder respectively, and select the size of hidden layers β ∈ {32,64,128,256,512,1024}. For the aforementioned graph embedding methods, we adopt the optimal parameters which achieve the highest AUPR scores.
Parameter settings for ensemble methods
In this paper, we propose two ensemble strategies: prediction combination for GEELPI and attention neural network for GEELFI. The detailed parameter settings are described below.
For GEELPI, Random Forest and Logistic Regression are implemented by “scikitlearn” [34] where default hyperparameters are adopted. For the logistic regression, we additionally adopt L2 regulation with default parameters.
For GEELFI, we tune the following parameter settings: (1) the number of hidden layers μ and the size of hidden layers β in DANN (2) the embedded representation vectors ε involved in the feature fusion (3) the dimension of lncRNAmiRNA pair features θ (4) the number of estimators η in Random Forest classifier.
In the attention layer of DANN, we design two groups of attention weights for individual lncRNAmiRNA pair features. For fullyconnected layers, we consider different combinations of the parameters: number of hidden layers μ ∈ {1, 2, 3, 4}, size of hidden layers β ∈ {480, 240, 120, 60, 30}. Then we use the grid search to optimize these parameters according to their performances on 5CV. Finally, we design a twohiddenlayer neural network, and the size of each layer is 120 and 60 respectively.
As for the embedded representation vectors ε, we consider combinations of embedded representation vectors for merged lncRNAmiRNA pair features. For individual graph embedding methods, we implement 5CV for 20 times. In the light of AUC and AUPR scores, we reorder five graph embedding methods as GraRep, LE, GAE, HOPE, DeepWalk. And then we select the top K features as the candidates for lncRNAmiRNA pair features. Here we visualize the trend of AUC scores over the combination of top K features in Fig. 1 (a). The fused feature based on the top 2 graph embedding methods (i.e. GraRep and LE) owns the best performances. Hence, we adopt ε = {GraRep, LE}.
We consider the dimension of lncRNAmiRNA pair features θ ∈ {80, 120, 160, 240, 280, 320} with the consideration of the AUPR and AUC scores. As presented in Fig. 1 (b), fused features of 160 dimensions have a higher AUPR score and that of 240 dimensions has a higher AUC score. In the subsequent experiment, pair features of 160 dimensions achieve better performance, thus we set θ = 160.
Eventually, we consider the number of estimators η in Random Forest from 80 to 2000. In Fig. 1 (c), when the number of estimators equals to 2000, the AUPR score has little improvement. Considering computational efficiency and time costs, we set η = 2000.
After analysis above, we adopt μ = 2, β = {240,120}, ε = {GraRep, LE}, θ = 160 and η = 2000 for GEELFI. All the parameters used in graph embedding ensemble methods are summarized in Table 1.
Comparison with stateoftheart methods
Here, we compare our models with several stateoftheart methods including EPLMI [13], INLMI [15], and SLNPM [35]. EPLMI infers link probability according to the similarity between lncRNA and miRNA expression profiles. Specifically, EPLMI constructs a bipartite network using known lncRNAmiRNA interactions and exploits lncRNA (miRNA) expression profile information via the network for LMI prediction. INLMI integrates the sequence similarity and the expression similarity, and adopts a twoway diffusion algorithm to infer LMIs. SLNPM predicts LMIs by implementing a label propagation algorithm on two biomedical entities similarity graphs respectively. EPLMI and SLNPM are implemented according to the descriptions in the publications, then we evaluate the above models on our dataset by using 5fold crossvalidation experiments.
As shown in Table 2, GEELFI achieves the best AUPR score (0.7011), and the best AUC score (0.9578), and GEELPI achieves the secondbest AUPR score (0.7004) and AUC score (0.9537), which significantly outperform other stateofart methods. The substantial improvement of our models could be attributed to two factors: (1) GEELPI and GEELFI make the best of the structural properties implied in the lncRNAmiRNA heterogeneous network by employing graph embedding. (2) GEELPI and GEELFI adopt ensemble strategies (i.e. prediction integration and feature integration) to integrate multiview information.
In computational experiments, the topranked predictions are critical to reflect the performances of models. Here, we calculate the recall and precision of the aforementioned models on topranked predictions ranging from the top 100 to the top 1000. As presented in Fig. 2 (a), both GEELPI and GEELFI achieve best recall scores over all thresholds. For instance, when checking the top 500 predictions, GEELPI and GEELFI achieve recall scores of 0.5719 and 0.5706, nevertheless, the recall scores for SLNPM, EPLMI, INLMI remain 0.5283, 0.0921, 0.0884 respectively. Similarly, both GEELPI and achieve better precision scores than other benchmark methods as given in Fig. 2 (b). For example, both GEELPI and GEELFI can infer 86% real interactions in the top 500 predictions, whereas SLNPM, EPLMI, INLMI can only find 80, 10, 10% real interactions. Therefore, both GEELPI and GEELFI are preferable for LMI prediction compared with other stateoftheart methods.
Effect of ensemble learning
In this paper, we adopted two ensemble strategies to integrate heterogeneous information and develop our methods: GEELPI and GEELFI. In the following, we evaluate the performances of base predictors and our methods by 20 runs of 5CV and discuss how the ensemble strategies improve performances.
As demonstrated in Table 3, generally, these graph embedding based models could produce satisfactory performances, achieving AUPR scores> 0.65 and AUC scores> 0.92. In terms of the standard deviations of 20 runs of experiments, all these prediction models could lead to stable results. The experimental results indicate that graph embedding methods can efficiently capture inherent properties from the lncRNAmiRNA heterogeneous network for LMI inference.
Further, we integrate above five graph embedding based methods by ensemble strategies to enhance the accuracy of the model. GEELPI integrates different prediction scores from five graph embeddingbased predictors, achieving AUPR score of 0.7004 and AUC score of 0.9537. GEELFI attentively integrates lncRNA and miRNA representations to obtain distinctive lncRNAmiRNA pair features, achieving AUPR score of 0.7011 and AUC score of 0.9578. Both GEELPI and GEELFI achieve superior performances compared with base predictors, which indicates our ensemble strategies can contribute to higher accuracy for LMI prediction.
To evaluate the generalization ability of our ensemble models, we design an experiment on different sparsity of the heterogeneous network by removal of a certain proportion of links. In the experiments, we randomly delete 10, 20, 30, and 40% of LMIs in the heterogeneous network. Then, we build the base predictors and the ensemble models on the networks with fewer interactions. Table 4 reports the AUPR scores of different prediction methods. As we can observe, the ensemble models GEELPI and GEELFI produce higher AUPR scores than all the base predictors as the ratios of removed links ranging from 10 to 40%. More importantly, when the network becomes sparser, the performances of the ensemble models are less affected than other individual predictors. For instance, when the number of removed interactions ranging from 10 to 20%, the AUPR scores of LE, GraRep, HOPE, DeepWalk, GAE, GEELPI and GEELFI reduce by 2.7, 2.1, 2.1, 2.3, 4.3, 1.7, and 1.7% respectively, which verifies the generalization ability and robustness of our ensemble models.
In conclusion, integrating individual graph embedding based models with ensemble learning can effectively improve accuracy, generalization ability, and robustness in LMI prediction.
Effect of attention network
In the design of GEELFI, we consider a deep attention neural network to integrate graph embeddings as the ensemble strategy. DANN learn lncRNAmiRNA pair features by capturing the different aspects of representation vectors. To validate the effectiveness of the attention mechanism, we evaluate the performances of GEELFI and our designed comparison method on LMI prediction.
To validate the effect of attention network on feature fusion, we design the comparison variant as GEELF, which merges diverse embedded lncRNA and miRNA representations directly, without considering the different importance of embedded representations. For i th lncRNA and j th miRNA, the merged representation of lncRNA is defined as \( {L}_i=\sum \limits_{k\in S}{l}_i^k \) and the merged representation of miRNA is defined as \( {M}_j=\sum \limits_{k\in S}{m}_j^k \), where S is a set of lncRNA and miRNA representations learned by graph embedding methods. And the lncRNA and miRNA pair feature is computed as F_{ij} = [L_{i}; M_{j}]. We construct GEELFI and GEELF based on learned graph embeddings. To validate the effectiveness of our attention mechanism at a larger scale, we choose the K embeddings for the fused feature. Here we respectively adopt S = {GraRep}, {GraRep, GAE}, {GraRep, HOPE, DeepWalk}, {GraRep, HOPE, DeepWalk, LE} and {LE, GraRep, HOPE, DeepWalk, and GAE} with respect to K = {1, 2, 3, 4, 5} as our benchmarks to compare the performances of GEELF and GEELFI for LMI prediction. As shown in Fig. 3, given K = {1, 2, 3, 4, 5}, GEELFI achieves AUPR scores of 0.6810, 0.6838, 0.6539, 0.6538 and 0.6670 which outperforms 0.6805, 0.6725, 0.6493, 0.6487 and 0.6541 respectively. The experimental result demonstrates the utilization of attention mechanism can contribute to better performance for LMI prediction. Therefore, we can conclude that our deep attention neural network can effectively merge multiple embedded lncRNA and miRNA representations and learn better lncRNAmiRNA pair features for LMI prediction.
To further probe into how the attention network captures different aspects of embedded representations, we fix K as 5 and implement 5CV for 20 times. Then we visualize the attention weights of lncRNA representations and miRNA representations learned by attention neural network. In Fig. 4, we can observe that (1) for lncRNAs, DANN generally pays much attention to the GAEbased embeddings, and for miRNA, it assigns higher attention weights to GraRepbased embeddings, which indicates the graph embedding based on neural network and matrix factorization method are efficient in LMI prediction. (2) furthermore, attention weights vary with lncRNA sequences and miRNA sequences in each fold, which validates DANN can adaptively adjust its attention to learn distinctive lncRNAmiRNA pair features according to specific lncRNA and miRNA data.
Consequently, our deep attention neural network can learn highlevel sophistic representations of lncRNAmiRNA pairs and enhance the performances of GEELFI on LMI prediction.
Case studies
The primary goal of computational methods is to refine the candidate list and guide further validation experiments. Here, we conduct case studies to demonstrate the practical capability of the proposed method for unknown LMI inference. Firstly, we train the model on our dataset. Then, we employ our model to score unlabeled lncRNAmiRNA pairs. Later, we validate the prediction result by a comprehensive datasets starBase [36]. Here, we list the top 10 LMIs in Table 5. As we can observe, both GEELPI and GEELFI can correctly infer 8 LMIs among their top 10 predictions. For instance, our proposed model can accurately predict that lncRNA lncACER2–1:1 can interact with miRNA hsamiR106a5p. ACER2 is one of the human alkaline ceramidases, and can produce lncRNA lncACER2–1. MiRNA hsamiR106a5p can participate in various biological processes, and are involved in severe diseases (e.g., gastric carcinoma and glioblastoma) [37, 38]. Some researchers have discovered that the expression of hsamiR106a5p is downregulated in breast tissues, and ACER2 could serve as a target gene of hsamiR106a5p [39]. Whereas, the interaction between lncCOL6A3–5:1 and hsamiR4500 is to be confirmed in the future. In general, both GEELPI and GEELFI are effective tools to indicate novel interactions between lncRNA and miRNA.
Conclusions
LncRNAs and miRNAs are critical to cellular processes, and inferring their interactions contributes to betraying the mechanism of complicated disease. In this paper, we propose novel graph embedding ensemble learning methods: GEELPI and GEELFI. Comparison with other stateofart methods demonstrates both GEELPI and GEELFI achieve higher accuracy performances for LMI prediction. The adoption of graph embedding methods overcomes the limitation of traditional features, and makes our model efficiently capture the inherent structural properties of LMI heterogeneous network. Further experiments indicate that ensemble learning and attention mechanism are powerful to enhance accuracy, generalization ability, and robustness of LMI prediction model. Moreover, the case studies are also performed to prove the practical capability of our methods. In conclusion, both GEELPI and GEELFI are promising for LMI prediction.
Datasets and methods
Datasets
We collect 8091 experimentally verified lncRNAmiRNA interactions from the lncRNASNP dataset [40]. After removing duplicated interactions, we obtain 5118 interactions between 780 lncRNAs and 275 miRNAs. We then download lncRNA sequences from NONCODE dataset [41] and miRNA sequences from miRBase dataset [42] separately. Ultimately, we compile our dataset with 3784 interactions between 642 lncRNAs and 275 miRNAs.
Heterogeneous network
To model the complicated relationship between biomedical entities, we design a lncRNAmiRNA heterogeneous network by integrating the known LMIs with the sequence similarity, as shown in Fig. 5 (a).
Given r lncRNAs and t miRNAs, the interaction matrix can be denoted by A ∈ ℝ^{r × t}, where A(i, j) = 1 if i th lncRNA and j th miRNA are interacting, otherwise A(i, j) = 0. Our previous work [35] indicates that the pairwise similarity between biomedical entities (i.e. lncRNA and miRNA sequence similarity) can help to infer interactions. Therefore, same as our previous work, we extract 5spectrum feature [43] from lncRNA (miRNA) sequence and then calculate similarity by linear neighborhood similarity measure (LNS) [35]. In this way, we acquire lncRNA similarity matrix S_{l} ∈ ℝ^{r × r} and miRNA similarity matrix S_{m} ∈ ℝ^{t × t}, where S(i, j) is the similarity score between i th and j th lncRNAs (miRNAs). Further, for a single biomedical entity, we consider the top 10 most similar entities as its immediate neighborhoods, and obtain adjacency matrix W_{l} ∈ ℝ^{r × r} and W_{m} ∈ ℝ^{m × m} from S_{l} and S_{m} separately. Ultimately, we regard biomedical entities (i.e. a lncRNAs and a miRNAs) as nodes and their relationships (i.e. LMs, lncRNAlncRNA similarity and miRNAmiRNA similarity) as edges to construct the heterogeneous network H:
where A^{T} denotes the transpose of the matrix A.
Graph embedding methods
To fully exploit the topological properties of the heterogeneous network, we choose graph embedding methods from three categories [19] (i.e. matrix factorization, random walk, and neural network).
From the matrix factorizationbased category, we adopt Laplacian Eigenmaps (LE) [28], GraRep [30] and HOPE [29]. LE computes a lowdimensional representation of the dataset, optimally preserving local neighborhood information by using the Laplacian of the graph [28]. GraRep integrates global structural information of the graph into the learning process and learns highorder proximity [30]. HOPE can preserve highorder proximities of large scale graphs and is capable of capturing the asymmetric transitivity [29].
From the random walkbased category, We select DeepWalk [31]. DeepWalk uses local information obtained from truncated random walks to learn latent representations by treating walks as the equivalent of sentences [31].
We consider Graph Auto Encoder (GAE) [32] as a representative of the neural networkbased methods. GAE obtains lowdimensional node representations by reconstructing the heterogeneous network with consideration of the firstorder and secondorder of proximities.
By employing the aforementioned graph embedding methods, the topological and inherent properties of the heterogeneous network are acquired, then the learned distinctive representations will be further used in the downstream task. as shown in Fig. 5 (a).
Graph embedding ensemble learning based on prediction integration
In this section, we introduce a graph embedding ensemble learning method based on prediction integration (GEELPI). We build base predictors based on individual graph embedding methods, and further combine their predictions with ensemble strategy to infer LMIs.
To build a base predictor, firstly, we acquire the lowdimensional representations of miRNAs and lncRNAs using the corresponding graph embedding method. Then we denote lncRNAmiRNA pairs as the concatenation of two kinds of embeddings and further build a Random Forest predictor based on pairs. The reason why we adopt Random Forest lies in its highefficiency.
Following the steps outlined above, we can construct five base predictors based on corresponding graph embedding methods. The five graph embedding methods are heterogeneous, which captures inherent structure properties from different aspects, thus they may demonstrate different generalization abilities on datasets. Therefore, it is natural to integrate several predictors by using ensemble strategies. Theoretically, ensemble learning is to build a model ϕ : (f_{1}(x), f_{2}(x), …, f_{n}(x)) → {0, 1}, which maps the outcome of n base predictors to a label. Specifically, we consider logistic regression as the mapping function ϕ, which is simple but can model the nonlinear relationship between base predictors and labels. In this way, we construct GEELPI for LMI prediction as described in Fig. 5 (B).
Graph embedding ensemble learning based on feature integration
In this section, we introduce a graph embedding ensemble learning method based on feature integration (GEELFI). We construct a deep attention neural network to learn lncRNAmiRNA pair representations, and further develop a classifier for LMI prediction.
The deep attention neural network contains attention layer and deep fullyconnected neural layers, as given in Fig. 5(c). First, we consider attention mechanism to integrate different embedded representations. Because heterogenous lncRNA and miRNA features could be correlated and have redundant information, if directly merge them, it may affect the performances of conventional classifiers negatively. Attention mechanism can be used to assign importance weights to different representations which can determine the most relevant aspects, disregarding noise and redundancies in the input [44]. Motivated by its successful applications in many fields [45,46,47,48,49,50,51], we adopt an attention mechanism to integrate heterogeneous genomic representations. Then we consider the deep neural network (DNN) for feature refinement. DNN allows computational models with multiple processing layers to learn representations of lncRNAs and miRNAs with multiple levels of abstraction. Moreover, deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer [52]. Therefore, we construct a DANN to adaptively capture the importance of each embedding feature and learn distinctive highlevel representations for LMI prediction.
Specifically, given i th lncRNA and j th miRNA, by using five embedding methods, we obtain five lncRNA representations and five miRNA representations, let \( {l}_i^k \) and \( {m}_j^k \) (k = 1, 2, 3, 4, 5) denote embeddings from LE, GraREP, HOPE, DeepWalk and GAE, i = 1, 2, , …, r and j = 1, 2, , …, t. Then these representations are fed into attention networks. Let L_{i} denotes the integrated feature for i th lncRNA, and M_{j} denotes the integrated feature for j th miRNA. The merged representation of lncRNA and miRNA are defined as:
where \( {\boldsymbol{a}}_{\boldsymbol{k}}^{\boldsymbol{l}} \) denotes an attention weight measuring the importance of embedded representation k with respect to i th lncRNA, and \( {\boldsymbol{a}}_{\boldsymbol{k}}^{\boldsymbol{m}} \) is an attention weight measuring the importance of embedded representation k with respect to j th miRNA.
Next, we concatenate i th lncRNA representation L_{i} and j th miRNA representation M_{j} to obtain lncRNAmiRNA pair feature F_{ij}, which indicates the interaction between i th lncRNA and j th miRNA:
where [L_{i}; M_{j}] is the concatenation of the two vectors.
To learn preferable representations of lncRNAmiRNA interactions, we consider the interacting lncRNAmiRNA pairs as positive instances and noninteracting lncRNAmiRNA pairs as negative instances to build a deep neural network. For i th lncRNA and j th miRNA, the lncRNAmiRNA pair feature F_{ij} is fed into deep fully connected layers as following:
where L denotes the number of hidden layers; ReLU is an activation function [53], and W_{l} and b_{l} are the weight matrix and bias vector for the l th layer, respectively.
And the prediction score between i th lncRNA and j th miRNA \( {\hat{\rho}}_{ij} \) is computed as:
where Sigmoid is an activation function; W and b are the weight matrix and bias vector, respectively.
And we adopt the following binary cross entropy as the loss function:
where \( \mathcal{L} \) denotes loss function; r and t are total numbers of lncRNAs and miRNAs respectively. p_{ij} is a label, p_{ij} = 1 if i th lncRNA and j th miRNA are interacting, otherwise p_{ij} = 0;
Therefore, the attention weights \( {\boldsymbol{a}}_{\boldsymbol{k}}^{\boldsymbol{l}} \) and \( {\boldsymbol{a}}_{\boldsymbol{k}}^{\boldsymbol{m}} \) can be updated through the backpropagation algorithm [54] and gradient descent algorithm according to the above loss function \( \mathcal{L} \). The update procedure can be described as:
where α is the learning rate of the neural network .
Here, to improve performances of LMI prediction, we build a Random Forest classifier based on pair features.
Therefore, we utilize DANN to integrate multiple features obtained by graph embedding methods to learn better representations of lncRNAmiRNA pairs, and construct GEELFI.
Availability of data and materials
The dataset lncRNASNP used in this study are freely available at http://bioinfo.life.hust.edu.cn/lncRNASNP. We download lncRNA sequence from NONCODE is now available at http://www.noncode.org. And the main datasets of miRNA sequences are available at http://www.mirbase.org.
Abbreviations
 GEELPI:

Graph embedding ensemble learning based on prediction integration
 GEELFI:

Graph embedding ensemble learning based on feature integration
 LMIs:

lncRNAmiRNA interactions
 nt:

Nucleotides
 DANN:

Deep attention neural network
 DNN:

Deep neural network
 5CV:

5Fold cross validation
 AUC:

Area under ROC curve
 AUPR:

Area under precisionrecall curve
 SEN:

Sensitivity
 SPEC:

Specificity
 PREC:

Precision
 ACC:

Accuracy
 F:

Fmeasure
References
Turner M, Galloway A, Vigorito E. Noncoding RNA and its associated proteins as regulatory elements of the immune system. Nat Immunol. 2014;15(6):484–91.
Fatica A, Bozzoni I. Long noncoding RNAs: new players in cell differentiation and development. Nat Rev Genet. 2014;15(1):7–21.
Miska EA. How microRNAs control cell division, differentiation and death. Curr Opin Genet Dev. 2005;15(5):563–8.
Xu P, Guo M, Hay BA. MicroRNAs and the regulation of cell death. Trends Genet. 2004;20(12):617–24.
Lu M, Zhang Q, Deng M, Miao J, Guo Y, Gao W, Cui Q. An analysis of human MicroRNA and disease associations. PLoS One. 2008;3(10):e3420.
Yoon JH, Abdelmohsen K, Gorospe M. Functional interactions among microRNAs and long noncoding RNAs. In: Seminars in cell & developmental biology: 2014. Amsterdam: Elsevier; 2014. p. 9–14.
Chakravarty D, Sboner A, Nair SS, Giannopoulou E, Rubin MA. The oestrogen receptor alpharegulated lncRNA NEAT1 is a critical modulator of prostate cancer. Nat Commun. 2014;5:5383.
Latronico MVG, Catalucci D, Condorelli G. Emerging role of MicroRNAs in cardiovascular biology. Circ Res. 2007;101(12):1225–36.
Qian L, Jianguo H, Nanjiang Z, Ziqiang Z, Ali Z, Zhaohui L, Fangting W, YinYuan M. LncRNA loc285194 is a p53regulated tumor suppressor. Nucleic Acids Res. 2013;41(9):4976–87.
Xu MD, Wang Y, Weng W, Wei P, Qi P, Zhang Q, Tan C, Ni SJ, Dong L, Yang Y. A positive feedback loop of lncRNAPVT1 and FOXM1 facilitates gastric Cancer growth and invasion. Clin Cancer Res. 2016;23(8):2071.
Berghoff EG, Clark MF, Sean C, Ivelisse C, Leib DE, Kohtz JD. Evf2 (Dlx6as) lncRNA regulates ultraconserved enhancer methylation and the differential transcriptional control of adjacent genes. Development. 2013;140(21):4407–16.
Gong J, Liu W, Zhang J, Miao X, Guo AY. lncRNASNP: a database of SNPs in lncRNAs and their potential functions in human and mouse. Nucleic Acids Res. 2015;43(Database issue):D181.
Huang YA, Chan KCC, You ZH. Constructing prediction models from expression profiles for large scale lncRNAmiRNA interaction profiling. Bioinformatics. 2018;34(5):812–9.
Huang ZA, Huang YA, You ZH, Zhu Z, Sun Y. Novel link prediction for largescale miRNAlncRNA interaction network in a bipartite graph. BMC Med Genomics. 2018;11(6):17–27.
Hu P, Huang YA, Chan KCC, You ZH: Discovering an Integrated Network in Heterogeneous Data for Predicting lncRNAmiRNA Interactions; 2018.
Zhang W, Tang G, Zhou S, Niu Y. LncRNAmiRNA interaction prediction through sequencederived linear neighborhood propagation method with information combination. BMC Genomics. 2019;20(Suppl 11):946.
Wang YB, You ZH, Li X, Jiang TH, Chen X, Zhou X, Wang L. Predicting proteinprotein interactions from protein sequences by a stacked sparse autoencoder deep neural network. Mol BioSyst. 2017;13(7):1336–44.
Zitnik M, Agrawal M, Leskovec J. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics. 2018;34(13):i457–66.
Yue X, Wang Z, Huang J, Parthasarathy S, Moosavinasab S, Huang Y, Lin SM, Zhang W, Zhang P, Sun H. Graph embedding on biomedical networks: methods, applications and evaluations. Bioinformatics. 2020;36(4):1241–51.
Yang P, Hwa Yang Y, Zhou BB, Zomaya AY. A review of ensemble methods in bioinformatics. Curr Bioinformatics. 2010;5(4):296–308.
Polikar R. Ensemble based systems in decision making. IEEE Circuits and systems magazine. 2006;6(3):21–45.
Chen X, Yan CC, Zhang X, Zhang X, Dai F, Yin J, Zhang Y. Drugtarget interaction prediction: databases, web servers and computational models. Brief Bioinform. 2016;17(4):696–712.
Zhang W, Niu Y, Xiong Y, Zhao M, Yu R, JJPo L. Computational prediction of conformational Bcell epitopes from antigen primary structures by ensemble learning. PloS one. 2012;7(8):e43575.
Zhang W, Liu F, Luo L, JJBb Z. Predicting drug side effects by multilabel learning and ensemble learning. BMC Bioinformatics. 2015;16(1):365.
Zhang W, Yue X, Tang G, Wu W, Huang F, Zhang X. SFPELLPI: sequencebased feature projection ensemble learning for predicting LncRNAprotein interactions. PLoS Comput Biol. 2018;14(12):e1006616.
Gong Y, Niu Y, Zhang W, Li X. A network embeddingbased multiple information integration method for the MiRNAdisease association prediction. BMC Bioinformatics. 2019;20(1):468.
Zhang W, Jing K, Huang F, Chen Y, Li B, Li J, Gong J. SFLLN: a sparse feature learning ensemble method with linear neighborhood regularization for predicting drug–drug interactions. Inf Sci. 2019;497:189–201.
BELKIN M. Laplacian eigenmaps and spactral techniques for embedding and clustering. Adv Neural Inf Proces Syst. 2001;14(6):585–91.
Ou M, Peng C, Jian P, Zhang Z, Zhu W. Asymmetric transitivity preserving graph embedding, vol. 2016. New York City: Acm Sigkdd International Conference; 2016.
Cao S, Wei L, Xu Q. GraRep: learning graph representations with global structural information, vol. 2015. New York City: Acm International on Conference on Information & Knowledge Management; 2015.
Perozzi B, AlRfou R, Skiena S. DeepWalk: online learning of social representations, vol. 2014. New York City: Acm Sigkdd International Conference on Knowledge Discovery & Data Mining; 2014.
Kipf TN, Welling M. Variational graph autoencoders. arXiv preprint arXiv:161107308 2016.
Zhou S, Yue X, Xu X, Liu S, Zhang W, Niu Y. LncRNAmiRNA interaction prediction from the heterogeneous network through graph embedding ensemble learning. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM): 2019. New York: IEEE; 2019. p. 622–7.
Pedregosa F, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J. Scikitlearn: machine learning in python. J Mach Learn Res. 2013;12(10):2825–30.
Zhang W, Tang G, Wang S, Chen Y, Zhou S, Li X: Sequencederived linear neighborhood propagation method for predicting lncRNAmiRNA interactions. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM): 2018; 2018.
JunHao L, Shun L, Hui Z, LiangHu Q, JianHua Y. starBase v2.0: decoding miRNAceRNA, miRNAncRNA and proteinRNA interaction networks from largescale CLIPSeq data. Nucleic Acids Res. 2014;42(Database issue):D92.
Xiao B, Guo J, Miao Y, Jiang Z, Rong H, Zhang Y, Li D, Zhong J. Detection of miR106a in gastric carcinoma and its clinical significance. Clin Chim Acta. 2009;400(1):97–102.
Yang G, Zhang R, Chen X, Mu Y, Jing A, Chen S, Liu Y, Shi C, Sun L, Rainov NG. MiR106a inhibits glioma cell growth by targeting E2F1 independent of p53 status. J Mol MedJmm. 2011;89(10):1037–50.
Sabit H, Cevik E, Tombuloglu H, Farag K, Said O. miRNA profiling in MCF7 breast Cancer cells: seeking a new biomarker. J Biomed Sci. 2019;8:3.
Jing G, Wei L, Jiayou Z, Xiaoping M, AnYuan G. lncRNASNP: a database of SNPs in lncRNAs and their potential functions in human and mouse. Nucleic Acids Res. 2015;43(Database issue):D181.
Changning L, Baoyan B, Geir S, Lun C, Wei D, Yong Z, Dongbo B, Yi Z, Runsheng C. NONCODE: an integrated knowledge database of noncoding RNAs. Nucleic Acids Res. 2005;33(Database issue):D112–5.
Sam GJ, Grocock RJ, Stijn VD, Alex B, Enright AJ. miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res. 2006;34(suppl 1):140–4.
Leslie C, Eskin E, Noble WS. The spectrum kernel: A string kernel for SVM protein classification. In: Biocomputing 2002. World Scientific; 2001. p. 564–75.
Chaudhari S, Polatkan G, Ramanath R, Mithal V: An attentive survey of attention models. arXiv preprint arXiv:190402874 2019.
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:14090473 2014.
Cheng Z, Ding Y, He X, Zhu L, Song X, Kankanhalli MS. A^ 3NCF: an adaptive aspect attention model for rating prediction, vol. 2018. California: IJCAI; 2018. p. 3748–54.
Maharjan S, Montes M, González FA, Solorio T: A genreaware attention model to improve the likability prediction of books. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: 2018; 2018: 3381–3391.
Han C, Shen F, Liu L, Yang Y, Shen HT. Visual spatial attention network for relationship detection. In: Proceedings of the 26th ACM international conference on multimedia. Seoul, Republic of Korea: Association for Computing Machinery; 2018. p. 510–8.
Hong Z, Zeng X, Wei L, Liu X. Identifying enhancerpromoter interactions with neural network based on pretrained DNA vectors and attention mechanism. Bioinformatics. 2020;36(4):1037–43.
Shen T, Zhou T, Long G, Jiang J, Pan S, Zhang C. DiSAN: Directional SelfAttention Network for RNN/CNNFree Language Understanding. In: AAAI. 2018;2018.
Ying H, Zhuang F, Zhang F, Liu Y, Xu G, Xie X, Xiong H, Wu J: Sequential recommender system based on hierarchical attention network. IJCAI International Joint Conference on Artificial Intelligence 2018, 2018July:3926–3932.
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44.
Maas AL, Hannun AY, Ng AY: Rectifier nonlinearities improve neural network acoustic models. In: Proc icml: 2013; 2013: 3.
HechtNielsen R. Theory of the backpropagation neural network. In: Neural networks for perception: Elsevier; 1992. p. 65–93.
Acknowledgments
Not applicable.
About this supplement
This article has been published as part of BMC Genomics Volume 21 Supplement 13, 2020: Selected articles from the 2019 IEEE International Conference on Bioinformatics and Biomedicine (IEEE BIBM 2019): genomics (part 1). The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume21supplement13.
Funding
Publication costs are funded by the National Natural Science Foundation of China (61772381, 62072206, 61572368), National Key Research and Development Program (2018YFC0407904), Huazhong Agricultural University Scientific & Technological Selfinnovation Foundation. The funding bodies are not involved in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Author information
Authors and Affiliations
Contributions
WZ designed the study. CSZ and SZ prepared data, implemented the algorithm. CSZ, YQ, SZ, SCL, WZ, and YQN drafted the manuscript. All of the authors reviewed and approved the manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Zhao, C., Qiu, Y., Zhou, S. et al. Graph embedding ensemble methods based on the heterogeneous network for lncRNAmiRNA interaction prediction. BMC Genomics 21 (Suppl 13), 867 (2020). https://doi.org/10.1186/s1286402007238x
Published:
DOI: https://doi.org/10.1186/s1286402007238x