 Research
 Open access
 Published:
GDCLNcDA: identifying noncoding RNAdisease associations via contrastive learning between deep graph learning and deep matrix factorization
BMC Genomics volume 24, Article number: 424 (2023)
Abstract
Noncoding RNAs (ncRNAs) draw much attention from studies widely in recent years because they play vital roles in life activities. As a good complement to wet experiment methods, computational prediction methods can greatly save experimental costs. However, high falsenegative data and insufficient use of multisource information can affect the performance of computational prediction methods. Furthermore, many computational methods do not have good robustness and generalization on different datasets. In this work, we propose an effective endtoend computing framework, called GDCLNcDA, of deep graph learning and deep matrix factorization (DMF) with contrastive learning, which identifies the latent ncRNAdisease association on diverse multisource heterogeneous networks (MHNs). The diverse MHNs include different similarity networks and proven associations among ncRNAs (miRNAs, circRNAs, and lncRNAs), genes, and diseases. Firstly, GDCLNcDA employs deep graph convolutional network and multiple attention mechanisms to adaptively integrate multisource of MHNs and reconstruct the ncRNAdisease association graph. Then, GDCLNcDA utilizes DMF to predict the latent diseaseassociated ncRNAs based on the reconstructed graphs to reduce the impact of the falsenegatives from the original associations. Finally, GDCLNcDA uses contrastive learning (CL) to generate a contrastive loss on the reconstructed graphs and the predicted graphs to improve the generalization and robustness of our GDCLNcDA framework. The experimental results show that GDCLNcDA outperforms highly related computational methods. Moreover, case studies demonstrate the effectiveness of GDCLNcDA in identifying the associations among diversiform ncRNAs and diseases.
Introduction
According to a central dogma of molecular biology, it describes how genetic information is transmitted through RNA to the corresponding protein. Noncoding RNAs (ncRNAs) are a large segment of the transcriptome that do not have apparent proteincoding roles, which are functional RNAs molecule that is not translated into a protein [1]. Thus, for the past decades, there are a view that ncRNAs are transcriptional noise [2]. Until a breakthrough in biotechnology, ncRNAs catch extensive attention of many researchers and completely change the view of biological scientists on RNA function [3]. Hundred studies find that ncRNAs occupy a vital position in life activities by being a key regulators of gene expression, involving the occurrence and development of many diseases and so on [4]. Nowadays, microRNAs (miRNAs), circular RNAs (circRNAs) and long noncoding RNAs (lncRNAs) are commonly studied in diseaseassociated ncRNAs [3, 5].
More than 1, 500 miRNAs found in the human genome up to now. They have a length of about 2123 nucleotides, and each miRNA has hundreds of targeted mRNAs. miRNAs are involved in almost every process in human cells. Therefore, researchers believe that every disease is influenced by a miRNAs component [6]. miR1405p and miR146a can target Sirt2, Nrf2, TAF9b/P53, and other pathways, that they take a important place in doxorubicininduced cardiotoxicity [7, 8]. In the last few years, great efforts be made to discover the latent miRNAdisease associations. For instance, Chen et al. [9] use the matrix decomposition method to discover the diseaseassociated miRNAs. Peng et al. [10] construct a HN of miRNAgenedisease with their similarity networks. Autoencoder (AE) and convolutional neural network (CNN) are used to recognize the characteristic combination and predict the final label for each pair of miRNA and disease, respectively. Jiang et al. [11] design the similarity kernel fusion (SKF) method to integrate diverse similarity kernels of miRNA and disease, which can be more effectively for predicting miRNAdisease associations. Li et al. [12] combine the linear and nonlinear features from miRNA and disease to find the latent associations, where the linear features are formed by the correlation profiles of diseaselncRNA and miRNAlncRNA and the nonlinear features are extracted by graph attention network (GAT).
circRNAs can act as miRNA, or protein inhibitors, which attracts an increasing number of attentions from researchers [13]. They have a closed singlestrand continuous circular form. Without 3’ or 5’ polyadenylated tails, they can be resistant to extracellular enzymemediated degradation [14]. In Crohn’s disease, hsa\(\_\)circRNA\(\_\)103765 can impact tumor necrosis factor\(\alpha\) via cell apoptosis induced [15]. Lei et al. [16] develop a computational path weighted method for inferring circRNAdisease associations by integrating similarity networks and interaction network. More specifically, they calculate a linkage score for each pair of circRNA and disease based on paths linking them. Wei et al. [17] reconstruct the association matrix between circRNAs and diseases based on diverse similarity networks, and use it as a basis for the links prediction task by nonnegative matrix factorization. Wang et al. [18] propose a machine learning framework for latent circRNAdisease links discovery via a fusion of circRNA sequences and disease ontology. Li et al. [19] use GAT and random walk and restart (RWR) to extract the loworder and highorder neighbor representations from similarity networks of circRNA and disease, respectively. There are two graph autoencoders (GAE) for circRNAdisease associations prediction, based on integrating these representations.
lncRNAs are antisense RNA molecules with more than 200 nucleotides. They can regulate the transcription and expression of genes and involve in cancer development or suppression, which by specific binding to noncoding regions of target genes [20]. For example, the overexpression of lncRNACTA929C8 in brain tissue may lead to Alzheimer’s disease, that its high expression is about 1000 times that of other normal tissues [21]. Wang et al. [22] design a weighted matrix factorization method to infer diseaseassociated lncRNAs. To be specific, the algorithm assigns initial weights to the interassociation and intraassociation matrices within the network. It then collaboratively decomposes these matrices into lowrank equivalents, aiming to uncover the inherent relationships among the nodes. Zhang et al. [23] propose a multifeature coding approach to build the characteristic of linkage among lncRNA and disease samples by combining the six similarity characteristics, and develop an attention CNN to infer possible association between lncRNA and disease. Wu et al. [24] utilize GAE to extract lowdimensional representations of vertices and random forest (RF) to identify the possible relationships between lncRNA and disease. Zhao et al. [25] utilize the GAT to learn vertex representations based on homogeneous and heterogeneous subgraphs. To obtain more semantic information, they perform an attention mechanism for assigning weights to numerous metapathbased subgraphs. For final prediction task, they use neural inductive matrix completion (NIMC) to rebuild the linkages among lncRNA and disease.
Although there have been many efforts to analyze the underlying associations between various ncRNAs and diseases, there are still some challenges [17, 26, 27]: (1) High falsenegative association; (2) Insufficient utilization of multisource information; (3) The noise both from multisource information and multistage methods; (4) The robustness and generalization of the methods are insufficient.
Firstly, as we all know, the traditional wet experiments consume a lot of resources, but are also inefficient and susceptible to the outside world. At present, plentiful computational methods critically depend on the associations between ncRNAs and diseases verified by wet experiments. Unfortunately, there is a phenomenon where the existing open ncRNAdisease databases use 1 and 0 to indicate whether has relationship between them, with very few “1” values pointing a known association and very numerous “0” values pointing an unknown association rather than no association. This phenomenon we called falsenegative, and there are many falsenegative associations in ncRNAdisease databases, which will impact the performance and interpretability of computational methods [17, 28]. Secondly, abundant previous works enhance performance of methods by fusing the similarity networks of ncRNAs and diseases by a simple average or linear weighting strategy. Therefore, those works ignore that multisource information may have different contributions to the same prediction task [26, 29]. Thirdly, there are many works using a multistage method to integrate multisource information to improve the performance, and some of those methods also rely on handcrafted intermediate results. Moreover, noise is contained in most of the similarity information [10, 26]. These will affect the effectiveness and interpretability of methods. Finally, great works focus on two specific bioentities of interest (e.g., lncRNA and disease), which may lead to a model not being able to get a good result on different datasets when this model uses the same set of parameters. Therefore, the robustness and generalization of methods have to be improved. In conclusion, it is worth noting that reducing falsenegatives in original association data, making full and reasonable use of multiple sources of information from bioentities, and differentiating the significance of various sources of information can enhance the predictive capability of ncRNAdisease associations. Furthermore, it is vital to improve the robustness and generalization of methods. However, there is no complete and effective endtoend framework to address these challenges.
In this work, to overcome the challenges, we design the GDCLNcDA that uses the \(\textbf{G}\)raph learning models and \(\textbf{D}\)eep matrix factorization based on \(\textbf{C}\)ontrastive \(\textbf{L}\)earning for \(\textbf{Nc}\)RNA\(\textbf{D}\)isease \(\textbf{A}\)ssociations identification. It is an endtoend computational framework, for integrating divers multisource information on different HNs. Different from our previous work MHDMF [28], GDCLNcDA introduces a deep graph learning (deep graph convolutional networkGCNII [30]), employs multiple attention mechanisms, including graph attention network (GAT) and multichannel attention to enhance the characteristics of within and between similarity networks. GDCLNcDA also uses DMF to identify potential associations while further adding contrastive learning (CL), which makes the GDCLNcDA framework have better generalization and robustness. In addition, we perform GDCLNcDA on more multisource heterogeneous networks (MHNs) which contain more bioentities. The GDCLNcDA has the following advantages:

1.
We design an endtoend computational framework GDCLNcDA, which is the first to introduce GCNII to fuse the multisource information of different ncRNAs and diseases based on three different multilayer heterogeneous networks. Furthermore, GDCLNcDA is the first to use CL in a chain framework. These multilayer heterogeneous networks include miRNAs, circRNA, lncRNA, genes, and diseases. GDCLNcDA consists of four parts: (1) constructing multiple MHNs of ncRNAs and diseases, (2) reconstructing diverse association graphs, (3) establishing various predicted association graphs, (4) generating contrastive loss on reconstructed graphs and predicted graphs.

2.
GDCLNcDA efficiently integrates GCNII, multiple attention mechanisms, DMF, and CL into an endtoend framework for identifying underlying associations. GDCLNcDA reduces the falsenegative associations via multisource GCNII and multiple attention mechanisms, which is used to reconstruct the ncRNAdisease association graphs. GDCLNcDA introduces DMF to take both explicit and implicit feedback into consideration for generating ncRNAdisease associations predictive graphs based on reformulated association graphs. In addition, GDCLNcDA further utilizes CL to improve generalization and robustness by generating a contrastive loss on the reconstructed graphs and predictive graphs.

3.
To assess the capability of GDCLNcDA, we compare it with seven stateoftheart methods under 5fold crossvalidation (5CV) and 10fold crossvalidation (10CV) on three different MHNs, and GDCLNcDA achieves firstrank results. It is shown that GDCLNcDA can easily extend on different datasets and have better generalization and robustness. Then, we implement ablation experiments to prove the effectiveness of each part and different MHNs, and parameter analysis of GDCLNcDA to illustrate the choice of parameters. Finally, case studies are performed on miRNA, circRNA, lncRNA and their two corresponding diseases.
Multisource heterogenous networks
miRNAgenedisease associations
For miRNAdisease, the positive set of miRNAdisease associations is downloaded from the Human MicroRNA Disease Database (HMDD v2.0) [31]. The miRNAgene associations are downloaded from the miRWalk2.0 database [32]. The diseasegene associations are downloaded from DisGeNET [33]. We intersect the datasets to remove genes that have no relation with diseases and miRNAs. Meanwhile, we also download the semantic trees of diseases from the U.S. National Library of Medicine (MeSH) [34]. We filter out miRNAdisease associations that their corresponding names are absent in the MeSH descriptors or miRBase records. Then, we get 4266 associations between 285 miRNAs and 197 diseases, and 1789 genes associate with miRNA and disease, respectively.
circRNAgenedisease associations
For circRNAdisease, we download the positive associations of circRNAdisease from CircR2Disease database [35], the circRNAgene associations from http://cssb2.biology.gatech.edu/knowgene/search.html, and the diseasegene associations from http://cssb2.biology.gatech.edu/knowgene/. We move out diseases and circRNAs that their corresponding names are absent in the MeSH descriptors or the records in Circinteractome and circbank databases. After filtering, there are 418 genes linked with 515 circRNAs and 61 genes linked with 82 diseases, and 563 associations between circRNAs and diseases.
lncRNAgenedisease associations
For lncRNAdisease, we obtain the lncRNAdisease positive linkages from LncRNADdisease database [36], the lncRNAgene linkages from lncReg database [37], and the diseasegene linkage from DisGeNet database. After removing the duplicate and missing data, we collect 577 linkages among 276 lncRNAs and 125 diseases with 3043 linked genes.
Multisource information
We integrate multisource information to build three different types of ncRNAdisease MHNs. The MHNs includes the hamming profile, sequence, and gaussian interaction profile kernel (GIPK) similarity of three types of ncRNAs, the hamming profile, semantic, and GIPK similarity of diseases, as well as experimentally valid miRNAdisease, circRNAdisease, lncRNAdisease, miRNAgene, circRNAgene, lncRNAgene, and diseasegene associations. In this work, all the similarity networks of ncRNAs and diseases are treated as graphs with edge weighted. The association matrixes of ncRNAgene and diseasegene are treated as features for edgeweighted graphs of ncRNAs and diseases, respectively. All the similarity calculations are given in the Supplementary Material.
Hamming profile similarity
Hamming profile can be used to measure the similarity of a pair of vectors by counting the number of different corresponding elements of the two vectors [38]. According to the biological assumption that similar ncRNAs are always linked with similar diseases, we treat Hamming profile similarity as topological information from the known associations among ncRNAs and diseases. The higher Hamming profile value, the lower similarity in ncRNAs or disease. For diseases, the Hamming profile similarity kernel DHS\((d_{i}, d_{j})\) is defined as follows:
where \(\textbf{m}(d_{i}), \textbf{m}(d_{j})\) represent binary vectors of diseases \(d_{i}, d_{j}\), which correspond to the \(i^{th}, j^{th}\) column in the ncRNAdisease association matrix \(\textbf{M}\).
For ncRNAs, the Hamming profile similarity kernel NHS\((nc_{i}, nc_{j})\) is defined as follows:
where \(\textbf{m}(nc_{i}), \textbf{m}(nc_{j})\) are binary vectors of ncRNAs \(nc_{i}, nc_{j}\), which correspond to the \(i^{th}, j^{th}\) row in the association matrix \(\textbf{M}\).
Gaussian interaction profile kernel similarity
Gaussian interaction profile kernel (GIPK) can capture topological features of the interaction network of biological entity pairs. The similar bioentities can be better clustered in a space that describes GIPK similarity. Therefore, the GIPK is a reasonable method for measuring the similarity of bioentities, and it is widely used. Here, GIPK similarity for diseases \(DGS(d_{i}, d_{j})\) between disease \(d_{i}\) and \(d_{j}\) can be defined as follows:
\(\beta _{d}\) is a regulation parameter for controlling the kernel bandwidth.
where \(N_{d}\) is the number of all diseases.
Similarly, the GIPK similarity for ncRNAs \(NGS\left( nc_{i}, nc_{j}\right)\) between ncRNAs \(nc_{i}, nc_{j}\) can be obtained as follows:
where \(N_{nc}\) is the number of all ncRNAs.
Disease semantic similarity
In the last decade, the effectiveness of disease semantic similarity based on Wang et al. [39] has been proved by many previous works, and it is widely used for identifying latent associations between ncRNAs and diseases. In the MeSH disease descriptors, the associations in different diseases can be described as their corresponding Directed Acyclic Graph (DAG) structures. Each node in DAG is a disease and each directed edge is their association. The more similar diseases are, the more common parts of DAGs they share. We obtain the disease semantic similarity \(DSS1(d_{i}, d_{j})\) between disease \(d_{i}\) and \(d_{j}\) by Eq. (7) as follows:
where \(N\left( d_{i}\right)\) represents a node set on the DAG of disease \(d_{i}\). \(C 1_{d_{i}}\left( t\right)\) represents the semantic contribution value of a node \(t \in N\left( d_{i}\right)\), which is associated with \(d_{i}\). For \(d_{i}\) itself, \(C 1_{d_{i}}\left( t\right) = 1\). For t to \(d_{i}\), \(C 1_{d_{i}}\left( t\right) = \max \left\{ 0.5 * C 1_{d_{i}}\left( t^{\prime }\right) \mid t^{\prime } \in \text{ children } \text{ of } t\right\}\) will increase as their distance decreases.
If a disease occurs in different DAGs, it is a common, and vice versa. The above method for calculating semantic similarity treats every different disease in the same layer as having the same semantic contribution. However, the semantic contribution values of uncommon diseases should be higher than the common diseases [40]. According to previous work [41], we distinguish the semantic contribution values of uncommon diseases by Eq. (8) as follows:
where \(C 2_{d_{i}}\left( t\right)\) is the semantic contribution value of t to \(d_{i}\) can be defined as Eq. (9):
Inspired by previous work [41], we calculate the final disease semantic similarity \(DSS(d_{i}, d_{j})\) between disease \(d_{i}\) and \(d_{j}\), which integrating the results of the above two semantic similarity calculations and describing as below:
ncRNA sequence similarity
To make use of the ncRNA sequence information, we compute the ncRNA sequence similarity scores \(NSS\left( nc_{i}, nc_{j}\right)\) based on SmithWaterman (SW) [42] method. This sequence pairwise alignment method is packaged using the Biopython, a python tool. The sequence information of miRNAs, circRNAs, and lncRNAs is downloaded from miRBase [34] database, CircInteractome [43] database and circBank [44] database, and LncRNADisease [36] database, respectively. In this work, NSS represents the ncRNA sequence similarity network. The weight of each edge in NSS needs to be normalized to the range [0,1] as follows:
where \(NSS\left( nc_{i}, nc_{j}\right)\) denotes the SmithWaterman score between ncRNA \(nc_{i}\) and \(nc_{j}\).
Methods
Model framework
We design a widely effective computational framework GDCLNcDA for identifying latent different types of ncRNAdisease associations. In effect, the more different varieties of data there are, the more complementary information there is. Many previous works have shown that exploiting multisource information does help computational methods improve their performance. In this work, our endtoend framework utilize multisource information from three large MHNs to reduce the influence of the falsenegative associations and relieve the noise which may be introduced by a multistage method.
Figures 1 and 2 show the overall flow of GDCLNcDA, which is constitutive of four parts: (1) constructing multiple MHNs of ncRNAs and diseases (Fig. 1), (2) reconstructing association graphs (matrixes) (Fig. 2. A and B), (3) establishing predicted association graphs (matrixes) (Fig. 2. C), (4) generating contrastive loss on reconstructed graphs and predicted graphs (Fig. 2. C). For constructing multiple MHNs of ncRNAs and diseases, we construct three multiple layers MHNs including similarity profiles and interaction profiles of miRNAs, circRNA, lncRNA, genes, and diseases. For reconstructing association graphs, we use GAT to reduce the the impact of noise in the similarity networks and enhance the characteristics within the similarity network. GCNII is used to encode the different similarity profiles and interaction profiles, and channel attention mechanism to enhance the characteristics between the similarity networks. For establishing predicted association graphs, we employ DMF to predict the latent associations based on reconstructed association graphs. Furthermore, we introduce a novel contrastive optimization module to generate a collaborative contrastive loss of reconstructed association graphs and predicted graphs.
Graph attention mechanism
Graph attention network (GAT) [45] is a novel convolutionstyle neural network. It is a valid method for graph representation learning, which can solve the weaknesses of previous graph convolutionbased approaches. In GAT, the nodes can take part in their neighborhoods’ features. In different sized neighborhoods, GAT is capable of implicitly assigning different significances to different nodes. In this work, we use GAT to capture the characteristics within multiple homogeneous similarity networks of ncRNA and disease.
In GDCLNcDA, GAT is adopted to obtain the shallower embeddings on each similarity networks ncRNA and disease for downstream works, which can reduce the effect of noise in the similarity networks. We can obtain the attentionbased similarity networks after GAT. We use GAT to enhance the characteristics within each similarity network. GAT utilizes a masked selfattention mechanism to learn the significance of its neighbors first. More specifically, it apples linear transformation on nodes i, j (node pointing a disease here) in a similarity graph \(\mathcal {G}\) and employs selfattention on the nodes by a shared attentional mechanism, a mapping function \(f_{a}(\cdot )\), which can calculate attention coefficient \(w_{ij}^{gat}\) as follows:
where \(\textbf{F}_{d} = \{\textbf{f}_{1}, \textbf{f}_{2}, \cdots , \textbf{f}_{N_{d}}\}, \textbf{f}_{i}, \textbf{f}_{j} \in \mathbb {R}^{F_{d}}\) is the input feature of disease nodes, where \(N_{d}\) is the number of disease nodes, and \(F_{d}\) is the dimensionality of each node, and \(\textbf{W}_{gat} \in \mathbb {R}^{F_{d} \times N_{d}}\). The GAT output of disease \(\textbf{F}^{'}_{d} = \{\mathbf {f^{'}}_{1}, \mathbf {f^{'}}_{2}, \cdots , \mathbf {f^{'}}_{N_{d}}\}, \mathbf {f^{'}}_{i} \in \mathbb {R}^{N_{d}}\)
In this model, each node can participate in each other node, without all structural information. By introducing the masked attention, the \(w_{ij}^{gat}\) for nodes \(j \in \mathcal {N}_{i}\), where \(\mathcal {N}_{i}\) denotes \(1^{st}\)order neighbors of node i in the \(\mathcal {G}\). To make the coefficients easy to compare between different nodes, we normalize the significance of different neighbor nodes by softmax function can be expressed as follows:
In this work, we apply the LeakyReLU nonlinearity, by fully expanding out, the coefficients calculated by the attention mechanism can be formulated as follows:
where \({\textbf{a}} \in \mathbb {R}^{2N_{d}}\) is a weight vector to parameterize the attention layer. \(\cdot ^{T}\) represents matrix transposition and \(\Vert\) represents the concatenation operation.
Subsequently, we can obtain the aggregated features of each node that linearly combines the normalized attention coefficients and nodes features. The aggregated features use a potentially nonlinear activation function \(\sigma (\cdot )\) to be the final node features. Then, the formation of GAT output \(\mathbf {f^{'}}_{i}\) is shown as follows:
In this work, to reduce the impact of selfattention and stabilize the learning process of nodes importance, we further employ multihead attention. Specifically, we concatenate the node features which executing K independent selfattention, then the Eq. (15) can be rewritten as follows:
where \(\Vert\) denotes the concatenation operation, \(\alpha _{i j}^{k}\) denote the normalized attention coefficients calculated by \(k^{th}\) selfattention \((a_{gat}^{k})\), and \(\textbf{W}_{gat}^{k}\) denotes the corresponding weight matrix. Correspondingly, we can obtain the final GAT output of disease \(\textbf{F}^{'}_{d} \in \mathbb {R}^{N_{d} \times N_{d}}\), as well as ncRNA \(\textbf{F}^{'}_{nc} \in \mathbb {R}^{N_{nc} \times N_{nc}}\), \(N_{nc}\) is the number of ncRNA nodes. We treat these GAT outputs as attentionadjacency matrixes of ncRNA \(\textbf{A}_{nc}\) and disease \(\textbf{A}_{d}\) for the downstream reconstruction task, which also called attentionbased similarity networks of ncRNAs \(\mathcal {G}^{a}_{nc}\) and diseases \(\mathcal {G}^{a}_{d}\).
Deep graph convolution network
Graph convolution network (GCN) and its variants are vital components of graph learning, which can obtain the lowdimensional vector embedding of nodes [46]. Despite they show excellent performance in varieties of application areas on realworld datasets, most of the recent models are shallow, such as GCN [47] and GAT [45], to accomplish their perfect performance with 2layer models. Stacking more graph convolution layers and adding nonlinearity can cause a phenomenon, called oversmoothing, which tends to impact these models’ performance. Chen et al. [30] develop the GCNII to effectively relieve the problem of oversmoothing by using Initial residual and Identity mapping techniques. In this work, we utilize the GCNII for similarityspecific learning, where a GCNII is trained for each attentionbased similarity network to apply the association graph reformulation component.
In GDCLNcDA, we treat every attentionbased similarity network as a edgeweighted graph \(\mathcal {G}^{a}_{nc} = (\mathcal {V}_{nc}, \mathcal {E}_{nc})\) and \(\mathcal {G}^{a}_{d} = (\mathcal {V}_{d}, \mathcal {E}_{d})\). There are two inputs for a GCNII model: (1) attentionadjacency matrixes \(\textbf{A}_{nc} \in \mathbb {R}^{N_{nc} \times N_{nc}}\) and \(\textbf{A}_{d} \in \mathbb {R}^{N_{d} \times N_{d}}\) representing the graph structure description, where \(N_{nc}\) is the number of ncRNAs and \(N_{d}\) is the number of diseases; (2) nodes feature matrixes \(\textbf{X} \in \mathbb {R}^{N_{nc} \times F_{nc}}\) and \(\textbf{Y} \in \mathbb {R}^{N_{d} \times F_{d}}\), where \(F_{nc}\) and \(F_{d}\) are the feature dimensionality of ncRNAs and diseases, respectively. We treat ncRNAgene and diseasegene as the feature matrixes of ncRNAncRNA edgeweighted graphs and diseasedisease edgeweighted graphs, respectively. Each attentionbased similarity network trained by one GCNII, the GCNII can be built by stacking multiple convolutional layers, for ncRNA, the embedding of the \(l^{th}, l = \{1, 2, \cdots , L\}\) layer defined as follows:
For disease, the embedding of the \(l^{th}\) layer can be written as follows:
where \(\alpha _{l}\) and \(\beta _{l}\) are hyperparameters. We need ensure that the final embedding of every node retains a fraction of \(\alpha _{l}\) from input feature if the layers stacked, \(\alpha _{l} = 0.2\) we set here. Setting \(\beta _{l}\) is to ensure the decay of the weight matrix adaptively increases as more layers stacked, in here \(\beta _{l} = log(\lambda / l) \approx \lambda / l\), where \(\lambda\) is a hyperparameter.
which is the graph convolution matrix with the renormalization trick, where \(\textbf{D}\) is the diagonal degree matrix of \(\textbf{A}\). \(\textbf{I}_{n}\) is identity mapping. We can obtain the final deep graph learning embeddings of ncRNA \(\textbf{E}^{X} \in \mathbb {R}^{N_{nc} \times f_{nc}}\) and disease \(\textbf{E}^{Y} \in \mathbb {R}^{N_{d} \times f_{d}}\) from multiple source information, \(f_{nc}\) and \(f_{d}\) are the dimensionality of embeddings.
Multichannel attention mechanism
Many previous works normally use a simple average or a linear weighting strategy to integrate the multiple similarity information, which ignores the difference in contribution of different source similarity information [48]. In this work, we perform the multichannel attention mechanism to capture the characteristics between the multiple similarity networks of ncRNA and disease.
From Fig. 2. C, the embedding tensor \(\mathscr {T}\) is stacked by all similarity embedding matrixes from the upper deep multisource information graph learning, and each embedding matrixes are treated as a channel for an attention layer. Then, we model the significance of each channel (similarity) to increase or decrease the contribution of diverse source similarities. \(C_{nc}, C_{d}\) are the numbers of channels from ncRNA and disease, respectively. By squeezing embedding tensors of ncRNA \(\mathscr {T}_{X} = [\textbf{E}_{1}^{X}, \textbf{E}_{2}^{X}, \cdots , \textbf{E}_{C_{nc}}^{X}], \mathscr {T}_{X} \in \mathbb {R}^{N_{nc} \times f_{nc} \times C_{nc}}\) and disease \(\mathscr {T}_{Y} = [\textbf{E}_{1}^{Y}, \textbf{E}_{2}^{Y}, \cdots , \textbf{E}_{C_{d}}^{Y}], \mathscr {T}_{Y} \in \mathbb {R}^{N_{d} \times f_{d} \times C_{d}}\). We can get the onedimensional (1D) features of ncRNA \(\mathscr {F}_{X} \in \mathbb {R}^{1 \times 1 \times C_{nc}}\) and disease \(\mathscr {F}_{Y} \in \mathbb {R}^{1 \times 1 \times C_{d}}\). Specifically, for the \(c^{th}_{nc}\), \(c^{th}_{d}\) embedding matrix of ncRNA \(\textbf{E}_{c_{nc}}^{X}\) and disease \(\textbf{E}_{c_{d}}^{Y}\), the values \(f_{c_{nc}}, f_{c_{d}}\) in \(\mathscr {F}_{X}, \mathscr {F}_{Y}\) are calculated as follows:
We capture the significance of channels is computed as attention weights by using attention mechanism:
where \(\textbf{W} = \{\textbf{W}_{1}, \textbf{W}_{2}\}\) is the training parameter, \(f_{C_{nc}}^{a}, f_{C_{d}}^{a}\) are values in \(\mathscr {F}_{X}^{a} \in \mathbb {R}^{1 \times 1 \times C_{nc}}, \mathscr {F}_{Y}^{a} \in \mathbb {R}^{1 \times 1 \times C_{d}}\), which are attentional 1D features of ncRNA and disease, respectively.
Finally, we obtain the normalized channel embeddings with attention weights as follows:
as aforementioned, we can get the enhanced channel embeddings of ncRNA \(\tilde{\mathscr {T}}_{X} = [\tilde{\textbf{E}}_{1}^{X}, \tilde{\textbf{E}}_{2}^{X} , \dots , \tilde{\textbf{E}}_{C_{nc}}^{X}]\) , and disease \(\tilde{\mathscr {T}}_{Y} = [\tilde{\textbf{E}}_{1}^{Y}, \tilde{\textbf{E}}_{2}^{Y} , \dots , \tilde{\textbf{E}}_{C_{d}}^{Y}]\).
The association graph reconstruction
We employ CNN to generate the final embeddings of ncRNA \(\textbf{X}_{nc}^{\prime }\) and disease \(\textbf{Y}_{d}^{\prime }\) based on the enhanced multiple channel embeddings, \(\textbf{X}_{nc}^{\prime }\) and \(\textbf{Y}_{d}^{\prime }\) are represented as follows:
where \(\textbf{W}_{k}^{nc} \in \mathbb {R}^{f_{nc} \times 1}\) and \(\textbf{W}_{k}^{d} \in \mathbb {R}^{f_{d} \times 1}\), \(f_{nc}\) and \(f_{d}\) are the numbers of feature from GCNII embeddings.
Then, we reconstruct the ncRNAdisease association graph \(\textbf{ReG} \in \mathbb {R}^{N_{nc} \times N_{d}}\) by using Matrix Factorization (MF), which can be described as:
Deep matrix factorization
Matrix Factorization (MF) is a latent factor model, which performs outstanding capacity in information mining of the recommender tasks [49]. Many previous works utilize MF methods of predicting the linkages between biological entities successfully [3, 50, 51]. As we all know, the associations between biological entities are very sparse, which will affect the performance of the computational methods. In order to alleviate the impact of this problem, many methods add relevant similarity information to assist a prediction task [52]. However, modeling only linear features extracted by MF is insufficient to extract complicated associations between ncRNAs and diseases. Deep matrix factorization (DMF) captures nonlinear features between ncRNA and disease, which is based on all explicit and implicit feedback and improves the prediction performance.
There are three steps in this part. Firstly, we extract the row vector and column vector of the reconstructed associations \(\textbf{ReG}\) as the original features of ncRNA \(\textbf{ReG}_{i*}\) and disease \(\textbf{ReG}_{*j}\), respectively. \(\textbf{ReG}_{i*}\) and \(\textbf{ReG}_{*j}\) contain the association patterns of ncRNA \(nc_{i}\) and disease \(d_{j}\), and considered as associations between \(i^{th}\) ncRNA and all diseases, as well as \(j^{th}\) disease and all ncRNAs, respectively. There is a high falsenegative in the original ncRNAdisease association \(\textbf{M}\), because that 1 is known link with experimental backing (explicit feedback), while 0 is unknown link rather than no link (implicit feedback). We obtain predicted scores for some unknown relations in \(\textbf{ReG}\) to reduce the falsenegative. Meanwhile, we retain the original “1” values in ncRNAdisease associations. The implicit feedback is denoted by nonzero values between 0 and 1, rather than 0 only. We further perform implicit feedback composed of association patterns to enhance performance. Secondly, we treat \(\textbf{ReG}_{i*}\) and \(\textbf{ReG}_{*j}\) as inputs of multiple fully connected layers, projecting ncRNA and disease into potential structured space. To be more specifically, we generate the feature of ncRNA \(\textbf{x}_{i}\) (as same as the feature of disease \(\textbf{y}_{j}\)) from this process is as follows:
where \(h_{l^{\prime }} (l^{\prime } = 1, \dots , L^{\prime }1)\) denotes the \(l^{\prime th}\) hidden layer and the \(L^{\prime }\) denotes the number of hidden layers. \(\textbf{W}^{\prime }_{l^{\prime }}\) and \(\textbf{b}_{l^{\prime }}\) are the weight matrix and the bias term on the \(l^{\prime th}\) hidden layer, respectively. \(f_{\theta }(\cdot )\) is a nonlinear activation function, we use the Rectified Linear Unit (ReLU) here.
Thirdly, we obtain the final features of ncNRA \(\textbf{X}_{nc} = \{ \textbf{x}_{1}, \textbf{x}_{2}, \dots , \textbf{x}_{m}\}\) and disease \(\textbf{Y}_{d} = \{\textbf{y}_{1}, \textbf{y}_{2}, \dots , \textbf{y}_{n}\}\). We can get the final ncRNAdisease association predicted graph \(\textbf{PrG} \in \mathbb {R}^{N_{nc} \times N_{d}}\) by MF as below:
the higher value \(\textbf{PrG}_{ij}\) is, the more possibility association between ncRNA \(nc_{i}\) and disease \(d_{j}\), and vice versa.
In GDCLNcDA, we use mean square error as a loss function. It is which is achieved by minimizing the Frobenius norm of the difference between \(\textbf{PrG}\) and \(\textbf{M}\). The loss function is given as follows:
Cocontrastive learning
Contrastive Learning (CL) demonstrates excellent ability of unsupervised performance in graph representation learning [53,54,55,56]. Initially, Velickovic et al. [53] and Sun et al. [57] learn the expressive representations of graphs or nodes, which by maximizing the interactive information of different graininess among graphlevel representations and substructurelevel representations. Peng et al. [58] obtain interactive information between input and representations of nodes and edges by performing two discriminators. You et al. [59,60,61] propose various augmentations for graphlevel representation learning.
In this work, we use the CL to learn the interactive information of representations of nodes and edges from reconstructed association graph and predicted association graph, rather than contrasting different augmented views of examples. The purpose of CL used is to improve the generalization ability of our framework and supervise the learning of the latent linkage prediction task. The cocontrastive learning loss \(Loss_{CL}\) for each positive pair \((\textbf{reg}_{i}, \textbf{prg}_{i})\) of the reconstructed association graph and predicted association graph can be defined as follows:
where \(\textbf{reg}_{i}\) is the embedding of a node \(\textbf{reg}_{i}\) in \(\textbf{ReG}\) treated as the anchor, and \(\textbf{prg}_{i}\) is the embedding in \(\textbf{PrG}\), which is the positive sample. We treat the embeddings of other nodes in both graphs as negatives (positives and negatives mean that have relations and no relations). \(\mathcal {T}\) is a augmentation function, the critic \(\phi (\textbf{reg}, \textbf{prg}) = sim(g(\textbf{reg}), g(\textbf{prg}))\), where \(sim(\cdot )\) is the cos similarity and \(g(\cdot )\) is linear projection to enhance the expression power of the critic function [30].
Finally, the optimization objective of our framework consists of three parts: the multisource graph learning loss, the DMF loss, and the contrastive loss. The final loss function of Loss can be shown as follows:
Experiments
In this section, we implement experiments to implement the following queries: (1) Is it viable and efficient to be a wide method for identifying latent associations among multiple types of ncRNAs and diseases based on the proposed GDCLNcDA? (2) Is it useful to integrate deep graph learning, DMF and cocontrastive learning into an endtoend framework? (3) Is it beneficial to use information on larger MHNs?
Comparison with highly related methods
To prove the viable and efficient of GDCLNcDA, we compare the GDCLNcDA framework to another seven advanced methods in recent years. The 5CV and the 10CV are performed to evaluate the performance of GDCLNcDA and those seven methods on the same MHNs. All known associations between ncRNA and disease are treated as positive samples and unknown associations are treated as candidate samples. In Kfold crossvalidation (K is 5 or 10), (step 1) all proved associations are shuffled randomly and divided into K groups; (step 2) for each unique group, it is toke as a test dataset and the remaining groups are toke as training dataset; (step 3) repeat step 2 K times, each time with a different group. Our results are the average of the K group of results for Kfold crossvalidation. According to the articles of baselines, the settings of the parameters in these methods are adjusted to the optimal on our datasets. For our GDCLNcDA, the GCNII layers is set to 5, the CNN feature dimensionality is set to 96, the DMF layers is set to 2, the DMF feature dimensionality is set to 96, the learning rate is set to 0.001, the adaptive moment estimation (Adam) optimizer is used as the optimizer. It is worth noting that our experiments on the three different MHNs are all based on the above set of parameters. We also utilize the area under the receiver operating characteristic curve (AUC) and the area under the precision/recall curve (AUPRC) to assess the performance of those eight methods. All experiments are repeated 10 times to obtain a sound estimate of prediction results.
Baselines
\(\mathbf {MDASKF}\) [11]: A novel diverse similarity kernels integration for miRNAdisease relations prediction. MDASKF develops the Similarity Kernel Fusion (SKF) to integrate different similarity kernels of miRNA and disease extracted in two subspaces, respectively, and then, performs the Laplacian regularized leastsquares method to predict the potential miRNAdisease relations.
\(\textbf{NIMCGCN}\) [62]: Neural Inductive Matrix Completion (NIMC) with GCN for miRNAdisease relationships identification. NIMCGCN is the first model that uses GCN to learn miRNA and disease representations based on their corresponding similarity networks. Then, the learned representations are treated as inputs for a novel NIMC method to obtain a miRNAdisease relationship matrix completion.
\(\textbf{MMGCN}\) [26]: A multisource GCN with attention mechanism for miRNAdisease links prediction. MMGCN learns embeddings of miRNA and disease via GCN encoding their various corresponding similarity views, respectively. It further employs attention mechanism to differentiate the embeddings from different views for prediction task.
\(\textbf{DMFCDA}\) [63]: DMF for circRNAdisease linkages inference. DMFCDA employs a projection layer to learn underlying features of circRNA and disease from original linkages between circRNA and diseases only. By modeling the nonlinear linkages, it can learn complex information from data and take both explicit and implicit feedback into consideration.
\(\textbf{DMFMSF}\) [27]: DMF with SVD and SKF for ncRNAdisease relations discovery. DMFMSF first uses SKF to integrate three similarities of ncNRA and disease, respectively. Then, it extracts linear and nonlinear characteristics by Singular Value Decomposition (SVD) and DMF. In finally, it combines linear and nonlinear characteristics to discover potential ncRNAdisease relations.
\(\mathbf {CKAHGRTMF}\) [3]: A novel model of three matrixes factorization with hypergraphregular terms for ncRNAdisease relationship prediction. It assesses the degree of association by the bilateral projection matrix and two potential characteristic matrixes of ncRNA and disease, respectively. It further uses two graph regular terms on ncRNA and disease characteristics to enhance the predict performance.
\(\textbf{MHDMF}\) [28]: A multisource GCN and DMF for miRNAdisease associations identification. MHDMF learns and enhances embeddings of miRNA and disease by GCN and channel attention from their diverse corresponding similarity networks, respectively. At last, it further uses DMF to identify latent associations based on the embeddings.
Performance comparison
In Tables 1, 2, 3, and 4, we demonstrate all comparison results to illustrate the feasibility and the effectiveness of GDCLNcDA. Our framework GDCLNcDA performs outstanding among these comparison methods. As the comparative results of GDCLNcDA under 5CV and 10CV have tiny differences, our GDCLNcDA has better robustness than other methods. More importantly, the GDCLNcDA framework has stable performance on different MHNs and strong generalization in the face of different datasets.
Different from these traditional similarity network information integration methods (MDASKF, DMFMSF and CKAHGRTMF), GDCLNcDA does not integrate similarity information through a simple average or linear weighting strategy. It automatically learns the information of each similarity network through depth graph learning and effectively distinguishes the contribution of different similarity information to the prediction task through the attention mechanism. The GDCLNcDA framework can integrate multisource similarities in a more reasonable way of calculating. Different from the multistage methods (DMFMSF and CKAHGRTMF), our framework takes an endtoend approach for data training and prediction. It enables the model to automatically learn relevant and discriminative features from the raw input data. Instead of relying on handcrafted features, the model can effectively extract representations and patterns directly from the data, potentially capturing more intricate and nuanced information. Furthermore, it optimizes all the model parameters jointly, considering the entire pipeline from input to output. This holistic optimization can lead to improved performance as the model can adapt its internal representations and decisionmaking processes based on the end objective, rather than optimizing individual components separately. Different from the graph learningbased methods (NIMCGCN and MMGCN), this framework utilizes more information from larger MHNs and captures richer and more comprehensive representations. Furthermore, it uses the attention mechanism to strengthen the feature of nodes within the similarity network and the contribution between different similarity networks. GDCLNcDA can effectively integrate information from multiple sources and improve the overall understanding of the data. We use contrastive learning in this framework to extract semantically meaningful representations by maximizing the similarity between positive pairs and minimizing the similarity between negative pairs. This encourages the framework to focus on capturing essential features and discarding irrelevant or noisy information, resulting in rich and informative representations that can generalize well to downstream tasks. Different from the DMFbased methods (DMFCDA and DMFMSF), our GDCLNcDA decreases the falsenegative of the original associations, which MF relies on. We further integrate more information as additional data into the reconstructed graph. Multisource information often provide complementary information about the data, capturing different aspects or modalities. Contrastive learning can be used to reduce the need for large amounts of labeled data in the target domain, also reducing the impact of falsenegative accordingly. These can improve the ability of GDCLNcDA to generalize and handle complex patterns and variations. In brief, GDCLNcDA is feasibility and the effectiveness in underlying ncRNAdisease associations identification, which can be verified by the comparison results thereinbefore.
Ablation experiments
Performance of GDCLNcDA and its variants
In this section, we illustrate whether the integration of deep graph learning, DMF and contrastive learning within the GDCLNcDA framework is necessary for the ncRNAdisease associations identification task. We carry out an ablation experiment by split and recombination of our framework. The experiment is conducted under 5CV.
The variant methods we framed include GDCLNcDA, GDCLNcDA\(\_\)GCNII, GDCLNcDA\(\_\)GATGCNII, GDCLNcDA\(\_\)DMF, GDCLNcDA\(\_\)GCNII+DMF, and GDCLNcDA\(\_\)GCNII+DMF+CL.

GDCLNcDA\(\_\)GCNII denotes that GCNII and channel attention are only performed to extract and strengthen the embeddings for final identification task.

GDCLNcDA\(\_\)GATGCNII denotes that GAT and GCNII are only performed to enhance and generate the embeddings for final identification task.

GDCLNcDA\(\_\)DMF denotes that DMF is only used for final identification task without any additional information.

GDCLNcDA\(\_\)GCNII+DMF denotes that GCNII used first to reconstruct the association graph, and then, DMF used for final identification task based on the reconstructed graph.

GDCLNcDA\(\_\)GCNII+DMF+CL denotes that GCNII used first to reconstruct the association graph. Then, DMF used to generate predicted graph. The CL used to obtain the loss between the reconstructed graph and predicted graph, which used to update and optimize the entire framework.
As demonstrated in the Table 5, the results of GDCLNcDA and its variant methods. GDCLNcDA can attain supreme performance among all methods. For GDCLNcDA\(\_\)GCNII and GDCLNcDA\(\_\)GATGCNII methods, the latter uses attention mechanism in each similarity network. This result demonstrates that enhancing the features within each similarity network is useful to the identification task. For GDCLNcDA\(\_\)GCNII, GDCLNcDA\(\_\)DMF and GDCLNcDA\(\_\)GCNII+DMF methods, the last one combines the GDCLNcDA\(\_\)GCNII and the GDCLNcDA\(\_\)DMF. This result demonstrates that associations reconstruction can reduce some real falsenegative in original associations. For GDCLNcDA\(\_\)GCNII+DMF and GDCLNcDA\(\_\)GCNII+DMF+CL methods, the latter adds contrastive loss in framework. This result demonstrates that contrastive learning between GCNII and DMF can be conducive to improve generalization and performance of framework. GDCLNcDA accomplishes the brilliant performance among these variants, which illustrates the essentials of each component within GDCLNcDA.
Performance of GDCLNcDA on different heterogeneous networks
To show the benefit of using information of larger MHNs, we perform another ablation experiment by leveraging different MHNs used in the GDCLNcDA framework. All the numerical experiments are carried out under the same number of iteratives and 5CV. In the Table 6, there are all results from the associations between miRNAs, circRNAs, lncRNAs, and their corresponding diseases and genes. These results demonstrate whether the integration of diverse interaction information is beneficial for ncRNAdisease associations identification. GDCLNcDA achieves outstanding performance by performing on larger MHNs. GDCLNcDA is more powerful by adding multiple interaction information.
Parameter analysis of GDCLNcDA
In this section, we conduct an experiment analyzing some parameters within the GDCLNcDA framework to demonstrate their impact. This experiment is under 5CV. In the following, only one parameter is varied to test its effect while the others are fixed.
GCNII layer
We utilize GCNII to obtain multisource embeddings for ncRNA and disease. The number of GCNII layer l is selected in \(\{4, 5, 6, 7\}\). As shown in Fig. 3(a), there is a small influence on GDCLNcDA performance when the GCNII layer number changes. When the layer number is 5, we obtain optimal performance. In the network topology of biological entities, the biological significance will be greatly reduced if the distance between two biological entities is too far.
Dimensionality of CNN features
The CNN dimensionality determines the size of final embeddings of ncRNA and disease. After generating these embeddings, the framework will implement the succeeding association graph reconstruction task. The CNN dimensionality is selected from \(\{48, 64, 96, 128\}\), shown in Fig. 3(b), it can be discovered that the performance of GDCLNcDA has tiny changes under different dimensionalities. When CNN dimensionality is 96, we obtain optimal performance.
DMF layer
The number of DMF layers will directly affect the result of the identification task. The number of DMF layers is selected from \(\{1, 2, 3\}\), when it is 2, we obtain optimal performance, as shown in Fig. 3(c).
Dimensionality of DMF feature
We use DMF to extract the features of ncRNAdisease associations via potential features in a common lowdimensional space. Therefore, the DNF dimensionality of potential features is crucial for the predicted graph generated. The DMF dimensionality for potential feature is selected from \(\{32, 48, 64, 96\}\), when DMF dimensionality is 96, we obtain optimal performance, as shown in Fig. 3(d).
Learning rate
As learning rate can control the size of step for gradient descent, it be a significant hyperparameter for deep learning. Step size is one of the factors that determine whether the algorithm can reach the optimal solution. A bad learning rate can lead to a number of problems. For example, the model is unstable and unable to converge, easily falls into local optimal, slow convergence and other problems. The learning rate is selected in \(\{0.1, 0.01, 0.001, 0.0001\}\), when it is 0.001, we obtain optimal performance, as shown in Fig. 3(e).
Case studies
We illustrate the ability of GDCLNcDA with case studies for ncRNAdisease associations identification. The performance of case studies for GDCLNcDA is further assessed by two specific diseases for miRNA, circRNA, and lncRNA. More explicitly, we choose diverse cancers, such as lung neoplasms and brain cancer for miRNA, cervical cancer and breast cancer cancer for circRNA, and ovarian cancer and kidney cancer for lncRNA. In this work, we rank the predicted score of unknown associations from those MHNs.
Table 7 displays the top10 candidate miRNAs, and further proved the predicted associations by performing ① dbDEMC [64], ② HDMM v3.2 [65], and ③ MNDR2.0 [66]. The HDMM v3.2 database is the updated version of the HMDD v2.0 database [31], from which we download the positive set for our MHN of miRNAdisease. More specifically, the top10 candidate miRNAs we identified, which did not appear in HDMM v2.0 but found validation in HDMM v3.2, further illustrate the effectiveness of our GDCLNcDA framework.
Table 8 displays the top10 candidate circRNAs, and further verified the predicted associations by utilizing ④ circMine [67], and ⑤ Lnc2Cancer3.0 [68]. Table 9 displays the top10 candidate lncRNAs, and further proved the predicted associations by employing ⑥ LncRNADisease v2.0 [69], ③MNDR2.0, and ⑤ Lnc2Cancer3.0.
Conclusion
The central dogma of molecular biology describes how genetic information is transmitted through RNA to the corresponding protein. As ncRNAs do not involved in transcription of proteins, they are treated as the transcriptional noise. With the development of biotechnology, ncRNA has attracted wide attention. For the past few years, increasing experimental skills demonstrate that ncRNA is badly related to the development of diverse human diseases. However, the relationship verified by wet experimental skills is not sufficient to further explore the pathogenic mechanism at the molecular level of disease. Therefore, it is essential to develop the computational method for studying the ncRNAdisease associations.
In this work, we develop a novel endtoend framework called GDCLNcDA, which accomplishes brilliant performance on three MHNs, including three varieties of ncRNA (miRNA, circRNA, and lncRNA). Different from previous works, we construct multiple MHNs of three varieties ncRNA, disease, and gene, and use deep graph learning and multiple attention mechanisms to reconstruct associations between ncRNAs and diseases, on which DMF to generate the predicted associations based. Furthermore, we add contrastive learning between reconstructed associations and predicted associations to improve the generalization of our framework. In practice, the feasibility and availability of GDCLNcDA is also proved by our following experiments.
GDCLNcDA can not only efficiently make use of restricted verified associations to predict latent relation, but also fuse multisource information of MHNs to weaken the falsenegative of ncRNAdisease associations accountably. The experimental results account for that GDCLNcDA obtains outstanding performance among stateoftheart methods we compared under 5CV and 10CV. Additionally, diverse ablation experiments show evidence of the availability of different modules within GDCLNcDA and the efficacy for MHNs construction. Finally, we construct case studies to further give evidence of the potential ability of GDCLNcDA in identifying the underlying candidate diseaserelated ncRNAs.
Availability of data and materials
For miRNAdisease, the positive set of miRNAdisease associations is downloaded from the HMDD v2.0 database [31]: http://cmbi.bjmu.edu.cn/hmdd. The miRNAgene associations are downloaded from the miRWalk2.0 database [32]: http://mirwalk.umm.uniheidelberg.de/. The diseasegene associations are downloaded from DisGeNET [33]: https://www.disgenet.org/.
For circRNAdisease, we download the positive associations of circRNAdisease from CircR2Disease database [35]: http://bioinfo.snnu.edu.cn/CircR2Disease/, the circRNAgene associations from http://cssb2.biology.gatech.edu/knowgene/search.html, and the diseasegene associations from http://cssb2.biology.gatech.edu/knowgene/.
For lncRNAdisease, we obtain the lncRNAdisease positive linkages from LncRNADisease database [36]: https://www.cuilab.cn/lncrnadisease, the lncRNAgene linkages from lncReg database [37]: https://www.lncrnablog.com/tag/lncreg/, and the diseasegene linkage from DisGeNet database.
All Disease semantic similarity are downloaded from MeSH [34]: http://www.nlm.nih.gov.
The code of GDCLNcDA is provided on GitHub (https://github.com/AINING96/GCL_NcDA).
References
Yanofsky C. Establishing the triplet nature of the genetic code. Cell. 2007;128(5):815–8.
Mohanty V, GoekmenPolar Y, Badve S, Janga S. Role of lncRNAs in health and diseasesize and shape matter. Brief Funct Genom. 2015;14(2):115–29.
Wang H, Tang J, Ding Y, Guo F. Exploring associations of noncoding RNAs in human diseases via threematrix factorization with hypergraphregular terms on center kernel alignment. Brief Bioinform. 2021;22(5):bbaa409.
Mattick J, Makunin I. Noncoding RNA. Hum Mol Genet. 2006;15(suppl_1):R17–R29.
Zheng J, Qian Y, He J, Kang Z, Deng L. Graph Neural Network with SelfSupervised Learning for Noncoding RNADrug Resistance Association Prediction. J Chem Inf Model. 2022;62(15):3676–84.
Diederichs S. Noncoding RNA and disease. RNA Biol. 2012;9(6):701–2.
Pan J, Tang Y, Yu J, Zhang H, Zhang J, Wang C, et al. miR146a attenuates apoptosis and modulates autophagy by targeting TAF9b/P53 pathway in doxorubicininduced cardiotoxicity. Cell Death Dis. 2019;10(9):1–15.
Zhao L, Qi Y, Xu L, Tao X, Han X, Yin L, et al. MicroRNA1405p aggravates doxorubicininduced cardiotoxicity by promoting myocardial oxidative stress via targeting Nrf2 and Sirt2. Redox Biol. 2018;15:284–96.
Chen X, Yin J, Qu J, Huang L. MDHGI: matrix decomposition and heterogeneous graph inference for miRNAdisease association prediction. PLoS Comput Biol. 2018;14(8):1006418.
Peng J, Hui W, Li Q, Chen B, Hao J, Jiang Q, et al. A learningbased framework for miRNAdisease association identification using neural networks. Bioinformatics. 2019;35(21):4364–71.
Jiang L, Ding Y, Tang J, Guo F. MDASKF: similarity kernel fusion for accurately discovering miRNAdisease association. Front Genet. 2018;9:618.
Li G, Fang T, Zhang Y, Liang C, Xiao Q, Luo J. Predicting miRNAdisease associations based on graph attention network with multisource information. BMC Bioinformatics. 2022;23(1):244.
Lan W, Dong Y, Chen Q, Zheng R, Liu J, Pan Y, et al. KGANCDA: predicting circRNAdisease associations based on knowledge graph attention network. Brief Bioinform. 2022;23(1):bbab494.
Chen B, Huang S. Circular RNA: an emerging noncoding RNA as a regulator and biomarker in cancer. Cancer Lett. 2018;418:41–50.
Ye Y, Zhang L, Hu T, Yin J, Xu L, Pang Z, et al. CircRNA_103765 acts as a proinflammatory factor via sponging miR30 family in Crohn’s disease. Sci Rep. 2021;11(1):1–14.
Lei X, Fang Z, Chen L, Wu F. PWCDA: path weighted method for predicting circRNAdisease associations. Int J Mol Sci. 2018;19(11):3410.
Wei H, Liu B. iCircDAMF: identification of circRNAdisease associations based on matrix factorization. Brief Bioinform. 2020;21(4):1356–67.
Wang L, Wong L, Li Z, Huang Y, Su X, Zhao B, et al. A machine learning framework based on multisource feature fusion for circRNAdisease association prediction. Brief Bioinform. 2022;23(5):bbac388.
Li G, Lin Y, Luo J, Xiao Q, Liang C. GGAECDA: Predicting circRNAdisease associations using graph autoencoder based on graph representation learning. Comput Biol Chem. 2022;99:107722.
Hardin H, Helein H, Meyer K, Robertson S, Zhang R, Zhong W, et al. Thyroid cancer stemlike cell exosomes: regulation of EMT via transfer of lncRNAs. Lab Investig. 2018;98(9):1133–42.
Faghihi M, Modarresi F, Khalil A, Wood D, Sahagan B, Morgan T, et al. Expression of a noncoding RNA is elevated in Alzheimer’s disease and drives rapid feedforward regulation of βsecretase. Nat Med. 2008;14(7):723–30.
Wang Y, Yu G, Wang J, Fu G, Guo M, Domeniconi C. Weighted matrix factorization on multirelational data for LncRNAdisease association prediction. Methods. 2020;173:32–43.
Zhang Y, Ye F, Gao X. MCANET: multifeature coding and attention convolutional neural network for predicting lncRNAdisease association. IEEE/ACM Trans Comput Biol Bioinforma. 2021.
Wu Q, Xia J, Ni J, Zheng C. GAERF: predicting lncRNAdisease associations by graph autoencoder and random forest. Brief Bioinform. 2021;22(5):bbaa391.
Zhao X, Zhao X, Yin M. Heterogeneous graph attention network based on metapaths for lncRNA–disease association prediction. Brief Bioinform. 2022;23(1):bbab407.
Tang X, Luo J, Shen C, Lai Z. Multiview multichannel attention graph convolutional network for miRNA–disease association prediction. Brief Bioinform. 2021;22(6):bbab174.
Xie G, Chen H, Sun Y, Gu G, Lin Z, Wang W, et al. Predicting circRNADisease Associations Based on Deep Matrix Factorization with Multisource Fusion. Interdisc Sci Comput Life Sci. 2021;13(4):582–94.
Ai N, Liang Y, Yuan H, OuYang D, Liu X, Xie S, et al. MHDMF: Prediction of miRNAdisease associations based on Deep Matrix Factorization with Multisource Graph Convolutional Network. Comput Biol Med. 2022;149:106069.
Ata SK, Fang Y, Wu M, Shi J, Kwoh CK, Li X. Multiview collaborative network embedding. ACM Trans Knowl Discov Data (TKDD). 2021;15(3):1–18.
Chen M, Wei Z, Huang Z, Ding B, Li Y. Simple and deep graph convolutional networks. PMLR; 2020. p. 1725–1735.
Li Y, Qiu C, Tu J, Geng B, Yang J, Jiang T, et al. HMDD v2. 0: a database for experimentally supported human microRNA and disease associations. Nucleic Acids Res. 2014;42(D1):D1070–4.
Dweep H, Gretz N. miRWalk2. 0: a comprehensive atlas of microRNAtarget interactions. Nat Methods. 2015;12(8):697.
Piñero J, Bravo À, QueraltRosinach N, GutiérrezSacristán A, DeuPons J, Centeno E, et al. DisGeNET: a comprehensive platform integrating information on human diseaseassociated genes and variants. Nucleic Acids Res. 2016:gkw943.
Lipscomb C. Medical subject headings (MeSH). Bull Med Libr Assoc. 2000;88(3):265.
Fan C, Lei X, Fang Z, Jiang Q, Wu F. CircR2Disease: a manually curated database for experimentally supported circular RNAs associated with various diseases. Database. 2018;2018.
Chen G, Wang Z, Wang D, Qiu C, Liu M, Chen X, et al. LncRNADisease: a database for longnoncoding RNAassociated diseases. Nucleic Acids Res. 2012;41(D1):D983–6.
Zhou Z, Shen Y, Khan M, Li A. LncReg: a reference resource for lncRNAassociated regulatory networks. Database. 2015;2015.
Charikar M. Similarity estimation techniques from rounding algorithms. 2002. p. 380–388.
Wang J, Du Z, Payattakool R, Yu P, Chen C. A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007;23(10):1274–81.
Wang L, You ZH, Huang YA, Huang DS, Chan KC. An efficient approach based on multisources information to predict circRNAdisease associations using deep convolutional neural network. Bioinformatics. 2020;36(13):4038–46.
Pasquier C, Gardès J. Prediction of miRNAdisease associations with a vector space model. Sci Rep. 2016;6(1):1–10.
Cock P, Antao T, Chang J, Chapman B, Cox C, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–3.
Dudekula D, Panda A, Grammatikakis I, De S, Abdelmohsen K, Gorospe M. CircInteractome: a web tool for exploring circular RNAs and their interacting proteins and microRNAs. RNA Biol. 2016;13(1):34–42.
Liu M, Wang Q, Shen J, Yang B, Ding X. Circbank: a comprehensive database for circRNA with standard nomenclature. RNA Biol. 2019;16(7):899–905.
Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y. Graph attention networks. arXiv preprint arXiv:1710.10903. 2017.
Zitnik M, Agrawal M, Leskovec J. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics. 2018;34(13):i457–66.
Kipf T, Welling M. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. 2016.
Wang X, Wang R, Shi C, Song G, Li Q. Multicomponent graph convolutional collaborative filtering. In: Proceedings of the AAAI conference on artificial intelligence, vol. 34. 2020. p. 6267–6274.
Luo X, Zhou M, Xia Y, Zhu Q. An efficient nonnegative matrixfactorizationbased approach to collaborative filtering for recommender systems. IEEE Trans Ind Inform. 2014;10(2):1273–84.
Zhong Y, Xuan P, Wang X, Zhang T, Li J, Liu Y, et al. A nonnegative matrix factorization based method for predicting diseaseassociated miRNAs in miRNAdisease bilayer network. Bioinformatics. 2018;34(2):267–77.
Fu G, Wang J, Domeniconi C, Yu G. Matrix factorizationbased data fusion for the prediction of lncRNAdisease associations. Bioinformatics. 2018;34(9):1529–37.
Li L, Gao Z, Wang Y, Zhang M, Ni J, Zheng C, et al. SCMFMDA: Predicting microRNAdisease associations based on similarity constrained matrix factorization. PLoS Comput Biol. 2021;17(7):1009165.
Velickovic P, Fedus W, Hamilton W, Liò P, Bengio Y, Hjelm D. Deep Graph Infomax. ICLR (Poster). 2019;2(3):4.
Xia J, Wu L, Chen J, Hu B. Li S. SimGRACE: A Simple Framework for Graph Contrastive Learning without Data Augmentation; 2022. p. 1070–9.
Zhu Y, Xu Y, Yu F, Liu Q, Wu S, Wang L. Graph contrastive learning with adaptive augmentation. 2021. p. 2069–2080.
Xia J, Wu L, Wang G, Chen J. Li S. Progcl: Rethinking hard negative mining in graph contrastive learning. PMLR; 2022. p. 24332–46.
Sun F, Hoffmann J, Verma V, Tang J. Infograph: Unsupervised and semisupervised graphlevel representation learning via mutual information maximization. arXiv preprint arXiv:1908.01000. 2019.
Peng Z, Huang W, Luo M, Zheng Q, Rong Y, Xu T, et al. Graph representation learning via graphical mutual information maximization. 2020. p. 259–270.
You Y, Chen T, Sui Y, Chen T, Wang Z, Shen Y. Graph contrastive learning with augmentations. Adv Neural Inf Process Syst. 2020;33:5812–23.
You Y, Chen T, Shen Y, Wang Z. Graph contrastive learning automated. PMLR; 2021. p. 12121–12132.
You Y, Chen T, Wang Z. Shen Y. Bringing your own view: Graph contrastive learning without prefabricated data augmentations; 2022. p. 1300–9.
Li J, Zhang S, Liu T, Ning C, Zhang Z, Zhou W. Neural inductive matrix completion with graph convolutional networks for miRNAdisease association prediction. Bioinformatics. 2020;36(8):2538–46.
Lu C, Zeng M, Zhang F, Wu F, Li M, Wang J. Deep matrix factorization improves prediction of human circRNAdisease associations. IEEE J Biomed Health Inform. 2020;25(3):891–9.
Yang Z, Wu L, Wang A, Tang W, Zhao Y, Zhao H, et al. dbDEMC 2.0: updated database of differentially expressed miRNAs in human cancers. Nucleic Acids Res. 2017;45(D1):D812–8.
Huang Z, Shi J, Gao Y, Cui C, Zhang S, Li J, et al. HMDD v3. 0: a database for experimentally supported human microRNA–disease associations. Nucleic Acids Res. 2019;47(D1):D1013–7.
Cui T, Zhang L, Huang Y, Yi Y, Tan P, Zhao Y, et al. MNDR v2. 0: an updated resource of ncRNA–disease associations in mammals. Nucleic Acids Res. 2018;46(D1):D371–4.
Zhang W, Liu Y, Min Z, Liang G, Mo J, Ju Z, et al. circMine: a comprehensive database to integrate, analyze and visualize human diseaserelated circRNA transcriptome. Nucleic Acids Res. 2022;50(D1):D83–92.
Gao Y, Shang S, Guo S, Li X, Zhou H, Liu H, et al. Lnc2Cancer 3.0: an updated resource for experimentally supported lncRNA/circRNA cancer associations and web tools based on RNAseq and scRNAseq data. Nucleic Acids Res. 2021;49(D1):D1251–8.
Bao Z, Yang Z, Huang Z, Zhou Y, Cui Q, Dong D. LncRNADisease 2.0: an updated database of long noncoding RNAassociated diseases. Nucleic Acids Res. 2019;47(D1):D1034–7.
Acknowledgements
Not applicable.
Funding
The authors wish to thank editors and reviewers. This work was supported in part by the major key project of Peng Cheng Laboratory under grant PCL2023AS12, the Macau Science and Technology Development Funds Grant No. 0056/2020/FJ from the Macau Special Administrative Region of the People’s Republic of China.
Author information
Authors and Affiliations
Contributions
NA designed the framework, conducted the experiments, and wrote the manuscript. YL, HLY, and DOY modified the manuscript. All author(s) revised and approved the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Ai, N., Liang, Y., Yuan, H. et al. GDCLNcDA: identifying noncoding RNAdisease associations via contrastive learning between deep graph learning and deep matrix factorization. BMC Genomics 24, 424 (2023). https://doi.org/10.1186/s12864023095013
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12864023095013