 Research
 Open access
 Published:
Nodeadaptive graph Transformer with structural encoding for accurate and robust lncRNAdisease association prediction
BMC Genomics volumeÂ 25, ArticleÂ number:Â 73 (2024)
Abstract
Background
Long noncoding RNAs (lncRNAs) are integral to a plethora of critical cellular biological processes, including the regulation of gene expression, cell differentiation, and the development of tumors and cancers. Predicting the relationships between lncRNAs and diseases can contribute to a better understanding of the pathogenic mechanisms of disease and provide strong support for the development of advanced treatment methods.
Results
Therefore, we present an innovative NodeAdaptive Graph Transformer model for predicting unknown LncRNADisease Associations, named NAGTLDA. First, we utilize the nodeadaptive feature smoothing (NAFS) method to learn the local feature information of nodes and encode the structural information of the fusion similarity network of diseases and lncRNAs using Structural Deep Network Embedding (SDNE). Next, the Transformer module is used to capture potential association information between the network nodes. Finally, we employ a Transformer module with two multiheaded attention layers for learning globallevel embedding fusion. Network structure coding is added as the structural inductive bias of the network to compensate for the missing messagepassing mechanism in Transformer. NAGTLDA achieved an average AUC of 0.9531 and AUPR of 0.9537 significantly higher than stateoftheart methods in 5fold cross validation. We perform case studies on 4 diseases; 55 out of 60 associations between lncRNAs and diseases have been validated in the literatures. The results demonstrate the enormous potential of the graph Transformer structure to incorporate graph structural information for uncovering lncRNAdisease unknown correlations.
Conclusions
Our proposed NAGTLDA model can serve as a highly efficient computational method for predicting biological information associations.
Background
According to a large number of cell biology experiments, lncRNA are RNA molecule that are not involved in protein coding and exceed approximately 200 nucleotides in length [1,2,3,4]. At the beginning of the study, most researchers thought that lncRNAs were just an unimportant product in the transcription process. However, as biological experimental results continue to accumulate, researchers are slowly discovering that lncRNAs are assumed to have very important roles in many important cell biological processes. They are involved in managing the cell cycle, managing embryonic development, the spatial and temporal control of gene expression, determining cell fates [5]. Moreover, researchers in ongoing clinical experiments on human diseases have perceived that lncRNAs are inextricably linked to many human cancers [6, 7] and have a decisive role in human cardiovascular physiological activity and its pathology [8]. Therefore, researchers have regarded lncRNAs as a crucial factor in the study of human diseases and have explored the relationships between diseases and lncRNAs as a new research direction to overcome the barriers of human diseases. Exploring the relationships between diseases and lncRNAs will lead us to deepen our understanding of disease mechanisms [9] and find the causative factors and sources of diseases from the genetic roots. At the same time, understanding the interactions between lncRNAs and diseases will allow us to intervene and regulate the expression of diseaserelated genes, and find new targets and strategies [10] for the treatment of diseases. Researchers have found that the expression levels of some lncRNAs are very prominent in certain diseases, so lncRNAs can be used as potential biomarkers and play a very important role in the early detection and treatment of diseases. In drug discovery, by exploring the relationship between diseases and lncRNAs, this can help us to investigate new and optimized drugs that are more effective. In addition, human genetic diseases [11] exhibit a close association with lncRNAs. Investigating lncRNAs allows for the elucidation of certain genetic diseases stemming from gene mutations, thereby expediting researchersâ€™ investigations into genetic disorders. However, it requires considerable time to study the linkage in real clinical experiments, requires significant material resources and is challenging to apply on a large scale. Therefore, the design of a novel computational model to compute the association between diseases and lncRNAs is of great importance in advancing the development of bioinformatics. There are some challenges in the actual study, namely: (1) Large datasets exhibit a low percentage of positive samples, resulting in significant sparsity that reduces the modelâ€™s ability to predict positive samples effectively. (2) The availability of disease and lncRNA association data is limited, lacking a cohesive fusion of biological association data, and similarity calculations heavily rely on association matrices.
Many methods for calculating lncRNAdisease associations have been developed and their accuracy and reliability have been verified by biological experiments. Thus, to propose better calculation methods, researchers have collected a large quantity of data to create relevant benchmark databases. Gene Reference Into Function (GRIF) [12], DisGeNET [13], and Disease Ontology (DO) [14] are three standard databases related to diseases. RNADisease v4.0 [15], Lnc2Cancer [16] and LncRNADisease [17] are three standard databases related to lncRNAdisease association. These standard databases were also created to break away from the previous way of thinking that one lncRNA corresponds to one disease and to perform global calculations and experiments on the benchmark dataset in the database by the proposed computational method.
Numerous computational techniques for exploring diseaselncRNA interactions have emerged with the continual advancement of diverse technology. We can classify the available computational methods into bioinformatics networkbased methods [18] and deep learningbased methods [19].
Bioinformatics networkbased models take known associations and their respective similarities to reconstitute heterogeneous networks and use a variety of different messaging mechanisms and random walks for the computation of potential associations on top of the constructed heterogeneity. For example, the KRWRH model [20] utilized the restarted random walks to compute associations between lncRNAs and diseases on top of integrating similarities between diseases, similarities between lncRNAs, and known associations into a new heterogeneous network. The RWRHLD model [21] combined all three of them into a heterogeneous network: observed relationships between lncRNAs and diseases, known associations between crosstalk network between lncRNAs and lncRNAs, and integrating similarity between diseases, based on which links between diseases and lncRNAs are inferred using a restart random walk approach. The IRWRLDA model [22] is a novel algorithm that improves upon traditional random walks by considering both lncRNA similarity and disease similarity for initialization probabilities. It can be used to infer new associations, even when the disease has no known association with any lncRNAs. The SIMCLDA model [23] applied matrix completion and principal component analysis to infer potential associations. The NCPLDA model [24] capitalized on the networks consistency projection to obtain a new computational model for calculating new associations between lncRNAs and diseases. The GrwLDA model [25] generated a global network by combining identified lncRNAdisease interaction information, disease fusion similarity, and lncRNA fusion similarity and utilized this network to explore novel associations between diseases and lncRNAs. The LRWRHLDA model [26] integrated multiple heterogeneous and homogeneous networks to construct a threelayer bioinformatics network using RWR to mine interactions. The LRWHLDA model [27] is designed to excavate the relationships between diseases and lncRNAs with a new idea based on localized random walk that takes full advantage of the topology of the network. The LncRDNetFlow model [28] integrated three interaction networks, disease interaction network, lncRNA interaction network and protein interaction network, to construct a threelayered heterogeneous network to obtain disease and lncRNA feature data. Nevertheless, none of these methods can perform comprehensive learning and fusion of local and global information, nor can they perform deeper network feature learning.
The deep learningbased lncRNAdisease association prediction models have shown significant improvements in performance compared to previous shallow models. The CNNLDA model [29] reorganized multiple sources of similarity and introduced miRNA datasets to enable the neural network model to learn more information. It utilized convolutional neural networks to learn node embeddings and inferred the associations between diseases and lncRNAs. The BiGAN model [30] employed generative adversarial networks for lncRNAdisease interaction calculations. It combined the similarity of lncRNAs and diseases and adopted a bidirectional generative adversarial network to infer their associations. The MCANet model [31] utilized embedded learning for multiple feature sources, ensuring that each node has a unique vector representation. It used attentionbased convolutional neural networks to excavate direct interactions between lncRNAs and diseases. The ACLDA model [32] constructed a network based on metapaths using lncRNAs, miRNAs, and diseases. It introduced a novel approach that combines CNN and autoencoders for association prediction. The VADLP model [33] constructed multilayer graphs to integrate multiple similarities and employed variance autoencoders and CNN for lncRNAdisease interaction inference. The gGATLDA model [34] utilized attention mechanisms at the graph level. During the graph construction process, each diseaselncRNA pair is extracted to form a subgraph for lncRNAdisease relationship calculation. The MLMKDNN model [35] proposed a deep multikernel learning method, which included feature matrix construction, kernel space mapping, and deep neural network fusion. The kernel space mapping technique was applied to transform the feature matrix, enabling effective integration using deep neural networks for fusion. The MLGCNET model [36] employed multilayer graph autoencoder to obtain a representation vector of disease and lncRNA. The MGATE model [37] applied a multichannel selfattentive encoder to learn latent embeddings of diseases and lncRNAs from multiple angles of the graph. The GANLDA model [38] incorporated multisource data as initial features. GAT is adopted to get feature information about nodes and their neighbors and finally a multilayer perceptron is leveraged to screen the association. However, when building deep networks in graph neural networks, deep learning tends to cause oversmoothing during the node learning process, resulting in minimal differences between the vector representations of nodes.
A new trend of combining Transformers and graph neural networks to process graph data. This approach combines the parallelizability of Transformers, the advantages of their multihead attention mechanism, and graph neural network methods to design new neural network models for graph data processing. Microsoft introduced the Graphormer [39], which, for the first time, utilized Transformers for graphlevel tasks. It effectively integrated intermediate encoding, spatial encoding, and edge encoding into Transformers, successfully incorporating graph structural information. This integration has shown improved performance in widely used benchmark datasets for graph representation learning. Following this trend, a classic neural network model framework called GraphGPS emerged, which combines graph neural networks and Transformers [40]. It used MLP to learn graph information, feeding it into both the graph neural network and the Transformer for graph representation learning. The fusion of the results obtained from both models leads to highly competitive outcomes.
Although these methods have achieved relatively good results in the task of lncRNAdisease association prediction, they still have limitations and shortcomings as follows: (1) Graphbased methods do not maintain good performance and robustness in the face of sparse large datasets and the problem of oversmoothing of node features can occur [41]. Their learning ability is limited when confronted with complex heterogeneous graphs comprising different nodes and edges [42, 43]. (2) Traditional deep learningbased and bioinformatics networkbased approaches do not capture both local and global information, and do not learn the features of nodes by fusing the information encoded in the graph structure. (3) In these existing methods, a simple linear fusion is also used for the fusion of features [23, 24, 26, 38]. The incorporation of adaptive and efficient fusion approach holds the potential for significant improvements in model performance and robustness.
Based on the aforementioned limitations of the existing methods and the inherent advantages of the Transformer model, we propose an innovative lncRNAdisease association prediction model named NAGTLDA. First, we construct a heterogeneous network by utilizing observed associations and compute the integrated similarity of diseases and lncRNAs to create their respective integrated similarity networks. Next, we employ nodeadaptive feature smoothing (NAFS) [44] to perform locallevel node embedding on the heterogeneous network and integrated similarity networks. Simultaneously, we utilize Structural Deep Network Embedding (SDNE) [45] to encode the structural information of the integrated similarity networks. Furthermore, we utilize the Transformer model for globallevel embedding learning, allowing it to leverage its inherent global perspective to unearth potential association information. Finally, we employ the Transformer model to perform globallevel fusion of all learned embeddings and incorporate the structural inductive bias of the network. This fusion approach effectively and significantly enhances the utilization of all captured information, thereby greatly improving the performance of inferring the associations between diseases and lncRNAs. Our proposed model outperforms these models that exist now in terms of performance and scalability.
In summary, our research makes the following key contributions:

We employ the NAFS method for feature embedding learning without the need for explicit training, and we utilize SDNE to encode the network structure.

We employ both locallevel and globallevel approaches for feature embedding, enabling the model to effectively uncover potential association information.

To improve the Transformer model for learning graph node information, we learn the networkâ€™s structural information as an inductive bias.

We propose a Transformer fusion mechanism, which introduces the Transformer model for node embedding and fusion of multiple features and topology information, enriching the representation of lncRNAs and diseases.
Methods
Known human lncRNAdisease associations
In our experiment, we used a benchmark dataset to assess the effectiveness of our model. This dataset was obtained from previous research by Fu et al. [46] on lncRNAdisease association prediction, which includes 240 lncRNAs, 412 diseases, and 2697 experimentally validated lncRNAdisease interactions from the Lnc2Cancer [16], LncRNADisease [17], and GeneRIF [47] databases. We denoted the quantity of diseases and lncRNAs as \({N}_{l}\) and \({N}_{d}\), respectively. We constructed an adjacency matrix A based on the observed interactions between lncRNAs and diseases, and \(A\in {R}^{{N}_{l}\times {N}_{d}}\), where \(A(l(i), d(j))=1\) if there exists an identified relationship between lncRNA \(l(i)\) and disease \(d(j)\); otherwise \(A(l(i), d(j))=0\).
LncRNA functional similarity
There are multiple methods for expressing the similarity between lncRNAs, and one common method is based on their association with related diseases. By comparing the similarity of different lncRNAs with their associated diseases, their functional similarity can be assessed. In this experiment, we adopted the lncRNA functional similarity calculation method proposed by Chen et al. [48], which assumes that there are two lncRNAs \({l}_{1}\) and \({l}_{2}\), respectively, \({l}_{1}\) is linked to disease category \(D(i)=\left\{{d}_{i1}, {d}_{i2}, {d}_{i3},\cdots , {d}_{in}\right\}\), and \({l}_{2}\) is linked to disease category \(D(j)=\left\{{d}_{j1}, {d}_{j2}, {d}_{j3},\cdots , {d}_{jm}\right\}\). The formula for calculating the similarity score between disease \({d}_{k}\in {\text{D}}({\text{i}})\) and disease category \(D(j)\) provided here is:
where \(DS({d}_{k}, d)\) represents the semantic similarity between diseases \({d}_{k}\) and d. Based on the semantic similarity between the diseases and the associations between the lncRNAs and disease category, the formula for calculating the functional similarity of lncRNAs is as follows:
where n and m denote the quantity of diseases in disease category \(D(i)\) and category \(D(j)\), which can be represented as \(D(i)=n,\) \(D(j)=m\), respectively.
Disease semantic similarity
To compute the semantic similarity between diseases, their Medical Subject Headings (MeSH) descriptors can be used [49], and they can be denoted as a Directed Acyclic Graph (DAG) [50]. Specifically, the hierarchical relationship of a disease can be represented as \(DAG({d}_{i})=(T({d}_{i}), E({d}_{i}))\), where \(T({d}_{i})\) represents \({d}_{i}\) and all its ancestor nodes, and \(E({d}_{i})\) is a set of edges from ancestral nodes to descendant nodes. Computing disease semantic similarity can be divided into three steps. For the first stage, for any disease \({d}_{j}\) in \(DAG({d}_{i})\), its contribution towards the semantic similarity of disease \({d}_{i}\) can be computed using the following formula:
where parameter \(\gamma\) represents a hyperparameter set to 0.5 in the formula for disease semantic contribution. The second stage is to compute the total semantic value of the disease, which is computed using the following formula for \({DV}_{{d}_{i}}\):
The third stage is to compute the semantic similarity between diseases \({d}_{i}\) and \({d}_{j}\) using the following formula:
Gaussian interaction profile (GIP) kernel similarity for lncRNAs and diseases
Gaussian kernel similarity is a common similarity measurement method that can map data to a multidimensional space and compute the similarity between data points. The calculated lncRNA functional similarity and disease semantic similarity are both relatively sparse, so it is necessary to introduce other similarities to compensate for this deficiency. Therefore, we decided to introduce GIP similarity, which can make the similarity between data nodes more obvious and facilitate the prediction of associations between nodes. The calculation formulas for GIP kernel similarity \(LK({l}_{i}, { l}_{i})\) between lncRNA \({l}_{i}\) and \({l}_{j}\) and DK (\({d}_{i}\), \({d}_{j}\)) between disease \({d}_{i}\) and \({d}_{j}\) are as follows:
where comparable to reference [51], \(IP({l}_{i})\) and \(IP({ l}_{j})\) represent the irow and jrow corresponding to the lncRNA in the known lncRNAdisease interaction matrix A,\(IP({d}_{i})\) and \(IP({d}_{i})\) represent the icolumn and jcolumn corresponding to the disease in the known lncRNAdisease interaction matrix A. \({r}_{l}\) and \({r}_{d}\) are the kernel bandwidth control parameters and are defined by the following formula:
Integrated similarity networks for lncRNAs and diseases
Previously, we introduced GIP kernel similarity to compensate for the sparsity of lncRNA functional similarity and disease semantic similarity. Based on these similarities, we calculate the integrated similarity matrix between diseases and lncRNAs using the following formula:
where \(IL({l}_{i}, { l}_{j} )\) represents the integrated similarity matrix between lncRNAs, and \(ID({d}_{i}, {d}_{j})\) represents the similarity matrix between diseases. To better utilize the integrated similarity matrices of lncRNAs and diseases, we use them to obtain their corresponding integrated similarity networks. We set two thresholds \(\alpha\) and \(\beta\) to calculate the similarity network, and their formulas are expressed as follows:
where \({I}_{net}\) represents the network obtained from the integrated similarity matrix of lncRNAs. If the similarity value between \({l}_{i}\) and \({l}_{j}\) is not less than or equal to threshold \(\alpha\), then \({I}_{net}({l}_{i}, { l}_{j})\) = 1. Otherwise, \({I}_{net}({l}_{i}, { l}_{j})\) = 0. \({D}_{net}\) denotes the network obtained from the integrated similarity matrix of diseases. If the similarity value between \({d}_{i}\) and \({d}_{j}\) is not less than or equal to threshold \(\beta\), then \({D}_{net}({d}_{i}, {d}_{j})\)=1. Otherwise, \({D}_{net}({d}_{i}, {d}_{j})\)=0.
LncRNAdisease heterogeneous network
We constructed a lncRNAdisease heterogeneous network that includes the lncRNA similarity matrix, disease similarity matrix, and the known lncRNAdisease association matrix A:
where \({A}^{T}\) represents the transpose of the lncRNAdisease interaction matrix.
NAGTLDA
This section provides a detailed introduction to our proposed model, NAGTLDA, which accurately excavates the lncRNAdisease associations. The NAGTLDA process is shown in Fig. 1, which depicts the workflow and the sequence of steps involved in the NAGTLDA framework. The model framework comprises the following parts: (1) using NAFS to learn locallevel node feature embedding, (2) using SDNE to encode the structure of networks, (3) using a Transformer model with a multihead attention layer to learn globallevel node feature embedding, (4) using a Transformer model with two multihead attention layers to learn embedding fusion at the globallevel, (5) predicting the association score between diseases and lncRNAs.
Locallevel node feature embedding (nodeadaptive feature smoothing)
In recent years, GCN [52] has become very popular in graph neural networks (GNNs). This is because GCN can learn the features of all nodes in a graph based on both node features and graph structure. Using GCN to aggregate multiorder neighbour information in large graph networks leads to oversmoothing problems and requires a high computational cost and large memory consumption. To address this issue, Zhang et al. [44] proposed a model called NAFS, which aggregates and updates the features of nodes in a graph. Compared with GCN, NAFS not only solves the limitations of GCN but also significantly simplifies the model training intricacy and mitigates the occurrence of gradient vanishing and gradient explosion during backpropagation without the need for additional training.
Since our model uses NAFS for node feature embedding for all three graphs (\({I}_{net},{D}_{net } \,and\, {G}_{net}\)), we use \({G}_{net}\) as an example for illustration. The abbreviation for \({G}_{net}\) is G. We denote the quantity of nodes in G as n and the quantity of edges as m. Computing of NAFS consists of four steps. The initial step entails computing the oversmoothing distance, and the calculation is performed in the following manner:
where \({[{\widehat{G}}^{k}X]}_{i}\) represents the ith row in the matrix, which indicates the smoothed node representation of the ith node. Dis(â€¢) represents a distance formula, which can be implemented using the Euclidean distance formula. \(\widehat{G}={\widetilde{D}}^{r1}\widetilde{G}{\widetilde{D}}^{r}\),\(\widetilde{D}\) denotes the degree matrix of graph. r is a hyperparameter in the model. \(\widetilde{G}\) represents the adjacency matrix of the undirected graph with selfloops added. The calculation formula for \({\widehat{G}}^{\infty }\) is as follows:
where \({d}_{i}\) represents the degree of node i. The smoothing weight calculated in the second step is computed as follows:
where K represents the maximum number of smoothing steps. The third step is to calculate the smoothing weight matrix, which is computed as follows:
where \(\varphi (k)\in {R}^{n}\) and \(Diag(\cdot )\) represents a diagonal matrix. We denote the initial input feature representation as \({X}^{(0)}\). After \(l\) rounds of smoothing, the node feature matrix \({X}^{(l)}=\widehat{G}{X}^{(l1)}\) contains the feature of the previous round of smoothing. After K rounds of maximum smoothing, \({X}^{(k)}\) will contain more information, and we can obtain a collection of feature matrices \(\left\{{X}^{(0)}, \,{X}^{(1)}, \,{X}^{(2)}, \,\cdots , \,{X}^{(k)}\right\}\). Finally, the formula for smoothing feature \(\widehat{X}\) is as follows:
The definition of \({X}^{(0)}\) is as follows:
In GCN, a symmetric normalized adjacency matrix \(\widehat{G}={\widetilde{D}}^{r1}\widetilde{G}{\widetilde{D}}^{r}\) is used. Setting râ€‰=â€‰0.5 yields the symmetric normalized adjacency matrix \({\widetilde{D}}^{1/2}\widetilde{G}{\widetilde{D}}^{1/2}\) [52] as the feature extractor. However, in NAFS, \(\left\{{r}_{1}, \,{r}_{2}, \,{r}_{3}, \,\cdots , \,{r}_{U}\right\}\) results in a more diverse set of feature embeddings. The value of r controls the normalization weight of each edge, so different r values lead to distinct node feature embeddings for the same graph. We obtain a set of smoothed features \(\left\{\widehat X^{(0)},\widehat X^{(1)},\widehat X^{(2)},\cdots,\widehat X^{(U)}\right\}\) based on this set of different r values, and we combine different smoothed features into \({\widehat{Z}}_{G}=({\widehat{X}}^{(0)}\otimes {\widehat{X}}^{(1)}\cdots \otimes {\widehat{X}}^{(U)})\in {{\varvec{R}}}^{({N}_{l}+{N}_{d})\times ({N}_{l}+{N}_{d})}\). Here, \(\otimes\) represents a type of combination method, which can be replaced with the max function, concatenation, and mean function.
First, we input the heterogeneous network \({G_{net}} \in {R^{({N_l} + {N_d}) \times ({N_l} + {N_d})}}\) and the initial features \({X^{(0)}} \in {R^{({N_l} + {N_d}) \times ({N_l} + {N_d})}}\) of the network nodes, which consists of nodes corresponding to lncRNAs and disease entities. We will compute a smoothing weight matrix \(W(k)\) for each kstep according to Eq. (18), then we use a list \(\left\{{r}_{1}, \,{r}_{2}, \,{r}_{3},\,\cdots,\,{r}_{U}\right\}\). For each rvalue in the list, we derive a new feature node embedding representation of the network structure from Eq. (19), denoted as \({\hat X^{(u)}} \in {R^{({N_l} + {N_d}) \times ({N_l} + {N_d})}}\). The feature embeddings obtained from all the rvalue are fused to obtain the final feature embedding \({\hat Z_G} \in {R^{({N_l} + {N_d}) \times ({N_l} + {N_d})}}\). The final NAFS is expressed as follows:
where U denotes the length of the rlist and \(\otimes\) represents the fusion mode of the features (Mean).
Similarly, we use NAFS to process and obtain the corresponding lncRNAintegrated similarity network node features \({\hat Z_L} \in {\varvec{R}^{{N_l} \times{N_l}}}\) and diseaseintegrated similarity network node features \({\widehat{Z}}_{D}\in {{\varvec{R}}}^{{N}_{d}\times {N}_{d}}\). We perform the node features in \({\widehat{Z}}_{L}\) affine, converting \({\widehat{Z}}_{L}\) and \({\widehat{Z}}_{D}\) to the same dimension:
where \({W}_{LD}\in {{\varvec{R}}}^{{N}_{d}\times {N}_{l}}\) and \({b}_{LD}\in {{\varvec{R}}}^{{N}_{d}}\) are trainable parameters. We splice \({\widehat{Z}}_{L}^\prime\) and \({\widehat{Z}}_{D}\) to form a new node feature \({\widehat{Z}}_{LD}=\left[\begin{array}{c}{\widehat{Z}}_{L}^\prime\\ {\widehat{ Z}}_{D}\end{array}\right]\in {R}^{({N}_{l}+{N}_{d})\times {N}_{d}}\).
Network structure encoding
We learn the structural encoding of the network as the structural inductive bias and transfer it to the downstream Transformer module for processing. Here, we encode the network structure using the SDNE approach provided by Wang et al. [45] to conduct additional research on the information in the network.
In the model we encode the structure of the network with \({I}_{net}\mathrm{ \,and \,}{D}_{net}\). Here we use \({I}_{net}\) as an example to illustrate the process of SDNE. SDNE is composed of a decoder part and an encoder, where the decoder maps the input network with multiple nonlinear functions and the decoder applies multiple nonlinear functions to reconstruct the network. In \({I}_{net}=(V, E)\), the adjacency matrix of the network is denoted by \(M\), \(V\) denotes the collection of lncRNA nodes within the network, where \(V={N}_{l}\). Then, the mapping and reconstruction of the network is performed as follows:
where \({M}_{i}\) denotes the initial feature of the ith lncRNA in the network, \(\upsigma \left(\cdot \right)\) denotes the activation function, \({W}_{l}^{\left(1\right)}\in {{\varvec{R}}}^{{n}_{1}\times {N}_{l}}, {b}^{(1)}\in {{\varvec{R}}}^{{n}_{1}}\),\({W}_{l}^{(k)}\in {{\varvec{R}}}^{{n}_{k}\times {n}_{k1}}\) and \({b}^{(k)}\in {{\varvec{R}}}^{{n}_{k}}\) are the trainable parameters, and K is the number of layers of the decoder and encoder hidden layers. When \({y}_{i}^{(k)}\) is obtained, the encoder will be reused to map to obtain the output \({\widehat{M}}_{i}\). To make SDNE capture a more accurate network structure, secondorder similarity and firstorder similarity are used here to construct the loss function of SDNE so that the error between the reconstructed network and the original network is smaller, and the SDNE loss function \({L}_{sdne}\) is calculated as follows:
Here, âŠ™ represents the Hadamard product. \({b}_{i}={\{{b}_{i,j}\}}_{j=1}^{{N}_{l}}\), if \(M(i,j)\)=0, \({b}_{i,j}\)=1; otherwise, \({b}_{i,j}=\beta >1\). \(M\) represents the adjacency matrix of the network, \(M(i, j)\) represents the value of the ith row and jth column of the association matrix, and \(\alpha\) is the hyperparameter. \({L}_{reg}\) is a regularization term proposed to avoid overfitting, which is calculated as follows:
We input a network \(G = (V,E)\), where V denotes the set of nodes and E denotes the set of edges. Encode the network structure following the formulation in Eq. (23). Subsequently, decode the network structure by passing it through a decoding module, utilizing Eq. (26). Employ Eq. (24) for the firstorder loss function, Eq. (25) for the secondorder loss function, and Eq. (27) for the regularization function to compute the loss of the reconstructed network structure. This comprehensive approach aims to enhance the accuracy of the encoded network structure. Finally, output the result \(y_i^{(k)}\) obtained from the encoder. \({{I}_{net}\mathrm{ \,and \,}D}_{net}\) denote lncRNAintegrated similarity network and diseaseintegrated similarity network. The final expression of the SDNE is as follows:
where \(\widehat{M}\in {{\varvec{R}}}^{{N}_{l}\times {n}_{p} } \,{\text{and}} \,\widehat{D}\in {{\varvec{R}}}^{{N}_{d}\times {n}_{p}}\), \({n}_{p}=K/2\), and K denotes the number of hidden layers in the decoder and encoder. We combine \(\widehat{M}\) and \(\widehat{D}\) into a new network structure coding \(SF=\left[\begin{array}{c}\widehat{M}\\ \widehat{D}\end{array}\right]\in {{\varvec{R}}}^{({N}_{l}+{N}_{d})\times {n}_{p}}\).
Globallevel embedding
In our model, we account for the limitations of the information contained in the locallevel nodes. Therefore, we introduce a Transformer [53] module to learn globallevel node features and deeply explore the unknown associations between diseases and lncRNAs from a global perspective. The Transformer is utilized in the domain of graph neural networks and has significant implications for the future development of graph neural networks. In NAGTLDA, we only need the Transformer encoder to learn the feature embedding of the globallevel nodes.
We take the node features \({\widehat{Z}}_{G}\) of the heterogeneous network as input to the Transformer, which is first processed through the multihead attention layer as follows:
where \({W}_{i}^{q}\), \({W}_{i}^{k}\), \({W}_{i}^{v}\in {R}^{({N}_{l}+{N}_{d})\times (({N}_{l}+{N}_{d})/{n}_{head})}\) are the parameters to be trained in the model and \({n}_{head}\) represents the quantity of multihead attention heads. We obtain a set \({{H}_{i}=\{H}_{1}, {H}_{2}, \cdots , {H}_{{n}_{head}}\}\), and finally, we obtain the output H from the multihead attention:
where, \({W}^{H}\in {R}^{({N}_{l}+{N}_{d})\times {n}_{h}}\) is the training parameter and \(\oplus\) represents the splicing operation. Then we feedforward propagate the output of the multihead attention, and the feedforward network is defined as follows:
where \(\sigma (\cdot )\) represents a nonlinear activation function (LeakyReLU) and i denotes the quantity of hidden layers in the feedforward network. Here, given the initial input \(H\), we can proceed to obtain the output X of the feedforward network:
where \({W}^{F1}\in {R}^{{n}_{h}\times {n}_{d}}\), \({W}^{F2}\in {R}^{{n}_{d\times }{n}_{h}}\), \({b}^{F1}\in {R}^{{n}_{d}}\mathrm{\, and \,}{b}^{F2}\in {R}^{{n}_{h}}\) are the training parameters.
Globallevel embedding fusion
We have acquired locallevel and globallevel embeddings, and as it would be inefficient to combine these various embeddings using straightforward splicing or summing operations to produce the desired result, we continue to employ Transformerâ€™s decoder to carry out globallevel node embedding fusion representation. Transformer does not employ the graph information transfer mechanism for graph computation; as a result, the structural inductive bias of the network is introduced to Transformer to compensate for the missing information transfer mechanism, resulting in excellent results for the model. Here, we employ two multiheaded attention layers, the first of which handles node embedding and the second of which incorporates structural inductive bias of the network for developing the final node embedding representation learning.
First, we use the first multihead attention layer to process the concatenation of the globallevel embedding X and the locallevel embedding \({\widehat{Z}}_{LD}\). By applying the multihead attention Eqs. (30), (31), and (32) along with the feedforward network Eq. (33) we obtain a new node embedding \({X}^{F}\in {R}^{({N}_{l}+{N}_{d})\times {n}_{h}^{\mathrm{^\prime}}}\).
Then, we use the second layer of multihead attention to address the structural induction bias of the network. After concatenating the structural induction bias SF and node embedding \({X}^{F}\), we similarly utilize Eqs. (30), (31), (32) for multihead attention and Eq. (33) for the feedforward network to obtain a new representation of the node embedding \({X}^{S}\).
We utilized the rich information of the heterogeneous network and the topological structure of integrated similarities networks for lncRNAs and diseases to perform node feature embedding learning at both locallevel and global level. Simultaneously, we learned the structural information of the network. Finally, we fuse them using the Transformer structure to obtain the final node embedding representation \({X}^{S}\in {R}^{{(N}_{l}+{N}_{d})\times f}\).
Predicting the association score between lncRNAs and diseases
We expressed the final node embedding expression as \({X}^{S}=\left[\begin{array}{c}{X}_{L}^{S}\\ {X}_{D}^{S}\end{array}\right]\), where \({X}_{L}^{S}\in {R}^{{N}_{l}\times f}\) indicates the ultimate node feature embedding of lncRNAs and \({X}_{D}^{S}\in {R}^{{N}_{d}\times f}\) indicates the ultimate node feature embedding of diseases. The reconstruction of the lncRNAdisease interaction matrix \(\widehat{A}\) was performed using a bilinear decoder. The bilinear decoder formula is defined as follows:
where \({W}^{B}\) represents the trainable parameter matrix. We can consider the lncRNAdisease link prediction task as a simple binary classification problem, so binary crossentropy loss is selected as the loss function for association prediction, which is calculated as follows:
where (i, j) denotes the lncRNA and disease pairs, and the sets of data that are negative and positive data are represented by \({I}^{}\) and \({I}^{+}\), respectively. Our modelâ€™s overall loss function can be described as follows:
where \({L}_{l\_p}\) stands for the loss function of the reconstructed association matrix, whereas \({L}_{sdne}^{1}\) and \({L}_{sdne}^{2}\) reflect, the loss functions represented by the structures of the diseaseintegrated similarity and lncRNAintegrated similarity networks, respectively. In the overall optimization of our model, we added the Adam optimizer [54]. To achieve an equal distribution of negative and positive samples during the training phase of our model, an equivalent quantity of negative data is randomly chosen to enter the training. The training process of NAGTLDA is shown in Algorithm 1.
Results
Experimental setting
During our experimental process, we employed 5fold crossvalidation (5CV) to test the performance of our proposed model. We partitioned the diseaselncRNA pairs into five equal subsets, employing a fourtoone ratio for training and testing, which facilitated five crossvalidation iterations. In each round, we removed all known associations from the test set and evaluated the performance of the trained model on the test samples. For selecting performance evaluation metrics, we adopted AUPR (area under precisionrecall curve) and AUC (area under the receiver operating characteristic curve) as the major markers. Additionally, we considered five auxiliary reference metrics: recall, accuracy (ACC), F1score, precision (Prec.), and specificity (Spec.). After conducting our 5CV experiment, detailed results are presented in Table 1. Our model achieved an average accuracy of 0.8785 and average recall of 0.9088 on the experimental dataset. The average specificity and precision reached 0.8483 and 0.8578, respectively, while the average F1score reached 0.882. In particular, the AUC and AUPR for our model are shown in Fig. 2. The average AUC and AUPR were 0.9531 and 0.9537, respectively. The results of the 5CV experiment demonstrate the excellent performance of our proposed model in diseaselncRNA interaction prediction tasks.
Several hyperparameters are included in the model, including the final embedding dimension (dim), maximum smoothing steps (k), learning rate (lr), encoding dimension for SDNE (nhid), number of Transformer layers (L1 and L2), number of attention heads for multihead attention (Head1 and Head2), rvalue for NAFS, and weight decay for the optimizer. The best settings of hyperparameter optimization are presented in Table 2. The optimal parameter values are bolded, and these optimal parameters were chosen based on the model AUC.
Parameter analysis
During the process of setting hyperparameters, we found that certain parameter values have a noticeable impact on the model performance. For instance, we analyzed the dimensions of the final node features, as shown in Fig. 3. We compared different dimension values (\(dim\in \{32, 64 ,128, 256, 512\}\)) and found that when dimâ€‰=â€‰64, the AUC and AUPR values are highest. Selecting an appropriate dimension to represent node features is crucial. If the dimension is too small, the distinguishability between nodes may not be clear. However, if the dimension is too large, it can result in a significant amount of redundant information. Therefore, the choice of embedding dimension as a hyperparameter is also vital for the model.
Then, we analyzed the maximum number of smoothing steps in NAFS, as shown in Fig. 4. The maximum number of smoothing steps indicates the number of neighbours aggregated in the process of aggregating neighbour nodes, which is equivalent to aggregating multiorder neighbours. We found that when hopsâ€‰=â€‰7, the values of AUC and AUPR are the highest. When hops are greater than 7, they show a decreasing trend, and when they are less than 7, they show an increasing trend. After each smoothing, the following node features will contain all the previous smoothing information, so the number of smoothing steps is also very important for the learning of feature embedding.
In our model, we introduced the Transformer module, which includes a multihead attention mechanism that provides us with a global perspective, enabling us to perform globallevel embedding learning. We used two instances of the Transformer module in our model, and we found that different combinations of layer numbers (L1 and L2) have a significant impact on the modelâ€™s performance. As shown in Fig. 5a, different layer numbers affect the modelâ€™s AUC, while Fig. 5b illustrates the impact of different values of L1 and L2 on AUPR. The highest AUC value is achieved when the combination of (L1, L2) is set to (10, 20), while the highest AUPR value is achieved when it is set to (15, 10). Additionally, different combinations of the quantity for the attention heads, Head1 and Head2, also affect the prediction efficiency of the model. As depicted in Fig. 6a, the varying combinations of Head1 and Head2 influence the AUC values, with the highest value observed when it is set to (8, 64). In Fig. 6b, we can observe that the highest AUPR value is achieved when the combination of Head1 and Head2 is (8, 64).
Performance comparison with different ratios
The different proportions of negative and positive samples in each fold of crossvalidation can also impact the modelâ€™s performance. Therefore, we set the proportions between positive samples and negative samples in each fold as follows: positive samples: negative samplesâ€‰=â€‰{1:1, 1:5, 1:10, random}, for experimental purposes. The detailed outcomes of the studies are presented in Fig. 7. We can observe that when the ratioâ€‰=â€‰1:1, indicating a balanced ratio of positive and negative samples, the AUC and AUPR values are the highest at 0.9531 and 0.9537, respectively, but the corresponding accuracy is the lowest. When the ratioâ€‰=â€‰1:5, the AUC and AUPR values are slightly lower than those of the ratioâ€‰=â€‰1:1, but the accuracy is slightly higher. When the ratioâ€‰=â€‰1:10, the AUC value is the lowest, but the accuracy is higher than the previous ratios. When the ratio is set to random, the AUC value is ranked third, and the AUPR value is the lowest, but the accuracy is the highest at 0.9783.
We speculate that the reason for these results may be due to the low proportion of positive samples in the experimental dataset. If we balance the positive and negative samples in each fold, it leads to the smallest quantity of training data in each fold, resulting in the lowest model accuracy. As the proportions between positive and negative samples decrease, the quantity of training data in each fold also decreases, leading to a decrease in accuracy.
Performance comparison with other methods
In our experiments, we compared our model with six stateoftheart computational methods on a benchmark dataset D1 using a 5CV approach, which are as follows:

HGATLDA (2022) [55]: A metapathbased heterogeneous graph attention network framework was used to perform interaction prediction between diseases and lncRNAs by constructing disease, lncRNA, and gene heterogeneity networks.

SFGAE (2022) [56]: A graph selfencoder was utilized for feature learning of nodes and selffeatured representations of miRNAs and diseases were constructed for association prediction between miRNAs and diseases.

VGAELDA (2021) [57]: An endtoend computational model based on a variational selfencoder and graph selfencoder was adopted to predict the relationships between diseases and lncRNAs.

LAGCN (2020) [58]: A layerattentive graph convolution network was used to synthesize multisource similarity to construct heterogeneous network for association prediction between drugs and diseases.

LDALNSUBRW (2020) [59]: A computational method based on unbalanced double random wandering and linear neighborhood similarity for association prediction between diseases and lncRNAs.

CNNLDA (2019) [29]: A dual convolutional neural network model based on an attention mechanism that integrates multiple sources of data was used to excavate the associations between diseases and lncRNAs.
For benchmark dataset, the D1 downloaded from the Lnc2Cancer [16], LncRNADisease [17] and GeneRIF [47]. The dataset utilized in this study was sourced from the previous research conducted by Fu et al. [46] on lncRNAdisease association prediction. The dataset comprises 240 lncRNAs, 412 diseases, and 2,697 experimentally validated lncRNAdisease interactions. The semantic similarity data for all diseases is obtained from MeSH.
In the benchmark dataset D1 experiments, we compared different models using two evaluation metrics, namely, AUC and AUPR, to facilitate better comparison between models. The experimental results are presented in Table 3, where we highlight the highest results. It can be observed that our proposed NAGTLDA model achieves the highest AUC and AUPR values. This improvement can be attributed to the utilization of a Transformer for global learning during the process of learning node features. NAGTLDA outperforms LDALNSUBRW by 8.92% in AUC and 5.51% in AUPR. Figure 8 shows the AUC and AUPR curves of all models obtained through 5CV experiments. It is evident from the figure that NAGTLDA outperforms other models in terms of performance. To visually highlight the performance disparity between NAGTLDA and existing stateoftheart methods, we conducted a significance analysis of their AUC values, represented in Fig. 9 (* denotes Pâ€‰<â€‰0.05, ** denotes Pâ€‰<â€‰0.01, *** denotes Pâ€‰<â€‰0.001). Notably, the significance levels of NAGTLDA compared to other methods are consistently high, ranging from a minimum significance of Pâ€‰<â€‰0.05 to a maximum significance of Pâ€‰<â€‰0.001. The improvement in the performance of our model has a significant enhancement for uncovering unknown lncRNAdisease associations. Hence, we can infer that our proposed model demonstrates excellent performance and serves as an effective computational approach for predicting diseaselncRNA associations.
Compared with these stateoftheart methods, our model exhibits a significant performance advantage, as confirmed in the experiments above. The enhancement in performance can be attributed to the following unique contributions: NAFS is utilized to learn local features of nodes, simplifying the model training process and enhancing effectiveness. Moreover, the incorporation of network structure encoding enhances the efficiency of graph node information learning. Lastly, the application of the Transformer architecture allows for the learning of global information of nodes in the graph. The global and local features are then adaptively and efficiently fused using a multihead attention approach, resulting in comprehensive feature information for diseases and lncRNAs.
Performance on other datasets
To further validate the performance and generalization ability of the NAGTLDA model, we performed experiments on a larger lncRNAdisease association dataset D2 and a miRNAdisease association dataset D3, as shown in Table 4.

D2: We screened the data from the databases of known lncRNAdisease associations, including LncRNADisease v2.0 [60] and Lnc2Cancer v3.0 [61], known lncRNAmiRNA associations from Encori [62] and NPInter V4.0 [63], and known miRNAdisease associations from HMDD v3.2 [64]. All disease names were converted to standard MeSH disease terms to facilitate the calculation of semantic similarity between the diseases. After removing redundant data, the final merger yielded 861 lncRNAs, 432 diseases, and 4516 known lncRNAdisease associations. The features used to make semantic similarity of diseases in the model are obtained from MeSH.

D3: The known miRNAdisease association data were downloaded from the HMDD v3.2 database [64], and we obtained 788 miRNAs, 374 diseases, and 8968 corresponding known associations from the screening. The features used to make semantic similarity of diseases in the model are obtained from MeSH.
We conducted 5fold crossvalidation experiments on the D2 and D3 datasets, and the results are presented in Table 5. Comparing the experimental outcomes of the original dataset with the D2 dataset, we observed that the model performs better on D2. This improved performance can be attributed to the incorporation of the Transformer structure into the NAGTLDA model, enhancing its performance on larger datasets. The Transformer, originally designed for largescale natural language processing tasks, brings notable advantages to our model, allowing it to excel on larger datasets.
On the D3 dataset, we achieved remarkable results with AUC and AUPR values exceeding 0.94, while the F1score reached 0.8746. These outcomes indicate that our model possesses strong generalization capabilities. It not only performs well in predicting lncRNAdisease associations, which is the primary focus of our study, but also demonstrates high performance on other noncoding RNA datasets.
We established independent validation sets to assess the performance of our model, following the methodology outlined by Fu et al. [65]. For the D1 dataset, which contains 2697 positive samples, we initially selected 20% of the positive samples and the same number of negative samples to construct an independent balanced validation set (Bvalidation set). The remaining samples were utilized for training. Subsequently, we randomly extracted 20% samples from the D1 dataset to create an unbalanced independent validation set (Unbvalidation set), while the remaining samples served as the training set. The experimental results on these two independent validation sets are summarized in Table 6. We assessed the modelâ€™s performance on the two independent validation sets in comparison to its performance on the benchmark dataset. Notably, there was a decrease in performance on the independent validation sets, specifically in terms of the two primary metrics, AUC and AUPR. Despite this decrease, the model still demonstrated relatively good results. Furthermore, the AUC and AUPR on the unbalanced independent validation set were slightly lower than those on the balanced validation set. This trend was observed in both balanced and unbalanced datasets, suggesting the need to explore strategies for choosing an optimal ratio of positive and negative samples to enhance the comprehensiveness of model comprehensiveness during training.
After comparing NAGTLDA with other stateoftheart models in previous experiments on the D1 dataset, we extended our evaluation to two larger datasets, D2 and D3. We analyzed the significance of their AUC values, as illustrated in Figs. 10 and 11, to assess computational efficiency and scalability across models. Notably, NAGTLDA exhibited remarkable significance compared to other models on both datasets, with particularly noteworthy results on the D2 dataset, where the significance compared to other stateoftheart models reached Pâ€‰<â€‰0.001.
The reason for the strong scalability of our model is as follows: (1) Our model applied SDNE to learn the structure coding based on the specific network. (2) We leveraged the graph transformer structure to learn global level features, which can adaptively learn the features of nodes and has a very powerful learning capability. (3) We added NAFS to learn local features to make the model more scalable by flexibly learning the information of different nodes.
However, there are some limitations of our proposed model on large dataset. Large datasets are commonly imbalanced in positive and negative samples, which requires to introduce multisource features to compensate for the shortcomings of sparse positive samples. Moreover, there are many hyperparameters in the model, and the model application on large datasets may cause overfitting phenomenon for too many parameters.
Feature visualization
To display the effectiveness of our proposed model more specifically and graphically, we visualize the lncRNAdisease pair features learned by the model for comparison. We used tSNE [66] to downscale the lncRNAdisease pair features and plot them in the twodimensional plane to compare the learned pair features with the original pair features. As shown in Fig. 12, we visualize the original pair features (left) and the learned pair features (right). In the visualization, we distinguish the negative samples from the positive samples with different color dots, and we can observe that the lncRNAdisease pairs learned by NAGTLDA are more concentrated and distinguishable than the original positive and negative samples respectively. This also indicates that our model is meaningful and interpretable for disease and lncRNA feature learning.
Ablation experiments
To assess the influence of each module on the model performance and its importance, three sets of ablation experiments were performed for validation.
The first set of ablation experiments is to remove a module from the initial model to construct a comparison model, and each new comparison model is described as follows:

Remove T1: Remove the Transformer module that performs globallevel embedding of heterogeneous networks.

Remove lncRNANAFS: Remove the NAFS module that performs locallevel embedding of the lncRNAintegrated similarity network.

Remove diseaseNAFS: Remove the NAFS module that performs locallevel embedding of the diseaseintegrated similarity network.

Remove lncRNASDNE: Remove the SDNE module that encodes the structure of the lncRNAintegrated similarity network.

Remove diseaseSDNE: Remove the SDNE module that encodes the diseaseintegrated similarity network structure.
The results obtained from the experiments are presented in Fig. 13 and Table 7, and the original NAGTLDA model has excellent results compared to other comparable models. For example, on both the AUC and AUPR, NAGTLDA outperforms remove diseaseSDNE by values of 0.0181 and 0.0133, respectively. We observe that encoding the network structure information exerts the most significant impact on the overall model performance. Consequently, the acquisition of nodelevel information within the network holds great importance. However, a comprehensive understanding of the networkâ€™s structural information also emerges as a vital component. The overall performance of the new model formed by removing a module is lower than that of the original model, thus proving the effectiveness of our use of Transformer layer for globallevel embedding, NAFS for locallevel embedding, and SNDE for network structure encoding.
The second set of ablation experiments was conducted by replacing the method used for locallevel embedding in the model with the classical GCN and GAT in graph neural networks to construct the comparison models: NAGTLDA_gcn and NAGTLDA_gat. As shown in Table 8 and Fig. 14, NAGTLDA performs better than the variant model. Specifically, NAGTLDA is 0.0106 higher than NAGTLDA_gcn in terms of AUC value, 0.0079 higher than NAGTLDA_gat in terms of AUPR, and 0.0158 higher than NAGTLDA_gcn in accuracy. NAGTLDA compared to NAGTLDA_gcn and NAGTLDA_gat in F1 score is the highest, and the F1score is a benchmark indicator for the comprehensive ability of the model, so the original model is a better choice. Combining the outcomes of the first set of ablation experiments and the present set of experiments, it can be concluded that using NAFS for embedding learning of node features is an efficient learning method, and it also proves the effectiveness and efficiency of using NAFS in the whole model.
The third set of ablation experiments is conducted for NAFS. We input a set of r values to obtain a set of different node feature representations, and we can use different ways to process this set of node feature representations. NAGTLDA_concat, NAGTLDA_max and NAGTLDA_simple represent the use of concatenate, max and simple operations, respectively. The simple operation means inputting only one r value to one experimental result. The detailed experimental outcomes are presented in Fig. 15 and Table 9. Six of the seven evaluation metrics in the experimental results are the highest when the mean operation is used.
Case study
In the previous sections, we tested and confirmed the effectiveness of NAGTLDA. Now, we evaluate NAGTLDAâ€™s ability to excavate unknown relationships between diseases and lncRNAs. We chose four common diseases, which are prostate cancer, colon cancer, breast cancer, and colorectal cancer, as case studies from the dataset. We trained the model with 2797 observed lncRNAdisease relationships as instances for training and then made predictions for unknown potential associations. We extracted the top 15 candidate lncRNAs for each disease and validated the results using three benchmark databases: LncRNADisease v2.0 [60], Lnc2Cancer 3.0 [61], and MNDR v3.1 [67].
The exact cause of colon cancer is still unknown, but studies and research have shown that the risk of developing the disease increases with age, obesity, and cancer in other parts of the body. As research continued, researchers found that colon cancer is closely linked to several lncRNAs. For example, CYTOR and the corresponding protein binding can contribute to the metastasis of colon cancer [68], and HOXBAS3 expression can inhibit the growth of colon cancer [69]. The experimental outcomes are presented in Table 10, where 14 of the top 15 candidate lncRNAs have been confirmed.
The most prevalent malignancy is prostate cancer in the male urological system, which is highly prevalent in older men, but its etiology has not yet been fully identified. Researchers have found that prostate cancer is closely related to the expression of lncRNAs. For example, the expression of MAGI2AS3 and MEG3 in lncRNAs inhibits the development of prostate cancer [70, 71], and MNX1AS1 indirectly promotes the development of prostate cancer through expression [72]. We used it as the second disease in the case study, and the experimental outcomes are presented in Table 11. Thirteen of the top 15 candidate lncRNA species we identified have been confirmed by the relevant literature.
Breast cancer is the most common cancer among women. According to research, obesity, excessive alcohol consumption, and overnutrition all increase the incidence of breast cancer, but thus far, medical researchers have not found the exact cause of cancer. With the persistent expansion of bioclinical technology, growing number of lncRNAs related to breast cancer have been discovered. For example, the distant metastasisfree survival, overall survival, and progressionfree survival of breast cancer patients are strongly associated with high expression of BCAR4, LUCAT1, and TINCR [73,74,75]. LINC00511 binds to the MMP13 protein to promote breast cancer cell migration and proliferation [76]. We used breast cancer as the third type of disease in the case study, and the experimental outcomes are presented in Table 12. All of top 15 candidate lncRNAs have been validated by the relevant literature.
Colorectal cancer is the third most common malignancy in the world, and its incidence is relatively similar in men and women. The majority of the population suffers from the disease due to lifestyle habits, and a very small percentage is due to genetic factors. Colorectal cancer ranks second in the number of deaths caused by malignant tumors. Researchers have found through numerous clinical trials that ITGB8AS1 combined with the corresponding signals can contribute to the growth and metastasis of colorectal cancer [77] and that GAS5 and YAP phosphorylation and degradation interact to inhibit the development of colorectal cancer [78]. We used it as the fourth disease in our case study, and the experimental outcomes are presented in Table 13, where 13 of the top 15 candidate lncRNAs we selected have been validated by the relevant literature.
Discussion
In the present paper, we designed a NAGTLDA computational model to make inferences about unknown interactions between lncRNAs and diseases. Based on the experimental results, our model demonstrates promising performance, particularly in handling large datasets. The high scalability across varying sizes of datasets can be ascribed to the utilization of the graph Transformer architecture for extracting feature representations. This architecture possesses a highly expressive and adaptive learning capability, enabling it to learn diverse networks effectively.
However, our proposed model and the current study have some limitations. The limitations of our model are as follows: (1) The main framework of our model is built upon the Transformer architecture, requiring considerable computational power during the training process, particularly in practical applications involving large datasets. (2) The existence of numerous hyperparameters necessitates meticulous optimization and tuning, thereby augmenting the complexity of the training process. (3) Our model also relies on the initial similarity features of the nodes, which are calculated based on the association matrix. There are some limitations in the present field of lncRNAdisease association prediction as follows:(1) There are no true negative samples in the experimental data, and all the biological data are looking for true positive samples and not paying much attention to negative samples. Negative samples may be correct or they may be undetected false negatives. (2) The experimental results of computational modeling do not correlate very well with biological experiments, and better integration of computational modeling and biological experiments makes the results better interpretable. In future research, we can start by studying the dataset and exploring how to better represent the correlations between entities, which will result in a more accurate discovery of unknown associations. In addition, as medical science and technology continue to advance, the discovery of more unknown lncRNAs, represented as isolated nodes, is anticipated. Moving forward, there is a pressing need to develop more comprehensive models that can accurately predict the associations between these isolated nodes and experimentally verified disease nodes.
Conclusions
In the model, we first framed a heterogeneous network consisting of diseases and lncRNAs, an integrated similarity network for diseases and an integrated similarity network for lncRNAs, and used NAFS to perform nodelevel embedding for each of the three networks. We also adopted SDNE to encode the structural information of the networks with the goal of utilizing the constructed networks more effectively. We then introduce the Transformer module for globallevel embedding to explore potential unknown associations in the dataset and utilize the Transformer fusion mechanism with two levels of attention to perform globallevel embedding fusion on the learned embeddings and network topology. We performed embedding learning on the network information from both local and global perspectives so that some potential associations can be better identified. Finally, a bilinear decoder is employed to fuse the node embedding representations of diseases and lncRNAs as input for lncRNA and disease association prediction. We also conducted experiments on the performance of our model, and the outcomes of the 5CV and contrast to other baseline models confirm the excellent performance of our model. In the case study, NAGTLDA successfully predicted associations, such as NEAT1colon cancer, SOX2OTprostate cancer, and WT1AScolorectal cancer, which were previously unknown in the dataset. He et al. [79] investigated the function of NEAT1 in colon cancer, and found that the expression of NEAT1 was significantly elevated in colon cancer cells in their experiments, which proved that NEAT1 indirectly promotes the occurrence of colon cancer. Song et al. [80] demonstrated that SOX2OT inhibits the proliferation and metastasis of prostate cancer cells by interacting with other noncoding RNAs. This discovery provides a new therapeutic approach for the treatment of prostate cancer. Zhang et al. [81] experimentally demonstrated experimentally that WT1AS was closely associated with overall survival in colorectal cancer. The correlation between WT1AS and colorectal cancer was demonstrated on clinicopathological features and data modeling analysis, and WT1AS can be used as a biomarker and therapeutic target for colorectal cancer prognosis. This proves that our proposed model performs very well in finding new therapeutic strategies for diseases and provides a solid foundation for biological experiments and clinical practice.
Availability of data and materials
For lncRNAdisease, the D1 dataset downloaded from the Lnc2Cancer [13]: http://www.biobigdata.net/lnc2cancer, LncRNADisease [14]: http://cmbi.bjmu.edu.cn/lncrnadisease and GeneRIF [38]: https://ftp.ncbi.nlm.nih.gov/gene/GeneRIF/, the D2 dataset screened from the databases of known lncRNAdisease associations, including LncRNADisease v2.0 [51]: http://www.rnanut.net/lncrnadisease, and Lnc2Cancer v3.0 [52]: http://www.biobigdata.net/lnc2cancer, known lncRNAmiRNA associations from Encori [53]: http://starbase.sysu.edu.cn/. and NPInter V4.0 [54]: http://bigdata.ibp.ac.cn/npinter, and known miRNAdisease associations from HMDD v3.2 [55]: http://cuilab.cn/hmdd.
The miRNAdisease associations D3 are downloaded from the HMDD v3.2 database [55]: http://cuilab.cn/hmdd.
The semantic similarity data for all diseases is obtained from MeSH at http://www.nlm.nih.gov.
The code of NAGTLDA is provided on GitHub (https://github.com/ghli16/NAGTLDA).
Abbreviations
 lncRNA:

Long noncoding RNA
 NAFS:

Nodeadaptive feature smoothing
 SDNE:

Structural Deep Network Embedding
 GCN:

Graph Convolutional Network
 DAG:

Directed acyclic graph
 AUPR:

Area under precisionrecall curve
 AUC:

Area under the receiver operating characteristic curve
 5CV:

5Fold cross validation
 DO:

Disease Ontology
 CNN:

Convolution Neural Network
References
Derrien T, Johnson R, Bussotti G, et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 2012;22:1775â€“89.
Guttman M, Rinn JL. Modular regulatory principles of large noncoding RNAs. Nature. 2012;482:339â€“46.
Wang Kevin C, Chang HY. Molecular mechanisms of long noncoding RNAs. Mol Cell. 2011;43:904â€“14.
Wapinski O, Chang HY. Long noncoding RNAs and human disease. Trends Cell Biol. 2011;21:354â€“61.
Chen X, Yan CC, Zhang X, et al. Long noncoding RNAs and complex diseases: from experimental results to computational models. Brief Bioinform. 2016;22:558â€“76.
VincentSalomon A, GanemElbaz C, ManiÃ© E, et al. X inactivespecific transcript RNA coating and genetic instability of the X chromosome in BRCA1 breast tumors. Cancer Res. 2007;67:5134â€“40.
Chen W, BÃ¶cker W, Brosius J, et al. Expression of neural BC200 RNA in human tumours. J Pathol. 1997;183:345â€“51.
Congrains A, Kamide K, Oguro R, et al. Genetic variants at the 9p21 locus contribute to atherosclerosis through modulation of ANRIL and CDKN2A/B. Atherosclerosis. 2012;220:449â€“55.
Spagnolo P, Kropski JA, Jones MG, Lee JS, Rossi G, Karampitsakos T, et al. Idiopathic pulmonary fibrosis: disease mechanisms and drug development. Pharmacol Ther. 2021;222:107798.
Gavrilov K, Mark Saltzman W. Therapeutic siRNA: principles, challenges, and strategies. The Yale journal of biology and medicine. 2012;85:187â€“200.
Markowitz RHG, LaBella AL, Shi M, Rokas A, Capra JA, Ferguson JF, et al. Microbiomeassociated human genetic variants impact phenomewide disease risk. In: Proceedings of the National Academy of Sciences. 2022. p. 119.
JimenoYepes AJ, Sticco JC, Mork JG, et al. GeneRIF indexing: sentence selection based on machine learning. BMC Bioinformatics. 2013;14:171.
PiÃ±ero J, SaÃ¼ch J, Sanz F, et al. The DisGeNET cytoscape app: exploring and visualizing disease genomics data. Comput Struct Biotechnol J. 2021;19:2960â€“7.
Bello SM, Shimoyama M, Mitraka E, et al. Augmenting the disease ontology improves and unifies disease annotations across species. Dis Model Mech. 2018. https://doi.org/10.1242/dmm.032839.
Chen J, Lin J, Hu Y, et al. RNADisease v4. 0: an updated resource of RNAassociated diseases, providing RNAdisease analysis, enrichment and prediction. Nucleic Acids Res. 2023;51:D1397â€“404.
Ning S, Zhang J, Wang P, et al. Lnc2Cancer: a manually curated database of experimentally supported lncRNAs associated with various human cancers. Nucleic Acids Res. 2015;44:D980â€“5.
Chen G, Wang Z, Wang D, et al. LncRNADisease: a database for longnoncoding RNAassociated diseases. Nucleic Acids Res. 2012;41:D983â€“6.
Sheng N, Huang L, Lu Y, et al. Data resources and computational methods for lncRNAdisease association prediction. Comput Biol Med. 2023;153:106527â€“37.
Lei X, Mudiyanselage TB, Zhang YC. A comprehensive survey on computational methods of noncoding RNA and disease association prediction. Brief Bioinformatics. 2021;22(4):bbaa350.
Ganegoda GU, Li M, Wang W, et al. Heterogeneous network model to infer human diseaselong intergenic noncoding RNA associations. IEEE Trans Nanobiosci. 2015;14:175â€“83.
Zhou M, Wang X, Li J, et al. Prioritizing candidate diseaserelated long noncoding RNAs by walking on the heterogeneous lncRNA and disease network. Mol BioSyst. 2015;11:760â€“9.
Chen X, You ZH, Yan GY, et al. IRWRLDA: improved random walk with restart for lncRNAdisease association prediction. Oncotarget. 2016;7:57919â€“31.
Lu C, Yang M, Luo F, et al. Prediction of lncRNAâ€“disease associations based on inductive matrix completion. Bioinformatics. 2018;34:3357â€“64.
Li G, Luo J, Liang C, et al. Prediction of LncRNAdisease associations based on network consistency projection. Ieee Access. 2019;7:58849â€“56.
Gu C, Liao B, Li X, et al. Global network random walk for predicting potential human lncRNAdisease associations. Sci Rep. 2017;7:12442.
Wang L, Shang M, Dai Q, He P. Prediction of lncRNAdisease association based on a Laplace normalized random walk with restart algorithm on heterogeneous networks. BMC Bioinformatics. 2022;23(1):1â€“20.
Li J, Zhao H, Xuan Z, Yu JZ, Yang C, Liao B, et al. A novel approach for potential human LncRNAdisease association prediction based on local random walk. IEEE ACM Trans Comput Biol Bioinf. 2021;18:1049â€“59.
Zhang JP, Zhang Z, Chen Z, Deng L. Integrating multiple heterogeneous networks for novel LncRNAdisease association inference. IEEE/ACM Trans Comput Biol Bioinform. 2019;16:396â€“406.
Xuan P, Cao Y, Zhang T, et al. Dual convolutional neural networks with attention mechanisms based method for predicting diseaserelated lncRNA genes. Front Genet. 2019;10:416.
Yang Q, Li X. BiGAN: LncRNAdisease association prediction based on bidirectional generative adversarial network. BMC Bioinformatics. 2021;22(1):357.
Zhang Y, Ye F, Gao X. MCANet: multifeature coding and attention convolutional neural network for predicting lncRNAdisease association. IEEE/ACM Trans Comput Bio Bioinform. 2022;19:2907â€“19.
Xuan P, Gong Z, Cui H, et al. Fully connected autoencoder and convolutional neural network with attentionbased method for inferring diseaserelated lncRNAs. Brief Bioinform. 2022;23(3):bbac089.
Sheng N, Cui H, Zhang T, et al. Attentional multilevel representation encoding based on convolutional and variance autoencoders for lncRNAâ€“disease association prediction. Brief Bioinformatics. 2021;22:bbaa067.
Wang L, Zhong C. gGATLDA: lncRNAdisease association prediction based on graphlevel graph attention network. BMC Bioinformatics. 2022;23(1):11.
Ai C, Yang H, Guo F, et al. A multilayer multikernel neural network for determining associations between noncoding RNAs and diseases. Neurocomputing. 2022;493:91â€“105.
Wu Q, Cao R, Xia J, Ni J, Zheng CH, Su Y. Extra trees method for predicting LncRNAdisease association based on multilayer graph embedding aggregation. IEEE/ACM Trans Comput Biol Bioinform. 2022;19:3171â€“8.
Sheng N, Huang L, Wang Y, Zhao J, Xuan P, Gao L, et al. Multichannel graph attention autoencoders for diseaserelated lncRNAs prediction. Brief Bioinform. 2022;23(2):bbab604.
Lan W, Wu X, Chen Q, Peng W, Wang J, Chen YP. GANLDA: Graph attention network for lncRNAdisease associations prediction. Neurocomputing. 2022;469:384â€“93.
Ying C, Cai T, Luo S, et al. Do transformers really perform bad for graph representation? Arxiv preprint. 2021;arXiv:2106.05234.
RampÃ¡Å¡ek L, Galkin M, Dwivedi VP, et al. Recipe for a general, powerful, scalable graph transformer. Adv Neural Inf Process Syst. 2022;35:14501â€“15.
Oono K, Suzuki T. Graph neural networks exponentially lose expressive power for node classification. In: International conference on learning representations. 2020.
Zhu J, Rossi RA, Rao A, et al. Graph neural networks with heterophily. AAAI. 2021;35:11168â€“76.
Chen D, Oâ€™bray L, Borgwardt K. Structureaware transformer for graph representation learning. In: Proceedings of the 39th International Conference on Machine Learning, PMLR. Vol. 162. 2022. p. 3469â€“89.
Zhang W, Sheng Z, Yang M, et al. NAFS: a simple yet toughtobeat baseline for graph representation learning. In: Proceedings of the 39th International Conference on Machine Learning (ICML). Vol. 162. 2022. p. 26467â€“26483.
Wang D, Cui P, Zhu W. Structural Deep Network Embedding. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016. https://doi.org/10.1145/2939672.2939753.
Fu G, Wang J, Domeniconi C, et al. Matrix factorizationbased data fusion for the prediction of lncRNAâ€“disease associations. Bioinformatics. 2017;34:1529â€“37.
Lu Z, Bretonnel Cohen K, Hunter L. GeneRIF quality assurance as summary revision. Pac Symp Biocompute. 2006. https://doi.org/10.1142/9789812772435_0026.
Chen X, Clarence Yan C, Luo C, et al. Constructing lncRNA functional similarity network based on lncRNAdisease associations and disease semantic similarity. Sci Rep. 2015;5:11338.
Wang D, Wang J, Lu M, et al. Inferring the human microRNA functional similarity and functional network based on microRNAassociated diseases. Bioinformatics. 2010;26:1644â€“50.
Xuan P, Han K, Guo M. Prediction of microRNAs associated with human diseases based on weighted k most similar neighbors. PLoS One. 2013;8:e70204.
van Laarhoven T, Nabuurs SB, Marchiori E. Gaussian interaction profile kernels for predicting drugâ€“target interaction. Bioinformatics. 2011;27:3036â€“43.
Davies H, Jones B. Attention all surveyors: our schools need you. Struct Surv. 1994;12:31â€“4.
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. 2017.
Kingma D, Ba J. Adam: a method for stochastic optimization. Comput Sci. 2014. https://doi.org/10.48550/arXiv.1412.6980.
Zhao X, Zhao X, Yin M. Heterogeneous graph attention network based on metapaths for lncRNAâ€“disease association prediction. Brief Bioinform. 2022;23:bbab407.
Ma M, Na S, Zhang X, et al. SFGAE: a selffeaturebased graph autoencoder model for miRNAâ€“disease associations prediction. Brief Bioinform. 2022;23(5):bbac340.
Shi Z, Zhang H, Jin C, et al. A representation learning model based on variational inference and graph autoencoder for predicting lncRNAdisease associations. BMC Bioinformatics. 2021;22(1):136.
Yu Z, Huang F, Zhao X, et al. Predicting drugâ€“disease associations through layer attention graph convolutional network. Brief Bioinform. 2021;22:bbaa243.
Xie G, Jiang J, Sun Y. LDALNSUBRW: lncRNAdisease association prediction based on linear neighborhood similarity and unbalanced birandom walk. IEEE/ACM Trans Comput Biol Bioinf. 2020;22:1â€“1.
Bao Z, Yang Z, Huang Z, et al. LncRNADisease 2.0: an updated database of long noncoding RNAassociated diseases. Nucleic Acids Res. 2018;47:D10347.
Gao Y, Shang S, Guo S, et al. Lnc2Cancer 3.0: an updated resource for experimentally supported lncRNA/circRNA cancer associations and web tools based on RNAseq and scRNAseq data. Nucleic Acids Res. 2021;49:D12518.
Li JH, Liu S, Zhou H, et al. starBase v2.0: decoding miRNAceRNA, miRNAncRNA and proteinâ€“RNA interaction networks from largescale CLIPSeq data. Nucleic Acids Res. 2014;42:D927.
Teng X, Chen X, Xue H, et al. NPInter v4.0: an integrated database of ncRNA interactions. Nucleic Acids Res. 2019;48:D1605.
Huang Z, Shi J, Gao Y, et al. HMDD v3.0: a database for experimentally supported human microRNAâ€“disease associations. Nucleic Acids Res. 2019;47:D10137.
Fu Y, Yang R, Zhang L. Association prediction of CircRNAs and diseases using multihomogeneous graphs and variational graph autoencoder. Comput Biol Med. 2022;151:106289.
van der Laurens M, Hinton G. Visualizing data using tSNE Laurens van der Maaten. J Mach Learn Res. 2008;9:2579â€“605.
Ning L, Cui T, Zheng B, et al. MNDR v3.0: mammal ncRNAâ€“disease repository with increased coverage and annotation. Nucleic Acids Res. 2021;49:D160â€“4.
Yue B, Liu C, Sun H, et al. A positive feedforward loop between LncRNACYTOR and Wnt/Î²catenin signaling promotes metastasis of colon cancer. Mol Ther. 2018;26:1287â€“98.
Huang JZ, Chen M, Chen D, et al. A peptide encoded by a putative lncRNA HOXBAS3 suppresses colon cancer growth. Mol Cell. 2017;68:171184.e6.
Hu R, Wu P, Liu J. LncRNA MAGI2AS3 inhibits prostate cancer progression by targeting the miR1423p. Horm Metab Res. 2022;54:754â€“9.
Wu M, Huang Y, Chen T, et al. LncRNA MEG3 inhibits the progression of prostate cancer by modulating miR95p/QKI5axis. J Cell Mol Med. 2018;23:29â€“38.
Liang D, Tian C, Zhang X. lncRNA MNX1AS1 promotes prostate cancer progression through regulating miR2113/MDM2 axis. Mol Med Rep. 2022;26(1):231.
Godinho MFE, Sieuwerts AM, Look MP, et al. Relevance of BCAR4 in tamoxifen resistance and tumour aggressiveness of human breast cancer. Br J Cancer. 2010;103:1284â€“91.
Zheng A, Song X, Zhang L, et al. Long noncoding RNA LUCAT1/miR5582â€“3p/TCF7L2 axis regulates breast cancer stemness via Wnt/Î²catenin pathway. J Exp Clin Cancer Res. 2019;38(1):305.
Hou A, Zhang Y, Zheng Y, et al. LncRNA terminal differentiationinduced ncRNA (TINCR) sponges miR302 to upregulate cyclin D1 in cervical squamous cell carcinoma (CSCC). Hum Cell. 2019;32:515â€“21.
Shi G, Cheng Y, Zhang Y, et al. Long noncoding RNA LINC00511/miR150/MMP13 axis promotes breast cancer proliferation, migration and invasion. Biochim Biophys Acta Mol Basis Dis. 2021;1867:165957.
Lin X, Zhuang S, Chen X, et al. lncRNA ITGB8AS1 functions as a ceRNA to promote colorectal cancer growth and migration through integrinmediated focal adhesion signaling. Mol Ther. 2021;30:688â€“702.
Ni W, Yao S, Zhou Y, et al. Long noncoding RNA GAS5 inhibits progression of colorectal cancer by interacting with and triggering YAP phosphorylation and degradation and is negatively regulated by the m6A reader YTHDF3. Mol Cancer. 2019;18(1):143.
He Z, Deng J, Song A, Cui X, Ma Z, Zhang Z. NEAT1 promotes colon cancer progression through sponging miR4953p and activating CDK6 in vitro and in vivo. J Cell Physiol. 2019;234:19582â€“91.
Song X, Wang H, Wu J, Sun Y. Long noncoding RNA SOX2OT knockdown inhibits proliferation and metastasis of prostate cancer cells through modulating the miR4525p/HMGB3 axis and inactivating Wnt/Î²catenin pathway. Cancer Biother Radiopharm. 2020;35:682â€“95.
Zhang H, Wang Z, Wu J, Ma R, Feng J. Long noncoding RNAs predict the survival of patients with colorectal cancer as revealed by constructing an endogenous RNA network using bioinformation analysis. Cancer Med. 2019;8:863â€“73.
Acknowledgements
Not applicable.
Funding
This work is supported by the National Natural Science Foundation of China [grant numbers 62362034, 61862025] and the Natural Science Foundation of Jiangxi Province of China [grant numbers 20232ACB202010, 20212BAB202009, 20181BAB211016].
Author information
Authors and Affiliations
Contributions
GL: conceived the study, analyzed the results, drafted the article. PB: collected the data, designed and performed the experiments, drafted the article. CL: revised the article. JL: supervised the study, revised the article. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisherâ€™s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Li, G., Bai, P., Liang, C. et al. Nodeadaptive graph Transformer with structural encoding for accurate and robust lncRNAdisease association prediction. BMC Genomics 25, 73 (2024). https://doi.org/10.1186/s12864024099982
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12864024099982