 Research
 Open access
 Published:
GNNMF: a multiview graph neural network for ATACseq motif finding
BMC Genomics volume 25, Article number: 300 (2024)
Abstract
Background
The Assay for TransposaseAccessible Chromatin using sequencing (ATACseq) utilizes the Transposase Tn5 to probe open chromatic, which simultaneously reveals multiple transcription factor binding sites (TFBSs) compared to traditional technologies. Deep learning (DL) technology, including convolutional neural networks (CNNs), has successfully found motifs from ATACseq data. Due to the limitation of the width of convolutional kernels, the existing models only find motifs with fixed lengths. A Graph neural network (GNN) can work on nonEuclidean data, which has the potential to find ATACseq motifs with different lengths. However, the existing GNN models ignored the relationships among ATACseq sequences, and their parameter settings should be improved.
Results
In this study, we proposed a novel GNN model named GNNMF to find ATACseq motifs via GNN and background coexisting probability. Our experiment has been conducted on 200 human datasets and 80 mouse datasets, demonstrated that GNNMF has improved the area of eight metrics radar scores of 4.92% and 6.81% respectively, and found more motifs than did the existing models.
Conclusions
In this study, we developed a novel model named GNNMF for finding multiple ATACseq motifs. GNNMF built a multiview heterogeneous graph by using ATACseq sequences, and utilized background coexisting probability and the iterloss to find different lengths of ATACseq motifs and optimize the parameter sets. Compared to existing models, GNNMF achieved the best performance on TFBS prediction and ATACseq motif finding, which demonstrates that our improvement is available for ATACseq motif finding.
Background
Transcription factors (TFs) are proteins that can bind to DNA sequences and play a crucial role in generegulated networks [1], cell cycle regulation [2], and human diseases [3]. The region where TFs bind to DNA sequences is called transcription factor binding sites (TFBSs), which are short and conserved DNA fragments [4]. Transcription regulation is carried out by the interplay between TFs and TFBSs in DNA sequences, and identifying TFBSs aids us to reveal functions of TFs and the cause of human diseases [5]. The aligned TFBSs of the same TF can be defined as a regulatory DNA motif, which can be represented by a position weight matrix (PWM) [6]. Assays for TransposaseAccessible Chromatin using sequencing (ATACseq) can probe open chromatin via Tn5, which is an effective method for locating TFBSs at a genomewide level [7, 8]. Compared to Chromatin immunoprecipitation sequencing (ChIPseq) and DNase I hypersensitive sites sequencing (DNaseseq) technologies, ATACseq can reveal more kinds of transcription factor binding regions [9]. An ATACseq footprint is a fragment of a DNA sequence that has not been cleaved due to the transcription factors binding. TOBIAS and HintATAC are feasible tools for identifying ATACseq footprints, TOBIAS utilizes bias correction [10] and footprinting scores to locate footprints, and HintATAC employs the hidden Markov model to locate footprints [11]. ATACseq footprints have been applied to TF network prediction [11], comparison of TF activity [12], identification of TFs enriched in peripheral blood mononuclear cells (PBMC)specific peaks [13], and ATACseq motifs finding [14].
There are several special tools for ATACseq motif scanning, such as TRACE, chromVAR, SnapATAC [15,16,17]. They scan input sequences by using known motifs, which limits their ability to find ATACseq motifs. DL technology, including convolutional neural networks (CNNs) and the recurrent neural networks has achieved successes in proteinprotein networks, generegulated networks, and motifs finding [18,19,20]. ATACseq motif finding includes two key steps: the first step is to predict an ATACseq sequence that contains TFBSs, i.e. TFBS prediction, and the second step is to find ATACseq motifs. Existing models such as scFAN, FactorNet, and DeepATAC employ CNNs to predict TFBSs and find motifs from ATACseq data [21, 22]. These models utilized convolutional kernels of CNNs to scan ATACseq sequences, and a fully connected layer to predict TFBSs. Due to the limitation of the width of convolutional kernels, these models only find motifs with fixed lengths. Moreover, these existing models do not consider the coexisting probability of TFBSs in an input sequence. In a related line of research, the graph neural network (GNN) can learn the key message from the graphstructured data [23]. GNNs can learn the embeddings of nodes via their neighboring nodes, and keep the connection of the graph unchanged and nodes’ embedding can also enable graphbased explanation and reasoning. GNNs work on nonEuclidean structured data and have been shown to perform well on molecular structures [24], proteinprotein networks [25], genegene networks [26], ATACseq motif finding [14], etc. MMGraph is an ATACseq motif predictor based on GNN and coexisting probability to find multiple motifs [14]. MMGraph utilizes ATACseq footprints to build a multiview heterogeneous graph by defining coexisting, Hamming, jaccard, and inclusive edges. Compared with CNNbased models, MMGraph utilizes ATACseq footprints to build the multiview heterogeneous graph and employs the coexisting probability to find multiple ATACseq motifs of different lengths. However, MMGraph still has certain defects. MMGraph sets the coexistence probability threshold to 0.5, which needs to be optimized. MMGraph uses only Hamming distance and coexisting probability to measure the relationships between kmers, which ignored the correlation between kmers.
To address the above issues, this study developed a novel model called GNNMF to find ATACseq motifs and predict TFBSs. GNNMF is based on MMGraph and has three improved aspects: built a multiview heterogonous graph, which contains four kinds of edges (coexisting edges, similarity edges, jaccard edges, and inclusive edges) and two types of nodes (kmer nodes and sequence nodes); defined the iterloss function to improve TFBS prediction accuracy; optimized the threshold of coexisting probability via defining coexisting probability between kmer nodes in negative sequences. We tested the effectiveness of the GNNMF model on 200 human ATACseq datasets and 80 mouse ATACseq datasets across nine evaluation metrics, including the area of eight metrics radar (AEMR), precision, recall, F1_score, accuracy (ACC), specificity, the Matthews correlation coefficient (MCC), the area under the receiver operating characteristic curve (AUC), and the area under the precisionrecall curve (PRC). According to the experimental results, GNNMF improved the ACC, MCC, and AUC by 9.6%, 18.2%, and 2.9%, respectively, on 200 human ATACseq datasets and by 2.36%, 6.35%, and 3.01%, respectively, on 80 mouse ATACseq datasets. In this study, AEMR is an overall score for evaluating the model’s ability to predict TFBSs. GNNMF improved the AEMR by 4.92% and 6.81% on 200 human and 80 mouse ATACseq datasets, respectively. Moreover, among all the models, GNNMF find the most 385 and 662 significant motifs from human ATACseq data and mouse ATACseq data , respectively.
Methods
Data processing
The 280 ATACseq datasets, including 200 human ATACseq datasets and 80 mouse ATACseq datasets, were randomly downloaded from the ENCODE project (Supplementary Table S1). To obtain corrected footprints, footprints of each dataset are obtained via TOBIAS and HintATAC tools with default parameters [11, 12]. All footprints are ranked in descending order by their scores. Then the top 1500 footprints are used to intersect with the footprints that HintATAC obtains, and the intersected footprints of each dataset are applied to the following analysis.
Each intersected footprint is pruned with 101 bps around its center by the bedtools [27], which is set as a positive sequence. The TFBS prediction is a binary classification, so we shuffle all bases within a positive sequence as a negative sequence [28]. This paper gives a positive sequence a label of ‘1’, and gives a negative sequence a label of ‘0’. Thus, we can obtain an ATACseq sequence set \(S\) that includes \(n\) sequences.
For each dataset, 80% of \(S\) is set as training data, 10% of \(S\) is set as validation data, and the remaining \(S\) is set as testing data. Then, the sequence \(s( \cdot )\) (\(s( \cdot ) \in S\)) is trimmed into kmers \(k( \cdot )\) by the step of one base to obtain a kmer set \(Ks( \cdot )\)(\(lenk = length(k( \cdot ))\)). \(s( \cdot )\) and kmer \(k( \cdot )\) will be utilized to build the multiview heterogeneous graph \(G\), where \(s( \cdot )\) and \(k( \cdot )\) are nodes.
Building the multiview heterogeneous graph
The multiview heterogeneous graph \(G\) contains two kinds of nodes (sequence node \(s( \cdot )\) and kmer node \(k( \cdot )\)), and four types of edges (coexisting edge, similarity edge, Jaccard edge and inclusive edge) (Fig. 1A). Coexisting edges (view 1) represent coexisting relationships between kmer nodes in a sequence, and the weights of coexisting edges are calculated by the coexisting probability [14]. Similarity edges (view 2) represent relations between kmer nodes and the weights of similarity edges are measured by Hamming distances [29]. Jaccard edges (view 3) represent relationships between kmer nodes, and the weights of Jaccard edges are measured by the Jaccard correlation coefficient. Inclusive edges represent inclusive relations between a given sequence \(s( \cdot )\) and its constituent kmers, and weights of inclusive edges are measured by TFIDF (term frequencyinverse document frequency) [30].
We let \(Ks(i)\) be a kmer set that contains all unique \(k( \cdot )\) within \(s(i)\). \(K\) is an \(m\) kmer set that contains all unique \(k( \cdot )\) in all \(Ks(i)\) of \(S\), where \(m\) is the total number of unique \(k( \cdot )\) within \(S\). Four edge types are defined to build \(G\), where \(s( \cdot )\) and \(k( \cdot )\) are nodes.
Coexisting edges measure the coexisting probability between \(k( \cdot )\) nodes. The coexisting edge weight between two kmers \(k(p)\) and \(k(j)\) is calculated by Formula (1):
where \(Wco\) represents a \(m \times m\) coexisting weight matrix, \(num(k( \cdot ))\) represents the total number of \(Ks( \cdot )\) that contain \(k( \cdot )\), and \(nums(k(p),k(j))\) is the total number of \(Ks( \cdot )\) that contain both \(k(p)\) and \(k(j)\).
Similarity edges measure mismatches between \(k( \cdot )\) nodes using the Hamming distance [31]. The similarity edge weight between \(k(p)\) and \(k(j)\) is calculated by Formula (2):
where \(Wsim\) represents a \(m \times m\) similarity weight matrix,\({\kern 1pt} {\kern 1pt} p{\kern 1pt} ,j \in [1,...,m]\). \(Ham\min g( \cdot , \cdot )\) represents the Hamming distance function.
Jaccard edges measure the correlation between \(k( \cdot )\) nodes using the Jaccard correlation coefficient. The Jaccard edge weight between two nodes \(k(p)\) and \(k(j)\) is calculated by Formula (3):
where \(Sk( \cdot )\) presents a sequence set that contains all sequences, including \(k( \cdot )\). \(Wjac\) represents a \(m \times m\) similarity weight matrix, \({\kern 1pt} p{\kern 1pt} ,j \in [1,...,m]\).
Inclusive edges measure the dependency degree between \(s( \cdot )\) and \(k( \cdot )\). The inclusive edge weight between \(s( \cdot )\) and \(k( \cdot )\) is calculated by transferring the concept of the term frequencyinverse document frequency (TFIDF) [32] to Formula (4):
where \(Winclu\) represents an \(m \times n\) inclusive weight matrix,\(tf(k(p),s(i))\) is the number of \(k(p)\) existing in \(s(i)\), \(i \in [1,...,n]\), and \(p \in [1,...,m]\).
GNNMF overview
In this study, we developed a threelayer GNN model named GNNMF to find ATACseq motifs. GNNMF decomposes the multiview heterogeneous graph \(G\) into four subgraphs, i.e. a coexisting subgraph, similarity subgraph, Jaccard subgraph, and inclusive subgraph (Fig. 1B). The first layer of GNNMF is utilized to learn the embedding of \(k( \cdot )\) as \(Ek(k( \cdot )) \in {\mathbb{R}}^{dc + ds + dj}\) where \(dc\), \(ds\) and \(dj\) are the embedding dimensions of \(Esim(k( \cdot ))\)(\(Esim(k( \cdot )) \in {\mathbb{R}}^{dc}\)), \(Eco(k( \cdot ))\)(\(Eco(k( \cdot )) \in {\mathbb{R}}^{ds}\)) and \(Ejac(k( \cdot ))\) (\(Ejac(k( \cdot )) \in {\mathbb{R}}^{dj}\)) from the coexisting subgraph, similarity subgraph and Jaccard subgraph, respectively. The second layer learns the embedding of \(s( \cdot )\) as \(Es(s( \cdot ))\) (\(Es(s( \cdot )) \in {\mathbb{R}}^{dsq}\)) from the inclusive subgraph, where \(dsq\) is the dimension of \(Es(s( \cdot ))\). The third layer is a fully connected layer, which is used to predict TFBSs.
We normalize \(Wco\), \(Wsim\) and \(Wjac\) as the initial embedding of \(k( \cdot )\) via Formula (5).
where \(Ect(k(p))\) (\(Ect(k(p)) \in {\mathbb{R}}^{dc}\)) is the initial embedding of \(k(p)\), based on \(Wco\);\(Est(k(p))\)(\(Est(k(p)) \in {\mathbb{R}}^{ds}\)) is the initial embedding of \(k(p)\), based on \(Wsim\);\(Ejt(k(p))\)(\(Ejt(k(p)) \in {\mathbb{R}}^{dj}\)) is the initial embedding of \(k(p)\), based on \(Wjac\).
GNNMF learns the embedding of \(k(p)\) as \(Eco(k(p))\) from the coexisting subgraph, and \(Eco(k(p))\) is calculated by Formula (7).
where \(W^{co}\)(\(W^{co} \in {\mathbb{R}}^{m \times dc}\)) is a training weight matrix, which is randomly initialized. \({\text{Re}} LU( \cdot )\) represents the rectified linear unit, which is an activation function.
GNNMF learns \(Esim(k(p))\) from the similarity subgraph via Formula (9).
where \(Esim(k(p))\) is the embedding of \(k(p)\) in the similarity subgraph. \(W^{sim}\)(\(W^{sim} \in {\mathbb{R}}^{m \times ds}\)) is a training weight matrix, which is randomly initialized.
GNNMF learns the embedding of \(k(p)\) as \(Ejac(k(p))\) from the Jaccard subgraph, and \(Ejac(k(p))\) is calculated by Formula (11).
where \(W^{jac}\)(\(W^{jac} \in {\mathbb{R}}^{m \times dj}\)) is a training weight matrix that is randomly initialized.
Then, \(Esim(k( \cdot ))\), \(Eco(k( \cdot ))\) and \(Ejac(k( \cdot ))\) are concatenated as \(Esc(k( \cdot ))\) (Formula (12)), which is fed into the second layer.
We stack \(Esc(k(p))\) into a matrix \(Msc\) (\(Msc \in {\mathbb{R}}^{m \times (ds + dc + dj)}\)). GNNMF learns the embedding of \(s( \cdot )\) as \(Esq(s( \cdot ))\) via the second layer, \(Esq(s( \cdot ))\) is calculated by Formula (14).
The third layer is a fully connected layer, which is used to predict TFBSs (Formula (16)).
where \(W\)(\(W \in {\mathbb{R}}^{n \times dsq}\)) is the training weight matrix, which is randomly initialized. \(b\) represents the bias, which is a \(n \times 1\) vector, \(sigmoid( \cdot )\) is a sigmoid function, and \(\hat{y}(s(i))\) represents the predicted label of \(s(i)\) by GNNMF.
The BCEloss is the binary crossentropy loss, which is used to calculate the binary crossentropy loss between the predicted probability and the true label. Based on BCEloss, we define the iterloss as the loss function of GNNMF [33]:
where \(y(s(i))\) represents the true label of sequence \(s(i)\) (Formula (19)); \(\hat{y}(s(i))\) represents the predicted label of sequence \(s(i)\); and \(epoch\) represents the iteration, \(BCEloss(s(i))^{0} = 0\).
Find multiple motifs via coexisting probability between kmers
GNNMF calculates the mutual information (MI) \(mi(p,i)\) between \(k(p)\) and \(s(i)\) by using \(Esc(k( \cdot ))\) and \(Esq(s( \cdot ))\) (Fig. 1C). This study divided \(S\) into \(S^{0}\) and \(S^{1}\) by \(label = 1/0\), as well as \(n^{label}\), \(m^{label}\), \(s^{label} ( \cdot )\), \(Ks^{label} ( \cdot )\) and \(K^{label}\). GNNMF calculates the \(MI^{label}\) matrix by \(mi(p,i)\), where \({\kern 1pt} k(p) \in K^{label}\), \(s(i) \in s^{label} ( \cdot )\), \(i \in [1,...,n^{label} ]\). We define \(mean(MI^{0} )\) as the background noise of \(MI^{1}\) and calculate the denoised MI matrix \(dnMI^{1}\), where \(mean(MI^{0} ) = (\sum\limits_{p = 1}^{{m^{0} }} {\sum\limits_{i = 1}^{{n^{0} }} {mi^{0} (p,i)} )/(m^{0} \times n^{0} )}\) and \(dnmi^{1} (p,i) = mi^{1} (p,i)  mean(MI^{0} )\). We define the kmer seed set \(Kseed(i)\) that contains all unique \(k(p)\) in \(s^{1} (i)\), when \(dnmi^{1} (p,i) > 0\),\(i \in [1,...,n^{1} ]\). Then, we locate the interval of \(k(p)\) in \(s^{1} (i)\) as \(itk(p) = \left[ {strk(p),strk(p) + lenk  1} \right]\), where \(k(p) \in Kseed(i)\),\(strk(p)\) is the starting position of \(k(p)\) in \(s^{1} (i)\). If multiple \(k(p)\) exist in \(s^{1} (i)\), there will be multiple \(itk(p)\). We obtain two \(k( \cdot )\) centered on \(itk(p)\) as \(kl(p) = [ck(p)  lenk + 1,ck(p)]\) and \(kr(p) = [ck(p){ + }1,ck(p) + lenk{]}\), where \(ck(p) = strk(p) + \left\lceil {(lenk  1)/2} \right\rceil\) and \(kl(p),kr(p) \in K\).
We calculate the coexisting probability \(wco(kl(p),kr(j))\) and background probability \(bwco(kl(p),kr(p))\) between \(kl(p)\) and \(kr(p)\), respectively. If \(wco(kl(p),kr(p)) < bwco(kl(p),kr(p))\), which means that \(kl(p)\) and \(kr(p)\) are strongly related, GNNMF merges \(kl(p)\) and \(kr(p)\) to a TFBS as \(tfbs(p) = [ck(p)  lenk{ + }1,ck(p) + lenk]\). For all \(k( \cdot ) \in Kseed(i)\), we can find multiple candidates \(tfbs(i)\) in \(s^{1} (i)\). If multiple \(tfbs(i)\) overlap, they are merged into a longer TFBS in \(s^{1} (i)\). Finally, GNNMF can find multiple \(tfbs( \cdot )\) with different lengths.
Evaluation metrics
Our evaluation metrics include precision, recall, F1_score, ACC, specificity, MCC, AUC, PRC, and AEMR [34]. AEMR is the area of a radar chart, which can be generated that consists of eight equiangular spokes with each spoke representing one of the scores defined above [14]. The higher the AEMR score is, the better the performance of model for TFBS prediction.
All the evaluation metrics are calculated via the following formulas:
where TP, TN, FP and FN represent the number of the true positive (TP), true negative (TN), false positive (FP), and false negative (FN), respectively.
The AUC is the area under the receiver operating characteristic curve, and the PRC is the area under the precisionrecall curve, which is a value between 0 and 1.
Experiment settings
This study trained GNNMF for 30 epochs using the Adam optimizer [35] (Fig. 2). The hyperparameters of GNNMF included the learning rate, \(dsim\), \(dco\), \(djac\), \(dsq\) and \(lenk\), and their ranges were tested in this study (Supplementary Table S2). We utilized the grid search to choose the optimal parameters and applied the combinations of those hyperparameters to GNNMF model. The AUC is used to measure the performance of GNNMF with different combinations on the testing set. The learning rate of GNNMF was set as 0.001 and was adjusted by the natural exponential decay with 0.001. When we trained GNNMF on the training set, the sequence nodes in validation set and the testing set were masked. The \(dsim\), \(dco\) and \(djac\) were set as 50, and \(dsq\) was set as 150 in our experiment. Considering the computational complexity and the performance of GNNMF, we set \(lenk = 5\). The scFAN, DeepATAC, FactorNet, MMGraph, and MMGraph+jac models are used as comparison tools, where MMGraph+jac combines MMGraph and jaccard graph. The evaluation metrics, including precision, recall, F1_score, ACC, specificity, MCC, AUC, PRC, and AEMR, are utilized to evaluate models’ performance on TFBS prediction. To explain the GNNMF model, ATACseq motifs are used to show features that GNNMF learned. We matched the found motifs to motifs that the HOCOMOCO database contains via the TOMTOM tool.
Results
TFBS prediction
In this study, we utilized nine metrics to evaluate the performance of all the models on 200 human ATACseq datasets and 80 mouse ATACseq datasets (Fig. 3). FactorNeT, scFAN, DeepATAC, MMGraph, and MMGraph+jac were selected as comparison tools, because they achieved feasible performances in the previous study. In the light of results, AEMRs of models based on the GNN model are higher than those of the models based on the CNN model, which demonstrates that the GNN model is more suitable for ATACseq TFBS prediction. On 200 human ATACseq datasets, GNNMF achieved the highest average AEMR score (2.13). Compared to existing models, GNNMF improved the AEMR by 4.92%. Among models based on CNNs, DeepATAC obtained an AEMR of 1.59 , which is higher than those of scFAN and FactorNeT. On 80 mouse ATACseq datasets, GNNMF achieved consistent results on 200 human ATACseq datasets, and improved the AEMR by 6.81% (Supplementary Figure S1). On 280 ATACseq datasets, GNNMF yielded the highest AEMR score, which indicated that our improvement was efficient.
On 200 human ATACseq datasets, GNNMF obtained the highest scores for all eight metrics and achieved average precision, recall, F1_score, ACC, specificity (Supplementary Figure S2), MCC, AUC, and PRC values of 0.863, 0.846, 0.843, 0.845, 0.944, 0.709, 0.942, and 0.949, respectively. In particular, GNNMF improved the ACC and MCC by 2.13% and 5.08%, respectively. Moreover, on 80 mouse ATACseq datasets, GNNMF achieved average precision, recall, F1_score, ACC, specificity, MCC, AUC, and PRC values of 0.844, 0.831, 0.828, 0.830, 0.917, 0.676, 0.931, and 0.946, respectively (Supplementary Table S3). Compared to existing models, GNNMF improved the ACC and MCC by 2.36% and 6.35%, respectively.
Finding multiple ATACseq motifs
ATACseq can reveal all opening chromatin, which means that there are multiple motifs in an ATACseq dataset. In this study, we developed the GNNMF model to find ATACseq motifs by employing the GNN and coexisting probability. Existing DL models to find ATACseq motifs, including FactorNet, scFAN, DeepATAC, and MMGraph, were selecteds as comparison tools. We tested the above models on 280 ATACseq datasets including 200 human ATACseq datasets and 80 mouse ATACseq datasets. This study matches the found motifs to the HOCOMOCO database via the TOMTOM v5.1.0 tool. The pvalue less than 0.05 indicated that the found motifs are significant. All motifs that each model found are listed in Table 1, 401 human ATACseq motifs were found. Among all the models, GNNMF found the most ATACseq motifs. Moreover, we tested all the models on 80 mouse ATACseq datasets, and the number of motifs in which each model found is listed in Supplementary Table S4. The pvalue depicts the degree to which the found motifs are significant. The pvalues of the found motifs of each model are shown in the violin plot, and the median value shows the model’s ability to find ATACseq motifs (Fig. 4). GNNMF achieved the highest median value among all the models. Based on motif finding results, GNNMF found the most motifs among all the models. The GNNbased models outperformed the CNNbased models, which indicated that the GNN is a feasible algorithm for finding ATACseq motifs. GNNMF is based on MMGraph, which is improved via the Jaccard edge, background probability, and the iterloss function. Our results demonstrated that our improvements were conducive to finding ATACseq motifs.
Discussion
This study developed a GNN model called GNNMF by improving the MMGraph model. We improved the MMGraph by using the Jaccard edge, iterloss function, and background probability. GNNMF decomposed the multiview heterogeneous graph into four subgraphs, and employed a threelayer GNN model to predict TFBSs. The first layer of GNNMF was used to learn the embedding of kmer nodes via the similarity subgraph, Jaccard subgraph, and coexisting subgraph. The second layer was used for the embedding of sequence nodes via the inclusive subgraph, and the third layer was used to predict TFBSs. GNNMF utilizes embeddings of kmer nodes and sequence nodes to define kmer seeds and employs the coexisting probability to find multiple motifs with different lengths. Existing models to find ATACseq motifs, including FactorNet, scFAN, DeepATAC, MMGraph and MMGraph +jac, were selected as comparison tools. We tested all the models on 200 human ATACseq datasets and 80 mouse ATACseq datasets. We utilized the precision, recall, F1_score, specificity, ACC, MCC, AUC, and PRC to evaluate the models’ ability in TFBS prediction. In light of our results, GNNMF achieved the highest precision, recall, F1_score, ACC, specificity, MCC, AUC, and PRC of 0.863, 0.846, 0.843, 0.845, 0.944, 0.709, 0.942, and 0.949, respectively, on 200 human datasets. On 88 mouse datasets, our proposed model obtained the highest precision, recall, F1_score, ACC, specificity, MCC, AUC, and PRC values of 0.844, 0.831, 0.828, 0.830, 0.917, 0.676, 0.931, and 0.946, respectively. Moreover, the GNNbased model achieved higher AEMR scores than models based on CNN. Our results demonstrated that the GNN has more potential for TFBS prediction on ATACseq datasets. Compared to MMGraph+jac, GNNMF utilizes the iterloss to predict TFBSs. The AEMR of GNNMF was higher than that of MMGraph+jac, which indicated the iterloss has an advantage over BCEloss in TFBS prediction.
GNNMF utilized the embedding of kmer nodes and sequence nodes, and calculated coexisting probability between kmer nodes to find multiple motifs with different lengths. MMGraph and MMGraph+jac used the same way to set the background probability threshold, and they set the background probability threshold to 0.5. However, GNNMF defines the background probability threshold via negative sequences. Our results indicated that GNNMF found more and higher quality motifs than MMGraph and MMGraph+jac. Therefore, the background probability threshold of the negative sequences aided GNNMF in finding ATACseq motifs. FactorNet, scFAN, and DeepATAC are all based on CNNs, they found ATACseq motifs by convolutional kernels in the first layer. But these models only found the fixed length of motifs, and the found motifs are similar. However, compared with models based on CNNs, models based on GNNs have great advantages. We utilized all the models to find ATACseq motifs and used pvalue to evaluate the motifs’ quality. The results demonstrated that GNNMF is the best model for finding multiple ATACseq motifs.
GNNMF employs a threelayer GNN to predict whether given sequences are bound by TFs. Some novel algorithms can be applied in TFBS prediction, such as graph attention networks [36], graphGAN [37], and graph autoencoders [38]. In the heterogeneous graph, there are many relationships between nodes and each kind of relationship represents the importance between two nodes. The attention mechanism can allocate and update the different weights to the nodes and edges of the heterogeneous graph, during the training process of the model. Moreover, TFs bind indirectly to motifs of other TFs, which coregulate targeted gene expression [39]. The cooperation of TFs acts as a vital role in the process of human disease [40]. ATACseq data can detect openaccessible DNA regions by probing open chromatin, meaning that ATACseq data contain multiple TFs [41]. By analyzing ATACseq data, we revealed interactions between TFs, and explored the inducement of human disease. Therefore, GNNMF is a potential tool for studying the cooperation among different TFs.
Conclusions
In this study, we developed a novel model named GNNMF for finding multiple ATACseq motifs. GNNMF built the multiview heterogeneous graph by using ATACseq sequences, employed a threelayer of GNN to predict TFBSs, and utilized coexisting probability to find ATACseq motifs. We conducted experiments on 200 human and 80 mouse ATACseq datasets to analyze the effectiveness of the proposed method. Our evaluation metrics included precision, recall, F1_score, ACC, specificity, MCC, AUC, PRC, and AEMR. In particular, GNNMF improved the AEMR by 4.92% and 6.81% on 200 human and 80 mouse ATACseq datasets, respectively. Meanwhile, GNNMF found multiple motifs from ATACseq data through the coexisting probability between kmers. Regarding ATACseq data, our proposed method found more and higher quality motifs, which demonstrated methods based on coexisting probability of kmers are more efficient than DL models. As a result, GNNMF achieved better performance than a few stateoftheart methods. This study made great contributions to finding motifs from ATACseq data.
Availability of data and materials
The datasets are provided within Supplementary Table S1, which can be downloaded from the ENCODE project. GNNMF is available at https://github.com/zhangsq06/GNNMF.git.
Abbreviations
 ATACseq:

Assay for TransposaseAccessible Chromatin using sequencing
 TFBSs:

transcription factor binding sites
 DL:

Deep learning
 CNNs:

convolutional neural networks
 GNN:

Graph neural network
 TFs:

Transcription factors
 PWM:

position weight matrix
 ChIPseq:

Chromatin immunoprecipitation sequencing
 DNaseseq:

DNase I hypersensitive sites sequencing
 AEMR:

area of eight metrics radar
 ACC:

accuracy
 MCC:

Matthews correlation coefficient
 AUC:

Area under the receiver operating characteristic curve
 TFIDF:

Term frequencyinverse document frequency
References
Madan Babu M, Teichmann SA. Evolution of transcription factors and the gene regulatory network in Escherichia coli. Nucleic Acids Res. 2003;31(4):1234–44.
Joaquin Á, Watson R. Cell cycle regulation by the BMyb transcription factor. Cell Mol Life Sci. 2003;60:2389–401.
Vishnoi K, Viswakarma N, Rana A, Rana B. Transcription factors in cancer development and therapy. Cancers (Basel). 2020;12(8):1–32.
Wang Y, Zhang S, Ma A, Wang C, Ma Q. Assessing deep learning algorithms in cis regulatory motif finding based on genomic sequencing data. Briefings in Bioinformatics. 2020;23(1):1–10.
Brenner S, Wersinger C, Gasser T. Transcriptional regulation of the αsynuclein gene in human brain tissue. Neurosci Lett. 2015;599:140–5.
Sinha S. On counting position weight matrix matches in a sequence, with application to discriminative motif finding. Bioinformatics. 2006;22(14):e454–63.
Bajic M, Maher KA, Deal RB. Identification of open chromatin regions in plant genomes using ATACSeq. Methods Mol Biol. 2018;1675:183–201.
Sun Y, Miao N, Sun T. Detect accessible chromatin using ATACsequencing, from principle to applications. Hereditas. 2019;156(1):1–9.
Ma S, Zhang Y. Profiling chromatin regulatory landscape: Insights into the development of ChIPseq and ATACseq. Mol Biomed. 2020;1:1–13.
Bentsen M, Goymann P, Schultheis H, Klee K, Petrova A, Wiegandt R, Fust A, Preussner J, Kuenne C, Braun T. ATACseq footprinting unravels kinetics of transcription factor binding during zygotic genome activation. Nat Commun. 2020;11(1):4267.
Li Z, Schulz MH, Look T, Begemann M, Zenke M, Costa IG. Identification of transcription factor binding sites using ATACseq. Genome Biol. 2019;20:1–21.
Bentsen M, Goymann P, Schultheis H, Klee K, Petrova A, Wiegandt R, Fust A, Preussner J, Kuenne C, Braun T. ATACseq footprinting unravels kinetics of transcription factor binding during zygotic genome activation. Nat Commun. 2020;11(1):1–11.
Youn A, Marquez EJ, Lawlor N, Stitzel ML, Ucar D. BiFET: sequencing Biasfree transcription factor Footprint Enrichment Test. Nucleic Acids Res. 2018;47(2):e11–e11.
Zhang S, Yang L, Wu X, Sheng N, Fu Y, Ma A, Wang Y. MMGraph: a multiple motif predictor based on graph neural network and coexisting probability for ATACseq data. Bioinformatics. 2022;38(19):4636–8.
Ouyang N, Boyle AP. TRACE: transcription factor footprinting using chromatin accessibility data and DNA sequence. Genome Res. 2020;30(7):1040–6.
Schep AN, Wu B, Buenrostro JD, Greenleaf WJ. chromVAR: inferring transcriptionfactorassociated accessibility from singlecell epigenomic data. Nat Methods. 2017;14(10):975–8.
Fang R, Preissl S, Li Y, Hou X, Lucero J, Wang X, Motamedi A, Shiau AK, Zhou X, Xie F. Comprehensive analysis of single cell ATACseq data with SnapATAC. Nat Commun. 2021;12(1):1337.
Lei H, Quan C. A Shortest Dependency Path Based Convolutional Neural Network for ProteinProtein Relation Extraction. BioMed Res Int, 2016, (2016714). 2016, 2016:19.
Wang Y, Zhang S, Yang L, Yang S, Ma Q. Measurement of conditional relatedness between genes using fully convolutional neural network. Front Genet. 2019;10:1009.
Yang J, Ma A, Hoppe AD, Ang CW, Li Y, Zhang C, Wang Y, Liu B, Ma Q. Prediction of regulatory motifs from human Chipsequencing data using a deep learning framework. Nucleic Acids Res. 2019;15:15.
Fu L, Zhang L, Dollinger E, Peng Q, Nie Q, Xie X. Predicting transcription factor binding in single cells through deep learning. Science Advances. 2020;6(51):eaba9031.
Quang D, Xie XS. FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotideresolution sequential data. Methods. 2019;166:40–47.
Scarselli F, Yong SL, Gori M, Hagenbuchner M, Tsoi AC, Maggini M. Graph Neural Networks for Ranking Web Pages. 2005.
Zhang S, Liu Y, Xie L: Molecular MechanicsDriven Graph Neural Network with Multiplex Graph for Molecular Structures. 2020.
Zhang D, Kabuka M. Multimodal deep representation learning for protein interaction identification and protein family classification. BMC Bioinformatics. 2019;20(16):1–14.
Jha K, Saha S, Singh H. Prediction of proteinprotein interaction using graph neural networks. Sci Rep. 2022;12(1):1–12.
Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841–2.
Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA and RNAbinding proteins by deep learning. Nat Biotechnol. 2015;33:831–8.
Grabowski S, Kowalski TM: Algorithms for all‐pairs Hamming distance based similarity. Software: Practice and Experience 2021.
Mu’tasem J, Salim N. Stock market prediction based on term frequencyinverse document frequency. J Econ Bus Mgmt. 2016;4(3):183–7.
Norouzi M, Fleet DJ, Salakhutdinov RR. Hamming distance metric learning. Adv Neural Inform Process Syst. 2012;25:1–9.
Yuntao Z, Ling G, Yongcheng W. An improved TFIDF approach for text classification. J Zhejiang UnivSci A. 2005;6(1):49–55.
Malhotra R, Shakya A, Ranjan R, Banshi R. Software defect prediction using binary particle swarm optimization with binary cross entropy as the fitness function. In: Journal of Physics: Conference Series: 2021: IOP Publishing; 2021: 012003.
Mancini A, Vito L, Marcelli E, Piangerelli M, De Leone R, Pucciarelli S, Merelli E. Machine learning models predicting multidrug resistant urinary tract infections using “DsaaS.” BMC Bioinformatics. 2020;21(10):1–12.
Mehta S, Paunwala C, Vaidya B. CNN based traffic sign classification using Adam optimizer. 2019 international conference on intelligent Computing and Control Systems (ICCS). 2019:1293–8.
Velikovi P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y. Graph Attention Networks. 2017.
Wang H, Jia W, Wang J, Miao Z, Guo M: GraphGAN: Graph Representation Learning with Generative Adversarial Nets. IEEE Transactions on Knowledge and Data Engineering. 2017, PP(99).
Hong C, Chen L, Liang Y, Zeng Z. Stacked Capsule Graph Autoencoders for geometryaware 3D head pose estimation. Comp Vision Image Understand. 2021;1:103224.
Nie Y, Shu C, Sun X. Cooperative binding of transcription factors in the human genome. Genomics. 2020;112(5):3427–34.
Di Malta C, Cinque L, Settembre C. Transcriptional regulation of autophagy: Mechanisms and diseases. Front Cell Dev Biol. 2019;7(114).
Doganli C, Sandoval M, Thomas S, Hart D. Assay for TransposaseAccessible Chromatin with HighThroughput Sequencing (ATACSeq) Protocol for Zebrafish Embryos. Methods Mol Biol. 2017;1507:59.
Acknowledgements
We gratefully acknowledge generous support for this research from our team.
Funding
This work was supported by the Young Scientists Fund of the National Natural Science Foundation of China [62302218, 32300523], the National Natural Science Foundation of China [62072212].
Author information
Authors and Affiliations
Contributions
SZ conceived the idea. SZ and XW processed the data and conducted simulation and real dataset experiments. CZ and SZ wrote the manuscript. ZL and YW revised the manuscript. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Humans, animals or plants have not been directly used in this study.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Zhang, S., Wu, X., Lian, Z. et al. GNNMF: a multiview graph neural network for ATACseq motif finding. BMC Genomics 25, 300 (2024). https://doi.org/10.1186/s12864024102180
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12864024102180