SDNN-PPI: self-attention with deep neural network effect on protein-protein interaction prediction
BMC Genomics volume 23, Article number: 474 (2022)
Protein-protein interactions (PPIs) dominate intracellular molecules to perform a series of tasks such as transcriptional regulation, information transduction, and drug signalling. The traditional wet experiment method to obtain PPIs information is costly and time-consuming.
In this paper, SDNN-PPI, a PPI prediction method based on self-attention and deep learning is proposed. The method adopts amino acid composition (AAC), conjoint triad (CT), and auto covariance (AC) to extract global and local features of protein sequences, and leverages self-attention to enhance DNN feature extraction to more effectively accomplish the prediction of PPIs. In order to verify the generalization ability of SDNN-PPI, a 5-fold cross-validation on the intraspecific interactions dataset of Saccharomyces cerevisiae (core subset) and human is used to measure our model in which the accuracy reaches 95.48% and 98.94% respectively. The accuracy of 93.15% and 88.33% are obtained in the interspecific interactions dataset of human-Bacillus Anthracis and Human-Yersinia pestis, respectively. In the independent data set Caenorhabditis elegans, Escherichia coli, Homo sapiens, and Mus musculus, all prediction accuracy is 100%, which is higher than the previous PPIs prediction methods. To further evaluate the advantages and disadvantages of the model, the one-core and crossover network are conducted to predict PPIs, and the data show that the model correctly predicts the interaction pairs in the network.
In this paper, AAC, CT and AC methods are used to encode the sequence, and SDNN-PPI method is proposed to predict PPIs based on self-attention deep learning neural network. Satisfactory results are obtained on interspecific and intraspecific data sets, and good performance is also achieved in cross-species prediction. It can also correctly predict the protein interaction of cell and tumor information contained in one-core network and crossover network.The SDNN-PPI proposed in this paper not only explores the mechanism of protein-protein interaction, but also provides new ideas for drug design and disease prevention.
Proteins are organic macromolecules made up of amino acids, which are essential components of cells and sustain life activities. They play an important role in biology by linking various important physiological activities of cells to PPIs , enabling a range of life activities such as apoptosis and immune response. In recent years, a large number of high-throughput experimental methods have emerged to study PPIs, such as yeast two-hybrid screening , mass spectrometry , hybridization methods , immunoprecipitation  and protein microarrays . However, all of these are based on biological and chemical experiments, which require a lot of manpower, financial and time resources. Therefore, artificial intelligence-based computational methods have emerged in bioinformatics [7, 8] and become quite prevalent predicting the interaction of proteins with other biological macromolecules [9, 10]. Especially in PPIs, there are abundant amino acid sequence information data, which is sufficient to establish PPIs prediction calculation model . A growing number of researchers have been attracted by the aforementioned methods. The basic steps of PPIs prediction based on protein sequence consist of two parts: protein coding method and machine learning model.
With the rapid development of machine learning techniques [12–14] and the refinement of neural networks [15–18], some machine learning-based and sequence-based models have been presented for PPIs prediction. Shen et al.  first employed conjoint triad (CT) to extract features from protein sequences and predicted PPIs through support vector machine model incorporating kernel function with 83.9% accuracy. Guo et al.  proposed auto covariance (AC) to extract information from protein sequences and used support vector machine model to predict PPIs in the Saccharomyces cerevisiae dataset with 88.09% accuracy. Yang et al.  proposed local descriptors (LD) to represent protein sequences and successfully predicted potential PPIs on Saccharomyces cerevisiae (core subset) dataset by implementing K-neighbor model. You et al.  utilized four categories of protein sequence information (AC, CT, LD, MAC) to encode proteins as feature vectors focusing on dimensionality reduction and proposed a new hierarchical PCA-EELM (principal component analysis-integrated extreme learning machine) model to predict protein interactions. In 2014, Barman et al.  used support vector machine, Naive Bayes and random forest based on 5-fold cross-validation to complete the host-pathogen interaction prediction. In 2016, An et al.  jointly proposed a new computational method called RVM-BiGP, combining the relevance vector machine (RVM) model and Bi-gram probabilities (BiGP), to efficiently handle imbalanced protein interaction datasets. In 2018, Goktepe et al.  adopted PCA to fuse PSSM, Bi-gram, AAC, pseudo-amino acid (PseAAC) and weighted jump-order joint triple to obtain approximate features, then used SVM to complete PPIs prediction. Song et al.  used position specific scoring matrix (PSSM) to obtain evolutionary information and proposed a new feature fusion algorithm, which could combine discrete cosine transform (DCT), fast Fourier transform (FFT) and singular value decomposition (SVD). In 2019, Chen et al.  extracted features from PseAAC, autocorrelation descriptor (AD), CT and LD by elastic network, and predicted PPI in several datasets with the help of LightGBM network. In 2020, Yu et al.  proposed a combination of PseAAC, pseudo-position-specific scoring Matrix (PsePSSM), reduced sequence and index-vectors (RSIV), and AD to encode protein sequences for potential PPIs on Saccharomyces cerevisiae (core subset) dataset through GTB-PPI model.
Although machine learning methods can make predictions based on best fitting models, it is still open to some limitations on effectively learning the eigenvalues at a deep level. In recent years, deep learning architectures [8, 29–32] provide strong support for solving relevant problems in bioinformatics. In 2017, Wang et al.  extracted protein sequence features from PSSM, and reconstructed them through stacked auto-encoder. After that, prediction was completed with the help of a new probabilistic classification vector machine (PCVM). Du et al.  proposed a deep neural network model, DeepPPI, to improve the performance of PPIs prediction using AAC, gradient tree boosting (DC), LD and other protein transformations where demonstrated the superiority of the model on several datasets. Wang et al.  combined Deep Neural Networks (DNNs) with a new local composition ternary description (LCTD) feature representation, and proposed DNN-LCTD method to predict the PPIs on Saccharomyces cerevisiae (core subset) dataset with the accuracy of 93.12%. In 2018, Hashemifar et al.  efficiently combined deep Siamese-like convolutional neural networks and random projection to construct DPPI model for predicting PPIs by associating with protein evolutionary information. In 2019, Zhang et al.  proposed a deep model called EnsDNN, which extracted protein interaction information from AC, LD and multi-scale continuous and discontinuous local descriptors (MCD) which achieved 95.29% accuracy in Saccharomyces cerevisiae (core subset) dataset. You et al.  proposed a highly efficient method to detect PPIs by integrating a new protein sequence substitution matrix feature representation and ensemble weighted sparse representation model classifier. Yao et al.  designed a new protein sequence representation method, Res2vec, and combined effective feature embedding with deep learning techniques to develop the DeepFE-PPI framework, which achieved good performance in PPIs prediction. In 2020, Li et al.  represented proteins using AC, CT, LD, PseAAC, and built Ensemble model to complete PPIs prediction work. In 2021, Yu et al.  used PseAAC, AD, multivariate mutual information (MMI), composition-transition-distribution (CTD), amino acid composition PSSM (AAC-PSSM), and dipeptide composition PSSM (DPC-PSSM) to construct the pattern of GcForest-PPI.
Inspired by the above discussion, this paper proposes a protein-protein interaction prediction method, SDNN-PPI. Firstly, protein sequence information is encoded with AAC, CT, and AC. Second of all, in order to carry out effective feature extraction, the deep neural network combined with self-attention method is conducted to adjust the weight of the sequence and further emphasize the key features, so as to establish a network model to fully extract protein sequence information. Eventually, 5-fold cross-validation approach is applied in 2 intraspecies, 2 interspecies, and 4 independent datasets. All of which achieved high accuracy rates. To further evaluate the merits of the model, the effectiveness of the method is tested on one-core network and crossover network. The experimental results show that SDNN-PPI outperforms other state-of-the-art methods and is highly competitive.
Materials and methods
In this study, multiple high-confidence PPI datasets were used to measure the performance of SDNN-PPI, including the intraspecific datasets Saccharomyces cerevisiae core subset (S.cerevisiae core subset)  and Human , the interspecific dataset Human-Bacillus Anthracis (Human-B.Anthracis)  and the Human-Yersinia pestis (Human-Y.pestis) . The composition of the four datasets is shown in Table 1. In addition, four independent datasets  including Caenorhabditis elegans (C.elegans), Escherichia coli (E.coli), Homo sapiens (H.sapiens) and Mus musculus (M.musculus) are tested for PPIs. And the predictive performance of the method is further validated on two significant PPI networks . One is the one-core CD9 network, which contains 16 PPIs, and the other is crossover network, which consists of 96 PPIs. There is also a data set, Saccharomyces Cerevisiae, for independent test experiments. In addition, to ensure the balance of positive and negative samples in the dataset, the same number of randomly selected negative samples is in the same amount as positive samples meaning the ratio of positive to negative samples was 1:1.
Feature extraction techniques
Since the length of the protein sequence is different, the input to the neural network used in the experiment is fixed. The protein sequences of different lengths have to be transformed into feature vectors of fixed length when they are input into network layers. In this paper, the feature fusion strategy is used to convert protein sequences into feature vectors based on AAC, CT and AC. AAC has the advantage of obtaining the proportion of each amino acid in the entire protein sequence from a global perspective. CT regards any continuous three amino acids as a unit, and puts the characteristics of amino acids and their adjacent amino acids into consideration, but ignores the information of amino acid discontinuous fragments. In terms of physicochemical properties, AC extracts not only discontinuous fragment information, but also the interaction features of long-distance amino acids by considering the adjacent effects of amino acids. In summary, this method extracts amino acid global features through ACC, and then uses CT to reduce the defect of few short-range amino acid interactions in ACC. And through the AC, which is based on the physicochemical properties, the local features of amino acids with adjacent effects were extracted, and more comprehensive protein information was obtained, which provided strong support for the downstream feature extraction.
Amino acid composition (AAC)
The amino acid composition method  normalizes the frequency of occurrence of each amino acid in the protein, which is a concise protein feature extraction method. Specifically, the frequency of twenty amino acids in protein sequences is counted, and each protein sequence is converted into a 1 × 20-dimensional feature vector. The feature extraction formula is as follows:
Where n represents the number of amino acid x in the protein sequence and N represents the number of all amino acids in the protein.
Conjoint triad (CT)
The combined triplet method  takes an amino acid and its left and right amino acids as a unit, and divides 20 amino acids into 7 different clusters  according to the volume of amino acid side chains and dipoles (as shown in Table 2). Among them, different amino acids belonging to a certain cluster are considered to be the same. Therefore, the obtained feature is a 343-dimensional feature vector, which is the normalized results of triples (7*7*7). The formula is:
Among them, C represents a triplet, NC represents the number of occurrences of this triplet, N represents the number of all amino acids in the protein, and the denominator represents that a protein sequence has N−2 triplets.
Auto covariance (AC)
The autocovariance method  mainly considers the proximity effect of amino acids. The interaction between amino acids and a fixed number of surrounding amino acids showed hydrophobicity (H1), hydrophilicity (H2), net charge index (NCI), Polarity (P1), polarizability (P2), soluble surface area (SASA), and side chains. The amino acid sequence is replaced by the initial values of the seven physical and chemical properties, and normalized to zero mean and unit standard deviation (SD), as shown in Formula (3).
Where fi,j represents the value of the j-th property of the i-th amino acid, fj represents the average value of the j-th property of 20 amino acids, and Si represents the corresponding standard deviation. The formula for calculating AC is as follows.
Among them, lag represents the distance between the residuals, and N represents the length of the protein sequence. In this paper, j takes 7 to represent 7 physical and chemical properties. When the lag is taken as 30, it can not only avoid the difficulty of capturing useful protein sequence features due to too close amino acid distance but also solve the problem of noise caused by too much amino acid content.
PPIs model based on self-attention combined with deep neural network
The simple neural network receives data at the input layer, transforms the data through multiple hidden layers, and finally computes the result at the output layer. Neurons in the hidden or output layer are connected to all neurons in the previous layers, as shown in Fig. 1A. Each neuron computes a weighted sum of its inputs and applies a nonlinear activation function to compute its output f(x) (Fig. 1B). The most commonly used activation function is the Rectified Linear Unit (ReLU), which sets the negative signal threshold to 0 and allows positive signals to pass normally. The deep neural network (DNN) proposed in recent years is an artificial neural network inspired by the neural network of the brain, which consists of multiple interconnected computing units (neurons) and extracts high-level abstractions from data. DNN is widely used in speech recognition, PPIs , and other fields with its powerful feature extraction ability. DNN takes the received data as input, then transforms it in a non-linear way, and the last layer outputs [43, 44]. With regard to avoid over fitting, a dropout layer is also added to drop some neurons during training, as shown in Fig. 2.
Self-attention mechanism (Fig. 3) is a model framework proposed by the Google team  in 2017, which can reduce the dependence on external information and be better at capturing the internal correlation of data or features, especially long-distance dependency. As shown in Fig. 3, the weight is obtained by calculating the similarity of Q and K after linear transformation, then the softmax function is used to normalize the weight, and finally attention is obtained by the weight and V.Then, the output of the self-attention module is the weighted sum of feature vectors on all the amino acids, and its core formula is :
Where dk square root represents the scaling factor to control the magnitude of the dot product. Q, K and V represent the query, key and value of the amino acid, respectively.
Based on the excellent performance of deep neural network and Attention mechanism, this paper proposes a DNN network that applies multi-layer fully connected layers and self-attention to predict PPIs, named SDNN-PPI. Deep networks have the characteristics of synthesizing various information, but as the number of layers increases, the risk of overfitting will increase, and the focus on key data will also be reduced. Therefore, this paper dynamically pays attention to the key residues in the sequence through the self-attention in the feature extraction layer, adjusts the weights, captures the feature of single residue, promotes the prediction process, and avoids falling into local optimum caused by DNN overfitting. In addition, since self-attention has a strong ability to extract internal features, it is widely used to capture long-range dependencies between tokens in sequential data. Therefore, in the prediction stage, self-attention mechanism is used to enhance the feature extraction of protein pairs, and further exploits the potential relationship of residues to obtain more accurate information. The SDNN-PPI model is shown in Fig. 4. It mainly includes three modules, namely the feature extraction layer, the feature fusion layer, and the PPIs prediction layer.
(1) Input layer: The model is based on two proteins (P1, P2) as input, and converts the protein sequences into feature vectors through the three encoding methods of AAC, CT, and AC. Finally, each protein sequence is encoded into a vector with dimension of 573, which consists of 20 AAC features, 343 CT features, and 210 AC features respectively.
(2) Feature extraction layer: SDNN-PPI is composed of two channels, which extract the hidden information of proteins respectively. Each channel is composed of six fully connected layers (1024-512-256-128-64-32) by adding a self-attention layer that adjusts the global weight of the sequence. To avoid gradient vanishing and over fitting, Batch Normalization and Dropout layers are added after each dense layer. The formula is expressed as:
Where P represents the feature vector of protein sequence, and f represents the output through the full connection layer.
(3) Feature fusion layer: The feature fusion layer connects the protein information (F1’, F2’) obtained by the two channels from the feature extraction layer. The formula is expressed as:
(4) Prediction layer: The prediction layer is composed of three fully connected layers (32-16-8) and a self-attention layer. Self-attention layer is conducive to increasing the exploration of protein pairs, which is put after the first dense layer. Then there is a single neuron with a Sigmoid activation function that converts the input from the previous layer into an output score. The formula is as follows:
where s denotes dense layer with one unit activated by sigmoid function.
The following assessments are used for this article: Accuracy (ACC), Sensitivity (Sens), Specificity (Spec), Precision (Prec), Matthews Correlation Coefficient (MCC), and AUC. These assessments are used to calculate accuracy and bias to assess the feasibility and robustness of PPI forecasting methodologies. The definition formula is as follows:
Among them, TP (True Positive) is the number of correctly predicted protein pair interactions in the sample data set, TN (True Negative) is the number of correctly predicted protein pairs that do not interact, FP (False Positive) is the number of non-interacting protein pairs predicted as interacting, while FN (False Negative) is the number of interacting protein pairs predicted as non-interacting.
In order to prove the statistical significance of SDNN-PPI, kappa coefficient is also added. Kappa coefficient is an indicator to measure the consistency of two variables, which can be used to evaluate the classification accuracy. The results of kappa coefficient are usually between 0 and 1. When the result is in the range of 0.0 to 0.20, the classification result is considered to be slight, kappa=0.21-0.40 means fair, kappa=0.41-0.60 is moderate, kappa=0.61-0.80 describes substantial, and kappa > 0.81 represents almost perfect. Its calculation formula is as follows:
Where p0 means accuracy,
Results and discussion
This part mainly evaluates and discusses the performance of the model. Firstly, the coding method used in this work is described, which can achieve ideal results. Secondly, the framework of the model is determined. Then, the results for two intraspecific and interspecific datasets. Subsequently, SDNN-PPI and existing advanced algorithms are compared on intraspecies and interspecific datasets to evaluate the validity of the model. Then, four independent data sets are used to prove the robustness of the model. Finally, the PPI networks further prove the potential capability of the model in predicting disease development.
Encoding method selection
In this paper, encoding methods of ACC, CT and AC were used to construct 573-dimensional feature vectors to encode proteins, which can extract global and local features. In addition, LD was also used to encode local characteristics of proteins . LD can encode each protein sequence into a 630-dimensional vector. In order to verify the encoding scheme, LD was also originally used in our experiments as another optional encoding method for protein pairs, and S.cerevisiae (core subset) data set was selected to search for best encoding combination scheme based on the experimental results of the model. In order to exclude the influence of the superiority of the SDNN-PPI model on the results, the standard two-channel self-attention model was selected to verify the encoding scheme. The two-channel self-attention model, which is used for encoding methods selection, is very concise. The input proteins A and B were encoded into feature vectors by the method in the first column of Table 3. Then the two vectors were input to two identical feature extraction layers, which only adopted the self-attention mechanism. Then, feature fusion is performed on the two protein vectors extracted from feature extraction, and the final result is obtained under the action of fully connected prediction layers. As shown in Table 3, compared with the other 10 combination schemes, the ACC+CT+AC encoding combination scheme achieved the optimal results on 6 evaluation indicators. However, after the addition of LD in encoding scheme ACC+CT+AC, the results did not improve effectively, which may be due to the fact that LD was not accurate enough to extract the features of the encoding of excessively long protein sequences, resulting in poor effects.
Model ablation experiment
To verify the effect of different network structures on the performance of SDNN-PPI, two different network structures were first designed. (a) using a dual-channel network to extract protein information (DNN-PPI a),which is the SDNN-PPI model without the self-attention part. And (b) directly connect two proteins in a single channel network (DNN-PPI b). As can be seen from the first two lines of Table 4, the dual-channel model was superior to the single-channel model, and the ACC, Spec, Sens, Prec, MCC, and AUC values of DNN-PPI a were 3.12%, 2.79%, 5.84%, 2.92%, 6.22%, and 1.49% higher than those of DNN-PPI b, respectively. Secondly, after setting up the dual-channel model, the meaning of Self-attention was studied. The following was the control variable method based on SDNN-PPI. (c) self-attention was added in feature extraction layer (SDNN-PPI a), (d) self-attention was added in prediction layer (SDNN-PPI b), (e) self-attention was added in both feature extraction layer and prediction layer (SDNN-PPI), (f) dual-channel network without self-attention (DNN-PPI a). After building different networks, the S.cerevisiae (core subset) dataset was used to evaluate the model results. As shown in Table 4, the SDNN-PPI performed better, so this model was chosen as the final framework.
Performance of the SDNN-PPI
When training a model with dataset, it is easy to overfit due to unreasonable division of the dataset. Compared with the division technique of traditional models (dividing fixed training sets and test sets), cross-validation can avoid such problems, so this paper uses the 5-fold cross-validation method to evaluate the model. The experimental data is randomly divided into 5 parts, samples of 4 parts are randomly taken as the training set, the other part is used as the test set, and finally the average of the 5 test sets is calculated. Table 5, 6, 7 and 8 presented the cross-validation results of this method. In addition, the performance of this method was compared with several advanced methods, and the results were shown in Table 10, 11, 12 and 13.
As can be seen from Table 5, SDNN-PPI had an excellent prediction performance for intraspecific data sets. The average prediction results of S.cerevisiae (core subset) in ACC, Spec, Sens, Prec, MCC and AUC were 95.48%, 97.23%, 93.80%, 97.13%, 91.02% and 98.63%, respectively. Similarly, the average results of the Human dataset were ACC 98.94%, Spec 99.10%, Sens 98.77%, Prec 99.02%, MCC 97.57%, and AUC 99.60%, as shown in Table 6. Meanwhile, for the interspecific data set, as shown in Table 7–8, SDNN-PPI achieved 93.15% and 88.33% accuracy in Human-B.anthracis and Human-Y.pestis, respectively. The above experimental results show that the prediction of PPIs by SDNN-PPI is effective and robust. Table 9 presented the statistical significance of SDNN-PPI in four data sets. According to the above description, kappa between 0.61-0.80 indicates that the classification results were substantial, and when kappa > 0.81, the classification results were almost perfect. The kappa values of the 4 data sets in Table 9 were all greater than 0.61 and 3 were greater than 0.81, indicating that the results were statistically significant.
Compared with other methods
To predict protein-protein interactions, various prediction methods have been continuously proposed. In order to more objectively evaluate the predictive performance of the constructed model, the prediction results were compared with other models in the same data set. The comparison results of the intraspecific datasets S.cerevisiae (core subset) and Human were shown in Tables 10 and 11. The interspecific datasets Human-B.Anthracis and Human-Y.pestis results were shown in Tables 12 and 13. For comparison methods, the data in the table were extracted from the original text, and N/A means that the data is not available in the original text. And the values in bold indicate the optimal value for this column.
As can be seen from Table 10, ACC, Spec, Sens, Prec, MCC, and AUC of SDNN-PPI were 95.48%, 97.23%, 93.80%, 97.13%, 91.02% and 98.63%, respectively. Compared with other methods, its ACC increased by 0.04% 2.18%. According to Table 11, ACC, Spec, Sens, Prec, MCC, and AUC of SDNN-PPI in Human data set were 98.94%, 99.10%, 98.77%, 99.02%, 97.57% and 99.60%, respectively. Compared with other methods, the accuracy of this method is obviously improved. Although SDNN-PPIs was not optimal in all indicators, it was higher in more than half of the indicators on S.cerevisiae (core subset) and human datasets, indicating that the method was still competitive. For this, the predictive performance of SDNN-PPI method became significantly better than other methods in multiple indicators.
As can be seen from Table 12, ACC, Sens, Prec, MCC and AUC of SDNN-PPI in Human-B.anthracis data set are 93.15%, 96.61%, 90.44%, 86.57% and 98.23%, respectively. The ACC of SDNN-PPI is 93.15%, which is significantly higher than other methods. According to Table 13, the ACC, Sens, Prec, MCC and AUC of SDNN-PPI in Human-Y.pestis data set were 88.33%, 93.92%, 84.63%, 77.26% and 95.74%, respectively. In comparison to other methods, its ACC value was 1.03% ∼ 12.23% higher than other methods. Therefore, the SDNN-PPI method achieves comparative results on interspecies datasets. It was worth noting that the two tables do not display Spec columns because the models being compared did not have Spec values.
Performance on independent data sets
In order to further verify the generalization ability of SDNN-PPI, Saccharomyces cerevisiae  was selected as the training set, and C.elegans, E.coli, H.sapiens and M.musculus were selected as independent test sets. The number of interaction pairs of the independent test set was shown in the test pairs in the Table 14. In addition, the results were evaluated by ACC. Saccharomyces cerevisiae set consists of 17257 positive pairs and 48594 negative pairs, from which the same number of positive and negative samples are randomly selected to train the model. The prediction results were shown in Table 14. As can be seen from Table 14, the accuracy of SDNN-PPI in these four independent data sets was 100%. This can show that SDNN-PPI achieved good predictive performance on four independent test sets, indicating that the proposed model can characterize important PPIs information and make cross-species predictions. In other words, PPIs prediction models generated by one species can be migrated to other species.
Performance on PPI networks
Studying the network of PPIs  is also of great significance to understanding other information about proteins, and the corresponding biological topological properties can be studied. In this paper, SDNN-PPI detected two important PPIs networks, namely the one-core network and crossover network of Wnt-related pathway. The mononuclear PPIs network is a network of PPIs composed of a core protein, CD9 , and interacts with many other proteins. CD9 is a tetrameric protein that plays an important role in cell viability and tumor suppression. The network is composed of CD9 as the core protein and 17 other genes.
The second is a typical crossover and multicore network  constructed by 78 genes. This pathway network plays a crucial role in tumor growth and tumor formation. AAC, CT, and AC were used to encode proteins to obtain a 573-dimensional feature vector. The Saccharomyces cerevisiae dataset was used as the training set, and the one-core network and crossover network of the wnt-related paths were used as the test set. The one-core network prediction results of the wnt-related paths were shown in Fig. 5, and the other in Fig. 6. Solid lines represent true predictions and dashed lines represent false predictions. It can be obtained from the graph that all interacting proteins are correctly identified. Table 15 showed the prediction results of various methods on the two network datasets. The results shown that the proposed method produces comparable or better results in comparison to existing models. After the above discussion, SDNN-PPI was a model with high generalization ability, which can obtain competitive results in multiple data sets and effectively improve the prediction accuracy of PPIs.
The study of PPIs is of great significance for understanding cellular regulation and signal transduction, as well as for exploring and elucidating the mechanism of protein interactions in cells. In this paper, we proposed SDNN-PPI, a self-attention-based deep learning neural network prediction method for PPIs. The protein sequences were encoded by AAC, CT and AC methods, and excellent accuracy was obtained in the intraspecific data sets (S.cerevisiae core subset and Human) and interspecies data sets (Human-B.anthracis and Human-Y.pestis). In order to further verify the universality of SDNN-PPI, the evaluation of C.elegans, E.coli, H.sapiens and M.musculus data sets also achieved competitive accuracy, indicating that the method can also achieve good performance in cross-species prediction. The PPI network prediction based on one-core and crossover network correctly predicted the protein interaction containing cell and tumor information on the network. Therefore, comprehensive evaluations demonstrated that SDNN-PPI method could provide a new way to solve problems in signaling pathway research, drug-target prediction and disease pathogenesis research [51–54]. Although protein sequences are transformed into vectors through various encoding methods, the acquisition of comprehensive protein characteristic information is still insufficient. How to better mine the structural information, evolutionary information set of protein pairs and the relationship between protein residues is leading us to the next research direction. At the same time, DNA computing and DNA storage [55, 56] have been applied in more fields [57, 58], and the storage of known protein information and structure may also play a role in promoting biological evolution.
Availability of data and materials
The data and code underlying this article are available in https://github.com/xueleecs/SDNN-PPI.The article all data set on the https://github.com/xueleecs/SDNN-PPI/tree/main/Data.
amino acid composition
position specific scoring matrix
- GTB-gradient tree boosting; DC:
gradient tree boosting
local composition ternary description
multi-scale continuous and discontinuous local descriptors
multivariate mutual information
dipeptide composition PSSM
Deep neural network
Humphreys IR, Pei JM, Baek M, Krishnakumar A, Anishchenko I, Ovchinnikov S, Zhang J, Ness TJ, Banjade S, Bagde SR, Stancheva VG, Li XH, Liu KX, Zheng Z, Barrero DJ, Roy U, Kuper J, Fernandez IS, Szakal B, Branzei D, Rizo J, Kisker C, Greene EC, Biggins S, Keeney S, Miller EA, Fromme JC, Hendrickson TL, Cong Q, Baker D. Computed structures of core eukaryotic protein complexes. Science. 2021; 374(6573):1340. https://doi.org/10.1126/science.abm4805.
Bacon K, Blain A, Bowen J, Burroughs M, McArthur N, Menegatti S, Rao BM. Quantitative yeast-yeast two hybrid for the discovery and binding affinity estimation of protein-protein interactions. ACS Synth Biol. 2021; 10(3):505–14. https://doi.org/10.1021/acssynbio.0c00472.
Woodall DW, Dillon TM, Kalenian K, Padaki R, Kuhns S, Semin DJ, Bondarenko PV. Non-targeted characterization of attributes affecting antibody-fc gamma riiia v158 (cd16a) binding via online affinity chromatography-mass spectrometry. Mabs. 2022; 14(1). https://doi.org/10.1080/19420862.2021.2004982.
Hu L, Wang XJ, Huang YA, Hu PW, You ZH. A survey on computational models for predicting protein-protein interactions. Brief Bioinform. 2021; 22(5). https://doi.org/10.1093/bib/bbab036.
Susila H, Nasim Z, Jin S, Youn G, Jeong H, Jung J-Y, Ahn JH. Profiling protein-dna interactions by chromatin immunoprecipitation in arabidopsis. Methods Mol Biol (Clifton, NJ). 2021; 2261:345–56. https://doi.org/10.1007/978-1-0716-1186-9\_21.
Ma JF, Wu C, Hart GW. Analytical and biochemical perspectives of protein o-glcnacylation. Chem Rev. 2021; 121(3):1513–81. https://doi.org/10.1021/acs.chemrev.0c00884.
Liu W, Jiang Y, Peng L, Sun XG, Gan WQ, Zhao Q, Tang HR. Inferring gene regulatory networks using the improved markov blanket discovery algorithm. Interdiscip Sci-Comput Life Sci. 2022; 14(1):168–81. https://doi.org/10.1007/s12539-021-00478-9.
Wang H, Zhao J, Su Y, Zheng C-H. sccdg: A method based on dae and gcn for scrna-seq data analysis. IEEE/ACM Trans Comput Biol Bioinforma. 2021; PP. https://doi.org/10.1109/tcbb.2021.3126641.
Hu H, Zhang L, Ai HX, Zhang H, Fan YT, Zhao Q, Liu HS. Hlpi-ensemble: Prediction of human lncrna-protein interactions based on ensemble strategy. RNA Biol. 2018; 15(6):797–806. https://doi.org/10.1080/15476286.2018.1457935.
Zhang L, Yang PY, Feng HW, Zhao Q, Liu HS. Using network distance analysis to predict lncrna-mirna interactions. Interdisc Sci-Comput Life Sci. 2021; 13(3):535–45. https://doi.org/10.1007/s12539-021-00458-z.
Chou KC, Cai YD. Predicting protein-protein interactions from sequences in a hybridization space. J Proteome Res. 2006; 5(2):316–22. https://doi.org/10.1021/pr050331g.
Camacho DM, Collins KM, Powers RK, Costello JC, Collins JJ. Next-generation machine learning for biological networks. Cell. 2018; 173(7):1581–92. https://doi.org/10.1016/j.cell.2018.05.015.
Fang WW, Yao XN, Zhao XJ, Yin JW, Xiong NX. A stochastic control approach to maximize profit on service provisioning for mobile cloudlet platforms. IEEE Trans Syst Man Cybern-Syst. 2018; 48(4):522–34. https://doi.org/10.1109/tsmc.2016.2606400.
Li HH, Liu JX, Liu RW, Xiong NX, Wu KF, Kim TH. A dimensionality reduction-based multi-step clustering method for robust vessel trajectory analysis. Sensors. 2017; 17(8). https://doi.org/10.3390/s17081792.
Song T, Pang S, Hao S, Rodriguezpaton A, Zheng P. A parallel image skeletonizing method using spiking neural p systems with weights. Neural Process Lett. 2019; 50(2):1485–502.
Song T, Zeng X, Zheng P, Jiang M, Rodriguezpaton A. A parallel workflow pattern modeling using spiking neural p systems with colored spikes. IEEE Trans Nanobioscience. 2018; 17(4):474–84.
Song T, Zheng P, Wong MLD, Wang X. Design of logic gates using spiking neural p systems with homogeneous neurons and astrocytes-like control. Inf Sci. 2016; 372:380–91. https://doi.org/10.1016/j.ins.2016.08.055.
Song T, Rodriguez-Paion A, Zheng P, Zeng XX. Spiking neural p systems with colored spikes. IEEE Trans Cogn Dev Syst. 2018; 10(4):1106–15. https://doi.org/10.1109/tcds.2017.2785332.
Shen JW, Zhang J, Luo XM, Zhu WL, Yu KQ, Chen KX, Li YX, Jiang HL. Predictina protein-protein interactions based only on sequences information. Proc Natl Acad Sci U S A. 2007; 104(11):4337–41. https://doi.org/10.1073/pnas.0607879104.
Guo YZ, Yu LZ, Wen ZN, Li ML. Using support vector machine combined with auto covariance to predict proteinprotein interactions from protein sequences. Nucleic Acids Res. 2008; 36(9):3025–30. https://doi.org/10.1093/nar/gkn159.
Yang L, Xia JF, Gui J. Prediction of protein-protein interactions from protein sequence using local descriptors. Protein Pept Lett. 2010; 17(9):1085–90. https://doi.org/10.2174/092986610791760306.
You ZH, Lei YK, Zhu L, Xia JF, Wang B. Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis. BMC Bioinformatics. 2013; 14. https://doi.org/10.1186/1471-2105-14-s8-s10.
Barman RK, Saha S, Das S. Prediction of interactions between viral and host proteins using supervised machine learning methods. PLoS ONE. 2014; 9(11). https://doi.org/10.1371/journal.pone.0112034.
An JY, Meng FR, You ZH, Chen X, Yan GY, Hu JP. Improving protein-protein interactions prediction accuracy using protein evolutionary information and relevance vector machine model. Protein Sci. 2016; 25(10):1825–33. https://doi.org/10.1002/pro.2991.
Goktepe YE, Kodaz H. Prediction of protein-protein interactions using an effective sequence based combined method. Neurocomputing. 2018; 303:68–74. https://doi.org/10.1016/j.neucom.2018.03.062.
Song XY, Chen ZH, Sun XY, You ZH, Li LP, Zhao Y. An ensemble classifier with random projection for predicting protein-protein interactions using sequence and evolutionary information. Appl Sci-Basel. 2018; 8(1). https://doi.org/10.3390/app8010089.
Chen C, Zhang QM, Ma Q, Yu B. Lightgbm-ppi: Predicting protein-protein interactions through lightgbm with multi-information fusion. Chemometr Intell Lab Syst. 2019; 191:54–64. https://doi.org/10.1016/j.chemolab.2019.06.003.
Yu B, Chen C, Zhou HY, Liu BQ, Ma Q. Gtb-ppi: Predict protein-protein interactions based on l1-regularized logistic regression and gradient tree boosting. Genomics Proteomics Bioinforma. 2020; 18(5):582–92. https://doi.org/10.1016/j.gpb.2021.01.001.
Quang D, Xie XH. Danq: a hybrid convolutional and recurrent deep neural network for quantifying the function of dna sequences. Nucleic Acids Res. 2016; 44(11). https://doi.org/10.1093/nar/gkw226.
Pang SC, Zhang Y, Song T, Zhang XD, Wang X, Rodriguez-Paton A. Amde: a novel attention-mechanism-based multidimensional feature encoder for drug-drug interaction prediction. Brief Bioinform. 2022; 23(1). https://doi.org/10.1093/bib/bbab545.
Wang S, Jiang MJ, Zhang SG, Wang XF, Yuan Q, Wei ZQ, Li Z. Mcn-cpi: Multiscale convolutional network for compound-protein interaction prediction. Biomolecules. 2021; 11(8). https://doi.org/10.3390/biom11081119.
Wang S, Song T, Zhang S, Jiang M, Wei Z, Li Z. Molecular substructure tree generative model for de novo drug design. Brief Bioinform. 2022. https://doi.org/10.1093/bib/bbab592.
Wang YB, You ZH, Li X, Jiang TH, Chen X, Zhou X, Wang L. Predicting protein-protein interactions from protein sequences by a stacked sparse autoencoder deep neural network. Mol BioSyst. 2017; 13(7):1336–44. https://doi.org/10.1039/c7mb00188f.
Du XQ, Sun SW, Hu CL, Yao Y, Yan YT, Zhang YP. Deepppi: Boosting prediction of protein-protein interactions with deep neural networks. J Chem Inf Model. 2017; 57(6):1499–510. https://doi.org/10.1021/acs.jcim.7b00028.
Wang J, Zhang L, Jia LY, Ren YZ, Yu GX. Protein-protein interactions prediction using a novel local conjoint triad descriptor of amino acid sequences. Int J Mol Sci. 2017; 18(11). https://doi.org/10.3390/ijms18112373.
Hashemifar S, Neyshabur B, Khan AA, Xu JB. Predicting protein-protein interactions through sequence-based deep learning. Bioinformatics. 2018; 34(17):802–10. https://doi.org/10.1093/bioinformatics/bty573.
Zhang L, Yu GX, Xia DW, Wang J. Protein-protein interactions prediction based on ensemble deep neural networks. Neurocomputing. 2019; 324:10–19. https://doi.org/10.1016/j.neucom.2018.02.097.
You ZH, Huang WZ, Zhang SW, Huang YA, Yu CQ, Li LP. An efficient ensemble learning approach for predicting protein-protein interactions by integrating protein primary sequence and evolutionary information. IEEE-ACM Trans Comput Biol Bioinforma. 2019; 16(3):809–17. https://doi.org/10.1109/tcbb.2018.2882423.
Yao Y, Du XQ, Diao YY, Zhu HX. An integration of deep learning with feature embedding for protein-protein interaction prediction. Peerj. 2019; 7. https://doi.org/10.7717/peerj.7126.
Li FF, Zhu F, Ling XH, Liu Q. Protein interaction network reconstruction through ensemble deep learning with attention mechanism. Front Bioeng Biotechnol. 2020; 8. https://doi.org/10.3389/fbioe.2020.00390.
Yu B, Chen C, Wang XL, Yu ZM, Ma AJ, Liu BQ. Prediction of protein-protein interactions based on elastic net and deep forest. Expert Syst Appl. 2021; 176. https://doi.org/10.1016/j.eswa.2021.114876.
Kosesoy I, Gok M, Oz C. A new sequence based encoding for prediction of host-pathogen protein interactions. Comput Biol Chem. 2019; 78:170–77. https://doi.org/10.1016/j.compbiolchem.2018.12.001.
Angermueller C, Parnamaa T, Parts L, Stegle O. Deep learning for computational biology. Mol Syst Biol. 2016; 12(7). https://doi.org/10.15252/msb.20156651.
Webb S. Deep learning for biology. Nature. 2018; 554(7693):555–57. https://doi.org/10.1038/d41586-018-02174-z.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. In Advances in Neural Information Processing Systems. 2017; 30:6000–10.
Lei YP, Li SY, Liu ZY, Wan FP, Tian TZ, Li S, Zhao D, Zeng JY. A deep-learning framework for multi-level peptide-protein interaction prediction. Nat Commun. 2021; 12(1). https://doi.org/10.1038/s41467-021-25772-4.
Dey L, Mukhopadhyay A. Compact genetic algorithm-based feature selection for sequence-based prediction of dengue-human protein interactions. IEEE/ACM Trans Comput Biol Bioinforma. 2021; PP. https://doi.org/10.1109/tcbb.2021.3066597.
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977; 33(1):159–74. https://doi.org/10.2307/2529310.
Tang W, Hu J, Zhang H, Wu P, He H. Kappa coefficient: a popular measure of rater agreement. Shanghai Arch Psychiatry. 2015; 27(1):62–7. https://doi.org/10.11919/j.issn.1002-0829.215010.
Chen C, Zhang QM, Yu B, Yu ZM, Lawrence PJ, Ma Q, Zhang Y. Improving protein-protein interactions prediction accuracy using xgboost feature selection and stacked ensemble classifier. Comput Biol Med. 2020; 123. https://doi.org/10.1016/j.compbiomed.2020.103899.
Li L, Gao Z, Wang YT, Zhang MW, Ni JC, Zheng CH. Scmfmda: Predicting microrna-disease associations based on similarity constrained matrix factorization. PLoS Comput Biol. 2021; 17(7). https://doi.org/10.1371/journal.pcbi.1009165.
Su YS, Liu CL, Niu YY, Cheng F, Zhang XY. A community structure enhancement-based community detection algorithm for complex networks. IEEE Trans Syst Man Cybern-Syst. 2021; 51(5):2833–46. https://doi.org/10.1109/tsmc.2019.2917215.
Tian Y, Su XC, Su YS, Zhang XY. Emodmi: A multi-objective optimization based method to identify disease modules. IEEE Trans Emerg Top Comput Intell. 2021; 5(4):570–82. https://doi.org/10.1109/tetci.2020.3014923.
Cai LJ, Lu CC, Xu JL, Meng YJ, Wang P, Fu XZ, Zeng XX, Su YS. Drug repositioning based on the heterogeneous information fusion graph convolutional network. Brief Bioinform. 2021; 22(6). https://doi.org/10.1093/bib/bbab319.
Cao B, Li X, Zhang X, Wang B, Zhang Q, Wei X. Designing uncorrelated address constrain for dna storage by dmvo algorithm. IEEE/ACM Trans Comput Biol Bioinforma. 2020. https://doi.org/10.1109/TCBB.2020.3011582.
Wu J, Zheng Y, Wang B, Zhang Q. Enhancing physical and thermodynamic properties of dna storage sets with end-constraint. IEEE Trans Nanobioscience. 2021; PP. https://doi.org/10.1109/tnb.2021.3121278.
Zhou SH. A real-time one-time pad dna-chaos image encryption algorithm based on multiple keys. Opt Laser Technol. 2021; 143. https://doi.org/10.1016/j.optlastec.2021.107359.
Song T, Wang X, Li X, Zheng PJO. A programming triangular DNA origami for doxorubicin loading and delivering to target ovarian cancer cells. Oncotarget. 2017; 5. https://doi.org/10.18632/oncotarget.23733.
Wang YB, You ZH, Yang S, Li X, Jiang TH, Zhou X. A high efficient biological language model for predicting protein-protein interactions. Cells. 2019; 8(2). https://doi.org/10.3390/cells8020122.
Sharma A, Singh B. Ae-lgbm: Sequence-based novel approach to detect interacting protein pairs via ensemble of autoencoder and lightgbm. Comput Biol Med. 2020; 125. https://doi.org/10.1016/j.compbiomed.2020.103964.
An JY, You ZH, Zhou Y, Wang DF. Sequence-based prediction of protein-protein interactions using gray wolf optimizer-based relevance vector machine. Evol Bioinforma. 2019; 15. https://doi.org/10.1177/1176934319844522.
We thank our partners who provided all the help during the research process and the team for their great support.
This work was supported by National Key Research and Development Project of China (2021YFA1000102, 2021YFA1000103), Natural Science Foundation of China (Grant Nos. 61873280, 61972416), Taishan Scholarship (tsqn201812029), Foundation of Science and Technology Development of Jinan (201907116), Shandong Provincial Natural Science Foundation(ZR2021QF023), Fundamental Research Funds for the Central Universities (21CX06018A), Spanish project PID2019-106960GB-I00, Juan de la Cierva IJC2018-038539-I.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Li, X., Han, P., Wang, G. et al. SDNN-PPI: self-attention with deep neural network effect on protein-protein interaction prediction. BMC Genomics 23, 474 (2022). https://doi.org/10.1186/s12864-022-08687-2