SDNN-PPI: self-attention with deep neural network effect on protein-protein interaction prediction

Li, Xue; Han, Peifu; Wang, Gan; Chen, Wenqi; Wang, Shuang; Song, Tao

doi:10.1186/s12864-022-08687-2

Research
Open access
Published: 27 June 2022

SDNN-PPI: self-attention with deep neural network effect on protein-protein interaction prediction

Xue Li¹,
Peifu Han¹,
Gan Wang¹,
Wenqi Chen¹,
Shuang Wang¹ &
…
Tao Song¹

BMC Genomics volume 23, Article number: 474 (2022) Cite this article

4522 Accesses
32 Citations
1 Altmetric
Metrics details

Abstract

Background

Protein-protein interactions (PPIs) dominate intracellular molecules to perform a series of tasks such as transcriptional regulation, information transduction, and drug signalling. The traditional wet experiment method to obtain PPIs information is costly and time-consuming.

Result

In this paper, SDNN-PPI, a PPI prediction method based on self-attention and deep learning is proposed. The method adopts amino acid composition (AAC), conjoint triad (CT), and auto covariance (AC) to extract global and local features of protein sequences, and leverages self-attention to enhance DNN feature extraction to more effectively accomplish the prediction of PPIs. In order to verify the generalization ability of SDNN-PPI, a 5-fold cross-validation on the intraspecific interactions dataset of Saccharomyces cerevisiae (core subset) and human is used to measure our model in which the accuracy reaches 95.48% and 98.94% respectively. The accuracy of 93.15% and 88.33% are obtained in the interspecific interactions dataset of human-Bacillus Anthracis and Human-Yersinia pestis, respectively. In the independent data set Caenorhabditis elegans, Escherichia coli, Homo sapiens, and Mus musculus, all prediction accuracy is 100%, which is higher than the previous PPIs prediction methods. To further evaluate the advantages and disadvantages of the model, the one-core and crossover network are conducted to predict PPIs, and the data show that the model correctly predicts the interaction pairs in the network.

Conclusion

In this paper, AAC, CT and AC methods are used to encode the sequence, and SDNN-PPI method is proposed to predict PPIs based on self-attention deep learning neural network. Satisfactory results are obtained on interspecific and intraspecific data sets, and good performance is also achieved in cross-species prediction. It can also correctly predict the protein interaction of cell and tumor information contained in one-core network and crossover network.The SDNN-PPI proposed in this paper not only explores the mechanism of protein-protein interaction, but also provides new ideas for drug design and disease prevention.

Peer Review reports

Introduction

Proteins are organic macromolecules made up of amino acids, which are essential components of cells and sustain life activities. They play an important role in biology by linking various important physiological activities of cells to PPIs [1], enabling a range of life activities such as apoptosis and immune response. In recent years, a large number of high-throughput experimental methods have emerged to study PPIs, such as yeast two-hybrid screening [2], mass spectrometry [3], hybridization methods [4], immunoprecipitation [5] and protein microarrays [6]. However, all of these are based on biological and chemical experiments, which require a lot of manpower, financial and time resources. Therefore, artificial intelligence-based computational methods have emerged in bioinformatics [7, 8] and become quite prevalent predicting the interaction of proteins with other biological macromolecules [9, 10]. Especially in PPIs, there are abundant amino acid sequence information data, which is sufficient to establish PPIs prediction calculation model [11]. A growing number of researchers have been attracted by the aforementioned methods. The basic steps of PPIs prediction based on protein sequence consist of two parts: protein coding method and machine learning model.

With the rapid development of machine learning techniques [12–14] and the refinement of neural networks [15–18], some machine learning-based and sequence-based models have been presented for PPIs prediction. Shen et al. [19] first employed conjoint triad (CT) to extract features from protein sequences and predicted PPIs through support vector machine model incorporating kernel function with 83.9% accuracy. Guo et al. [20] proposed auto covariance (AC) to extract information from protein sequences and used support vector machine model to predict PPIs in the Saccharomyces cerevisiae dataset with 88.09% accuracy. Yang et al. [21] proposed local descriptors (LD) to represent protein sequences and successfully predicted potential PPIs on Saccharomyces cerevisiae (core subset) dataset by implementing K-neighbor model. You et al. [22] utilized four categories of protein sequence information (AC, CT, LD, MAC) to encode proteins as feature vectors focusing on dimensionality reduction and proposed a new hierarchical PCA-EELM (principal component analysis-integrated extreme learning machine) model to predict protein interactions. In 2014, Barman et al. [23] used support vector machine, Naive Bayes and random forest based on 5-fold cross-validation to complete the host-pathogen interaction prediction. In 2016, An et al. [24] jointly proposed a new computational method called RVM-BiGP, combining the relevance vector machine (RVM) model and Bi-gram probabilities (BiGP), to efficiently handle imbalanced protein interaction datasets. In 2018, Goktepe et al. [25] adopted PCA to fuse PSSM, Bi-gram, AAC, pseudo-amino acid (PseAAC) and weighted jump-order joint triple to obtain approximate features, then used SVM to complete PPIs prediction. Song et al. [26] used position specific scoring matrix (PSSM) to obtain evolutionary information and proposed a new feature fusion algorithm, which could combine discrete cosine transform (DCT), fast Fourier transform (FFT) and singular value decomposition (SVD). In 2019, Chen et al. [27] extracted features from PseAAC, autocorrelation descriptor (AD), CT and LD by elastic network, and predicted PPI in several datasets with the help of LightGBM network. In 2020, Yu et al. [28] proposed a combination of PseAAC, pseudo-position-specific scoring Matrix (PsePSSM), reduced sequence and index-vectors (RSIV), and AD to encode protein sequences for potential PPIs on Saccharomyces cerevisiae (core subset) dataset through GTB-PPI model.

Although machine learning methods can make predictions based on best fitting models, it is still open to some limitations on effectively learning the eigenvalues at a deep level. In recent years, deep learning architectures [8, 29–32] provide strong support for solving relevant problems in bioinformatics. In 2017, Wang et al. [33] extracted protein sequence features from PSSM, and reconstructed them through stacked auto-encoder. After that, prediction was completed with the help of a new probabilistic classification vector machine (PCVM). Du et al. [34] proposed a deep neural network model, DeepPPI, to improve the performance of PPIs prediction using AAC, gradient tree boosting (DC), LD and other protein transformations where demonstrated the superiority of the model on several datasets. Wang et al. [35] combined Deep Neural Networks (DNNs) with a new local composition ternary description (LCTD) feature representation, and proposed DNN-LCTD method to predict the PPIs on Saccharomyces cerevisiae (core subset) dataset with the accuracy of 93.12%. In 2018, Hashemifar et al. [36] efficiently combined deep Siamese-like convolutional neural networks and random projection to construct DPPI model for predicting PPIs by associating with protein evolutionary information. In 2019, Zhang et al. [37] proposed a deep model called EnsDNN, which extracted protein interaction information from AC, LD and multi-scale continuous and discontinuous local descriptors (MCD) which achieved 95.29% accuracy in Saccharomyces cerevisiae (core subset) dataset. You et al. [38] proposed a highly efficient method to detect PPIs by integrating a new protein sequence substitution matrix feature representation and ensemble weighted sparse representation model classifier. Yao et al. [39] designed a new protein sequence representation method, Res2vec, and combined effective feature embedding with deep learning techniques to develop the DeepFE-PPI framework, which achieved good performance in PPIs prediction. In 2020, Li et al. [40] represented proteins using AC, CT, LD, PseAAC, and built Ensemble model to complete PPIs prediction work. In 2021, Yu et al. [41] used PseAAC, AD, multivariate mutual information (MMI), composition-transition-distribution (CTD), amino acid composition PSSM (AAC-PSSM), and dipeptide composition PSSM (DPC-PSSM) to construct the pattern of GcForest-PPI.

Inspired by the above discussion, this paper proposes a protein-protein interaction prediction method, SDNN-PPI. Firstly, protein sequence information is encoded with AAC, CT, and AC. Second of all, in order to carry out effective feature extraction, the deep neural network combined with self-attention method is conducted to adjust the weight of the sequence and further emphasize the key features, so as to establish a network model to fully extract protein sequence information. Eventually, 5-fold cross-validation approach is applied in 2 intraspecies, 2 interspecies, and 4 independent datasets. All of which achieved high accuracy rates. To further evaluate the merits of the model, the effectiveness of the method is tested on one-core network and crossover network. The experimental results show that SDNN-PPI outperforms other state-of-the-art methods and is highly competitive.

Materials and methods

Data sets

In this study, multiple high-confidence PPI datasets were used to measure the performance of SDNN-PPI, including the intraspecific datasets Saccharomyces cerevisiae core subset (S.cerevisiae core subset) [20] and Human [38], the interspecific dataset Human-Bacillus Anthracis (Human-B.Anthracis) [42] and the Human-Yersinia pestis (Human-Y.pestis) [42]. The composition of the four datasets is shown in Table 1. In addition, four independent datasets [27] including Caenorhabditis elegans (C.elegans), Escherichia coli (E.coli), Homo sapiens (H.sapiens) and Mus musculus (M.musculus) are tested for PPIs. And the predictive performance of the method is further validated on two significant PPI networks [41]. One is the one-core CD9 network, which contains 16 PPIs, and the other is crossover network, which consists of 96 PPIs. There is also a data set, Saccharomyces Cerevisiae[34], for independent test experiments. In addition, to ensure the balance of positive and negative samples in the dataset, the same number of randomly selected negative samples is in the same amount as positive samples meaning the ratio of positive to negative samples was 1:1.

Table 1 Compositions of the four benchmark data sets

SDNN-PPI: self-attention with deep neural network effect on protein-protein interaction prediction

Abstract

Background

Result

Conclusion

Introduction

Materials and methods

Data sets

Feature extraction techniques

Amino acid composition (AAC)

Conjoint triad (CT)

Auto covariance (AC)

PPIs model based on self-attention combined with deep neural network

Evaluation metrics

Results and discussion

Encoding method selection

Model ablation experiment

Performance of the SDNN-PPI

Compared with other methods

Performance on independent data sets

Performance on PPI networks

Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us