 Research
 Open access
 Published:
Identification of selfinteracting proteins by integrating random projection classifier and finite impulse response filter
BMC Genomics volume 20, Article number: 928 (2019)
Abstract
Background
Identification of proteinprotein interactions (PPIs) is crucial for understanding biological processes and investigating the cellular functions of genes. Selfinteracting proteins (SIPs) are those in which more than two identical proteins can interact with each other and they are the specific type of PPIs. More and more researchers draw attention to the SIPs detection, and several prediction model have been proposed, but there are still some problems. Hence, there is an urgent need to explore a efficient computational model for SIPs prediction.
Results
In this study, we developed an effective model to predict SIPs, called RPFIRF, which merges the Random Projection (RP) classifier and Finite Impulse Response Filter (FIRF) together. More specifically, each protein sequence was firstly transformed into the Position Specific Scoring Matrix (PSSM) by exploiting Position Specific Iterated BLAST (PSIBLAST). Then, to effectively extract the discriminary SIPs feature to improve the performance of SIPs prediction, a FIRF method was used on PSSM. The R’classifier was proposed to execute the classification and predict novel SIPs. We evaluated the performance of the proposed RPFIRF model and compared it with the stateoftheart support vector machine (SVM) on human and yeast datasets, respectively. The proposed model can achieve high average accuracies of 97.89 and 97.35% using fivefold crossvalidation. To further evaluate the high performance of the proposed method, we also compared it with other six exiting methods, the experimental results demonstrated that the capacity of our model surpass that of the other previous approaches.
Conclusion
Experimental results show that selfinteracting proteins are accurately wellpredicted by the proposed model on human and yeast datasets, respectively. It fully show that the proposed model can predict the SIPs effectively and sufficiently. Thus, RPFIRF model is an automatic decision support method which should provide useful insights into the recognition of SIPs.
Background
Protein is a significant component of all cells and tissues of an organism. It is organic macromolecule or large biological molecule, comprising of many amino acids with different length. It is the basic material of life and the main undertaker of life activity. A number of proteins often associate with their partner or other proteins which is called proteinprotein interactions (PPIs) [1]. Selfinteracting proteins (SIPs) is a particular type of PPIs, where can interact in terms of duplicate their own genes. SIPs occupy an important role in cellular functions and cellular signal transduction. The majority of chemical reactions occur in living systems which mainly depend on the activity of enzymes. Its essence is a large of protein selfinteractions. But it exists a certain difficulty for researchers to discover whether protein can interact with each other or not. The functionality of protein refers to that it could handle the transport of ions and small molecules across cell membranes, depends on their homooligomers [2]. In particular, homooligomerization can also contribute proteins to compose large structures with increasing error control during synthesis and without increasing genome size [3]. From the past years, many researchers elucidated the overall properties of proteins. Ispolatov et.al discovered that the average homodimers of SIPs is more than double the total amount of nonSIPs in the protein interaction networks (PINs) [4]. It is crucial for clarifying the function of SIPs to further understand the regulation of protein function and comprehend whether protein can interact with each other, so that we can better comprehend the mechanism of disease [5]. Liu et al analyzed the properties of SIPs from various aspects information, and applied a logistic regression framework to develop a SIPs prediction model by integrating multiple features [6]. Hence, SIPs will help to improve the stability and prevent the denaturation of a protein via reducing its surface area [7].
So far, a large number of previous methods on the PPIs detection have been proposed [8,9,10]. For instance, Zhang et al. summarized all sorts of computational methods based on their present knowledge, and proposed an algorithm which integrates structural information with other functional clues [11]. Zou et al. presented a novel fingerprint features and dimensionality reduction strategy for predicting TATA binding proteins, which could improve the prediction accuracy [12]. Hamp et al. introduce a new technique to predict PPIs based on evolutionary profiles and profilekernel support vector machine [13]. Wan et al. exploited an ensemble multilabel classifier for human protein subcellular location prediction with imbalanced protein source [14]. Song et al. designed a predictor to identify DNAbinding proteins based on unbalanced classification [15]. Sylvain et al. put forward a new PPIs Prediction Engine named PIPE, which is capable of predicting PPIs for any target pair of the yeast Saccharomyces cerevisiae proteins from their original structure and without any additional information [16]. Xia et al. presented a sequencebased multiclassifier system that employed autocorrelation descriptor to code an interaction protein pair and chose rotation forest as classifier to infer PPIs [17]. Li et al. provide a scored human PINs with severalfold more interactions and better functional biological relevance than comparable resources by the means of data integration and quality control [18].
However, these approaches could be applied to detect PPIs well [19], but they are not good enough to predict SIPs. Mainly exist in terms of following points: (1) In essence, they also have certain limitations that take the correlation between protein pairs into account for SIPs detection, for example coexpression, colocalization and coevolution. Nevertheless, these info are of no use for SIPs. (2) The datasets applied to predict PPIs are different from those of SIPs, the datasets of the former are balanced and those of the latter are unbalanced. (3) Besides, prediction of PPIs datasets have no PPIs between same partners. In virtue of reasons, these computational approaches are not suitable for predicting SIPs. Hence, It is becoming more and more significant to exploit an effective calculation method to predict SIPs.
In this paper, we put forward a random projection (RP) bind with Finite Impulse Response Filter (FIRF) model for predicting SIPs from protein sequence information. Furthermore, the main ideas of our raised method includes the following four aspects: (1) The PSIBLAST could be exploited to convert each protein sequence to a Position Specific Scoring Matrix (PSSM); (2) Employing Finite Impulse Response Filter (FIRF) method to calculate the eigenvalues from protein sequences on a PSSM; (3) To reduce the dimension of feature values which obtained from WT method by applying the Principal Component Analysis (PCA) technique, and removed the noise features from the data, thus the pattern in the data is discovered; (4) RP classifier is applied to build a training set on which the classifiers will be trained. More specifically as follows: first of all, the PSSM of each protein sequence is converted into a 400dimensional feature vector by employing FIRF method to extract helpful information; then, to remove the influence of noise, we reduced the dimension from 400 to 300 by applying PCA method; At last, realized classification on yeast and human datasets by relying on RP classifier. The experimental results show that this method outperforms the SVMbased method and other previous methods. It is revealed that the presented method is suitable and perform well for predicting SIPs.
Results and discussion
Fivefold crossvalidation on human and yeast datasets
The performance of the proposed method is estimated on the human and yeast datasets. Aiming at the fairness and overfitting problems, we repeated the experiment five times on the two same datasets, termed fivefold cross validation. Further, described it in details, we split the human dataset which was mainly composed of characteristic values into five nonoverlapping pieces, and four parts was randomly chosen as training set and selected the remaining characteristic values as independent test set. Then, we can obtain the results by repeating five times to test our model. To illustrate the rationality, toughness and stability of our algorithm, we also implemented the method of RPFIRF on the yeast dataset.
To guarantee impartiality and objectivity of the test, the parameters for human and yeast datasets should be set in the same way. In our task, we obtained the better result by adjusting the diverse parameters of RP classifier constantly. Thus, we set the number of blocks B1 = 10 for independent projections to classify the training and test sets, the size of each block was carefully chosen as B2 = 30, and then applying the KNearest Neighbor (KNN) base classifier and the leaveoneout test error estimate, where k = seq (1, 30, by = 8).
Afterwards, we test our RPFIRF prediction method on the two mentioned datasets, and got the results of the two datasets based on 5fold crossvalidation are discovered in Tables 1 and 2. From the Table 1, the data is observed that our proposed method exhibited the five outcomes of average Accuracy (Acc), Sensitivity (Sen), Precision (PE), and Matthews correlation coefficient (MCC) of 97.89, 74.46, 100.00, and 85.31% on human dataset and the standard deviations of them of 0.17, 2.18, 0.00, and 1.29%, respectively. Similarly, we can get the results in Table 2 by running experiment on yeast dataset, the average Accuracy is 97.35%, average Sensitivity is 77.03%, average Precision is 99.62%, and average MCC is 86.31% and the standard deviations of them of 0.15, 1.17, 0.52, and 0.79%, respectively.
As mentioned above, It is apparent that our method can receive good effect of SIPs detection because of the appropriate feature extraction and classifier. The presented feature extraction technique plays a critical part in enhancing the calculation accuracy. The specific reasons can be summed up in the following three aspects: (1) PSSM could describe the protein sequence in the form of numerical values. It can be employed to find an amino acid that matches a specific location to give the score in a target protein sequence. Not only can it represents the information of protein sequence, but also it preserves helpful enough information as much as possible. Accordingly, A PSSM contains almost the whole information of one protein sequence for detecting SIPs. (2) Finite impulse response filter (FIRF) feature extraction method of protein sequence can further optimize the performance of our proposed model. (3) To drop the negative influence of noise, PCA was employed to reduce the dimension of data on the condition of the integrity of FIRF feature vector, thus the helpful information in the data will be mined. In a few words, experimental results revealed that our RPFIRF model is extreme fit for SIPs prediction.
Compare our proposed model with the SVMbased method
Although the RPFIRF model achieved accuracy more than 90%, It still needs further test and verify the effectiveness of our presented model. From the point of classification, support vector machine (SVM) is a generalized linear classifier. The SVMbased method has been widely known in many fields of scientific research. Therefore, it’s necessary to compare the prediction accuracy of our RPFIRF model with the SVMbased method by using the same eigenvalues based on the two above mentioned datasets. We mainly employed the LIBSVM packet tool [20] to implement classification in the experiment. Our first task was to adjust the main parameters of SVM classifier. A radial basis function (RBF) was chosen as the kernel function, and then the two parameters of RBF were adjusted via a grid search algorithm, which were set c = 0.6 and g = 0.02.
As is shown in Tables 3 and 4, we trained and compared the RPFIRF model with SVMbased model on yeast and human datasets by employing 5fold crossvalidation respectively. The data from Table 3 can be displayed that the mean of Accuracy, the mean of Sensitivity, the mean of Precision, and the mean of MCC from SVM classifier are 92.32, 32.89, 100.00, and 53.07% on yeast dataset, respectively. However, the RPFIRF method reached 97.35% average Accuracy, 77.03% average Sensitivity, 99.62% average Precision, and 86.31% average MCC on yeast dataset. Equally, the data from Table 4 can be shown that the average Accuracy, the average Sensitivity, the average Precision, and the average MCC of SVM classifier are 96.21, 54.44, 100.00, and 72.30% on human dataset. Nevertheless, the proposed model achieved 97.89% average Accuracy, 74.46% average Sensitivity, 100.00% average Precision, and 85.31% average MCC on human dataset. Stated thus, it is clear that the overall prediction results of RP classifier are much better than those of SVM classifier.
Meanwhile, receiver operating characteristic (ROC) curves was applied to analysis the binary classification system (the outcome results only have two categories), was widely applied in many fields such as bioinformatics [21], forecasting of natural hazards [22], machine learning [23], data mining [24] and so on. Therefore, we also used ROC curves to measure the comprehensive index between sensitivity and specificity continuous variable. The area under curves (AUC) could be shown the discriminating capability of the classifier. The closer the topleft corner of the curve is, the higher the prediction accuracy is. Otherwise, the lower the diagnosis result is. In other words, The larger the AUC, the stronger the capability of discernment.
From Fig. 1, we plotted the ROC curves by making a comparison between RP and SVM on human dataset, it is clearly that the AUC of SVM classifier is 0.7754 and that of RP classifier is 0.8842. Plots of the RP and SVM classifier on yeast dataset in the ROC space are plot in Fig. 2, it is sharply that the AUC of SVM classifier is 0.6631 and that of RP classifier is 0.8896. Anyhow, we demonstrate that the AUC of RP classifier is also significantly larger than that of SVM classifier. So the RP method is an accurate and robust technique for SIPs detection.
Measure our proposed model against other previous methods
In the process of practice, we measured the quality of proposed model named RPFIRF with other existing methods based on the two above mentioned datasets to further testify that our approach could obtain better results. We listed a clear statement of account in Tables 5 and 6, which are the comparison results on the two datasets. From Table 5, it is obvious that the RPFIRF model achieved the highest average accuracy of 97.35% than the other six methods (range from 66.28 to 87.46%) on yeast dataset. At the same instant, it is clear to see that the other six methods got lower MCC (range from 15.77 to 28.42%) than our proposed model of 86.31% on the same dataset. In exactly the same way, from Table 6, the overall results of our prediction approach is also outperform the other six methods on human dataset. To make a summary of it, we measured our RPFIRF model against with the other six approaches on yeast and human datasets respectively, the prediction accuracy of the overall experimental results can be improved. This fully illustrates that a good feature extraction tool and a suitable classifier is very important for predicting model. It is further illustrated that our method is superior to the other six approaches and quite suitable for SIPs preditcion.
Conclusion
In the study, a machine learning model was put forward to predict SIPs which based on protein primary sequence. This model was developed by combining Finite Impulse Response Filter with Random Projection classifier, which was termed RPFIRF. The mainly improvements for this method are attributable to the following aspects: (1) A reasonable representative method FIRF is used to effectively extract the discriminary features, which can process and analyze protein sequence data well. (2) The RP classifier is strongly suitable for predicting SIPs, and a high recognition accuracy can be obtained. The experimental results measured by the presented model on yeast and human datasets revealed that the performance of RP method is significantly better than that of the SVMbased method and other six previous methods. It fully shows that the integration of FIRF method with RP classifier is able to significantly improve the accuracies of SIPs prediction. Overall, we have predicted a reliable set of SIPs suitable for further computational as well as experimental analyses. For the future research, there will be more and more effective feature extraction methods and machine learning approaches exploited for detecting SIPs.
Materials and methodology
Datasets
In our study, we constructed the datasets mainly derived from the UniProt database [29] which contains 20,199 curated human protein sequences. There are many different types of resources such as DIP [30], BioGRID [31], IntAct [32], InnateDB [33] and MatrixDB [34], we can get the PPIs related information from them. In relational databases, we mainly set up the datasets for SIPs which embodies two identical interacting protein sequences and whose type of interaction was characterized as “direct interaction”. Based on that, we can construct the datasets for the experiment by applying 2994 human selfinteracting protein sequences.
For the 2994 human SIPs, we need to single out the datasets for the experiment and assess the performance of the RPFIRF model, which mainly includes three steps [28]: (1) If the protein sequences which may be fragments, we will remove it and retain the length of protein sequences between 50 residues and 5000 residues from all the human proteome; (2) To build up the positive dataset of human, we formed a highgrade SIPs data which should meet one of the following conditions: (a) the selfinteractions were revealed by at least one smallscale experiment or two sorts of largescale experiments; (b) the protein has been announced as homooligomer (containing homodimer and homotrimer) in UniProt; (c) it has been reported by more than two publications for selfinteractions; (3) For the human negative dataset, we removed the whole types of SIPs from all the human proteome (contains proteins annotated as ‘direct interaction’ and more extensive ‘physical association’) and SIPs detection in UniProt database. To sum it up, we obtained the ultimate human dataset for the experiment which was mainly composed of 1441 SIPs and 15,938 nonSIPs [28].
Just as the construction of human dataset, we also further assess the crossspecies ability of the RPFIRF model by repeating the same strategy mentioned above to generate the yeast dataset. Finally, 710 SIPs was assigned to form the yeast positive dataset and 5511 nonSIPs was allocated to constitute the yeast negative dataset [28].
Assessment tools
In the field of machine learning, confusion matrix is always employed in evaluating the classification model, also known as an error matrix [35, 36]. It indicates information about actual and predicted classifications for two class classifier which could be shown as the follow Table 7.
In our study, in the interest of size up the steadiness and effectiveness of our present model, we computed the values of 5 parameters: Accuracy (Acc), Sensitivity (Sen), specificity (Sp), Precision (PE) and Matthews’s Correlation Coefficient (MCC), respectively. These parameters can be described as follows:
where, TP (i.e. true positives) is the quantity of true interacting pairs correctly predicted. FP (i.e. false positives) represents the number of true noninteracting pairs falsely predicted. TN (i.e. true negatives) is the count of true noninteracting pairs predicted correctly. FN (i.e. false negatives) represents true interacting pairs falsely predicted to be noninteracting pairs. On the basis of these parameters, a ROC curve was plotted to evaluate the performance of random projection method. And then, we can calculate the area under curve (AUC) to measure the performance of the classifier.
Position specific scoring matrix
In our experiment, Position Specific Scoring Matrix (PSSM) is a helpful technique which was employed to detect distantly related proteins [37]. Accordingly, each protein sequence information was transformed into PSSM by using the PSIBLAST [38]. And then, a given protein sequence can be converted into an H × 20 PSSM which could be represented as follow:
where H denotes the length of a protein sequence, and 20 is the number of amino acids due to every sequence was constituted by 20 different amino acids. For the query protein sequence, the score C_{αβ} indicates that the βth amino acid in the position of α assigned from a PSSM. Therefore, C_{αβ} could be described as:
where p(α,k) represents the occurrence frequency of the kth amino acid at location of α, and q(β,k) is the Dayhoff’s mutation matrix value between βth and kth amino acids. In addition, diverse scores determine different relative location relationships, a greater degree means a strongly conservative position, and otherwise a weakly conservative position can gain a lower value.
Overall, PSSM has been more and more important in the research of SIPs prediction. In a detailed and exact way, we employed PSIBLAST to obtain the PSSM from each protein sequence for detecting SIPs. To achieve a better score and a large scale of homologous sequences, the Evalue parameter of PSIBLAST was set to be 0.001 which reported for a given result represents the quantity of two sequences’ alignments and selected three iterations in this experiment [39, 40]. Afterwards we can achieve a 20dimensional matrix which consists of M × 20 elements based on PSSM, where M represents the count of residues of a protein, and 20 denote the 20 types of amino acids.
Finite impulse response filters
In the field of digital signal processing (DSP) [41], finite impulse response filter (FIRF) is one of the most commonly used components, which can perform the function of signal premodulation and frequency band selection and filtering. FIRF are widely employed in many fields such as communications [42], image processing [43], pattern recognition [44], wireless sensor network [45] and so on. Many methods of DSP were applied in the fundamental research of cytology, brain neurology, genetics and other fields. In our work, we applied FIRF to process the characteristics of protein sequences, which would be used to predict the SIPs. Therefore, many important features of the problem can be fully highlighted by the FIRF method, and then it could devote to the details of the problem. We design it by using Fourier series method in details as follows.
At first, the corresponding Frequency Response Function of FIRF transfer function can be described as:
where, h(n) is the available impulse response sequence, and N represents the sample sizes of frequency response H (e^{jw}). Given the frequency response H_{d} (e^{jw}) of ideal filter, and let H (e^{jw}) approach H_{d} (e^{jw}) infinitely.
And then, we can achieve the h_{d}(n) by employing inverse Fourier transform of H_{d} (e^{jw}). The h_{d}(n) is built as
where h_{d}(n) is a finite length. If h_{d}(n) is an infinite length, we can intercept h_{d}(n) by applying a finite length of the windows function sequence w(n).
According to the above formula, we can gain the unit sample response for our designed FIR filter. To check the filter whether meet the design requirements by follow formula.
The integral square error (ISE) between the frequency response of ideal filter and our designed filter can be defined as follow:
In our study, we cannot directly extract the eigenvalues from the protein because of each protein sequence have the different amino acids composition. To prevent the generation of unequal lengths of feature vectors, we multiply the transpose of PSSM by PSSM to achieve 20 × 20 matrix. and then, we employ the FIRF technique to transform the PSSM of each protein sequence into a feature vector which have the same size with 20 × 20 matrix. Afterwards, these feature values could be computed as a 400dimensional vector. Eventually, every protein sequence from the two above mentioned datasets was transformed into a 400dimensional vector by employing FIRF approach.
For the sake of remove the influence of noise and improve the result of SIPs prediction, we applied the Principal Component Analysis (PCA) to remove the influence of noisy features on the two above mentioned datasets. So as to we can reduce the dimension of the two datasets from 400 to 300. Accordingly, we could employ a small number of information to represent the whole data and push the complexity into smaller, so as to improve the generalization error.
Random projection classifier
In mathematics and statistics, Random Projection (RP) is a classifier for dimensionality reduction of some points which lie in Euclidean space. RP classifier showed that N points in N dimensional space can almost always be mapped to a space of dimension ClogN with command on the ratio of error and distances [46, 47]. It has been successfully applied in rebuilding of frequencysparse signals [48], face recognition [49], protein subcellular localization [50] and textual and visual information retrieval [51].
We formally describe the RP classifier as follow in details. At first, let
be the primitive high dimensional space dataset, where n represents the high dimension and N denotes the number of the dataset. The goal of dimensionality reduction is embedding the vectors into a lower dimensional space R^{q} from a high dimension R^{n}, where q < <n. The output of data is defined as follow:
where q is close to the intrinsic dimensionality of Γ. Thus, the vectors of Γ was regarded as embedding vectors.
If we want to reduce the dimension of Γ via random projection method, a random vector set γ = {r_{i}} k i = 1 must be constructed at first, where r_{i}∈R^{q}. The random basis can be obtained by two common choices as follow [46]:

(1)
The vectors {r_{i}} k i = 1 are normally distributed over the q dimensional unit sphere.

(2)
The components of the vectors {r_{i}} k i = 1 are chosen Bernoulli + 1/− 1 distribution and the vectors are standardized so that r_{i}_{l2} = 1 for i = 1, …,n.
Then, the columns of q × n matrix R are consisted of the vectors in γ. The embedding result Ã_{i} of A_{i} can be got by
In our proposed method, random projection classifier will be trained on a training set. And we enrich the component of the ensemble method based on random projection.
Next, the size of target space was set to a part of around the space where the training members reside. We built a size of n × N matrix G whose columns are made up the column vectors in Γ. The training set Γ have given in Eq.14.
Then, we construct k random matrices {R_{i}} k i = 1 whose size is q × n, q and n are introduced in the above mentioned paragraph, and k is the quantity of classifiers. Here, the columns of matrices are normalized so as to the l_{2} norm is 1.
And then, in our method, to construct the training sets {T_{i}} k i = 1 by projecting G onto {R_{i}} k i = 1 which is the k random matrices. It can be represented as follow:
The training sets are imported into an inducer and the export results are a piece of classifiers {ℓ_{i}} k i = 1. How to classify a new dataset I through classifier ℓ_{i}. At first, we embed I into the dimensionality reduction space R^{q}. Then, It can be owned via mapping u to the random matrix R_{i} as follow:
where Ĩ is the inlaying of u, the classification of Ĩ can be garnered from the classification of I by ℓ_{i}. In this ensemble method, the random projection classifier use a datadriven voting threshold which is employed to classification outcomes of the whole classifiers {ℓ_{i}} k i = 1 for the Ĩ to decide produce the ultimate classification result of Ĩ.
In this experiment, the random projections were split up nonoverlapping blocks where B1 = 10 and each one carefully chosen from a block of size B2 = 30 that achieved the smallest estimate of the test error. We used the kNearest Neighbor (KNN) as base classifier and the leaveoneout test error estimate, where k = seq (1, 30, by = 8). The prior probability of interaction pairs in the training sample set was taken as the voting parameter. Our classifier integrates the results of taking advantage of the base classifier on the selected projection, with the datadriven voting threshold to confirm the final mission.
Availability of data and materials
Not applicable.
References
De Las Rivas J, Fontanillo C. Protein–protein interactions essentials: key concepts to building and analyzing interactome networks. PLoS Comput Biol. 2010;6(6):e1000807.
Marianayagam NJ, Sunde M, Matthews JM. The power of two: protein dimerization in biology. Trends Biochem Sci. 2004;29(11):618–25.
Hashimoto K, et al. Caught in selfinteraction: evolutionary and functional mechanisms of protein homooligomerization. Phys Biol. 2011;8(3):035007.
Ispolatov I, et al. Binding properties and evolution of homodimers in protein–protein interaction networks. Nucleic Acids Res. 2005;33(11):3629–35.
Wang YB, et al. Detection of interactions between proteins by using legendre moments descriptor to extract discriminatory information embedded in pssm. Molecules. 2017;22(8):1366.
Liu Z, et al. Proteomewide prediction of selfinteracting proteins based on multiple properties. Mol Cell Proteomics. 2013;12(6):1689–700.
Miller S, et al. The accessible surface area and stability of oligomeric proteins. Nature. 1987;328(6133):834.
You, ZH, Xiao Li, and Keith CC Chan. An improved sequencebased prediction protocol for proteinprotein interactions using amino acids substitution matrix and rotation forest ensemble classifiers. Neurocomputing. 2017;228:27782.
You Z, et al. A SVMbased system for predicting proteinprotein interactions using a novel representation of protein sequences. In: Intelligent Computing Theories. Berlin Heidelberg: Springer; 2013. p. 629–37.
You, ZH, et al. Prediction of proteinprotein interactions from amino acid sequences using extreme learning machine combined with auto covariance descriptor. In: 2013 IEEE Workshop on Memetic Computing (MC). IEEE, 2013;8085.
Zhang QC, et al. Structurebased prediction of protein–protein interactions on a genomewide scale. Nature. 2012;490(7421):556.
Zou Q, et al. Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy. BMC Syst Biol. 2016;10(4):114.
Hamp T, Rost B. Evolutionary profiles improve protein–protein interaction prediction from sequence. Bioinformatics. 2015;31(12):1945–50.
Wan S, Duan Y, Zou Q. HPSLPred: an ensemble multilabel classifier for human protein subcellular location prediction with imbalanced source. Proteomics. 2017;17(17–18):1700262.
Song L, et al. nDNAprot: identification of DNAbinding proteins based on unbalanced classification. BMC Bioinformatics. 2014;15(1):298.
Pitre S, et al. PIPE: a proteinprotein interaction prediction engine based on the reoccurring short polypeptide sequences between known interacting protein pairs. BMC Bioinformatics. 2006;7(1):365.
Xia JF, Han K, Huang DS. Sequencebased prediction of proteinprotein interactions by means of rotation forest and autocorrelation descriptor. Protein Pept Lett. 2010;17(1):137–45.
Li T, et al. A scored human protein–protein interaction network to catalyze genomic interpretation. Nat Methods. 2017;14(1):61.
Wang YB, et al. Predicting protein–protein interactions from protein sequences by a stacked sparse autoencoder deep neural network. Mol BioSyst. 2017;13(7):1336–44.
Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol. 2011;2(3):27.
Wang L, et al. A computationalbased method for predicting drug–target interactions by using stacked autoencoder deep neural network. J Comput Biol. 2018;25(3):361–73.
Peres D, Cancelliere A. Derivation and evaluation of landslidetriggering thresholds by a Monte Carlo approach. Hydrol Earth Syst Sci. 2014;18(12):4913–31.
Li JQ, et al. PSPEL: in silico prediction of selfinteracting proteins from amino acids sequences using ensemble learning. IEEE/ACM Trans Computat Biol Bioinform. 2017;14(5):1165–72.
Wang Y, et al. Predicting protein interactions using a deep learning methodstacked sparse autoencoder combined with a probabilistic classification vector machine. Complexity. 2018;2018.
Du X, et al. A novel feature extraction scheme with ensemble coding for protein–protein interaction prediction. Int J Mol Sci. 2014;15(7):12731–49.
Zahiri J, et al. PPIevo: protein–protein interaction prediction from PSSM based evolutionary information. Genomics. 2013;102(4):237–42.
Zahiri J, et al. LocFuse: human protein–protein interaction prediction via classifier fusion using protein localization information. Genomics. 2014;104(6):496–503.
Liu X, et al. SPAR: a random forestbased predictor for selfinteracting proteins with finegrained domain information. Amino Acids. 2016;48(7):1655–65.
Consortium U. UniProt: a hub for protein information. Nucleic Acids Res. 2014;43(D1):D204–12.
Salwinski L, et al. The database of interacting proteins: 2004 update. Nucleic Acids Res. 2004;32(suppl_1):D449–51.
ChatrAryamontri A, et al. The BioGRID interaction database: 2017 update. Nucleic Acids Res. 2017;45(D1):D369–79.
Orchard S, et al. The MIntAct project—IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res. 2013;42(D1):D358–63.
Breuer K, et al. InnateDB: systems biology of innate immunity and beyond—recent updates and continuing curation. Nucleic Acids Res. 2012;41(D1):D1228–33.
Launay G, et al. MatrixDB, the extracellular matrix interaction database: updated content, a new navigator and expanded functionalities. Nucleic Acids Res. 2014;43(D1):D321–7.
Stehman SV. Selecting and interpreting measures of thematic classification accuracy. Remote Sens Environ. 1997;62(1):77–89.
Provost FJ, Fawcett T, Kohavi R. The case against accuracy estimation for comparing induction algorithms. In: ICML; 1998.
Gribskov M, McLachlan AD, Eisenberg D. Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci. 1987;84(13):4355–8.
Altschul SF, Koonin EV. Iterated profile searches with PSIBLAST—a tool for discovery in protein databases. Trends Biochem Sci. 1998;23(11):444–7.
Chen ZH, et al. Prediction of selfinteracting proteins from protein sequence information based on random projection model and fast Fourier transform. Int J Mol Sci. 2019;20(4):930.
Chen ZH, et al. An improved deep Forest model for predicting selfinteracting proteins from protein sequence using wavelet transformation. Front Genet. 2019;10.
Zhao Hui, Qiu G, Yao L, et al. Design of fractional order digital FIR differentiators using frequency response approximation. In: Proceedings. 2005 International Conference on Communications, Circuits and Systems, 2005. IEEE, 2005.
Haigh PA, et al. Multiband carrierless amplitude and phase modulation for bandlimited visible light communications systems. IEEE Wirel Commun. 2015;22(2):46–53.
Gastal Eduardo SL, Oliveira Manuel M. High‐Order Recursive Filtering of Non‐Uniformly Sampled Signals for Image and Video Processing. Computer Graphics Forum. 2015;34(2):8193.
Sengupta N, Kasabov N. Spiketime encoding as a data compression technique for pattern recognition of temporal data. Inf Sci. 2017;406:133–45.
Shi X, et al. Infinite impulse response graph filters in wireless sensor networks. IEEE Signal Process Lett. 2015;22(8):1113–7.
Schclar Alon, Rokach Lior. Random projection ensemble classifiers. In: International Conference on Enterprise Information Systems. Springer, Berlin, Heidelberg, 2009;309316.
Song XY, et al. An ensemble classifier with random projection for predicting protein–protein interactions using sequence and evolutionary information. Appl Sci. 2018;8(1):89.
Donoho DL. Compressed sensing. IEEE Trans Inf Theory. 2006;52(4):1289–306.
Ma C, et al. Random projectionbased partial feature extraction for robust face recognition. Neurocomputing. 2015;149:1232–44.
Wan S, Mak MW, Kung SY. R3PLoc: a compact multilabel predictor using ridge regression and random projection for protein subcellular localization. J Theor Biol. 2014;360:34–45.
Hong R, et al. Learning visual semantic relationships for efficient visual retrieval. IEEE Trans Big Data. 2015;1(4):152–61.
Acknowledgments
The authors would like to thank all the guest editors and anonymous reviewers for their constructive advices.
About this supplement
This article has been published as part of BMC Genomics Volume 20 Supplement 13, 2019: Proceedings of the 2018 International Conference on Intelligent Computing (ICIC 2018) and Intelligent Computing and Biomedical Informatics (ICBI) 2018 conference: genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume20supplement13.
Funding
This work is supported in part by the National Science Foundation of China, under Grants 61373086, 61572506.
Author information
Authors and Affiliations
Contributions
ZHC and ZHY conceived the algorithm, carried out analyses, prepared the data sets, carried out experiments, and wrote the manuscript; LPL, YBW and YQ designed, performed and analyzed experiments and wrote the manuscript; All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Chen, ZH., You, ZH., Li, LP. et al. Identification of selfinteracting proteins by integrating random projection classifier and finite impulse response filter. BMC Genomics 20 (Suppl 13), 928 (2019). https://doi.org/10.1186/s1286401963011
Published:
DOI: https://doi.org/10.1186/s1286401963011