Datasets
In our study, we constructed the datasets mainly derived from the UniProt database [29] which contains 20,199 curated human protein sequences. There are many different types of resources such as DIP [30], BioGRID [31], IntAct [32], InnateDB [33] and MatrixDB [34], we can get the PPIs related information from them. In relational databases, we mainly set up the datasets for SIPs which embodies two identical interacting protein sequences and whose type of interaction was characterized as “direct interaction”. Based on that, we can construct the datasets for the experiment by applying 2994 human self-interacting protein sequences.
For the 2994 human SIPs, we need to single out the datasets for the experiment and assess the performance of the RP-FIRF model, which mainly includes three steps [28]: (1) If the protein sequences which may be fragments, we will remove it and retain the length of protein sequences between 50 residues and 5000 residues from all the human proteome; (2) To build up the positive dataset of human, we formed a high-grade SIPs data which should meet one of the following conditions: (a) the self-interactions were revealed by at least one small-scale experiment or two sorts of large-scale experiments; (b) the protein has been announced as homo-oligomer (containing homodimer and homotrimer) in UniProt; (c) it has been reported by more than two publications for self-interactions; (3) For the human negative dataset, we removed the whole types of SIPs from all the human proteome (contains proteins annotated as ‘direct interaction’ and more extensive ‘physical association’) and SIPs detection in UniProt database. To sum it up, we obtained the ultimate human dataset for the experiment which was mainly composed of 1441 SIPs and 15,938 non-SIPs [28].
Just as the construction of human dataset, we also further assess the cross-species ability of the RP-FIRF model by repeating the same strategy mentioned above to generate the yeast dataset. Finally, 710 SIPs was assigned to form the yeast positive dataset and 5511 non-SIPs was allocated to constitute the yeast negative dataset [28].
Assessment tools
In the field of machine learning, confusion matrix is always employed in evaluating the classification model, also known as an error matrix [35, 36]. It indicates information about actual and predicted classifications for two class classifier which could be shown as the follow Table 7.
In our study, in the interest of size up the steadiness and effectiveness of our present model, we computed the values of 5 parameters: Accuracy (Acc), Sensitivity (Sen), specificity (Sp), Precision (PE) and Matthews’s Correlation Coefficient (MCC), respectively. These parameters can be described as follows:
$$ Acc=\frac{TP+ TN}{TP+ FP+ TN+ FN} $$
(1)
$$ Sen=\frac{TP}{TP+ FN} $$
(2)
$$ Sp=\frac{TN}{FP+ TN} $$
(3)
$$ PE=\frac{TP}{FP+ TP} $$
(4)
$$ MCC=\frac{\left( TP\times TN\right)-\left( FP\times FN\right)}{\sqrt{\left( TP+ FN\right)\times \left( TN+ FP\right)\times \left( TP+ FP\right)\times \left( TN+ FN\right)}} $$
(5)
where, TP (i.e. true positives) is the quantity of true interacting pairs correctly predicted. FP (i.e. false positives) represents the number of true non-interacting pairs falsely predicted. TN (i.e. true negatives) is the count of true non-interacting pairs predicted correctly. FN (i.e. false negatives) represents true interacting pairs falsely predicted to be non-interacting pairs. On the basis of these parameters, a ROC curve was plotted to evaluate the performance of random projection method. And then, we can calculate the area under curve (AUC) to measure the performance of the classifier.
Position specific scoring matrix
In our experiment, Position Specific Scoring Matrix (PSSM) is a helpful technique which was employed to detect distantly related proteins [37]. Accordingly, each protein sequence information was transformed into PSSM by using the PSI-BLAST [38]. And then, a given protein sequence can be converted into an H × 20 PSSM which could be represented as follow:
$$ M=\left\{ M\alpha \kern0.1em \beta \kern0.3em \alpha :1=1\cdots H,\beta =1\cdots 20\right\} $$
(6)
where H denotes the length of a protein sequence, and 20 is the number of amino acids due to every sequence was constituted by 20 different amino acids. For the query protein sequence, the score Cαβ indicates that the β-th amino acid in the position of α assigned from a PSSM. Therefore, Cαβ could be described as:
$$ C\alpha \beta ={\sum}_{k=1}^{20}p\left(\alpha, k\right)\times q\left(\beta, k\right) $$
(7)
where p(α,k) represents the occurrence frequency of the k-th amino acid at location of α, and q(β,k) is the Dayhoff’s mutation matrix value between β-th and k-th amino acids. In addition, diverse scores determine different relative location relationships, a greater degree means a strongly conservative position, and otherwise a weakly conservative position can gain a lower value.
Overall, PSSM has been more and more important in the research of SIPs prediction. In a detailed and exact way, we employed PSI-BLAST to obtain the PSSM from each protein sequence for detecting SIPs. To achieve a better score and a large scale of homologous sequences, the E-value parameter of PSI-BLAST was set to be 0.001 which reported for a given result represents the quantity of two sequences’ alignments and selected three iterations in this experiment [39, 40]. Afterwards we can achieve a 20-dimensional matrix which consists of M × 20 elements based on PSSM, where M represents the count of residues of a protein, and 20 denote the 20 types of amino acids.
Finite impulse response filters
In the field of digital signal processing (DSP) [41], finite impulse response filter (FIRF) is one of the most commonly used components, which can perform the function of signal pre-modulation and frequency band selection and filtering. FIRF are widely employed in many fields such as communications [42], image processing [43], pattern recognition [44], wireless sensor network [45] and so on. Many methods of DSP were applied in the fundamental research of cytology, brain neurology, genetics and other fields. In our work, we applied FIRF to process the characteristics of protein sequences, which would be used to predict the SIPs. Therefore, many important features of the problem can be fully highlighted by the FIRF method, and then it could devote to the details of the problem. We design it by using Fourier series method in details as follows.
At first, the corresponding Frequency Response Function of FIRF transfer function can be described as:
$$ H\left({e}^{jw}\right)=\sum \limits_{n=0}^{N-1}h(n){e}^{- jwn} $$
(8)
where, h(n) is the available impulse response sequence, and N represents the sample sizes of frequency response H (ejw). Given the frequency response Hd (ejw) of ideal filter, and let H (ejw) approach Hd (ejw) infinitely.
$$ {H}_d\left({e}^{jw}\right)=\sum \limits_{n=-\infty}^{\infty }{h}_d(n){e}^{- jwn} $$
(9)
And then, we can achieve the -hd(n) by employing inverse Fourier transform of Hd (ejw). The hd(n) is built as
$$ {h}_d(n)=\frac{1}{2\pi }{\int}_{-\pi}^{\pi }{H}_d\left({e}^{jw}\right){e}^{jw n} dw $$
(10)
where hd(n) is a finite length. If hd(n) is an infinite length, we can intercept hd(n) by applying a finite length of the windows function sequence w(n).
$$ h(n)={h}_d(n)w(n) $$
(11)
According to the above formula, we can gain the unit sample response for our designed FIR filter. To check the filter whether meet the design requirements by follow formula.
$$ H\left({e}^{jw}\right)= DTFT\left[h(n)\right] $$
(12)
The integral square error (ISE) between the frequency response of ideal filter and our designed filter can be defined as follow:
$$ {\varepsilon}^2=\frac{1}{2\pi }{\int}_{-\pi}^{\pi }{\left|{H}_d\left({e}^{jw}\right)-H\left({e}^{jw}\right)\right|}^2 dw $$
(13)
In our study, we cannot directly extract the eigenvalues from the protein because of each protein sequence have the different amino acids composition. To prevent the generation of unequal lengths of feature vectors, we multiply the transpose of PSSM by PSSM to achieve 20 × 20 matrix. and then, we employ the FIRF technique to transform the PSSM of each protein sequence into a feature vector which have the same size with 20 × 20 matrix. Afterwards, these feature values could be computed as a 400-dimensional vector. Eventually, every protein sequence from the two above mentioned datasets was transformed into a 400-dimensional vector by employing FIRF approach.
For the sake of remove the influence of noise and improve the result of SIPs prediction, we applied the Principal Component Analysis (PCA) to remove the influence of noisy features on the two above mentioned datasets. So as to we can reduce the dimension of the two datasets from 400 to 300. Accordingly, we could employ a small number of information to represent the whole data and push the complexity into smaller, so as to improve the generalization error.
Random projection classifier
In mathematics and statistics, Random Projection (RP) is a classifier for dimensionality reduction of some points which lie in Euclidean space. RP classifier showed that N points in N dimensional space can almost always be mapped to a space of dimension ClogN with command on the ratio of error and distances [46, 47]. It has been successfully applied in rebuilding of frequency-sparse signals [48], face recognition [49], protein subcellular localization [50] and textual and visual information retrieval [51].
We formally describe the RP classifier as follow in details. At first, let
$$ \varGamma ={\left\{ Ai\right\}}_{i=1}^N, Ai\in {R}^n $$
(14)
be the primitive high dimensional space dataset, where n represents the high dimension and N denotes the number of the dataset. The goal of dimensionality reduction is embedding the vectors into a lower dimensional space Rq from a high dimension Rn, where q < <n. The output of data is defined as follow:
$$ \overset{\sim }{\varGamma }={\left\{\overset{\sim }{A_i}\right\}}_{i=1}^N,\overset{\sim }{A_i}\in {R}^q $$
(15)
where q is close to the intrinsic dimensionality of Γ. Thus, the vectors of Γ was regarded as embedding vectors.
If we want to reduce the dimension of Γ via random projection method, a random vector set γ = {ri} k i = 1 must be constructed at first, where ri∈Rq. The random basis can be obtained by two common choices as follow [46]:
-
(1)
The vectors {ri} k i = 1 are normally distributed over the q dimensional unit sphere.
-
(2)
The components of the vectors {ri} k i = 1 are chosen Bernoulli + 1/− 1 distribution and the vectors are standardized so that ||ri||l2 = 1 for i = 1, …,n.
Then, the columns of q × n matrix R are consisted of the vectors in γ. The embedding result Ãi of Ai can be got by
$$ \overset{\sim }{A_i}=R\cdot {A}_i $$
(16)
In our proposed method, random projection classifier will be trained on a training set. And we enrich the component of the ensemble method based on random projection.
Next, the size of target space was set to a part of around the space where the training members reside. We built a size of n × N matrix G whose columns are made up the column vectors in Γ. The training set Γ have given in Eq.14.
$$ G=\left({A}_1|{A}_2|...|{A}_N\right) $$
(17)
Then, we construct k random matrices {Ri} k i = 1 whose size is q × n, q and n are introduced in the above mentioned paragraph, and k is the quantity of classifiers. Here, the columns of matrices are normalized so as to the l2 norm is 1.
And then, in our method, to construct the training sets {Ti} k i = 1 by projecting G onto {Ri} k i = 1 which is the k random matrices. It can be represented as follow:
$$ {T}_i={R}_i\cdot G,\kern0.5em i=1,...,k $$
(18)
The training sets are imported into an inducer and the export results are a piece of classifiers {ℓi} k i = 1. How to classify a new dataset I through classifier ℓi. At first, we embed I into the dimensionality reduction space Rq. Then, It can be owned via mapping u to the random matrix Ri as follow:
$$ \overset{\sim }{I}={R}_i\cdot I $$
(19)
where Ĩ is the inlaying of u, the classification of Ĩ can be garnered from the classification of I by ℓi. In this ensemble method, the random projection classifier use a data-driven voting threshold which is employed to classification outcomes of the whole classifiers {ℓi} k i = 1 for the Ĩ to decide produce the ultimate classification result of Ĩ.
In this experiment, the random projections were split up non-overlapping blocks where B1 = 10 and each one carefully chosen from a block of size B2 = 30 that achieved the smallest estimate of the test error. We used the k-Nearest Neighbor (KNN) as base classifier and the leave-one-out test error estimate, where k = seq (1, 30, by = 8). The prior probability of interaction pairs in the training sample set was taken as the voting parameter. Our classifier integrates the results of taking advantage of the base classifier on the selected projection, with the data-driven voting threshold to confirm the final mission.