Prediction of subcellular location of apoptosis proteins by incorporating PsePSSM and DCCA coefficient based on LFDA dimensionality reduction

Background Apoptosis is associated with some human diseases, including cancer, autoimmune disease, neurodegenerative disease and ischemic damage, etc. Apoptosis proteins subcellular localization information is very important for understanding the mechanism of programmed cell death and the development of drugs. Therefore, the prediction of subcellular localization of apoptosis protein is still a challenging task. Results In this paper, we propose a novel method for predicting apoptosis protein subcellular localization, called PsePSSM-DCCA-LFDA. Firstly, the protein sequences are extracted by combining pseudo-position specific scoring matrix (PsePSSM) and detrended cross-correlation analysis coefficient (DCCA coefficient), then the extracted feature information is reduced dimensionality by LFDA (local Fisher discriminant analysis). Finally, the optimal feature vectors are input to the SVM classifier to predict subcellular location of the apoptosis proteins. The overall prediction accuracy of 99.7, 99.6 and 100% are achieved respectively on the three benchmark datasets by the most rigorous jackknife test, which is better than other state-of-the-art methods. Conclusion The experimental results indicate that our method can significantly improve the prediction accuracy of subcellular localization of apoptosis proteins, which is quite high to be able to become a promising tool for further proteomics studies. The source code and all datasets are available at https://github.com/QUST-BSBRC/PsePSSM-DCCA-LFDA/.


Background
Protein maintains a highly ordered operation of the protection of the cell system [1]. At the cellular level, proteins work only in specific locations. It is necessary to fulfill the protein's function that subcellular locations provide a specific chemical environment and set of interaction partners [2]. Apoptosis is cell physiological death which is closely related to intracellular control [3]. Cancer and autoimmune disease occurs when blocking apoptotic protein appears, ischemic damage or neurodegenerative disease occurs when unwanted apoptosis appears [4]. Studying proteins involved in the apoptotic process can help us understand the pathogenesis of the disease and provide a variety of therapeutic targets. It is very valuable to get information on apoptosis protein subcellular localization, which can help us understand the apoptosis proteins function, cell apoptosis mechanisms and drug development [5]. Therefore, it is a challenging task using the machine learning method to construct the protein subcellular location prediction model.
Currently, apoptosis protein subcellular localization prediction has made great advancement. Zhou and Doctor [36] constructed 98 protein apoptosis protein dataset, using the amino acid composition and covariance discriminant method, the overall prediction accuracy reached 72.5% by jackknife test. Huang et al. [37] obtained the accuracy rate of 77.6% by combining the protein instability index with the support vector machine. However, the prediction capacity of this method was unbalanced. Especially, for other class proteins (exclude cytoplasmic, membrane and mitochondrial proteins), the prediction accuracy did not exceed 50%. Bulashevska and Eils [38] used Bayesian classifier based on Markov chain model to construct ensemble classifier, and the prediction accuracy of 98 apoptosis proteins was further improved by jackknife test. Zhang et al. [15] constructed a new apoptosis protein dataset of 225 proteins. They used encoding approach with grouped weight as feature extraction method for protein sequences and support vector machine as classifier (named as EBGW_SVM). The overall prediction accuracy was 83.1% using jackknife test. The feature extraction method of the protein sequence takes into account the distribution of residues with the same unique characteristic, but ignores the physical and chemical properties of the protein sequence. Chen and Li [39] constructed a dataset containing 317 apoptosis protein sequences and obtained higher prediction accuracy, which combined support vector machine and increment of diversity (named as ID_SVM) by using jackknife test. Similarly, Ding et al. [40] used the Fuzzy K-nearest neighbor (FKNN) algorithm and the overall prediction accuracy was 90.9% using CL317 dataset. Qiu et al. [41] used the DWT_SVM method to obtain high prediction accuracy rates of 97.5, 87.6 and 88.8% for CL317, ZW225 and ZD98 datasets, respectively by jackknife test. The above methods ignore the biological information of the protein sequence, so the prediction method of homologous similarity based on the protein sequence and protein functional domain is proposed. Yu et al. [42] proposed a novel pseudo-amino acid model which extracted the sequence characteristics of proteins using amino acid substitution matrices and auto covariance transformation and used support vector machine as classifier. The results of prediction accuracy obtained by jackknife test were 90.0 and 87.1% on the CL317 and ZW225 datasets, respectively. Liu et al. [43] used tri-gram encoding based PSSM as feature extraction method, then used SVM-RFE algorithm to reduce feature vectors, finally the best feature vectors were input to the SVM classifier. The prediction accuracy were 95.9, 97.8 and 96.9% on the CL317, ZW225 and ZD98 datasets, respectively. Dai et al. [44] treated the difference between the N-segment and C-segment of the protein in subcellular location prediction, and proposed a model based on golden ratio segmentation to improve subcellular localization prediction, and achieved a better predictive effect. Xiang et al. [45] introduced evolutionaryconservative information to represent protein sequences. Meanwhile, according to the proportion of golden section in mathematics, the position-specific scoring matrix (PSSM) is divided into several blocks. The overall accuracy of ZD98 and CL317 datasets were 98.98 and 91.11%, respectively by using SVM classifier. Liang et al. [46] combined the Geary autocorrelation function and detrended cross-correlation coefficient methods based on PSSM to extract the protein sequences from the CL317, ZW225 and ZD98 datasets. Under the jackknife test, the overall prediction accuracy were 89.0 84.4 and 91.8%, respectively.
Using only a feature is difficult to have a big breakthrough in the prediction of subcellular localization. At present, researchers usually combined multiple feature extraction methods of protein sequences to obtain more comprehensive protein sequence information. However, the feature vectors of the protein sequences obtained by fusing a variety of features are usually very high. Highdimensional data contains a lot of redundant information, which may seriously affect the performance of the classifier. Dimensionality reduction methods can help us eliminate redundant information and are widely used in data classification and pattern recognition. At present, many researchers introduce a variety of methods to reduce dimension in the subcellular localization prediction, such as SVD (singular value decomposition) [47], Backward feature selection [48], CFS (correlation-based feature selection) [49], Forward selection [50], PSO (particle swarm optimization) [51], mLASSO (multi-label least absolute shrinkage and selection operator) [52], GA (Genetic algorithm) [53] and so on.
In this paper, we presents a new method for predicting subcellular localization of apoptosis proteins, called PsePSSM-DCCA-LFDA. Firstly, obtain sequence information from apoptotsis protein sequences by combining PsePSSM algorithm and DCCA coefficient. Then, the LFDA method is used to reduce the dimension and noise information in the original high-dimensional space. Finally, using SVM as classifier to predict protein subcellular localization. By jackknife test, the optimal parameters of the model are determined under different ξ values, S values, different dimensionality reduction methods and selection of different dimensions, and established PsePSSM-DCCA-LFDA prediction model. Using the most rigorous jackknife test, the overall prediction accuracy are 99.7, 99.6 and 100%, respectively for CL317 dataset, ZW225 dataset and ZD98 dataset. The results show that the PsePSSM-DCCA-LFDA method can get better prediction effect than other existing methods.

Pseudo-position specific scoring matrix (PsePSSM)
In order to obtain the evolutionary information of the protein sequences, the protein sequences of the CL317, ZW225 and ZD98 datasets are aligned with the non-redundant (NR) database (ftp://ftp.ncbi.nih.gov/blast/db/) using the PSI-BLAST program [54], and obtain the position specific scoring matrix (PSSM) [55] of the corresponding protein sequences. The NR database contains 85,107,862 protein sequences. We use three iterations and E-value is 0.001 in PSI-BLAST program. The BLOSUM62 matrix is used as substitution matrix for generating the PSSM. PSSM can be expressed for a protein sequence P as the following Eq. (1).
where L is total number of amino acids in the protein sequence, E i, j represents the evolution information of amino acids in protein sequences. The rows of PSSM represent the corresponding amino acids positions in protein sequences, and columns of PSSM indicate the 20 amino acid types that may be mutated. The PSSM value ranges from − 9 to 11. Since the length of the protein sequence in the CL317, ZW225 and ZD98 datasets is inconsistent, the corresponding PSSM dimension for the protein sequence in the dataset is different, which is difficult for our subsequent study. In this paper, PsePSSM [56] algorithm is used to extract the features of protein sequences, and the PSSM of different protein sequences is transformed into a uniform vector.
First, the elements of PSSM are normalized by Eq. (2), whose PSSM value ranges from 0 to 1.
where x is the original PSSM value. Then, a protein sequence can be expressed using PsePSSM as follows: where P j ¼ P L i¼1 P i; j =L ð j ¼ 1; 2; ⋯; 20Þ , P j represents the average score of the all amino acid residues which are mutated to j amino acid type in the protein P. θ ξ j ¼ 1 L−ξ P L−ξ i¼1 ðP i; j −P ðiþξÞ; jÞ Þ 2 ð j ¼ 1; 2; ⋯20; ξ < L; ξ≠0Þ , θ ξ j is order information of protein sequences, j is amino acid type, ξ is contiguous distance.
From the above, a protein sequence generates 20+20×ξ dimension feature vector using PsePSSM algorithm.

Detrended cross-correlation analysis coefficient
According to the evolutionary information expressed by the protein sequence, we can obtain the corresponding position score-specific matrix (PSSM), as shown in eq. (1). In order to extract more protein sequence information from the PSSM matrix, the protein sequence information is extracted from the PSSM using the detrended cross-correlation analysis coefficient (DCCA coefficient) method [57][58][59]. DCCA coefficient is a method based on the trend covariance method, and the least squares linear fitting and trend elimination are carried out for nonstationary signals. The evolutionary information expressed in the form of PSSM is used as the attribute, and each amino acid is considered as one property. PSSM is considered to be the time series of all attributes. Since the size of the PSSM matrix for each protein sequence is L×20, we calculate the 20 columns in the PSSM matrix as 20 non-stationary time series [46,60].
After normalizing the PSSM matrix using the eq. (2), for any two different columns {m i } and {n 1 } of PSSM (i=1,2,···,L), L is the length of protein sequence. First we use the Eq. (4) to calculate the new time series M k and N k .
Then the time series M k and N k are divided into L-S segments which can be overlapped, each segment contains S+1 data, and then the least squares linearly fitting for each segment of the data to obtain the fitting values M i;k andÑ i;k . Use the Eq. (5) to calculate the covariance of each segment.
In particular, there are f 2 xx ðS; iÞ ¼ 1 Next, the covariance of the L-S segments (whole time series) calculated by using the Eq. (6) is: In particular, there are f 2 xx ðSÞ ¼ 1 Finally, the DCCA coefficients of two different time series {m i } and {n 1 } are calculated using Eq. (7).
As can be seen from Eq. (7), ρ DCCA depends on the length L of the protein sequence and the length S+1 of the overlapping portion of each segment. Its value ranges from -1≤ ρ DCCA ≤1, where 1 represents perfect cross-correlation, 0 indicates no cross-correlation, and − 1 represents perfect anti-cross-correlation [61]. Finally, the DCCA coefficient algorithm will generate a 190-dimensional feature vector for a protein sequence.

Local fisher discriminant analysis
This paper uses a supervised dimensionality reduction method, local Fisher discriminant analysis (LFDA) [62]. LFDA has the form of embedded transformation, and it can be easily calculated by solving the generalized eigenvalue problem. Let the protein data matrix be X = [x 1 , x 2 , ⋯x n ],x i ∈ R d , where n is the number of samples of the protein, d is the dimension of the protein sequence feature extraction. y i ∈ {1, 2⋯, c}, n ℓ is the number of samples of the categoryℓ, P c ℓ¼1 n ℓ ¼ n. The local within-class scatter matrix S (w) and the local between-class scatter matrix S (b) are calculated using Eqs. (8) and (9).
It is worth noting that A is an affinity matrix, A i, j ∈ A is the affinity betweenx i and x j . In this paper, we use the affinity matrixA i, j = exp(−‖x i − x j ‖/σ i σ j ) defined by Zelnik-Manor and Perona [63]. σ i ¼ kx i −x ðK Þ i k represents the local scaling of the surrounding x i data samples, where x ðK Þ i is the K nearest neighbor of x i . The literature [63] proved that in the experiment for high-dimensional data, when K = 7, better results can be obtained, so this article selected K = 7.
Solve LFDA transformation matrixT LFDA Matrix after the dimension reduction becomes: Therefore, through the Eq. (11), we eliminate the redundant information contained in the high-dimensional data obtained after the original protein sequence feature extraction. In other words, the fusion PsePSSM algorithm and DCCA coefficient algorithm on the apoptosis protein sequence after the feature extraction matrix X, through the transformation matrix T LFDA , matrix Z is obtained after dimensionality reduction.

Support vector machine
Support vector machine (SVM) is a supervised machine learning method based on statistical learning theory, which is proposed by Vapnik et al. [64]. Because of its excellent learning and generalization ability, especially the ability to deal with high dimensional sparse vector, it has become a hotspot in the field of data mining and machine learning. In recent years, SVM has also been widely used in the field of bioinformatics. In the field of proteomics research, it has been widely used to predict membrane protein types [65,66], G protein-coupled receptors [67,68], protein structure [69][70][71][72][73], protein-protein interaction [74][75][76], protein subcellular localization [77][78][79][80], protein post-translational modification sites [81][82][83][84] and other protein structure and function of the study.
SVM is used to solve a two-class classification problem. SetD = {(x i , y i )| i = 1, 2, ⋯, n} is a training set, wherex i ∈ R d represent sample i, which has d dimension feature vectors, y i ∈ {+1, −1}is class labels of sample i. SVM transforms a linearly indivisible sample of low-dimensional input space into high-dimensional feature space to make it linearly separable.
In this study, we choose the radial basis function (RBF) to perform prediction. Because RBF kernel function is the most widely used kernel function and its superiority for solving nonlinear problem [17,18,[41][42][43][44][45][46], which is defined as follows: where γ is the kernel width parameter, x i and y j are the feature vectors of the i-th and j-th protein sequences, respectively. The egularization parameter C and the kernel parameter γ are optimized based on CL317 and ZW225 datasets by K-fold cross validation using a grid search strategy to obtain the highest overall prediction accuracy by using the LIBSVM software [85], which can be freely downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvm/. In this paper, C is allowed to take a value only between 2 −5 and 2 15 , and γ only between 2 -15 and 2 5 .
SVM is originally designed for two-class classification, but CL317, ZW225 and ZD98 are multi-class classification data. At present, three kinds of strategies can be solved multi-classification: one-versus-one (OVO), oneversus-rest (OVR) [86] and direct acyclic graph SVM (DAGSVM) [87]. LIBSVM software implements the "one-versu-one" (OVO) strategy for multi-class classification. The OVO strategy sets up a classifier between any two categories,so if k is the number of classes, then k(k − 1)/2 classifiers are constructed. During the testing phase, the test samples are submitted to all classifiers, k(k − 1)/2 classification results are obtained, and the final result is generated by voting. That is to say, the most voting category is the final class. It is worth noting that when there are two categories of voting the same results, we choose the class appearing first of the vote as the final category for the sake of simple operation.

Performance evaluation and model building
In statistical prediction, there are four validation tests: self-consistency test, independent dataset test, k-fold cross-validation and jackknife test, which are often used to evaluate the prediction performance [78,80]. In this paper, the jackknife test [88,89] is used to examine the performance of the prediction model. The jackknife test requires testing each sample in the dataset. Specifically, each time one sample is selected as an independent test sample in the dataset, and the remaining samples are used as a training set to establish a prediction model until all the samples have been tested in the dataset.
We use four standard performance measures to evaluate the model performance, including sensitivity (Sens), specificity (Spec), Matthews correlation coefficient (MCC) and overall accuracy (OA), as follows: where TP represents the numbers of the correctly identified positives, TN represents the numbers of correctly identified negatives, FP represents the numbers of the negatives identified as positives, FN represents the numbers of the positives identified as negatives. In addition, to assess the generalization performance of the model, the receiver operating characteristic (ROC) curves were used. The AUC is the area calculated under ROC curve plotted by FP rate vs TP rate, which is a quantitative indicator of the robustness of the model. Its values range from 0 to 1. For convenience, the method is called PsePSSM-DCCA-LFDA in this paper, which is used to predict apoptosis protein subcellular localization. To provide an intuitive picture, the flowchart of PsePSSM-DCCA-LFDA method is shown in Fig. 1. We have implemented it in MATLAB R2014a.
The PsePSSM-DCCA-LFDA prediction model is detailed below: 1) Input CL317 dataset and ZW225 dataset, respectively, which contain apoptosis protein sequences and the class label corresponding to all kinds of proteins; 2) The 20 + 20 × ξdimension feature vector is generated by PsePSSM algorithm. Using DCCA coefficient, the protein sequence is extracted to generate 190 dimension feature vectors. By combining these two methods, the two different apoptosis protein datasets generate the corresponding feature extraction matrices ofX = 317 × (190 + (20 + 20 × ξ))and X = 225 × (190 + (20 + 20 × ξ)), respectively; 3) Using the LFDA method to solve T LFDA ¼ arg max and the matrix Z is obtained by removing the redundant information in the apoptosis protein sequences; 4) The matrix Z after dimensionality reduction are input into the SVM classifier, and the protein subcellular localization prediction is performed by jackknife test; 5) According to the accuracy of prediction, the optimal parameters of the model are selected, including the ξ values andSvalues of parameters, the selection of the dimension reduction algorithm and the dimensionality; 6) Calculate the Sens, Spec, MCC, OA and AUC values of the model, and evaluate prediction performance of the model; 7) Using the independent testing dataset ZD98 to test the PsePSSM-DCCA-LFDA prediction model.

Selection of optimal parameter ξ andS
In this study, the apoptosis protein sequences are extracted by the fusion PsePSSM algorithm and the DCCA coefficient algorithm, and obtain the feature information in the protein sequences. It is worth noting that both the PsePSSM algorithm and the DCCA coefficient algorithm can control the validity of the algorithm to extract the feature information of the protein sequence by adjusting some of the parameters in the algorithm. How to get the best parameters of these two feature extraction algorithms is very important for us to construct a protein subcellular localization prediction model. In order to discover the merits of the feature parameters, we use CL317 and ZW225 datasets as the research object, the best parameters of the model are selected by the prediction accuracy under different parameters. In this paper, the PsePSSM algorithm is used to carry out feature extraction on protein sequences, and the ξ value indicates the sequence-order information of the amino acid residues in the protein sequence. If the ξ value is set too large, the feature vector dimension of the protein sequence is too high, resulting in more redundant information, which affects the prediction effect. If the ξ value is set too small, the feature vector contains very little sequence information, and the features of the protein sequence of the apoptosis protein dataset cannot be extracted comprehensively. To find the optimal For the different ξ values, the apoptosis protein datasets CL317 and ZW225 are classified by SVM respectively. The SVM is used to select the radial basis function (RBF) and the results are tested by jackknife method. The overall prediction accuracy of each class protein and overall prediction accuracy in the apoptosis protein datasets are obtained, as are shown in Tables 1 and 2. Table 1 shows that the OA of CL317 dataset are different with constant change of ξ value. The highest prediction accuracy of mitochondrial proteins reach 70.6% when ξ = 3, which is 32.4 and 14.7% higher than when ξ values are 0 and 1, respectively. The prediction accuracy of membrane proteins is 87.3% when ξ = 10. The OA of CL317 dataset reach 83.6 and 83.9% when ξ = 3 and ξ = 10, respectively, higher than that when ξ values are taking other values. Table 2 shows that the OA of ZW225 dataset are different with constant change of ξ value. For cytoplasmic proteins, the highest predictive accuracy is 81.4% when ξ values are 1, 3, 4 and 6, respectively. For membrane proteins, the highest prediction accuracy is 93.3% when ξ = 0, which is 7.9% higher than when ξ = 5. From the overall prediction accuracy, when ξ values are 3 and 6, the OA of ZW225 dataset reach 77.3 and 77.8%, respectively, which is 6.6 and 7.1% higher than that when ξ = 0.
To select the optimal parameters of the PsePSSM algorithm in the subcellular prediction model of apoptosis proteins, CL317 and ZW225 datasets are selected as the training datasets. Fig. 2 shows the OA changes when different ξ values are chosen in CL317 and ZW225 datasets. It can be seen from Fig. 2 that the prediction accuracy of the two datasets is changing with the change of the ξ value. In addition, CL317 and ZW225 datasets reach the highest accuracy, whenξ = 3 andξ = 6, respectively. But in order to unify the model parameters, ξ = 3 is chosen in the model. Therefore, the PsePSSM algorithm is used to extract the protein sequence, and each protein sequence to obtain 20 + 20 × ξ = 20 + 20 × 3 = 80 dimension feature vector.
In the feature extraction process by using DCCA coefficient, the selection of Svalue has a crucial influence on the construction of the model. Sis used to determine the length of each overlapping portion of the detrended cross-correlation analysis. Because the length of the shortest protein sequence in the benchmark dataset is 50, the maximum value allowed for S is 49. To find the optimal Svalue in the model, set Svalues from 5 to 49 in turn. For the different Svalues, the apoptosis protein datasets CL317 and ZW225 are classified by SVM respectively. The SVM is used to select RBF and the results are tested by jackknife method. The prediction accuracy of each class protein and the overall prediction accuracy in the apoptosis protein datasets are obtained, as are shown in Tables 3 and 4. Table 3 shows that the OA of CL317 dataset are different with constant change of S value. The accuracy of cytoplasmic proteins reach 97.3% when S = 30 and S = 35, respectively. The highest prediction accuracy of mitochondrial proteins reach 92.7%, when S = 45 and S = 49, respectively, which is 12.7% higher than whenS = 5. The accuracy of nuclear proteins is 67.3% when S = 49, which is 21.1% higher than whenS = 5. The accuracy of secreted proteins is 88.2% when S = 25. From overall prediction accuracy, the OA of CL317 dataset is 85.8% when S = 35, which is 9.8% higher than when S = 5. Table 4 shows that the OA of ZW225 dataset are different with constant change of Svalue. The accuracy of cytoplasmic proteins reach 87.1% whenS = 49. The highest prediction accuracy of membrane proteins reach 91.0% when S values are 20, 25 and 40, respectively. The accuracy of nuclear proteins reach 75.6% when S = 40 and S = 45, respectively. From the overall prediction accuracy, the OA of ZW225 dataset is 82.7% when S = 40, which is higher than other parameters.
In our current study, two apoptosis protein datasets CL317 and ZW225 are selected as the training datasets.
To determine the optimal parameters of DCCA coefficient algorithm in the model, Fig. 3 shows the change of OA in CL317 dataset and ZW225 dataset by choosing differentS. It can be seen from Fig. 3 thatSvalues are different for the highest overall prediction accuracy of two datasets. The average overall prediction accuracy of CL317 and ZW225 datasets is the highest when S = 40. That is, the DCCA coefficient algorithm chooses optimal parameterS = 40. At this time, the 190 dimension feature vector can be obtained by extracting each protein sequence by DCCA coefficient method.

Selection of dimensionality reduction method and optimal dimension
The increasing dimension of the dataset makes the classification more difficult and the development to a certain extent can cause curse of dimensionality. For high-dimensional data, firstly, dimensionality reduction is carried out, and then data after dimensionality reduction is input into the learning system. In order to achieve the ideal protein subcellular localization prediction accuracy, the PsePSSM and DCCA coefficients are first fused to extract features of the protein sequences. In the discussion of section 3.  [92] and LFDA (Local Fisher Discriminant Analysis) dimensionality reduction method are used to compare the effect of protein subcellular localization overall prediction accuracy by using these four dimensionality reduction methods. In this study, we use the SVM to classify with the radial basis kernel function, and the results are tested by jackknife method. The overall prediction accuracy of subcellular localization of two apoptosis protein datasets are obtained with different dimensionality reduction methods and under different dimensions, as shown in Tables 5 and 6.
As can be seen from Table 5, for the CL317 dataset, choosing different dimensionality reduction methods and dimensions have a significant effect on the accuracy of protein subcellular prediction. When Laplacian Eigenmaps and AKPCA method are used to reduce dimension, the dimension is 50, and CL317 dataset obtains the highest overall prediction accuracy, which is 84.9 and 89.6%, respectively. When PCA method is used to reduce dimension and dimensionality chooses 40, the highest overall prediction accuracy of the CL317 dataset is 89.3%. When LFDA method is used to reduce dimension and dimensionality chooses 10, the overall prediction accuracy is the highest, which is 99.7%. It shows that when choosing different dimensionality reduction methods, getting the best dimension for the CL317 dataset is different. By comparing the overall prediction accuracy by different dimensionality reduction methods with different dimensions, we can find that when the LFDA dimensionality reduction method is adopted, the dimensionality is 10, the overall prediction accuracy is the highest, 10.1% higher than when AKPCA dimensionality reduction method is used and the dimension is 50. It can be more intuitively found in Fig. 4 for the CL317 dataset, when the LFDA dimensionality reduction method is selected, the highest overall prediction accuracy of the model is achieved when dimension is 10. As can be seen from Table 6, for the ZW225 dataset, choosing different dimensionality reduction methods and dimensions has a significant effect on the accuracy of protein subcellular prediction. When PCA method is selected to reduce the dimensionality and dimension chooses 30, the highest overall prediction accuracy of 85.8% is achieved in the ZW225 dataset. When using the Laplacian Eigenmaps method to reduce dimension, and dimensionality chooses 30 or 50, the overall prediction accuracy of the dataset is the highest, which is 82.2%. When using the AKPCA method to reduce dimension, and dimensionality chooses 40, the highest overall prediction accuracy is 86.2%. When using the LFDA method to reduce dimension, and dimensionality chooses 10,20,30,40,50,60,70 or 80, the highest overall prediction accuracy is 99.6%. It indicates that the choice of the optimal dimension is closely related to the use of dimensionality reduction methods. In this paper, by comparing the overall prediction accuracy by different dimensionality reduction methods with different dimensions, it can be found that when the LFDA dimensionality reduction method is adopted and the dimension is 10, the overall prediction accuracy is the highest, 17.4% higher than when Laplacian Eigenmaps dimensionality reduction method is used and dimension is 30. It can be more intuitively found in Fig. 5 for the ZW225 dataset, when the LFDA dimensionality reduction method is selected, the highest overall prediction accuracy of the model when dimension is 10, 20, 30, 40, 50, 60, 70 or 80. Since the two apoptosis protein datasets CL317 and ZW225 are selected as the training set, in order to unify the parameters of the model, the LFDA dimensionality reduction method is adopted in this paper, and the optimal dimension is 10-dimensional.

Effect of feature extraction algorithm on results
Feature extraction method converts character representation of a protein sequence into a numerical representation, which uses the corresponding feature vector to represent protein sequence information. PsePSSM method can get homology and sequence information of amino acids in the protein sequences. DCCA coefficient method is an extension of the DCCA and the DFA (detrended fluctuation analysis). Here, only the evolutionary represented in the form of PSSM is adopted as the considered properties. The PsePSSM algorithm and the DCCA coefficient method are combined to obtain more protein sequence information, but this will obtain high-dimensional features to make the model worse, which contain more redundant variables. LDFA dimensionality reduction method use local within-class scatter matrix and local between-class scatter matrix to remove the redundant information based on the feature information of the protein sequences in the dataset and the corresponding class labels. In this paper, the optimal feature extraction algorithm is selected by comparing the influence of different feature extraction methods on the prediction results. Two different predicted results of the two apoptosis protein datasets CL317 and ZW225 are shown in Tables 7 and 8. Furthermore, we    analyze the robustness of the model under different feature extraction algorithms, which use ROC curve. As we know, the ROC curve is used in positive vs negative (two classes) classification. But apoptosis proteins subcellular localization prediction is a multi-class prediction problem. We first use the one-versus-rest (OVR) strategy to transform the multi-classification problem into two-classification problems. One of the classes is selected as positive samples i.e. "positive" one and other classes as negative samples [69]. Then for these two-classification true positive rate and false positive rate, the average of them was taken as the final result [43]. Figures 6 and 7 are the ROC curves obtained by four different feature extraction methods for the CL317 dataset and ZW225 dataset, respectively. Table 7 shows that the OA of CL317 dataset are different, which use different feature extraction algorithms. The OA of PsePSSM algorithm reach 83.6%, which is 1.9% lower than DCCA coefficient algorithm. The OA of PsePSSM-DCCA algorithm is 86.8%, which is 3.2, 1.3% higher than PsePSSM and DCCA coefficient algorithm, respectively. The LFDA algorithm is used to reduce the dimensionality after two algorithms. The accuracy of each class has been obviously improved by using LFDA algorithm and OA of CL317 dataset reach 99.7%. For PsePSSM algorithm, the accuracy of secreted proteins reach 70.6%, which is lower than DCCA coefficient, PsePSSM-DCCA and PsePSSM-DCCA-LFDA algorithm, respectively. The accuracy of secreted proteins is 100% by PsePSSM-DCCA-LFDA algorithm, which is 29.4% higher than the PsePSSM method. Fig. 6 shows that PsePSSM-DCCA-LFDA reach largest coverage area of the ROC curve, whose AUC value is 0.9842. In addition, the AUC values of PsePSSM, DCCA coefficient and PsePSSM-DCCA are 0.9591, 0.9520 and 0.9587, respectively.     Table 8 shows that the OA, accuracy of each class are different for ZW225 dataset with different feature extraction algorithms. The OA of PsePSSM algorithm reach 77.3%, which is 5.4, 7.6 and 22.3% lower than DCCA coefficient algorithm, PsePSSM-DCCA algorithm and PsePSSM-DCCA-LFDA algorithm, respectively. The OA of PsePSSM-DCCA algorithm is 84.9%, which is 7.6, 2.2% higher than PsePSSM and DCCA coefficient algorithm, respectively. Using the LFDA algorithm to reduce the dimensionality, the PsePSSM-DCCA algorithm as feature extraction method, the prediction accuracy of the four kinds of proteins in the ZW225 dataset has been improved remarkably, and the OA of the model has reach 99.6%. Fig. 7 shows that PsePSSM-DCCA-LFDA reach largest coverage area of the ROC curve, whose AUC value is 0.9805. In addition, the AUC values of PsePSSM, DCCA coefficient and PsePSSM-DCCA are 0.9380, 0.9386 and 0.9464, respectively. Analyzing and comparing the prediction results and robustness of prediction model on CL317 and ZW225 datasets by using four different feature extraction methods, we choose PsePSSM-DCCA-LFDA as feature extraction method in this paper.

Performance of prediction model
In PsePSSM-DCCA-LFDA prediction model, protein sequence information is extracted by fusing the PsePSSM and DCCA coefficient methods, and then the subcellular localization of apoptosis protein datasets is predicted by SVM based on LFDA dimensionality reduction method. According to the above analysis, when using PsePSSM, ξ = 3 is selected, when using DCCA coefficient, S = 40is selected. Using the LFDA method to reduce the dimension of the dataset, the optimal dimension chooses 10. The RBF is selected as the kernel function of SVM. In this paper, the most rigorous jackknife test methods are used to test the datasets CL317 and ZW225, the main results are shown in Table 9.
As can be seen from Table 9, the OA of CL317 dataset is 99.7% by using jackknife test. The sensitivity of each class is 100% except cytoplasmic proteins. The sensitivity of cytoplasmic proteins is 99.1%. The specificity of each class is 100% except mitochondrial proteins. The OA of ZW225 dataset is 99.6% by using jackknife test. The sensitivity, specificity and MCC of mitochondrial and nuclear proteins are 100, 100% and 1, respectively. The sensitivity of cytoplasmic proteins is 100%, the specificity and MCC are 99.4% and 0.99, respectively.

Comparison with other methods
In this section, to demonstrate the effectiveness of the proposed method PsePSSM-DCCA-LFDA, we compared with other recently reported prediction methods on the same apoptosis proteins datasets. All the methods are performed using jackknife cross-validation test. Tables 10 and 11 details the comparison of the proposed method and other prediction methods on the CL317 and ZW225 datasets, respectively.
As can be seen from Table 10, the OA of CL317 dataset is 99.7% by using PsePSSM-DCCA-LFDA, which is 2.2-17% higher than other prediction methods. We can find that the overall accuracy by our method is higher than that of ID [93], ID_SVM [39], DF_SVM [21], FKNN [40] and so on. The value of sensitivity for each protein class is listed. For example, the sensitivity of mitochondrial proteins, nuclear proteins, secreted proteins, endoplasmic proteins and membrane proteins eached 100% by our method, while the ID [93] are 85.3, 82.7, 88.2, 83.0 and 81.8%, respectively. For the cytoplasmic proteins, the sensitivity of our method is 99.1%, which is also the highest, which is 12.7% higher than that of the Auto_Cova [42] method. For the CL317 dataset, our proposed method has achieved satisfactory prediction results.
As can be seen from Table 11, the OA of ZW225 dataset is 99.6% using PsePSSM-DCCA-LFDA, which is almost 16.5, 15.6, 13.8 and 12.5% higher than EBGW_SVM [15], DF_SVM [21], FKNN [40], Auto_Cova [42], respectively. Especially for the most difficult case-mitochondrial proteins, the predictive accuracy has improved to 100% by our method, which is 40% higher than that of the EBGW_SVM [15], 36% higher than the prediction accuracy of DF_SVM [21]. It indicates that the model of this paper has excellent properties for the prediction of mitochondrial proteins in apoptosis proteins. In general, for the ZW225 dataset, our proposed method has achieved satisfactory prediction results. In order to further validate the actual prediction ability of the model, we use the independent testing dataset ZD98 to test the model. When using PsePSSM, ξ = 3 is selected. When using DCCA coefficient, S = 40 is selected. Using LFDA method to reduce the dimension of the dataset, the optimal dimension chooses 10. The RBF is selected as the kernel function of SVM. The results of the ZD98 dataset are tested by the jackknife cross-validation method and compared with other reported prediction methods. Table 12 shows the predictive results of the subcellular localization on the ZD98 dataset.
As can be seen from Table 12, the OA of ZD98 dataset is 3.1-11.2% higher than other methods by using PsePSSM-DCCA-LFDA, which is 9.2% higher than ID [93], 11.2% higher than ID_SVM [39] and DWT_SVM [41]. In addition, the sensitivity of mitochondrial proteins, cytoplasmic proteins and membrane proteins reached 100% by our method, while the ID_SVM [39] are 84.6, 95.3 and 93.3%, respectively. Especially for mitochondrial proteins, the prediction accuracy of DWT_SVM [41] is 53.9%, which is 46.1% lower than that of our method. It shows that our method has achieved good results of the mitochondrial proteins prediction. For the accuracy of the other proteins by the algorithm proposed in this paper is 100%, which is 41.7% higher than the ID_SVM method [39]. In conclusion, the above results indicate that the prediction model we construct can significantly improve the prediction accuracy of protein subcellular localization and has achieved satisfactory prediction results.

Discussion
In this paper, we propose a novel method for predicting apoptosis protein subcellular localization, called PsePSSM-DCCA-LFDA. When using PsePSSM, ξ = 3 is selected. When using DCCA coefficient, S = 40 is selected. Using LFDA method to reduce the dimension of the dataset, the optimal dimension chooses 10. The RBF is selected as the kernel function of SVM. The overall prediction accuracy are 99.7, 99.6 and 100% for CL317 dataset, ZW225 dataset and ZD98 dataset by the most rigorous jackknife test, respectively, which is better than other state-of-the-art methods. The OA of CL317 dataset is 99.7% by using PsePSSM-DCCA-LFDA, which is 2.2-17% higher than other prediction methods. The OA of ZW225 dataset is 99.6% by using    PsePSSM-DCCA-LFDA, which is 1.8-16.5% higher than other prediction methods. The OA of ZD98 dataset is 3.1-11.2% higher than other methods by using PsePSSM-DCCA-LFDA.
PsePSSM-DCCA-LFDA demonstrated good performance on predicting apoptosis protein subcellular localization, which is better than the state-of-the-art methods. It is mainly due to the following reasons: 1. Both the PsePSSM algorithm and the DCCA coefficient method extract feature information from the PSSM corresponding to the protein sequences. Although both algorithms are data mining the evolutionary information of protein sequences in order to obtain the best numerical representation of the protein sequences, the two algorithms are different. PsePSSM feature extraction takes into account the sequence-order information of the protein sequence. The DCCA coefficient uses the columns in the PSSM as the least squares fitting and the trend elimination as the non-stationary time series to remove the PSSM between the cross-correlation. 2. LFDA can effectively remove redundant information in the protein sequences without losing important information in the apoptosis protein sequence. 3. SVM classification algorithm can deal with high-dimensional data, avoiding over-fitting and effectively removing non-support vector.
Protein subcellular localization information can explain the disease mechanism, provide theoretical basis and solution. Medical studies have found that abnormal subcellular localization of proteins occurs, when cells is cytopathic. Further, abnormally localized proteins provide molecular markers for the early diagnosis of diseases and can become molecular targets for the design of new drugs, which achieve the goal of curing diseases.
Currently our method is still trained on small dataset, because CL317, ZW225 and ZD98 datasets are widely used benchmark datasets, it is difficult to collect largescale experimentally verified. In the next step, we will build a large-scale protein subcellular dataset for prediction research.

Conclusion
With the advent of the big data age, the gap between the number of proteins in the public database and its functional annotations is widening. The critical challenge of bioinformatics is to develop automated methods for fast and accurately determining the structures and functions of proteins [94]. In this paper, a novel method for protein subcellular localization prediction is proposed. We use the LFDA dimensionality reduction method and the SVM algorithm to predict the apoptotic protein subcellular localization. Firstly, we fuse the PsePSSM and DCCA coefficient methods to carry out feature extraction on protein sequences. Then, the extracted feature vectors are used to reduce the dimension using LFDA method, and the subcellular localization of apoptosis proteins are predicted by SVM algorithm. By jackknife test, the OA of the three benchmark datasets reach 99.7, 99.6 and 100%, respectively. The results show that the PsePSSM-DCCA-LFDA method has good performance by comparing with others, which use the same benchmark datasets. Since user-friendly and publicly accessible webserver is one of the important factors in building a practical predictive system [78,88], in order for the convenience of the researchers, we will develop a web-server or standalone version for the prediction method presented in this paper.