Proposing a highly accurate protein structural class predictor using segmentation-based features
© Dehzangi et al.; licensee BioMed Central Ltd. 2014
Published: 24 January 2014
Skip to main content
© Dehzangi et al.; licensee BioMed Central Ltd. 2014
Published: 24 January 2014
Prediction of the structural classes of proteins can provide important information about their functionalities as well as their major tertiary structures. It is also considered as an important step towards protein structure prediction problem. Despite all the efforts have been made so far, finding a fast and accurate computational approach to solve protein structural class prediction problem still remains a challenging problem in bioinformatics and computational biology.
In this study we propose segmented distribution and segmented auto covariance feature extraction methods to capture local and global discriminatory information from evolutionary profiles and predicted secondary structure of the proteins. By applying SVM to our extracted features, for the first time we enhance the protein structural class prediction accuracy to over 90% and 85% for two popular low-homology benchmarks that have been widely used in the literature. We report 92.2% and 86.3% prediction accuracies for 25PDB and 1189 benchmarks which are respectively up to 7.9% and 2.8% better than previously reported results for these two benchmarks.
By proposing segmented distribution and segmented auto covariance feature extraction methods to capture local and global discriminatory information from evolutionary profiles and predicted secondary structure of the proteins, we are able to enhance the protein structural class prediction performance significantly.
Protein structural class prediction problem is defined as categorizing a given protein into one of the four structural classes namely, all-α, all-β, α + β, and α/β . Knowledge of the structural classes of proteins can also provide important information about their functionalities and overall folding types [2, 3]. Therefore, protein structural class prediction problem is considered as an important step towards the protein structure prediction problem. Despite the importance of this problem, finding a fast and accurate computational approach to solve this problem when the sequence similarity rate is low still remains an unsolved problem for bioinformatics and computational biology.
During the past two decades, a wide range of studies, using machine learning-based methods, have been conducted to solve this problem [4, 5]. These studies can be categorized into two groups. The first group consists of studies that have tried to address this problem by proposing novel classification techniques [6, 7]. They proposed a wide range of classification techniques based on different learning algorithms such as, Bayesian based learners , Meta-classifiers [9–13], Support Vector Machines (SVM) [14–17], Artificial Neural Network (ANN) [18–20], and ensemble classifiers [21–25]. Among a wide range of classification techniques used to tackle this problem, SVM classifier has attained the best results for this task [5, 22, 26, 27]. The second group consists of studies that have mainly focused on proposing novel features that capture local and global discriminatory information to address protein structural class prediction problem such as sequence based information [10, 28–30], pseudo amino acid composition [31–33], physicochemical-based information [15, 22, 28, 34–36], and structural based information [5, 33, 37–40]. The most important enhancements in protein structural class prediction accuracy have been based on relying on these techniques rather than exploring the impact of classification techniques. These recent enhancements were mainly because of extracting features from Position Specific Scoring Matrix (PSSM) profiles  as well as structural information extracted from the predicted secondary structure of proteins .
The most significant enhancement by solely relying on the PSSM for feature extraction was achieved by [16, 26, 40]. They used PSSM profiles to extract sequence order information based on the concepts of dipeptide composition, auto covariance and composition of the amino acids. They used entire protein sequence as a general entity to extract these features. Hence, the auto covariance and dipeptide composition calculated along an entire protein sequence were used as its local descriptor. Further enhancement for protein structural class prediction accuracy has been achieved by including structural information extracted from the predicted secondary structure of the proteins using PSIPRED . By adding these features to the extracted features from the PSSM, the protein structural class prediction accuracy has been significantly improved especially when the sequence similarity rate was low [27, 37, 43]. Similar to the features extracted from the PSSM, the whole protein as a general entity was used to extract these features as well. Despite all the recent efforts on extracting effective features to capture local and global discriminatory information from evolutionary and structural profiles, the protein structural class prediction accuracy have not been improved significantly since the study of Mizianty and Kurgan in 2009 [5, 6].
In this study, we propose segmented auto covariance and segmented distribution feature extraction methods to capture more local sequence order information from evolutionary and structural profiles. We also employe the concept of occurrence and composition feature groups to capture global sequence order information based on evolutionary, and structural profiles. First, by solely relying on the PSSM profiles for feature extraction, we enhance the protein structural class prediction accuracy by over 15% and 5% for 25PDB and 1189 benchmarks respectively compared to similar studies . These enhancements highlight the potential discriminatory information embedded in the PSSM that have not been adequately explored in the literature. Then, by exploring our proposed feature extraction techniques to include structural information derived from the predicted secondary structure using SPINE-X , we achieve up to 92.2% and 86.3% prediction accuracies respectively for 25PDB and 1189 benchmarks and enhance the overall protein structural class prediction accuracy even further by 7.9% and 2.8% better than previously reported results found in the literature [5, 6, 27].
To evaluate the prediction performance of our proposed approaches, we employe two benchmarks namely 25PDB and 1189. These two benchmarks have been widely used for protein structural class prediction problem. The 25PDB was introduced by  consisting of 1673 proteins with less than 25% sequence similarities in average (the homology-range between 22% and 45%). This benchmark extracted from 25% PDBSELECTED which includes high-resolution non-homologous proteins from the Protein Data Bank (PDB) . Therefore, it is considered as an appropriate representative of benchmarks consisting of proteins in twilight zone (proteins with sequence similarities between 20% and 45%) for protein structural class prediction problem. Hence, in this study, the 25PDB benchmark is used as the main source to investigate the effectiveness of our proposed model.
The properties of 1189 and 25PDB benchmarks.
α + β
In this study, we use PSSM profiles to extract evolutionary-based information as well as predicted secondary structure using SPINE-X to extract structural-based information. PSSM is calculated by applying the PSI-BLAST  in which its cut off value (E) is set to 0.001 on our explored benchmarks (using NCBI's non redundant (NR) protein data base). Given a protein sequence, PSSM produces the substitution probability of the amino acids along its sequence based on their position with all 20 amino acids. PSSM consists of two L × 20 matrices (L is the length of a protein and the columns of the matrices represent 20 amino acids). The first matrix is called PSSM_cons and gives the log-odd of the substitution probability. The second matrix is called PSSM_prob and gives the normalized substitution probability for each amino acid .
We also use predicted secondary structure using SPINE-X which was recently proposed by  and attained better results than PSIPRED on predicting protein secondary structure (especially for the coded area). Given a protein sequence, SPINE-X produces a L × 3 matrix (which will be referred to SPINE-M for the rest of this study) including the normalized probability of contribution of a given amino acid based on its position along the protein sequence to build one of the three secondary structure elements namely, α-helix, β-strands, and coils. It also return a transformed version of the protein sequence (also extracted from the SPINE-M) in which each amino acid along the protein sequence is replaced with H (represents helix), E (represents strand), or C (represents coil) based on its tendency to incorporate in building one of these secondary structure elements. We will refer to this sequence as the structural consensus sequence. It is expected that predicted secondary structure using SPINE-X provides significant structural information for the protein structural class prediction problem similar to or even better than PSIPRED due to its better performance .
where P ij is the substitution probability of the amino acid at location i with the j-th amino acid in the PSSM_cons. In the second step, we replace the amino acid at i-th location of original protein sequence by the j-th amino acid to form the consensus sequence. Note that the PSSM_cons is used in this study for feature extraction (which it is normalized using min-max method) as it was used in the literature [26, 27].
After calculating evolutionary consensus sequence, we count the occurrence of each amino acid (for all 20 amino acids) along this sequence and produce corresponding feature group (AAO). Similarly, we calculate the occurrence of each secondary structure element (for all three elements) in the structural consensus sequence and produce the corresponding feature group (SSEO). Occurrence feature group as the global descriptor of the proteins is used in this study instead of composition of the amino acids (occurrence of amino acids divided by the length of protein sequence) since it maintains the length information which is disregarded in the composition feature group .
where S ij is the normalized probability of the occurrence of the j-th secondary structure element at location i of the protein sequence in the SPINE-M. It was shown that using semi-composition method is able to provide more discriminatory information compared to extracting composition of the amino acids feature group from the original protein sequence . This feature group is also able to provide important global discriminatory information about the substitution probability of the amino acids as well as normalized frequency of secondary structure elements.
This method is specifically proposed to add more local sequence order information about how the amino acids based on their substitution probability with each other (extracted from the PSSM) as well as their tendency to incorporate in one of the secondary structure elements (extracted from SPINE-M) are distributed along the protein sequence. We propose this segmentation method in the manner where segments of a protein sequence are of unequal lengths and each segment is represented by a distribution feature which is computed as follows. First, for the PSSM, to extract the segmented distribution feature group (PSSM-SD), we compute the total sum of substitution probability of the j column of the PSSM . Then, we start from the first row of the PSSM and compute the partial sum of the substitution probability of the amino acid amino acid j, for the first i amino acids which is given by . Using the distribution factor F P (which is a parameter investigated in this study), we find out the maximum value of index i such that partial sum S 1 is less than or equal to the F P % of total sum (T j ). Thus we can say that the first ?6? substitution probabilities contribute to F P % of the total sum (T j ). We use ?6? to define the ending location of the first segment, while its beginning point is taken to be 1 (which represents the first row of the PSSM). The distribution feature of this segment is given by ?6?. In a similar manner, we find out the number of first amino acids of the protein sequence that contribute to 2F P %, 3F P %, ..., 50% of T j (50% of T j starting from the first row of the PSSM), respectively. Indices , are used to define the ending locations of segments 2, 3, ..., 50/F P , respectively; while the beginning location of all these segments remains to be 1. Hence, the distribution features for these segments are computed as . Note that we have thus computed 50/F P distribution features by processing the protein sequence starting from the first row of the PSSM in downward direction. We repeat this process starting from the last row of the PSSM in upwards direction to get another set of 50/F P features (to explore the rest of 50% of T j starting from the end of protein sequence corresponding to the last row of the PSSM). Thus, the total of 2× (50/F P ) = 100/F P distribution features are computed for each column of the PSSM.
Combining SPINE-seg and SPINE-AC, we build SPINE-SAC feature group consisting of 3 × (2K S + 2K S + K S )) features in total (4K S features in SPINE-seg and K S features in SPINE-AC).
where γ is the kernel parameter, x i and x j are input feature vectors. In this study, the γ in addition to the cost parameter C (which also called the soft margin parameter) of the SVM classifier are optimized using grid search algorithm implemented in the LIBSVM package. The grid search algorithm tries various pairs of γ and C values and selects the values with the best classification accuracy  (using 10-fold cross validation evaluation method). The range of gamma and C parameters to be searched in this algorithm are taken to be their default values used in the SVMLIB toolbox (these ranges were from 2-5 to 215 for C and from 2-15 to 23 for gamma). It is a simple algorithm as it has just two parameters to optimize (γ and C). Despite its simplicity, it has been shown to be an effective method to optimize these parameters .
We first investigate the effectiveness of our proposed feature extraction methods to capture local and global discriminatory information from the PSSM. We compare their performances with similar studies that relied solely on the PSSM for feature extraction . In this step, we also explore the effective value for distance factor (K P ) in segmented auto covariance feature extraction method as well as segmentation factor (F P ) in segmented distribution method. To find the effective value for segmented auto covariance method, we study the K P value between 1 and 10 (similar to ). We also study the segmentation factor (F P ) in segmentation distribution between three values used in this study (25, 10 and 5). In the second step, we conduct a similar experiments using the SPINE-X for feature extraction. We investigate the effectiveness of our proposed feature extraction method to extract these features from the SPINE-M as well as the effective values for K S (between 1 and 10) and F S (among three values (25, 10, and 5) used in this study) in the similar manner. In the final step, we add the structural features extracted from the SPINE-M using our proposed methods to the extracted features from the PSSM and compare our results with the best results found in the literature for the protein structural class prediction problem [5, 6, 27].
To explore the impact of the distance factor on the segmented auto covariance method, 10-fold cross validation is adopted as it was widely used in similar studies [26, 45]. In this paper, we have used k-fold cross validation where k = 10 to measure the prediction performance. We also provide these performance results using k-fold cross validation as a function of k where k = 2, 3, 4, ..., 10 in Additional File 1. In the 10-fold cross validation, the benchmark is divided into ten non-overlapping subsets called fold. Then in each iteration, the combination of nine folds is used for training purpose and the remained fold is used for testing purpose. This process repeats for all 10 folds to be used as the testing set. We also use Jackknife cross validation to report our overall achieved prediction accuracy as well as prediction accuracy achieved for each structural class individually to compare them with previous studies. In this method, in each iteration, all but one sample use as a training purpose while the remained sample is used for testing purpose. This process repeats for all the samples available in the benchmark to be used as the testing sample. Jackknife is considered as a computationally expensive approach for evaluation. Furthermore, it was shown in  that its performance is similar to 10-fold cross validation. Since it has been widely used to evaluate protein structural class prediction accuracy, it is also adopted in this study to enable us to directly compare our results with the state of the art results found in the literature [5, 6, 26, 27]. We will use the overall prediction accuracy (in percentage) as the main accuracy measurement to be able to directly compare our achieved results with previously reported results found in the literature which is defined as follows:
More information about these three measurement for protein structural class prediction problem can be found in  and . We will report sensitivity as well as specificity and MCC measures for all four structural classes for the best results reported in this study.
Note that we optimized γ and C for K P = 1 and F P = 25 using grid algorithms on the 1189 benchmarks (to avoid over tuning) and used corresponding values for the rest of this study (γ = 0.055 and C = 500). We determine the parameters used in this study for feature extraction as well as employed classification technique on the 1189 benchmark while the 25PDB is not used at all and reserved to investigate the generality and effectiveness of our proposed model. However, our experiments have determined that there is no significant difference between the optimized parameters for the 25PDB and 1189 benchmarks for our extracted features.
As we can see in Figure 2 and Figure 3, our extracted feature vector significantly outperforms the results reported in  for all the values used for K P (between 1 and 10). It shows the effectiveness of the proposed segmentation-based method to explore discriminatory information embedded in the PSSM compared to use of whole protein sequence as a general entity. It also shows that by using segmented auto co-variance method, even by using very low values for K P , we can achieve to high prediction accuracy since it is able to explore adequate local sequence order information (also emphasis on the impact of segmented distribution method). We report up to 89.6% prediction accuracy (using jackknife cross validation) by adjusting K P to 4 (20 + 20 + 5 × K P (= 4) × 20 + 80 = 520 features in total) which is 15.5% better than 74.1% prediction accuracy achieved by reproducing  experiment (using K P = 9 in AAC_PSSM_AC) for the 25PDB benchmark (Figure 2). Similarly, we achieve up to 79.7% prediction accuracy by adjusting K P to 4 which is 5.1% better than 74.6% prediction accuracy achieved by reproducing  experiment (using K P = 6 in AAC_PSSM_AC) for the 1189 benchmark (Figure 3). Since the best results for both 25PDB and 1189 benchmarks are achieved by setting K P to 4 (the achieved results do not differ significantly for different values used for K P (between 1 and 10) which highlights the effectiveness of segmentation technique rather than the effect of the distance factor (K P ) to extract this feature group), it is adopted as a distance factor to extract features for segmented auto covariance from the PSSM for the rest of this study.
We also repeat this experiment to explore the impact of segmentation factor F P in segmented distribution feature extraction method. The prediction accuracies achieve by adjusting the segmentation factor to 10 and 5 are not improved (which even by increasing K P , they are reduced) compared to the achieved results by adjusting this parameter to 25. It highlights the sufficiency and effectiveness of adopting F P = 25 as the segmentation factor compare to use of 10 and 5. In other word, using four segments is able to effectively provide adequate discriminatory information for this task better than increasing the number of segments to 10 or 20.
The impact of the proposed feature extraction groups (using PSSM for feature extraction) proposed in this study to enhance protein structural class prediction accuracy (in %).
Combination of features
PSSM-AAC + PSSM-SAC
PSSM-AAC + PSSM-SAC + PSSM-SD
PSSM-AAC + PSSM-SAC + PSSM-SD + AAO
In this step, we investigate the impact of our proposed feature extraction method on the SPINE-X for feature extraction. We build a feature vector based on our proposed methods in this study relying solely on the SPINE-M for feature extraction. We extract SSEO (occurrence of the secondary structure elements from predicted secondary structure using SPINE-M (3 features)), SPINE-SSEC (semi-composition from SPINE-M (3 features)), SPINE-SAC (segmented auto covariance were K S adjust to 1 to 10 in 10 different experiments (K S × 5 × 3 features)), and SPINE-SD (segmented distribution where segmentation factor adjusts to 25 (4 × 3 = 12 features)) feature groups. The combination of these feature groups is referred as SPINE-S (SSEO + SPINE-SSEC + SPINE-SD + SPINE-SAC = SPINE-S). The protein structural class prediction results are obtained in this subsection using the Jack-knife cross validation method.
The impact of the proposed feature extraction groups (using SPINE-M for feature extraction)proposed in this study to enhance protein structural class prediction accuracy (in %).
Combination of features
SPINE-AAC + SPINE-SAC
SPINE-AAC + SPINE-SAC + SPINE-SD
SPINE-AAC + SPINE-SAC + SPINE-SD + SSEO
Comparison of the results reported for the 25PDB benchmark (in percentage %)
α + β
Comparison of the results reported for the 1189 benchmark (in percentage %)
α + β
Adding structural features to evolutionary features extracted in our experiments enhances the results for up to 2.4% and 6.6% better than relying solely on evolutionary features for the 25PDB and 1189 benchmarks respectively. This emphasis on the impact of structural information extracted from the SPINE-X in general for the protein structural class prediction problem.
The specificity (in percentage) and MCC measurements for the best results: (a) for the 25PDB benchmark; (b) for the 1189 benchmark
α / β
α + β
α / β
α + β
In this study we proposed novel segmented distribution and segmented auto covariance feature extraction methods to capture more local and global discriminatory information from evolutionary profile and predicted secondary structure of proteins. We first extract the corresponding features from the PSSM in addition to the occurrence of the amino acids extracted from evolutionary consensus sequence and semi-composition extracted from the PSSM. Then by applying SVM to the extracted features, we enhanced the protein structural class prediction accuracy for low-homology protein sequences (twilight zone) up to 15.5% for the 25PDB benchmark and 5.1% for the 1189 benchmark better than similar studies that relied solely on the PSSM for feature extraction . Our results supported the idea that potential sequence order information embedded in the PSSM has not been adequately explored in the literature.
In continuation, we added similar features extracted from the predicted secondary structure using the SPINE-X (segmented distribution, segmented auto covariance of the normalized probability of secondary structure elements, occurrence of secondary structure elements extracted from the structural consensus sequence, and semi-composition of the secondary structure elements extracted from the SPINE-M) to previously extracted features from the PSSM. By incorporating structural information, we achieved up to 92.2% and 86.3% for the 25PDB and the 1189 benchmarks which were respectively up to 7.9% and 2.8% better than previously reported results found in the literature for these two benchmarks that have been widely used for the protein structural class prediction problem [5, 6, 27].
We are currently investigating the effectiveness of our proposed techniques in this study to tackle protein fold recognition. We are aiming to develop our protein structural class, and fold prediction server which will be publicly available in the near future. We also aim at exploring the-state-of-the-art feature reduction techniques on our extracted features to investigate the possibility of further feature reduction for these tasks.
Publication of this article funded by Griffith University and National ICT Australia (NICTA).
NICTA is funded by the Australian Government through the Department of Communications and the Australian Research Council through the ICT Centre of Excellence Program.
This article has been published as part of BMC Genomics Volume 15 Supplement 1, 2014: Selected articles from the Twelfth Asia Pacific Bioinformatics Conference (APBC 2014): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S1.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.