In silico prediction of the granzyme B degradome
© Wee et al; licensee BioMed Central Ltd. 2011
Published: 30 November 2011
Granzyme B is a serine protease which cleaves at unique tetrapeptide sequences. It is involved in several signaling cross-talks with caspases and functions as a pivotal mediator in a broad range of cellular processes such as apoptosis and inflammation. The granzyme B degradome constitutes proteins from a myriad of functional classes with many more expected to be discovered. However, the experimental discovery and validation of bona fide granzyme B substrates require time consuming and laborious efforts. As such, computational methods for the prediction of substrates would be immensely helpful.
We have compiled a dataset of 580 experimentally verified granzyme B cleavage sites and found distinctive patterns of residue conservation and position-specific residue propensities which could be useful for in silico prediction using machine learning algorithms. We trained a series of support vector machines (SVM) classifiers employing Bayes Feature Extraction to predict cleavage sites using sequence windows of diverse lengths and compositions. The SVM classifiers achieved accuracy and AROC scores between 71.00% to 86.50% and 0.78 to 0.94 respectively on independent test sets. We have applied our prediction method on the Chikungunya viral proteome and identified several regulatory domains of viral proteins to be potential sites of granzyme B cleavage, suggesting direct antiviral activity of granzyme B during host-viral innate immune responses.
We have compiled a comprehensive dataset of granzyme B cleavage sites and developed an accurate SVM-based prediction method utilizing Bayes Feature Extraction to identify novel substrates of granzyme B in silico. The prediction server is available online, together with reference datasets and supplementary materials.
Proteolysis - the specific and limited cleavage of proteins by enzymes called proteases - represents an important mechanism for post-translational control in all living organisms . Granzymes (short for granule enzymes) belong to a unique class of serine proteases which are known to mediate critical roles in the innate immune response against virus-infected or tumor cells through the induction of apoptotic cell death . Consequently, the enzymes have been implicated in the pathogenesis of several chronic inflammatory and cardiovascular disorders. Granzymes are released into the cytoplasm of the target cells through endocytosis of cytolytic granules released by cytotoxic T cells or natural killer cells . Once released into the target cells, granzymes go on to cleave specific cellular proteins and activate multiple signaling pathways leading to apoptotic cell death. Of the five human subtypes discovered to date (granzymes A, B, H, K and M), granzyme B has been the most well studied. Like caspases, granzyme B recognizes specific tetrapeptide sequence motifs (P4-P3-P2-P1) and cleave proteins after aspartate residue at P1 [3, 4]. Besides cleaving specific proteins regulating apoptotic cell death, granzyme B has been reported to cleave proteins across a wide spectrum of other functional classes, ranging from nuclear and cytoskeletal components to membrane receptors and viral proteins .
To date, more than 500 granzyme B substrates have been characterized and many more are expected to be identified . While systematic experimental discovery and validation of bona fide substrates are necessary for elucidating the granzyme B degradome, many of the processes are often time consuming and laborious. For these reasons, computational prediction of substrates could be immensely helpful in generating initial hypotheses and experimental leads. While a wide range of computational methods have been applied for substrate prediction of related proteases such as caspases [6, 7], only a limited number are available for prediction of granzyme B substrates. PeptideCutter  is a general protease substrates cleavage prediction server which predicts for potential granyzme B cleavage sites using preferential tetrapeptide cleavage (P4-P3-P2-P1) specificities derived from in vitro combinatorial library studies by Thornberry et al. . Backes et al. developed the GraBCas software which extended the use of the in vitro specificities by incorporating position-specific scoring matrices and accounting for conserved residues at P1' and P2' positions . More recently, Barkan et al. advanced the field through the application of the support vector machines (SVM) method on a set of experimentally verified cleavage sites using both sequence and structural features .
In this paper, we have compiled a dataset of 580 experimentally verified granzyme B cleavage sites and found distinctive patterns of residue conservation and position-specific residue propensities which could be useful for in silico prediction using machine learning algorithms. We trained a series of SVM classifiers employing Bayes Feature Extraction to predict cleavage sites using sequence windows of diverse lengths and compositions. The SVM classifiers achieved accuracy and AROC scores between 71.00% to 86.50% and 0.78 to 0.94 respectively on independent test sets. We applied our prediction method on the Chikungunya viral proteome and identified several regulatory domains of viral proteins to be potential sites of granzyme B cleavage, suggesting direct antiviral activity of granzyme B during host-viral innate immune responses. A web server, together with reference datasets and supplementary materials, can be accessed at http://www.casbase.org/grasvm/index.html.
Results and discussion
Sequence analysis of granzyme B cleavage sites
Using peptide combinatorial libraries, Thornberry and co-workers had previously identified the presence of distinctive sequence specificities governing protein cleavage of both caspase and granzyme B substrates . In particular, specific tetrapeptide sequences upstream of the cleavage site (P4-P3-P2-P1) of granzyme B targets serve as recognition sites for protein cleavage. The tetrapeptide “IEPD” was identified as the optimal tetrapetide cleavage sequence in vitro. However, emerging data on granzyme B substrates suggest that the in vivo cleavage specificities are far more diverse, with numerous substrates possessing cleavage specificities extending beyond the tetrapeptide sequence [5, 10].
We compiled a comprehensive dataset of 580 unique granzyme B cleavage sites extracted from experimentally verified substrates as reported in literature. Data was extracted from the substrates list compiled in Barkan et al. , as well as the proteomic studies by Van Damme et al. . In addition to the P4P1 cleavage site sequences, segments of different lengths and compositions centered on the P1 position were selected. In all, eight groups of sequences were obtained - P2P2’, P4P1, P4P2’, P4P4’, P6P6’, P8P8’, P10P10’ and P14P10'. We further extracted an equal number of “non-cleavage” sites by randomly selecting non-annotated tetrapeptide sequences (and other corresponding sequence segments) on the substrates. On the P10P10’ dataset, we computed Px (or relative position-specific residue propensity) of each amino acid at the different residue positions along the 20-mer sequence. Px was computed as the ratio of the frequency of occurrence of a particular residue in the cleavage site sequences over the same residue in the non-cleavage site sequences at the particular position.
Average Px of amino acids: Average Px of each amino acid was calculated by averaging the Px values of the particular amino acid across all residue positions within the 20-mer sequence window (P10P10’)
SVM prediction of granzyme B cleavage sites
To account for these unique signatures of residue conservation and position-specific propensities for in silico prediction, we developed SVM prediction models incorporating the Bayes Feature Extraction (BFE) approach as described in Shao et al.. Vector representation using the BFE approach was shown to significantly improve performance in several bio-computational problems - such as the prediction of protein methylation sites , caspase cleavage  and linear B-cell epitopes  - over simple binary encoding schemes. In BFE, feature vectors encoded in a bi-profile manner comprising of positive position-specific and negative position-specific profiles. These profiles were generated by accounting for the frequency of occurrence of each amino acid at each position of the sequences in the positives pool (cleavage site sequences) and negatives pool (non-cleavage site sequences) respectively. Here, we trained a series of SVM classifiers on sequence windows of diverse lengths and compositions (P2P2’, P4P1, P4P2 , P4P4’, P6P6’, P8P8’, P10P10’ and P14P10') using simple binary encoding and BFE schemes (details in Materials and Methods). Datasets were segmented into training and independent test sets comprising of 480 positives/480 negatives and 100 positives/100 negatives respectively. Using the RBF kernel, 10-fold cross-validation was implemented to acquire the optimal set of C and γ parameter values. SVM classifiers were subsequently trained on the entire training set using the optimized parameters and evaluated on the independent test sets.
Results of SVM prediction using simple binary encoding
Results of SVM prediction using Bayes Feature Extraction
Next, we compared our prediction method with GraBCas  and the SVM models developed by Barkan et al. . As the GraBCas algorithm primarily focuses on the detection of specific tetrapeptide motifs, we applied the algorithm on our P4P1 independent test set which contains only the tetrapeptide cleavage site sequences. Using the recommended cut-off score of 0.12, GraBCas predicted only 61 out of 100 cleavage sites correctly (S n =61%). On the same dataset, our P4P1-SVM and P4P1-Bayes classifiers respectively predicted 77 out of 100 (S n =77%) and 79 out of 100 (S n =79%) cleavage sites correctly. The weaker sensitivity scores observed for GraBCas could be due to the utilization of position-specific scoring matrices (PSSMs) which are derived from a small, out-dated set of in vitro cleavage specificities and the absolute requirement of Asp residue at P1 on the cleavage sites. To further evaluate the performance of the PSSM-based algorithm in our context, we constructed PSSMs derived from our entire dataset of cleavage sites, and found that the AROC scores of the PSSM-based predictors were generally poorer than our SVM-based classifiers (data not shown). In Barkan et al., the best SVM classifier recorded a true positive rate (TPR) of 0.79 and false positive rate (FPR) of 0.21 at the critical point on the receiver operating characteristic (ROC) curve when tested on an independent test set. In our SVM method, several classifiers encoded using the BFE scheme registered better prediction performance when measured by the same metrics; P10P10’-Bayes with TPR of 0.86 and FPR of 0.14, as well as P14P10’-Bayes, P8P8’-Bayes and P6P6’-Bayes with TPRs of 0.85 and FPRs of 0.15.
Prediction of granzyme B cleavage of CHIKV proteome
To investigate the applicability of our computational method, we applied the SVM classifiers on the proteome of the Chikungunya virus (CHIKV) and analyzed for the presence of hitherto undiscovered granzyme B cleavage sites. CHIKV is a member of the alphavirus family and has been known to be transmitted to humans via the bite of the virus-borne Aedes mosquito . Acute infection of CHIKV results in symptoms such as abrupt fever, skin rash and arthralgia. As CHIKV epidemics have been re-emerging in recent times, there have been concerted efforts directed toward developing relevant vaccines and drug therapies. During viral infections, granzyme B has been reported to mediate downstream cleavage of critical host regulatory proteins, leading to the induction of the apoptotic cell death, and hence disruption of viral propagation . Although granzyme B-induced apoptotic cell death has long been considered the de facto mechanism for killing virus-infected cells, emerging evidence suggest that the enzyme could exert direct antiviral activity through cleavage of the viral proteins . For these reasons, it is intuitive to speculate if the CHIKV proteome may be directly regulated by granzyme B activity in this manner and if cleavage of specific CHIKV proteins will potentiate the host innate immune responses against viral infectivity.
Prediction of granzyme B cleavage of CHIKV proteome
Biological activity and function
Non-structural: mRNA capping
9, 11, 58, 525
Non-structural: NTPase, helicase and protease activities
116, 247, 343
Non-structural: ADP-ribose phosphatase activity
181, 350, 363, 506
Non-structural: RNA polymerase activity
219, 371, 476, 540
Structural: virus-host cell fusion
Structural: virus-host cell attachment
Structural: protease, viral nucleocapsid formation
Structural: membrane permeabilization, budding of viral particles
In this paper, we constructed a comprehensive database of experimentally verified granzyme B cleavage sites for analysis and development of prediction methods. We discovered that flanking sequences of cleavage sites possess distinctive residue composition and position-specific propensity patterns which could be helpful in discriminating the cleavage sites from non-cleavage sites in silico. We have rigorously tested SVM classifiers employing simple binary encoding and the Bayes Feature Extraction schemes to predict granzyme B cleavage sites. Results also show that the best classifiers are more effective than existing algorithms. We applied our prediction method on the Chikungunya viral proteome and identified several regulatory domains of viral proteins to be potential targets of granzyme B cleavage, suggesting a direct antiviral function of granzyme B during host-viral innate immune responses. To complement experimental research, we have implemented our prediction method on a web server which is freely accessible at http://www.casbase.org/grasvm/index.html. In the immediate future, we will be exploring the influence of cleavage site secondary structures, solvent accessibilities and other physicochemical properties on protease-substrate cleavage specificities, as well as their potential for enhancing the performance of our SVM prediction models. Computational prediction of granzyme B substrates will complement on-going experimental efforts and refine our understanding of the biochemistry of this fascinating protease and its relatives.
Materials and methods
We extracted a pool of 779 unique, experimentally verified cleavage sites from literature. 723 sequences were derived from proteomic experimental studies conducted by Van Damme et al. , with the remaining 56 from systematic in vitro and in vivo experiments as compiled in Barkan et al. . We further extracted sequence segments of different lengths flanking the P1 cleavage sites. In all, eight datasets were constructed: the tetrapeptide cleavage site sequences (referred to as P4P1 dataset) and sequences containing residues extended to P14 and P10' (P2P2’, P4P2’, P4P4’, P6P6’, P8P8’, P10P10’ and P14P10' datasets). These sequences were assigned as positive examples for analysis as well as for development of the SVM method. An equal number of “non-cleavage sites” or negative examples were obtained by randomly extracting P1 residues on the substrates. Sequence segments of the aforementioned lengths and compositions were obtained as detailed earlier. All datasets of positive and negative sequences (779/779) were subsequently subjected to homology filtering using the CD-HIT clustering algorithm  where sequences bearing more than 85% sequence identity with any other sequence in the dataset were eliminated. The final datasets comprised of 580 positive and 580 negative sequences (the complete list of cleavage sites is available in Additional File 1). For analysis, all 580 positives and 580 negatives from the P10P10’ dataset were used. For SVM model development, datasets were partitioned into training and test sets consisting of 480 positives/480 negatives and 100 positives/100 negatives respectively.
The relative position-specific residue propensity Px was computed as the ratio of the frequency of occurrence of a particular amino acid in the cleavage sites pool to its frequency of occurrence in the non-cleavage sites pool at a specific position on the sequence. Using the P10P10’ dataset, Px scores were calculated for every amino acid at each of the twenty residue positions and visualized on heat maps. Additionally, we constructed a sequence logo representation of the positive sequences from the P10P10’ dataset using WebLogo .
SVM vector representation
To encapsulate sequence information for SVM training and testing, input vectors were constructed using simple binary or bi-profile Bayes Features encoding. For simple binary encoding, each amino acid is represented by a vector of 20 dimensions, comprising of binary values of zeroes and ones. For example, alanine was represented as [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1] and cysteine as [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0]. Hence, in this case, a 20-mer sequence will be represented by a vector of 400 dimensions (20 x 20). Detailed description on bi-profile vector encoding using Bayes Features is available in Shao et al. . In short, feature vectors contain information from both positive position-specific and negative position-specific profiles. These profiles were generated by accounting for the frequency of occurrence of each amino acid at each position of the sequences in the positives pool (cleavage site sequences) and negatives pool (non-cleavage site sequences) respectively. Therefore, a 20-mer sequence (from the P10P10’ dataset) would be represented by a feature vector of 40 dimensions (20 x 2), containing information of the residues in both positive (cleavage site sequences) and negative (non-cleavage site sequences) spaces. For all sequence representations, P1 residues were excluded from the feature vectors.
SVM model development
C is the regularization variable that directs the trade-off between margin and classification error. We used the radial basis function (RBF) kernel and performed grid-based optimization for γ, which controls the capacity of the RBF kernel, and C using 10-fold cross-validation. In 10-fold cross-validation, the training set was randomly partitioned into ten subsets where one of the subsets was used as the test set while the other subsets were used for training the classifier. The trained classifier was evaluated using the test set. This procedure was repeated ten times using different subsets for testing, hence making sure that all subsets were utilized for both training and testing. The optimized γ and C values were applied towards training the entire training set to generate the SVM classifier for independent testing on an out-of-sample test set. Graphical plots of optimization results are provided in Additional File 2.
Evaluation of model performance
A set of statistical variables were established to evaluate the performance of the SVM classifier for the prediction of granzyme B cleavage sites:
(i) True Positives (TP), for the number of correctly classified cleavage sites.
(ii) False Positives (FP), for the number of incorrectly classified non-cleavage sites.
(iii) True Negatives (TN), for the number of correctly classified non-cleavage sites.
(iv) False Negatives (FN), for the number of incorrectly classified cleavage sites.
In addition, we plotted the receiver operating characteristic curve (ROC) and computed the area under the curve (AROC) for threshold independent evaluation. To compare against the prediction model developed by Barkan et al., we further determined the critical points on the ROCs of our SVM classifiers, which are defined as the points where the ROC curves intersect the lines connecting coordinates (1, 0) and (0, 1) on the graphs.
This study was sponsored by a research grant from the Joint Council Office (JCO) of A*STAR Singapore.
This article has been published as part of BMC Genomics Volume 12 Supplement 3, 2011: Tenth International Conference on Bioinformatics – First ISCB Asia Joint Conference 2011 (InCoB/ISCB-Asia 2011): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/12?issue=S3.
- López-Otín C, Overall CM: Protease degradomics: a new challenge for proteomics. Nat Rev Mol Cell Biol. 2002, 3: 509-519. 10.1038/nrm858.View ArticlePubMedGoogle Scholar
- Chowdhury D, Lieberman J: Death by a thousand cuts: granzyme pathways of programmed cell death. Annu Rev Immunol. 2008, 26: 389-420. 10.1146/annurev.immunol.26.021607.090404.PubMed CentralView ArticlePubMedGoogle Scholar
- Los M, Stroh C, Janicke RU, Engels IH, Schulze-Osthoff K: Caspases: more than just killers?. Trends Immunol. 2001, 22: 31-34. 10.1016/S1471-4906(00)01814-7.View ArticlePubMedGoogle Scholar
- Thornberry NA, Rano TA, Peterson EP, Rasper DM, Timkey T, Garcia-Calvo M, Houtzager VM, Nordstrom PA, Roy S, Vaillancourt JP, Chapman KT, Nicholson DW: A combinatorial approach defines specificities of members of the caspase family and granzyme B. Functional relationships established for key mediators of apoptosis. J Biol Chem. 1997, 272: 17907-11. 10.1074/jbc.272.29.17907.View ArticlePubMedGoogle Scholar
- Van Damme P, Maurer-Stroh S, Plasman K, Van Durme J, Colaert N, Timmerman E, De Bock PJ, Goethals M, Rousseau F, Schymkowitz J, Vandekerckhove J, Gevaert K: Analysis of protein processing by N-terminal proteomics reveals novel species specific substrate determinants of granzyme B orthologs. Mol Cell Proteomics. 2009, 8: 258-72.View ArticlePubMedGoogle Scholar
- Wee LJ, Tan TW, Ranganathan S: SVM-based prediction of caspase substrate cleavage sites. BMC Bioinformatics. 2006, 7 (Suppl 5): S14-10.1186/1471-2105-7-S5-S14.PubMed CentralView ArticlePubMedGoogle Scholar
- Piippo M, Lietzén N, Nevalainen OS, Salmi J, Nyman TA: Pripper: prediction of caspase cleavage sites from whole proteomes. BMC Bioinformatics. 2010, 11: 320-10.1186/1471-2105-11-320.PubMed CentralView ArticlePubMedGoogle Scholar
- Gasteiger E, Hoogland C, Gattiker A, Duvaud S, Wilkins MR, Appel RD, Bairoch A: Protein Identification and Analysis Tools on the ExPASy Server. The Proteomics Protocols Handbook. Edited by: Walker JM. 2005, Humana Press, 571-607.View ArticleGoogle Scholar
- Backes C, Kuentzer J, Lenhof HP, Comtesse N, Meese E: GraBCas: a bioinformatics tool for score-based prediction of Caspase- and Granzyme B-cleavage sites in protein sequences. Nucleic Acids Res. 2005, 33 (Web server issue): W208-W213.PubMed CentralView ArticlePubMedGoogle Scholar
- Barkan DT, Hostetter DR, Mahrus S, Pieper U, Wells JA, Craik CS, Sali A: Prediction of protease substrates using sequence and structure features. Bioinformatics. 2010, 26: 1714-1722. 10.1093/bioinformatics/btq267.PubMed CentralView ArticlePubMedGoogle Scholar
- Shao J, Xu D, Tsai SN, Wang Y, Ngai SM: Computational identification of protein methylation sites through bi-profile Bayes feature extraction. PLoS One. 2009, 4: e4920-10.1371/journal.pone.0004920.PubMed CentralView ArticlePubMedGoogle Scholar
- Song J, Tan H, Shen H, Mahmood K, Boyd SE, Webb GI, Akutsu T, Whisstock JC: Cascleave: towards more accurate prediction of caspase substrate cleavage sites. Bioinformatics. 2010, 6: 752-760.View ArticleGoogle Scholar
- Wee LJ, Simarmata D, Kam YW, Ng LF, Tong JC: SVM-based prediction of linear B-cell epitopes using Bayes feature extraction. BMC Genomics. 2010, 11 (Suppl 4): S21-10.1186/1471-2164-11-S4-S21.PubMed CentralView ArticlePubMedGoogle Scholar
- Pialoux G, Gauzère BA, Jauréguiberry S, Storbel M: Chikungunya, an epidemic arbovirosis. Lancet. 2007, 7: 319-327. 10.1016/S1473-3099(07)70107-X.View ArticlePubMedGoogle Scholar
- Romero V, Andrade F: Non-apoptotic functions of granzymes. Tissue Antigens. 2008, 71: 409-416. 10.1111/j.1399-0039.2008.01013.x.View ArticlePubMedGoogle Scholar
- Kam YW, Ong EK, Rénia L, Tong JC, Ng LFP: Immuno-biology of Chikungunya and implications for disease intervention. Microbes Infect. 2009, 11: 1186-1196. 10.1016/j.micinf.2009.09.003.View ArticlePubMedGoogle Scholar
- Huang Y, Niu B, Gao Y, Fu L, Li W: CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010, 26: 680-682. 10.1093/bioinformatics/btq003.PubMed CentralView ArticlePubMedGoogle Scholar
- Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res. 2004, 14: 1188-1190. 10.1101/gr.849004.PubMed CentralView ArticlePubMedGoogle Scholar
- Chang CC, Lin CJ: LIBSVM: a library for support vector machines. [http://www.csie.ntu.edu.tw/~cjlin/libsvm]
- Burges CJC: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery. 1998, 2: 121-167. 10.1023/A:1009715923555.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.