KinMutRF: a random forest classifier of sequence variants in the human protein kinase superfamily

Background The association between aberrant signal processing by protein kinases and human diseases such as cancer was established long time ago. However, understanding the link between sequence variants in the protein kinase superfamily and the mechanistic complex traits at the molecular level remains challenging: cells tolerate most genomic alterations and only a minor fraction disrupt molecular function sufficiently and drive disease. Results KinMutRF is a novel random-forest method to automatically identify pathogenic variants in human kinases. Twenty six decision trees implemented as a random forest ponder a battery of features that characterize the variants: a) at the gene level, including membership to a Kinbase group and Gene Ontology terms; b) at the PFAM domain level; and c) at the residue level, the types of amino acids involved, changes in biochemical properties, functional annotations from UniProt, Phospho.ELM and FireDB. KinMutRF identifies disease-associated variants satisfactorily (Acc: 0.88, Prec:0.82, Rec:0.75, F-score:0.78, MCC:0.68) when trained and cross-validated with the 3689 human kinase variants from UniProt that have been annotated as neutral or pathogenic. All unclassified variants were excluded from the training set. Furthermore, KinMutRF is discussed with respect to two independent kinase-specific sets of mutations no included in the training and testing, Kin-Driver (643 variants) and Pon-BTK (1495 variants). Moreover, we provide predictions for the 848 protein kinase variants in UniProt that remained unclassified. A public implementation of KinMutRF, including documentation and examples, is available online (http://kinmut2.bioinfo.cnio.es). The source code for local installation is released under a GPL version 3 license, and can be downloaded from https://github.com/Rbbt-Workflows/KinMut2. Conclusions KinMutRF is capable of classifying kinase variation with good performance. Predictions by KinMutRF compare favorably in a benchmark with other state-of-the-art methods (i.e. SIFT, Polyphen-2, MutationAssesor, MutationTaster, LRT, CADD, FATHMM, and VEST). Kinase-specific features rank as the most elucidatory in terms of information gain and are likely the improvement in prediction performance. This advocates for the development of family-specific classifiers able to exploit the discriminatory power of features unique to individual protein families. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2723-1) contains supplementary material, which is available to authorized users.


Background
Only a minor fraction of the large number of variants discovered with current high-throughput next generation sequencing (NGS) methodologies are causally implicated in disease onset [1][2][3][4][5][6]. The correct identification of the causative variants remains a challenging effort [7]. For a few examples there is sufficient experimental information associating variants and human maladies, and for an even smaller number of cases the underlying biochemical mechanism is known. However, for the vast majority of the sequence variants identified,~100,000 disease-associated variants, the functional information is missing [8]. The experimental characterization and functional annotation of those novel variants would require humongous resources. Nevertheless, this problem is very amenable to computational approaches [6]. Different methods to predict the probability of a variant being causaly implicated in a disease have been proposed during the last decade. A brief description of the most popular methods, along with relevant URLs and references, are listed in Additional file 1: Table  S1. A first group of methods applied deterministic rules to a reduced number of protein features to identify damaging mutations. For example, the widely cited methods SIFT [9] and MutationAssessor [10], MutPred [11], FATHMM [12], Panther [13] and PROVEAN [14] rely on different interpretations of signatures of evolutionary constraint to assess the pathogenicity of variants. A second group of methods (e.g. PMUT [15], SNAP [16], PolyPhen-2 [17], NetDiseaseSNP [18], LS-SNP [19], PhD-SNP [20], MutationTaster [21], VEST [22], SNPs&GO [23], SNPs3D [24], MuD [25], Can-Predict [26], CADD [27], PON-P2 [28] and nsSNPAnalyzer [29]) rely on advanced automatic machine learning approaches that integrate prior knowledge in the form of both sequence-based and structure-based features, under the assumption that pathogenic variants will disrupt normal protein function and structural stability. After a training process where the system is presented a set of previously characterized damaging and neutral variants, new variants can be classified based on the knowledge acquired. Each method implements a different machine learning approach: neural networks [15,16,18], Bayesian methods [17,21], support vector machines [19,20,23,24,27] or random forests [22,25,26,28,29]. Recently, some meta-predictor have been published, for instance, Meta-SNP [30] combines four of the most widely employed computational methods for prioritising missense single nucleotide variations, both Condel [31] and PON-P [32] integrate five classifiers, and PredictSNP [33] incorporates eight. Moreover, the SPRING [34] method is based on six functional effect scores calculated by existing methods (SIFT, Polyphen2, LRT, MutationTaster, GERP and PhyloP) and five association scores derived from a variety of genomic data sources (Gene Ontology, protein protein interactions, protein sequences, protein domain annotations and gene pathway annotations). Concomitantly, each predictor implements a distinctive set of features with a different scope and applicability. Some predictors are generally applicable to any protein, while a recent group of methods include properties that focus on a characteristic subset of variants (eg. Cancer variants predicted by CanPredict [26], Can-DrA [35] and CHASM [36]) or a protein family of interest under the assumption that family-specific features bring discriminative information that justifies the development of specialized methods. An interesting example of the latter are protein kinases [5,[37][38][39][40]. The protein kinase superfamily is very amenable to this approach. Protein kinases play a central role in the cell and consequently they have been studied in detail. As a consequence, a broad number of variants in members of the protein kinase superfamily have been reported in the literature in relation to disease [41], including some types of cancer [42]. In previous publications, we demonstrated the preferential distribution of both germline and somatic variants [43,44] around regions of functional and structural relevance and how this information can be used to develop a computational method [37] to predict the impact of variants on the function of protein kinases. The combination of the predictions from the classifier with annotations extracted from the literature and other sources, facilitates the mechanistical interpretation of the consequences of the variants [45].
Here, we introduce KinMutRF as a random forest-based classifier to predict the pathogenicity of novel variants. Although the core functionality builds up on our previous work [37], in this new implementation we redefine the sequence-derived features, using optimized ways to extract the signals encoded at the protein, domain and residue levels. To demonstrate the improved prediction capabilities of the KinMutRF, approach we benchmark our random forest classifier with other state-of-the-art prediction methods and we discuss the benefits and pitfalls of the development of a family-specific predictor in the light of our findings.

Training datasets
Variants affecting members of the protein kinase superfamily were downloaded from the UniProt/Swiss-Prot variant pages (release 2014_08 of 03-Sept-2014) [46], which compile variants in UniProtKB. The training datasets used in this work have been included with the Supplementary Materials.

Statistics to evaluate prediction performance
Accordig to best practices in the field [46][47][48], perfomances was assesed in terms of Accuracy, Precision, Recall, F-score and Mathew's correlation coefficient (MCC).
Where: TP: True positives, correctly predicted pathogenic variants; FP: False positives, neutral variants predicted as disease prone; TN: True negatives, correctly predicted neutral variants; and FN: False negatives, pathogenic variants predicted as neutral.

Description of the classification features
Variants were characterized with a battery of 25 features at the protein, domain and residue level (see details below). The distribution of variants in the training sets respect the classification features can be found in Fig. 1 (panels from c to l). Classification features were computed as follows:

Membership to kinase groups
We used the taxonomy proposed by Manning [49] implemented in UniProt to classify the protein kinases superfamily. This taxonomy considers three levels of abstraction: subfamilies, families and groups. The level of protein kinase groups are stablished according to sequence similarity, the presence of accessory domains, and by considering the different modes of regulation. For a detailed description of protein kinase groups in KinBase and the abbreviations used in this work, see reference [50] and the supplementary materials. A total of 15 protein kinase groups were considered in this analysis (Fig. 1, panels c and d) and the log odds ratio of their contribution to disease was calculated according to the following formula: kinase group ¼ log 2 disease var: in kinase group þ ξ ð Þ =disease var: neutral var: in kinase group þ ξ ð Þ =neutral var: Where "disease var." and "neutral var." refer to the total number of variants in UniProt classified as disase or neutral, respectively. The terms "disease var. in kinase group" and "neutral var. in kinase group" are the number of variants in a specific kinase group for each category. Note that a pseudo count of ξ = 10 -20 is considered to resolve kinase groups with no neutral variants.

Gene ontology terms (sumGOlor)
Gene Ontology (GO) annotations were used as a proxy for the functional relevance of protein kinases. Starting from the terms that annotate each kinase in UniProt the three subontologies (i. e. molecular function, biological process and cellular compartment) were followed to their roots to consider all parent nodes. The probabilities of observing each of these GO terms together with neutral and disease variants were compared with log-odds ratio (Fig. 1, panel l). Protein kinase are characterised by the sum of the individual contributions of their GO terms. Where "disease var." and "neutral var." refer to the total number of variants in UniProt classified as disase or neutral, respectively. The terms "disease var. annotated with GOi" and "neutral var. annotated with GOi" are the number of variants annoatated with a particular gene ontology term for each category, disease-associated or neutral. Note that a pseudo count of ξ = 10 -20 is considered to resolve cases where no neutral variants where annotated with GO i .

PFAM domains
For each of the 80 different domains defined by UniProt as found in the protein kinase superfamily, a log-odds ratio (details in Fig. 1, panels e and f ) of the frequency with which they harbour disease and neutral variants has been computed according to the following formula: Where "disease var." and "neutral var." refer to the total number of variants in UniProt classified as disase or neutral, respectively. The terms "disease var. in PFAMi" and "neutral var. in PFAMi" are the number of variants in a specific kinase PFAM domain for each category. Note that a pseudo count of ξ = 10 -20 is considered to resolve cases where no neutral variants where annotated with PFAMi.

Amino acid and their biochemical properties
The physic-chemical properties of the amino acids involved in variation often determine the propensity to disease. Our prediction features consider the native amino acid, the newly observed one, and the derived changes in some crucial biochemical properties. These include changes volume, Kyte-Doolittle hydrophobicity, C beta branching and formal charge represented as differences in the nominal values ( Fig. 1, panels g, j and k).

Residue conservation: SIFT
Variants are described with the precomputed SIFT [51] scores downloaded from dbNSFP [52] as a proxy for amino Conservation within a set of related sequences has traditionally been the strongest and most widely implemented features for the classification of variants.

Functional annotations in UniProt, FireDB and Phospho.ELM
The activity of protein kinases is affected by the alteration of functionally relevant residues involved, for example, in catalysis or phosphorilation. In the implementation of KinMutRF, residue annotations in UniProt [53] define functionally relevant amino acids. The residue annoations include the following categories: active sites (act_site), general (binding) or specialised binding (carbohyd, metal, np_bind), disulfid bonding, experimentally modified residues (mod_res), repeat regions (repeat), signal peptides (signal), transmembrane regions (transmem) and zinc fingers (zn_fing), among others broadly defined sites. An additional categories (any_uniprot) account for the residues being annotated with at least one of the previous categories. Similarly, phosphorilation sites from Phospho.ELM [54] and for the prediction of the catalytic and ligand-binding sites according to FireDB [55] are included (Fig. 1, panel h).

Construction of the training datasets
Variants affecting members of the protein kinase superfamily were extracted from the UniProt/Swiss-Prot variant pages [46], the compilation of variation available in UniProtKB. Every variant in this set is given a classification as neutral or pathogenic. In the few cases were the same variant was described by several instances, a single record was considered, selecting a pathogenic instance if ambiguous. Note that no additional reclassification attending to disease types or information from other sources was applied. After the filtering process, 1021 unique variants in 84 protein kinases form the disease dataset and 2668 variants in 450 proteins conform its neutral counterpart. In total, there were variants described and classified for 459 out of the 507 protein kinases described in UniProt, and 75 kinases span both categories of variants. The disease and neutral variant sets were used for training and evaluation of the machine learning classifier. The 848 variants affecting 299 kinases that are listed as unclassified in UniProt were left out from this analysis.
The training of the random forest-based classification kernel of KinMutRF followed a 10-fold cross-validation approach. As suggested by the best practices in the field [16,46], the 459 protein kinases for which classified variation data exists were distributed randomly in 10 different bins. All variants corresponding to an individual protein were assigned to the same bin. We incorporated this rule to avoid overestimating the performance of the classification; the contrary would constitute a circularity type 2 bias [47,56]. This bias might originate from similarities at the protein level (i.e. different variants from the same protein) between the training and evaluation sets. To ensure reproducibility of our results and to facilitate of other methods to be developed in the future, these training bins have been included with the Supplementary Materials (Additional file 2: Supplementary File S1). Then, each bin was iteratively used as evaluation set whereas the remaining nine were used as training instances. Results are accumulated until all bins had been used in the evaluation step. Following current standard practice in the field [47][48][49], we assessed the performance of the clasiffier with five different statistics: accuracy, precision, recall, f-score and Mathew's correlation coefficient (MCC) according to the formulas described in Methods.

Optimization of the prediction method
A machine learning classifier was trained to predict the pathogenicity of variants affecting the human kinome. In particular, a Random Forest kernel was selected after exploration of the many methods implemented in the Weka (v.3.6.11) package. To optimise the parametrization of the random forest classifier, we explored an increasing number of decision trees, ranging from 4 to 30 elements. Our results (Fig. 1, panels a and b) show that all performance statistics reach a steady plateau after an expected initial overhead and suggest that prediciton performance is not afffected by moderate alterations in the size of the forest. Subsequent analyses implement a configuration with 26 trees given the slightly better f-score in average in our preliminary analyses. i Distribution of SIFT scores; j Changes in volume caused by disease-associated and neutral variants; k Changes in hydrophobicity caused by disease-associated and neutral variants; l Accumulated Gene Ontology (GO) log odds-ratio. Note that, where relevant, disease-associated variants were represented in dark red whereas ochre was used for their neutral counterparts

Evaluation of classification performance in the training set
In a previous section we described the construction of the training datasets and how these were used in 10-fold crossvalidation experiment to assess the prediction capabilities of the KinMutRF classifier according to five common statistics. Accuracy accounts for the fraction of variants correctly predicted in function of the total number of variants. Due to the innate inbalance in the constitution of the datasets, with 1021 neutral variants and 2668 disease-associated variants respectively, a naïve classifier predicting every variant as the majority class would achieve a basal 72.32 % accuracy. Consequently, the evaluation of the classification should refer to the prediction of the positive class. In the case of a predictor of pathogenicity, this corresponds to the pathogenic mutations. Precision accounts for the proportion of correctly predicted disease-associated variants with respect to all the variants predicted as positive by the classifier. Recall, often referred as sensitivity, accounts for the proportion of correctly predicted disease-associated variants respect to all positive variants present in the dataset. These two statistics combine into a single one, the f-score, which is convenient for evaluation purposes. Finally, we considered the Mathew's correlation coefficient (MCC) accounts for the performance of both the disease and the neutral prediction. Despite accuracy, this statistic is robust even in cases with dispair class sizes. KinMutRF yields accurate results when both classes are considered (accuracy: 88.45 %, MCC: 0.68). Performance is also satisfactory when only the pathogenic set is considered. KinMutRF achieves a precision of 81.62 % and a recall of 75.22 %, that combined produce an f-score of 78.29 %. The implementation of KinMutRF overcomes our previous KinMut results implementing a support vector machine (SVM) kernel and a different set of prediction features [37,51] (Acc: 83.29 %, Prec: 60.03 %, Recall: 75.17 %, f-score: 66.7 % and MCC: 0.6). The improvement is particularly significant in terms of precision, the ability to predict correctly in the pathogenic variants, while a similar recall is maintained.

Most relevant features for classification
The contribution of individual features for the classification of the classes was assesed using the InfoGainAttri-buteEval module in Weka (v.3.6.11). Features are ranked according to the information gain resulting from the inclusion of individual features. The ranking of the classification features of KinMutRF is summarised in Table 1. One would expect that a family-specific predictor would benefit from the use of the information encoded by features that pertain only to the family of interest. Our ranking of features follows this intuition as the highest information gain (0.491) corresponds to the implementation of Gene Ontology terms that describe the function of each protein kinase and the fequency with which it has been reported in relation with disease and neutral variants (sumGOlor). This observation is coherent with Fig. 1 (panel l), where a clear separation between the accummulated GO log odds ratio of the two classes of variations (disease-associated and neutral). The evolutionary conservation of the residues, measured with SIFT, follows in the ranking. with an information gain of 0.179. In spite of not being a kinase-specific feature, this observation is coherent with the widespread use of SIFT as part of a full body of other classifiers and with the observations in Fig. 1 (panel i). Third and fourth position in this ranking are also occupied by kinase-specific features, namely the membership to a kinase group and the relevance of the kinase domains, produce information gains of 0.120 and 0.112 respectively. It is clear from the observaton of Fig. 1 (panels c, d, e and f) that there is a preferential distribution of disease-associated mutations respect to certain protein kinases and domains. One could argue that the inclusion of features that rely on existing knowledge (e.g. protein and domain specific features) might inherently bias the classification of variants. Albeit partially true from a benchmark perspective, the ability to derive correct predictions from related proteins is the ultimate goal of family-specific methods as the one under consideration here. A different reasoning is that genetic aberrations affecting uncharted regions of the variation-spacei.e. less characterised protein kinasesmight result difficult to characterise as predictions would be hindered by lack of data, or on a worst case scenario by the strong contribution of the few exisiting examples. We expect that the wealth of data coming from current sequencing efforts would quickly bridge this knowledge gap and that all elements of the human kinome would present a comparable amount of information. This is also true for the development of family-specific methods outside the protein kinase superfamily, currently limited by lack of sufficient variation information. The ranking of features is continued by other commonly used features. However, their contribution to the information gain is an order of magnitude smaller. These include recurrently implemented by methods that focus on alteration of protein stability (Additional file 1: Supplementary Table S1) such as the nature of the wildtype (0.044) and mutant (0.037) amino acids or the associated change in hydrophobicity (0.037). Last in the ranking appear features that assess the relevance of the residue in terms of catalysis and phosphorylation propensity. Their position in the ranking might be determined by their limited abundance. Nevertheless, these observations are coherent with previous observations that determined that disease-associated variants, independently of their somatic or germline character, did not allocated necessarily on catalytic sites but on the close proximity of these, under the hypothesis that the structural neighbourhood of the functional sites is also determinant for correct protein function [43,44,57].

Benchmark of the classifier respect to other methods
The capability of KinMutRF to correctly identify pathogenic variants was benchmarked to that of another eight state-ofthe-art approaches ( Table 2). Evaluation was studied according to the five performance measures described in Methods, KinMutRF yields very satisfactory predictions when the other methods are interrogated about the pathogenicity of the 3689 kinase variants for which UniProt provides a characterization. In fact, our methodology achieves the best accuracy (0.88) and precision (0.82) among the evaluated methods, indicative that the prediction of both neutral and pathogenic mutations is sufficiently reliable. This observation is supported by a Matthew's correlation coefficient (MCC) of 0.68, comparable to that achieved by the the best in this category, VEST [22]. Our f-score (0.78) is also comparable with the one achieved by VEST, that compensated the lack precison with increased recall. The difference in prediction performance might be bigger in practical terms, as the results of KinMutRF competitors correspond to an optimistic interpretation that might be boosted by a circularity type 1 bias [56]; the set used in the benchmark might include variants already presented to the classifiers during their own training phase [52]. This effect was taxatively avoided in the evaluation of KinMutRF.

Comparison to Kin-Driver manually curated kinase variants
To understand the prediction performance of KinMutRF beyond the training datasets, we evaluated the agreement with an independent source, Kin-Driver [58]. The resource present two quantitative adjantages: First, it includes variants that have not been presented to KinMutRF during its training phase. Second, variants are manually classified according to their consequence on protein activity into activating and deactivating, which allows further understanding of the strengths and weakenesses of our model. KinMutRF correctly predicted 65 out of the 159 (40.88 %) pathogenic variants included in Kin-Driver that were not included in the set used for training our predictor. The drop in performance might be explained by the nature of the consequence of the variants. The random forest correctly identified 21 out of 34 (61.76 %) loss-of-function variants whereas only 44 out of the 125 (35.20 %) gain-of-function variants were classified correctly. This analysis is coherent with previous observations [54,57] that advocate for the further development of methods to predict the consequences of activating variants as most of the methodologies focus on the disruption of protein function.  Predicting the pathogenicity of unclassfied variants, recorded in UniProtKB/Swiss-Prot In a previous section we discussed the preparation of a training set from the variation in UniProtKB/Swiss-Prot variant pages. In this process, we excluded 848 variants in 299 kinases for which a classification of "Disease" and/or "Polymorphism" was not available. We propose that KinMutRF can bridge this gap in knowledge and suggest whether these are most likely pathogenic or neutral. KinMutRF predicted 185 (21.81 %) of these variants as pathogenic (Fig. 2,  panel b). The full list of predictions, as well as the prediction features that originated them, can be found with the Supplementary Materials (Additional file 4: Supplementary File S2). One could argue that the prediction features used in this analysis rely excessively on existing knowledge. Should this be the case, predictions for all the variants in a particular kinase group, protein kinase or PFAM domain would follow the same character, being all either neutral or pathogenic. Most of the 53 protein kinases that harbored variants predicted as disease-associated also presented neutral variation (Fig. 2, panel a). The same is also true for kinase groups and PFAM domains (Fig. 2, panels c, d and e). These results support our selection of features, most importantly, the highly informative accumulative log odds ratio of Gene Ontology terms as a proxy for protein function (Fig. 2, panel f). In spite of being distributed satisfactorily, the results from KinMutRF highlight the functional relevance of previously reported domains such as the protein kinase domain or the PI3K/PI4K and certain taxonomical kinase groups characterised by them, namely Tyr, atypical PI3/PI4 kinase, CAMK and TKL.

Conclusions
Here we presented a novel method for prioritization of pathogenic variants in the human protein kinase superfamily. KinMutRF implements a random forest classifier that outperforms our previous implementation (KinMut) and other state-of-the-art methods with a similar purpose. Our choice of features and datasets makes the method especially relevant in the context of kinase variantion and their intrinsic role in cancer biology. The family-specific character of the Kin-MutRF classifier allowed us to introduce features that are unique to the protein kinase family. An analysis of the individual information gain identified these kinasespecific features among the most relevant for a correct classification. Namely, the functional characterization of the kinase according to Gene Ontology terms, the membership to a particular kinase group or the occurrence of the variants at relevant catalytic protein kinase domain arise as important features that are unique to the protein kinase superfamily. This is in full agreement with previous observations and advocates for the urgent development of family-specific classifiers where the abundance of variation data permits.

Availability of supporting data
KinMutRF is publicly implemented as a component of our pipeline for the identification, annotation and interpretation of the consequences of kinase variants, wKinMut-2 [61]. This resource is freely available at http://kinmut2.bioinfo.cnio.es. The source code, documentation and examples for KinMutRF can be downloaded for local installation from https://github.com/ Rbbt-Workflows under a GPV version 3 licence. We are also grateful to the two anonymous reviewers that revised this manuscript for their very relevant comments.

Consent for publication
Not applicable.

Ethics approval and consent to participate
Not applicable.

Availability of data and materials
Training datasets used for 10-fold cross-validation experiment provided as Additional file 2: Supplementary File S1. Predictions for the unclassified variants in Uniprot and the Bruton agammaglobulinemia tyrosine kinase domain are available as Additional file 2: Supplementary Files S1 and Additional file 4: Supplementary Files S2 respectively. The source code of KinMutRF is released under a GPL version 3 license, and can be downloaded from https:// github.com/Rbbt-Workflows/KinMut2 whereas a web implementation of KinMutRF is freely available at http://kinmut2.bioinfo.cnio.es. f Distribution of the accummulated Gene Ontology log odds-ratios (sumGOlor) for neutral and disease-associated variants