- Research article
- Open Access
Predicting protein function by machine learning on amino acid sequences – a critical evaluation
BMC Genomicsvolume 8, Article number: 78 (2007)
Predicting the function of newly discovered proteins by simply inspecting their amino acid sequence is one of the major challenges of post-genomic computational biology, especially when done without recourse to experimentation or homology information. Machine learning classifiers are able to discriminate between proteins belonging to different functional classes. Until now, however, it has been unclear if this ability would be transferable to proteins of unknown function, which may show distinct biases compared to experimentally more tractable proteins.
Here we show that proteins with known and unknown function do indeed differ significantly. We then show that proteins from different bacterial species also differ to an even larger and very surprising extent, but that functional classifiers nonetheless generalize successfully across species boundaries. We also show that in the case of highly specialized proteomes classifiers from a different, but more conventional, species may in fact outperform the endogenous species-specific classifier.
We conclude that there is very good prospect of successfully predicting the function of yet uncharacterized proteins using machine learning classifiers trained on proteins of known function.
Genome sequencing projects continue to produce unprecedented amounts of novel protein sequence information, and large-scale experimental efforts are underway to determine the function of the newly discovered proteins [1–6]. For a majority of proteins it is already possible to predict their approximate function with reasonable accuracy based on their evolutionary relationship or sequence similarity to proteins with known functions [7–9]. For most recently sequenced bacterial genomes about three quarters of open reading frames can be assigned a possible function in this way. However, a significant number of predicted proteins in each newly sequenced genome have turned out to defy this approach. These proteins, which in extreme cases may constitute up to 50% of open reading frames, show no similarity to proteins of known function. This may be due to missing experimental data, or the proteins are evolving too rapidly or are even unique to a small clade of species.
It would be very useful if one could obtain at least a general idea of the function of such proteins based on their amino acid sequence alone. Of course this is an extremely challenging task, and one that will only be of limited usefulness without combining it with additional information (e.g. structure models, phylogenetic profiles, or genomic context), but nonetheless several techniques to address this issue have been proposed recently [10–16]. These publications show that using machine learning classifiers it is possible to predict the function of well-characterized proteins based on features of their amino acid sequence, without using homology information . However, it is unclear if and how well such classifiers would transfer to proteins of unknown function. There are many reasons to assume that these 'unknown' proteins are special and differ from well-characterized proteins in significant ways: They may be evolving at a faster pace, they may function in unconventional ways, they may have unusual physico-chemical properties that have made them less accessible to experimentation. If 'unknown' proteins are not just a random subset of the proteome, but are biased in such a systematic fashion, classifiers trained and tested on proteins of known function may generalize poorly and will be unable to predict the function of the real proteins of interest.
A direct test of the predictive performance on proteins of unknown function is rarely possible, although a recent retrospective study  made some first steps in that direction. Thus a critical systematic assessment of the general prospect of successful classifier transfer is of great interest.
Here we show that proteins of known and unknown function do indeed differ significantly. We go on to show that, surprisingly, proteins from different species do also differ, to an even larger extent. We then demonstrate that classifiers do nonetheless generalize across species boundaries and use this to provide the first critical estimate of predictive performance on proteins of unknown function.
Results and discussion
We based our analysis on the completed and annotated proteomes of seven bacterial pathogens which cause sexually transmitted diseases in humans (Table 1). These species cover a wide range of phylogenetic relationships, from closely related species (two mycoplasma species) to very divergent ones (Treponema, Chlamydia). On the other hand, they all share the same general ecological niche, thus minimizing confounding effects of divergent evolutionary adaptation.
Prediction of known protein functions
In this paper we are not interested in optimizing a method of predicting protein functions, but rather in evaluating an aspect of function prediction that has been somewhat neglected previously, namely whether classifiers trained on proteins of known functions can be expected to transfer successfully to proteins of unknown function. Even an optimal classifier would be useless if it could not be applied reliably to the real proteins of interest, i.e. those for which no function is known at present.
However, as a baseline for our study, we first showed that we are able to correctly predict the function of proteins with known function using a Support Vector Machine classifier based on features derived from their amino acid sequence alone (see Methods for details of feature definition and selection and the machine learning technique). Confirming previous results [10, 12, 16] we found that this is indeed possible, although with varying performance for each class and species (Figure 1). Only in three highly specialized bacterial species (Treponema and the two mycoplasmas) overall performance was hardly better than random, and we will show below how the results of the present work indicate a way to overcome this problem. The observed median AUC is 56% averaged across all species and functional classes, and is higher for some important functional classes such as intermediary metabolism, DNA metabolism, and transport and binding proteins. These results are equivalent to previously reported accuracies [10, 12, 16]. The generally good performance on such small bacterial genomes is encouraging, especially as it does not rely on the use of posttranslational modification and localization predictions, which are very informative features for eukaryotic proteins [13–15].
Discrimination between 'known' and 'unknown' proteins
Previous studies in general stopped at this point and assumed that predictive performance would be maintained when the classifier were applied to proteins of unknown function. We wanted to know if that is a reasonable assumption. To determine the overall similarity of known and unknown proteins in the feature space used for function prediction, we trained another set of SVM classifiers to try to distinguish between these two sets of proteins. Not unexpectedly we found that they do indeed differ significantly (Figure 2). The possible reason why this is the case lies in the type of unknown proteins. A set of unknown proteins in a species typically contains:
• Unique proteins – Proteins without known homologs
• Hypothetical proteins – Proteins with homologs of unknown function. No experimental evidence exists for the function or existence of the protein product.
• Wrongly predicted proteins – Open reading frames that are not actually expressed (transcribed/translated), but are only the result of genome misannotation.
• Special proteins – Proteins that have a special feature (e.g. an unusual size or extreme amount of charged amino acids) making them different from known proteins, but which do have a biological function.
From the above list, one can see that the set of unknown proteins will contain some members that actually will have a function and others that are probably genome annotation artifacts. The ability to distinguish between known and unknown proteins is most likely due to the difference between unusual unknown proteins (categories 3 and 4) and normal known proteins. It is expected that homology-less function prediction will be possible for the 'normal' and unknown proteins (categories 1 and 2), while being much more difficult for the special proteins (category 4) and meaningless for wrongly predicted proteins (category 3). Therefore, it would be interesting to use the clasification information to estimate the fraction of 'predictable' and 'unpredictable' proteins in the set of unknown proteins. An exact estimate is not possible, because there is no exact definition of a normal protein available, but we can use the performance of the SVM classifier to obtain a rough estimate. The median AUC for the discrimination between proteins of known and unknown function is 63%. If we assume that 'predictable' unknown proteins are indistinguishable from known proteins, we can calculate the lower bound estimate of the fraction of unpredictable proteins to be 26% (=(63%–50%)/50%).
Discrimination between proteins from different bacteria
To determine if the effect of protein set dissimilarity will be as deleterious as one might fear, we argued that proteins from different species will also show some level of dissimilarity and, hence, one could use the performance of classifers across species boundaries to estimate the transferability from known to unknown proteins. Using our taxonomically diverse sample of bacterial species we trained a new set of SVM classifiers to try to distinguish between each pair of species. To our great surprise we found that this task is far easier than function prediction or the discrimination of known and unknown proteins (Figure 3). The median AUC across all species pairs is an astonishing 85%. This means that given any randomly picked pair of proteins from species A and B, we will be able to assign them to their correct species of origin in 85% of cases. This finding is entirely unexpected, given that the bacteria in our dataset all share the same highly stable ecological niche, the human urogenital tract. While they naturally differ widely in their exact pathophysiology, they would still be expected to carry out the same general biological processes using very similar molecular machinery. The fact that the SVM classifiers are nonetheless able to find generally valid species-specific "sequence signatures" is of course of great biological interest.
One possible explanation for this high accuracy in discriminating proteins from different species lies in the varying levels of guanine-cytosin (GC) content in each species(Table 1). Variations in genomic GC content from 25% to 75% have been shown to be common in prokaryotes . Variations in GC content in coding sequences will be reflected in differences of amino acid composition, as GC rich codons will be depleted in low-GC species and vice versa. Even if this variation is subtle, it will influence classifier performance.
Classifier transfer between species
How then does this high level of dissimilarity between species affect the performance of function prediction classifiers? Figure 4 shows that classifiers transfer across species boundaries with surprisingly little loss of accuracy. Classifiers that perform well on their species of origin do almost as well on each of the other species. High levels of protein set dissimilarity are apparently tolerated without decreasing performance. A case of special interest is represented by the mycoplasma genomes, where classifiers perform poorly if they are trained on the species itself, and functions are predicted with higher accuracy if the classifier comes from one of the non-mycoplasma species. Mycoplasma species are highly derived organisms with extremely reduced minimal genomes and their proteomes may be specifically adapted, e.g. the features used for their SVM classifiers differ entirely from those of the other species (Figure 5), but the paradoxical cross-species performance is still difficult to explain by this fact alone.
In conclusion, we find that proteins with known and unknown function differ significantly, but we also find that classifiers transfer very well between different bacterial species which differ even more. Viewed optimistically, this means that there is a distinct possibility that function prediction classifiers will generalize successfully to predict the function of proteins of unknown function. Figure 6 summarizes the results and can also be used to estimate the performance of classifiers for unknown proteins. In most cases, this performance will be almost as good as that on the known proteins. Our findings also indicate that, especially in the case of "unusual" proteomes, such as the mycoplasmal examples, it may be a promising strategy to train classifiers on related but more conventional species to achieve the highest predictive performance.
Protein dataset and annotation
Protein sequences for seven bacterial pathogens causing sexually transmitted diseases in humans (Table 1) were obtained from the Los Alamos National Laboratory Bioscience Division STD Sequence Databases . For each functionally characterized protein its classfication in one of 13 functional classes based on a modified version of the Riley scheme  was obtained from the same source.
Definition of protein sequence features
For every protein we calculated the frequency and total number of each amino acid, as well as of certain sets of amino acids (e.g. hydrophobic, charged, polar). To encode distributional features we also determined the number and size of continuous stretches of each amino acid or amino acid set. We also subdivided every protein into four equally sized fragments and calculated the same feature values for each fragment and combination of fragments. In addition, we predicted the secondary structure using Prof , the position of putative transmembrane helices using TMHMM  and of disordered regions using DisEMBL , and treated the obtained predictions in the same way as the amino acids. A small number of global features (e.g. isoelectric point and molecular weight) were also included. The total number of features extracted for every protein is 2579. The full feature set is described in Additional File 1.
Standardization of feature values
Since the original features generated in this way are very heterogeneously scaled linear normalization (standardization) was performed to rescale each feature by its mean and variance. After standardization, each of the 2579 features has a mean of 0 and a standard deviation of 1.
Homology-corrected generation of test and training sets
The entire dataset was subdivided randomly five times into test and training sets (size ratio 1:4). To prevent inflation of the prediction accuracies by predictions on homologous sequences in the test set, we applied a recursive Blast strategy to assign proteins that show significant sequence similarity to each other to the same set (either test or training). For this purpose every protein that was added to the test set was searched in three PSI-Blast iterations  against the non-redundant database of protein sequences at NCBI  using default settings. The obtained position-specific sequence profile was then run against the bacterial proteins and every protein generating a hit at E < 0.001 was also added to the test set, and the procedure repeated recursively until no new potential homologues were detected. Then the next randomly chosen protein would be added to the test set until the required test set size was exceeded.
For every training set, species and task, we selected discriminatory features using a simple filter approach which in previous work performed as well as classical wrapper approaches (data not shown). Briefly, for every feature we performed a Wilcoxon signed-rank test for every comparison of functional classes. Features were retained if for at least one comparison of classes they had a Wilcoxon p-value less than 0.02, indicating that they contribute potentially discriminating information. A second step of filtering removed highly redundant features, so that the remaining features had a pairwise absolute correlation coefficient of less than 0.95. For the known-unknown and species-species discrimination tasks the same procedure was applied using the Wilcoxon results for feature values in the various species or in 'known' vs. 'unknown' proteins, respectively.
Classification was done using Support Vector Machine classifiers as implemented in the WEKA machine learning package . As the datasets are highly imbalanced the negative class was undersampled to equal the positive class . A simple polynomial kernel with order 3 was used, as it had shown good performance in previous related studies . Other parameters were used in default settings (complexity constant = 1, size of the kernel cache = 1 × 104, tolerance parameter = 1.03 × 10-03) to avoid introducing bias by fine tuning to the present data. For the functional class prediction, one-against-all classifiers were generated for each class. For example, for predicting the transport and binding proteins functional class, we labeled all the other 12 functional classes as 'not transport and binding proteins' and performed a binary classification of transport and binding proteins against 'not transport and binding proteins'. We could then assess how well the features discriminate between the transport and binding functional class and all other functional classes.
Classifier performance evaluation
Classifier performances were evaluated using the Area Under the Receiver Operating Characteristic curve (AUC) on the test set. The median over the five splits of the test and training sets is generally reported. This value is a non-parametric estimate of the discriminating ability of the classifier. A value of 50% corresponds to a random classifier, a value of 100% indicates perfect performance [29, 30]. Using the AUC as a descriptor of classifier performance has the important advantage that it is independent of the class distribution in the test set. This is very important for our protein function prediction task: It is highly unlikely that the distribution of functions among the 'unknown' proteins is the same as that of the 'known' proteins, and the AUC provides the most unbiased performance estimate in this situation.
Delneri D, Brancia FL, Oliver SG: Towards a truly integrative biology through the functional genomics of yeast. Curr Opin Biotechnol. 2001, 12: 87-91. 10.1016/S0958-1669(00)00179-8.
Norin M, Sundstrom M: Structural proteomics: developments in structure-to-function predictions. Trends Biotechnol. 2002, 20: 79-84. 10.1016/S0167-7799(01)01884-4.
Baker D, Sali A: Protein structure prediction and structural genomics. Science. 2001, 294: 93-96. 10.1126/science.1065659.
Que QQ, Winzeler EA: Large-scale mutagenesis and functional genomics in yeast. Funct Integr Genomics. 2002, 2: 193-198. 10.1007/s10142-002-0057-3.
Zhang C, Kim SH: Overview of structural genomics: from structure to function. Curr Opin Chem Biol. 2003, 7: 28-32. 10.1016/S1367-5931(02)00015-7.
Sonnichsen B, Koski LB, Walsh A, Marschall P, Neumann B, Brehm M, Alleaume AM, Artelt J, Bettencourt P, Cassin E, Hewitson M, Holz C, Khan M, Lazik S, Martin C, Nitzsche B, Ruer M, Stamford J, Winzi M, Heinkel R, Roder M, Finell J, Hantsch H, Jones SJ, Jones M, Piano F, Gunsalus KC, Oegema K, Gonczy P, Coulson A, Hyman AA, Echeverri CJ: Full-genome RNAi profiling of early embryogenesis in Caenorhabditis elegans. Nature. 2005, 434: 462-469. 10.1038/nature03353.
Whisstock JC, Lesk AM: Prediction of protein function from protein sequence and structure. Q Rev Biophys. 2003, 36: 307-340. 10.1017/S0033583503003901.
Dobson PD, Cai YD, Stapley BJ, Doig AJ: Prediction of protein function in the absence of significant sequence similarity. Curr Med Chem. 2004, 11: 2135-2142.
Doerks T, von Mering C, Bork P: Functional clues for hypothetical proteins based on genomic context analysis in prokaryotes. Nucleic Acids Res. 2004, 32: 6321-6326. 10.1093/nar/gkh973.
King RD, Karwath A, Clare A, Dehaspe L: Accurate prediction of protein functional class from sequence in the Mycobacterium tuberculosis and Escherichia coli genomes using data mining. Yeast. 2000, 17: 283-293. 10.1002/1097-0061(200012)17:4<283::AID-YEA52>3.0.CO;2-F.
King RD, Karwath A, Clare A, Dehaspe L: The utility of different representations of protein sequence for predicting functional class. Bioinformatics. 2001, 17: 445-454. 10.1093/bioinformatics/17.5.445.
Jensen LJ, Gupta R, Staerfeldt HH, Brunak S: Prediction of novel archaeal enzymes from sequence-derived features. Protein Science. 2002, 11: 2894-2898. 10.1110/ps.0225102.
Jensen LJ, Gupta R, Blom N, Devos D, Tamames J, Kesmir C, Nielsen H, Staerfeldt HH, Rapacki K, Workman C, Andersen CA, Knudsen S, Krogh A, Valencia A, Brunak S: Prediction of human protein function from post-translational modifications and localization features. J Mol Biol. 2002, 319: 1257-1265. 10.1016/S0022-2836(02)00379-0.
Jensen R, Gupta H, Staerfeldt HH, Brunak S: Prediction of human protein function according to Gene Ontology categories. Bioinformatics. 2003, 19: 635-642. 10.1093/bioinformatics/btg036.
Jensen LJ, Ussery DW, Brunak S: Functionality of system components: Conservation of protein function in protein feature space. Genome Research. 2003, 13: 2444-2449. 10.1101/gr.1190803.
Dobson PD, Doig AJ: Predicting enzyme class from protein structure without alignments. J Mol Biol. 2005, 345: 187-199. 10.1016/j.jmb.2004.10.024.
Han L, Cui J, Lin H, Ji Z, Cao Z, Li Y, Chen Y: Recent progresses in the application of machine learning approach for predicting protein functional class independent of sequence similarity. Proteomics. 2006, 6: 4023-4037. 10.1002/pmic.200500938.
King RD, Wise PH, Clare A: Confirmation of data mining based predictions of protein function. Bioinformatics. 2004, 20: 1110-1118. 10.1093/bioinformatics/bth047.
Bentley SD, Parkhill Â: Comparative Genomic Structure of Prokaryotes. Annual Review of Genetics. 2004, 38: 771-791. 10.1146/annurev.genet.38.072902.094318.
Los Alamos National Laboratory Bioscience Division STD Sequence Databases. [http://www.stdgen.lanl.gov]
Riley M: Functions of the gene products of Escherichia coli. Microbiology Review. 1993, 57: 862-952.
Ouali M, King RD: Cascaded multiple classifiers for secondary structure prediction. Prot Sci. 2000, 9: 1162-1176.
Krogh A, Larsson B, von Heijne G, Sonnhammer ELL: Predicting Transmembrane Protein Topology with a Hidden Markov Model: Application to Complete Genomes. J Mol Biol. 2001, 305: 567-580. 10.1006/jmbi.2000.4315.
Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB: Protein disorder prediction: implications for structural proteomics. Structure. 2003, 11 (11):
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.
National Center for Biotechnology Information. [http://www.ncbi.nlm.nih.gov]
WEKA machine learning package. [http://www.cs.waikato.ac.nz/ml/weka]
Al-Shahib A, Breitling R, Gilbert D: Feature selection and the class imbalance problem in predicting protein function from sequence. Appl Bioinformatics. 2005, 4 (3): 195-203.
Bamber D: The Area above the Ordinal Dominance Graph and the Area below the Receiver Operating Characteristic Graph. Journal of Mathematical Psychology. 1975, 12: 387-415. 10.1016/0022-2496(75)90001-2.
Gribskov M, Robinson NL: Use of receiver operating chracteristic (ROC) analysis to evaluate sequence mathcing. Computer and Chemistry. 1996, 20: 25-33. 10.1016/S0097-8485(96)80004-0.
The authors thank M. Girolami and S. Rogers for helpful discussions on machine learning approaches. AA was funded by the University of Glasgow. RB was supported by a Caledonian Research Foundation Personal Fellowship.
AA collected the protein sequence data, generated the feature database and performed all experiments. RB designed the study, implemented the feature selection algorithm, and drafted the manuscript. DRG supervised the project and provided critical input. All authors read and approved the final manuscript.