Predicting protein function by machine learning on amino acid sequences – a critical evaluation
© Al-Shahib et al; licensee BioMed Central Ltd. 2007
Received: 29 November 2006
Accepted: 20 March 2007
Published: 20 March 2007
Predicting the function of newly discovered proteins by simply inspecting their amino acid sequence is one of the major challenges of post-genomic computational biology, especially when done without recourse to experimentation or homology information. Machine learning classifiers are able to discriminate between proteins belonging to different functional classes. Until now, however, it has been unclear if this ability would be transferable to proteins of unknown function, which may show distinct biases compared to experimentally more tractable proteins.
Here we show that proteins with known and unknown function do indeed differ significantly. We then show that proteins from different bacterial species also differ to an even larger and very surprising extent, but that functional classifiers nonetheless generalize successfully across species boundaries. We also show that in the case of highly specialized proteomes classifiers from a different, but more conventional, species may in fact outperform the endogenous species-specific classifier.
We conclude that there is very good prospect of successfully predicting the function of yet uncharacterized proteins using machine learning classifiers trained on proteins of known function.
Genome sequencing projects continue to produce unprecedented amounts of novel protein sequence information, and large-scale experimental efforts are underway to determine the function of the newly discovered proteins [1–6]. For a majority of proteins it is already possible to predict their approximate function with reasonable accuracy based on their evolutionary relationship or sequence similarity to proteins with known functions [7–9]. For most recently sequenced bacterial genomes about three quarters of open reading frames can be assigned a possible function in this way. However, a significant number of predicted proteins in each newly sequenced genome have turned out to defy this approach. These proteins, which in extreme cases may constitute up to 50% of open reading frames, show no similarity to proteins of known function. This may be due to missing experimental data, or the proteins are evolving too rapidly or are even unique to a small clade of species.
It would be very useful if one could obtain at least a general idea of the function of such proteins based on their amino acid sequence alone. Of course this is an extremely challenging task, and one that will only be of limited usefulness without combining it with additional information (e.g. structure models, phylogenetic profiles, or genomic context), but nonetheless several techniques to address this issue have been proposed recently [10–16]. These publications show that using machine learning classifiers it is possible to predict the function of well-characterized proteins based on features of their amino acid sequence, without using homology information . However, it is unclear if and how well such classifiers would transfer to proteins of unknown function. There are many reasons to assume that these 'unknown' proteins are special and differ from well-characterized proteins in significant ways: They may be evolving at a faster pace, they may function in unconventional ways, they may have unusual physico-chemical properties that have made them less accessible to experimentation. If 'unknown' proteins are not just a random subset of the proteome, but are biased in such a systematic fashion, classifiers trained and tested on proteins of known function may generalize poorly and will be unable to predict the function of the real proteins of interest.
A direct test of the predictive performance on proteins of unknown function is rarely possible, although a recent retrospective study  made some first steps in that direction. Thus a critical systematic assessment of the general prospect of successful classifier transfer is of great interest.
Here we show that proteins of known and unknown function do indeed differ significantly. We go on to show that, surprisingly, proteins from different species do also differ, to an even larger extent. We then demonstrate that classifiers do nonetheless generalize across species boundaries and use this to provide the first critical estimate of predictive performance on proteins of unknown function.
Results and discussion
Bacterial species used in the analysis.
Total # of proteins
# of 'unknowns'
% Average GC content
Prediction of known protein functions
In this paper we are not interested in optimizing a method of predicting protein functions, but rather in evaluating an aspect of function prediction that has been somewhat neglected previously, namely whether classifiers trained on proteins of known functions can be expected to transfer successfully to proteins of unknown function. Even an optimal classifier would be useless if it could not be applied reliably to the real proteins of interest, i.e. those for which no function is known at present.
Discrimination between 'known' and 'unknown' proteins
• Unique proteins – Proteins without known homologs
• Hypothetical proteins – Proteins with homologs of unknown function. No experimental evidence exists for the function or existence of the protein product.
• Wrongly predicted proteins – Open reading frames that are not actually expressed (transcribed/translated), but are only the result of genome misannotation.
• Special proteins – Proteins that have a special feature (e.g. an unusual size or extreme amount of charged amino acids) making them different from known proteins, but which do have a biological function.
From the above list, one can see that the set of unknown proteins will contain some members that actually will have a function and others that are probably genome annotation artifacts. The ability to distinguish between known and unknown proteins is most likely due to the difference between unusual unknown proteins (categories 3 and 4) and normal known proteins. It is expected that homology-less function prediction will be possible for the 'normal' and unknown proteins (categories 1 and 2), while being much more difficult for the special proteins (category 4) and meaningless for wrongly predicted proteins (category 3). Therefore, it would be interesting to use the clasification information to estimate the fraction of 'predictable' and 'unpredictable' proteins in the set of unknown proteins. An exact estimate is not possible, because there is no exact definition of a normal protein available, but we can use the performance of the SVM classifier to obtain a rough estimate. The median AUC for the discrimination between proteins of known and unknown function is 63%. If we assume that 'predictable' unknown proteins are indistinguishable from known proteins, we can calculate the lower bound estimate of the fraction of unpredictable proteins to be 26% (=(63%–50%)/50%).
Discrimination between proteins from different bacteria
One possible explanation for this high accuracy in discriminating proteins from different species lies in the varying levels of guanine-cytosin (GC) content in each species(Table 1). Variations in genomic GC content from 25% to 75% have been shown to be common in prokaryotes . Variations in GC content in coding sequences will be reflected in differences of amino acid composition, as GC rich codons will be depleted in low-GC species and vice versa. Even if this variation is subtle, it will influence classifier performance.
Classifier transfer between species
Protein dataset and annotation
Protein sequences for seven bacterial pathogens causing sexually transmitted diseases in humans (Table 1) were obtained from the Los Alamos National Laboratory Bioscience Division STD Sequence Databases . For each functionally characterized protein its classfication in one of 13 functional classes based on a modified version of the Riley scheme  was obtained from the same source.
Definition of protein sequence features
For every protein we calculated the frequency and total number of each amino acid, as well as of certain sets of amino acids (e.g. hydrophobic, charged, polar). To encode distributional features we also determined the number and size of continuous stretches of each amino acid or amino acid set. We also subdivided every protein into four equally sized fragments and calculated the same feature values for each fragment and combination of fragments. In addition, we predicted the secondary structure using Prof , the position of putative transmembrane helices using TMHMM  and of disordered regions using DisEMBL , and treated the obtained predictions in the same way as the amino acids. A small number of global features (e.g. isoelectric point and molecular weight) were also included. The total number of features extracted for every protein is 2579. The full feature set is described in Additional File 1.
Standardization of feature values
Since the original features generated in this way are very heterogeneously scaled linear normalization (standardization) was performed to rescale each feature by its mean and variance. After standardization, each of the 2579 features has a mean of 0 and a standard deviation of 1.
Homology-corrected generation of test and training sets
The entire dataset was subdivided randomly five times into test and training sets (size ratio 1:4). To prevent inflation of the prediction accuracies by predictions on homologous sequences in the test set, we applied a recursive Blast strategy to assign proteins that show significant sequence similarity to each other to the same set (either test or training). For this purpose every protein that was added to the test set was searched in three PSI-Blast iterations  against the non-redundant database of protein sequences at NCBI  using default settings. The obtained position-specific sequence profile was then run against the bacterial proteins and every protein generating a hit at E < 0.001 was also added to the test set, and the procedure repeated recursively until no new potential homologues were detected. Then the next randomly chosen protein would be added to the test set until the required test set size was exceeded.
For every training set, species and task, we selected discriminatory features using a simple filter approach which in previous work performed as well as classical wrapper approaches (data not shown). Briefly, for every feature we performed a Wilcoxon signed-rank test for every comparison of functional classes. Features were retained if for at least one comparison of classes they had a Wilcoxon p-value less than 0.02, indicating that they contribute potentially discriminating information. A second step of filtering removed highly redundant features, so that the remaining features had a pairwise absolute correlation coefficient of less than 0.95. For the known-unknown and species-species discrimination tasks the same procedure was applied using the Wilcoxon results for feature values in the various species or in 'known' vs. 'unknown' proteins, respectively.
Classification was done using Support Vector Machine classifiers as implemented in the WEKA machine learning package . As the datasets are highly imbalanced the negative class was undersampled to equal the positive class . A simple polynomial kernel with order 3 was used, as it had shown good performance in previous related studies . Other parameters were used in default settings (complexity constant = 1, size of the kernel cache = 1 × 104, tolerance parameter = 1.03 × 10-03) to avoid introducing bias by fine tuning to the present data. For the functional class prediction, one-against-all classifiers were generated for each class. For example, for predicting the transport and binding proteins functional class, we labeled all the other 12 functional classes as 'not transport and binding proteins' and performed a binary classification of transport and binding proteins against 'not transport and binding proteins'. We could then assess how well the features discriminate between the transport and binding functional class and all other functional classes.
Classifier performance evaluation
Classifier performances were evaluated using the Area Under the Receiver Operating Characteristic curve (AUC) on the test set. The median over the five splits of the test and training sets is generally reported. This value is a non-parametric estimate of the discriminating ability of the classifier. A value of 50% corresponds to a random classifier, a value of 100% indicates perfect performance [29, 30]. Using the AUC as a descriptor of classifier performance has the important advantage that it is independent of the class distribution in the test set. This is very important for our protein function prediction task: It is highly unlikely that the distribution of functions among the 'unknown' proteins is the same as that of the 'known' proteins, and the AUC provides the most unbiased performance estimate in this situation.
The authors thank M. Girolami and S. Rogers for helpful discussions on machine learning approaches. AA was funded by the University of Glasgow. RB was supported by a Caledonian Research Foundation Personal Fellowship.
- Delneri D, Brancia FL, Oliver SG: Towards a truly integrative biology through the functional genomics of yeast. Curr Opin Biotechnol. 2001, 12: 87-91. 10.1016/S0958-1669(00)00179-8.PubMedView ArticleGoogle Scholar
- Norin M, Sundstrom M: Structural proteomics: developments in structure-to-function predictions. Trends Biotechnol. 2002, 20: 79-84. 10.1016/S0167-7799(01)01884-4.PubMedView ArticleGoogle Scholar
- Baker D, Sali A: Protein structure prediction and structural genomics. Science. 2001, 294: 93-96. 10.1126/science.1065659.PubMedView ArticleGoogle Scholar
- Que QQ, Winzeler EA: Large-scale mutagenesis and functional genomics in yeast. Funct Integr Genomics. 2002, 2: 193-198. 10.1007/s10142-002-0057-3.PubMedView ArticleGoogle Scholar
- Zhang C, Kim SH: Overview of structural genomics: from structure to function. Curr Opin Chem Biol. 2003, 7: 28-32. 10.1016/S1367-5931(02)00015-7.PubMedView ArticleGoogle Scholar
- Sonnichsen B, Koski LB, Walsh A, Marschall P, Neumann B, Brehm M, Alleaume AM, Artelt J, Bettencourt P, Cassin E, Hewitson M, Holz C, Khan M, Lazik S, Martin C, Nitzsche B, Ruer M, Stamford J, Winzi M, Heinkel R, Roder M, Finell J, Hantsch H, Jones SJ, Jones M, Piano F, Gunsalus KC, Oegema K, Gonczy P, Coulson A, Hyman AA, Echeverri CJ: Full-genome RNAi profiling of early embryogenesis in Caenorhabditis elegans. Nature. 2005, 434: 462-469. 10.1038/nature03353.PubMedView ArticleGoogle Scholar
- Whisstock JC, Lesk AM: Prediction of protein function from protein sequence and structure. Q Rev Biophys. 2003, 36: 307-340. 10.1017/S0033583503003901.PubMedView ArticleGoogle Scholar
- Dobson PD, Cai YD, Stapley BJ, Doig AJ: Prediction of protein function in the absence of significant sequence similarity. Curr Med Chem. 2004, 11: 2135-2142.PubMedView ArticleGoogle Scholar
- Doerks T, von Mering C, Bork P: Functional clues for hypothetical proteins based on genomic context analysis in prokaryotes. Nucleic Acids Res. 2004, 32: 6321-6326. 10.1093/nar/gkh973.PubMed CentralPubMedView ArticleGoogle Scholar
- King RD, Karwath A, Clare A, Dehaspe L: Accurate prediction of protein functional class from sequence in the Mycobacterium tuberculosis and Escherichia coli genomes using data mining. Yeast. 2000, 17: 283-293. 10.1002/1097-0061(200012)17:4<283::AID-YEA52>3.0.CO;2-F.PubMed CentralPubMedView ArticleGoogle Scholar
- King RD, Karwath A, Clare A, Dehaspe L: The utility of different representations of protein sequence for predicting functional class. Bioinformatics. 2001, 17: 445-454. 10.1093/bioinformatics/17.5.445.PubMedView ArticleGoogle Scholar
- Jensen LJ, Gupta R, Staerfeldt HH, Brunak S: Prediction of novel archaeal enzymes from sequence-derived features. Protein Science. 2002, 11: 2894-2898. 10.1110/ps.0225102.PubMed CentralPubMedView ArticleGoogle Scholar
- Jensen LJ, Gupta R, Blom N, Devos D, Tamames J, Kesmir C, Nielsen H, Staerfeldt HH, Rapacki K, Workman C, Andersen CA, Knudsen S, Krogh A, Valencia A, Brunak S: Prediction of human protein function from post-translational modifications and localization features. J Mol Biol. 2002, 319: 1257-1265. 10.1016/S0022-2836(02)00379-0.PubMedView ArticleGoogle Scholar
- Jensen R, Gupta H, Staerfeldt HH, Brunak S: Prediction of human protein function according to Gene Ontology categories. Bioinformatics. 2003, 19: 635-642. 10.1093/bioinformatics/btg036.PubMedView ArticleGoogle Scholar
- Jensen LJ, Ussery DW, Brunak S: Functionality of system components: Conservation of protein function in protein feature space. Genome Research. 2003, 13: 2444-2449. 10.1101/gr.1190803.PubMed CentralPubMedView ArticleGoogle Scholar
- Dobson PD, Doig AJ: Predicting enzyme class from protein structure without alignments. J Mol Biol. 2005, 345: 187-199. 10.1016/j.jmb.2004.10.024.PubMedView ArticleGoogle Scholar
- Han L, Cui J, Lin H, Ji Z, Cao Z, Li Y, Chen Y: Recent progresses in the application of machine learning approach for predicting protein functional class independent of sequence similarity. Proteomics. 2006, 6: 4023-4037. 10.1002/pmic.200500938.PubMedView ArticleGoogle Scholar
- King RD, Wise PH, Clare A: Confirmation of data mining based predictions of protein function. Bioinformatics. 2004, 20: 1110-1118. 10.1093/bioinformatics/bth047.PubMedView ArticleGoogle Scholar
- Bentley SD, Parkhill Â: Comparative Genomic Structure of Prokaryotes. Annual Review of Genetics. 2004, 38: 771-791. 10.1146/annurev.genet.38.072902.094318.PubMedView ArticleGoogle Scholar
- Los Alamos National Laboratory Bioscience Division STD Sequence Databases. [http://www.stdgen.lanl.gov]
- Riley M: Functions of the gene products of Escherichia coli. Microbiology Review. 1993, 57: 862-952.Google Scholar
- Ouali M, King RD: Cascaded multiple classifiers for secondary structure prediction. Prot Sci. 2000, 9: 1162-1176.View ArticleGoogle Scholar
- Krogh A, Larsson B, von Heijne G, Sonnhammer ELL: Predicting Transmembrane Protein Topology with a Hidden Markov Model: Application to Complete Genomes. J Mol Biol. 2001, 305: 567-580. 10.1006/jmbi.2000.4315.PubMedView ArticleGoogle Scholar
- Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB: Protein disorder prediction: implications for structural proteomics. Structure. 2003, 11 (11):
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralPubMedView ArticleGoogle Scholar
- National Center for Biotechnology Information. [http://www.ncbi.nlm.nih.gov]
- WEKA machine learning package. [http://www.cs.waikato.ac.nz/ml/weka]
- Al-Shahib A, Breitling R, Gilbert D: Feature selection and the class imbalance problem in predicting protein function from sequence. Appl Bioinformatics. 2005, 4 (3): 195-203.PubMedView ArticleGoogle Scholar
- Bamber D: The Area above the Ordinal Dominance Graph and the Area below the Receiver Operating Characteristic Graph. Journal of Mathematical Psychology. 1975, 12: 387-415. 10.1016/0022-2496(75)90001-2.View ArticleGoogle Scholar
- Gribskov M, Robinson NL: Use of receiver operating chracteristic (ROC) analysis to evaluate sequence mathcing. Computer and Chemistry. 1996, 20: 25-33. 10.1016/S0097-8485(96)80004-0.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.