Quantitative assessment of relationship between sequence similarity and function similarity
© Joshi and Xu. 2007
Received: 05 July 2006
Accepted: 09 July 2007
Published: 09 July 2007
Skip to main content
© Joshi and Xu. 2007
Received: 05 July 2006
Accepted: 09 July 2007
Published: 09 July 2007
Comparative sequence analysis is considered as the first step towards annotating new proteins in genome annotation. However, sequence comparison may lead to creation and propagation of function assignment errors. Thus, it is important to perform a thorough analysis for the quality of sequence-based function assignment using large-scale data in a systematic way.
We present an analysis of the relationship between sequence similarity and function similarity for the proteins in four model organisms, i.e.,Arabidopsis thaliana, Saccharomyces cerevisiae, Caenorrhabditis elegans, andDrosophila melanogaster. Using a measure of functional similarity based on the three categories of Gene Ontology (GO) classifications (biological process, molecular function, and cellular component), we quantified the correlation between functional similarity and sequence similarity measured by sequence identity or statistical significance of the alignment and compared such a correlation against randomly chosen protein pairs.
Various sequence-function relationships were identified from BLAST versus PSI-BLAST, sequence identity versus Expectation Value, GO indices versus semantic similarity approaches, and within genome versus between genome comparisons, for the three GO categories. Our study provides a benchmark to estimate the confidence in assignment of functions purely based on sequence similarity.
Large-scale genome sequencing projects have discovered many new proteins. Of all the proteins whose sequences are known, functions have been experimentally determined for only a small percentage . Annotation of a genome involves assignment of functions to proteins in most cases on the basis of sequence similarity. Protein function assignments based on postulated homology as recognized by sequence identity or significant expectation value of alignment are used routinely in genome analysis. Over the past years, many computational methods [2–11] have been developed to predict function through identifying sequence similarity between a protein of unknown function and one or more proteins with experimentally characterized or computationally predicted functions. However, it is widely recognized that functional annotations should be transferred with caution, as the sequence similarity does not guarantee evolutionary or functional relationship. In addition, if a protein is assigned an incorrect function in a database, the error could carry over to other proteins for which functions are inferred by sequence relationship to the protein with errant function assignment [12–14].
A number of studies in sequence-function relationship have been carried out. Shah et al.  showed that many EC (Enzyme Commission) classes could not be perfectly discriminated by sequence similarity at any threshold. Pawlowski et al.  have studied the relation between sequence similarity and functional similarities based on the EC classification for theE. coli genome. However, this study is limited only to within genome comparisons and lacks any analysis based on inter-genome comparisons. Devos et al.  have studied the complexity in transferring function between similar sequences. Their study shows that binding site, keywords, and functional class annotations are less conserved than EC numbers, and all of them in turn are less conserved than protein structure. Wilson et al. showed that percent identity in sequence alignment is more effective at quantifying functional conservation of their simple classification of SCOP domains than modern probabilistic scores . However, all these studies did not use a broad definition of functions for a systematic large-scale analysis. In this paper, we will build a comprehensive and systematic benchmark for the sequence-function relationship using four model organisms (Arabidopsis thaliana, Saccharomyces cerevisiae, Caenorrhabditis elegans, andDrosophila melanogaster) and controlled vocabularies of function annotation terms in the Gene Ontology  from three different perspectives, i.e., biological process, molecular function, and cellular component.
It has been long recognized that genome annotations using computational methods produce many false function assignments. Many of these methods have been applied to function prediction. They often provide valuable hypotheses, but none are perfect. As a result, it is known that many databases contain incorrect function assignments, and these erroneous assignments propagate from one database to another. Nevertheless, up until now there has been no systematic study for this critical issue. The question whether two proteins are functionally similar is very complex to answer. Function is a very complex notion involving many different aspects including chemical, biochemical, cellular, organism mediated, and developmental processes. Qualitatively it is expected that with higher sequence similarity, the two proteins are more likely to have related functions. However, quantitatively the relationship between function similarity at the different categories and sequence similarity has not been studied deeply. Such a quantitative study is fundamentally important, as it can provide assessment of gene function prediction quality and insights into the underlying mechanisms of new evolving functions through changes in sequence [25,26].
Our study confirms that sequence comparison often provides good suggestions for gene functions or related functions. These suggestions serve as useful hypotheses for further experimental work to confirm, refine or refute the predictions. Such a process can substantially increase the speed of biological knowledge discovery. On the other hand, when assigning function based purely on similarity to proteins of known function (as annotated in databases), it is important to be aware of incomplete or wrong annotations. Given the value of computational function annotation, our study also shows that a significant portion of gene annotations of biological process, molecular function, and cellular component based solely on sequence similarity, in particular, when the sequence similarity is low, are unreliable. Our study also provides a numerical benchmark for the extent to which one can trust computational annotation. It is possible that a confidence score can be derived from our study for any annotation based on sequence similarity. With this score in the annotation file, the user can have a better insight about the quality of the annotations. Furthermore, our analyses highlights the different sequence-function relationships identified from BLAST versus PSI-BLAST, sequence identity versus Expectation value, GO indices versus semantic similarity approaches and within genome versus between genome comparisons, for the three GO classification types.
There are some limitations in our current study. Our study can only reflect certain aspect of protein function. Protein function variations may result from factors other than sequence, such as alternative splicing and post-translational modification, and our method does not address these factors. Another limitation is that when we assess gene function prediction, we only consider one hit at a time in a database. In many cases, sequence comparison yields multiple hits for one query protein and these hits may have different functions. In our future study, we will develop a new method to assess the function prediction for a query protein by combining the functions of multiple hits while considering the dependence among these functions and the E-values of the hits.
Details about the four genomes and number of functional annotations in biological process, molecular function and cellular component assigned based on experimental or sequence similarity evidence
# of annotations verified by experimental evidence
# of annotations based on computational methods
# of ORFs
Example of GO index and the corresponding GO ID and functional category
Functional category and GO ID
cellular process (GO:0009987)
cell communication (GO:0007154)
signal transduction (GO:0007165)
cell surface receptor linked signal transduction (GO:0007166)
G-protein coupled receptor protein signaling pathway (GO:0030454)
We assume that the functional relationship between two proteins is reflected by the number of index levels that they share. We have demonstrated the usefulness of such an assumption in our early studies for gene function prediction [28,29]. We acquired the GO annotations for all the genes in the four genomes and for the three functional categories from GO Website . A gene can (and usually does) belong to multiple indices at various levels in the graph, as proteins may be involved in multiple functions in a cell. Different indices could correspond to the same GO term as well.
Gene Ontology annotation is based on various evidences to annotate functional categories. Towards quality control, all the plots (except for Figure 6B) presented in this paper are based on the annotations with actual experimental evidences such as IDA (inferred from direct assay), IEP (inferred from expression pattern), IGI (inferred from genetic interaction), IMP (inferred from mutant phenotype), IPI (inferred from physical interaction), RCA (inferred from reviewed computational analysis) and TAS (traceable author statement). We performed some comparisons using annotations assigned purely based on computational methods such as ISS (inferred from sequence similarity) and IEA (inferred from electronic annotation), but the plots are not presented here. We have removed the functional annotations that were purely based on evidences such as ND (no biological data available) and NAS (non-traceable author statement.
Within each family of proteins with similar sequences, functional similarity between proteins is expressed as the number of common roots shared by their functional classification other than the first level, which represents a classification of biological process, molecular function and cellular component. In the case of proteins with multiple functional assignments, the maximum indices of overlap are considered. For example, consider a gene pair ORF1 and ORF2, both annotated proteins. Assume ORF1 has a function represented by GO INDEX 1-1-3-3-4 and ORF2 has a function 1-1-3-2. When compared with each other for the level of matching GO INDEX, they match through INDEX level 1 (1-1) and level 2 (1-1-3) and will have functional similarity equal to 2. The functional similarity defined this way can assume values from 1 to 12.
SS (t1,t2) = -lnp ms (t1,t2)
where,p ms (t1, t2) is the probability of the minimum subsumer for terms t1 and t2. The minimum subsumer for terms t1 and t2 is defined as the common parent of the deepest GO Index level shared by t1 and t2.
The subcellular distribution of proteins within a proteome is useful and important to a global understanding of the molecular mechanisms of a cell. Protein localization can be seen as an indicator of its function. Localization data can be used as a means of evaluating protein information inferred from other resources. Furthermore, the subcellular localization of a protein often reveals its activity mechanism. The subcellular localization information was predicted using SubLoc [32,33,41]. The five main subcellular localization categories as predicted by SubLoc are Cytoplasmic, Nuclear, Mitochondrial, Transmembrane, and Extracellular. The total numbers of proteins with predicted subcellular localization are 6323 inSaccharomyces cerevisiae, 27,288 inArabidopsis thaliana, 21,588 inCaenorrhabditis elegans, and 18,498 inDrosophila melanogaster. It is worth mentioning that the subcellular localization predictions were not based on sequence similarity.
The sequence similarity search was done using tools such as BLAST , FASTA [34,35] and PSI-BLAST . BLAST is the most widely used sequence comparison tool, particularly for genome annotation. FASTA is more sensitive in accuracy but slower than BLAST. Both FASTA and BLAST were developed for pairwise local alignment, with heuristics used. PSI-BLAST is used to identify remote homology based on iterative BLAST searches.
We compared the sequences for within as well as between genome sequence similarities. Each protein sequence was compared against the complete set of proteins for the same genome for within genome comparisons. For between genome comparisons, a pair of similar protein pair was identified using the reciprocal search method , i.e., the two proteins in the pair are the best hits in each other's genome from sequence search. Intra-genome sequence comparison would reflect the sequence similarity between the paralogs; while the inter-genome comparison would partially highlight the orthologous sequence similarities.
To assess the significance of a sequence comparison, an expectation value or E-value can be calculated. This value represents the number of different alignments with the observed alignment score or better that are expected to occur in the database search simply by chance. The E-value is a widely accepted measure for assessing potential biological relationship, as it is an indicator of the probability for finding the match by chance. Smaller E-values represent more likelihood of having an underlying biological relationship. In this study, we will use both E-value and sequence identity as parameters to quantify sequence similarity. On the other hand, E-values depend on a number of computational factors, such as the length of the query protein and the size of search database. The issues prevent the E-value from being a reliable indicator for homology, as addressed in Fig. 1 and related discussions.
The data and results are publicly available at our website .
This research is supported by USDA/CSREES-2004-25604-14708 and NSF/ITR-IIS-0407204. We like to thank the anonymous reviewers for their helpful suggestions.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.