Dynamics of domain coverage of the protein sequence universe
© Rekapalli et al.; licensee BioMed Central Ltd. 2012
Received: 29 April 2012
Accepted: 11 November 2012
Published: 16 November 2012
The currently known protein sequence space consists of millions of sequences in public databases and is rapidly expanding. Assigning sequences to families leads to a better understanding of protein function and the nature of the protein universe. However, a large portion of the current protein space remains unassigned and is referred to as its “dark matter”.
Here we suggest that true size of “dark matter” is much larger than stated by current definitions. We propose an approach to reducing the size of “dark matter” by identifying and subtracting regions in protein sequences that are not likely to contain any domain.
Recent improvements in computational domain modeling result in a decrease, albeit slowly, in the relative size of “dark matter”; however, its absolute size increases substantially with the growth of sequence data.
The protein universe is the collection of all proteins of every biological species that lives or has lived on earth. Its basic properties are the subject of rigorous investigation[2, 3], because it is an essential foundation of all biology. The currently known protein space, which is a part of the protein universe that has been revealed by DNA sequencing, consists of more than 16 million protein sequences in a non-redundant (nr) database (December 8, 2011) and its size is rapidly increasing due to recent technological advances[4, 5]. Only a small fraction of the current protein space can be analyzed by traditional experimental techniques therefore, computational classification of protein sequences and their assignment to known biological functions is critical[6, 7].
Proteins are composed of one or more domains, parts that are conserved in sequence and structure and that can evolve and function independently. Several valid and often overlapping definitions of protein domains exist, starting with the original definition by Wetlaufer, as stable units of protein structure that could fold autonomously. In terms of protein sequences, domains are clusters of consecutive residues exhibiting various levels of conservation. Domains vary in length between 40 to nearly 700 residues; however, 90% of surveyed domains are shorter than 200 residues with an average of approximately 100 residues.
The use of profile hidden Markov models (HMMs) that capture the conserved sequence features of protein domains[7, 13, 14] is arguably the most successful computational approach for identifying protein domains, and the Pfam (Protein Families) database is the premier repository, currently containing 13,672 protein domain models in its high-quality, curated Pfam-A part. Another popular resource, a Conserved Domain Database (CDD) at the National Center for Biotechnology Information, is a larger, partially redundant collection of domain and multi-domain models imported from various sources, including Pfam. ProDom and ADDA are also important resources aiming at developing high-quality domain models. Using Pfam and CDD profiles, recent computational analyses have assigned 72% of all protein sequences in the NR database and nearly 80% of all sequences in the curated UniProtKB database to known protein families. The remaining sequences are uncharacterized and considered to be “dark matter” of the protein universe. Levitt proposed four potential components comprising “dark matter”: (i) sequences that are erroneous; (ii) low-complexity, non-globular sequences; (iii) known but unrecognized protein domains; and (iv) novel protein domains to be discovered.
In this study, we propose to expand the definition of “dark matter” by including regions in partly covered protein sequences that are not characterized and do not have any domain match. In addition to domain coverage, detecting regions in protein sequences that are unlikely to contain any domain considerably reduces the size of “dark matter”. Finally, we show that despite substantial improvements in computational domain modeling and tools for their identification, the relative size of “dark matter” decreases slowly, while its absolute size increases dramatically with the growth of sequence data.
Results and discussion
Further defining “dark matter” of the protein sequence universe
Many resources for computational domain finding exist. The original “dark matter” analysis by Levitt utilized CDD profiles. However, we argue that while CDD is superior in overall computational coverage, it may not be the best choice for specifically defining protein domains. Many CDD profiles are built from sources such as Clusters of Orthologous Groups of Proteins (COG) and Protein Clusters (PRK) that are not specialized domain databases (e.g., COG focuses on evolutionary relationships and PRK on basic relatedness between protein sequences). Both COG and PRK capture similarity between protein sequences regardless of their domain composition. As a result, many CDD profiles cover full-length proteins including regions for which domain information is unavailable. In contrast, the Pfam models are built primarily for protein domains and are known for excellent specificity. As such, Pfam models are integrated in many other resources including CDD.
Based on arguments presented above, we determined that Pfam domain models are better suited for the purpose of defining the size of “dark matter” in the protein sequence space. Furthermore, the data on Pfam coverage of a large sequence space is available for comparison. The latest Pfam release (Pfam 26) is reported to cover nearly 80% of protein sequences in the UniProtKB database, but only 57% of amino acid (aa) residues in all protein sequences in this database. We ran Pfam 26 on the latest release of the NCBI nr database and found that it covers only 51.39% of amino acid residues in its 16.39 million sequences. Thus, the size of “dark matter”, defined as a lack of domain information, appeared to be nearly half of the currently known protein space. The difference between Pfam domain coverage of the UniProtKB reported by the Pfam team and of the nr database reported here appeared to be significant. It may reflect the fact that UniProkKB is slightly smaller in size than the nr database, but it could also be due to potential problems in the way calculations are done on such a large data set. Access to original data is limited due to its prohibitive size (a flat file size is cumulatively over 600 MB); thus, it seems important to report numbers obtained in an independent analysis, especially because according to our calculations the size of “dark matter” is larger. To clarify this point, we have repeated our analysis using the latest release of the UniProkKB database (September 2012) and obtained 53.8% domain coverage, which is close to numbers reported by the Pfam team.
Can identification of specific regions other than domains reduce the size of “ dark matter” ?
Computational coverage of the protein sequence space
Total sequence space
5.64E + 09
4.14E + 08
3.74E + 08
6.78E + 07
5.43E + 07
9.10E + 08
2.90E + 09
2.72E + 08
1.20E + 08
4.65E + 07
4.62E + 07
4.84E + 08
A large section of protein space can be safely subtracted from “ dark matter”
As we have shown above, various computationally identifiable regions in protein sequences (e.g. transmembrane helices, low-complexity regions, etc) cannot be used to reduce the size of “dark matter”. However, a large section of “dark matter” apparently can be effectively predicted not to contain any domain. Once all domains are identified in all protein sequences, we can identify regions that are both (i) too short to contain a domain and (ii) are located in positions between pairs of known domains or between a known domain and the protein terminus (N or C). For example, such positions are shown in grey on Figure 1. To calculate the contribution of such regions to the total sequence space, we decided to set their size limit at 50 aa. The reason behind this number is that whereas some domains are smaller than 50 aa, domains are never located adjacent to each other without at a least a small connecting linker. The average size of interdomain linkers was calculated to be 6-8 aa. Thus, a 50 aa cutoff accounts for the smallest domains bordered by average-size linkers. We have calculated that such regions cover approximately 9% of the total protein sequence space (5.09E + 08 aa), which is quite significant. Thus, by subtracting these regions from current “dark matter”, we effectively decrease its size from 48.6% to 39.6%.
Relative size of “ dark matter” is shrinking, albeit slowly
The trend shown on Figure3 suggests that “the dark matter problem” is slowly being solved. The most recent advances in computational domain modeling and identification, such as the latest Pfam 26 release and the underlying tool development, resulted in doubling the rate of improvement in domain coverage. However, the absolute size of “dark matter” is still growing rapidly as the genome sequencing progresses.
Computational coverage of the protein sequence space, which is generated by genome sequencing projects, is an important process for our understanding of life. We propose a biologist-centered view on current computational coverage, where not only completely non-covered protein sequences, but also parts of partially covered protein sequences that are not occupied by protein domains are considered “dark matter”. Using high-throughput computing we show that the unexplored space of the protein sequence universe is larger than previously defined and that despite substantial improvements in bioinformatics during the last three years, the relative size of “dark matter” is decreasing very slowly.
The following releases of the NCBI nr (non-redundant) database were used: April 4, 2009 (nrApr09), September 9, 2010 (nrSep10), and December 8, 2011 (nrDec11). The UniProtKB release September 2012 was used to calculate its domain coverage. Domain models/HMMs were retrieved from the three recent versions of the Pfam protein families database (Pfam-A portion only): Pfam 22.0 Pfam 24.0, and Pfam 26.0. Conserve Domain Database version 3.02 was used to obtain its more than 78,000 position-specific scoring matrices (PSSMs).
Software for identification of domains and regions in protein sequences
Protein sequence regions were identified using standard software packages and cutoffs: low-complexity regions, SEG; coiled coils, PairCoil2; transmembrane regions, TMHMM2.0c; and signal peptides, Phobius. Protein sequences were scanned against Pfam domain models (profile HMMs) using hmmscan of the HMMER v.3.0 package with the cut_ga filter and against CDD PSSMs using the RPS-BLAST with default parameters. To fully reproduce earlier steps in computational domain coverage with Pfam 22.0 we used hmmpfam of HMMER v.2.3.2 adapted for the Kraken supercomputer, as described earlier. The amino acid coverage was calculated for each protein sequence in the respective database based on the following considerations. For non-overlapping domains and regions the amino acid coverage is the sum of domain and region lengths. If a domain and a region overlap, the priority is given to the domain when computing domain coverage. For overlapping domains with satisfactory E values (above the threshold for domain identification), the length of the longest domain was taken into consideration.
All computational analyses were performed in a local computing environment. Computationally intensive tasks were carried out using the Intel X86_64 Linux cluster (Newton) with a total of 4,200 processor cores at the University of Tennessee and the Cray XT5 supercomputer (Kraken) with a total of 112,896 processor cores at the Oak Ridge National Laboratory. Tasks were automated using a combination of C, PHP, and MPI scripts.
This work was supported in part by the Laboratory Directed Research and Development program at the Oak Ridge National Laboratory managed by UT-Battelle, LLC, under contract DE-AC05-00OR22725. Allocation of advanced computing resources (Kraken Supercomputer) was provided by the National Science Foundation.
- Levitt M: Nature of the protein universe. Proc Natl Acad Sci USA. 2009, 106: 11079-11084. 10.1073/pnas.0905029106.PubMed CentralView ArticlePubMedGoogle Scholar
- Koonin EV, Wolf Y, Karev GP: The structure of the protein universe and genome evolution. Nature. 2002, 420: 218-223. 10.1038/nature01256.View ArticlePubMedGoogle Scholar
- Chothia C, Gough J, Vogel C, Teichmann SA: Evolution of the protein repertoire. Science. 2003, 300: 1701-1703. 10.1126/science.1085371.View ArticlePubMedGoogle Scholar
- Shendure J, Ji H: Next-generation DNA sequencing. Nat Biotech. 2008, 26: 1135-1145. 10.1038/nbt1486.View ArticleGoogle Scholar
- Kahn SD: On the future of genomic data. Science. 2011, 331: 728-729. 10.1126/science.1197891.View ArticlePubMedGoogle Scholar
- Eisenberg D, Marcotte EM, Xenarious I, Yeates TO: Protein function in the post-genomic era. Nature. 2000, 405: 823-826. 10.1038/35015694.View ArticlePubMedGoogle Scholar
- Sammut SJ, Finn RD, Bateman A: Pfam 10 years on: 10 000 families and still growing. Brief Bioinform. 2008, 9: 210-219. 10.1093/bib/bbn010.View ArticlePubMedGoogle Scholar
- Chothia C: One thousand families for the molecular biologist. Nature. 1992, 357: 543-544. 10.1038/357543a0.View ArticlePubMedGoogle Scholar
- Wetlaufer DB: Nucleation, rapid folding, and globular intrachain regions in proteins. Proc Natl Acad Sci USA. 1973, 70: 697-701. 10.1073/pnas.70.3.697.PubMed CentralView ArticlePubMedGoogle Scholar
- Jones S, Stewart M, Michie A, Swindelis MB, Orengo C, Thornton JM: Domain assignment for protein structures using a consensus approach: characterization and analysis. Protein Sci. 1998, 7: 233-242.PubMed CentralView ArticlePubMedGoogle Scholar
- Islam SA, Sternberg MJ: Identification and analysis of domains in proteins. Protein Eng. 1995, 8: 513-525. 10.1093/protein/8.6.513.View ArticlePubMedGoogle Scholar
- Wheelan SJ, Marchler-Bauer A, Bryant SH: Domain size distributions can predict domain boundaries. Bioinformatics. 2000, 16: 613-618. 10.1093/bioinformatics/16.7.613.View ArticlePubMedGoogle Scholar
- Eddy SR: Profile hidden Markov models. Bioinformatics. 1998, 14: 755-763. 10.1093/bioinformatics/14.9.755.View ArticlePubMedGoogle Scholar
- Schultz J, Milpetz F, Bork P, Ponting CP: SMART, a simple modular architecture research tool: identification of signaling domains. Proc Natl Acad Sci USA. 1998, 95: 5857-5864. 10.1073/pnas.95.11.5857.PubMed CentralView ArticlePubMedGoogle Scholar
- Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, Heger A, Holm L, Sonnhammer EL, Eddy SR, Bateman A, Finn RD: The Pfam protein families database. Nucleic Acids Res. 2012, 40: D290-D301. 10.1093/nar/gkr1065.PubMed CentralView ArticlePubMedGoogle Scholar
- Marchler-Bauer A, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, Fong JH, Geer LY, Geer RC, Gonzales NR, Gwadz M, Hurwitz DI, Jackson JD, Ke Z, Lanczycki CJ, Lu F, Marchler GH, Mullokandov M, Omelchenko MV, Robertson CL, Song JS, Thanki N, Yamashita RA, Zhang D, Zhang N, Zheng C, Bryant SH: CDD: a conserved domain database for the functional annotation of proteins. Nucleic Acids Res. 2011, 39: D225-229. 10.1093/nar/gkq1189.PubMed CentralView ArticlePubMedGoogle Scholar
- Bru C, Courcelle E, Carrere S, Beausse Y, Dalmar S, Kahn D: The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res. 2005, 33: D212-215.PubMed CentralView ArticlePubMedGoogle Scholar
- Heger A, Wilton CA, Sivakumar A, Holm L: ADDA: a domain database with global coverage of the protein universe. Nucleic Acids Res. 2005, 33: D188-191.PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: The COG database: an updated version includes eukaryotes. BMC Bioinforma. 2003, 4: 41-10.1186/1471-2105-4-41.View ArticleGoogle Scholar
- Klimke W, Agarwala R, Badretdin A, Chetvernin S, Ciufo S, Fedorov B, Kiryutin B, O'Neill K, Resch W, Resenchuk S, Schafer S, Tolstoy I, Tatusova T: The national center for biotechnology Information's protein clusters database. Nucleic Acids Res. 2009, 37: D216-223. 10.1093/nar/gkn734.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang YH, Ferriers L, Clarke DJ: Comparative functional analysis of the RcsC sensor kinase from different Enterobacteriaceae. FEMS Microbiol Lett. 2009, 293: 248-254. 10.1111/j.1574-6968.2009.01543.x.View ArticlePubMedGoogle Scholar
- Wong WC, Maurer-Stroh S, Eisenhaber F: More than 1,001 problems with protein domain databases: transmembrane regions, signal peptides and the issue of sequence homology. PLoS Comput Biol. 2010, 6: e1000867-10.1371/journal.pcbi.1000867.PubMed CentralView ArticlePubMedGoogle Scholar
- Wootton JC, Federhen S: Statistics of local complexity in amino acid sequences and sequence databases. Comput Chem. 1993, 17: 149-163. 10.1016/0097-8485(93)85006-X.View ArticleGoogle Scholar
- Lupas A: Predicting coiled-coil regions in proteins. Curr Opin Struct Biol. 1997, 7: 388-393. 10.1016/S0959-440X(97)80056-5.View ArticlePubMedGoogle Scholar
- Miyazaki S, Kuroda Y, Yokoyama S: Identification of putative domain linkers by a neural network – application to a large sequence database. BMC Bioinforma. 2006, 7: 323-10.1186/1471-2105-7-323.View ArticleGoogle Scholar
- Wong WC, Maurer-Stroh S, Eisenhaber F: Not all transmembrane helices are born equal: Towards the extension of the sequence homology concept to membrane proteins. Biol Direct. 2011, 6: 57-10.1186/1745-6150-6-57.PubMed CentralView ArticlePubMedGoogle Scholar
- George RA, Heringa J: An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 2002, 15: 871-879. 10.1093/protein/15.11.871.View ArticlePubMedGoogle Scholar
- Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A: The Pfam protein families database. Nucleic Acids Res. 2008, 36: D281-D288. 10.1093/nar/gkn226.PubMed CentralView ArticlePubMedGoogle Scholar
- Finn RD, Mistry J, Tate J, Coggill PC, Heger A: The Pfam protein families database. Nucleic Acids Res. 2010, 38: D211-D222. 10.1093/nar/gkp985.PubMed CentralView ArticlePubMedGoogle Scholar
- Eddy SR: Accelerated profile HMM searches. PLoS Comput Biol. 2011, 7: e1002195-10.1371/journal.pcbi.1002195.PubMed CentralView ArticlePubMedGoogle Scholar
- McDonnell AV, Jiang T, Keating AE, Berger B: Paircoil2: improved prediction of coiled coils from sequence. Bioinformatics. 2006, 22: 356-358. 10.1093/bioinformatics/bti797.View ArticlePubMedGoogle Scholar
- Krogh A, Larsson B, von Heijne G, Sonnhammer EL: Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001, 305: 567-580. 10.1006/jmbi.2000.4315.View ArticlePubMedGoogle Scholar
- Kall L, Krogh A, Sonnhammer EL: Advantages of combined transmembrane topology and signal peptide prediction–the Phobius web server. Nucleic Acids Res. 2007, 35: W429-432. 10.1093/nar/gkm256.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Rekapalli B, Halloy C, Zhulin IB: Proceedings of the 24th ACM symposium on applied computing; 9-12 march 2009. HPS-HMMER: A Tool for Protein Domain Identification on A Large Scale. 2009, Honolulu, Hawaii, 766-770.Google Scholar