A kingdom-specific protein domain HMM library for improved annotation of fungal genomes
© Alam et al; licensee BioMed Central Ltd. 2007
Received: 26 February 2007
Accepted: 10 April 2007
Published: 10 April 2007
Pfam is a general-purpose database of protein domain alignments and profile Hidden Markov Models (HMMs), which is very popular for the annotation of sequence data produced by genome sequencing projects. Pfam provides models that are often very general in terms of the taxa that they cover and it has previously been suggested that such general models may lack some of the specificity or selectivity that would be provided by kingdom-specific models.
Here we present a general approach to create domain libraries of HMMs for sub-taxa of a kingdom. Taking fungal species as an example, we construct a domain library of HMMs (called Fungal Pfam or FPfam) using sequences from 30 genomes, consisting of 24 species from the ascomycetes group and two basidiomycetes, Ustilago maydis, a fungal pathogen of maize, and the white rot fungus Phanerochaete chrysosporium. In addition, we include the Microsporidion Encephalitozoon cuniculi, an obligate intracellular parasite, and two non-fungal species, the oomycetes Phytophthora sojae and Phytophthora ramorum, both plant pathogens. We evaluate the performance in terms of coverage against the original 30 genomes used in training FPfam and against five more recently sequenced fungal genomes that can be considered as an independent test set. We show that kingdom-specific models such as FPfam can find instances of both novel and well characterized domains, increases overall coverage and detects more domains per sequence with typically higher bitscores than Pfam for the same domain families. An evaluation of the effect of changing E-values on the coverage shows that the performance of FPfam is consistent over the range of E-values applied.
Kingdom-specific models are shown to provide improved coverage. However, as the models become more specific, some sequences found by Pfam may be missed by the models in FPfam and some of the families represented in the test set are not present in FPfam. Therefore, we recommend that both general and specific libraries are used together for annotation and we find that a significant improvement in coverage is achieved by using both Pfam and FPfam.
The number of genomes being sequenced now exceeds 2000. Of these, as of February 2007, 510 are completed while 1091, 695 and 62 bacterial, eukaryotic and archaeal genomes (respectively) are still underway . Much of this genomic sequence is relatively poorly annotated and one of the major challenges in bioinformatics is the computational annotation of this massive amount of data in a high-throughput manner . Genome annotation can be classified into three levels: the nucleotide, protein and process levels . Databases such as PROSITE , PRINTS , SMART , TIGRFAMs  or Pfam , which keep information in the form of motifs, alignment blocks, or profiles, provide a reference for the annotation at the protein level  where the main aim is to identify conserved regions and domains within the protein sequences predicted at the nucleotide annotation stage. InterPro  provides an integrated resource to cross-reference these motif or domain databases.
The Pfam database, in particular, has a wealth of information about approximately 8000 domains and plays a major role in achieving such high-throughput annotation of newly sequenced genomes, due to its specialized profile Hidden Markov Models (HMMs) [11, 12]. TIGRFAMs is another similar database of protein families based on HMMs designed to specifically support large sequencing projects, although this has less coverage with under 2500 models in release 4.1, and is focused more towards complete proteins than domains. Profile HMMs are flexible, probabilistic models that can be used to describe the consensus patterns shared by sets of homologous protein/domain sequences. They summarise the shared statistical features of these homologous sequences in a way that allows efficient searching for matches in translated DNA sequences corresponding to predicted protein-coding genes. HMMs in the Pfam database are constructed from an alignment of a representative set of sequences for each protein domain, called a seed alignment. The seed alignments are tested and improved by manual curation, and by application to large databases like the Universal Protein (UniProt) database . A key issue, though, is the trade-off between sensitivity and specificity of the representative seeds and the corresponding models. If the seeds get larger and increasingly general, then they may lose specificity.
It has previously been reported that more specific HMMs, built from sequences obtained from a less diverged set of species, can lead to improved sensitivity and specificity in the detection of domains and will therefore provide improved coverage when annotating proteins in related species . The HMM library TLFAM-Pro has been developed for use with prokaryotes and some results of using the method have been described . About 3000 ClustalW alignments from NCBI's database of Clusters of Orthologous Groups (COGs) , as of 2001, were used to compile HMMs. It was found that, although TLFAM-Pro demonstrated higher scores and longer alignments, a search of the test dataset against Pfam yielded more total hits, suggesting that TLFAM-Pro may provide a useful complementary resource to Pfam. This preliminary study was carried out in 2002, when both the number of domains in Pfam and the number of available genomes was much smaller than now and therefore it is unclear whether these results remain valid. It was also reported that archaeal- and fungal- specific TLFAM databases had been constructed, or were to be constructed in the near future, but we are not aware of any publications describing them and no implementation is currently available. In other restricted applications, it has been shown that kingdom-specific HMMs improve performance -, as shown for example, in the prediction of N-terminal myristoylation sites in plants . However, as far as we are aware no large-scale study of the effectiveness of kingdom-specific HMMs for protein domain searching has been carried out. Given the rapidly increasing availability of un-annotated or partially annotated genomes across all kingdoms, it is important to determine whether more specific HMMs are useful for the annotation of these genomes. In this paper, we test this hypothesis specifically, taking the case of fungal genomes as an example.
A large number of complete and partial genome sequences have recently become publicly available for fungal species. We are involved in the development of the e-Fungi data warehouse, which provides tools for the comparative analysis of these genomes and associated functional data . As part of this project we are developing a pipeline for the automated annotation of new genomes as they become available. We are therefore interested in developing methods for identifying protein domains and it is important to obtain the best coverage possible. In this paper we describe a fungal-specific HMM library that was developed to carry out this task. This serves as an example of a kingdom-specific HMM library, and we evaluate its performance in comparison to the more general Pfam database . We compile the fungal-specific HMMs using genomic data from the 30 species represented in the current version of the e-Fungi data warehouse . We evaluate the increase in coverage provided by the fungal-specific models over those 30 species. In order to test the method on previously unseen data, we then evaluate its performance on five more recently sequenced genomes that were not included in the first release of the e-Fungi database used to construct the models. Our results demonstrate that a fungal-specific library does provide a significant increase in coverage and that best performance is achieved by combining results from the kingdom-specific HMM library with results from the standard Pfam library. We investigate how this improved coverage affects the distribution of identified multi-domain proteins and we investigate the functional annotation of families that show the largest difference in performance between the two libraries.
Results and discussion
Comparison of FPfam and Pfam results for sequences from 30 fungal genomes
Proteome sizes of 30 original fungal genomes and five test genomes (shown by asterisks)
FPfam and Pfam results comparison for the test set of five fungal species
In addition to these results, Pfam also picked up some more domains that are not yet included in the FPfam libraries. This suggests that a further improvement could be obtained in the annotation of novel genomes by applying both general and species-specific domain libraries.
Examples of domain instances missed by Pfam
The number of instances for category A, Bfand Bp
No of Instances:
The number of domains for category A, Bfand Bp
No of Domains:
Category-A and B instances for FPfam and Pfam domains in 30 original and five test genomes
Category B (f)
Category B (p)
Protein of unknown function (DUF229)
LICD Protein Family
Copper binding proteins, plastocyanin/az
Plant protein of unknown function (DUF94
Domain of unknown function DUF143
Laminin B (Domain IV)
Ribosomal protein S6
Fungal ornithine decarboxylase antizyme
Protein of unknown function (DUF1279)
GCC2 and GCC3
Trichodiene synthase (TRI5)
Somatotropin hormone family
Uncharacterised protein family (UPF0139)
ATP synthase E chain
Leucine-rich repeat N-terminal domain
Domains per sequence analysis
Comparison of bit-scores from Pfam and FPfam model searches
Effect of E-value cut-offs on sequence coverage
We have constructed a fungal-specific HMM library, FPfam, using sequences from 30 genomes and tested its performance against sequences from five new genomes. Our results show that FPfam provides improved sensitivity and coverage for domains represented in the library. By using FPfam, more sequences can be annotated as containing at least one of these domains and more multi-domain proteins are found at a given E-value cut-off. The best performance is obtained by combining FPfam with the general-purpose Pfam library, which finds some sequences missed by FPfam and allows additional domains to be located that are not represented in the current version of the FPfam library. Use of a kingdom-specific HMM library therefore effectively reduces the "twilight" zone and finds a significant number of difficult cases that might otherwise be missed. Indeed, the method demonstrates the ability to annotate additional examples of otherwise well-characterised, ubiquitous domains that Pfam and fungal-specific, rare motifs that are generally not well represented in the standard PFam HMM library.
Currently we are applying the domainer/mkdom algorithms  for all predicted proteins from the 35 fungal species, in order to have a database like Pfam-B providing coverage for all protein sequences in our e-Fungi fungal database. The FPfam libraries will then be used in order to classify all fungal sequences into super-families, families and subfamilies in a hierarchical fashion. The FPfam families will be made available as full alignments of these domains.
The Pfam database
Pfam is a database of multiple alignments of conserved regions or domains in proteins. Current release 18 of Pfam comprises alignments for more than 7973 domains . The Pfam database has two parts: Pfam-A contains models constructed from human-curated multiple alignments covering 75% of UniProt  (the largest available collection of protein sequences), while Pfam-B has models constructed from alignments obtained by an automated clustering of the rest of UniProt derived from the Prodom database . A recent development in the Pfam infrastructure is called Pfam clans or Pfam-C; this contains information about Pfam families that arise from a common ancestor. With ever-increasing coverage in protein databases, and based on human curated alignments, Pfam is a highly suitable and useable database for the large-scale annotation of proteins arriving from newly sequenced genomes. The easiest way to do this is to scan newly predicted Open Reading Frames (ORFs) against the HMMs using hmmpfam, provided in the HMMER package .
A typical Pfam-A entry contains a seed alignment, an alignment of a representative set of sequences, an HMM built using the seed alignment, a full alignment of all (detectable) sequences in the family and a description of the family with additional details such as the threshold parameters used to create the full alignment. Pfam seed alignments are saved and remain stable as long as they are able to detect all the known members of the family; otherwise the missing members are added to the alignment to improve the sensitivity of the HMMs. Seed and full alignments are curated manually and then the Pfam-A entry is annotated and linked to other motif databases .
Identifying Pfam domains in 30 fungal species
Predicted ORFs from 30 fungal genomes, including two oomycetes, were obtained from the Broad Institute. These sequences were filtered for a length of more than 40 amino acids and the resulting proteome sizes for each genome are shown in Table 1. Pfam database release 18 was downloaded and installed locally. Each fungal sequence was scanned against Pfam HMMs using hmmpfam, from the HMMER package, applying an E-value cut-off of 0.1. With this cut-off, 57.15% of the total fungal proteins were found to contain at least one Pfam domain and 5314 different Pfam domains were detected in these 30 fungal species.
Constructing a fungal-specific HMM library (FPfam)
We adopt the following procedure to construct a fungal-specific HMM library from the 30 original genomes:
a. For each domain, a maximum of two protein sequences per genome below an E-value cut-off of 1e-3 were obtained from the training dataset of fungal genomes. The training set of genomes is shown without asterisks in both Table 1 and the fungal species tree [see Additional File 2]. To avoid any bias towards the more closely related set of five genomes from Saccharomyces 'sensu stricto' clade, the number of sequences to be included in the seed alignment from this group was reduced to a maximum of six. The E-value of 1e-3 was used to reduce the probability of introducing false positive hits into the seed alignments. A restriction of at least five sequences per model with an E-value less then 1e-3 reduced the number of domains to 2953. Furthermore, to avoid models becoming too specific, a maximum of four sequences were added from representative species of the different domains of life, selecting one homologue from Human, Mouse, plants and bacteria where available.
b. The set of sequences gathered for each of the 2953 domains was aligned using ClustalW . To be compatible with Pfam, the alignment format was converted to selex.
c. All domain alignments were gathered into a single flatfile, adding the default Pfam-A annotation and parameters.
d. Global and local HMMs were constructed using hmmbuild from HMMER.
e. HMMs were calibrated using hmmcalibrate from HMMER.
f. The resulting fungal specific Pfam-A like database, from now on called FPfam, was indexed for sequence comparison using hmmpfam.
Protein sequences from 30 fungal genomes were scanned through the fungal version of Pfam (FPfam) database with the E-value cut-off of 0.1. FPfam results were compared with those obtained from searches against Pfam HMMs using the same E-value cut-off.
Testing FPfam on five new genomes
As a test case, ORFs from five more recently sequenced fungal genomes were obtained from the Broad Institute  and from the DSM . These are the species marked with asterisks in Table 1 and the phylogenetic tree [see Additional File 2]. These genomes were filtered removing protein sequences with lengths less than 40 amino acids. The resulting size of the proteome for each of the five new fungal genomes used in this test is shown in Table 1.
To perform the Pfam and FPfam comparison, each sequence from the five new fungal genomes was scanned against the HMMs from both libraries, using hmmpfam. The same E-value cut-off of 0.1 was applied in both cases. The libraries are calibrated in the same way, so we expect that the same E-value will result in a similar number of false positives in each case.
Comparison of bitscores between FPfam and Pfam hits
After the completion of all the hmmpfam searches against the training and test set of genomes, using both the Pfam and FPfam HMMs, the hmmer normalized alignment scores (known as bitscores) were extracted. We divided the results into two main categories: A, where hits were available from both the Pfam and FPfam libraries and B, where one of the libraries did not produce any hits. Bitscores were assigned to six bins of bitscore ranges and the frequency of hits calculated for category A, where the FPfam score is higher than Pfam and vice versa (named Pfam>FPfam and FPfam>Pfam respectively) and for category B where either FPfam or Pfam results (named Pfam-alone and FPfam-alone), respectively, were missing. To avoid frequencies being counted twice for category A where both Pfam and Fpfam bitscores were available, only the maximum score of the two was used to determine its respective bin.
Effect on the coverage of domains by varying E-value thresholds
The probability of false positives when searching a database of sequences is expressed in terms of E-values. To test the effect of E-value changes, we compared the coverage of sequences with at least one domain detected by either FPfam or Pfam alone to that of domains detected by considering the results from both libraries, applying a range of different E-value cut-offs (0.1, 1e-5, 1e-10).
This work was supported by a BBSRC award "e-Fungi: an e-Science infrastructure for comparative functional genomics in fungal species". We are grateful to the support teams, especially Dr. Sarfraz Nadeem at the North-West Grid (the Manchester portal) and Dr. Nick Gresham from the Faculty of Life Sciences, University of Manchester, for providing computational resources and support.
- Liolios K, Tavernarakis N, Hugenholtz P, Kyrpides NC: The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide. Nucleic Acids Res. 2006, D332-334. 10.1093/nar/gkj145. 34 DatabaseGoogle Scholar
- Ouzounis CA, Karp PD: The past, present and future of genome-wide re-annotation. Genome Biol. 2002, 3 (2): COMMENT2001-10.1186/gb-2002-3-2-comment2001.PubMed CentralPubMedView ArticleGoogle Scholar
- Stein L: Genome annotation: from sequence to biology. Nat Rev Genet. 2001, 2 (7): 493-503. 10.1038/35080529.PubMedView ArticleGoogle Scholar
- Hulo N, Bairoch A, Bulliard V, Cerutti L, De Castro E, Langendijk-Genevaux PS, Pagni M, Sigrist CJ: The PROSITE database. Nucleic Acids Res. 2006, D227-230. 10.1093/nar/gkj063. 34 DatabaseGoogle Scholar
- Attwood TK, Bradley P, Flower DR, Gaulton A, Maudling N, Mitchell AL, Moulton G, Nordle A, Paine K, Taylor P: PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res. 2003, 31 (1): 400-402. 10.1093/nar/gkg030.PubMed CentralPubMedView ArticleGoogle Scholar
- Letunic I, Copley RR, Pils B, Pinkert S, Schultz J, Bork P: SMART 5: domains in the context of genomes and networks. Nucleic Acids Res. 2006, D257-260. 10.1093/nar/gkj079. 34 DatabaseGoogle Scholar
- Haft DH, Selengut JD, White O: The TIGRFAMs database of protein families. Nucleic Acids Res. 2003, 31 (1): 371-373. 10.1093/nar/gkg128.PubMed CentralPubMedView ArticleGoogle Scholar
- Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R: Pfam: clans, web tools and services. Nucleic Acids Res. 2006, D247-251. 10.1093/nar/gkj149. 34 DatabaseGoogle Scholar
- Schultz J, Milpetz F, Bork P, Ponting CP: SMART, a simple modular architecture research tool: identification of signaling domains. Proc Natl Acad Sci USA. 1998, 95 (11): 5857-5864. 10.1073/pnas.95.11.5857.PubMed CentralPubMedView ArticleGoogle Scholar
- Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bradley P, Bork P, Bucher P, Cerutti L: InterPro, progress and status in 2005. Nucleic Acids Res. 2005, D201-205. 33 DatabaseGoogle Scholar
- Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF: The genome sequence of Drosophila melanogaster. Science. 2000, 287 (5461): 2185-2195. 10.1126/science.287.5461.2185.PubMedView ArticleGoogle Scholar
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W: Initial sequencing and analysis of the human genome. Nature. 2001, 409 (6822): 860-921. 10.1038/35057062.PubMedView ArticleGoogle Scholar
- Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 2006, D187-191. 10.1093/nar/gkj161. 34 DatabaseGoogle Scholar
- Gollery M: Specialized hidden Markov model databases for microbial genomics. Comparative and Functional Genomics. 2003, 4 (2): 250-254. 10.1002/cfg.280.PubMed CentralPubMedView ArticleGoogle Scholar
- Gollery M: TLFAM – A New Set of Protein Family Databases. OMICS A Journal of Integrative Biology. 2002, 6 (1): 35-37. 10.1089/15362310252780825.PubMedView ArticleGoogle Scholar
- Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 2001, 29 (1): 22-28. 10.1093/nar/29.1.22.PubMed CentralPubMedView ArticleGoogle Scholar
- Podell S, Gribskov M: Predicting N-terminal myristoylation sites in plant proteins. Bmc Genomics. 2004, 5:Google Scholar
- e-Fungi. [http://www.e-fungi.org.uk/]
- Sonnhammer EL, Eddy SR, Durbin R: Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins. 1997, 28 (3): 405-420. 10.1002/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-L.PubMedView ArticleGoogle Scholar
- Zhang JR, Idanpaan-Heikkila I, Fischer W, Tuomanen EI: Pneumococcal licD2 gene is involved in phosphorylcholine metabolism. Mol Microbiol. 1999, 31 (5): 1477-1488. 10.1046/j.1365-2958.1999.01291.x.PubMedView ArticleGoogle Scholar
- Pfam online database. [http://www.sanger.ac.uk/Software/Pfam/]
- Thomas PD, Kejariwal A, Campbell MJ, Mi H, Diemer K, Guo N, Ladunga I, Ulitsky-Lazareva B, Muruganujan A, Rabkin S: PANTHER: a browsable database of gene products organized by biological function, using curated protein family and subfamily classification. Nucleic Acids Res. 2003, 31 (1): 334-341. 10.1093/nar/gkg115.PubMed CentralPubMedView ArticleGoogle Scholar
- Gouzy J, Corpet F, Kahn D: Whole genome protein domain analysis using a new method for domain clustering. Comput Chem. 1999, 23 (3–4): 333-340. 10.1016/S0097-8485(99)00011-X.PubMedView ArticleGoogle Scholar
- Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang HZ, Lopez R, Magrane M: The universal protein resource (UniProt). Nucleic Acids Research. 2005, 33: D154-D159. 10.1093/nar/gki070.PubMed CentralPubMedView ArticleGoogle Scholar
- Bru C, Courcelle E, Carrre S, Beausse Y, Dalmar S, Kahn D: The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Research. 2005, 33: D212-D215. 10.1093/nar/gki034.PubMed CentralPubMedView ArticleGoogle Scholar
- Eddy SR: Profile hidden Markov models. Bioinformatics. 1998, 14 (9): 755-763. 10.1093/bioinformatics/14.9.755.PubMedView ArticleGoogle Scholar
- Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD: Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Research. 2003, 31 (13): 3497-3500. 10.1093/nar/gkg500.PubMed CentralPubMedView ArticleGoogle Scholar
- The-Broad-Institute: Fungal Genome Initiative. [http://www.broad.mit.edu/annotation/fgi/]
- Pel HJ, de Winde JH, Archer DB, Dyer PS, Hofmann G, Schaap PJ, Turner G, de Vries RP, Albang R, Albermann K: Genome sequencing and analysis of the versatile cell factory Aspergillus niger CBS 513.88. Nat Biotechnol. 2007, 25 (2): 221-231. 10.1038/nbt1282.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.