MLTreeMap - accurate Maximum Likelihood placement of environmental DNA sequences into taxonomic and functional reference phylogenies
© Stark et al; licensee BioMed Central Ltd. 2010
Received: 23 April 2010
Accepted: 5 August 2010
Published: 5 August 2010
Shotgun sequencing of environmental DNA is an essential technique for characterizing uncultivated microbes in situ. However, the taxonomic and functional assignment of the obtained sequence fragments remains a pressing problem.
Existing algorithms are largely optimized for speed and coverage; in contrast, we present here a software framework that focuses on a restricted set of informative gene families, using Maximum Likelihood to assign these with the best possible accuracy. This framework ('MLTreeMap'; http://mltreemap.org/) uses raw nucleotide sequences as input, and includes hand-curated, extensible reference information.
We discuss how we validated our pipeline using complete genomes as well as simulated and actual environmental sequences.
In the field of microbial genomics, successful laboratory cultivation of naturally occurring microbes has become a major bottleneck [1–3]; this limits and biases our understanding of the biochemical capabilities and ecological roles of microbes in their habitats. Since cultivation is a prerequisite for standard genome sequencing approaches, we are still lacking genomic information for many important microbial lineages (including entire phylum-level groups [4, 5]). In addition, there is a sequencing backlog even for those strains that have been cultivated successfully; this however is being addressed now by directed sequencing efforts that are underway [6, 7]. Nevertheless, the severe biases and the large gaps in the worldwide collection of cultivated isolates make it difficult to fully appreciate evolutionary processes and microbial ecology, or to exploit the large repertoire of microbial genes that might be relevant to medicine and biotechnology. While techniques that analyze single cells, such as multiplexed microfluidics PCR  or single-cell genome sequencing [9, 10], can provide unequivocal genomic data in the absence of cultivation, these methods are still limited in terms of throughput and usability. Thus, the approach that presently generates the largest amount of unbiased microbial genome sequence data is 'metagenomics' (; also termed 'environmental sequencing').
More than 200 metagenomics projects are currently registered  at various stages of completion; these address a wide variety of habitats and microbial lifestyles [12–16]. Typically, in such projects, an environmental sample is processed by lysing cells and indiscriminately isolating genomic DNA; the latter is then fragmented and shotgun-sequenced to a desired depth. However, even when employing the latest next-generation, high-throughput DNA sequencing technologies, the large complexity and genomic heterogeneity of natural microbial communities often preclude de novo assembly of complete genomes from the data - instead, a large number of short to medium-sized sequence fragments are obtained. From these, quantitative inferences can already be made regarding genome sizes [17, 18], recombination rates , and functional repertoires [20, 21], among others. However, many of the perhaps more important ecological questions require the assignment of the sequence fragments to the microbial lineage they originate from, a process called 'binning' [12, 22].
An increasing number of algorithms have been devised for this task; these can largely be divided into two groups. The first consists of 'unsupervised' approaches [23–27], in which sequences are binned using signature-based algorithms that focus on nucleotide compositional signals (reflected in the relative frequencies of short nucleotide 'words'). These approaches require no external reference information a priori; instead, they learn to distinguish the major taxonomic groups from the data itself (although subsequent assignment to known taxonomic entities is often done). In contrast, 'supervised' approaches [28–34] require extensive, annotated, external reference information. For the most part, these approaches interpret the results of large-scale homology searches against sequence databases, sometimes followed by phylogeny reconstruction; the external reference information is usually derived from the available fully sequenced microbial genomes. For both types of approaches, the various implementations differ greatly in their speed, accuracy, coverage, ease of installation and use, and in the interpretation and visualization of the results. Owing to the size and nature of the input data, formal phylogenetics algorithms are relatively rarely used in these pipelines, with three exceptions: Maximum Parsimony in , Neighbor Joining in , and an approximate Maximum Likelihood approach in . That the Maximum Likelihood approach has not been applied more frequently is somewhat surprising, since it is arguably among the most accurate and best-described techniques in phylogenetics [35–38]. One reason for this is presumably the high computational cost of this approach, which makes it difficult to execute for very large numbers of sequence fragments.
Here, we describe a software framework ("MLTreeMap") that does employ full Maximum Likelihood, and which is specifically designed for metagenomics sequences. We significantly reduced the computational costs through algorithmic improvements, as well as through a focus on a restricted (but user-extensible) set of informative gene families. The aim of the framework is to cover the high-accuracy end of the tool spectrum, with a particular focus on consistency across different sources of input data. To achieve this, the package, a) starts from raw nucleotide sequences to avoid inconsistencies arising from different gene-calling strategies, b) corrects for frame-shifts and other errors on the fly to optimally extract marker genes, c) includes searches against 'off-target' reference sequences to avoid the detection of undesired deep paralogs, d) concatenates marker genes when several of them are observed in a given sequence fragment, and e) offers intuitive visualization features, both via the command-line as well as via the web-server. The framework contains hand-curated reference phylogenies and alignments; in the first full release that we describe here (MLTreeMap version 2.011), these references encompass a total of 44 distinct gene families that have been selected to address both taxonomic as well as functional aspects of microbial assemblages.
Results and Discussion
Finally, we tested the MLTreeMap pipeline not only with respect to taxonomic assignment, but also with respect to the functional characterization of samples. Currently, the pipeline covers four important enzyme families (RuBisCO, Nitrogenase/NifD, Nitrogenase/NifH, and Methane Monooxygenase). These families are represented by hand-curated alignments, and visualized in the form of annotated protein trees. Future versions of MLTreeMap will extend this set in order to cover a significantly larger number of important diagnostic protein/enzyme families that are indicative of core functions (metabolic and otherwise [55–59]). Figure 4A shows a typical result of MLTreeMap for the functional classification of a set of environmental sequence samples. Three datasets are shown, that each contain representatives of the RuBisCO enzyme family (Ribulose-1,5-bisphosphate carboxylase oxygenase). The mere presence of these genes in the sample could also have been deduced from simple BLAST searches on the data; however, the summary shown in Figure 4A reveals crucial, additional information: first, the mapped sequences show a clear separation into distinct sub-families of RuBisCO. The surface seawater sample is dominated by subfamily #1, the plant surface sample by subfamily #4b, and the distal human gut by subfamily #4a and other unclassified parts of the tree (subfamilies are designated according to ). Second, the functional placements tend to corroborate the taxonomic assignments that MLTreeMaps reports for the same samples (not shown); this enables checks for consistency and/or unexpected horizontal transfers. And third, the placements can be seen to differ dramatically in their distance from the root, that is, in their evolutionary 'depth' with respect to previously known members of the family. For example, in the case of the surface seawater, virtually all sequences were very close to the tips of the tree, in other words closely related to known examples of RuBisCO (mainly from Cyanobacteria and alpha-Proteobacteria). In contrast, instances of RuBisCO-like proteins in the human gut were observed much closer to the root, i.e., at a greater evolutionary distance from previously known sequences and in non-canonical subfamilies. From this, it would be much harder to predict their functions, and it is indeed conceivable that they are not functioning in CO2 fixation, but rather in other, possibly sulfur-related metabolic pathways (methionine salvage or yet other, uncharacterized pathways [60–62]). The standardization and ease of use provided by MLTreeMap allow for consistent, semi-quantitative analysis of the functional coding potential of entire collections of metagenomics samples - as an example, Figure 4B shows combined data for 11 distinct metagenomes. In this case, the coding capacities for nitrogen fixation and CO2 fixation have been compared across samples and sites. Large differences become apparent, including the known paucity of nitrogen fixation genes in some environments , but also surprises such as nitrogenase-like genes in the distal human gut. Here again, the availability of the annotated reference trees in the MLTreeMap output is crucial: the sequences are likely of a non-canonical, archaeal type, related to genes in Methanobrevibacter smithii, and are thought to function in a process other than nitrogen fixation [64, 65].
For both, functional as well as taxonomic assignments, MLTreeMap offers a number of user-definable parameter settings. Users can chose which of two phylogenetic reference trees to use (modified from  or ), and whether to use Maximum Likelihood or Maximum Parsimony (the latter works faster but is somewhat less accurate; see Figures 2 and 3). When choosing Maximum Likelihood, users can also request bootstrap replicates. However, bootstrapping will in most cases not be necessary since the input data is already divided into many independent sequence fragments (these constitute 'bootstraps' in some sense; the fragmentation is due to the lack of assembly in most metagenomics projects). Bootstrapping could of course be turned on for specific cases of interest, but for assessing entire datasets it is probably less advisable. This is because individual RAxML runs using all the columns of a given sequence alignment yield more accurate results than each individual bootstrapping run in which columns have been re-sampled [on average, only 65% of distinct input columns are used in each bootstrap, Berger et al., submitted; this becomes an issue particularly when input sequences are rather short to begin with]. The overall accuracy of MLTreeMap is fairly good already, but it could be further enhanced by improving the coverage and evenness of the reference trees and also by optionally giving deeply assembled contigs (i.e., those with high read coverage) correspondingly more weight in the final aggregation step. Future versions of the pipeline could also likely be optimized further with regards to computational speed - we note that currently much time is still spent outside RAxML, in the pre-processing steps. If further speed-ups can indeed be achieved, then the pipeline should cope well with further advances in sequencing technology - perhaps even to a point in the future when much of the raw data will be discarded immediately after sequencing, and only genes of interest (such as the phylogenetically and functionally informative genes assessed by MLTreeMap) will be kept.
MLTreeMap performs consistent and rapid placements of metagenomics sequence fragments into high-quality, manually curated reference phylogenies - with high accuracy, albeit covering only a restricted fraction of any given sample (around 1%). It focuses on phylogenetically and functionally informative genes, thereby aiming to capture and characterize core aspects of a microbial community. MLTreeMap is one of only a few frameworks that can address microbial eukaryotes on an equal footing with prokaryotes, and it can easily be extended by the user (with any specific gene family of interest). The pipeline will likely be best put to use when analyzing hundreds of samples in comparison: this should ultimately reveal quantitative correlations between certain taxonomic clades and certain functional gene abundance profiles, thus helping to address the classic question of 'who does what' in microbial assemblages.
Materials and methods
Annotated protein-coding genes from fully sequenced genomes were downloaded from STRING  and RefSeq . The phylogenetic 'tree-of-life' references were obtained from  and , but were subsequently modified: we removed genomes for which we were unable to obtain sequences, at the time, and added others. For the tree of , we made the representation of organisms non-redundant at the genus level, with a small number of exceptions for fast-evolving genera, and recomputed the best Maximum Likelihood tree, while keeping fixed the original topology of the published tree ('constraints' in RAxML). This computation was based on concatenated alignments of the exact same 40 reference genes as used by MLTreeMap. Note that the purpose of MLTreeMap is not to generate tree-of-life phylogenies de novo; instead these trees are provided externally [7, 47], we therefore chose to maintain their published topology. For the four functional reference families, gene family information was obtained from KEGG  (nifD: K02586, nifH: K02588, MMO: K08684) and from STRING  (RuBisCO: COG1850). In total, the current release 2.01 of MLTreeMap contains 11,069 genes in the reference data; on average, each gene family of interest is represented by 252 genes.
Implementation and Use
MLTreeMap is provided both online (albeit with input-size limitations) as well as offline in form of a command-line executable. The latter is designed with as few external runtime dependencies as possible: BLAST, GeneWise, HMMER and RAxML. Visualization of the results is optional, and a separate Perl-script (with additional dependencies) is provided for this purpose. When using the pipeline, individual reports are generated for each sequence fragment on which marker genes were detected. Aggregated reports are also generated, but this step may have to be repeated by the user (for example when running the pipeline in parallel on separate machines, or when re-weighting the fragments according to additional, external information such as assembly depth or sample size).
The MLTreeMap pipeline has only a few configurable parameters (including: choice of phylogenetic placement method, number of bootstraps, and choice of taxonomic reference phylogeny); other settings are hardcoded with the following default values: required significance of initial BLASTX hits (e = 0.01; database size fixed at 1'000'000), gap removal parameters for Gblocks (-t = p -s = y -u = n -p = t -b3 = 15 -b4 = 3 -b5 = h -b2 = [0.55 · #alignment_rows]), and required sequence length of the marker genes after alignment and gap removal (50 amino acids). Due to this latter threshold, the pipeline will not yield much useful information for samples with typical read lengths below 300 base pairs (indeed, 500 bp or longer is recommended). The Maximum Likelihood insertion in RAxML is typically done under the following settings: "-f v -m PROTGAMMAWAG" (the WAG substitution model yields the best likelihood scores on the phylogenetic reference trees, compared to all other amino acid substitution models available in RAxML; this was assessed using the RAxML "-f e" option for tree evaluation). For only 7 of the 44 protein families, a substitution model other than WAG is used (RTREV for COG0049, COG0090, COG0092, COG0093 and COG0100; CPREV for COG0201 and BLOSUM62 for Methane Monooxygenase). RAxML works with unrooted trees; however, the MLTreeMap pipeline reports all results in the context of rooted trees, for convenience (the re-rooting is hardcoded for each reference tree). Note that the actual Maximum Likelihood insertion step in MLTreeMap is clearly defined and fairly generic - it could in principle be performed also by software other than RAxML (for example by the PPLACER program; Matsen et al., personal communication; preprint at http://arxiv.org/abs/1003.5943). MLTreeMap can be compiled and executed locally, and previous versions are maintained at our website, for reference (together with the corresponding reference alignments and trees). We plan to update MLTreeMap yearly - each time updating the reference alignments with data from newly sequenced genomes, and extending the repertoire of functional reference families.
For the validation tests based on whole genomes, the query genomes were artificially fragmented into non-overlapping, consecutive stretches of 1'000 base pairs each. Prior to each test, the respective genome was removed from the reference phylogeny to avoid circularity, and MLTreeMap placements were made using either Maximum Parsimony or Maximum Likelihood (all other settings were identical; bootstrapping was not used). The resulting placements were then compared to the known positions of the query genomes in the reference tree, either by assessing the node distance or the taxonomic assignment. For the latter, the newly placed fragment was assigned to the highest taxonomic rank for which all genomes in the clade below the placement branch were in agreement. For the tests based on simulated metagenomes, we chose the Phrap assembly of the 'medium complexity' simulated dataset, available at http://fames.jgi-psf.org/. The expected target composition of this set is not simply defined by the list of constituent genomes ; instead, since the relative genome representation depends on the read coverage of each genome in the simulated set, we weighted all genomes accordingly.
Additional data files
All reference information contained in MLTreeMap (sequences, phylogenies) is available from the associated website http://mltreemap.org/.
List of Abbreviations
polymerase chain reaction
Ribulose-1,5-bisphosphate carboxylase oxygenase
This work was supported by the Swiss National Science Foundation, by the University of Zurich through its Research Priority Program "Systems Biology and Functional Genomics", and by the Emmy-Noether Program of the German Science Foundation (DFG). We are indebted to Dongying Wu for sharing detailed information regarding his global tree-of-life phylogeny, to Peer Bork for essential scientific and technical support during the early phases of the project, and to Jakob Pernthaler, Thomas Wicker and Michael Baudis for suggestions and critical discussions.
- Alain K, Querellou J: Cultivating the uncultured: limits, advances and future challenges. Extremophiles. 2009, 13 (4): 583-594. 10.1007/s00792-009-0261-3.PubMedView Article
- Ferrari BC, Winsley T, Gillings M, Binnerup S: Cultivating previously uncultured soil bacteria using a soil substrate membrane system. Nat Protoc. 2008, 3 (8): 1261-1269. 10.1038/nprot.2008.102.PubMedView Article
- Zengler K: Central role of the cell in microbial ecology. Microbiol Mol Biol Rev. 2009, 73 (4): 712-729. 10.1128/MMBR.00027-09.PubMed CentralPubMedView Article
- Hugenholtz P: Exploring prokaryotic diversity in the genomic era. Genome Biol. 2002, 3 (2): REVIEWS0003-10.1186/gb-2002-3-2-reviews0003.PubMed CentralPubMedView Article
- Liolios K, Chen IM, Mavromatis K, Tavernarakis N, Hugenholtz P, Markowitz VM, Kyrpides NC: The Genomes On Line Database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 2010, D346-354. 10.1093/nar/gkp848. 38 Database
- Peterson J, Garges S, Giovanni M, McInnes P, Wang L, Schloss JA, Bonazzi V, McEwen JE, Wetterstrand KA, Deal C: The NIH Human Microbiome Project. Genome Res. 2009, 19 (12): 2317-2323. 10.1101/gr.096651.109.PubMed CentralPubMedView Article
- Wu D, Hugenholtz P, Mavromatis K, Pukall R, Dalin E, Ivanova NN, Kunin V, Goodwin L, Wu M, Tindall BJ: A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature. 2009, 462 (7276): 1056-1060. 10.1038/nature08656.PubMed CentralPubMedView Article
- Ottesen EA, Hong JW, Quake SR, Leadbetter JR: Microfluidic digital PCR enables multigene analysis of individual environmental bacteria. Science. 2006, 314 (5804): 1464-1467. 10.1126/science.1131370.PubMedView Article
- Zhang K, Martiny AC, Reppas NB, Barry KW, Malek J, Chisholm SW, Church GM: Sequencing genomes from single cells by polymerase cloning. Nat Biotechnol. 2006, 24 (6): 680-686. 10.1038/nbt1214.PubMedView Article
- Ishoey T, Woyke T, Stepanauskas R, Novotny M, Lasken RS: Genomic sequencing of single microbial cells from environmental samples. Curr Opin Microbiol. 2008, 11 (3): 198-204. 10.1016/j.mib.2008.05.006.PubMed CentralPubMedView Article
- Handelsman J, Rondon MR, Brady SF, Clardy J, Goodman RM: Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chem Biol. 1998, 5 (10): R245-249. 10.1016/S1074-5521(98)90108-9.PubMedView Article
- Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P: A bioinformatician's guide to metagenomics. Microbiol Mol Biol Rev. 2008, 72 (4): 557-578. 10.1128/MMBR.00009-08. Table of ContentsPubMed CentralPubMedView Article
- Handelsman J: Metagenomics: application of genomics to uncultured microorganisms. Microbiol Mol Biol Rev. 2004, 68 (4): 669-685. 10.1128/MMBR.68.4.669-685.2004.PubMed CentralPubMedView Article
- Eisen JA: Environmental shotgun sequencing: its potential and challenges for studying the hidden world of microbes. PLoS Biol. 2007, 5 (3): e82-10.1371/journal.pbio.0050082.PubMed CentralPubMedView Article
- Raes J, Foerstner KU, Bork P: Get the most out of your metagenome: computational analysis of environmental sequence data. Curr Opin Microbiol. 2007, 10 (5): 490-498. 10.1016/j.mib.2007.09.001.PubMedView Article
- Tringe SG, Rubin EM: Metagenomics: DNA sequencing of environmental samples. Nat Rev Genet. 2005, 6 (11): 805-814. 10.1038/nrg1709.PubMedView Article
- Raes J, Korbel JO, Lercher MJ, von Mering C, Bork P: Prediction of effective genome size in metagenomic samples. Genome Biol. 2007, 8 (1): R10-10.1186/gb-2007-8-1-r10.PubMed CentralPubMedView Article
- Angly FE, Willner D, Prieto-Davo A, Edwards RA, Schmieder R, Vega-Thurber R, Antonopoulos DA, Barott K, Cottrell MT, Desnues C: The GAAS metagenomic tool and its estimations of viral and microbial average genome size in four major biomes. PLoS Comput Biol. 2009, 5 (12): e1000593-10.1371/journal.pcbi.1000593.PubMed CentralPubMedView Article
- Johnson PL, Slatkin M: Inference of microbial recombination rates from metagenomic data. PLoS Genet. 2009, 5 (10): e1000674-10.1371/journal.pgen.1000674.PubMed CentralPubMedView Article
- Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC: Comparative metagenomics of microbial communities. Science. 2005, 308 (5721): 554-557. 10.1126/science.1107851.PubMedView Article
- Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, Ley RE, Sogin ML, Jones WJ, Roe BA, Affourtit JP: A core gut microbiome in obese and lean twins. Nature. 2009, 457 (7228): 480-484. 10.1038/nature07540.PubMed CentralPubMedView Article
- McHardy AC, Rigoutsos I: What's in the mix: phylogenetic classification of metagenome sequence samples. Curr Opin Microbiol. 2007, 10 (5): 499-503. 10.1016/j.mib.2007.08.004.PubMedView Article
- Teeling H, Waldmann J, Lombardot T, Bauer M, Glockner FO: TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics. 2004, 5: 163-10.1186/1471-2105-5-163.PubMed CentralPubMedView Article
- McHardy AC, Martin HG, Tsirigos A, Hugenholtz P, Rigoutsos I: Accurate phylogenetic classification of variable-length DNA fragments. Nat Methods. 2007, 4 (1): 63-72. 10.1038/nmeth976.PubMedView Article
- Abe T, Sugawara H, Kinouchi M, Kanaya S, Ikemura T: Novel phylogenetic studies of genomic sequence fragments derived from uncultured microbe mixtures in environmental and clinical samples. DNA Res. 2005, 12 (5): 281-290. 10.1093/dnares/dsi015.PubMedView Article
- Brady A, Salzberg SL: Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat Methods. 2009, 6 (9): 673-676. 10.1038/nmeth.1358.PubMed CentralPubMedView Article
- Dick GJ, Andersson AF, Baker BJ, Simmons SL, Thomas BC, Yelton AP, Banfield JF: Community-wide analysis of microbial genome sequence signatures. Genome Biol. 2009, 10 (8): R85-10.1186/gb-2009-10-8-r85.PubMed CentralPubMedView Article
- Huson DH, Auch AF, Qi J, Schuster SC: MEGAN analysis of metagenomic data. Genome Res. 2007, 17 (3): 377-386. 10.1101/gr.5969107.PubMed CentralPubMedView Article
- Krause L, Diaz NN, Goesmann A, Kelley S, Nattkemper TW, Rohwer F, Edwards RA, Stoye J: Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Res. 2008, 36 (7): 2230-2239. 10.1093/nar/gkn038.PubMed CentralPubMedView Article
- Monzoorul Haque M, Ghosh TS, Komanduri D, Mande SS: SOrt-ITEMS: Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences. Bioinformatics. 2009, 25 (14): 1722-1730. 10.1093/bioinformatics/btp317.PubMedView Article
- von Mering C, Hugenholtz P, Raes J, Tringe SG, Doerks T, Jensen LJ, Ward N, Bork P: Quantitative phylogenetic assessment of microbial communities in diverse environments. Science. 2007, 315 (5815): 1126-1130. 10.1126/science.1133420.PubMedView Article
- Dutilh BE, Snel B, Ettema TJ, Huynen MA: Signature genes as a phylogenomic tool. Mol Biol Evol. 2008, 25 (8): 1659-1667. 10.1093/molbev/msn115.PubMed CentralPubMedView Article
- Wu M, Eisen JA: A simple, fast, and accurate method of phylogenomic inference. Genome Biol. 2008, 9 (10): R151-10.1186/gb-2008-9-10-r151.PubMed CentralPubMedView Article
- Schreiber F, Gumrich P, Daniel R, Meinicke P: Treephyler: fast taxonomic profiling of metagenomes. Bioinformatics. 2010, 26 (7): 960-961. 10.1093/bioinformatics/btq070.PubMedView Article
- Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981, 17 (6): 368-376. 10.1007/BF01734359.PubMedView Article
- Felsenstein J: Inferring phylogenies. 2004, Sunderland, Mass.: Sinauer Assoc
- Whelan S, Lio P, Goldman N: Molecular phylogenetics: state-of-the-art methods for looking into the past. Trends Genet. 2001, 17 (5): 262-272. 10.1016/S0168-9525(01)02272-7.PubMedView Article
- Holder M, Lewis PO: Phylogeny estimation: traditional and Bayesian approaches. Nat Rev Genet. 2003, 4 (4): 275-284. 10.1038/nrg1044.PubMedView Article
- Delmotte N, Knief C, Chaffron S, Innerebner G, Roschitzki B, Schlapbach R, von Mering C, Vorholt JA: Community proteogenomics reveals insights into the physiology of phyllosphere bacteria. Proc Natl Acad Sci USA. 2009, 106 (38): 16428-16433. 10.1073/pnas.0905240106.PubMed CentralPubMedView Article
- Kunin V, Raes J, Harris JK, Spear JR, Walker JJ, Ivanova N, von Mering C, Bebout BM, Pace NR, Bork P: Millimeter-scale genetic gradients and community-level molecular convergence in a hypersaline microbial mat. Mol Syst Biol. 2008, 4: 198-10.1038/msb.2008.35.PubMed CentralPubMedView Article
- Schmidt HA, Strimmer K, Vingron M, von Haeseler A: TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics. 2002, 18 (3): 502-504. 10.1093/bioinformatics/18.3.502.PubMedView Article
- Stamatakis A: RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006, 22 (21): 2688-2690. 10.1093/bioinformatics/btl446.PubMedView Article
- Stamatakis A, Hoover P, Rougemont J: A rapid bootstrap algorithm for the RAxML Web servers. Syst Biol. 2008, 57 (5): 758-771. 10.1080/10635150802429642.PubMedView Article
- Birney E, Clamp M, Durbin R: GeneWise and Genomewise. Genome Res. 2004, 14 (5): 988-995. 10.1101/gr.1865504.PubMed CentralPubMedView Article
- Durbin R: Biological sequence analysis: probabilistic models of proteins and nucleic acids. 1998, Cambridge [u.a.]: Cambridge Univ. PressView Article
- Talavera G, Castresana J: Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst Biol. 2007, 56 (4): 564-577. 10.1080/10635150701472164.PubMedView Article
- Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P: Toward automatic reconstruction of a highly resolved tree of life. Science. 2006, 311 (5765): 1283-1287. 10.1126/science.1123061.PubMedView Article
- Kalyuzhnaya MG, Lapidus A, Ivanova N, Copeland AC, McHardy AC, Szeto E, Salamov A, Grigoriev IV, Suciu D, Levine SR: High-resolution metagenomics targets specific functional types in complex microbial communities. Nat Biotechnol. 2008, 26 (9): 1029-1034. 10.1038/nbt.1488.PubMedView Article
- Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC, Rigoutsos I, Salamov A, Korzeniewski F, Land M: Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods. 2007, 4 (6): 495-500. 10.1038/nmeth1043.PubMedView Article
- Tringe SG, Zhang T, Liu X, Yu Y, Lee WH, Yap J, Yao F, Suan ST, Ing SK, Haynes M: The airborne metagenome in an indoor urban environment. PLoS One. 2008, 3 (4): e1862-10.1371/journal.pone.0001862.PubMed CentralPubMedView Article
- Baker GC, Smith JJ, Cowan DA: Review and re-analysis of domain-specific 16 S primers. J Microbiol Methods. 2003, 55 (3): 541-555. 10.1016/j.mimet.2003.08.009.PubMedView Article
- Polz MF, Cavanaugh CM: Bias in template-to-product ratios in multitemplate PCR. Appl Environ Microbiol. 1998, 64 (10): 3724-3730.PubMed CentralPubMed
- Sipos R, Szekely AJ, Palatinszky M, Revesz S, Marialigeti K, Nikolausz M: Effect of primer mismatch, annealing temperature and PCR cycle number on 16 S rRNA gene-targetting bacterial community analysis. FEMS Microbiol Ecol. 2007, 60 (2): 341-350. 10.1111/j.1574-6941.2007.00283.x.PubMedView Article
- DeSantis TZ, Brodie EL, Moberg JP, Zubieta IX, Piceno YM, Andersen GL: High-density universal 16 S rRNA microarray analysis reveals broader diversity than typical clone library when sampling the environment. Microb Ecol. 2007, 53 (3): 371-383. 10.1007/s00248-006-9134-9.PubMedView Article
- Wagner M, Loy A, Klein M, Lee N, Ramsing NB, Stahl DA, Friedrich MW: Functional marker genes for identification of sulfate-reducing prokaryotes. Methods Enzymol. 2005, 397: 469-489. 10.1016/S0076-6879(05)97029-8.PubMedView Article
- Junier P, Molina V, Dorador C, Hadas O, Kim OS, Junier T, Witzel JP, Imhoff JF: Phylogenetic and functional marker genes to study ammonia-oxidizing microorganisms (AOM) in the environment. Appl Microbiol Biotechnol. 2010, 85 (3): 425-440. 10.1007/s00253-009-2228-9.PubMed CentralPubMedView Article
- Braker G, Zhou J, Wu L, Devol AH, Tiedje JM: Nitrite reductase genes (nirK and nirS) as functional markers to investigate diversity of denitrifying bacteria in pacific northwest marine sediment communities. Appl Environ Microbiol. 2000, 66 (5): 2096-2104. 10.1128/AEM.66.5.2096-2104.2000.PubMed CentralPubMedView Article
- Auguet JC, Borrego CM, Baneras L, Casamayor EO: Fingerprinting the genetic diversity of the biotin carboxylase gene (accC) in aquatic ecosystems as a potential marker for studies of carbon dioxide assimilation in the dark. Environ Microbiol. 2008, 10 (10): 2527-2536. 10.1111/j.1462-2920.2008.01677.x.PubMedView Article
- Foerstner KU, Doerks T, Creevey CJ, Doerks A, Bork P: A computational screen for type I polyketide synthases in metagenomics shotgun data. PLoS One. 2008, 3 (10): e3515-10.1371/journal.pone.0003515.PubMed CentralPubMedView Article
- Ashida H, Danchin A, Yokota A: Was photosynthetic RuBisCO recruited by acquisitive evolution from RuBisCO-like proteins involved in sulfur metabolism?. Res Microbiol. 2005, 156 (5-6): 611-618. 10.1016/j.resmic.2005.01.014.PubMedView Article
- Ashida H, Saito Y, Kojima C, Kobayashi K, Ogasawara N, Yokota A: A functional link between RuBisCO-like protein of Bacillus and photosynthetic RuBisCO. Science. 2003, 302 (5643): 286-290. 10.1126/science.1086997.PubMedView Article
- Imker HJ, Singh J, Warlick BP, Tabita FR, Gerlt JA: Mechanistic diversity in the RuBisCO superfamily: a novel isomerization reaction catalyzed by the RuBisCO-like protein from Rhodospirillum rubrum. Biochemistry. 2008, 47 (43): 11171-11173. 10.1021/bi801685f.PubMed CentralPubMedView Article
- Johnston AW, Li Y, Ogilvie L: Metagenomic marine nitrogen fixation--feast or famine?. Trends Microbiol. 2005, 13 (9): 416-420. 10.1016/j.tim.2005.07.002.PubMedView Article
- Raymond J, Siefert JL, Staples CR, Blankenship RE: The natural history of nitrogen fixation. Mol Biol Evol. 2004, 21 (3): 541-554. 10.1093/molbev/msh047.PubMedView Article
- Ohkuma M, Noda S, Kudo T: Phylogenetic diversity of nitrogen fixation genes in the symbiotic microbial community in the gut of diverse termites. Appl Environ Microbiol. 1999, 65 (11): 4926-4934.PubMed CentralPubMed
- Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T, Julien P, Roth A, Simonovic M: STRING 8--a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res. 2009, D412-416. 10.1093/nar/gkn760. 37 Database
- Pruitt KD, Tatusova T, Klimke W, Maglott DR: NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res. 2009, D32-36. 10.1093/nar/gkn721. 37 Database
- Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T: KEGG for linking genomes to life and the environment. Nucleic Acids Res. 2008, D480-484. 36 Database
- Markowitz VM, Ivanova NN, Szeto E, Palaniappan K, Chu K, Dalevi D, Chen IM, Grechkin Y, Dubchak I, Anderson I: IMG/M: a data management and analysis system for metagenomes. Nucleic Acids Res. 2008, D534-538. 36 Database
- Seshadri R, Kravitz SA, Smarr L, Gilna P, Frazier M: CAMERA: a community resource for metagenomics. PLoS Biol. 2007, 5 (3): e75-10.1371/journal.pbio.0050075.PubMed CentralPubMedView Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.