An improved probability mapping approach to assess genome mosaicism
© Zhaxybayeva and Gogarten; licensee BioMed Central Ltd. 2003
Received: 19 June 2003
Accepted: 15 September 2003
Published: 15 September 2003
Maximum likelihood and posterior probability mapping are useful visualization techniques that are used to ascertain the mosaic nature of prokaryotic genomes. However, posterior probabilities, especially when calculated for four-taxon cases, tend to overestimate the support for tree topologies. Furthermore, because of poor taxon sampling four-taxon analyses suffer from sensitivity to the long branch attraction artifact. Here we extend the probability mapping approach by improving taxon sampling of the analyzed datasets, and by using bootstrap support values, a more conservative tool to assess reliability.
Quartets of orthologous proteins were complemented with homologs from selected reference genomes. The mapping of bootstrap support values from these extended datasets gives results similar to the original maximum likelihood and posterior probability mapping. The more conservative nature of the plotted support values allows to focus further analyses on those protein families that strongly disagree with the majority or plurality of genes present in the analyzed genomes.
Posterior probability is a non-conservative measure for support, and posterior probability mapping only provides a quick estimation of phylogenetic information content of four genomes. This approach can be utilized as a pre-screen to select genes that might have been horizontally transferred. Better taxon sampling combined with subtree analyses prevents the inconsistencies associated with four-taxon analyses, but retains the power of visual representation. Nevertheless, a case-by-case inspection of individual multi-taxon phylogenies remains necessary to differentiate unrecognized paralogy and shared phylogenetic reconstruction artifacts from horizontal gene transfer events.
Keywordsmaximum likelihood mapping long-branch attraction horizontal gene transfer taxon sampling bootstrap support values mapping
The analysis of four-taxon trees promises to provide valuable insight and visual documentation of genome mosaicism [1–5]. However, like other four-taxon analyses, our probability mapping approach for comparative genome analyses  is vulnerable to the long branch attraction (LBA) artifact because it analyzes datasets consisting of only four sequences. LBA is a well-known phylogenetic artifact . It is especially well studied for the case of four-taxon trees (e.g., see [7–11]). In short, regardless of the reconstruction method and model used, if the branches are long enough, the reconstructed tree might be affected by LBA although to different degrees. Furthermore, four-taxon analyses were shown to be instable and misleading under some circumstances [12, 13]. Addition of more taxa can break up the long branches and increases reliability. Simulation studies have shown that increase of the size of a dataset by introducing additional homologous sequences improves the accuracy of the reconstruction  (see  and  for the recent discussion). An increase in the sequence lengths of the analyzed data also can improve the reliability of phylogenetic reconstruction , but lumping different putative orthologs into a single dataset would defeat the purpose of the probability mapping approach, i.e., the detection of genes that have incompatible evolutionary histories. Merging proteins with different histories into concatenated datasets would not help to resolve their phylogenies.
Here we report an extension of probability mapping that increases the number of homologous sequences per dataset, throughout the rest of the article referred as Operational Taxonomic Unit (OTU) sampling, but retains the power to visualize genomic mosaicism from the original approach. A quartet of orthologous proteins (QuartOP) is defined as four homologs from four genomes that pick each other as top-scoring reciprocal hits in BLAST searches of the respective genomes (for more details see ). For each QuartOP detected in a genome quartet we add homologous sequences and evaluate the branching order of the QuartOP in 100 bootstrap samples. The bootstrap support values then are mapped into a barycentric coordinate system. We compare the mapping results with previously reported ones , and give examples that illustrate the utility of this approach in detecting horizontally transferred genes.
Results and Discussion
Interdomain Genome Quartets
Interdomain transfer, interphylum transfer, or shared artifact?
The relation between the different bacterial phyla, and the placement of the bacterial root remains uncertain (e.g., see ). Therefore, it is not clear which of the three unrooted trees for the genome quartet in question represents the true organismal phylogeny of the four genomes. The lack of phylogenetic signal alone should result in QuartOPs that map to the center of the triangle. However, we observe many genes that prefer one topology to the other two. The genes in figures 1A and 3A that group the ortholog from Halobacterium with its putative ortholog from Synechocystis might do so for a variety of different reasons:
A) Horizontal gene transfer
A1) between a mesophilic bacterium and a mesophilic archaeon
A2) between the extremely mesophilic bacteria Aquifex and Thermotoga
B) Phylogenetic reconstruction artifacts
B1) due to long branch attraction
B2) due to compositional bias
B3) due to the lack of phylogenetic signal
C) Unrecognized paralogy
D) This grouping reflects organismal evolutionary history.
Analyses of the tree topologies for 18 candidates for horizontal gene transfer. Support for putative transfers between Bacteria and Archaea is shown in the B&A column, and between Aquifex and Thermotoga is shown in the A&T column. The compositional bias is listed as "strong", if both the halobacterial sequence and its nearest phylogenetic neighbor failed the test for homogeneous composition. See Materials and Methods for details on performed analyses.
excision nuclease chain A
chromosome segregation SMC protein
Cyanobacteria, Rickettsia and Aquifex group within Archaea
Halobacterium groups within Bacteria
Archaeal type homologs are found in some bacteria
Uninterpretable: no resemblance with assumed organismal phylogeny
DNA gyrase, subunit B
Archaea do not form a group
DNA gyrase, subunit A
Both Thermotoga and Aquifex group with Archaea
Both Thermotoga and Aquifex group within Archaea
Both Thermotoga and Aquifex group within Archaea
50S ribosomal protein L2
30S ribosomal protein S19
The analyses depicted in figure 3 and table 1 demonstrate that bootstrap support value mapping in general, and support value mapping using extended datasets in particular, are useful in screening for genes that were transferred between divergent organisms. Replacing the genome from a mesophilic archaeon with that from an extremely thermophilic one, changes the topology of the subtree that has plurality support. This observation is in agreement with the hypothesis that genes are more frequently shared between organisms that live in similar environments . However, given that the Halobacterium genome is renowned for its large number of genes with bacterial character [22, 23], the total number of genes identified in this study as putatively transferred between the mesophilic bacteria and the halobacteria is very small. There are several reasons for this observation. Useful phylogenetic information retained in molecular sequences is constantly overwritten by more recent substitution events. The more divergent the analyzed genomes are, the more QuartOPs will be undecided about the most supported topology. Furthermore, support value mapping can only identify gene transfers that resulted in orthologous replacement. Last but not least, the applied approach to assemble QuartOPs is overly restrictive. Lineage specific duplications result in two orthologs being present in a single genome. These genes are paralogs of one another, but both are orthologs to the gene present in the genomes that branch off before the lineage specific duplication . Despite these shortcomings, support value mapping, especially when using extended datasets, provides a quick method to appraise the extent of genomic mosaicism, to delineate preliminarily the major flows of genes in microbial evolution (plurality or majority consensus), and to find subsets of potentially transferred genes.
Screen for more recent interphylum transfers
The higher frequency of unrecognized paralogs among the putatively horizontally transferred genes is due to the much larger number of QuartOPs analyzed. The detected number of unrecognized paralogs corresponds to less than 1% of the QuartOPs that contain sufficient phylogenetic information to support a topology with more than 90% bootstrap support (figure 4C). Every instance of unrecognized paralogy will result in a QuartOP deviating from the majority consensus revealed in this analysis.
Loss of support strength: due to conservative measure or taxon sampling?
It is difficult to compare posterior probabilities of QuartOPs directly with bootstrap support values of much larger datasets. Empirical studies  as well as the simulation studies [25, 26] indicate that bootstrap measures are much more conservative than Bayesian posterior probabilities. In the four-taxon cases analyzed in  a posterior probability of 0.99 calculated according to Strimmer and von Haeseler  was found to correspond to only 70% bootstrap support calculated from non-extended datasets.
Comparison of confidence levels for different types of mappings. Table entries give the numbers of QuartOPs in the indicated genome quartets that prefer one of the three tree topologies with the specified level of support.
99% posterior probability
90% bootstrap support from non-extended datasets
90% bootstrap support from extended datasets
Interphylum quartet (see figure 4)
The original posterior probability mapping methods reported in  return results similar to those obtained from the analyses of extended datasets. ML mapping is much faster than the bootstrap support values mapping of extended datasets reported here. In interpreting results, however, one needs to be aware of the non-conservative nature of the posterior probability mapping approach, and of the greater susceptibility of four-taxon analyses to the long branch attraction artifact. The faster ML mapping approach has utility as a quick estimation of phylogenetic information content of four genomes. Even though ML mapping greatly overestimates reliability, our results illustrate the utility of ML mapping as a pre-screen for putative horizontal gene transfer events. The use of extended datasets combined with subtree analyses prevents the inconsistencies associated with four-taxon analyses, but retains the power of visual representation. However, even an increase in OTU sampling and the simultaneous use of a more conservative probability measure does not obviate the need to inspect the phylogenies of candidate genes to detect instances of unrecognized paralogy. Given the public availability of over 100 prokaryotic genomes, appropriate reference genomes can be selected in most instances to distinguish differential loss of paralogs from horizontal gene transfer events.
The methodology of obtaining QuartOPs for four genomes is described in . For each sequence in a QuartOP we detect the top-scoring BLAST  hit with an E-value above 10-8 in each of 60 completely sequenced archaeal and bacterial reference genomes (Aeropyrum pernix, Archaeoglobus fulgidus, Anabaena sp., Aquifex aeolicus, Agrobacterium tumifaciens, Borrelia burgdorferi, Bradyrhizobium japonicum, Bifidobacterium longum, Bacillus subtilis, Brucella suis, Buchnera sp., Clostridium acetobutylicum, Caulobacter crescentus, Corynebacterium glutamicum, Campylobacter jejuni, Clamydophila pneumoniae, Deinococcus radiodurans, Escherichia coli K12, Fusobacterium nucleatum, Halobacterium sp., Haemophilus influenzae, Helicobacter pylori, Leptospira interrogans, Lactococcus lactis, Listeria monocytogenes, Lactobacillus plantarum, Mycoplasma genitalium, Methanococcus jannaschii, Methanopyrus kandleri, Mezorhizobium loti, Methanosarcina mazei, Methanobacterium thermoautotrophicum, Mycobacterium tuberculosis, Neisseria meningitides, Oceanobacillus iheyensis, Pseudomonas aeruginosa, Pyrobaculum aerophilum, Pyrococcus horikoshii, Pasteurella multocida, Rickettsia conorii, Ralstonia solanacearum, Staphylococcus aureus, Streptomyces coelicolor, Sinorhizobium meliloti, Shewanella oneidensis, Sulfolobus solfataricus, Salmonella typhi, Synechocystis sp., Thermoplasma acidophilum, Thermosynechococcus elongates, Thermotoga maritime, Treponema pallidum, Thermoanaerobacter tengcongensis, Tropheryma whipplei, Ureaplasma urealyticum, Vibrio cholerae, Wigglesworthia brevipalpis, Xanthomonas campestris, Xylella fastidiosa, Yersinia pestis). These genomes were downloaded from the NCBI web page . The resulting sequences are added to the QuartOP dataset and duplicated sequences are eliminated. The datasets are aligned with ClustalW , and 100 bootstrap samples are generated using the SEQBOOT program from the PHYLIP package version 3.6a2.1 . The distances are generated using TREE-PUZZLE version 5.1  under the auto-detected substitution model. Neighbor-Joining trees are calculated from these distances using NEIGHBOR from the PHYLIP package version 3.6a2.1 . The resulting trees are parsed with respect to which of the three four-taxa subtrees they contain (see figure 2) using an in-house Java program that utilizes PAL library classes . The resulting bootstrap support vectors are plotted into barycentric coordinates using GNUPLOT version 3.7 . Scripts for data manipulation were written in Perl and used many of the SEALS package subroutines .
The Rhodobacter capsulatus genome data were obtained from Integrated Genomics . Genome sequence for Chlorobium tepidum was downloaded from TIGR . The Rhodopseudomonas palustris genome was downloaded from JGI . Other genomes for the genome quartets were downloaded from NCBI .
The trees depicted in Figures 5 through 8 are neighbor-joining trees calculated using the NEIGHBOR program from PHYLIP version 3.6a2.1 . The distances used in NEIGHBOR were calculated in TREE-PUZZLE version 5.1  with the option to correct for Among Site Rate Variation using a discrete approximation of a Gamma distribution with eight rate categories and estimating the shape parameter. The three indicated support values are bootstrap support values calculated from 100 bootstrap samples analyzed with NEIGHBOR from the distance calculated in TREE-PUZZLE, bootstrap support values calculated from 100 bootstrap samples analyzed with the PROTPARS program from PHYLIP version 3.6a2.1 , and posterior probabilities as calculated with MrBayes version 3.0B4  (The analyses were performed independently three times, 200,000 generations each; the lowest posterior probability for the bipartition from the three runs is shown).
For eighteen potential candidates for the horizontal gene transfer between Halobacterium sp. and Synechocystis sp., or between Aquifex aeolicus and Thermotoga maritima phylogenetic trees were calculated and inspected manually. The neighbor-joining trees were calculated using the NEIGHBOR program from PHYLIP version 3.6a2.1 . The distances used in NEIGHBOR were calculated in TREE-PUZZLE version 5.1 . The trees were evaluated for potential transfers between Bacteria and Archaea, and between Thermotoga and Aquifex. 100 bootstrap samples were analyzed to assess the reliability of the branches on the tree. The possibility for the transfer was considered "strong" if the bootstrap support was above 70%, "weak" if the bootstrap support was lower, and "none" if no indication for the transfer could be inferred from the phylogenetic tree. Compositional bias for Halobacterium sp. and its closest phylogenetic neighbor was evaluated using a chi-square test at a 5% significance level as implemented in TREE-PUZZLE version 5.1. If both sequences failed the test, this is indicated as "strong" in the table. The results of these analyses are summarized in Table 1. The phylogenetic trees are available as additional data (see 1).
- LBA –:
Long Branch Attraction
- HGT –:
Horizontal Gene Transfer
- ML –:
- QuartOP –:
Quartet of Orthologous Proteins
- OTU –:
Operational Taxonomic Unit
This work was supported through the NASA Astrobiology Institute at Arizona State University, the NASA Exobiology Program, and in part through the NSF Microbial Genetics Program.
- Ribeiro S, Golding GB: The mosaic nature of the eukaryotic nucleus. Mol Biol Evol. 1998, 15: 779-788.View ArticlePubMedGoogle Scholar
- Raymond J, Zhaxybayeva O, Gogarten JP, Gerdes SY, Blankenship RE: Whole-genome analysis of photosynthetic prokaryotes. Science. 2002, 298: 1616-1620. 10.1126/science.1075558.View ArticlePubMedGoogle Scholar
- Raymond J, Zhaxybayeva O, Gogarten JP, Blankenship RE: Evolution of photosynthetic prokaryotes: a maximum-likelihood mapping approach. Philos Trans R Soc Lond B Biol Sci. 2003, 358: 223-230. 10.1098/rstb.2002.1181.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhaxybayeva O, Gogarten JP: Bootstrap, Bayesian probability and maximum likelihood mapping: Exploring new tools for comparative genome analyses. BMC Genomics. 2002, 3: 4-10.1186/1471-2164-3-4.PubMed CentralView ArticlePubMedGoogle Scholar
- Nesbo CL, Boucher Y, Doolittle WF: Defining the core of nontransferable prokaryotic genes: the euryarchaeal core. J Mol Evol. 2001, 53: 340-350. 10.1007/s002390010224.View ArticlePubMedGoogle Scholar
- Felsenstein J: Cases in which parsimony and compatibility methods will be positively misleading. Syst. Zool. 1978, 27: 401-410.View ArticleGoogle Scholar
- Farris JS: Likelihood and Inconsistency. Cladistics. 1999, 15: 199-204. 10.1006/clad.1999.0104.Google Scholar
- Siddall ME: Success of Parsimony in the Four-Taxon Case: Long-Branch Repulsion by Likelihood in the Farris Zone. Cladistics. 1998, 14: 209-220. 10.1006/clad.1998.0063.View ArticleGoogle Scholar
- Felsenstein J: Phylogenies from molecular sequences: inference and reliability. Annu Rev Genet. 1988, 22: 521-565. 10.1146/annurev.ge.22.120188.002513.View ArticlePubMedGoogle Scholar
- Huelsenbeck JP: The robustness of two phylogenetic methods: four-taxon simulations reveal a slight superiority of maximum likelihood over neighbor joining. Mol Biol Evol. 1995, 12: 843-849.PubMedGoogle Scholar
- Huelsenbeck JP, Hillis DM: Success of Phylogenetic Methods in the Four-Taxon Case. Systematic Biology. 1993, 42: 247-264.View ArticleGoogle Scholar
- Philippe H, Douzery E: The pitfalls of molecular phylogeny based on four species as illustrated by the Cetacea/Artiodactyla relationships. Journal of Mammalian Evolution. 1994, 2: 133-152.View ArticleGoogle Scholar
- Adachi J, Hasegawa M: Instability of quartet analyses of molecular sequence data by the maximum likelihood method: the Cetacea/Artiodactyla relationships. Mol Phylogenet Evol. 1996, 6: 72-76. 10.1006/mpev.1996.0059.View ArticlePubMedGoogle Scholar
- Graybeal A: Is it better to add taxa or characters to a difficult phylogenetic problem?. Syst Biol. 1998, 47: 9-17. 10.1080/106351598260996.View ArticlePubMedGoogle Scholar
- Hillis DM, Pollock DD, McGuire JA, Zwickl DJ: Is sparse taxon sampling a problem for phylogenetic inference?. Syst Biol. 2003, 52: 124-126. 10.1080/10635150309356.PubMed CentralView ArticlePubMedGoogle Scholar
- Rosenberg MS, Kumar S: Taxon sampling, bioinformatics, and phylogenomics. Syst Biol. 2003, 52: 119-124. 10.1080/10635150309344.PubMed CentralView ArticlePubMedGoogle Scholar
- Strimmer K, von Haeseler A: Likelihood-mapping: a simple method to visualize phylogenetic content of a sequence alignment. Proc Natl Acad Sci U S A. 1997, 94: 6815-6819. 10.1073/pnas.94.13.6815.PubMed CentralView ArticlePubMedGoogle Scholar
- Eric Weisstein's World of Mathematics. [http://mathworld.wolfram.com/]
- Gribaldo S, Philippe H: Ancient Phylogenetic Relationships. Theoretical Population Biology. 2002, 61: 391-408. 10.1006/tpbi.2002.1593.View ArticlePubMedGoogle Scholar
- Wainright PO, Hinkle G, Sogin ML, Stickel SK: Monophyletic origins of the metazoa: an evolutionary link with fungi. Science. 1993, 260: 340-342.View ArticlePubMedGoogle Scholar
- Jain R, Rivera MC, Moore JE, Lake JA: Horizontal Gene Transfer Accelerates Genome Innovation and Evolution. Mol Biol Evol. 2003Google Scholar
- Koonin EV, Makarova KS, Aravind L: Horizontal gene transfer in prokaryotes: quantification and classification. Annu Rev Microbiol. 2001, 55: 709-742. 10.1146/annurev.micro.55.1.709.View ArticlePubMedGoogle Scholar
- Ng WV, Kennedy SP, Mahairas GG, Berquist B, Pan M, Shukla HD, Lasky SR, Baliga NS, Thorsson V, Sbrogna J, Swartzell S, Weir D, Hall J, Dahl TA, Welti R, Goo YA, Leithauser B, Keller K, Cruz R, Danson MJ, Hough DW, Maddocks DG, Jablonski PE, Krebs MP, Angevine CM, Dale H, Isenbarger TA, Peck RF, Pohlschroder M, Spudich JL, Jung KW, Alam M, Freitas T, Hou S, Daniels CJ, Dennis PP, Omer AD, Ebhardt H, Lowe TM, Liang P, Riley M, Hood L, DasSarma S: Genome sequence of Halobacterium species NRC-1. Proc Natl Acad Sci U S A. 2000, 97: 12176-12181. 10.1073/pnas.190337797.PubMed CentralView ArticlePubMedGoogle Scholar
- Fitch WM: Homology a personal view on some of the problems. Trends Genet. 2000, 16: 227-231. 10.1016/S0168-9525(00)02005-9.View ArticlePubMedGoogle Scholar
- Douady CJ, Delsuc F, Boucher Y, Doolittle WF, Douzery EJ: Comparison of bayesian and maximum likelihood bootstrap measures of phylogenetic reliability. Mol Biol Evol. 2003, 20: 248-254. 10.1093/molbev/msg042.View ArticlePubMedGoogle Scholar
- Alfaro ME, Zoller S, Lutzoni F: Bayes or bootstrap? A simulation study comparing the performance of bayesian markov chain monte carlo sampling and bootstrapping in assessing phylogenetic confidence. Mol Biol Evol. 2003, 20: 255-266. 10.1093/molbev/msg028.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- National Center for Biotechnology Information. [http://www.ncbi.nlm.nih.gov/]
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680.PubMed CentralView ArticlePubMedGoogle Scholar
- Felsenstein J: PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. Department of Genetics, University of Washington, Seattle. 1993Google Scholar
- Schmidt HA, Strimmer K, Vingron M, von Haeseler A: TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics. 2002, 18: 502-504. 10.1093/bioinformatics/18.3.502.View ArticlePubMedGoogle Scholar
- Drummond A, Strimmer K: PAL: an object-oriented programming library for molecular evolution and phylogenetics. Bioinformatics. 2001, 17: 662-663. 10.1093/bioinformatics/17.7.662.View ArticlePubMedGoogle Scholar
- GNUPLOT Central. [http://www.gnuplot.info]
- Walker DR, Koonin EV: SEALS: a system for easy analysis of lots of sequences. ISMB. 1997, 5: 333-339.PubMedGoogle Scholar
- Integrated Genomics. [http://www.integratedgenomics.com/]
- The Institute for Genomic Research. [http://www.tigr.org]
- DOE Joint Genome Institute. [http://www.jgi.doe.gov/JGI_microbial/html/index.html]
- Huelsenbeck JP, Ronquist F: MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics. 2001, 17: 754-755. 10.1093/bioinformatics/17.8.754.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.