Genome-level homology and phylogeny of Shewanella (Gammaproteobacteria: lteromonadales: Shewanellaceae)
© Dikow; licensee BioMed Central Ltd. 2011
Received: 19 August 2010
Accepted: 12 May 2011
Published: 12 May 2011
Skip to main content
© Dikow; licensee BioMed Central Ltd. 2011
Received: 19 August 2010
Accepted: 12 May 2011
Published: 12 May 2011
The explosion in availability of whole genome data provides the opportunity to build phylogenetic hypotheses based on these data as well as the ability to learn more about the genomes themselves. The biological history of genes and genomes can be investigated based on the taxomonic history provided by the phylogeny. A phylogenetic hypothesis based on complete genome data is presented for the genus Shewanella (Gammaproteobacteria: Alteromonadales: Shewanellaceae). Nineteen taxa from Shewanella (16 species and 3 additional strains of one species) as well as three outgroup species representing the genera Aeromonas (Gammaproteobacteria: Aeromonadales: Aeromonadaceae), Alteromonas (Gammaproteobacteria: Alteromonadales: Alteromonadaceae) and Colwellia (Gammaproteobacteria: Alteromonadales: Colwelliaceae) are included for a total of 22 taxa.
Putatively homologous regions were found across unannotated genomes and tested with a phylogenetic analysis. Two genome-wide data-sets are considered, one including only those genomic regions for which all taxa are represented, which included 3,361,015 aligned nucleotide base-pairs (bp) and a second that additionally includes those regions present in only subsets of taxa, which totaled 12,456,624 aligned bp. Alignment columns in these large data-sets were then randomly sampled to create smaller data-sets. After the phylogenetic hypothesis was generated, genome annotations were projected onto the DNA sequence alignment to compare the historical hypothesis generated by the phylogeny with the functional hypothesis posited by annotation.
Individual phylogenetic analyses of the 243 locally co-linear genome regions all failed to recover the genome topology, but the smaller data-sets that were random samplings of the large concatenated alignments all produced the genome topology. It is shown that there is not a single orthologous copy of 16S rRNA across the taxon sampling included in this study and that the relationships among the multiple copies are consistent with 16S rRNA undergoing concerted evolution. Unannotated whole genome data can provide excellent raw material for generating hypotheses of historical homology, which can be tested with phylogenetic analysis and compared with hypotheses of gene function.
Taxon table and Mauve results
LCBs present in all taxa
LCBs present in subsets of taxa
% of genome
# of genes
% of genome
Aeromonas hydrophila ATCC 7966
Fresh and marine 
Alteromonas macleodii deep ecotype
Mediterranean Sea 
Colwellia psychrerythraea 34H
Arctic sediments 
Shewanella amazonensis SB2B
Baltic Sea 120 m 
Shewanella baltica OS223
Shewanella baltica OS155
Shewanella baltica OS185
Shewanella baltica OS195
Shewanella denitrificans OS217
Baltic Sea deep 
Shewanella frigidimarina NCIMB 400
North Sea 
Shewanella halifaxensis HAW-EB4
Nova Scotia 
Shewanella loihica PV-4
Hawaii, USA 
Shewanella oneidensis MR-1
Oneida Lake, USA 
Shewanella pealeana ATCC 700345
Atlantic Ocean 
Shewanella piezotolerans WP3
Pacific Ocean 
Shewanella putrefaciens CN-32
Shewanella sediminis HAW-EB3
Nova Scotia 
Shewanella sp. ANA-3
Woods Hole, USA 
Shewanella sp. MR-7
Black Sea 
Shewanella sp. MR-4
Black Sea 
Shewanella sp. W3-18-1
Coastal Pacific 
Shewanella woodyi ATCC 51908
Mediterranean Sea 
# of LCBs
The goal of the study presented here is to investigate how we can use whole genome data, not only to build a tree but to inform us of gene and genome history by comparing the hypothesis of historical homology supported by the phylogenetic hypothesis to what is known about gene function. There is a computational interest in the ability to build large trees, both in number of taxa and number of characters, e.g. [24, 25]. The biological history of genes and genomes can be investigated based on the taxomonic history of the bearers of these characters. This goes further than just the prediction of function of uncharacterized genes, but also includes the potential to track changing function over gene history and finding up- or down-stream segments of co-evolving DNA. Eisen and Fraser highlighted many of these goals when they introduced the term "phylogenomics" . While these goals are broad and ambitious, it is the hope that the present study represents a step in this direction.
The presented approach also represents a shift for phylogenetic systematics, in which historically one has generally known all the characters of interest very well and perhaps had a well-formed opinion about their history based on a lifetime of knowledge about their distribution and subtle variations. Even with molecular characters in the form of one or a few genes, even with many taxa, one gets to know the 'reliable' parts of an alignment and often memorizes the DNA sequence after having sequenced and edited the same marker for several years. The approach presented here proposes a new perspective which is obligated by the new kinds of data being gathered, particularly those from next-generation and shotgun sequencing, which generate millions of nucleotide base-pairs (bp) as opposed to thousands. Primary homology (sensu dePinna, ) must be determined in an automated fashion given the vast amount of data and the few character states of nucleotide data. The phylogenetic tree becomes an intermediate point - it is built based on hypotheses of primary homology, which it tests, and then is used as a framework for optimizing the character states and looking back to functional gene annotations to begin to answer questions about gene and genome history. Polymerase chain reaction (PCR) primers can provide hypotheses of primary homology, as amplifications using primers target conserved flanking regions, which provide a sufficient level of confidence that the 'same' regions are being sequenced. With next-generation sequencing, we have no such sense of location (particularly with bacteria), as we expect rearrangement of genes or other genomic segments over evolutionary history [28–31]. Annotations can provide information about the function of genes and the location of open reading frames, but these may not lead us to historical homology and will miss much of the homology present in the genome, for example among genes of altered function or non-coding DNA.
A second goal of this study is to present an example of comparative genomics for a closely related taxon that is densely sampled. This helps to avoid some of the downsides of comparative genomics when few, disparate taxa are compared: leaps of faith with character homology and trivial phylogenetic topologies. Finally, as an attempt to begin the investigation of gene history, 16S rRNA is presented as a test case. 16S rRNA was chosen because it is the marker that has defined prokaryote taxonomy for the past 30 years . While it is not used without some skepticism, e.g. [33, 34], the way that primary homology is generated here provides an opportunity to test our assumptions about 16S rRNA. This manuscript will address the outlined points in the following sections: Mauve primary homology , genome trees, subset trees, and 16S rRNA investigation.
For all taxa of Shewanella for which the genome sequence has been completed and deposited on GenBank, 16 species and 3 additional strains of one species and for three outgroup taxa, unannotated genome sequences were downloaded (strain and accession numbers in Table 1). Seven gene loci were also downloaded individually for all 22 taxa from the same genome accession numbers. The genes are gyrB (DNA gyrase subunit B), rpoA (DNA directed RNA polymerase subunit alpha), recA (recombinase A), topA (DNA topoisomerase I), mreB (rod shape-determining protein), gapA (glyceraldehyde-3-phosphate dehydrogenase), and atpA (ATP synthase F1 subunit alpha). Those species for which only one strain is included will hereafter be referred to by only their species name. For Shewanella baltica, for which multiple strains have been sampled, I will refer to the strain numbers throughout. Annotations were also downloaded from RefSeq (GenBank's Reference Sequence Collection, 2009). The sampled genomes range in size from 4,306,142 nucleotide base-pairs (bp) and 3,785 genes (Shewanella amazonensis) to 5,935,403 bp and 5,096 genes (Shewanella woodyi). All of the below analyses were performed on a 2.8 GHz Quad-Core MacPro with 20 GB RAM.
Sequences were submitted to a genome alignment using Progressive Mauve . This program has been used previously on fairly closely related members of Gammaproteobacteria . Traditional multiple sequence alignment cannot be used on complete genome sequences because significant rearrangement of genes or fragments has been shown to occur over evolutionary history [28–31]. Mauve addresses this issue by finding locally collinear blocks (LCBs), or contiguous segments of sequence within which there has not been rearrangement, but within a longer sequence that may have been subject to rearrangement events. The default parameters in Mauve were used. Mauve finds anchor points of similarity and then extends these matches outward. A single LCB becomes two when a sequence segment is found somewhere else in the genome for one of the taxa, meaning that a rearrangement has occurred. Mauve does not allow one sequence fragment to be homologous to more than one fragment in another species. CGView  has been used to plot the genomic location of LCBs found common to all species and to project gene annotations onto LCB alignments. Resulting LCBs were submitted to standard multiple sequence alignment with MAFFT . Mauve output XMFA files include the base-pair ranges (e.g. bp 1-10,000) of each LCB for each taxon. The base-pair locations were used along with annotations to assess which genes were present in each LCB. At the suggestion of a reviewer, Mauve was also run with a single outgroup taxon, C. psychrerythraea, to compare the influence of one vs. more than one outgroup on the percent of the genome covered by the Mauve alignment.
Aligned LCBs were concatenated and submitted for phylogenetic analysis with parsimony in TNT , maximum likelihood in RaxML v. 7.0.4 , and Neighbor Joining  in Geneious Pro 4.8.3 . Long alignments are challenging for phylogenetic tree building programs as many of these programs have a limit on the number of input characters or cannot access enough RAM to read the alignment. Those programs mentioned above were able to read the data analyzed here. The TNT commands consisted of 1000 builds with SPR and TBR followed by 1500 replicates of ratchet and tree fusing [43, 44]. RaxML was run under the GTRGAMMA model of nucleotide substitution. Only a single bootstrap replicate was attempted at a time, as this seemed to be the computational limit, at least for the used workstation, as any additional searching caused an error. Neighbor Joining was run with Jukes-Cantor, Tamura-Nei, and HKY genetic distance models in three separate analyses. Bootstrap resampling was attempted in TNT, with 5,000 pseudoreplicates. As an additional measure of support, trees were built from individual LCBs and were used to score whether nodes in the genome tree were also present in the individual LCB trees. Gaps were initially treated as a fifth state in TNT but all genome-wide data-sets were also reanalyzed with gaps treated as missing data. When not specified, the TNT analyses refer to those in which gaps were treated as a fifth state.
In order to compare the phylogenetic signal from the entire genome to signals from subsets of the genome, three courses of action were taken. The first asks the question whether localized signal is congruent with whole genome signal. To this end, phylogenetic hypotheses were built for individual LCBs, as above, in TNT. Second, in order to address whether the number of characters has a direct result on the topology, scripts using BioPerl were used to randomly sub-sample the nucleotide alignment and produce data-sets of varying size, 20,000 bp, 100,000 bp, 500,000 bp and 1,000,000 bp, which were submitted to the same phylogenetic treatment as above. Finally, a seven-gene analysis was attempted in TNT, as above, to compare a traditional phylogenetic analysis to the genome-level analysis that was the main goal of the study.
16S rRNA copies were tracked throughout the Mauve LCBs in order to establish for which of the 16S rRNA copies Mauve made an hypothesis of homology. All copies of 16S rRNA for all taxa (taxa have 8-11 copies each, according to annotations, for a total of 200 terminal branches for the 22 taxa), whether or not present in an LCB, were extracted from their respective annotated genome files, aligned with MAFFT, and submitted for phylogenetic analysis in TNT and RaxML, as above.
On the trees built from the 243 individual LCBs in TNT, for which there were 286 most parsimonious trees (MPTs), none had the topology of the genome tree with gaps as fifth state. There are 285 unique topologies represented in these 286 trees. None of these 286 trees has the topology of the gaps as missing TNT analysis either. While the individual LCB trees cannot be referred to as gene trees per se, as the vast majority of LCBs contain more than one gene as well as non-coding DNA, the phenomenon of parts of a data-set not producing the same phylogenetic signal, is comparable. To address whether it was an issue of number of base-pairs versus different localized signals concentrated in different contiguous parts of the genome (i.e. each LCB), I used the data-sets generated with BioPerl scripts, that randomly selected nucleotide columns from the 3 Mbp alignment and separately for the 12 Mbp alignment. These data-sets consisted of 20,000 bp, 100,000 bp, 500,000 bp and 1,000,000 bp for each of the original alignments. When I performed phylogenetic analyses on these data-sets, all produced the same topology as the phylogenetic (TNT and RaxML) genome tree; the same topology that was not found for any of the 243 single-LCB data-sets. For the seven-gene dataset, genes were aligned separately in MAFFT and then concatenated, resulting in an alignment of 13,009 bp. The topology resulting from this analysis is shown in Figure 3B with bootstrap resampling frequencies shown below the branches.
The next step was to associate the annotation, or list of open reading frames (ORFs) and their locations, with the list of LCB boundaries for each species. Of initial interest was tracking the multiple copies of 16S rRNA within these taxa and assessing their presence within the LCBs. There were 13 LCBs that contained a copy of 16S rRNA for at least one species. These are shown in the grid next to the tree in Figure 2. The LCBs are 4 (S. sp. MR-7, S. sp. MR-4, S. sp. ANA-3), 7 (S. halifaxensis, S. pealeana), 19 (S. putrefaciens), 27 (A. hydrophila), 37 (S. baltica OS195), 58 (C. psychrerythraea), 60 (A. hydrophila), 79 (S. amazonensis, S. loihica, S. baltica OS155, S. baltica OS223, S. baltica OS 185, S. baltica OS195, S. sp. W3-18-1, S. sp. MR-7, S. sp. MR-4, S. sp. ANA-3, S. oneidensis, S. sediminis, S. woodyi), 142 (C. psychrerythraea), 197 (A. hydrophila), 219 (S. baltica OS185, S. baltica OS195), 222 (S. frigidimarina, S. denitrificans, S. loihica, S. baltica OS223, S. baltica OS185, S. putrefaciens, S. sp. W3-18-1, S. sp. MR-7, S. halifaxensis, S. piezotolerans, S. sediminis), 231 (S. denitrificans, S. baltica OS223, S. baltica OS185, S. baltica OS195, S. putrefaciens, S. sp. W3-18-1, S. sp. MR-7, S. sp. MR-4, S. sp. ANA-3). The first obvious pattern is that, while all copies of 16S rRNA are homologous (derived from a single 16S rRNA ancestral copy), there was no single copy of 16S rRNA aligned by Mauve across all taxa, which would indicate the possibility of orthology. When the LCBs present in subsets of taxa were considered, and the subset of interest is all Shewanella species, there was also no single 16S rRNA copy aligned across all Shewanella taxa. It must also be noted that it is possible that there is a single orthologous copy that Mauve is unable to recover.
The allowance for variation within an LCB for base-pair length and number of genes among different taxa is how the estimates of primary homology considered by Mauve differ from previous studies: they do not require that a gene, which might look functionally the same, be homologous (aligned) across all taxa simply because it is present in these taxa. This is key in our beginning to understand gene histories and how these histories interact with the functional roles of genes. The homologies among fragments of non-coding DNA are also important, because if these kinds of analyses are extended to eukaryotes, which have vastly more non-coding DNA, we will be looking for ways to include these data and not be limited to evidence of history only from genes, which might have conflicting histories.
As can be seen in Table 1 in the all-taxa LCB data-set, the coverage by LCBs among included taxa ranges between 23-34%. In the subset taxa LCB data-set, one can track the coverage by phylogenetic position, with the innermost nodes having the most complete coverage. This coverage (Subset taxa LCBs) ranges from 36-97%. Outgroup taxa have a lower percentage of their genome covered by LCBs than do ingroup taxa. This pattern would be expected if the outgroups are indeed more distant relatives than any of the ingroup taxa. The inclusion of outgroups in a Mauve analysis decreases the LCB coverage, because they are more distantly related, but is essential to polarizing the characters obtaining the optimal phylogenetic topology. The reason three outgroup taxa were included was because there had not been a previous study showing that Alteromonas and Colwellia were indeed basal to Shewanella. One could not assume that Shewanella was monophyletic a priori; the seven-gene analysis (Figure 3B) finds a non-monophyletic Shewanella. To address a reviewer comment that using a single outgroup would increase the amount of data considered in the phylogenetic analysis, however, another Mauve run was attempted with a single outgroup (C. psychrerythraea). This species was chosen because it was found to be the most closely related to Shewanella, and therefore would cause the greatest increase in data if the other outgroups were not included. If just the most basal outgroup was included, one would not expect much of a change in genome coverage. The initial analysis, in which all three outgroup taxa are included, allows one to make the most a posteriori claims about homology for the broadest range of taxa. The topological differences are discussed below.
The number of parsimony informative characters in the 3 Mbp alignment (26.48%) was within the range of many molecular studies based on one or a few genes, e.g. [45, 46]. This fact provides a sense of confidence that Mauve was finding real, or at least likely, homologies. This percentage ranged from 1.4% to 53.6% when LCBs were compared to each other. If the percentage of parsimony informative characters had been significantly higher, I might be concerned that non-homologous and non-similar sequences were being aligned. In the RaxML analysis, gaps were treated as missing. Gaps were treated as a fifth state as the default for the parsimony (TNT) analyses because gaps do represent evolutionary events. The nature of the Mauve analysis, in my opinion, lends itself to the treatment of gaps as fifth-state characters because Mauve either breaks up a single LCB into two or more if homologous fragments are found in different positions in one or more genomes. If Mauve has postulated one LCB instead of two, it is then assumed that gaps represent the lack of homologous DNA sequence, not missing data. As for statistical support, values of 100% on all nodes are not particularly telling. By scoring the individual LCB trees for nodes present in the genome tree, it is clearer which nodes are the most and least robust. The relationships among strains of S. baltica have low support (60/286 trees), as might be expected if these have either recently diverged, or are exchanging genes. Confidence in the topology is also supported by the results of the random data-set analyses, which amount to a jackknife technique (without replacement).
When gaps are treated as missing data, as well as when C. psychrerythraea is the only outgroup taxon considered, there are topological differences in the placement of S. amazonensis, S. loihica, S. denitrificans, and S. frigidimarina, although S. denitrificans and S. frigidimarina are always sister. The remaining taxa retain their relationships. The node that separates S. denitrificans, S. frigidimarina, S. amazonensis, and S. loihica from the remaining species of Shewanella is the least well supported node in the tree, with 34 out of 286 individual LCB trees containing that node, indicating that it is not surprising that these four taxa are the ones that are placed differently in the various permutations of genome-wide data-sets. It is interesting that the RaxML tree and the gaps as fifth-state TNT tree have the same topology. There are multiple possible explanations for this finding. A reviewer mentioned that if genes are lost or gained in a single step, that considering gaps individually dramatically over-emphasizes the number of evolutionary events. It is also important to note, however, that by treating gaps as missing, one is ignoring those evolutionary events completely. It could be that a significant proportion of the evidence is lost in the gaps as missing analyses, but that maximum likelihood is better at compensating for multiple hits. To address the possibility that gaps, representing insertion/deletion (indel) events are overwhelming the phylogenetic signal, the number of parsimony tree steps resulting from indel events at internal nodes has been calculated to be 814,304. This number is 23.6% of all indels (the rest being at the terminal branches, autapomorphies that provide no information on branching pattern in parsimony) and 10.3% of total tree length. These calculations show that information from gaps does not make up a dominant proportion of the phylogenetic signal. It is entirely possible that the lack of indel data in the parsimony-missing data analyses leads to a different answer, but the presence of indel data in the gaps as fifth-state analysis and the ability of maximum likelihood to account for saturation give the same tree. This just argues for evidence that the genome tree (RaxML, parsimony gaps as fifth state) is the optimal tree.
For 22 taxa, there are 1.31 × 10^25 possible rooted bifurcating trees . It is surprising, then, that there was an immediate convergence on the optimal topology for the 12 Mbp, 3 Mbp, and random data-sets when run in TNT. Wagner tree build plus TBR branch swapping found the optimal tree very rapidly, within a minute. No amount of ratcheting or tree fusing altered the topology or tree length. The decisiveness of these data  marks the difference between analyses consisting of one or a few genes and the present analysis. Just as the idea of support needs to be adjusted to accommodate whole genome analyses, so do our expectations of tree search.
A recent paper considering the systems biology of Shewanella based on the whole genome sequences of 10 taxa (a subset of those included here) produced a phylogeny for those 10 taxa based on 1507 single-copy orthologs identified through a combination of, "(i) protein-protein pairwise reciprocal BLAST (blastp); (ii) reciprocal protein genomic sequence best match (tblastn); and (iii) Darwin pairwise best hit" (, p. 15914). Their tree had the same topological relationships for those 10 taxa as the genome tree presented here. It is also interesting that the number of genes found to be single-copy orthologs in  is comparable to the number of genes present in the LCBs for the ingroup taxa presented here (Table 1). Future work might ask whether the sets of orthologs are similar for these two data-sets and how these two different approaches (ortholog identification vs. unannotated homology detection) might complement each other to best utilize whole genome sequence data.
It is interesting that none of the trees that resulted from analysis of each LCB separately produced the same topology as the genome tree, even though there are relatively few taxa, and that the randomly sampled data-sets all produced the genome-tree topology, even though some of the LCBs were longer than some of the randomly sampled data-sets. Sometimes Shewanella was not monophyletic in individual LCB trees. What the results highlight, is that with a genome approach, which does not only focus on particular genes of interest, one is able to discern a unique phylogenetic signal. The random data-set tree results demonstrate that localized LCB signal might be a factor, even when an LCB contains 25 or more genes. One might expect that sampling nucleotides across all LCBs produces a signal closer to the optimum than does focusing on parts of the genome with putatively different histories. The surprising insight here was that 20,000 bp was enough to see this effect, for the present taxon sampling. It should also be noted that Mauve provides a kind of filter, in that the LCBs consist of those parts of the genome that have passed tests of similarity and co-linearity. The 3 Mbp data-set, based on LCBs common to all taxa, represents a very complete, ideal data-set; approximately one-third of the genome (spread over the whole genome; Figure 1) for which all taxa have data. This is perhaps part of the reason that 20,000 bp is enough to recover the genome tree topology, in this case. As suggested by a reviewer, the possibility exists that 20,000 bp is simply enough for the model mis-specification artifacts caused by mixed-tree signals or incorrect substitution models to consistently yield an incorrect topology. This explanation would require, however, that all LCBs, or all LCBs except one also had this problem because they all produce different topologies that are not congruent with the genome tree.
For the seven-gene tree (Figure 3B), there is generally high bootstrap support, but there are only two ingroup taxa that retain the same sister-group relationships that are also present in the genome tree (Figure 2). The genes chosen for the analysis were the same (minus 16S rRNA) as those chosen for a phylogenetic analysis for Vibrionaceae, the most closely related family to Shewanellaceae . Even the four S. baltica strains appear scattered across the tree. The two large clades, which are consistent across all other trees, containing (S. woodyi, S. sediminis, S. piezotolerans, S. halifaxensis, and S. pealeana) and (S. baltica, S. putrefaciens, S. sp. W3-18-1, S. sp. MR-4, S. sp. MR-7, S. sp. ANA-3, and S. oneidensis) are not present in the seven-gene tree. C. psychrerythraea, a putative outgroup taxon is nested deeply within the ingroup in the seven-gene tree.
The fact that Mauve does not generate a hypothesis of positional homology for any single copy of 16S rRNA even though all copies are very similar speaks to the challenges that occur when dealing with multiple gene copies. The pattern of gene arrangement flanking instances of 16S rRNA is not enough to assign a hypothesis of positional homology. There is simply too much rearrangement to have confidence in such a hypothesis. The taxonomic relationships suggested by the all-copy 16S rRNA tree from TNT (Figure 4) are different than those suggested by the genome tree (Figure 2). Nodes of congruence are highlighted in Figure 4. There are many similarities between the all-copy 16S rRNA tree and the genome tree, however: Shewanella is monophyletic, S. sp. MR-7, S. sp. MR-4, S. oneidensis, S. sp. ANA-3 are monophyletic, S. halifaxensis and S. pealeana are sister, S. denitrificans and S. frigidimarina are sister, S. sp. W3-18-1 and S. putrefaciens form a clade, S. piezotolerans, S. halifaxensis and S. pealeana form a clade, S. sediminis and S. woodyi are sister, Al. macleodii and C. psychrerythraea are sister. The 16S rRNA copy tree from RaxML shared many nodes in common with the 16S rRNA tree from TNT, but found the Al. macleodii and C. psychrerythraea clade within Shewanella and did not find S. denitrificans and S. frigidimarina as sister. The fact that the TNT and RaxML trees are not completely congruent is not particularly surprising given that there are few parsimony informative characters in 16S rRNA and that there is no expectation that a tree-like branching pattern exist for these gene copies. For the TNT run, the gaps were treated as a fifth state, so that a gap can be informative. RaxML treats gaps as missing data, so that might also account for the differences.
Haggerty et al.,  constructed a similar 16S rRNA copy tree, but did so for 17 species across four genera: Escherichia, Shigella, Yersinia, and Salmonella (also Gammaproteobacteria). They found the backbone of this tree, i.e. the separation of genera, conforming to their taxonomic expectations. They did not find widespread monophyly of copies at the species level, however. The tree shown in Figure 4, based on the present analysis, does show monophyly for many species. The following taxa formed monophyletic groups of 16S rRNA copies: C. psychrerythraea, Al. macleodii, S. loihica, S. woodyi, S. sediminis, S. pealeana, S. amazonensis, S. putrefaciens, S. oneidensis, S. sp. ANA-3, S. denitrificans, S. frigidimarina. These taxa formed paraphyletic groups: S. halifaxensis (S. pealeana nested inside), S. sp. W3-18-1 (S. putrefaciens nested inside). The following taxa had gene copies scattered among those of other taxa, S. sp. MR-7, S. sp. MR-4, S. baltica OS155, S. baltica OS185, S. baltica OS195, S. baltica OS223.
There is no expectation that this is the actual evolutionary history of these gene copies, or that it should provide the correct relationships among taxa, because multiple gene copies cannot all be orthologous to one another, and are homologous only at the level that there was one initial copy of 16S rRNA, but that is not informative at this phylogenetic level of inquiry. It was merely an exercise to see if the copies from one species group more closely with the other copies for that species or with copies from another species. The former is most often the case. The notable exceptions are the S. baltica strains, which are classified as strains of the same species and the S. sp. MR-4 and S. sp. MR-7. S. sp. MR-4 and MR-7 were isolated from different depths of the Black Sea . It follows that these may have been separate lineages for a much shorter period of time than the other taxa included in the analysis, or that there continues to be gene exchange. The major pattern, that of monophyly of within-species copies, might be explained in multiple ways. Haggerty et al., 2009 suggest that such a pattern might reflect "homogenization" of 16S rRNA copies such that each species has its own unique suite of copies of the gene. This is also known as concerted evolution . The question remains as to why Haggerty et al., 2009 do not recover a concerted evolution type pattern, even though they consider taxa not so distantly related to Shewanella and with similar 16S rRNA copy numbers. It is possible that because they included only 17 species covering all 4 genera, their taxon sampling was too sparse to recover the true pattern. They also removed parts of the 16S rRNA alignment that were "ambiguous", further lowering the number of characters, probably the informative characters, in their analysis.
Putting the issue of multiple copies aside, it is not surprising that any of the copies individually is able to group with its species, given that 16S rRNA sequences are a significant part of how prokaryotic species are defined. What is more important for evolutionary studies, however, is that because there are multiple copies, and because in the analysis presented here no one single copy among the sampled taxa is found to be positionally homologous, as well as the whole genome topology and the all-copy 16S rRNA topology do not agree, 16S rRNA should not be considered a reliable marker for positing evolutionary relationships. Whether it performs well diagnostically is unrelated to this question.
Recent phylogenetic studies have been published that rely on many more gene loci than has been customary in the past [e.g. 52-54]. These studies also begin to ask and answer questions regarding the number of genes sufficient to obtain the 'true' tree. Gene-tree concordance methods have also been adopted by many to this end [55, 56]. The study presented here, which takes a different approach, can also begin to add to this discussion. Here, it is shown that unannotated whole genome data can provide excellent raw material for generating hypotheses of historical homology, which can be tested with phylogenetic analysis and compared with hypotheses of gene function. The future possibilities include the ability to quantify lateral gene transfer and gene tree effect, track changing function over gene history and find segments of co-evolving DNA. It is through the combination of methods of phylogenetic systematics and comparative genomics that we can best use the whole genome data to reconstruct the histories of genes, genomes, and taxa.
RBD thanks the following people at the Field Museum of Natural History: Karen Cranston for scripting, Torsten Dikow for reading earlier drafts, and Leo Smith for analysis advice, reading earlier drafts, and hardware support. RBD also thanks Aaron Darling and an anonymous reviewer for their constructive comments. RBD is supported by the Emerging Pathogen Project (to The Field Museum of Natural History and The University of Chicago Institute for Genomics and Systems Biology). Emerging Pathogens Project is funded by The Davee Foundation and The Dr. Ralph and Marian Falk Medical Research Trust. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.