Evidence of host-virus co-evolution in tetranucleotide usage patterns of bacteriophages and eukaryotic viruses
© Pride et al. 2006
Received: 27 September 2005
Accepted: 18 January 2006
Published: 18 January 2006
Skip to main content
© Pride et al. 2006
Received: 27 September 2005
Accepted: 18 January 2006
Published: 18 January 2006
Virus taxonomy is based on morphologic characteristics, as there are no widely used non-phenotypic measures for comparison among virus families. We examined whether there is phylogenetic signal in virus nucleotide usage patterns that can be used to determine ancestral relationships. The well-studied model of tail morphology in bacteriophage classification was used for comparison with nucleotide usage patterns. Tetranucleotide usage deviation (TUD) patterns were chosen since they have previously been shown to contain phylogenetic signal similar to that of 16S rRNA.
We found that bacteriophages have unique TUD patterns, representing genomic signatures that are relatively conserved among those with similar host range. Analysis of TUD-based phylogeny indicates that host influences are important in bacteriophage evolution, and phylogenies containing both phages and their hosts support their co-evolution. TUD-based phylogeny of eukaryotic viruses indicates that they cluster largely based on nucleic acid type and genome size. Similarities between eukaryotic virus phylogenies based on TUD and gene content substantiate the TUD methodology.
Differences between phenotypic and TUD analysis may provide clues to virus ancestry not previously inferred. As such, TUD analysis provides a complementary approach to morphology-based systems in analysis of virus evolution.
Eukaryotic viruses and bacteriophages exist in numerous forms and are capable of infecting disparate hosts. The taxonomy of viruses is based upon morphological features, including capsid and tail structures, specific type of genetic material, and mechanism of replication and assembly [1, 2]. Genetic comparison across virus species has been complicated by generally different rates of gene evolution, thus, their overall classification rests on phenotypic and morphologic characteristics . Horizontal gene transfer has been substantial in virus evolution [4–7], complicating reproduction of ontogeny based on the current presence of particular loci. Analysis of phylogenies based on phenotypic systems is limited by convergent evolution, in which like characteristics are evolved by unrelated organisms to suit particular niches or evolutionary requirements . Taxonomy of bacteriophages also has been based on morphologic characteristics [9, 10]. Tail morphology forms the basis for bacteriophage classification into 3 separate families: Myoviridae (contractile tails), Podoviridae (short tail stubs), and Siphoviridae (long tails) . Studies examining phage tail assemblies  have not ascertained whether the source of tail characteristics has phylogenetic significance, indicating the likely existence of polyphylogeny within these phage groups .
Unlike viruses, prokaryotes most commonly have been taxonomically classified according to a single locus, 16S rRNA [13–15]. Because of its relatively conservative rate of evolution, and presumed rarity of horizontal transfer due to its functional constraints, the 16S rRNA locus is believed to serve as an accurate marker of recent common ancestry .
Evaluation of prokaryotic ancestry based on shared gene content also has been proposed, guided by the principle that prokaryotes have cores of essential genes, whose presence or absence is evolutionarily significant [16–18]. Alternatively, there has been analysis of phylogenetic signal in whole-genome nucleotide usage patterns and is consistent with the predicted phylogenetic structure of prokaryotes based on 16S rRNA . Whole-genome approaches are less biased by any single locus , with horizontal transfer being an intrinsic part of the signal, reflecting the current. Using Zero-Order Markov algorithms, we have previously demonstrated that patterns of tetranucleotide usage patterns retain phylogenetic signal among many related prokaryotes . Differences between tetranucleotide usage and 16S rRNA in ancestral reconstruction likely are due to horizontal influences, such as extensive recombination, and/or presence of restriction/modification systems.
We sought to better understand the evolution of viruses by comparing methods for reproducing ancestry including tetranucleotide usage deviation, and shared gene content to construct a framework independent of phenotypic analysis. Our goals were to understand whether: 1) phylogenetic signal is retained in nucleotide usage patterns of viruses; 2) phylogenetic structures based on nucleotide usage patterns and tail morphology are similar; 3) nucleotide usage patterns in viruses are primarily determined by gene content; and 4) whether their prokaryotic hosts exert a substantial influence on bacteriophage nucleotide usage patterns.
Nucleotide usage patterns are unique among different prokaryotes, providing distinct signatures [19, 21, 22] that are well-conserved across each genome, except for DNA hypothesized to be acquired through lateral gene transfer [23, 24]. To determine whether bacteriophages have unique genomic signatures, we employed a method based on tetranucleotide usage deviations (TUD) from expected. TUD patterns have substantially more phylogenetic signal than codon usage biases when compared to 16S rRNA . We generated TUD based on Zero-Order Markov algorithms, which determine how patterns of tetranucleotide usage in each genome deviate from those expected based on overall nucleotide content, representing the genomic signature [19, 25]. Zero-order Markov algorithms were chosen, as removal of constituent oligonucleotide biases (e.g. dinucleotide and trinucleotide biases) through Markov chain analysis results in a substantial loss of TUD phylogenetic signal .
We next sought to determine whether the TUD distribution across each bacteriophage genome is unique, representing a genomic signature. Highly similar yet distinct TUD profiles were shown for Enterobacteria phage P2 (Myoviridae), Enterobacteria phage HK022 (Siphoviridae), and Shigella phage V (Podoviridae) (Figure 2b). However, within a bacteriophage family, TUD patterns are not strictly conserved, as demonstrated by the lack of similarity between Mycobacterium phage Bxz1 and Enterobacteria phage HK022 (both Siphoviridae) (Figure 2c). These data are an indication that each phage genome has unique patterns of nucleotide usage that are not strictly determined by tail characteristics.
As greater numbers of prokaryote and virus genomes are solved, genomic signatures have become better defined. Prokaryotes have genomic signatures that contain phylogenetic signal at both the dinucleotide  and tetranucleotide  levels. Our data indicate the existence of genomic signature in viruses parasitizing prokaryotic and eukaryotic hosts. Because genomic signature analysis is based on whole-genomes and is independent of multiple alignments, it provides a robust methodology for comparison across and between prokaryotes and viruses.
We show that bacteriophage genomic signatures are associated with their host organisms with few notable exceptions (Figure 6). This is most likely the effect of co-evolution between host and parasite, in which the host influences phage nucleotide usage patterns and possibly vice versa . Indeed, both phage and host avoid use of certain tetranucleotides recognized by host restriction/modification systems . The close approximation, but not complete identity of phage TUD patterns to those of host organisms supports a co-evolution model (Figure 6). We hypothesize that the differences reflect limitations in the ability of phages to adapt to host TUD patterns, and/or that phages need to retain particular TUD patterns to maintain host range.
An alternative explanation for bacteriophage TUD patterns is that their approximation to their hosts results from amelioration , and does not reflect recent common ancestry. Obligate parasitism likely necessitates that phages ameliorate to host nucleotide usage patterns, which is why host range may be critical. Bacteriophage phylogenetics based on shared gene content is dissimilar to that based on TUD, but not consistent with taxonomy based on morphologic features . The differences between the methods likely reflect biases, with TUD phylogeny biased by amelioration, and gene content phylogeny biased by lateral gene transfer.
Horizontal gene transfer from host to phage produces apparent similarity to host nucleotide usage patterns that mechanistically does not represent recent common ancestry . If horizontal transfer from host to phage served as the predominant mechanism of phage evolution, Mycobacterium phage TUD patterns would closely reflect their hosts. In the Mycobacterium phages, where horizontal acquisition is common , phage TUD patterns more closely approximate each other than their prokaryote hosts (Figure 6), which supports host-phage evolution in parallel, but not their evolution through horizontal gene transfer. That nucleotide usage patterns are relatively homogenous across each phage genome (Figure 1), suggests that the proportion of horizontally transferred genes in phage genomes may be relatively small or subject to rapid amelioration, and substantiates a parallel model of host-phage evolution. In phages with known transposons such as Enterobacteria phage Mu, these elements are not identified as anomalous in the phage genome (Figure 1c), suggesting that transposons are not responsible for the observed similarity in nucleotide usage patterns between host and phage.
Our data on bacteriophage phylogenies (Figure 5) directly oppose conceiving of morphological characteristics for understanding phage ancestry. Based on a TUD-based model of evolution, tail characteristics would have been horizontally transferred, subject to variable rates of evolution, or be continuously altered with little phylogenetic significance. Conversely, for a model of ancestry based on tail morphology to be correct, patterns of nucleotide usage would have to had shifted continuously throughout phage evolution independent of host. The substantial relationships between phage and host TUD (Figure 6) make independent evolution highly unlikely. Exceptions to the model of host-phage co-evolution presented by Podoviruses Bacillus phages PZA and GA-1, Pseudomonas putida phage GH-1, and a few Enterobacteria phages could be explained by broad host range, accelerated rates of change with loss of TUD phylogenetic signal, or alternate replication strategies. Each of these Enterobacteria phages is T7-like, with many known similarities at the genetic level , further suggesting their phylogenetic distinctness. The Bacillus phages are Phi29-like phages, which in addition to lack of shared ancestry with other Bacillus phages (Figure 4), have T7-like polymerases , supporting their recent ancestry with T7-like phages. Each of these phage groups has substantial strand bias , indicating their unique differences compared to other phage groups, and likely evolutionary distance.
TUD phylogenetics support co-evolution of host and bacteriophage, however, eukaryotic viruses cluster similar as expected based on recognized genetic and morphological features [1, 2]. Bovine viruses (e.g. bovine RSV, papillomavirus, coronavirus, parvovirus, and polyomavirus) cluster independent of host, indicating that factors other than host influences determine their phylogenetic position (Figure 7). That coronaviruses do not belong to the major RNA virus cluster (Figures 7 and 8), contrary to previous studies , suggests that TUD may not be robust for certain RNA viruses. Most of the negative-sense RNA viruses cluster with the exception of RSV and the segmented RNA viruses. The clustering of RSV with rhinoviruses may represent convergent evolution or allelic exchange, as they occupy a similar ecological niche. The phylogenetic position of the segmented RNA viruses including orthomyxoviruses, arenaviruses, and reoviruses supports the concept of a common progenitor (Figure 7). The TUD analysis shows separate grouping of the large and small double stranded DNA viruses, which may be a limitation of the technique or reflect the occurrence of double stranded DNA more than once in the ancestral history of viruses. Bacteriophages cluster with the eukaryotic double stranded DNA viruses, further suggesting a single progenitor for the large double stranded DNA viruses. Whether the polyphylogeny observed among the large and small DNA viruses, the positive-sense RNA viruses, and the single stranded DNA viruses reflect methodologic limitations or evolutionarily significant phenomena remains to be determined. Phylogeny based on gene content, in which polyphylogeny among each group of viruses is diminished, supports the former hypothesis (Figure 8).
Morphological features form one basis for virus taxonomy, however we provide data that suggests bacteriophage tail characteristics may not sufficiently reflect their evolution. Based on TUD patterns, phages are co-evolving with their hosts in a manner defined by their ability to achieve broad host range. That there are only few exceptions to the co-evolution model concerning the many phages analyzed, substantiates that phylogenetic signal exists in phage TUD patterns. The TUD methodology is easily reproducible, alignment-independent, affected by lateral gene transfer in proportion to the extents of transfer, and can be used for directly comparing both prokaryotes and viruses. Despite differences between TUD and morphology based classification, TUD phylogeny retains utility in understanding host-phage co-evolution and deviation in patterns of nucleotide usage in certain viruses since their divergence from recent common ancestors. As such, TUD phylogeny should be considered complementary to other systems for analysis of virus evolution.
Tetranucleotides were selected for study because analysis of higher-order oligonucleotides was not possible given the limitation in virus genome sizes. We based our minimum genome sequence length on the assumption that 95% of tetranucleotide combinations should occur at least 10 times . Our calculated minimum length was 5 kb based on analysis of concatenated genome strands designed to eliminate strand bias. The minimum genome length analyzed in this study was 4.7 kb (9.4 kb when analyzing both strands), while analysis of pentanucleotides would have required a minimum genome length of 20 kb. To determine the tetranucleotide usage deviations from expected (TUD) among prokaryotic genomes, a Zero-Order Markov algorithm  was used, which involves determining the expected number of tetranucleotides by removing biases in mononucleotide frequencies, as is determined by the equation: E(W) = [(Aa * Cc * Gg * Tt) * N], where A, C, G, and T represent the frequency of the four nucleotides within the window being evaluated, respectively, a, c, g, and t represent the number of nucleotides A, C, G, and T in each tetranucleotide, respectively, and N represents the length of the genome being evaluated. The frequency of divergence for each tetranucleotide is expressed as the ratio of observed to expected, and the TUD profiles for all tetranucleotides determined for each organism studied using Swaap 1.0.1 , and their relative abundance between genomes compared by linear regression analysis.
Tetranucleotide differences in each genome were determined using Markov chain analysis , by determining the expected oligonucleotide word frequency by removing the biases existing in component oligonucleotides. W = (w 1 w 2 ...w m ) denotes the word formed by the concatenation of m nucleotides, and N(W) is its observed count in a sequence of length n, as described . The expected count E(W) of W is:
E(W) = N(w 1 w 2 ...wm-1)N(w 2 w 3 ...w m )/N(w 2 w 3 ...wm-1).
The frequency of the word F(W) is expressed as the ratio of the observed O(W) to the expected E(W). F(W) can be determined for all windows F w (W) of specified size within each genome. Tetranucleotide differences for each window are measured by the expression: , where i - j represent all tetranucleotide combinations. The tetranucleotide difference index represents the difference between each window and the mean difference for all windows. Z-scores were determined for each window using the equation Z = (x - μ)/σ, where x represents the tetranucleotide difference for the window being evaluated, μ represents the mean tetranucleotide differences for all windows, and σ represents the standard deviation of all windows. Those windows differing from the mean by ± 3 Z-scores were defined as significant, and were determined using Swaap 1.0.1 .
Distances based on TUD were determined: Dt = 1/44 * ∑ |F 1 (W) - F 2 (W)|, where F 1 (W)and F 2 (W) represent F(W) for each of the 256 tetranucleotides for any organisms 1 and 2 [19, 38], which represents the Euclidean distance between 2 vectors in 256 space. Bootstraping was performed by sampling with replacement of each of the 256 tetranucleotide frequencies using Swaap PH 1.0.1 , and phylograms created based on distance matrices using Phylip 3.5 , reviewed via Treeview , and displayed using midpoint rooting with Paup 4.0b10 .
Phylogeny based on gene content was determined using methodology similar to that described . Shared genes between genomes were determined using an operational definition of orthology. Briefly, all genes within each genome examined were added to a BLAST database, and each genome then was compared at the amino acid level against the database using a BLAST threshold value, E = 0.01. The resulting number of orthologues were tabulated for each pair of viruses compared, and distances were expressed as one minus the percent of shared genes between each genome. Data tabulation and distance matrices were generated using Swaap PH 1.0.1 . Phylogenies were constructed based on distance matrices using Paup 4.0b10 .
Analysis of congruence among the gene phylograms was performed on consensus trees, and 1000 trees created with random topology. Differences in log likelihood (Δ-ln L) were computed between phylograms based on TUD phylogenies and 1000 random trees. Differences in Δ-ln L for random phylograms can be considered as the null distribution, obtained when there is no more similarity in topology than expected by chance. If the Δ-ln L values for comparisons among the phylograms are within the 99th percentile of the null distribution, then the topologies are significantly different, and thus incongruent .
Tetranucleotide usage deviations from expected
Supported in part by the Foundation for Bacteriology, and the National Institutes of Health (RO1GM63270). We thank Dr. Douglas Brutlag for his critical review of this manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.