Horizontal gene transfer and nucleotide compositional anomaly in large DNA viruses
© Monier et al. 2007
Received: 14 June 2007
Accepted: 10 December 2007
Published: 10 December 2007
Skip to main content
© Monier et al. 2007
Received: 14 June 2007
Accepted: 10 December 2007
Published: 10 December 2007
DNA viruses have a wide range of genome sizes (5 kb up to 1.2 Mb, compared to 0.16 Mb to 1.5 Mb for obligate parasitic bacteria) that do not correlate with their virulence or the taxonomic distribution of their hosts. The reasons for such large variation are unclear. According to the traditional view of viruses as gifted "gene pickpockets", large viral genome sizes could originate from numerous gene acquisitions from their hosts. We investigated this hypothesis by studying 67 large DNA viruses with genome sizes larger than 150 kb, including the recently characterized giant mimivirus. Given that horizontally transferred DNA often have anomalous nucleotide compositions differing from the rest of the genome, we conducted a detailed analysis of the inter- and intra-genome compositional properties of these viruses. We then interpreted their compositional heterogeneity in terms of possible causes, including strand asymmetry, gene function/expression, and horizontal transfer.
We first show that the global nucleotide composition and nucleotide word usage of viral genomes are species-specific and distinct from those of their hosts. Next, we identified compositionally anomalous (cA) genes in viral genomes, using a method based on Bayesian inference. The proportion of cA genes is highly variable across viruses and does not exhibit a significant correlation with genome size. The vast majority of the cA genes were of unknown function, lacking homologs in the databases. For genes with known homologs, we found a substantial enrichment of cA genes in specific functional classes for some of the viruses. No significant association was found between cA genes and compositional strand asymmetry. A possible exogenous origin for a small fraction of the cA genes could be confirmed by phylogenetic reconstruction.
At odds with the traditional dogma, our results argue against frequent genetic transfers to large DNA viruses from their modern hosts. The large genome sizes of these viruses are not simply explained by an increased propensity to acquire foreign genes. This study also confirms that the anomalous nucleotide compositions of the cA genes is sometimes linked to particular biological functions or expression patterns, possibly leading to an overestimation of recent horizontal gene transfers.
During the last decade the study of virus evolution has been neglected to the point where 'virus evolution' most often refers to studies more akin to population genetics, such as the worldwide scrutiny of new polymorphisms appearing daily in the H5N1 avian flu virus , than to the fundamental question of where viruses come from [2–4]. Phylogenetic studies on viruses have long been considered unfeasible for two main reasons: 1) their reputed propensity to randomly acquire genetic material from their host or the environment and 2) their reputed very high sequence divergence rate. The generality of this vision, probably inherited from the study of RNA viruses (in particular retroviruses), now deserves to be revisited for DNA viruses in light of the increasing amount of available genomic sequence data, and the recent characterization of some giant viruses [5–8].
When analyzing DNA virus genome sequences on a global scale  one is immediately struck by their tremendous variation in size. Even if viral DNA genomes are expected to be larger than viral RNA genomes due to the improved accuracy of the replication system, it is not as easy to understand how DNA viruses with apparently similar "fitness" (as judged from their virulence and burst sizes) may have come to exhibit sizes ranging from a few kilobases up to more than a megabase [5, 6].
Even more intriguing is the fact that such a genome size variation is commonly found among viruses infecting the same or similar hosts, for prokaryotic viruses (e.g. bacteriophage ranging from 30 kb up to nearly 670 kb for bacteriophage G ), as well as animal viruses [from less than 5 kb (polyomaviruses) to 360 kb (poxviruses)]. Currently the largest known eukaryotic DNA viruses are plankton parasite Emiliania huxleyi virus 86 (407 kb)  and the amoeba-infecting mimivirus (1.2 Mb) .
Finally, unlike the situation for eukaryotic cellular organisms, the increase in viral genome size is not correlated with either accumulation of "junk" DNA (e.g. low complexity sequences or non-coding regions), invasion of mobile elements, gene duplication or repeat expansion [5, 7, 11].
In the present work, we now examine to which extent horizontal gene transfer (HGT) from host might account for the exceptional genome size of several families of large double stranded DNA viruses (LDVs) with genomes exceeding 150 kb in size.
Proportions of cA genes in the 67 large DNA viruses.
Genome size (bp)
Genomic G+C (%)
Number of analyzed CDS
Number of cA genes
Proportions of cA genes (%)
Enterobacteria phage RB43
Enterobacteria phage T4
Pseudomonas phage phiEL
Pseudomonas phage phiKZ
Mycobacterium phage Bxz1
Enterobacteria phage RB69
Enterobacteria phage RB49
Vibrio phage KVP40
Aeromonas phage 44RR2.8t
Aeromonas phage Aeh1
Aeromonas phage 31
Cercopithecine herpesvirus 16
Equid herpesvirus 1
Cercopithecine herpesvirus 1
Human herpesvirus 2
Human herpesvirus 1
Psittacid herpesvirus 1
Gallid herpesvirus 2
Gallid herpesvirus 3
Cercopithecine herpesvirus 2
Meleagrid herpesvirus 1
Pongine herpesvirus 4
Human herpesvirus 6B
Murid herpesvirus 1
Human herpesvirus 5 strain AD169
Human herpesvirus 6
Human herpesvirus 7
Cercopithecine herpesvirus 8
Murid herpesvirus 2
Human herpesvirus 5 strain Merlin
Tupaiid herpesvirus 1
Human herpesvirus 4
Equid herpesvirus 2
Cercopithecine herpesvirus 15
Ostreid herpesvirus 1
Rabbit fibroma virus
Molluscum contagiosum virus
Lumpy skin disease virus
M. sanguinipes entomopoxvirus
A. moorei entomopoxvirus 'L'
Mule deer poxvirus
African swine fever virus
M. configurata NPV-A
M. configurata NPV-B
L. dispar MNPV
Xestia c-nigrum granulovirus
Lymphocystis disease virus
Invertebrate iridescent virus 6
A. polyphaga mimivirus
E. huxleyi virus 86
P. bursaria Chlorella virus 1
E. siliculosus virus 1
Shrimp white spot syndrome virus
H. zea virus 1
Nucleotide composition is one of the specific properties of viral and cellular genomes [14–16]. The mimivirus genome is A+T rich (72%), and exhibits rather homogenous nucleotide compositions along the chromosome . In contrast, some betaherpesviruses exhibit a clear bimodal heterogeneity in G+C composition along their genomes . Different factors can shape compositional heterogeneity within and across genomes , including mutational biases, physical constraints on DNA molecules, functional requirements at the level of transcription  and translation [20–22], and genetic exchanges with other genomes by horizontal transfer [23–26]. Many of the initial surveys of LDV genomes characterized nucleotide compositional properties of individual genomes. However, there are few studies systematically addressing their compositional properties in a comparative way .
Here we analyzed the nucleotide compositional properties of 67 LDV genomes. We first compared global nucleotide compositions across these viruses and their hosts. We next identified compositionally anomalous (cA) genes in the viral genomes, examined their correlation with strand asymmetry, a possible cause of compositional biases, and described their functional and physical (i.e. chromosomal co-localization) properties. Finally, we investigated potential exogenous origins of the cA genes through phylogenetic tree reconstruction.
G+C content is a simple measure of genomic nucleotide composition, and it has been shown to be species-specific for prokaryotes . LDVs also present a large variation in global G+C content across viral families (27%–76%; Additional file 1). Large variations in G+C content are also observed within a viral family or subfamily (Additional file 2), for instance, the lumpy skin disease virus (26%) and the molluscum contagiosum virus (64%) belonging to the same chordopoxvirus subfamily. Large variations in G+C content were previously noted for herpesviruses .
We next analyzed compositional heterogeneities across different genes encoded in individual LDV genomes. Genes with nucleotide compositions substantially deviating from the average in a genome tend to be under distinct selection pressures or have particular evolutionary histories. We denote such genes as compositionally anomalous (cA) genes. To identify cA genes with a robust statistical support, we used a method  based on Markov modeling and Bayesian inference originally developed for the identification of horizontally transferred genes within prokaryotic genomes. Composition-based approaches have a clear limitation in detecting HGTs; they detect only a subset of HGTs (i.e. recent horizontal gene acquisitions by a recipient genome from a donor genome with nucleotide compositions that are different from the recipient genome). Our aim here is to examine the nature of cA genes in LDVs in the light of previous observations of cA genes for prokaryotic genomes.
We identified cA genes in all the analyzed LDV genomes. In many LDV genomes, the cA genes were more A+T rich than the remaining genes (for 43 out of 67 LDVs; binomial test, p < 0.05). The proportion of cA genes per genome was highly variable across the 67 LDVs, ranging from 1% to 53% (Table 1, Fig. 4). In contrast, the proportion obtained by the same method is less variable for prokaryotes (0% to 25%) . Herpesviruses exhibited the widest range of cA gene proportion (6% to 53%). All but two of the analyzed NCLDVs, including mimivirus, had a relatively low level of cA proportion (1%–17%). Two exceptions were Paramecium bursaria chlorella virus 1 (28%) and molluscum contagiosum virus (31%). Phages showed a relatively low proportion (4% to 21%). The cA gene proportions for other viruses were 5% to 26% for four baculoviruses, 39% for nimavirus and 24% for Heliothis zea virus 1. The very high cA proportions in some betaherpesviruses appear to be linked with genomic islands harboring betaherpesvirus-specific genes, which we will discuss below (in the "Co-localization of cA genes" section).
We observed a weak but significant positive correlation (R2 = 0.53, p < 0.005) between the proportion of cA genes and genomic G+C content (Additional file 4). This correlation was not due to a bias induced by a given viral family or subfamily. For example, when we excluded herpesviruses, many of which exhibit both a high cA gene proportion and a high G+C content, the correlation was reduced but still remained significant (R2 = 0.24, p ≤ 0.005). Such a significant correlation was observed also within a single subfamily (i.e. alphaherpesviruses, R2 = 0.66, p ≤ 0.005). A potential variation in gene prediction quality (which could be dependent on genomic G+C compositions) does not appear to explain these correlations (Monier et al., unpublished data). We also examined a possible relationship between the cA gene proportion and chromosome topology. There was no difference in the cA gene proportions between linear LDV genomes (15.5% on average) and circular LDV genomes (17.3% on average; t-test, p = 0.63).
Our global genomic signature analysis revealed remarkably different nucleotide compositions between LDVs and their hosts. To investigate further the discrepancy of nucleotide compositions, we examined if the identified cA genes are enriched in "host-like" sequences in terms of their nucleotide compositions. We determined the genomic signature distances (di-nucleotides for word length) from individual cA genes to the viral genome where they originate as well as to the host genomic sequences. As shown in the Additional file 5, we observed a marked enrichment of genes with host-like signatures among the identified cA genes (19%) relative to non-cA genes (3.3%). The host-like signatures in these cA genes could be due to a large variation in their nucleotide compositions, being unrelated to possible host origins for some of the cA genes. To test this hypothesis, we performed the same analysis by randomizing the pairings between viruses and their hosts. A comparable fraction (23% versus 19%) of the cA genes indeed exhibited smaller distances to the genomic signatures of randomly chosen hosts than to the viral genomes where they originate. Thus the enrichment of genes with host-like signatures in cA genes may not suggest their horizontal acquisitions from hosts.
Genome sequences of many prokaryotes and vertebrates [33, 34] show strong strand asymmetry in nucleotide composition between the leading and the lagging strands. For most bacteria, compositional strand asymmetry is characterized by an excess of G and T bases in the leading strands relative to the lagging strands [33, 35, 36]. A substantial part of compositional strand asymmetry is independent of gene distribution in the two distinct DNA strands, and is probably due to mutational biases linked to asymmetric mechanisms of DNA replication [37, 38]. Compositional strand asymmetry spanning large genomic segments has been described also for some large DNA viruses, including mimivirus [5, 39] and several herpesviruses [40, 41]. Such replication-associated strand bias can potentially result in two classes of genes with distinct nucleotide compositions, depending on which strand they are located. We thus examined if there is a correlation between the distribution of cA genes and compositional strand asymmetry.
We first generated cumulative GT-excess plots  for all of the 67 LDV genomes to assess their compositional strand asymmetry. LDVs presented various GT-excess patterns. Several LDVs, mostly phages, showed strong global strand asymmetry with a monotonous increase (or decrease) of the cumulative GT-excess curve along their entire genome. Some LDVs including mimivirus exhibited a few peaks in their GT-excess profiles. For other LDVs, the GT-excess curves locally fluctuated with no long genomic segments exhibiting a consistent compositional asymmetry. We selected fifteen LDVs presenting long (>10 kb) genomic segments with uniform compositional asymmetry (Additional file 6). After identifying the genomic coordinates where the sign of the nucleotide-skew (G+T versus C+A) changes, we split the genomic sequences into sub-strands. Those sub-strands were classified into two classes: class I consists of sub-strands having a positive nucleotide-skew (i.e. G+T% > C+A%), and class II with a negative skew (i.e. G+T% < C+A%). Genes were then classified into either class I or II according to the sub-strand they originated from. For instance, the plus strand (according to the GenBank entry) of the Pseudomonas phage phiKZ genome is G and T rich along its entire length, thus being classified as class I. This class I strand contains 229 genes (≥ 300 nt), of which 22 were detected as cA genes. The complementary strand is classified as a class II and contains 49 genes, of which 5 were cA genes. No significant correlation was found between the distribution of cA genes and the compositional strand asymmetry (Fisher's exact test, p-value = 0.99). The mimivirus genome shows a clear switch of nucleotide-skew (G+T versus C+A) at position 380,000 nt as previously noted . Thus a part of the plus strand (from 0 to 380,000 nt) and a part of the complementary strand (from 380,000 to 1,118,404 nt) constitute the class I strands. Class II is represented by the remaining strands (Additional file 6). Again, no significant correlation was found between the distribution of cA genes and the compositional strand asymmetry (p-value = 0.63). Of the fifteen LDVs that we analyzed, only three showed a significant correlation between cA gene distribution and strand asymmetries (p-values < 0.05). The three viruses are the fowlpox virus and two strains of the human herpesvirus 5 (the wild type strain merlin and the laboratory strain AD169). These results suggest that replication-associated compositional strand asymmetry accounts for only a small part of the nucleotide compositional biases of the cA genes observed in LDVs.
Functional constraint on specific genomic regions can be a cause of atypical nucleotide compositions of genes. We selected five genomic regions to examine if there were correlations between cA genes and certain of their functional attributes. For the first three cases (gamma- and betaherpseviruses), the correlation between the nucleotide compositional biases and gene functions has already been described, and could serve as a positive control for our method. For the last two cases (Emiliania huxleyi virus-86 and mimivirus), no such correlation has been previously reported.
Human herpesvirus 4 (Epstein-Barr virus) of the gammaherpesvirus subfamily has two different life cycle modes: latent and productive. Nine genes are known to be expressed during latency . Those genes are hardly expressed during the productive mode. They are highly A+T rich compared to the remaining genes. Karlin et al. hypothesized that the unique codon usage of those genes helps to minimize the competition with host genes for various host resources during latency . We found that eight of the nine latency genes were indeed detected as cA genes (89%). In contrast, the cA gene proportion was significantly lower for the remaining genes (29/81, 36%, p = 0.003).
The immediate-early transcription region (≈10-kb) of the murid herpesvirus 1 (murine cytomegalovirus, betaherpesvirus subfamily) genome is known to be CpG suppressed . Experimental data suggest that the methylation of the cytosines in CpG dinucleotides in the enhancer/promoter of these regions has a regulatory role in gene expression . The CpG suppressed region of the murid herpesvirus 1 genome encodes 10 genes. We found that the cA gene proportion in this region (6/10 genes, 60%) was twice as important as in the remaining regions (46/146, 32%) of the genome, though their difference might be due to chance (p = 0.085).
The murid herpesvirus 1 genome possesses three (19-kb, 10-kb and 17-kb) regions that are A+T rich relative to other parts of the genome . Those regions are enriched in genes encoding membrane glycoprotein (e.g. m02 and m145 families). Some of those proteins were suggested as responsible for the evasion from natural killer cell-mediated immune surveillance through their interaction with inhibitory natural killer cell receptors [49, 50]. Of 37 genes within the A+T rich regions, 67% (24 genes) were detected as cA genes, which is significantly higher than the cA gene proportion for the remaining genes (23.5%, 28/119 genes, p < 10-5). It is uncertain if the increased A+T levels of these regions have functional roles. However, it should be noted that these A+T rich regions contain five of the seven genes exhibiting significant sequence polymorphisms between strains of murid herpesvirus 1 .
Emiliania huxleyi virus-86 (EhV-86) of the Phycodnaviridae family exhibits two distinct transcription phases during its lytic infection to the host alga, E. huxleyi . The primary phase is characterized by the transcriptions of a group of genes by 1 hour post-infection. The secondary phase involves the transcriptions of other genes between 2 and 4 hours post-infection. Respectively thirty-eight and 253 genes in our data set (ORF length ≥ 100 codons) correspond to the primary and the secondary transcription phases. We found a significantly greater fraction of cA genes in the primary phase (8/38, 21%) than in the secondary phase genes (19/253, 7.5%; p = 0.014). Genes expressed during the primary phase map in the 104-kb central genomic region (bases 200,000 to 304,000), which shows a similar G+C content as the rest of the genome . It has been suggested that promoter-like elements (family A repeats) uniquely found in this region control this early expression pattern of EhV-86 . The functions of the early expression are unknown as most of the transcribed genes lack detectable homologs in the databases. Allen et al. postulated that an ancestral EhV acquired the 104-kb region by horizontal transfer .
Mimivirus has unusually well conserved promoter-like AAAATTGA motifs in the upstream regions of half of its predicted genes in the genome . Based on the putative associated gene functions, Suhre et al. predicted that these promoter-like elements regulate gene expression during the early stages of the viral infection, whilst most of the genes contributing to the virus particle are devoid of this motif . Of 402 genes with the promoter-like motifs in our data set, 43 (10.6%) were detected as cA genes. Of 508 genes lacking the motifs, 40 (7.9%) were detected as cA genes. The difference is not statistically significant (p = 0.16).
In summary, we found a significant correlation between cA genes and previously described functional categories of genes for three of the five cases. It should be noted that in the case of EhV, no relationship between expression timing and gene nucleotide composition was previously reported. This suggests the possible use of nucleotide composition analysis to predict expression patterns of viral genes. These results indicate that anomalous nucleotide compositions of some of the cA genes can be due to functional constraints, although the distinction between functional constrains and horizontal transfer events is generally difficult given the known bias in functions of horizontally transferred genes in prokaryotes [24, 26].
We found that the protein product from a cA gene in mimivirus (MIMI_L211 previously annotated as unknown function) exhibits significant sequence similarities to the C-terminal half of the etoposide-induced 24 (EI24) proteins (pfam07264, E-value < 10-7). EI24 (also known as PIG8) is directly induced by p53, a critical tumor suppressor coordinating DNA repair, cell-cycle arrest and apoptosis in response to cellular stresses. It has been suggested that EI24 acts as a pro-apoptotic factor and prevents tumor spreading in mammals . The N-terminal part of the EI24, which is missing in MIMI_L211, binds to Bcl-2, while the function of the C-terminus of EI24 is unknown. This is the first identification of an EI24-like domain in a viral genome. The presence of such a domain in mimivirus is puzzling since, to the best of our knowledge, no apoptotic phenomenon has been described for its unicellular host Acanthamoeba polyphaga. It is notable that another amoeba Dictyostelium discoideum, which carries out apoptotic processes, possesses a hypothetical ORF (Q54PW9_DICDI) matching the EI24 domain (E-value < 10-19).
Horizontally acquired genes may or may not exhibit anomalous nucleotide compositions depending on their origin. Phylogenetic reconstruction is a more powerful approach to examine horizontal gene transfers. However, the approach could be used only for a limited part of our data set, as most of the cA genes have no detectable homologs in the databases. Wherever possible, we systematically conducted phylogenetic reconstructions for the cA genes. Out of a total of 1633 phylogenetic reconstruction, only eight candidate horizontal gene transfer events were supported by the reconstructed trees (using conservative criteria, see Materials and Methods). All these trees revealed striking phylogenetic incongruities (Additional file 10): glutathione peroxidase of molluscum contagiosum virus (Additional file 10) (A), interleukin 10 of Cercopithecine herpesvirus 15 (B), hypothetical protein (R1R) of monkeypox virus (C), G protein-coupled receptor of fowlpox virus (D), photosystem II D2 protein (E) and site-specific DNA methylase (F) of bacteriophage S-PM2, thymidylate synthase of cyanophage P-SSM2 (G) and cytosine methyl-transferase of cyanophage P-SSM4 (H).
We demonstrated that there is no significant correlation between the G+C content of LDV genomes and that of their modern hosts, and that the genomic signatures of LDVs are also significantly different from those of the hosts. Recently, Mrázek and Karlin reported a similar result for large DNA viruses based on a smaller data set than the one used in this study . This simple observation does not favor the "gene pickpockets" depiction of viruses which supposedly frequently acquire genetic material from their hosts [57, 58]. Various factors can potentially account for the existence of species-specific global nucleotide compositions [18, 22]. Most importantly, LDVs carry their own genes for the major components of replication machinery. Several authors have suggested that viruses are ancient and that their evolutionary trajectory could be largely dissociated from those of their present hosts [2–4]. Accordingly, specific genomic signatures of LDV genomes may originate from intrinsic properties of their replication machinery. As an alternative, we speculate that viruses may take advantage of the compositional differences to re-orient host machineries towards viral DNA/RNA molecules.
We found no significant correlation between the cA gene proportion and the size of the LDV genomes. This observation suggests that extremely large sizes of the genomes of some LDVs such as mimivirus are not due to recent accretion of foreign genes. By extrapolation, the capacity to capture foreign genes is unlikely to be the major factor that determines the tremendous variation in genome size for DNA viruses .
We showed that the degree of intra-genome heterogeneity was highly variable across LDVs (i.e. the proportion of cA genes ranging from 1% to 53%). In prokaryotes, anomalous gene nucleotide composition is usually attributed to horizontal gene acquisition events. It has been argued that traces of horizontal gene transfer events as old as 100 million years can be detected through the identification of anomalies in nucleotide composition for bacterial genes . The composition-based method used in this study identified horizontally transferred gene candidates for prokaryotes, which constitute 0% to 25% of their gene complements. Surprisingly the same method identified only a comparable level of cA gene proportions for half of the analyzed LDVs. Furthermore, several LDVs exhibited a cA proportion as low as that of small parasitic intracellular bacteria, rarely exchanging genes with other bacteria. Thus, the nucleotide compositions of LDVs could indicate no general differences between LDVs and prokaryotes in the frequency of horizontal gene acquisitions.
Most of the identified cA genes were of unknown functions and did not exhibit cellular homologs in the databases. This also argues against massive HGTs from hosts to LDVs. One may speculate that cellular homologs could not be detected due to a faster viral sequence evolution. However, a recent study showed that evolutionary rates are similar between LDV genes with database homologs and those lacking detectable homologs (i.e. ORFans) .
Furthermore, functional and structural constraints can also cause intra-genome compositional heterogeneity. In this study, we explored several mechanisms that could potentially cause compositional anomaly in LDV genomes. These mechanisms include specific gene functions and replication associated strand asymmetry. We validated a significant enrichment of cA genes in specific functional categories of genes for some of the LDVs including EhV-86. We evidenced the presence of a nucleotide composition bias in the EhV-86 genes expressed in the primary phase that was previously overlooked. In contrast, we showed that replication associated compositional asymmetry is not a major cause of cA gene compositional bias for most of the LDVs.
It should be noted, that highly expressed genes exhibiting biased codon usages are usually pre-excluded in quantifying the number of horizontally transferred genes in prokaryotic genomes . Due to the lack of such a general knowledge on viral codon usage bias with functional consequences, we did not apply such a filter in our computational identification of cA genes. In this regard, possible functional constraints on the nucleotide compositions will tend to contribute to an overestimation of recent horizontal gene transfers in viruses relative to prokaryotes.
The lack of database homologs or phylogenetic evidences does not exclude the possible HGT-origins of cA genes. The hugely diverse world of viruses is probably the most underrepresented in the current database . Thus, we speculate that a significant part of these cA genes might have been transferred from other viruses that are not yet sequenced. Such inter-virus gene exchange can be achieved by illegitimate or homologous recombination [60–62] occasionally leading to the fusion of two viral genomes , or with the aid of mobile elements  upon co-infection of a cell by multiple virus species. Li et al. described a clear case of HGT from Xestia c-nigrum granulovirus to Mamestra configurata nucleopolyhedrovirus B , though this HGT event was not detected by our study due to the similar nucleotide compositions of these two genomes.
Goldenfeld and Woese pointed out the important role of viruses in the gene flux among microbial communities . They conceptualized the viral gene pool as a large dynamic repository of genetic information accessible to microorganisms. Inter-virus gene transfer may be central to the dynamics of the viral gene pool. Given the known diversity of genetic material in the virus world and their under-representation in the current sequence databases , this hypothesis appears as plausible as the functional/structural scenario in explaining the existence of compositionally anomalous genes in large DNA viruses.
The genomes of large DNA viruses exhibit nucleotide compositions largely differing from their host genomes. Our results based on nucleotide composition analyses, database searches and phylogenetic tree reconstructions suggest that horizontal gene transfer to these viruses from their current hosts is infrequent and does not account for the large variation in their genome size. However, the viral genomes still show a variable proportion of compositionally anomalous genes. Such compositional biases potentially arise from particular biological functions at the nucleic acid level and/or inter-virus gene transfers.
Viral genome data were downloaded from the viral section of the NCBI Reference Sequence (RefSeq) database . We have selected 67 non-redundant double-strand DNA viral genomes larger than 150 kb from the dataset (Table 1).
We prepared 32 sets of protein coding sequence (CDS) data for viral hosts (Table S6). For viral hosts species for which complete genome sequences are available, we downloaded sequence data from KEGG . For other viral hosts, we retrieved CDS data from GenBank . We used only CDS longer than or equal to 300 bp. For some of the CDS data sets built from GenBank, we included sequences from organisms that are closely related to the viral host species (Additional file 11). We removed the sequence redundancy in each CDS data set from GenBank by keeping only one representative from a cluster of homologs. Clustering of homologous protein sequences was performed using BLASTCLUST ; we used 40% for the minimal sequence identity (-S option) and 75% for the minimal coverage by alignment (-L option) for at least one sequence (-b option). Of the 32 host sequence data sets, 26 sets were composed of more than 20 sequences, and used in this study. These represent the host sequence data for 61 LDV genomes.
Based on the RefSeq annotation, we prepared a dataset for CDSs and non-coding sequences (NCDSs) for viral genomes. We retained only CDSs longer than or equal to 300 bp, and discarded shorter ones.
where n = 256 for k = 4.
To detect cA genes in LDV genomes, we used the computer program developed by Nakamura et al. . We used a Markov order of 3 with a 96 bp window sliding with a step size of 12 bp along the genome. A Monte-Carlo simulation was used to compute the statistical significance of the nucleotide composition bias in a gene as in Nakamura et al.. The threshold for the statistical was set to 1% significance (unilateral statistical test). In addition to the dataset derived from RefSeq, a control dataset was generated from the viral genomic sequences by conserving only open reading frames longer than or equal to 300 bp that exhibit significant sequence similarity hits to database sequences from cellular organisms or distantly related virus, using BLASTP  against SwissProt/TrEMBL  and the viral RefSeq databases. The proportions of the cA genes for the control dataset varied from 0% to 54% across different viruses, and correlated with the cA gene proportions for the dataset derived from RefSeq (R2 = 0.73, p < 0.001, not shown). In the manuscript we show only the results obtained from the RefSeq dataset.
We conducted two distinct statistical tests for physical proximity of cA genes in a viral genome (i.e. clustering of genes along the chromosome) using a Monte-Carlo simulation. First, we counted the number of pairs (n) of cA genes that were consecutively encoded in the genome. We shuffled the order of genes in the genome 1000 times and obtained the distribution of n in the randomized data. The distribution was used to determine p-value for n from the real data. Second, we recorded the size (number of genes, N) of each cA gene cluster. From the Monte-Carlo simulation, we obtained the p-value for the observation of at least one cA gene cluster with the size equal to or greater than N genes.
All the cA genes were searched against the SwissProt/TrEMBL database using BLASTP (E-value < 10-5) to identify homologous sequences. For each gene, we retrieved 50 best hits at the maximum from the database, and aligned them to the candidate sequence using MUSCLE . We improved these initial alignments by removing highly divergent or fragmented sequences with visual inspection of the alignments. The alignments were used to generate neighbor-joining (NJ) trees using CLUSTALW  with Kimura's correction for distance. Some of the NJ trees were selected for further analyses by the neighbor-joining method with Jones-Taylor-Thornton (JTT) substitution model  using MEGA  and the maximum likelihood method by PHYML . The results of the BLAST searches were also served to assign putative functions to the cA genes where the annotations in the RefSeq database were inadequate for our analysis.
Horizontal gene transfer (HGT) is usually invoked, when a phylogenetic tree based on gene sequences is incongruent with a species tree. In this study, we assumed deep phylogenetic origins for large DNA virus lineages (i.e. before the divergence of major phyla of cellular organisms), which have been indicated by the analyses conserved genes (e.g. DNA polymerase) [5, 78]. To list up candidates of HGT, we first identified cases where the grouping of viral sequences, V, and their cellular homologs, I, is supported with a high bootstrap value (i.e. ≥ 80% for (V, I)]). As the reconstructed tree is unrooted, V could represent an outgroup of I and other cellular homologs. In this case, HGT between viruses, V, and cellular organisms, I, is not required to account for the gene tree topology. To eliminate this possibility, we demanded one of the following two additional criteria for potential HGT events.
- From independent evidences, one can chose a group of sequences, O1, representing an outgroup of I.
- From independent evidences, one can chose a group of sequences, O2, representing an outgroup of I and O1. [Consequently, the tree topology will become (O2, (O1, (V, I))), when rooted by O2.]
- There are only a few species and/or genera in I.
- Other sequences outside the (V, I) group are from a wide variety of genera.
- The branches from the root of (V, I) to V and I are relatively shorter than the branches from the root of (V, I) to other peripheral nodes.
We thank Prof. Y. Nakamura for providing the computer program for the identification of cA genes. We also thank Dr. Pascal Hingamp for his detailed reading of the manuscript. We thank three anonymous reviewers for their comments. AM is partially supported by the EuroPathoGenomics European network of excellence. This work was partially supported by Marseille-Nice Genopole and the French National Network (RNG)
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.