Sequence analysis of an Archaeal virus isolated from a hypersaline lake in Inner Mongolia, China

Background We are profoundly ignorant about the diversity of viruses that infect the domain Archaea. Less than 100 have been identified and described and very few of these have had their genomic sequences determined. Here we report the genomic sequence of a previously undescribed archaeal virus. Results Haloarchaeal strains with 16S rRNA gene sequences 98% identical to Halorubrum saccharovorum were isolated from a hypersaline lake in Inner Mongolia. Two lytic viruses infecting these were isolated from the lake water. The BJ1 virus is described in this paper. It has an icosahedral head and tail morphology and most likely a linear double stranded DNA genome exhibiting terminal redundancy. Its genome sequence has 42,271 base pairs with a GC content of ~65 mol%. The genome of BJ1 is predicted to encode 70 ORFs, including one for a tRNA. Fifty of the seventy ORFs had no identity to data base entries; twenty showed sequence identity matches to archaeal viruses and to haloarchaea. ORFs possibly coding for an origin of replication complex, integrase, helicase and structural capsid proteins were identified. Evidence for viral integration was obtained. Conclusion The virus described here has a very low sequence identity to any previously described virus. Fifty of the seventy ORFs could not be annotated in any way based on amino acid identities with sequences already present in the databases. Determining functions for ORFs such as these is probably easier using a simple virus as a model system.


Background
The three domain description of cellular life on earth, Eukarya, Bacteria and Archaea is a firmly established biological tenet [1]. Each domain has an associated, probably vastly diverse, virus population [2][3][4][5][6]. Thousands of viruses infecting representatives of the domain Eukarya have been described and many of their DNA/RNA genomic sequences determined [7]. Something like 5-6000 viruses (bacteriophages) infecting representatives of the domain Bacteria have been described, at least morphologically, although rather fewer DNA/RNA genomic sequences have been determined [8]. In contrast we are largely ignorant about viruses infecting representatives of the domain Archaea. Just 40 or so have been described and the genomic sequences of only a few have been determined, sixteen being listed in Genbank. All archaeal viruses so far discovered have dsDNA genomes, both linear and circular [8,9]. Archaeal viruses having an RNA genome have not yet been identified and perhaps do not exist [9].
The domain Archaea is divided into four established kingdoms, the Crenarchaeota, the Euryarchaeota, the uncultivated Korarchaeota and the very recently identified Nanoarchaeota [10,11]. Virus particles associated with the first two phyla have been identified, recently reviewed in [9]. About 24 viruses of crenarchaeotes have been identified, often with unusual shapes, e.g. droplets and bottle shapes never observed elsewhere; these viruses have no obvious relationship to phage infecting members of the domain Bacteria [8,9]. Similarly about 20 viruses infecting members of the Euryarchaeota have been identified of which 15 infect haloarchaea, recently reviewed in [12]. These are mostly head/tail viruses of the order Caudovirales, including myoviruses and siphoviruses that may be distantly related to those infecting the domain Bacteria [8,9]; although other morphotypes have also been observed [12]. Only six viruses of the haloarchaea have been sequenced. All were isolated by the Dyall-Smith laboratory in Melbourne, from hypersaline sources in Australia, except for φCh1. φCh1, a temperate myovirus with a 58.5 kb linear genome, the host of which is the haloalkaliphile Natrialba magadii [13] was isolated from a laboratory strain and presumably originates, like the host, from Africa. Lytic viruses HF1 and the closely related HF2, having linear genomes of 75.9 kb and 77.7 kb, infect the haloarchaea Haloferax lucentense and Halorubrum coriense respectively [14,15]. His1 and the distantly related His2 spindle shaped viruses with linear genomes of 14.5 and 16 kb respectively, both have lytic and carrier status in Haloarcula hispanica [16]. Finally a lytic icosahedral virus SH1, having a linear genome of 31 kb infects Har. hispanica [17,18].
We have been studying both archaeal and bacterial prokaryotic diversity in Chinese salt lakes in Inner Mongolia; as part of this study we looked for virus particles associated with haloarchaea. In this report we describe the complete genomic sequence of a ~43 kb virus BJ1.

Description of site and lake water parameters
Lake Bagaejinnor is a hypersaline lake in Inner Mongolia, China [coordinates N45 08 527 E116 36 167]. The lake was sampled in September 2003. It had substantially evaporated over the summer, exposing expanses of [pink salt -encrusted] mud flats and had been reduced to small pools and lagoons of salt -saturated colourless water, pH 8.5. The pink colouration of the salt crystals indicated the presence of haloarchaea. The chemical composition of lake water was determined using laser inductively coupled plasma optical emission spectrometry by the Department of Geology Obviously this is a seasonal chemical analysis of the lake water, the composition of which continually varies, more dilute in spring following the winter thaw and then gradually becoming concentrated by the hot summer winds. We used trial and error techniques to find an appropriate medium where we could pour both top and bottom agars. Medium composition was influenced by very high salt concentrations interfering with agar solidification and causing "salting out" of some of the components. The eventual salt composition of this medium was identical to that determined for the lake above with the following exceptions; Na was at 2.85 M, Cl was at 2.6 M, S was at 0.642 M, Ca and Li were omitted completely.

Identification of a haloarchaeal host
Virus BJ1 was isolated from the water column of Lake Bagaejinnor and propagated using strain BJ1 B11. The host was characterised by 16S rRNA gene sequence using both forward and reverse primers, giving 1305 bp of sequence [EMBL: AM412370]. Strain BJ1 B11is most closely related, at 98% identity, to Halorubrum saccharovorum with 1289 identical nucleotides. It is also closely related to Hrr. lacusprofundi (1283 identical nucleotides) and the recently described Hrr. aidingense (1286 identical nucleotides). All three species were originally isolated from hypersaline environments, a salt pan in San Francisco, Deep lake Antarctica and Xin-Jiang in China respectively [19,20].

Plaque morphology
Plaques for BJ1 required one to two weeks to appear on plates because the host is slow growing. Plaque size for BJ1 was variable between experiments ranging from 1-5 mm in diameter, probably due to slight changes in growth conditions; they were also irregularly shaped and turbid. No attempt was made to optimise plaque formation by modifying temperature, salt concentrations or host strain.

Electron microscopy
Virus BJ1 has an icosahedral head, collar and tail, (Fig. 2). The icosahedral head usually has an electron dense shadowing in the centre. The sizes of these features are shown in the schematic diagram Fig 2. The length of a single vertex is 28 nm. The average length of an entire virus particle is about 127 nm. The virus appears to be non-contractile and can be tentatively assigned to the Siphoviridae family, (see the Discussion).

Characterisation of virus genome
The genomic nucleic acid was tested for susceptibility to various nucleases (Fig 3) Control experiments showed that no virus -associated nucleases were responsible for the degradation observed in these experiments. Fig. 3, lanes 1 and 4 show undigested genome controls, lane 2 shows that the genome was sensitive to DNase I digestion and lane 3 shows that the genome was insensitive to RNAse A. Susceptibility to a wide range of double strand -specific endonucleases i.e BamHI, HaeIII, SstI and XhoI, confirmed that the DNA was double stranded e.g. (Fig 3, panel c). Exonuclease III, specific for linear or nicked circular dsDNA, failed to cut circular double stranded DNA plasmid DNA controls (not shown) but substantially degraded virus genomic DNA (Fig 3 panel a, lane 5). Thus BJ1 probably has a linear dsDNA genome, although the possibility that it is a nicked circular genome cannot be completely ruled out.
Genomic nucleic acid ran on 1.2% TAE agarose gels as a discrete single band larger than a 23 kb DNA marker band. (data not shown). PFGE also suggested a genomic size greater than 23 kb but less than 48 kb (Fig 3, panel b). BamH1 digestion of the genomic DNA gave 21 discrete bands ranging in size from 6.5 kb to ~500 bp (Fig 3,    563, and 90 bps, with sizes in close agreement to those we observed. Thus the genomic DNA is not subject to methylation at BamHI sites.

Genome sequence of BJ1
See Figure 4 and Table 1. The double stranded genomic DNA isolated from virus particles is shown as a circular sequence 42, 271 bp long with a G+C content of 64.8 mol% [EMBL: AM419438]. Exonuclease III susceptibility showed that the DNA is linear but sequence assembly indicated it to be circular. This indicates that the genome is terminally redundant (and may be circularly permuted). It is unclear if the BJ1 genome ever forms a circular molecule but if it does then cos sites are unlikely to be involved as digests with three infrequent cutting restriction enzymes (HindIII, EcoRV and EcoRI) followed by melting at 80°C failed to show any change in the number of bands compared to un-melted digests (data not shown).
In the absence of an obvious end for the genome from our sequencing experiments we analysed the cumulative GC skew of the sequence (Fig. 4). Skew minima and maxima often represent initiation and termination points of DNA replication in prokaryotes and viruses with a cumulative increase in skew related to the direction of replication and transcription [21]. A clear maximum was observed at about 43000 followed by a sharp change with the minima from 1-8000. This in conjunction with the ORF map and pattern of operons was used to designate a +1 start of the genome (Fig. 4). The cumulative GC skew is consistent with the reading direction of most ORFs and a rolling circle pattern of DNA replication. A single tRNA for phenylalanine (GAA anticodon recognising a UUC codon) was identified using the tRNAscan-SE program. Potential ORFs were assigned using the programs FGENESB and GeneMark.hmm v2.5a (set for prokaryotes); these analyses predicted 63 and 66 ORFs, respectively, encoding polypeptides larger than 30 amino acids. We further analysed the regions upstream and downstream of these predicted ORFs for putative ribosome binding sites and overlapping start and stop codons, and found several additional ORFs. BLAST searches using the amino acid sequences of all predicted ORFs were used to differentiate between possible genes e.g. ORFs 5 and 6 have matches (see below), so putative ORFs in the opposite strand with no BLAST matches have been discounted. By combining all of the data we conclude that BJ1 probably contains 70 ORFs (Fig 4 and Table 1). [If we only count ORFs greater than 60 aa in size then the number of ORFs drops to 55]. Taking the upper estimate of 70 gives an ORF density of 1.65/kb. This is fairly close to the figure of 1.7 ORFs/kb observed for other archaeal virus genomes (17). The majority of the ORFs have initiation codons of ATG (62) and the rest are GTG (8).
The Shine/Dalgarno sequence from Halobacterium (Halorubrum) saccharovorum 16S rRNA gene sequence (Accession HSU17364), which is the closest phylogenetic match to the phage host was complemented (AGGAGGUGA) and used to search 5-15 bp upstream of each putative start site for the presence of putative ribosome binding sites (RBS). 51 of the 70 ORFs had sequences suggestive of a RBS, (Table 1). One particular stretch of 6 predicted ORFs (ORF43-ORF48) showed no obvious RBSs at all. A lack of a RBS for some genes is not surprising as archaeal transcription/translation is a mosaic of prokaryotic and eukaryotic mechanisms and the first gene of an operon, or a singly transcribed gene often lacks a RBS [22][23][24].
The majority of the ORFs (59/70) had a low calculated isoelectric point (pI < 5), which is similar to the acidic proteins of halophilic organisms [15,25]. Just three small ORFs (less than 74 aa) were predicted to be extremely basic (pI > 10). No ORF larger than 100 aa had a pI above 6.3. 63 ORFs and the tRNA are coded on one strand (designated forward) and 7 are on the reverse strand. One ORF, 30 (13255-14700 bp) overlaps entirely with another, ORF31 (13270-14487 bp), running in the opposite direction. It seems probable that both ORFs are coding, ORF30 because it overlaps with the start and stop codons of the ORFs before and after it i.e. 29 and 32, with a good consensus RBS; ORF31 because it shows significant homology to integrases, (see below).

BJ1 ORF analysis
BlastN analysis of the whole virus genome showed significant matches (E 10 -9 to 10 -4 ) to small segments of several haloarchaeal sequences i.e. Natronomonas pharaonis, Halobacterium sp. NRC-1 and Har. marismortui. BlastX analysis identified four regions of the genome having significant matches to data-base proteins either from haloviruses or haloarchaea, discussed below. The putative ORFs were individually analysed using BlastX and BlastP. InterPro was also used to search for functional domains. Using these approaches we were unable to ascribe any match or function to 50 of the 70 ORFs i.e. E values were greater than 0.05. Of the 20 we could match i.e. E value less than 0.05, most were to haloarchaeal virus entries or to haloar-chaea. These results are summarised in Table 2. Of these 20, 4 were matches to data-base entries with no identifiable function, i.e.: ORF9, ORF10, ORF17, ORF55 and ORF 24.
The remaining 15 ORFs could have functions tentatively ascribed to them on the basis of amino acid similarity, ( Table 2). We place them into three groups.

Nucleotide features
Nine direct repeats were observed greater than 13 nucleotides; the largest was 17 nucleotides, i.e. GGCGGCATC-CAACTCGG repeated at positions 34076 and 34120. All of the repeats were located in putative ORFs and we can infer nothing of significance for them. A number of per-    42048 ACTATCCGACtggGTCGGATAGT 42070 ; again both are present in putative ORFs and their significance is unclear although the last palindrome is located 209 nucleotides from the 3' end of the genome. The BJ1 genome has a low incidence of CTAG and GATC sequences, just three of each of these palindromes being present. This incidence is low, both compared to the statistically expected incidence, (every 256 base pairs) and compared to the related tetramers CGAG and GCTC which were both found 36 times. CTAG and GATC sequences appear to be selected against by many haloviruses e.g. these palindromes are absent from the genomes of HF1, HF2, His2 and SH1 [6]. This selection pressure is thought to be due to the avoidance of restriction-modification systems in the host cells [26], and there is evidence that CTAG and GATC palindromes are used by haloarchaeal systems [27,28].

Sequence heterogeneity
BamHI digests of virion DNA gave rise to a fragment of about 3.5 kb, as judged by agarose gel electrophoresis, present in sub-stoichiometric amounts relative to the other bands, indicated by the black arrow in Fig 3c. This was fully sequenced and found not to fit into our genome assembly. Primers derived from this sequence were used with virus sequence primers and virus genomic DNA as a template. Products were observed with primers derived from the 3' end of ORF 32, suggesting that a minor subfraction of virion DNA did contain this BamHI fragment. Sequencing showed that the site of insertion was at nucleotide 14790 in ORF 32 and showed that this part of ORF 32 was rich in CGX repeats, (Table 3). We have not yet been unable to derive PCR products defining either the location or 5' end of the insertion/substitution. Instead we have primer walked out from the defined 3' end of the insertion. ~8.7 kb of sequence has been determined [EMBL: AM491333] having a G+C content of 72.6 mol%, notably higher than the 64 mol% determined for the rest of the virus genome and close to that reported for Hrr. saccharovorum (71 mol%). Predicted ORFs have much higher homologies to known haloarchaeal proteins than the other viral ORFs, (Table 3). We think it most likely this sequence is derived from the host genomic DNA due to an integration/excision event.

Discussion
Morphological criteria used for virus classification is outlined by the International Committee for Taxonomy of Viruses [7]. Virus BJ1 is an icosahedral head/tailed virus and as such is assigned to the order Caudovirales with examples infecting members of both the domains Bacteria and Archaea. BJ1 can also be assigned to the Bradley classification group B and might tentatively be assigned to the family Siphoviridae due to the apparent absence of a contractile tail, base plate and tail fibres and the presence of striations in the tail fibre. If we assume that this classification is phylogenetically justified then it could indicate that the Caudovirales originated before the divergence of the Bacteria and Archaea [29]. An alternative explanation is that the Caudovirales originally infected members of the domain Bacteria but that horizontal gene exchange from mesophilic Bacteria to the Archaea and the subsequent stabilisation of these genes in the Archaea allowed the Caudovirales to spread into the domain Archaea [Certainly we have detected diverse bacterial populations in the water of Lake Bagaejinnor, SH unpublished] [9].
As described in the Introduction, very few viruses infecting the domain Archaea have been described and as yet we have little idea as to the extent of virus diversity in this domain. The virus we describe here may not be a common or dominant member of the virus community infecting haloarchaea in saline waters. We screened for lytic virus Orfs are in the forward direction unless indicated by a -ve sign. v indicates a valine start. aa indicates the number of amino acids. Mr is the molecular mass × 10 -3 , rounded to the nearest 100. pI is the isoelectric point rounded to one decimal place. rbs/distance is the ribosome binding site sequence and its distance from the start codon. particles forming plaques on archaeal lawns. These requirements for host culturability, good lawn formation and plaque formation are probably extremely restrictive.
As pointed out by others, there is a genuine need to develop other isolation and culture techniques to study both the dominant virus populations and the true extent of archaeal virus variation in samples such as these -perhaps using a combination of electron microscopy and metagenomic sequence studies.
The GC content of BJ1 at 65 mol% is quite close to that reported for Hrr. spp aidingense, lacusprofundi and saccharovorum, varying from about 63-71 mol% [19,20]. The host strain for BJ1 clearly belongs to the genus Halorubrum having 98% 16SrRNA gene sequence identity to these Halorubrum species. Its precise taxonomic relationship to these species, in particular if it belongs to a new Halorubrum species is the subject of current studies.
Of the ORFs identified in BJ1 described in the results, all of the statistically significant matches are recorded, ( Table  2). Six of the ORFs (9, 20, 50, 52, 53, 55) are most closely related to the haloarchaeal temperate, isometric head/ contractile tail viruses φCh1 [13] and the intensively studied, φH [30]. These two viruses are closely related to each other, the completed genome of φCh1 shows 97% homology to the genome of φH, which is about 60% complete. Although BJ1 stocks are clonal in origin, the genomic DNA preparation is obviously and necessarily derived from a virus pool. Genome sequence projects often therefore give rise to heterogeneous sequences. We found one substantial region of heterogeneity in ORF 32 at nucleotide 14790 involving either a large insertion or more probably a substitution event (since terminally redundant virus genomes usually package genomes in a 'head full' mechanism). To distinguish between these possibilities requires more sequencing. The variant sequence probably involves the acquisition of host derived DNA since the GC content is higher (72.6%) than that of the virus (64.8%) and close to that reported for Hrr. saccharovorum (71%). Obviously this insertion/substitution has taken place about 300 nucleotides away from the putative integrase gene. The integrase gene in viruses is often the site of insertion as well. We speculate that this variant sequence in the virus population is the result of an integration/excision event (possibly aberrant) during the virus infection to prepare genomic DNA. This may indicate that BJ1 is a lysogenic virus; plaques were certainly turbid consistent with this suggestion but further experiments will be required to prove it. Whether the virus population with this variant sequence is viable will also require further studies. Certainly virus populations with insertions and or substantial genomic deletions can be viable or at least rescued by functional virus genomes.
Many interesting features remain to be discovered about the BJ1 virus. Optimal growth conditions for this virus need to be established and its host range determined. This will facilitate studies on its environmental stability, patterns of transcription, protein functions, lysogenic potential and the viability of the variant virus. Assignment of protein functions to ORFs which cannot be assigned any function based on sequence identity is probably easier using a virus as a model than any other genome. A systematic effort on this front will reduce the number of unclassified ORFs that metagenomic and archaeal sequencing projects so often throw up.

Cultivation of prokaryotes from environmental samples
Isolates were grown on a modified Classic Halophile Medium (mCHM) broth, [31]. This was made in two components; component 1 contains 1% (w/v) yeast extract, 0.75% (w/v) casamino acids, 0.248% (w/v) KCl and 0.3% (w/v) trisodium citrate; component 2 contains 0.162% (w/v) Na 2 B 4 O 7 , 0.084% (w/v) NaBr, 7.116% (w/ v) MgCl 2 .7H 2 O, 13% (w/v) NaCl, 4.56% (w/v) Na 2 SO 4 , 0.062% (w/v) NaHCO 3 and 0.036% (w/v) Na 2 CO 3 , pH 8.0. Both components were autoclaved separately and mixed once cooled to 60°C, then stored at room temperature. 2% (w/v) agar was added to component 1 if required to make mCHM agar plates, while 0.7% (w/v) agar was added to component 1 to make soft top agar. Prokaryotes were cultivated from brine, salt or sediment samples. Brine was filtered on site through sterile 0.45 µm membrane filters in a 250 ml capacity polycarbonate filter unit (Sartorius) using a Nalgene hand pump until flow stopped. Membrane filters were immediately placed in cold sterile stabilisation buffer (10 mM Tris-HCl, pH 8.0, 1 mM EDTA, 2 M NaCl) and agitated to resuspend the cells. Filtered waters were placed in sterile falcon tubes. Samples were placed immediately on ice until they could be stored at -20°C, usually within 6 hours of collection. Either, cell suspensions from agitated filters were serially diluted and plated onto mCHM agar plates, or about 0.5 g sediment and salt crust was resuspended in 0.5 ml of mCHM and serial dilutions plated onto the mCHM agar plates. These were incubated for two months at 37°C and were periodically checked for the appearance of new colonies which were picked and grown on fresh plates. Subculturing was continued on the same medium until purity was achieved. Isolated colonies were then grown in  [32] and rP1 5'-ACG GHT ACC TTG TTA CGA CTT-3', [33] were used. Reaction conditions were: 95°C for 2 min, followed by 30 cycles of 95°C for 30 s, 50°C for 40 s and 72°C for 2 min, followed by 10 min extension time at 72°C. PCR products were purified using the QIAquick PCR Purification Kit (Qiagen) and stored at -20°C until required. DNA sequencing, also see below, was done by Lark Technologies, Cambridge UK using 27Fa and rP1 primers described above (corresponding to nucleotides 27-1492 with E. coli as the reference sequence). The DNA sequences were analysed using the BLASTN homology search program [34], which is available at the National Centre for Biotechnology Information to identify close matches.
Strains were placed on a phylogenetic tree using Molecular Evolutionary Genetics Analysis (MEGA) version 3.1 [35], using the Jukes and Cantor nucleotide substitution model for sequence alignment and the Neighbour-Joining method of tree inference. The support for each node was determined by assembling a consensus tree of 500 bootstrap replicates. plates typically yielded 1-2 µg nucleic acid.

Genome characterisation and sequencing
1 µg virus nucleic acid was treated with either excess DNase I (NEB), RNase A (Sigma) or Exonuclease III (NEB) in the manufacturers reaction buffer and incubated at 37°C for 10 min, 60 min or 30 min respectively. Reactions were electrophoresed on Tris-Acetate-EDTA (TAE) agarose gels and stained with SYBR green. Viral nucleic acids were ran on a 1% agarose pulse field gel (BioRad) in 0.5× TBE buffer at 14°C in a CHEF DR-II apparatus (Bio-Rad). The run time was 22 h with a voltage gradient of 6 V/cm and a linearly ramped pulse time of 50 to 90 s at an angle of 120°.
BJ1 genomic DNA was digested with BamHI (giving approximately 20 fragments ranging in size from 100 bp to 5 kbp, and cloned into BamHI-digested pUC18NotI vector [37]. Resulting clones were sequenced using vectorspecific oligonucleotide primers pUCF, 5'-GTTTTC-CCAGTCACGACGTTG-3' and pUCR, 5'-CACAG-GAAACAG CTATGACC-3'; these sequences were used to design further primers to primer walk across the clones. The high G+C content (~65 mol%) of the initial sequences was used to identify restriction enzymes that would likely cut the phage genome to give smaller (on average 500-1000 bp) fragments. Secondary libraries of SstI and XhoI fragments were created in pUC18NotI and representative clones of these libraries were sequenced using pUCF and pUCR and subsequent primer walking. Finally the remaining gaps were filled by designing primers to the ends of the larger contigs, orientating these contigs by PCR using phage genome as template, and then primer walking out from the contigs using the PCR amplified products as sequencing template. The genomic sequence was assembled using the Lasergene SeqMan 7.0 program (DNAStar). Final coverage of the genome was 4fold with the majority sequenced on both of the strands or, where bidirectional sequencing was impractical, with multiple sequence runs on the same strand.

Bioinformatics
Potential ORFs were assigned using the programs FGENESB [38] and GeneMark.hmm v2.5a [39]. tRNA sequences were identified using the tRNAscan-SE program in [40]. Translations of potential ORF sequences to amino acids were made with the SeqBuilder program (DNAStar). Statistics for each of the ORFs were calculated using the program ProtParam [41].
GC skew was calculated using the online base composition tools at [42]. BLAST (blastp and tblastn) and PSI-BLAST [43] were used to search for possible homologies to known proteins, or proteins predicted by translation of the unannotated DNA sequence in GenBank. Inverted repeats in the DNA sequence were identified using Einverted [44] and PALINDROME [45]; direct repeats were located using Palim [46].