Low-pass sequencing for microbial comparative genomics

Background We studied four extremely halophilic archaea by low-pass shotgun sequencing: (1) the metabolically versatile Haloarcula marismortui; (2) the non-pigmented Natrialba asiatica; (3) the psychrophile Halorubrum lacusprofundi and (4) the Dead Sea isolate Halobaculum gomorrense. Approximately one thousand single pass genomic sequences per genome were obtained. The data were analyzed by comparative genomic analyses using the completed Halobacterium sp. NRC-1 genome as a reference. Low-pass shotgun sequencing is a simple, inexpensive, and rapid approach that can readily be performed on any cultured microbe. Results As expected, the four archaeal halophiles analyzed exhibit both bacterial and eukaryotic characteristics as well as uniquely archaeal traits. All five halophiles exhibit greater than sixty percent GC content and low isoelectric points (pI) for their predicted proteins. Multiple insertion sequence (IS) elements, often involved in genome rearrangements, were identified in H. lacusprofundi and H. marismortui. The core biological functions that govern cellular and genetic mechanisms of H. sp. NRC-1 appear to be conserved in these four other halophiles. Multiple TATA box binding protein (TBP) and transcription factor IIB (TFB) homologs were identified from most of the four shotgunned halophiles. The reconstructed molecular tree of all five halophiles shows a large divergence between these species, but with the closest relationship being between H. sp. NRC-1 and H. lacusprofundi. Conclusion Despite the diverse habitats of these species, all five halophiles share (1) high GC content and (2) low protein isoelectric points, which are characteristics associated with environmental exposure to UV radiation and hypersalinity, respectively. Identification of multiple IS elements in the genome of H. lacusprofundi and H. marismortui suggest that genome structure and dynamic genome reorganization might be similar to that previously observed in the IS-element rich genome of H. sp. NRC-1. Identification of multiple TBP and TFB homologs in these four halophiles are consistent with the hypothesis that different types of complex transcriptional regulation may occur through multiple TBP-TFB combinations in response to rapidly changing environmental conditions. Low-pass shotgun sequence analyses of genomes permit extensive and diverse analyses, and should be generally useful for comparative microbial genomics.


Background
Extremely halophilic archaea inhabit hypersaline environments containing three to five molar salts. These environments are widely distributed, and include solar saltern facilities, the Dead Sea coast, and brine inclusions. The environments are diverse with respect to salinity, pH, temperature, pressure, light, and oxygen. This wide range of conditions contributes to the diversity of halophile barophilic, alkaliphilic, and psychrophilic characteristics in addition to their obligate halophilicity. Extreme halophiles are classified as members of the Halobacteriaceae family and further organized into nineteen genera [1]. Halophiles are a relatively poorly studied branch of the archaea with only one complete genome sequence, that of Halobacterium sp. NRC-1 [2].
The genome of NRC-1 consists of a large chromosome of 2,014,239 bp and two additional minichromosomal replicons of 365,425 bp (pNRC200) and 191,346 bp (pNRC100). The two minichromosomes are relatively less GC rich than the largest chromosome (57.9% and 59.2% vs. 67.9%). Averaged across all three chromosomes, H. sp. NRC-1 has an overall 65.8% GC content. Identification of ninety-one insertion sequence (IS) elements representing twelve different IS families is consistent with the dynamic genome rearrangements observed, mediated by the multiple IS elements [3]. Genome sequence analysis identified 2,682 putative protein-coding genes. Among these, counting each identical multigene family as one gene, 2,413 genes are unique. Analysis of the genome sequence and the predicted proteome reveals that H. sp. NRC-1 is more similar to other archaeal species than to non-archaeals. H. sp. NRC-1 possesses similarities to bacteria in genes coding for aerobic respiration and in overall genomic structure, and to eukaryotes in genes coding for DNA replication, transcription, and translation [2,4], H. sp. NRC-1 predicted proteins are extremely acidic; protein acidity is associated with enhanced solubility and activity in hypersaline cytoplasm [4].
Some recently isolated species exhibit additional interesting features. Halorubrum lacusprofundi, isolated from Deep Lake, Antarctica, is psychrophilic (cold loving) and grows at temperatures as low as four degrees Celsius [5]. Some halobacterial species are alkaliphilic. Some are acid-tolerant. Natronobacterium pharaonis from Wadi Natrun, Egypt and Natronococcus occultus from Lake Magadi, Africa have pH optima ranging from 9.5 to 10 and do not grow below pH 8.5. Slight acidophiles, such as Haloferax volcanii and Haloferax mediterranei, grow at pH values as low as 4.5. Unlike other Halobacterium species, Natrialba asiatica and Natrialba magadii are not pigmented; they contain less than 0.1% of bacterioruberins, which give most halophiles their red-orange color [6].
Membership in a large and diverse family, adaptation to unique environments, ease of culture, and the availability of genetic tools make the halophiles one of the best model systems to study microbial diversity and adaptation. To advance this model system, we previously determined the first complete halobacterial genome sequence, that of H. sp. NRC-1 [2], In this present study, we analyzed the halophile family more broadly through a comparative genomic approach based on low-pass sequencing. We scanned the genome sequences of four additional species: (1) Haloarcula marismortui, a metabolically versatile halophile [7], (2) Natrialba asiatica, a non-pigmented alkaliphile [8], (3) Halorubrum lacusprofundi, a psychrophile [5], and (4) Halobaculum gomorrense, an isolate from a depth of 4 meters in the Dead Sea [9]. These organisms were studied by random shotgun sequence analysis. This analysis consisted of generating approximately one thousand genomic sequence reads, which provided partial genomic scans and permitted comparative genomic analyses with the completely sequenced H. sp. NRC-1 genome as a reference.
Low pass genomic sequencing is twenty-fold less expensive than complete genome sequencing, yet provides data for useful comparative genomic analyses. It was a natural choice for the study of the intriguing Halobacteriaceae family.

Validation of low-pass sequencing
The genome of H. sp. NRC-1 is very compact (approximately 87% coding region) and the sequences of most random genomic fragments contain genes [2], To evaluate the utility of low-pass sequencing for the comparative analysis of Halobacteriaceae genomes, we hypothesized: 1) the partial genome sequences generated by shotgun sequencing were sufficient to identify the genes contained in most fragments and 2) a large number of different unique genes would be identified from ~1,000 sequence reads.
In order to test our hypotheses, 1,085 sequences of 450 bp average length were extracted from the actual H. sp. NRC-1 genome [10], with each sequence having a uniform probability of starting at any base of any of the three chromosomes, which is a reasonable approximation of the sonication process by which fragments are generated for shotgun sequencing. These 1,085 sequences represent approximately 0.2-fold coverage of the genome. The sequences were then used to search the protein nonredundant (nr) database on the NCBI server http:// www.ncbi.nlm.nih.gov/BLAST with BLAST, using default parameters. Using expert curation of the graphically visualized alignments, a total of 1,094 genes were identified from the 1,057 sequences. Some reads spanned more than one gene. Twenty-eight of the 1,085 sequence reads did not have database gene matches; they were from the intergenic regions of the H. sp. NRC-1 genome. Several of the genes were overlapped by more than one random sequence; of the 1,094 genes identified, 820 were unique.
The validation study supports our hypothesis that many genes can be successfully identified using partial genome sequences, and also validates the gene identification methods which were further applied to analyze database search results with the actual partial genome sequence reads of the four halophiles, described in the next section.
The non-redundant database contained H. sp. NRC-1 proteins at the time this search was done. Therefore, we ran the risk of overestimating the number of genes that we might identify in the four halophiles to be shotgunned. However, excluding H. sp. NRC-1 proteins would create an underestimate, as there were few other sequences from Halobacteriaceae in Genbank. For example, only 329 TREMBL genes are identified with BLASTX using the BLOSUM80 matrix in our initial 1,085 random sequences when all Halobacteriaceae sequences are excluded (in this case, counting no more than one gene per sequence, and approximating by equating a gene with any hit with an evalue less than or equal to 0.01). We assayed variability of this statistic by repeating it on an independently generated set of another 1,085 random sequences; 304 TREMBL genes were identified. Including the Halobacteriaceae TREMBL proteins, 881 and 892 genes were identified respectively, with 95% of the additional identifications due to alignments with H. sp. NRC-1 proteins. For our experimental planning purposes, we felt that retaining the H. sp. NRC-1 proteins would result in less error than excluding them. The resulting approximation was sufficient for us to proceed with actual sequencing. However, our results would not necessarily be precisely predictive of gene prediction power in other organisms. Such predictive power would depend on the number of closely related sequences in Genbank.

Overview of sequencing
Sequencing was conducted with the M13 forward primer on shotgun libraries prepared using sonicated genomic DNA from each of the four halophiles. The traces were processed with PHREDPHRAP http://www.phrap.org. This procedure produced 1,097 high-quality sequences for H. marismortui, 1,085 for H. lacusprofundi, 1,104 for N. asiatica, and 1,170 for H. gomorrense. The initial 1,085 random H. sp. NRC-1 sequences described above in the validation study were used in conjunction with the newly sequenced halophile sequences in most of the subsequent analyses described here. The number of high-quality bases per read ranged from 40 to 750 for all genomes. The average lengths of the high-quality sequences were 524,441,469, and 415 bases, respectively ( Figure 1). These represent approximately 15%, 18%, 17%, and 18% coverage of each genome, respectively.

Gene prediction
The DNA sequences generated by the shotgun method are usually not long enough to contain complete genes including start and stop codons. For this study, "best match" open reading frames were determined as described in Methods. When multiple sequence reads with different reading frames matched to the same gene, the match results of the sequence reads were compared by the percent identity, length of the alignments, and p value. The best match was then selected and the reading frame of the sequence was considered the "best match" ORF representing that gene.
Of the 1,097 H. marismortui sequences, 810 sequences (74%) had matches to the nr protein database. Of these, 703 were unique. H. gomorrense had a total of 1,170 sequences, of which 1,006 sequences (86%) had matches to the nr protein database. Of these, 742 were unique. Of the 1,085 H. lacusprofundi sequences, 802 sequences (74%) matched to the nr protein database. Of these, 623 were unique. Of the 1,104 N. asiatica sequences, 862 (78%) were matched to the nr protein database. Of these, 678 were unique. The length range of partial ORFs was 30 to 256 residues with average lengths of 121,130,137, and 130 aa, respectively ( Figure 2). The non-archaeal best matches were distributed between bacterial and eukaryotic sequences. Thus, like H. sp. NRC-1 and other archaea, these four halophiles have both bacterial and eukaryotic features [11,12].
Sequences with no database matches were not further analyzed. It is possible these sequences are new/unique genes to each halophile and therefore no match would be found in databases. Alternatively, these sequences may be non-coding, intergenic sequences. About 36% of predicted ORFs in H. sp. NRC-1 genome did not have any significant match to the nr protein database.
Of the 1,085 randomly selected H. sp. NRC-1 sequences, 1,057 (97.4%) had matches to the nr protein database, while 28 (2.6%) had no matches to the database. Further analysis revealed that all of the 28 sequences were from intergenic regions, consistent with no significant matches with the nr protein database. The 1,057 sequences with database matches identified 820 unique proteins.

Insertion sequence elements
IS elements are repetitive sequences that can broker genome reorganization and mediate evolution by creating insertion mutations, recombinations, and gene conversions. In Halobacterium these elements have expanded to multiple families and subfamilies with multiple copy numbers [13]. Of the ninety-one IS elements representing twelve families in the H. sp. NRC-1 genome, sixty-nine were found in the minichromosomes including twentynine IS elements on pNRC100 and forty on pNRC200. In the H. sp. NRC-1, due to genetic instability caused by transposition of insertion sequences, typically about 1% of colonies in a plate are gas-vesicle deficient and 0.1% are carotenoid deficient mutants [13].
In order to determine whether the presence of multiple IS elements is a general property of halophiles, sequences of each halophile were searched for IS elements. H. sp. NRC-1 IS elements were searched against each halophile sequence to identify homologous IS elements. Halophile sequences were also searched against the nr protein database to identify any new IS elements. In the H. lacusprofundi genome, we identified 38 sequences homologous to six different IS families of NRC-1. Ten H. marismortui genome sequences that were homologous to six different IS families of H. sp. NRC-1 were also identified, while only one sequence each from H. gomorrense and N. asiatica was found. These matched to ISH4 and ISH9 of NRC-1, respectively ( Table  1). All IS elements identified were homologous to one of the H. sp. NRC-1 IS elements, and no new IS elements were identified. In this analysis using partial genome sequences, multiple sequence reads often matched to a single IS family. This may represent multiple copies of the repeat or multiple sequence reads of the same repeat. Considering that the sequences were all generated from a random shotgun library and sequenced to relatively low redundancy, it is unlikely that several sequences would be derived from the same location. The results indicate that the H. lacusprofundi genome contains the highest number of IS elements.
Distribution of lengths of low-pass sequence reads for the four archaeal genomes Figure 1 Distribution of lengths of low-pass sequence reads for the four archaeal genomes. High-quality sequences were obtained by processing the shotgun sequences with PHREDPHRAP.

GC composition
Previous studies of several Halobacterium and Halococcus species found that their genomes contain both relatively GC-rich (66-68%) components and relatively GC-poor (57-60%) components [14]. The genome analysis of H. sp. NRC-1 revealed that the plasmids pNRC100 and pNRC200 contain lower GC content than the main chromosome (57.9% and 59.2% vs. 67.9%). To evaluate Distribution of lengths of partial ORFs derived from low-pass sequences Figure 2 Distribution of lengths of partial ORFs derived from low-pass sequences.
whether the presence of DNA components varying in their GC content is a general genome property of the halophiles, sequence reads from each of the halophiles were analyzed for their GC composition ( Figure 3). GC analysis of random NRC-1 sequences shows two small peaks at about 56% and 59%, which likely reflect the DNA component of pNRC100 and pNRC200, and a large peak at about 70%, likely reflecting the DNA component of the chromosome. In H. sp. NRC-1, of the sequence reads in the peak of 56% GC content, 62% were from pNRC100; of the sequence reads in the peak of 59% GC content, 73% were from pNRC200. philes are found are often areas exposed to extreme levels of UV solar radiation. Active DNA repair mechanisms for damage caused by UV irradiation, such as the formation of thymine dimers, have been previously observed in halophiles [4]. The four newly sequenced halophiles have 60% or higher overall GC content. The GC content of Halobacteriaceae thus may have evolved to minimize DNA damage caused by solar UV irradiation.

Isoelectric point of predicted proteins
Halophiles maintain osmotic balance between their cytoplasms and the hypersaline environment by maintaining ionic distributions of intracellular potassium of about four molar, equal to the external sodium concentration [2,15]. One of the most interesting properties of halophiles is the ability of their proteins to function in such a hypersaline cytoplasm, where mesophilic proteins would become denatured. Early findings showed that the surfaces of halophilic proteins are highly acidic, with negatively charged amino acid residues [16]. This was shown to enable proteins to stay soluble and active through the binding of hydrated salt ions in the solution [16][17][18]. Accordingly, recent investigations of predicted proteins from H. sp. NRC-1 demonstrate extremely acidic proteins with an overall median isoelectric point (pI) of 4.9 and a strong peak (mode) around 4.2 [4].
For our novel shotgunned sequences, the "best match" ORFs were used to estimate pI of their corresponding proteins. This approach was first validated using partial ORFs of 820 proteins from random H. sp. NRC-1 sequences. The predicted median pI of 4.4, calculated from partial ORFs, was somewhat lower to the median pI of 4.9 calculated GC distribution for five partially sequenced genomes   Figure 4).
The prevalence of highly acidic proteins across the four newly sequenced halophiles is consistent with those halophiles employing the acidic protein stabilizing mechanisms for hypersaline environments. Highly acidic proteins are thus a general feature of extreme halophiles, providing the basis for enhanced solubility in the presence of high salts.

Conserved core biological functions
Of the unique genes identified through their "best match" ORFs in the newly sequenced halophiles, 56% or more were homologous to H. sp. NRC-1 genes. This finding suggests conservation of genes and their functions between H. sp. NRC-1 and each halophile. In order to identify conserved core biological functions between H. sp. NRC-1 and the newly sequenced halophiles, genes homologous to H. sp. NRC-1 were classified into twelve functional groups: amino acid metabolism, cell envelope components, cellular processes, cofactor metabolism, DNA replication repair and recombination, energy metabolism, nucleotide metabolism, regulation, transcription, translation, transport, and miscellaneous genes. These are the same classifications described for the H. sp. NRC-1 genome project [2]. Also, an identical comparison to a subset of only 820 random H. sp. NRC-1 genes was made to enable quantitative comparison of these results with those of the other interspecies results. Core biological functions that are essential for H. sp. NRC-1 to function also seem to be conserved in the four halophiles ( Figure  5). Presence of genes for the citric acid cycle for uptake and utilization of amino acids, sodium-proton antiporters, potassium uptake systems, DNA replication, transcription, and translation systems resembling eukaryotes were observed in all species. Some of the aminoacyl-tRNA synthetases that were not found in some archaea (e.g.

Conservation of gene order between halophiles
Gene clusters encoding proteins of related functions are often co-localized in prokaryotic genomes; this organization could assist assigning functions to unknown ORFs when gene clusters of two organisms are compared [24]. The prediction of functional linkage between the genes involved in several metabolic pathways across the multiple genomes has been demonstrated to correlate functional with physical linkages of gene clusters [25]. Co-localization of functionally related genes was also identified in the H. sp. NRC-1 genome. The H. sp. NRC-1 analysis revealed that nine of the ten genes involved in arginine metabolism cluster at three genomic loci. Coinduction of the argS with the arcABC gene cluster on repressing phototrophy has been recently discussed [26].
Co-localization of the gene clusters involved in isoprenoid biosynthesis, purple membrane biogenesis, and nucleotide metabolism are also observed at multiple loci across the H. sp. NRC-1 genome.  We found many instances of two or more consecutive genes in H. sp. NRC-1 that may be similarly linked in the four newly sequenced halophiles ( Table 2). Detection of the conserved gene orders among the genes associated with important biological functions such as amino acid metabolism, cofactor metabolism, energy metabolism, translation, and transport suggests selective pressure to maintain such gene arrangements.

Transcription factor family
Archaea possess a simplified version of the eukaryotic RNA polymerase II-like transcription system including highly conserved TBP and TFB proteins, but no eukaryallike TBP-associated factors (TAFs) [27,28]. Recently, multiple TBP and TFB proteins have been reported in several archaea including H. sp. NRC-1 [12] and another halophilic archaea isolated from the Dead Sea, Haloferax volcanii [29]. A novel mechanism of transcriptional regulation by forty-two different combinations of multiple TBP and TFB interactions has been recently suggested for NRC-1 [30].  involved in TBP interaction [30], were conserved in the newly identified TFB proteins.
In order to study TFB family evolution, orthology relationships between the newly identified TFBs and H. sp. NRC-1 TFBs were analyzed by constructing a molecular tree from protein sequence distances. Of the five newly Sequence alignments for TFB proteins
Our findings of multiple TFB homologs in H. lacusprofundi and H. gomorrense suggest the presence of multiple transcription factors in other halophiles as well. In order to evaluate such a possibility, enzyme-digested genomic DNA of five halophiles, including H. sp. NRC-1, were subjected to Southern hybridization to estimate the number of related or identical sequences to tfb genes. By counting only highly distinguishable bands in the Southern blot (data not shown), at least four tfb-related sequences were identified in all newly sequenced halophiles but N. asiatica (three). H. sp. NRC-1 genomic DNA, used as a pos-itive control, hybridized to seven bands, presumably corresponding to the seven known tfbs.
In our low-pass sequence analysis, the presence of a TBP homolog was not observed in the newly sequenced halophiles. Using Southern hybridization, we estimated the numbers of sequences related to tbp genes. The results suggest the possible presence of two tbp related sequences from H. gomorrense and H. lacusprofundi and one tbprelated sequence from H. marismortui and N. asiatica. These four halophiles contain fewer copies of tbp genes than H. sp. NRC-1, which showed six tbp-related sequences in Southern analysis, corresponding to the six known tbps.
Identification of the presence of multiple transcription factor genes suggests transcriptional regulation through a variety of TBP-TFB combinations in the four newly sequenced halophiles.
Unrooted molecular tree of TFB protein family

Global similarity estimation to NRC-1
The four newly sequenced halophiles, as well as the previously sequenced H. sp. NRC-1, belong to the Halobacteriaceae family. More recent isolates H. marismortui and H. gomorrense are from the Dead Sea, H. lacusprofundi from Antarctica, and N. asiatica from Japan [6]. Sequence similarity is expected between closely related species. However, adaptation to their unique environments may have contributed to sequence divergence in these organisms. Similarity between organisms may be estimated among orthologous genes of all analyzed species, but this limits analysis to only a small number of genes, which may not be representative of global similarity.

Estimation of evolutionary distance between five halophiles
16S rRNA is useful for studying evolution because it is ubiquitous in all prokaryotes and contains highly conserved as well as variable regions, which can be used for determination of divergences at different taxonomic levels. For instance, the highly conserved region can be used to compare distantly related organisms whereas the variable region can be used to look at the divergence between closely related species. Although analysis based on 16S rRNA is useful, it does not yield unambiguous conclusions. Phylogenetic analysis of individual genes (such as 16S rRNA) is based upon only a small portion of an entire genome. Since a massive amount of sequence data (including the whole genome sequences) is now available, alternative methods of phylogenetic analysis that make use of these data are possible.
One method developed in this study for phylogeny reconstruction utilizes the protein distances calculated between the species. The method was first tested on completed genomes including Escherichia coli K-12 Strain MG1655, Haemophilus influenzae Rd KW20, Mycoplasma genitalium G-37, and Bacillus subtilis 168. E. coli and H. influenzae belong to the same bacterial phylum, gamma proteobacteria, while M. genitalium and B. subtilis belong together in the low-GC gram-positive phylum. These genomes were selected to determine if the newly developed method is able to identify distance between closely related species (in the same phylum), and also to test its ability to discriminate between different phyla. These two phyla are closely related based on 16S rRNA phylogeny. Unrooted molecular trees were computed; the tree computed by our novel protein distance matrix was compared to the tree computed from 16S rRNA ( Figure 8). The reconstructed molecular trees are in good agreement. Our simple methodology thus demonstrates an alternative way of calculating protein distances using more sequence data, and validates the current proposed relationship of four genomes.
Knowing that evolutionary relationships can be closely estimated using protein distance matrix methods, we then employed the method to calculate distances between the halophiles. The molecular tree obtained was compared to the 16S rRNA tree (Figure 9).

Orthology and divergence between the genomes of NRC-1 and other halophiles
We have studied four extremely halophilic archaea by low-pass genome sequencing and compared the results to a reference genome (H. sp. NRC-1) to better understand their orthologous characteristics. Despite the geographic remoteness of the habitats from which each of these species was isolated, and the evolutionary pressure to adapt to their unique environments, some unique core characteristics of extremely halophilic archaea seem conserved among the five species. Strikingly, the four newly sequenced halophiles also exhibit low pI of their proteins. The result suggests the presence of acidic proteins is an essential mechanism to function in hypersaline environments and that these four halophiles also maintain high salt concentration in their cytoplasm to maintain ionic balance in hypersaline environments.
Approximately 60% or higher overall GC content was observed in the halophile species analyzed in our study.  [30]. Different TBP-TFB combinations are likely to serve in response to different cellular and environmental stimuli. Figure 9 Molecular trees of five halophiles. Molecular tree was reconstructed by the protein distance matrix (A) and the 16S rRNA (B) analysis. The comparison between molecular trees obtained by two different methods showed lack of congruence among halophiles. Both trees resemble a "star" phylogeny among the five species, suggesting significant molecular divergence between species. The bootstrap scores are shown on 16S rRNA analysis.

Molecular trees of five halophiles
Analysis of IS elements showed that more sequence reads for H. lacusprofundi (38 reads) had similarity to NRC-1 IS elements, followed by H. marismortui (10 reads). Considering that the sequences are generated from a random shotgun library and that similar coverage of each genome sequence was used for analysis, these results suggest that the H. lacusprofundi genome contains the highest number of IS sequences homologous to H. sp. NRC-1. This conservation may be due to these transposon elements entering the two genomes by multiple transposition processes, or simply due to a more recent divergence from a common ancestor.

Molecular tree reconstruction
The evolutionary relationships of the five halophiles were assessed by two methods: a protein distance matrix and a 16S rRNA analysis. The protein distance matrix method we have developed for phylogeny reconstruction utilizes all available sequence information and calculates protein distances from the results of alignments of protein sequences between species. The method is very simple, as it relies on mean or median similarity of the entire complement of predicted proteins, but appears to offer sufficient resolution to provide useful insight across relevant evolutionary time frames. The molecular trees obtained by the protein distance matrix and the 16S rRNA lacked congruence. The protein distance matrix analysis indicates that H. sp. NRC-1 is more closely related to H. lacusprofundi, while the 16S rRNA analysis suggests that N. asiatica is more closely related to H. sp. NRC-1. Although 16S rRNA is the commonly used sequence to study evolutionary divergence, some discrepancies between the rRNA trees and gene trees in the investigation of individual genes and proteins have been previously discussed [32,33]. In contrast, the result of the protein distance matrix is supported by our findings of DNA and protein sequence similarity analysis, which indicated the highest sequence similarity between H. lacusprofundi and NRC-1. This finding was further supported by the presence of the highest number of IS elements that are homologous to IS elements found in H. sp. NRC-1. In addition, 63 % of putative genes identified from H. lacusprofundi sequences were homologous to NRC-1 genes, the highest percent among the four halophiles in this study. Although the observed "star" phylogeny suggests significant molecular divergence between species, considered together, our findings suggest the closest of the relationships is between NRC-1 and H. lacusprofundi.

Utility of low-pass sequencing for comparative genomics
Our analysis of low-pass sequencing demonstrated possible ways of identifying coding regions from sequence reads and allowed comparison between related species. Low-pass sequencing becomes more useful when a closely related reference genome is available. Some of our findings include: 1) identification of 60% or higher GC content and low protein pI among five halophiles, 2) identification of IS elements in all halophiles, 3) detection of gene conservation among halophiles, including approximately 60% genes shared with H. sp. NRC-1, spanning many core biological functions; 4) identification of genes belonging TFB families, and 5) detection of conserved gene orders between H. sp. NRC-1 and other halophiles.
Low-pass sequencing, however, exhibited limited utility for some genome analyses. Partial ORFs of predicted proteins limit detailed analysis on such proteins. Large-scale genomic topology cannot be inferred from low-pass sequencing. The evidence of common genetic exchange within a species and among closely and distantly related microbes by lateral gene transfer is hard to elucidate. A complete list of genes in a gene family or elements involved in a pathway could not be obtained. The basis of additional extremophilic characters, such as psychrophily, alkaliphily, barophily, and acidophily are difficult to determine. Low-pass sequencing has limited ability to resolve multigene families, pseudogenes, and allelic variations. These last are seldom issues in microbial genetics, but could be problematic in diploid organisms.
Nevertheless, to obtain a general comparative view on genomes of related species, low-pass sequencing can be an effective way of providing information in a time-and effort-efficient manner [34].

Conclusions
Low-pass sequencing is useful for comparative genomics, and will be increasingly so as sequencing technologies improve. It could be both cost-and time-effective for comparative genomics. Low-pass sequence data enables diverse analyses, especially identifying orthologous characteristics between a reference genome, such as that of H. sp. NRC-1 and related species. Comparative analyses show shared mechanisms governing cellular and genetic functions among these five halophiles. High GC contents, low pI of predicted proteins, multiple IS elements, biological pathways, and multiple transcription factors are conserved among the species despite geographic diversity of habitats.
SCORE 250, MINMATCH 15, and BANDWIDTH 1. The output file of CROSS_MATCH contains the substitution score which represents the distance between two species calculated from the differences in a given protein sequence alignment. The median values of the substitution scores were then calculated and stored as a pairwise distance matrix.
For rRNA tree construction, 16S rRNA sequences of halophiles and four complete genomes used in the study were retrieved from NCBI (H.

Author's contributions
YG: carried out sequencing, annotation, analyses, study design and drafted the manuscript. JR: performed statistical analysis of low-pass sequencing, developed the protein matrix phylogeny estimation method, and drafted the manuscript. GG: carried out identification of conserved gene orders. NB: participated in the design of the study. KD: wrote Perl scripts. MP: participated in sequencing. SK: provided feedback for the manuscript. SD: conceived of the study, and participated in the study design. WN: participated in the design of the study and coordination. LH: conceived of the study, and participated in its design and coordination.
All authors have read and approved the final manuscript.