Genome analysis and comparative genomics of a Giardia intestinalis assemblage E isolate

Background Giardia intestinalis is a protozoan parasite that causes diarrhea in a wide range of mammalian species. To further understand the genetic diversity between the Giardia intestinalis species, we have performed genome sequencing and analysis of a wild-type Giardia intestinalis sample from the assemblage E group, isolated from a pig. Results We identified 5012 protein coding genes, the majority of which are conserved compared to the previously sequenced genomes of the WB and GS strains in terms of microsynteny and sequence identity. Despite this, there is an unexpectedly large number of chromosomal rearrangements and several smaller structural changes that are present in all chromosomes. Novel members of the VSP, NEK Kinase and HCMP gene families were identified, which may reveal possible mechanisms for host specificity and new avenues for antigenic variation. We used comparative genomics of the three diverse Giardia intestinalis isolates P15, GS and WB to define a core proteome for this species complex and to identify lineage-specific genes. Extensive analyses of polymorphisms in the core proteome of Giardia revealed differential rates of divergence among cellular processes. Conclusions Our results indicate that despite a well conserved core of genes there is significant genome variation between Giardia isolates, both in terms of gene content, gene polymorphisms, structural chromosomal variations and surface molecule repertoires. This study improves the annotation of the Giardia genomes and enables the identification of functionally important variation.


Background
Giardia intestinalis (syn G. lamblia and G. duodenalis) is a flagellated protozoan parasite that belongs to the diplomonad group, which includes both parasitic and free living species. During the course of evolution this organism has adapted to a parasitic life-style, which involves passing through an infectious cyst stage and subsequently into the vegetative trophozoite stage. G. intestinalis infects humans and a broad range of other mammals including wild and domestic animals, and causes diarrhea. The parasite spreads through the fecal-oral route and constitutes a serious health problem in developing countries, where it contributes to malnutrition and growth inhibition in children [1,2]. In industrialized countries, waterborne outbreaks of G.
intestinalis continue to be a health concern in public water supplies and recreational waters. Highly resistant G. intestinalis cysts are able to survive from weeks to months in cold water, and the infectious dose can be as low as 10 cysts [3,4]. Many G. intestinalis infections do not give rise to symptoms and chronic carriers of G. intestinalis have been documented and these are sources of new infective cysts [5,6]. In addition to being an important mammalian pathogen G. intestinalis is an interesting model organism since it has been suggested to be a primitive, early-branching eukaryote. However, recently a number of eukaryotic features have been described in G. intestinalis [7]: a putative nucleolus has been identified [8]; a vestigial mitochondrial organelle called the mitosome has been confirmed [9]; and the prokaryote-like metabolic pathways that have been identified are most probably a result of horizontal gene transfer [10]. Even so, G. intestinalis is unusual by having two identical and transcriptionally active nuclei, no peroxisomes as well as a lack of a recognizable stacked Golgi-apparatus in vegetative cells [2]. The genome of G. intestinalis is small (the haploid size is~11.7 Mb) compared to most sequenced eukaryotic genomes and was first described by Morrison et al [11]. Each nucleus of the best studied G. intestinalis isolate WB contains a diploid set of 5 major chromosomes that are inherited individually [12,13]. Substantial karyotype variability has been reported among G. intestinalis isolates, most likely caused by the hypervariable subtelomeric regions [14]. Compared to some other unicellular species, eg. Trichomonas vaginalis, Trypanosoma cruzi, and even the diplomonad Spironucleus vortens, the G. intestinalis genome has not undergone a massive repeat expansion, possibly due to evolutionary constraints that act upon the genome or because a large scale proliferation of mobile elements has not occurred. The genome of G. intestinalis is compact in terms of intergenic space and non-coding regions. Promoters and untranslated regions are minimal and only four introns have been identified [11], which is consistent with the features of a compact genome.
G. intestinalis has previously been considered to be a single species, mainly due to similarity in morphology between different isolates. However, based on the genetic variation between strains, G. intestinalis can be considered to be a species complex [15,16]. Seven assemblages (genotypes) have been confirmed as distinct evolutionary lineages, by enzyme electrophoretic pattern studies and phylogenetics [17,18]. Moreover, sequence analysis has been used to stratify isolates into these groups, mainly based on the genes glutamate dehydrogenase, beta-giardin, elongation factor-1 alpha, triose phosphate isomerase and small subunit rRNA. Furthermore, a growing number of studies have shown phenotypic differences regarding metabolism and biochemistry, growth rates (both in vivo and in vitro), drug sensitivity, pH preference, infectivity, susceptibility to the G. intestinalis dsRNA virus and clinical features [19,18]. However, no genetic factors guiding host specificity or adaptability have been identified so far, possibly due to the limited genetic tools available for G. intestinalis.
Assemblages A and B are the only assemblages known to infect humans, C and D infect dogs, group E infects hoofed animals, including cattle, sheep, goats, pigs, water buffaloes and mufflons, F infects cats and G is restricted to rodents. One recent study analyzed the genetic variation from a large number of assemblage A and B isolates [20]. It has been proposed that assemblages A and B are distinct species, with the species names G. duodenalis for assemblage A and G. enterica for assemblage B [16]. However, the degree of genetic variation between other assemblages than A and B is largely unknown.
Whole genome data is important to establish phylogenetic relationships, to identify genotype specific markers for epidemiological studies and offers a novel approach to understand the genetic diversity on the subspecies level. Furthermore, comparative genomics is a powerful tool to reveal insights into the evolutionary history of microbial pathogens and can provide information as to how host specificity and virulence is determined. The genome-wide genetic variation in the G. intestinalis species complex remains largely unexplored, so far only two isolates from two assemblages have been subject to whole genome sequencing. The first G. intestinalis genome to be sequenced was the WB isolate (assemblage A). The WB genome was found to be 11.7 Mb in size with 4889 annotated coding sequences, extremely low levels of allelic sequence heterozygosity (<0.01%) and no known identifiable virulence factors [11]. More than 50% of the annotated coding sequences showed no similarity to any protein sequence in public databases, but of the genes with a functional annotation many are likely to represent lateral transfers from bacteria or archaea [21]. We previously reported genome sequencing and analysis of the assemblage B isolate GS and comparative genomics to the WB genome revealed a well conserved core of genes, but also extensive genomic rearrangements [22]. The GS isolate was found to have an abundance of allelic sequence heterozygosity as detected by high throughput sequencing. Moreover, a comparison of three genetically divergent G. intestinalis isolates allows improved annotation of protein coding genes and noncoding RNAs.
To further understand the biological differences between human-and non-human infective isolates of G. intestinalis we here present a draft genome sequence and analysis of the assemblage E isolate P15, originally isolated from a piglet in the Czech Republic. This is the first whole genome analysis of a field isolate of G. intestinalis with minimal cultivation in the laboratory. We therefore expect it to closely reflect a wild-type G. intestinalis isolate.

Genome assembly and allelic sequence heterozygosity
Clustering of the shotgun reads (Table 1, Additional file 1) and manual finishing of the assembly resulted in 820 contiguous sequences (contigs) with an N50 of 71,261 bp and an average genome coverage of 47×. The relatively large number of contigs is likely to be caused by the presence of genomic repeats rather than areas of insufficient sampling. The combined contig length was 11,522,052 bp. The haploid genome size is likely to be somewhat larger, due to the collapse of certain repeat regions.
The G + C content of the overall assembly is 47%, which is identical to GS but differs slightly from WB (49%). The lower G + C content in P15 and GS is most likely explained by non-perfect assembly of G + C rich repeat regions, such as VSPs or High Cysteine Membrane Proteins in the whole genome shotgun data sets. To circumvent the problem of different degree of assembly completion, we examined the G + C content of conserved syntenic genes (1:1:1 orthologs) and found the G + C content to be 46% (P15), 47% (GS) and 49% (WB). Orthologs in P15 have the lowest average G + C content while WB have the highest.
Only 255 positions of 10,704,988 analyzed positions passed the criteria for allelic sequence heterozygosity. Most heterozygous positions were located in short contigs and may therefore represent artifacts, for example chimeric contigs, rather than a real heterozygosity. None of the heterozygous positions were located in genes that have previously been used for genotyping.
The assembly contains 61 frame shifted genes, 48 of which were found to be interrupted by single base insertion or deletion errors which is consistent with the homopolymer sequencing errors related to the 454 technology. Six genes were present in regions where the assembly is of low quality. Five frame shifted genes could not be attributed to homopolymer sequencing errors or low coverage in the assembly and were designated as putative pseudogenes, these are described further below (Additional file 2).

Genome annotation and prediction
The vast majority of the P15 annotations were assigned using the information from the WB and GS genomes ([GenBank:AACB00000000] and [GenBank: ACGJ00000000]). The P15 draft genome sequence was found to contain 5012 annotated protein coding sequences, of which 2470 were conserved hypothetical genes. About 91% of the annotated genes had orthologs in WB. Genes without assigned orthologs were located in genomic regions without conserved synteny compared to the previously published Giardia genomes. The average protein identity between P15 and WB was 90% (sd. 7) and for the corresponding nucleotide sequences the identity was 87% (sd. 4). Only 15% of the ortholog pairs showed a nucleotide identity higher than 90%. The average protein identity between P15 and GS was 81% (sd. 10) and the nucleotide identity 77% (sd. 5). The higher average protein identity confirmed earlier phylogenetic studies [23] that showed assemblage E to be more closely related to assemblage A than assemblage B to A. Furthermore, 82% of the P15 genome assembly was identified as coding. The divergence between P15 and WB is close to what is observed between Leishmania major and L. infantum and the divergence between GS and WB is similar to that observed between Theileria parva and T. annulata ( Figure 1).
The Giardia core proteome and polymorphisms The Giardia gene content can be subdivided into a core set of housekeeping genes and structural genes (approximately 91% of the Giardia genes), unique assemblagespecific genes and a more variable part of the genome that contains members of large gene families (approximately 9% of the Giardia genes; Additional file 3). The core genes contain conserved positional orthologs between P15, WB and GS. The ratio of non-synonymous (dN) and synonymous (dS) nucleotide changes was used to estimate the selective pressure on genes. In two comparisons (GS-WB and P15-GS) synonymous changes were found to be saturated and the ratio is not reliable. P15 and WB orthologs (see Methods) were used to calculate dN/dS in order to evaluate if certain genes have a differential evolutionary rate in Giardia.
The distribution of dN/dS values displayed a left skew with a peak around 0.15 (Table 2, Figure 2A), implying that most genes are under purifying selection. Some 104 genes were found to be perfectly conserved in pairwise comparisons between the three genomes when comparing on the amino acid level (Additional file 4). Among the genes with the highest dN/dS appeared uncharacterized genes, possibly due to a relaxation of the selective pressure or because of their importance in parasite adaptability. Using a slimmed gene ontology we mapped 981 Giardia genes to 12 broad categories ( Figure 2B), which represent about 20% of the core gene content. The dN/dS distributions were grouped according to biological process GO category and the categories were found to be different from each other using a Kruskal-Wallis test. Five GO categories were significantly different using a Mann-Whitney test when compared to all other genes: protein metabolic process; nucleic acid metabolic process; primary metabolic process; translation; transport.
Publicly available Serial Analysis of Gene Expression (SAGE) data was used to group Giardia genes into 6 developmental categories and one category for constitutively expressed genes ( Figure 2C). Hierarchical clustering of data from 10 SAGE libraries have been used to construct the following developmental categories: excystation; differentiation; encystation; cyst/excystation; trophozoite/encystation; cyst/excystation genes that are downregulated; constitutively expressed genes [24]. The distribution of dN/dS ratios was examined and the groups were found to be different from each other using a Kruskal-Wallis test. Four categories were found to be significantly different when compared to all other groups using a Mann-Whitney test: constitutively expressed genes; trophozoite/encystation; cyst/excystation; differentiation. The differentiation group contain the smallest number of genes (n = 3).
Small insertion-deletion (indel) events were found to be abundant between orthologs. We produced threeway alignments of orthologous genes in GS, P15 and WB in order to identify lineage specific adaptions.  hominis. The vertical axis shows the cumulative frequency of the ortholog count and the horizontal axis shows the corresponding amino acid identity. A steeper curve to the right indicate that the species pair is more closely related compared to a species pair to the left. In this comparison, C. parvum and C. hominis are the closest related species, whereas T. cruzi and T. brucei are the most distantly related based on protein similarity. G. intestinalis P15 and WB have a divergence close to what can be observed between L. major and L. infantum. G. intestinalis WB and GS have a divergence close to what is observed between T. parva and T. annulata. The ortholog relationships were established using reciprocal BLASTP and the protein identities were extracted from the BLAST outputs. Sequence data was downloaded from EuPathDB.
Alignment gaps were interpreted as indel events along the branches of the phylogenetic tree ((P15, WB), GS). In order to find lineage specific indels, we defined an indel event as an alignment gap in one lineage that did not occur in the other two lineages, or correspondingly that one lineage lacked an alignment gap but a gap of the same length and start position was present in the two other lineages. The requirement of the same gap length and start position was implemented since gaps of different lengths would be caused by more than one indel event. Furthermore, a quality filter was implemented where 15 bp downstream and upstream of the genes was examined for mismatches and if more than 7 mismatches were identified the indel was rejected. Indels located in non-coding sequences were not considered because of the risk of inaccurate alignments. Some 515 indels were found to be specific for the P15 lineage (196 insertions and 319 deletions). The median deletion length was 3 bp and the median insertion length 6 bp. The most common length of an indel was 3 bp and the longest indel was 75 bp. Some 738 indels (303 insertions and 435 deletions) were found for the WB lineage.

Differential distribution of genes
Comparisons between the three Giardia isolates revealed the presence of genes specific for each isolate ( Figure 3, Additional file 2). In terms of specific genes, the P15 isolate has a few more (38) than GS, while the WB isolate has the smallest subset (5). The identified genes are not members of the large Giardia gene families and most of them have no sequence similarity to known sequences and their functions are therefore unknown. One P15 specific gene (GLP15_874) encodes an acetyl transferase. BLAST searches indicated high similarities to bacterial homologs of the gene. A phylogenetic analysis showed that this P15 gene clusters with bacterial sequences, mostly from a low G+C gram positive group (Firmicutes) suggestive of a recent gene acquisition (Additional file 5). The donor lineage cannot be determined due to poor bootstrap support values and likely frequent transfer of the gene among bacteria. Nevertheless, many of the putative donor lineages (i.e. Lactobacillus, Cloststridium, Anerotruncus and Enterococcus) are common inhabitants of the gastrointestinal tract of mammals (Additional file 5). This is consistent with a recent uptake of the gene from intestinal bacteria. In this comparison, 31 GS specific genes were identified, and several conserved bacterial domains were identified in these genes as reported previously [22]. P15 and WB share only 7 genes that are not present in GS ( Figure 3). In comparison, WB and GS share only one specific protein whereas P15 and GS share 20. Thirteen of these genes are located in a 20 kb contig in P15. Homologous genes were identified in GS on several small contigs. Synteny analysis of this chromosomal region in the WB genome indicates that a rearrangement has occurred when compared to P15. Preliminary data indicates that this region may also be absent from other Giardia assemblage A isolates. One of the genes only present in P15 and GS (GLP15_2168) contains domains similar to a prokaryotic DNA segregation protein and a second contains a possible nucleoside triphosphate hydrolase domain. However, the function of these proteins remains unknown. We detected the presence of 16 candidate pseudogenes in the Giardia genomes, 5 that were detected in WB and 6 in GS (Additional file 2). All of these are members of common gene families and all but one have positional orthologs in the GS-P15 and P15-WB genome comparisons, respectively. The remaining 5 pseudogenes were identified in P15. Candidate pseudogenes in WB and P15 were confirmed to have disrupted coding sequences by sequencing of PCR products (data not shown). SAGEtags were not detected for WB pseudogenes, but could be detected for some WB orthologs to the confirmed pseudogenes in P15 and putative pseudogenes in GS (Additional file 2). One pseudogene in P15 involves a lateral transfer candidate (Additional file 5). Intact orthologs are present in the human associated G. intestinalis WB and GS stains, and only four bacterial strains from three different groups. Thus, the gene has a very patchy phylogenetic distribution in the tree of life. Some of the bacteria that do encode the gene are associated with mammals; Collinsella aerofaciens is part of the normal gut flora in humans, and Ruminooccus inhabit the rumen of cattle.

Small RNAs and regulation of gene expression
The prediction of transfer RNAs (tRNA) was performed using tRNAscan-SE [25]. The WB genome contains 63 tRNA genes, two of which were predicted to be pseudogenes [26]. In P15, 61 tRNA genes were predicted and 5 of them contained an intron. In contrast, 58 tRNAs were identified in the GS genome and 7 of these contained an intron. The three genomes contain the same complement of tRNA genes but the copy number of each gene can be variable.
The same four introns were identified in the Giardia WB and GS genomes [22] and introns were found in the same positions in the P15 genome. Regular expressions constructed from consensus sequences of all known Giardia introns failed to locate additional candidates. The encystation inducible promoters from P15 resemble those of WB. Binding sites for the transcription factor Myb2 can be identified in the equivalent positions in the promoters of the cyst wall protein-1, cyst wall protein-2 and Myb2 itself. P15 harbors Myb2 binding sites in the sugar synthesis genes glucosamine-6phosphate isomerase and UDP-glucose 4-epimerase as in WB, a distinguishing feature that is missing in the GS isolate. Non-coding RNAs are generally well conserved, with P15-WB pairs showing the greatest similarity (Additional file 6).  parasite antigen in Giardia and these proteins have a conserved membrane spanning domain with an invariant cytoplasmic tail and a cysteine rich variable domain that is exposed to the extracellular environment. The NEK Kinases belong to the NIMA-related (Never In Mitosis Gene A) clade of serine/threonine kinases. Surprisingly, 76% of the NEK kinases in the Giardia WB genome are predicted to be catalytically inert.
We identified 112 complete VSP genes in the P15 genome assembly. However, there is also a significant number of VSP sequences that are present in small contigs in spite of the high coverage and relatively long read lengths. The average identity of VSPs between P15 and GS is 61% ± 10 which is slightly higher than for WB and GS. A search for the conserved terminal CRGKA motif in all the reads gave 4757 hits. With an average sequence coverage of 47× we estimated the number of VSP genes to be approximately 100, which is slightly lower than the number of annotated VSP genes. The reason could be incorrectly annotated VSP genes in shorter redundant contigs or possibly local variation in sequence depth. When we corrected for this, the VSP content is still lower in P15 compared to both WB and GS. VSP genes are rarely located in syntenic position in the genomes. This shows the plasticity of VSP genes both from the sequence perspective and from genomic location. The VSPs in P15 appear to be scattered throughout the genome as in WB and GS. Regions where VSPs and other dynamic protein families are found are characterized by significant differences in codon usage and higher G + C content. These regions are also frequently associated with breaks in synteny and rearrangements. The only regulatory motif that is uniquely associated with VSP genes in WB and P15 is the extended polyadenylation signal. This motif is remarkably well conserved over a stretch of 16 nucleotides (ACTTAGGTAGT[AG]AA[CT]GC) and is invariably located 0-15 bp downstream from the stop codon of VSP genes. Subtle differences in the consensus sequences can be distinguished between the isolates (Additional file 7). This sequence is also found in the small subset of VSP genes available from the GS isolate. Motif searches conducted on sequences upstream of the VSP genes revealed very few conserved sequences.
Phylogenetic analysis of VSP genes from P15 and WB was conducted using protein sequence alignment and maximum likelihood. The analysis indicated that VSP genes are distributed in two major subgroups in this tree which we designate as A and B, based on their distance in the phylogenetic tree (Additional file 7). The B subgroup is considerably smaller than the A subgroup but both subgroups contain both P15 and WB sequences. Certain branches of each subgroup show enrichment of either P15 or WB VSPs, which could reflect recent VSP duplication events. It is not clear whether the enrichment of WB genes is due a possible lack of assembled sequences for P15 or if this actually represents lineage specific diversification.

Structural variation in Giardia genomes
Structural variation analyzed across the three Giardia genomes revealed frequent loss of genome synteny. In total, 141 cases of insertions, deletions and rearrangements could be detected. In most cases, the exact type of rearrangement could not be determined due to the termination of contigs in P15 and GS and occasionally in WB. Only 28 synteny breaks were recorded between WB and GS in contrast to 113 instances in the phylogenetically more related WB and P15. This discrepancy can be explained by the more fragmented GS draft assembly in which very few genes from repetitive gene families could be assembled. The deep coverage and overall quality of the P15 assembly increases the amount of structural variants that can be detected. We investigated the quality of our assembly by PCR verification of 10 structural variants (Additional file 8). This set included predicted intra-and inter-chromosomal rearrangements involving all chromosomes except chromosome 2. Seven out of 10 synteny breaks yielded the expected PCR product, including potential inter-chromosomal rearrangements between chromosomes 1 and 5 as well as 3 and 5. Intra-chromosomal events were verified on chromosomes 3, 4 and 5. The percentage of verified synteny breaks is higher in the P15 assembly (70%) than what was found experimentally in the GS draft genome (59%) [22].
Genes found in non-syntenic regions of the genomes were categorized according to annotation and the number of genes in each category were compared for WB and P15 (Figure 4). Hypothetical proteins was the most common category (WB: 86, P15: 82) followed by VSPs (WB: 47, P15: 57). Protein 21.1, HCMPs, NEK Kinases and putative transposable elements were also found to be associated with synteny breaks in these Giardia genomes. Large fractions of genes belonging to these gene families have diverged significantly, indicating that these gene families evolve more rapidly than other genes. Genes for GTP-binding protein and reverse transcriptase were more commonly associated with synteny breaks in P15 whereas Kinases were found more often in WB. Retro-transpososon associated sequences such as reverse transcriptases or endonucleases are found 17 times in 13 distinct non-syntenic positions in P15 (10 instances/8 distinct loci) and WB (7 instances/5 distinct loci).
Indications of the highly variable genome structure in Giardia have been noted previously in several reports [14,[27][28][29]. Subtelomeric regions containing ribosomal DNA repeats are prone to rearrangements and copy number variation whereas the central region of chromosomes appear to be comparatively stable. The Giardia rearrangement rate of rDNA gene clusters has been estimated to be 1% per cell division in vitro [30]. Differences in chromosome number of individual Giardia nuclei have been reported for in vitro cultivated isolates. Five distinct chromosomal linkage groups have been demonstrated in spite of the karyotype diversity seen in Giardia. The karyotype heterogeneity could result from recombination associated with the presence of mobile elements. The WB genome contains three families of retrotransposons, two of which are located at the telomeres and could still be active [31].
VSPs were frequently found in non-syntenic positions in WB and P15. Genome rearrangements have been identified as crucial components in the generation of antigenic variation in other parasites. The mobile nature of VSPs observed here is likely to be involved in generating VSP diversity through gene duplication and recombination. The large-scale architecture of Giardia chromosomes seem to consist of gene rich regions displaying stable codon usage and conserved gene order interspersed by chromosomal locations with atypical codon usage ( Figure 5). VSPs and HCMP genes contribute to create such chromosomal islands with higher than average G + C content and atypical nucleotide signatures.
Several smaller rearrangements were found in areas with well-conserved genes. These include insertion or deletion of individual or multiple genes, inversions and translocations (Additional file 9).

Discussion
We have previously reported a draft genome sequence and analysis of the Giardia assemblage B isolate GS [22]. We here describe the first genome analysis of a non-human Giardia isolate, that of assemblage E isolate P15. We have chosen to use a high throughput strategy to sequence the genome of P15, taking advantage of the low cost of 454 sequencing and the rapid data generation. We have examined the genetic diversity of a mostly uncharacterized assemblage of Giardia, and identified intraspecific variation that can be used for population studies or genotyping in a clinical setting. Comparative genome analysis of the assemblage A isolate WB and the B isolate GS previously revealed evidence of chromosomal shuffling, an abundance of allelic heterozygosity (ASH) and approximately 900 conserved open reading frames that were left unannotated in the first Giardia genome project.
We have estimated the core gene content of Giardia to be approximately 91% of the total gene content, with the remaining 9% involved in antigenic variation and functions related to host specificity. A few members of the VSP and HCMP families were found to be positionally conserved between all three genomes, but the function of these syntenic surface proteins is not known. The Giardia genomes are relatively similar in terms of core gene content, but tend to differ with respect to synteny and in genes responsible for antigenicity. We have identified several small scale genomic changes, including insertions and deletions of coding and noncoding sequences. A small number of isolate specific genes were identified and in most cases these are located in nonsyntenic regions of the genome. As expected, the Giardia isolates have diverged repertoires of surface proteins (VSPs and HCMPs) primarily coded in nonsyntenic regions of the genomes. A similar genomic arrangement is present in the trypanosomes where syntenic regions are interspersed with large regions of surface molecule genes, noncoding sequences and repeats. The roles of such regions are unknown but could constitute regions of increased genomic plasticity Figure 5 A circular overview of five Giardia P15 contigs. A circular overview of the five largest contigs in the P15 assembly which in total represent 909 kb of the genome (about 8%). The most outer graph displays G + C content plotted in overlapping 1.5 kb windows along the contigs. Areas where the G + C content is elevated (>50%) is displayed in red. The middle graph displays atypical nucleotide composition (as defined in Methods), plotted in overlapping 1.5 kb windows. Radial lines represent genes, with different colors for different annotations (blue = Hypothetical protein, red = Variant-specific surface protein (VSP), Reverse transcriptase (RT) or High cysteine membrane protein (HCMP), yellow = NEK Kinase, purple = Protein 21.1, green = other). The most inner graph displays the level of sequence coverage. VSP, RT and HCMP genes are characterized by a higher than average G+C content and atypical nucleotide signatures. These genes are dispersed and create chromosomal islands with higher than average G + C. and they could play a role in the generation of antigenic variation through recombination events. A chromosome-wide analysis of gene order have been hindered in our assembly by the lack of long range continuity, a result of the sequencing read length. Future efforts to characterize the Giardia spp. genomes would benefit from using paired-end libraries or other strategies to join repeated regions and to avoid the typical homopolymer errors associated with the 454 technology.
A very low level of allelic sequence heterozygosity was detected in the P15 genome, which contradicts findings in the GS genome, but is consistent with observations from the WB genome. The striking differences in ASH levels could imply different biology regarding sexual reproduction, GS could be a recent mixture of two isolates, or the presence of active meiotic components in either isolate. It has been suggested that the assemblage A isolates may have a mechanism to maintain low levels of ASH, and that such a mechanism could also be operating in some assemblage E isolates. However, low ASH levels in P15 does conflict with genotyping data from assemblage E isolates where double peaked chromatograms have been observed [32]. Low ASH may be a feature of this particular assemblage E isolate but the high abundance of double peaks from other isolates could also reflect the fact that mixed infections with different assemblage E subgenotypes are more frequent in animals. Further studies of additional assemblage E isolates are needed to resolve this issue. Several genes identified as involved in DNA repair and meiosis in Giardia, were previously shown to display larger than average sequence divergence as well as insertions and deletions in the GS isolate, compared to WB [22]. An analogous comparison with the P15 isolate reveals substantial divergence in the genes Rad50, Rad52 and Mre11, and GS specific insertions in Smc5, Mlh1, Rad50 and deletions in Rad50, Msh6, Dmc1a and Smc1 that are specific for the ASH-rich GS lineage. Only one insertion in Rad50 was specific for the WB lineage whereas none were detected in P15. However, it should also be noted that genes displaying a high degree of divergence in the GS lineage are also the most diverged when comparing WB and P15. Future studies of these differences correlated to the level of ASH in disparate isolates could reveal a mechanism for dealing with ASH accumulation. These differences in gene synteny and ASH between these three Giardia isolates suggest that there is ubiquitous inter-genotype genetic diversity.
Indels have previously been noted as a prominent feature of many Giardia proteins when compared to eukaryotic homologs. Small indels are an abundant feature in pairwise comparisons between all three genomes and they are dispersed over a major fraction of the genes. In contrast to single nucleotide polymorphisms (SNPs), indels always alter the primary amino acid sequence, and are therefore more likely to impact protein function. We found that a number of genes involved in core processes of the cell have indels, which could possibly alter the functional specificity or efficiency of these proteins. Indels have been described in other species and have been attributed to DNA polymerase slippage [33].
Several groups encompassing metabolic genes were found to have elevated dN/dS ratios ( Figure 2B). Such changes could cause substrate specificity to change in enzymes or alter metabolic pathways. It is possible that these enzymes allow variation in the amino acid sequence without affecting protein function. Another possibility is that these enzymes evolve more rapidly due to adaptations to the intestinal microenvironment in the host organism. A similar differential pattern of dN/dS rate was observed for genes grouped according to developmental stage. We here observed that genes expressed in certain developmental stages have slightly different distributions of dN/dS values ( Figure 2C). The reason for this could be that certain stages need to be more tightly regulated and cannot easily accommodate variation. Another possibility would be that some stages expose proteins to the external environment, which could cause selective constraints by the host immune system.
We found evidence for a highly conserved set of core genes in Giardia which we propose is essentially common to all Giardia intestinalis isolates. This set of genes lies in genomic regions with mostly conserved synteny. Despite this, polymorphisms in terms of synonymous and nonsynonymous SNPs and indels are common in this set of genes, which could change the function or specificity of the expressed proteins. Also, cases of disrupted synteny do occur in these regions, which indicate that chromosomal recombination takes place in Giardia or in certain lineages of the species. Experimental evidence for recombination between assemblages is missing, but has been reported to occur between A2 isolates [34].
A large fraction of genes in the diplomonad genome have been proposed to have been acquired via horizontal gene transfer, due to their clustering with bacterial sequences in phylogenetic trees [10,35,26]. Here we found a number of genes only present in certain isolates ( Figure 3). Some of these genes could be involved in shaping phenotypic differences, whereas others could be non-functional remnants from recent horizontal gene transfers, or represent recent losses in one or two of the lineages. An example of a recently acquired gene is the acetyl transferase gene with an apparent bacterial origin which was identified in P15, but which is absent in WB and GS (Additional file 5). However, it is presently unknown if the gene is expressed and what the precise function of its putative product might be. Our gene content comparison (Figure 3) showed that gene acquisition and loss is much less frequent than in intestinal bacteria, such as Escherichia coli [36]. Nevertheless, it indicates that gene acquisition is an extant evolutionary process also in a eukaryotic parasite, which likely contributes to diversification of Giardia isolates over evolutionary timescales.
The dynamics of the G. intestinalis genomes are further exemplified by a putative pseudogene found in a conserved core region of assemblage E. Evolutionary analyses revealed that the gene most likely was recently acquired from bacteria, probably of the mammalian gut flora (Additional file 5). The gene then turned into a pseudogene in P15, whereas it is under purifying selection in WB and GS. The presence of this gene exclusively in the human infecting isolates is intriguing; it is tempting to speculate that it might be related to host specificity.
We could not find any clear evidence for chromosomal duplications in our genome data, but we detected the presence of a genomic region in P15 that harbors 13 genes that are absent in WB while it appears that they are present in GS. A preliminary analysis suggests that this region may be absent from other assemblage A isolates as well, and the protein products from these genes could constitute good targets for the development of assemblage discriminating antibodies for serological purposes. Host specificity or phenotypic variation could also manifest at the level of expression. Differential expression of genes between isolates could provide means for creating host specificity and phenotypic differences. The regulation of gene expression is likely to be mostly posttranscriptional because of the short relaxed promoters in Giardia, and it is possible that short RNAs could be involved. Future efforts to characterize the Giardia transcriptome would benefit from using high throughput sequencing to efficiently enhance the resolution and confirm the expression of proposed gene models.
The repeat content of the Giardia genome is approximately 9% depending on which gene families are taken into consideration. This can be compared to the 50% repeat content of another protozoan parasite, T. cruzi [37]. The relatively low complexity of the Giardia genome makes it ideal for rapid comparative genomics using shotgun sequencing.
Repeated areas in the Giardia genome contain pseudogenes, retro-transposons and low complexity repeats and they could constitute areas of increased genomic plasticity or sites where recombination occurs, which is indicated by their associations with discontinued synteny. Proteins that are encoded in these regions have no positional orthologs in other genomes but tend to be classified into one of the larger gene families in Giardia.
In other genomes with short intergenic distances, gene order has mostly been conserved. Also, the bidirectional promoters of Giardia could act to prevent reshuffling by inactivating genes upon rearrangement. Whether genomic rearrangements occur randomly, or in a controlled manner by a certain mechanism or genomic location, warrant further investigation. It is known that co-regulated genes act as a hindrance to genomic rearrangements. It could therefore be proposed that the lack of gene regulation at the transcriptional level allows for larger genomic rearrangements in Giardia without affecting gene function. The presence of retro-transposon derived sequences in non-syntenic genomic positions in WB and P15 could indicate retro-transposons have contributed to shaping the Giardia genomes by increasing the genomic plasticity.
The Giardia genomes contain a very large fraction of hypothetical genes that code for proteins of unknown function and with limited expression or proteomic data to support their expression or function. Most of these genes appear to be randomly distributed across the genome. The lack of significant hits to other protozoan genomes likely reflects the strong adaptation of the parasite to its host environment in combination with the evolutionary divergence of these organisms. As the number of sequences from diplomonads increases, due to a number of ongoing diplomonad sequencing projects, several Giardia proteins will likely be assigned orthologs, but experimental efforts will be required to elucidate the function of these proteins. The generation of the GS and P15 draft genomes have provided information about the strong conservation of these genes which indicates that they have a functional role in Giardia biology.
It is clear that the genomic data alone cannot resolve the question of host specificity and mechanisms of pathogenicity, but the genome sequence data provide an evolutionary insight into how parasite genomes have been shaped over the course of evolution. Moreover, the identification of lineage specific genes and gene variants will provide candidates for future functional studies. We hope that the draft genome sequence of this assemblage E isolate will provide useful information for future studies of the differences between Giardia strains and the diplomonads in general.

Conclusions
Our results indicate that there is extensive genetic variation between Giardia intestinalis lineages. This includes chromosomal recombination, single nucleotide polymorphisms and insertion-deletion events. The assemblages contain diverged repertoires of Giardia surface proteins and other gene families, including VSPs, HCMPs, Protein 21.1 and NEK Kinases, including hundreds of distinct proteins for each of the three sequenced Giardia genomes. The analysis of orthologs indicate a stable set of core genes, present in all of the examined assemblages, and only a small number of isolate specific genes have been identified. Certain categories of genes were found to be under a differential evolutionary rate. However, for a complete understanding of the genomic landscape of Giardia it would be useful to combine data from more than one sequencing platform to understand the chromosome-wide differences.

Reagents and cell culture
The P15 strain was originally isolated in the Czech Republic from the upper jejunum of a necropsied piglet that was naturally infected with Giardia [38]. Unless otherwise indicated, the reagents used for cell culture were obtained from Sigma Chemical Co, USA. Giardia intestinalis strain P15 trophozoites were grown in standard conditions. Trophozoites from in vitro passage 10 were used for DNA extraction. Genomic DNA was extracted using the Easy-DNA kit (Invitrogen, Carlsbad, CA, US, Cat. no. K1800-01) followed by purification on a G100 Genomic tip (Qiagen).

Sequencing, assembly and annotation
The genomic DNA was sequenced using a Genome Sequencer FLX/Titanium instrument (454 Life Sciences). Preparation and sequencing of the sample was performed according to the manufacturer's instructions. Base-calling was performed using the bundled 454 software. Three runs with the Genome Sequencer (454 Life Science) generated 1,639,140 reads (Table 1, Additional file 1). Genome assembly was performed de novo using the sequence assembly software MIRA V2.9.43 with the parameters -job = denovo,genome,normal,454 -SK:mnr = yes [39]. Manual inspection and editing of the assembly was carried out using consed and custom Perl scripts. The assembly process was complicated by the presence of repeats, which were too long to be spanned by the relatively short 454 reads.
A contamination screen was performed on the assembly using the megablast program [40] against GenBank human and fungi genomic databases and a few suspicious short, low coverage contigs were removed. Possible mis-assemblies were identified using coverage analysis and split at these points. Contigs less than 500 bp were removed from the assembly together with contigs with an average coverage less than 10.
The quality of our draft sequence is determined by several factors: the read length; the frequency of sequencing errors; the repeat content; the presence of allelic sequence heterozygosity. Given the high coverage (47×) of the genome sequence, we expect the quality of the sequence to be high and substitution errors extremely rare, however, a few (< 100) indel errors were identified due to their presence in open reading frames. The quality of the final assembly was assessed by evaluation of the consensus quality scores, only 1.69% of all positions in the assembly have quality scores less than 30. According to Phred quality scores this equals a base calling accuracy of 99.9% or 1 error in 1000 bases [41]. This would equal to 20,974 incorrectly called bases or 0.16% of the combined length.
The annotation and subsequent analysis of the contigs have primarily been focused on the non-repetitive parts of the genome. Gene prediction was performed using GLIMMER 3 and CRITICA [42,43]. Open reading frames (ORFs) with invalid start codons were removed together with short, overlapping ORFs. An automatic annotation was performed by pooling the ORFs from WB and GS and using the best reciprocal BLAST hit as a preliminary annotation. All annotations were manually curated by aligning the contigs against the scaffolds of WB and GS using TBLASTX.
Alignments of putative frame shifted genes (n = 61) were manually inspected using the assembly viewer Tablet (version 1.10.02.08) [44]. The open reading frame of 48 genes could be restored by manual inspection of the assembly alignments. Genes not fulfilling these criteria were either classified as being present in a problematic region (n = 6) or as being putative pseudogenes (n = 5) (Additional file 2).
Orthologs were assigned to homologous genes in syntenic position by using reciprocal BLAST searches and manual curation of the results. Ortholog alignments were created using the transAlign program which increases the accuracy of a nucleotide alignment by using the corresponding amino acid alignment as a template [45]. Whole genome alignments along with genomic features were visualized using the Artemis Comparison Tool (ACT) [46]. ORFs located in contig breaks that span two or more contigs were annotated as partial and designated herein as truncated. In order to improve the Giardia genome annotation, all hypothetical proteins were queried against the Gen-Bank non-redundant protein database and the Gen-Bank Conserved Domain Database (curated protein families from Pfam and other sources). Statistics were done using the R software platform (version 2.9.2). A Kruskal-Wallis test and a Mann-Whitney test were conducted to test significance of Gene Ontology categories. 5% was used as a significance level. SAGE data was used as present in Gene Expression Omnibus and Stekel et al.'s R-statistic was used to select differentially expressed genes (R > 8) and clustered according to expressional profile [24]. This Whole Genome Shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession [Gen-Bank:ACVC00000000]. The version described herein is the first version, ACVC01000000. The data is also available in GiardiaDB [47].

Isolate specific genes and synteny
Isolate specific genes were identified by reciprocal BLAST searches, using BLASTP and TBLASTN (cutoff E = 1e-20). The final gene lists were compiled by manual curation of BLAST search results and investigation of the associated genomic regions. While the current assembly limited us from studying chromosome wide synteny, we analyzed synteny blocks and microsynteny. Genes associated with synteny breaks were recorded and categorized according to available annotation.

Evolutionary analysis and allelic sequence heterozygosity
Calculation of the dN/dS ratio was done using the yn00 program of the PAML package [48] (version 4.2b). The trinucleotide composition was examined using χ 2 analysis of codon frequencies. This was performed by counting the frequencies of 3-mers in all 6 reading frames in overlapping windows and using the genome wide 3mere frequencies as the expected value when calculating the χ 2 statistics. By plotting the data using Circos [49], distinct peaks appear in areas where the trinucleotide (atypical) composition deviate from the average. Phylogenetic trees were constructed from clustalw2 protein alignments and clustered using maximum likelihood, as implemented in the software RAxML (version 7.0.4) [50]. The gamma model of rate heterogeneity was used together with the WAG substitution model for the optimal tree. Bootstrap analyses were performed on 100 replicates using the same settings. The VSP tree was rendered using the ape package for the R statistical platform [51].
We used strict criteria to define a single position as a possible allelic sequence heterozygosity (ASH). Only positions with higher coverage than 20× were analyzed without gaps or ambiguous nucleotides ('n'). Alternative nucleotides located in reads with identical start positions were ignored because of the likelihood of sequencing errors. Further, alternative nucleotides would be required to be present in at least 15% of the counted positions.

Introns, non coding sequences and motif discovery
Introns in the P15 genome were identified by alignment of syntenic regions in the three Giardia genomes. Introns and surrounding sequences from the three Giardia genomes were extracted and aligned using ClustalW. The aligned sequences were manually edited and used to build a regular expression capable of extracting all known Giardia introns from the WB, GS and P15 Giardia genomes. We analyzed the upstream and downstream regions of VSP genes in P15 and WB for common motifs using the MEME program [52]. Non-coding RNAs in P15 were identified by BLASTN searches using gene models from WB.

Confirmation of pseudogenes and synteny breaks
Primers were designed manually according to recommendations in the Phusion HotStart II polymerase instruction manual (Tm 60°C and 22-25 bp in length) and synthesized by Sigma-Genosys. The targets were amplified in a mixture containing 1 × Phusion HF buffer with 1.5 mM MgCl2, 200 μM dNTPs, 0.5 μM of the forward and reverse primers, 40 ng template genomic DNA (either P15 or WB) and 0.8 U Phusion HS II DNA polymerase (Finnzymes) in a total volume of 40 μl. The reactions were incubated for 3 min at 98°C followed by (98°C for 15 sec, 60°C for 15 sec, 72°C for 30 sec/1 kb of expected amplicon) × 35 cycles and were subsequently held at 4°C. Amplicons were separated by 0.7% 1 × TAE agarose gel electrophoresis and visualized by ethidium bromide staining. The PCR products were purified using the QIAquick PCR purification kit according to the manufacturer's recommendations and eluted in 30 μl Buffer AE. The purified PCR products were sequenced with their respective forward and/or reverse primers at the Uppsala Genome Center using the Big-Dye® Terminator v3.1 (Applied Biosystems) chemistry followed by capillary electrophoresis on an ABI3730XL sequencer (Applied Biosystems).