Insights into the genome sequence of a free-living Kinetoplastid: Bodo saltans (Kinetoplastida: Euglenozoa)

Background Bodo saltans is a free-living kinetoplastid and among the closest relatives of the trypanosomatid parasites, which cause such human diseases as African sleeping sickness, leishmaniasis and Chagas disease. A B. saltans genome sequence will provide a free-living comparison with parasitic genomes necessary for comparative analyses of existing and future trypanosomatid genomic resources. Various coding regions were sequenced to provide a preliminary insight into the bodonid genome sequence, relative to trypanosomatid sequences. Results 0.4 Mbp of B. saltans genome was sequenced from 12 distinct regions and contained 178 coding sequences. As in trypanosomatids, introns were absent and %GC was elevated in coding regions, greatly assisting in gene finding. In the regions studied, roughly 60% of all genes had homologs in trypanosomatids, while 28% were Bodo-specific. Intergenic sequences were typically short, resulting in higher gene density than in trypanosomatids. Although synteny was typically conserved for those genes with trypanosomatid homologs, strict colinearity was rarely observed because gene order was regularly disrupted by Bodo-specific genes. Conclusion The B. saltans genome contains both sequences homologous to trypanosomatids and sequences never seen before. Structural similarities suggest that its assembly should be solvable, and, although de novo assembly will be necessary, existing trypanosomatid projects will provide some guide to annotation. A complete genome sequence will provide an effective ancestral model for understanding the shared and derived features of known trypanosomatid genomes, but it will also identify those kinetoplastid genome features lost during the evolution of parasitism.


Background
The Kinetoplastida (Euglenozoa) are unicellular flagellates that include the trypanosomatid parasites, most notably Trypanosoma brucei, T. cruzi and Leishmania spp. These organisms cause substantial mortality and morbidity in humans and their livestock worldwide as the causative agents of African sleeping sickness, Chagas disease and leishmaniasis respectively. Bodo saltans is a free-living heterotroph found worldwide in freshwater and marine habitats. It possesses the diagnostic kinetoplastid features, such as flagella sited within a specialised flagellar pocket, glycolytic processes confined to a dedicated organelle (the 'glycosome'), and the characteristic concentration of mitochondrial DNA at the base of the flagellum (the 'kinetoplast') [1,2]. When comparing trypanosomatid parasites with each other, or collectively with other eukaryotes, the value of B. saltans is as a non-parasitic near relative, (i.e., an 'outgroup'), that can illuminate their key evolutionary transitions. Five draft genome sequences exist for Trypanosoma spp. and four for Leishmania spp. [3][4][5][6][7]; these will be augmented with further strains and other non-human parasites in the coming years [8]. With such excellent comparative resources in place or in development, there is a critical need for a non-trypanosomatid outgroup. In effect, it will provide a model of the ancestral trypanosomatid to distinguish those derived parts of the parasite genomes (i.e., unique trypanosomatid adaptations) from those which are a legacy of the free-living ancestor. For instance, such a model will help to resolve whether trypanosomatids previously possessed an algal plastid from which 'plant-like' genes in trypanosomatid genomes are derived [9][10][11]. As a prelude to a complete B. saltans genome sequencing effort, this study sought to establish an initial understanding of the bodonid genome, its structure and content relative to the trypanosomatids.
The most recent kinetoplastid phylogeny has shown that trypanosomatid parasites are just one of many independent acquisitions of parasitism, indeed, a relatively minor component of total diversity [12][13][14][15]. Nonetheless, they are, naturally, the most important aspect of kinetoplastid diversity. Many features of their completed genome sequences emphasised the common ancestry of T. brucei, T. cruzi and Leishmania spp., especially with respect to gene repertoire and order [16], but their critical pathological differences were also evident at the genomic level. The three human parasites cause distinct diseases; their genomes contain enigmatic adaptations related to pathogenesis and immune evasion, for instance the bloodstream expression site in T. brucei from which its variant surface glycoproteins (VSG) are expressed [17,18], and surface antigen families in general [16]. Without an historical dimension, these features cannot be compared, nor understood in an evolutionary context. As it is among the closest bodonid relatives of the trypanosomatids [19], Bodo saltans is a suitable outgroup to address three principal comparative issues: i) understanding how human trypanosomatid parasites acquired their distinct pathological strategies; ii) understanding how the ancestral trypanosomatid became parasitic in terms of derived innovations (e.g., cell surfaces) and loss of genomic repertoire; iii) understanding how typical kinetoplastid features (e.g., glycosomes) evolved and how these might have been modified for parasitism.
Quite what to expect from a bodonid genome sequence is an open question. Beyond the basic kinetoplastid features named above, the biological differences between bodonids and trypanosomatids are striking. While B. saltans is a bacteriovore, especially prevalent in polluted waters or other environments with high bacterial densities [1], trypanosomatids are obligate parasites inhabiting a nutri-ent-rich, but ultimately hostile, host environment, and adept at exploiting their eutrophic environment to maximise proliferation and transmission. By contrast, B. saltans preys on bacterial cells [1,2] and is probably adapted for resource acquisition within its relatively oligotrophic environment. Although bodonids and trypanosomatids are all flagellates, trypanosomatids attach their single flagellum to the cell surface to generate motile force, whereas the anterior flagellum in B. saltans is modified with hairlike mastigonemes, which may assist prey location during feeding [2,[20][21][22]. There are wider cytoskeletal differences also; the subpellicular microtubular cortex is instrumental in maintaining the numerous cell forms adopted by trypanosomatids [23], but is reduced in bodonids, (which lack complex developmental stages), to the region around the cytostome [2,24]. Perhaps most importantly for understanding the evolution of parasitism, we can expect substantial differences between trypanosomatid cell surfaces that function primarily to manipulate and frustrate the host immune response and bodonid membranes that are perhaps largely concerned with cellular homeostasis.
Rather than providing definitive answers to these questions, the preliminary sequence data presented here provides an initial insight into a few comprehensively resolved locations in the B. saltans genome, indicating what to expect from gene content and arrangement, and testing the feasibility of a complete sequence project. The sequence contigs were compared with corresponding regions in trypanosomatids (based on conserved gene order, where this existed), to examine gene content and the conservation of gene order (i.e., colinearity) and, therefore, the potential for using trypanosomatid genome sequences as scaffolds to assist assembly and annotation of the B. saltans sequence.

Gene structure
Clones were selected from the B. saltans fosmid library according to random end-sequences and positive results for specific PCR probes. Inserts from 12 fosmid clones were shotgun sequenced, comprising 0.403 Mbp in total and an average size of 33.6 Kbp. Table 1 describes the composition of the 12 contigs in terms of the affinity shown by each putative coding sequence to sequence databases. 178 putative coding sequences are specified; genes could be predicted by eye because of a definite elevation in GC content in coding regions. Subsequent matches to sequence databases showed these features to be correct. The boundaries between coding and flanking regions are marked by a transition from GC-rich to ATrich signatures; the sequences shown in Figure 1 clearly demonstrate the GC troughs that appear between coding sequences. This pattern is repeated in other contigs, as shown in subsequent figures. Gene density is high relative Schematic representation of three regions of the B. saltans genome sequence, as shown in the Artemis genome browser Figure 1 Schematic representation of three regions of the B. saltans genome sequence, as shown in the Artemis genome browser. Six reading frames are shown as parallel grey bars; scale in base-pairs. Base composition is plotted above. Putative coding sequences are shown as coloured boxes: red (homolog of trypanosomatid gene with known function), orange (homolog of hypothetical trypanosomatid gene), green (hypothetical gene with no trypanosomatid homolog but a positive functional match to a sequence database), blue (hypothetical gene with no matches to sequence databases). Labels attending these coding sequences contain the GeneDB identification numbers of homologous trypanosomatid genes where possible, or the description of homologous genes detected by BLAST comparisons (with % identity). Predicted transmembrane helices (blue) and signal peptides (purple) are shown on the DNA strands below the coding sequence. a. Clone '16k02' containing a tandem gene array of heat-shock protein 70. b. Clone '14l17' containing a tandem gene array of αand β-tubulin. An asterisk * denotes a β-tubulin gene disrupted by a single base deletion at position 589. c. Clone '5m18' containing a second tandem gene array of αand β-tubulin.
to corresponding regions in the L. major and T. brucei genome sequences, reflected by the consistently short intercoding sequences across all contigs (average = 377.2 bp). Figure 2 compares the gene order of one region (average interceding sequence length = 439.7 bp) with positionally orthologous regions in L. major (average = 1480.6 bp) and T. brucei (average = 1129.4 bp); this, like most fosmid inserts, contains more genes in Bodo than in trypanosomatids. Table 1 shows that 106/178 coding sequences (59.6%) are homologs of known trypanosomatid genes. The percentage nucleotide identity between bodonid and trypanosomatid proteins varies greatly; genes of known conservatism display high identity (α-tubulin, 98%; βtubulin, 99%; HSP70, 95%; GAPDH, 81%), but on average coding sequences are 44.38% identical and the most abundant identity class is 30-39%. Hence, most orthologs in these two classes have diverged by two-thirds or more. Of those coding sequences without trypanosomatid homologs, 20 show homology with other eukaryotes, 2 are of bacterial affinity, and the remainder (28.1%) are without matches to any database, i.e., Bodospecific. Despite the bacterial contamination inevitable in DNA preparations (see methods), we can be certain that these bacterial-type coding sequences are not artefacts because they are present in fosmid inserts otherwise composed of eukaryotic sequences, and individual sequence clones span both the bacterial-type gene and surrounding eukaryotic-type sequence. Although present in B. saltans, some of the familiar genes intensively studied in trypanosomatids are found in novel contexts. Figure 1 describes tandem gene arrays of HSP70 and tubulin, which are found in locations unlike those in trypanosomatids. An alternating tandem array containing α and β-tubulin is found in two distinct inserts (Figure 1b and 1c); α and βtubulin isoforms contain no amino acid differences but had dissimilar (unalignable) 3' untranscribed regions.

Gene content
Coding sequences without trypanosomatid homologs were compared to sequence and structural databases (see Table 2). Many of the gene products are homologous to proteins beyond the Kinetoplastida, suggesting that they are core eukaryotic proteins subsequently lost from trypanosomatids; for example, contiguous genes in Figure  1a homologous to a GPI-anchored protein in plants and fungi (16273-17686 bp) and a hypothetical gene in Metazoa (18275-19486 bp). Gene products in other regions (not shown) contain protein domains known elsewhere, for example an ABC transporter protein (clone '5 e 15', 21065 bp) and a nucleotide-sugar transporter protein (clone '45 a 12', 41227 bp), strongly indicating that these are Bodo-specific members of ubiquitous gene families. Some otherwise uncharacterised hypothetical proteins are predicted to expressed on the cell surface. The region shown in Figure 3a is notable not only for the base composition of coding regions and admixture of trypanosomatid and Bodo-specific genes mentioned above, but also for a hypothetical protein (16737-17978 bp) with 7 predicted transmembrane helices and a signal peptide. Table 2 contains other examples of Bodo-specific hypothetical genes predicted to be surface expressed, including those shown in Figure 1a (27014-28063 bp) and b (16869-18011 bp).

Colinearity
The extent of conserved gene order, or colinearity, between bodonid and trypanosomatid genome sequences was assessed using the Artemis Comparison Tool (ACT, see methods). One region of excellent colinearity is shown in Figure 2 and, despite disruption by some eukaryotic genes not seen in trypanosomatids, this contig corresponds unmistakably with chromosome 18 in L. major and chromosome 10 in T. brucei. Conversely, the presence of so many non-trypanosomatid genes meant that colinearity disappears entirely in some locations, as shown in Figure 1. Across the 12 genomic regions however, both patterns were atypical; most regions shows brief patches of colinearity, perhaps 2 or 3 genes with conserved synteny, set among larger regions of Bodo-specific genes or homologs to trypanosomatid genes from elsewhere in the genome. In this sense, the sequence presented in Figure 3 is representative because several coding sequences are homologs of trypanosomatid genes on chromosomes 13 (L. major) and 11 (T. brucei); these are roughly colinear but the order is disrupted by genes present on other chromosomes or by Bodo-specific genes.

Discussion
In this study, various locations in B. saltans genome, amounting to ~0.4 Mbp, were sequenced. Assuming that the bodonid genome is approximately the same size as a trypanosomatid haploid genome, i.e., 35-55 Mbp [16], these sequences comprise ~1% of the complete genome sequence, which will therefore contain roughly 14,000 genes. The success and utility of a B. saltans genome project will depend on its relationship with existing trypanosomatid genome sequences. This study shows that coding regions of the B. saltans genome share several structural features with trypanosomatids, indicating that the project is both feasible and likely to provide a useful comparative resource. Putative B. saltans genes lack introns, as in most trypanosomatid genes [3][4][5]. They display a conspicuous elevation in GC content, which will greatly assist gene finding. No evidence of strand-switching was observed in B. saltans, corroborating the view that it operates polycistronic transcription [25], i.e., transcription of many contiguous loci within a single nascent transcript [26][27][28], which is subsequently trans-spliced and polyadenylated to produce mature mRNA, as in trypanosomatids [29][30][31][32][33].
Although the arrangement of coding regions along the bodonid chromosome may be conserved with trypanosomatids, it is clear that gene order was not. The extent of conserved synteny, or rather colinear gene order, between bodonid and trypanosomatid genomes is of particular importance to the assembly of any B. saltans genome sequence. The coding regions presented here indicate that trypanosomatid genome sequences will be of limited value in the global assembly of a B. saltans genome sequence. Strict colinearity was not normally observed, if only because of the large number of Bodo-specific genes interposed between trypanosomatid homologs. Colinearity tended to persist over a distance of 3-5 genes, although some regions displayed conspicuous conservation (e.g., Figure 2), while others showed none at all (e.g., Figure 1). Therefore, this initial exploration of the B. saltans genome demonstrates that it should be possible to resolve a complete genome sequence, but, while the existing trypanosomatid resources will provide some useful guides for annotation, they could not be used as scaffolds for assembly, which should proceed de novo.
The purpose of a completed B. saltans genome sequence would be for understanding the evolution of trypanosomatid genome sequences. The mixture of familiar and novel features in the regions sequenced here indicates the value of a bodonid genome sequence in distinguishing trypanosomatid characters inherited from free-living ancestors (and still shared with them) from characters evolved since the origin of trypanosomatids. Hence, the first application would be in determining which parts of the trypanosomatid genome reflect the genomic legacy inherited from free-living ancestors, and show how they have been co-opted and modified for parasitism. Bodonid and trypanosomatid cells share various structural features, principally those that characterise kinetoplastid cells. Bodonids arrange their mitochondrial DNA in kinetoplasts, although their position within the cell differs from trypanosomatids [1], and conduct their glycolytic pathways within a dedicated organelle (the glycosome) [2]. Bodonids construct their flagella in a similar manner to trypanosomatids, but deploy them very differently [1]. While B. saltans uses one flagellum for movement and another for feeding, trypanosomatids flagella perform their motility function within the context of their sophisticated cell forms.
One might expect these structural similarities to be reflected at the genomic level. αand β-tubulin, the proteins that facilitate the development of flagella in trypanosomatids, are known to be arranged in tandem gene arrays, with an alternating, heterotypic α-β array in Trypanosoma spp. and distinct, monotypic α and β arrays in Leishmania spp. [34][35][36][37]. Bodonids were shown to share the alternating conformation, suggesting that Leishmania spp. and their relatives had abolished the ancestral locus and evolved novel genomic repertoires [38]. However, two B. saltans regions containing tubulin in this study show that modification of tubulin repertoire has also occurred in Trypanosoma, since neither of the α-β arrays in B. saltans was found at the genomic position occupied in trypanosomes. This demonstrates the utility of the B. saltans genome in resolving the evolutionary causes of structural or compositional differences between trypanosomatid genomes.
The second application of a B. saltans genome sequence would be to identify which components of the free-living legacy have been lost from trypanosomatids, and therefore, how reductive genome evolution has contributed to the parasite genomes. Table 2 describes many predicted proteins identified in B. saltans that have no trypanosomatid homologs. Among these, mostly Bodo-specific, genes are membrane transporters, various protein kinases, and other proteins containing domains commonly associated with cell surfaces. These and other Bodo-specific proteins must include those metabolism pathways, intracellular transport, cellular signalling and subcellular structures that exist in free-living kinetoplastids, but which have been deleted during the evolution of parasitism. Many of these proteins will be widespread among eukaryotic lineages, as is evident in Table 2; yet we should also expect to encounter a considerable genetic repertoire unique to the Kinetoplastida and so entirely new.
Having identified those features of trypanosomatid genomes that reflect their free-living ancestry, a B. saltans genome sequence would also reveal the additions to each parasite genome; structures derived from existing genes Note: † Clones for which no probe is stated were selected after end-sequencing.* Genes were classified according to their affinity: kinetoplastid (K), other eukaryotic (E), prokaryotic (P) or unknown (U).
and co-opted for novel uses, and genuinely novel genes involved in parasite-specific adaptations. These enigmatic genes include the numerous and diverse families of surface glycoprotein that form the protective coats around trypanosomatid parasites. T. brucei, T. cruzi and L. major each display highly derived and complex surface coats to frustrate host immunity, yet they differ in structure and substance and it is not known how each acquired its distinct solution to their common problem. Understanding the origins of these surface architectures will only be achieved with an historical perspective; one principal objective of a B. saltans genome project would be to identify the precursors of proteins such as VSG in T. brucei, mucins and trans-sialidase in T. cruzi, and proteophosphoglycans in Leishmania spp. (amongst others). A glimpse of this potential is seen in Figure 3, which includes a predicted protein with a complex 24 amino acid repeat (13440-16655 bp). The protein had a high affinity (42% amino acids identical) with a gene family on chromosome 12 in Leishmania spp., (currently annotated as 'surface antigens'), and a more distant affinity with proteophosphoglycans. Figure 3b shows a sequence alignment of the repeat domain from the B. saltans protein and its leishmanial homologs, where the level of amino acid identity rises to 50%.

Conclusion
Thorough sequencing of a few locations in the B. saltans genome has revealed clear similarities with trypanosomatids, but has also shown that trypanosomatid genome a. Schematic representation of a 32.9 Kb fragment of B. saltans genome sequence (clone 96g09) in the Artemis genome browser sequences will not be effective guides for any complete bodonid project, due to significant differences in content and gene order. This mixture of familiar and novel features suggests that B. saltans will indeed provide an effective outgroup for comparisons of trypanosomatid parasites, and, as with the evolution of tubulin repertoire, the historical perspective to understand which aspects of trypanosomatid biology have been retained from their common ancestry, which have been lost, and what has been uniquely derived since.