Insights into a dinoflagellate genome through expressed sequence tag analysis

Background Dinoflagellates are important marine primary producers and grazers and cause toxic "red tides". These taxa are characterized by many unique features such as immense genomes, the absence of nucleosomes, and photosynthetic organelles (plastids) that have been gained and lost multiple times. We generated EST sequences from non-normalized and normalized cDNA libraries from a culture of the toxic species Alexandrium tamarense to elucidate dinoflagellate evolution. Previous analyses of these data have clarified plastid origin and here we study the gene content, annotate the ESTs, and analyze the genes that are putatively involved in DNA packaging. Results Approximately 20% of the 6,723 unique (11,171 total 3'-reads) ESTs data could be annotated using Blast searches against GenBank. Several putative dinoflagellate-specific mRNAs were identified, including one novel plastid protein. Dinoflagellate genes, similar to other eukaryotes, have a high GC-content that is reflected in the amino acid codon usage. Highly represented transcripts include histone-like (HLP) and luciferin binding proteins and several genes occur in families that encode nearly identical proteins. We also identified rare transcripts encoding a predicted protein highly similar to histone H2A.X. We speculate this histone may be retained for its role in DNA double-strand break repair. Conclusion This is the most extensive collection to date of ESTs from a toxic dinoflagellate. These data will be instrumental to future research to understand the unique and complex cell biology of these organisms and for potentially identifying the genes involved in toxin production.

toxins and form harmful algal blooms, or "red tides", resulting from populations of more than 20 million cells per liter of seawater. The toxins cause a variety of poisonings that affect humans and marine wildlife [1] and have a significant impact on coastal ecosystems throughout the world [3]. Yet, other dinoflagellates, like Symbodinium, are central contributors to the health of reef ecosystems as the symbionts of corals [4]. Loss of the dinoflagellate symbiont results in coral bleaching. In addition to their ecological role, dinoflagellates display some fascinating and unique aspects of cell biology. One intriguing character is nuclear biology. The nucleus of dinoflagellates is unlike that of any other eukaryote because the chromosomes are condensed throughout the cell cycle except during DNA replication [5]. The morphologically similar chromosomes are attached to the nuclear envelope and can number in the hundreds [6]. Dinoflagellates also lack nucleosomes [7], instead the nuclear DNA is associated with basic proteins that are moderately similar to bacterial histone-like proteins (HLPs [8,9]). Dinoflagellates were thought to lack histones [10], but in a recent gene expression study, a putative histone H3 was annotated in Pyrocystis lunula, although the sequence was not analyzed further [11]. The general lack of nucleosomes raises many questions about transcription and gene regulation in these organisms. Dinoflagellate nuclei also contain vast amounts of DNA compared to other eukaryotes. Estimates range from 3 -250 pg·cell -1 , or approximately 3,000 -215,000 megabases (MB) [12]. In comparison, human nuclei contain 3.2 pg·cell -1 (3,180 MB). The dinoflagellate nucleus contains such a high concentration of DNA that it exists in a liquid crystal state, which is responsible for the unique morphology [13,14]. The DNA to basic protein ratio of dinoflagellate chromosomes has been estimated to be 10:1, which is dramatically higher than the 1:1 ratio observed in most eukaryotes. This indicates that very little basic protein is associated with dinoflagellate chromosomes and that the crystal structure is the primary cause of the unusual morphology. Dinoflagellates are also the only eukaryotes to contain hydroxymethyluracil, a deaminated nucleotide that can be produced by oxidative damage of DNA, which replaces 12 -70% of the thymidine [15]. The role of polyploidy or potentially, genome amplification within particular life history stages remains to be clarified for dinoflagellates. It is highly unlikely, however, given their relatively simple morphology that the immense DNA content is explained solely by gene content.
The most widespread plastid in dinoflagellates contains the unique photopigment peridinin. The "peridinin plastid" is remarkably different from this organelle in other eukaryotes because it lacks a typical genome. Plastids normally contain a circular genome of about 150 kb that encodes 100 -200 genes that are necessary for plastid function. In peridinin-containing dinoflagellates, the plastid genome has been broken into minicircles that encode a single, or a few genes per circle. However, only 16 genes have been identified thus far on minicircles [16,17]. Recent studies show that most of the plastid genes have been transferred to the nucleus [18,19] with 15 of these genes found exclusively on the plastid genome in all other photosynthetic eukaryotes [18]. The peridinin dinoflagellates encode therefore the smallest number of plastid genes of any photosynthetic eukaryote, making them a model for understanding organellar gene transfer. Nuclear-encoded plastid proteins are targeted to the plastid using a tripartite N-terminal targeting signal [20]. As in Euglena, nuclear-encoded plastid proteins are co-translationally inserted into the endoplasmic reticulum and embedded in this membrane using a stop-transfer sequence in the N-terminus. Through algal endosymbioses, the dinoflagellates have been able to acquire four other types of plastids from distantly related evolutionary lineages including the haptophytes, cryptophytes, diatoms, and prasinophytes [1,21]. This aspect of their evolutionary history highlights the unmatched ability of dinoflagellates to capture and retain foreign plastids.
Alexandrium tamarense is one of the best-studied dinoflagellates. This species forms toxic blooms and causes paralytic shellfish poisoning through saxitoxin production. It has a peridinin-containing plastid and in North America, A. tamarense blooms from Alaska to Southern California in the Pacific and along the Canadian and New England coasts in the Atlantic. There has been a recent increase in blooms of A. tamarense and other Alexandrium species in other parts of the world making this genus of high importance to the world's fisheries. We undertook a gene discovery project with this organism using expressed sequence tag (EST) data to investigate dinoflagellate evolution and to create a genomic resource for scientists working on different aspects of A. tamarense and dinoflagellate biology. The EST method was the most reasonable approach in this case because haploid A. tamarense cells contain approximately 143 chromosomes and have a genome size of 200 pg/cell (ca. 200,000 Mb [Erdner and Anderson unpublished data]). Our EST results comprise the first extensive high-throughput, genome-wide data set for a dinoflagellate.

Clustering and sequence analyses
The collection of 11,171 ESTs comprised of single-pass 3'reads (483 from the start library and 10,688 from the normalized library) from A. tamarense was assembled into 6,723 clusters. The normalized library showed a high degree of complexity, with a novelty rate of 60.18% and about 52% of the sequenced clones contained inserts that were longer than the single sequence read (ca. 750 bp).
Clustering of the total EST set showed that most of the reads were singletons (4,618 sequences) and the largest cluster was comprised of 46 ESTs that are closely related to HLPs (Table 1). Other highly represented transcripts were those encoding luciferin-binding protein (a protein involved in the regulation of bioluminescence) and several photosynthetic proteins (e.g., Rubisco, ATP synthase C chain, light harvesting proteins). Several large clusters were transcripts that lacked a similarity (e-value < e -5 ) to known proteins. One of these ESTs has an open reading frame that encodes a protein with a potential plastid-targeting signal ( Figure 1A). Interestingly, database searches against NCBI's nr and dbEST returned hits only to other dinoflagellate ESTs. Another of the largest clusters only had hits to ESTs from other dinoflagellates ( Figure 1B). These two proteins are therefore candidates for dinoflagellate-specific proteins.
Each cluster was searched against the SwissProt protein database using blastx. A total of 515 hits with an e-value less than 1e-20 were identified that terminated within 10 amino acids of the end of the SwissProt entry. From these hits, we estimated that the 3'-UTRs ranged in length from 25 -620 nt with a mean length of 155 nt. This is shorter than the average length observed for fungi (~200 nt) and metazoans (300-600 nt) [22]. However, this analysis is likely to be an underestimate of the average 3'-UTR length because only ESTs that were sequenced into the coding region were included in the analysis. The 3'-UTRs of A. tamarense cDNAs are also interesting because of their apparent lack of a polyA signal. Both simple n-mer searches (e.g. hexamer, pentamer) and the Gibb's sampler were used to assay the canonical region from -11 to -30 preceding the polyadenylation site in search of a polyadenylation signal. We were unable to find a single or a related set of hexamers or pentamers that are enriched in the 3'-UTRs (data not shown). Clearly, polyadenylation of transcripts occurs in A. tamarense, however, the mechanism by which this process takes place apparently does not involve a typical polyA signal. These ESTs were also analyzed for GC-content and codon usage. Coding region GC-content was 60.8%, whereas GC-content in the 3'-UTR was slightly less at 57.6%. The GC-content is reflected in the codon usage (Table 2), whereby 3 rd positions are strongly biased towards Gs or Cs. The stop codon TGA is also significantly favoured over TAG and TAA (frequencies of 411, 71, and 25 occurrences, respectively). The accession numbers of SwissProt hits with an e-value of 9e -10 and below (1,292 sequences) were submitted to the ProToGo server for GO category assignment [23]. A total of 1,203 of the SwissProt accession numbers could be assigned to GO categories. The results are summarized in Figure 2. The functional distribution of the A. tamarense ESTs that could be placed among GO categories is typical of other eukaryotes. However, the overall small number (i.e., 20%) of significant hits to GenBank is surprising, suggesting that many A. tamarense proteins may be either highly diverged and/or encode novel dinoflagellate-specific functions (e.g., Figure 1), or the sequence does not extend into the coding region of the transcript.

Dinoflagellate gene content and gene families
Of species with sequenced genomes, the apicomplexan Plasmodium falciparum is the most closely related organism to A. tamarense. Both of these species are members of the alveolate lineage with dinoflagellates and apicomplexans forming a monophyletic clade that is sister to the ciliates (e.g., [23]). Sequence comparisons using BLAST revealed that 609 of the 6723 A. tamarense ESTs had a significant hit (e-value less than 1e -10 ) to P. falciparum proteins. The top 20 most significant hits are shown in Table  3. The most highly conserved proteins between these organisms include many "housekeeping" proteins such as α-tubulin and heat shock protein 70. Despite their close evolutionary relationship, there are however likely to be substantial differences between A. tamarense and P. falciparum with respect to gene content. Due to the apicomplexan intracellular lifestyle, P. falciparum has lost most of the genes related to plastid function as well as other metabolic genes. Many of these same proteins appear in the list of the top BLAST hits against the nr database of GenBank (Table 4). There were 1,349 hits to the nr database that were better than 1e -10 .
As previously mentioned, our bioinformatic analyses identified 6,723 clusters of unique genes. However, this is likely to be a conservative estimate of the number of unique transcripts that were sequenced. A combination of short 3'-UTRs and highly conserved coding regions caused many related transcripts to be assembled together, even though their 3'-UTRs contained sequence differences. For example, two large clusters comprise ESTs that correspond to the plastid atpH gene that encodes the ATP synthase C chain. This gene is normally plastid encoded in other photosynthetic eukaryotes. These two clusters form closely Putative dinoflagellate-specific proteins Figure 1 Putative dinoflagellate-specific proteins. Amino acid sequence alignments of putative dinoflagellate-specific proteins. A) putative plastid protein that was highly represented in the A. tamarense cDNA library (cluster size = 22). A. tamarense sequences 1, 2, and 3 correspond to clones GC1-aba-e-13, GC1-abh-e14, and GC1-abd-o-22, respectively, and are aligned with highly similar ESTs from the dinoflagellates L. polyedrum (CD809498) and P. lunula (BU582532). The boxed region indicated a possible plastid targeting sequence. B) Putative dinoflagellate specific protein with significant blast hits only to other dinoflagellate ESTs. The Alexandrium sequence corresponds to clone UI-D-GC1-abh-f-23-0-UI.   related, but clearly distinct sets of transcripts. An additional atpH-encoding transcript was identified by a single EST. Together, the three clusters contain 43 ESTs, 16 of which are unique. The N-terminal extensions, which encode the tripartite plastid-targeting signals, share an average 74.3% nucleotide and 68.6% amino acid identity, respectively. Similar to many other species, the dinoflagellate transit peptides appear to be under selection to maintain hydrophobicity rather than a conserved amino acid sequence. This may explain why the nucleotide conservation is greater than that of the encoded amino acids. Five hydrophobic amino acids (phenylalanine, leucine, isoleucine, methionine, and valine) are, for example, encoded by codons with a T in the second position. This combined with the high GC-content at third positions results in higher conservation at second and third positions than at first positions. In addition, the high proportion of alanine (28.6%), leucine (10.2%), and valine (11.8%) rather than phenylalanine (2.4%), isoleucine (3.6%), methionine (4.3%, excluding starting methionine), and tyrosine (0.3%) in the N-terminal extensions may reflect the underlying GC-richness, because alanine, leucine, and valine are encoded by GC-rich codons. It is unclear if these amino acids are evolutionarily selected for specifically, or if they are selected for the combination of their hydrophobic character and the GC-content of their codons. In con-GO category assignment of A. tamarense ESTs

Molecular Function
trast, the conserved core of the protein shared an average 88.4% nucleotide and 98% amino acid identity, respectively, which corresponds to the more typical pattern of third position variation resulting from selection. The 3'-UTRs of the atpH genes show substantial variation and were difficult to align. There are several groups of more closely related 3'-UTRs that may be the result of recently duplicated genes. In all, there are five alignable groups of UTRs (and one singleton) that may have originated from more closely related genes.

Histone and histone-like proteins in dinoflagellates
A significant finding of this study is the identification of two rare (2/11,171) ESTs that encode a partial histone H2A.X. The longest cDNA isolated from the library using PCR was predicted to encoded a protein of 169 amino acids that shares high sequence identity to eukaryotic histone H2A.X ( Figure 3A). This clone putatively lacked only the start codon at the N-terminus. The divergent N-terminus of A. tamarense H2A.X is somewhat longer than in other homologs but the remainder of the sequence is conserved (in particular the α-helices of the histone fold). Several functional residues from the known crystal structure are also present in A. tamarense H2A.X including the lysine at the trypsin cleavage site, the arginines in the loops that interact with the DNA α-helix, and the lysine ubiquitination site [24]. The sites of interaction with histone H2B are also present.
H2A.X proteins are closely related to the canonical H2A except for the C-terminus which contains the distinctive SQ(E/D)Φ motif (where Φ is a hydrophobic residue).
H2A.X plays an important role in the recognition and repair of double-strand DNA breaks by non-homologous end-joining. At the site of double-strand breaks, the serine of the SQ(E/D)Φ motif is rapidly phosphorylated [25]. The phosphorylation signal spreads a large distance down the chromosome around the breaks, signalling the recruitment of the DNA repair proteins Rad50, Rad51, and BRCA1 [26,27]. We also identified histone H2A and H2A.X from the haptophyte Emilania huxleyi through high-throughput EST sequencing of this alga (J. D. H. and D. B. unpublished data). Phylogenetic analysis places A. tamarense H2A.X in its predicted position (with moderate bootstrap support) as sister to the E. huxleyi homolog within a group of chromalveolates that includes haptophytes, stramenopiles, and apicomplexans ( Figure 3B). H2A.X from A. tamarense, E. huxleyi, and Toxoplasma gondii do not, however, form a monophyletic group suggesting multiple origins within chromalveolates. This is not surprising because H2A.X appears to have arisen independently many times during eukaryotic evolution [28,29].
We tested the strength of these results using the Approximately Unbiased (AU-) statistical test. A 16-taxon ML backbone tree was generated without A. tamarense H2A.X and then we made a set of 17-taxon trees by placing this sequence on every possible branch (29 in total). This anal-
rense H2A.X likely reflects the short length and high conservation of these histones.
Dinoflagellate chromosomes do not contain nucleosomes, instead the DNA is associated with HLPs [10,30,31]. The similarity between dinoflagellate HLPs and bacterial HU and HLPs has only recently been noted and these proteins have not yet been subjected to phylogenetic analysis with a broad taxon sampling [32]. In our A. tamarense EST data, HLPs were the most highly repre-     tamarense and other dinoflagellates with HLPs and HU proteins from bacteria and eukaryotes showed moderate sequence similarity (a representative alignment is shown in Figure 4A). This alignment was constructed using information from secondary structure predictions (discussed below).

V TKGALAEAVATACEMKKKDVLKALASIAEIATAEVKKSA--VFTLPGLCRIKTRKKPATKA--GKKEIFGKMCVVKAK--PARTIVKAFPVAALKKEF--------------------------------------MAPKAMKGAVMGKAALADALATGCELKRRMSPRSWRVWPRSALGELKKSA--VFTLPGICRIKTRRKPATKA--GKREVFGKVVMVKAK--PAKTVVKAFPVAALKKQF-Lingulodinium ------------------------------------MAPMKAAAKKAMTKGAIADALATQFELKKTVCGKLINSLAEIATKEVKSSG--VFTIPGVCKIKTRTKPATKA--GKREIFGKTVVVKAK--PARKVVKAFPVAALKKSI-Crypthecodinium -----------------MAPKMKAAMKAAMKAPMKAMKATAMKGGKAMTKTGLAEALANATEVSKKDCASIIESLATICATEVKKTG--KLTIPGLVMVKTRKKPATKA--GKREMFGKVVLVKAQ--PAKTVVKAYPVKALKTDF-Prorocentrum --------------------------------------MAPMMKSTVMTKSALTAALAEACEVKRGVIGTALSTLASIASTEAKKNG--KFVIPGVVMIKTRQKNATKA--GKREIFGKVVMVKAK--PAKTIVKAFPAKALKDEF-
One group of proteins (referred to here as bacterial HLPs) is more closely related to dinoflagellate HLPs and includes Bph2 from Bordetella pertussis. Bph2 has a role in virulence gene expression and shares limited (likely convergent) sequence similarity with histone H1 [33]. The dinoflagellate and bacterial HLPs also contain an N-terminal extension in comparison to HU proteins. This extension is rich in alanine, lysine, and proline, which is reminiscent of the C-terminus of histone H1. The dinoflagellate HLP N-termini are however, also enriched in methionines. Compared to the bacterial HLPs, this N-terminal region is generally shorter in the dinoflagellates, although there is variability among species in both groups ( Figure 4A). In contrast to the primary sequence, secondary structure predictions for these three classes of proteins are remarkably similar. The crystal structure of E. coli HU has been determined (PBD ID: 1MUL) and the known secondary structure was compared to the predicted secondary structures of B. pertussis Bph2 and an A. tamarense HLP ( Figure 4B). Both types of HLPs are predicted to have two α-helices that are identical in size and spacing to the N-terminal helices in E. coli HU, followed by two βstrands that are similar in size and position. We conclude from this analysis that dinoflagellate HLPs show structural similarity to HU proteins from bacteria, however, it is unclear if these proteins are functional homologs. It is also apparent that dinoflagellate HLPs are distantly related to bacterial HU proteins. The dinoflagellates have one putatively homologous functional residue corresponding to Lys3 (arrow in Figure 4A) of HU proteins, which interacts with the DNA and is involved in wrapping the DNA around the protein [34]. A proline residue (asterisk in Figure 4A), which intercalates into the DNA during HU binding, appears to be conserved among HU proteins and bacterial HLPs, but is not present in the dinoflagellate HLPs [35]. However, there are several prolines conserved among dinoflagellates in the C-terminal end of the protein. The C-terminal arms of HU are critical for the interactions that bend the DNA. Given the low level of sequence similarity and the absence of a homologous proline in this region, it is unclear if the dinoflagellate HLPs are able to interact with DNA in the same manner as HU proteins.
In our phylogenetic analyses, the proteobacterial HLPs form a well-supported monophyletic group with the dinoflagellates ( Figure 4C) suggesting an origin of the dinoflagellate gene through lateral transfer (followed by several rounds of gene duplication). It is also noteworthy that dinoflagellates are the only eukaryotes to possess a proteobacterial form II rubisco [36]. The position of the dinoflagellate HLPs is distinct from that of other eukaryotic HU proteins. These latter proteins group with the canonical HU proteins from bacteria and have likely originated through intracellular transfer from the mitochondrial or plastid endosymbiont. Statistical support for the monophyly of the dinoflagellate and proteobacterial HLPs was tested using the AU-test. In these analyses (details not shown), a sister group relationship between the HLPs was the most highly favored topology (P = 0.659) and all other positions for the dinoflagellates (except branching inside the bacterial HLP clade) had significantly lower probabilities (P < 0.05).
Dinoflagellates no longer use the nucleosome as the major DNA packaging protein complex. Chromosomal DNA strands in these taxa are smooth, in contrast to the "beads on a string" conformation in other eukaryotes [12]. The chromosome structure is also unique in that they are uniform in size and morphology, remain condensed throughout the cell cycle, and are birefringent, indicating a liquid crystal state [5,14,37]. Transcription is thought to take place in DNA loops that protrude from the condensed chromosome [38]. It appears that dinoflagellates have acquired DNA binding proteins from a proteobacterium possibly to facilitate the compaction of their immense genomes. HU and related proteins from bacteria induce sharp bends in DNA strands and some models suggest that HLPs are responsible for creating DNA bends at the periphery of the chromosomes [39,40]. Immunolocalization shows dinoflagellate HLP to be associated with the periphery of chromosomes [41].
However, the HLP concentration is very low relative to DNA content. Dinoflagellate chromosomes have a 1:10 protein:DNA ratio (in contrast to the 1:1 ratio in other eukaryotes). The HLP concentration may therefore be too low to function in DNA compaction, rather they may act as transcriptional regulators [41,42].
In summary, our discovery of H2A.X in A. tamarense shows that, whereas dinoflagellates appear to no longer use nucleosomes for DNA packaging, at least one histone has been retained and is weakly expressed. Interestingly, in a recent paper, histone H3 appears in a table of redoxregulated genes in the dinoflagellate Pyrocystis lunula [11]. Until now, only these two histones have been identified in dinoflagellates and it is unclear if all dinoflagellates possess either of these two genes, or others that have not yet been found. If other histones are present (which is likely), they may however also be expressed at a low level (as is the case for A. tamarense H2A.X). This would render difficult their identification using the EST-based approach unless comprehensive sequencing of normalized and subtracted cDNA libraries is used. In metazoans, replicationdependant canonical histone (H2A, H2B, H3 and H4) mRNAs are not polyadenylated, raising the possibility that they have been excluded from this poly-A primed cDNA library [43]. However, these histone mRNAs are polyadenylated in plants, apicomplexans, and ciliates, suggesting that if they are present, they may be in dinoflagellates as well [44][45][46]. Given the important role that H2A.X plays in DNA repair, we speculate that this gene may have been maintained specifically to perform this function. Consistent with this idea, the core region of A. tamarense H2A.X is highly conserved, indicating that it may still be able to interact with DNA in a manner similar to H2A in other species.

Conclusion
This collection of ESTs is the most extensive genomic resource for a toxic dinoflagellate species to date and provides a useful glimpse into its nuclear genome. These data will be instrumental to future research to understand the unique and complex cell biology of these organisms and for understanding the method of toxin production in these species. We have likely not yet exhausted the gene discovery potential using the EST approach (i.e., note the high discovery rate of our normalized library). In the future, we will use serial subtraction of cDNA libraries to improve/maintain the novelty rate of our cDNA library and create cDNA libraries from A. tamarense under various growth conditions and life history stages to get generate a more complete catalog of the gene content of this important organism.

Library construction
Total RNA from a culture of the toxic dinoflagellate Alexandrium tamarense (CCMP 1598) was extracted using Trizol (GibcoBRL) and mRNA purified using the Oligotex mRNA Midi Kit (Qiagen). This culture strain was produced by isolating a single cyst, a diploid resting stage that produces haploid vegetative cells by meiosis. However, it is unknown if a single or multiple vegetative cells were isolated after antibiotic treatment to make the culture axenic. If a single vegetative cell was isolated, the culture would be clonal. The culture was grown at 20°C on a 13:11 hour light:dark cycle (80 µEinsteins of light) in L1 media. Start and normalized directionally cloned (3' NotI-5'EcoR1) cDNA libraries were constructed as previously described [47]. ESTs were sequenced from the 3' end to maximize clustering accuracy using the 3' untranslated region (UTR). All ESTs were processed as previously described [48]. Identification of a total of a non-redundant "unigene" set of 6,723 unique clusters from 11,171 sequences was accomplished using using UIcluster v3.0.5 [49].

Phylogenetic analyses
Data was gathered from GenBank (including the recently released Karenia brevis EST data, Frances Van Dolah, unpublished data) using blast searches. Maximum likelihood (ML) analyses were done with PHYLIP using the JTT model of protein evolution with gamma corrected rates (JTT + Γ) with 5 random additions [50]. ML bootstrap analyses (100 replications) were done as described with either 5 (histone H2A) or 1 (HLPs) rounds of random taxon addition. Bayesian analyses were done using MrBayes V3.0b4 [51]. Four chains (1 cold, 3 heated) were run for 1 million generations, sampled every 1000 generations, using the JTT + Γ model. The first 500 trees were discarded as burn-in. Neighbor joining (NJ) bootstrap (500 replicates) analyses were done with PHYLIP using the JTT + Γ model. Minimum evolution (ME) analyses done with PHYLIP using the JTT + Γ model with global rearrangements and 10 rounds of random taxon addition (1 round was used in the bootstrap analysis).
The Approximately Unbiased test was done using CON-SEL [52]. ML trees without the groups of interest were generated as described above. A pool of trees was then generated by adding the group of interest (A. tamarense H2A.X or dinoflagellate HLPs) to every possible branch in the ML tree. For the HLP analyses, a reduced taxon set was used that included Bordetella, Ralstonia, Xylella, Pasteurella, Nostoc, Synechocystis, Agrobacterium, Rikettsia, Escherichia, Guillardia, Cyanidioschyzon, Sorghum, Toxoplasma, Xenopus, and Homo. A. tamarense 1 and C. cohnii HCC2 were added as a monophyletic group to every branch in this reduced ML tree. Secondary structure prediction was done using Jpred [53,54]. The consensus secondary structures were used in the comparison to the know structure of E. coli HU (PDB ID: 1MUL).