Oxyrrhis - an emerging genomic model
Recent interest in the genetic and genomic architecture of O. marina has informed the evolutionary history of a range of conspicuous dinoflagellate traits. For example, it is now clear that the RNA trans-splicing mechanism, seemingly ubiquitous within the dinoflagellates, also occurs within O. marina and the more distantly related Perkinsus marinus , suggesting that trans-splicing was established early in the ancestral lineage leading to the dinoflagellates. In contrast, studies of the mitochondrial genome indicate that the large but highly fragmented structure, again a feature of this taxon, is a more recent trait as it is common to the dinoflagellates and O. marina but probably not to Perkinsus . Perhaps the most conspicuous dinoflagellate feature, the seeming massive genome sizes harboured by some species (up to 215 Gb ), also predates O. marina (a recent estimate places the genome size at ~50 Gbp ) but occurred after the divergence of Oxyrrhis/dinoflagellates from Perkinsus, in which the genome is of more typical proportion (~86Mb). Thus it is clear that O. marina is of increasing significance in the study of alveolate evolution.
Here we further indicate that tandem gene arrangements and abundant expressed gene variants are common in O. marina. EST surveys of several dinoflagellates have highlighted the occurrence of multiple transcripts coding for the same gene product [14, 38], and detailed studies of specific genes have revealed complex gene arrangements and expressed gene variants in several species (e.g.[39, 40]). In O. marina, previous study has shown abundant gene transcript variants for actin, HSP70, and rhodopsin (e.g. [3, 16]). We indicate the same phenomenon here, with nominally 30 identified expressed genes (and ~130 anonymous truncated transcripts) present as up to 28 variants, for which the majority of nucleotide variation was synonymous. A comparison with existing ESTs indicates even more extensive gene variant clusters in O. marina CCMP1788. In both cases, a large number of variant clusters were not identifiable by BLAST searches against Genbank databases and no obvious functional class of genes appeared to dominant the most abundant variant clusters. For 44_PLY01 the largest gene cluster occurred for phosphoribosylaminoimidazole-succinocarboxamide synthase, a gene associated with purine metabolism , and the largest variant cluster in CCMP1788 coded for a type II rhodopsin gene. Notably, Slamovits et al.  have described ~50 variants encoding rhodopsin in strain CCMP1788; here we detected far fewer variants (2-4 rhodopsin contigs). This discrepancy may simply be a result of different methodologies. Oxyrrhis marina cultures in this study were grown in the dark, and given the likely role of rhodopsin in phototaxis , it would seem a potential that this treatment may reduce rhodopsin expression. Alternatively, differences in gene variant abundance may occur between strains. Whether structural or transcriptional differences exist at this level has yet to be examined, though global comparisons of the 44_PLY01 and CCMP1788 datasets using BLAST and CAP3 (Figures 2 and 5) both highlighted limited similarity (~15% of ESTs/contigs were common to both strains). We have previously documented extensive genetic diversity within O. marina  and strains CCMP1788 and 44_PLY01 occur within different O. marina clades (44_PLY01 and CCMP1788 occur within clades 1 and 2, respectively) based on sequence variation at 2 gene loci [1, 24]. Whilst it is beyond the scope of the current study, it is likely that comparative assessments of gene/genome complement, arrangement, and structure at a range of phylogenetic between basal dinoflagellates will be highly informative. In particular, such comparative strategies will be useful to assess the rate of change of, for example, gene copy number at a key evolutionary juncture within the alveolates.
In addition to the occurrence of extensive expressed gene variants we have also shown that genes encoding transcribed variants occur as tandem repeated arrays in O. marina, an arrangement that has been demonstrated for a number of other dinoflagellate taxa . The 5 genes examined here (Table 3) were each arrayed in tandem, separated by short intergenic regions. HSP90 occurs in several contexts, with 2 major variants of the intergenic region; notably however, based on 3' UTR sequence the variants detected in a genomic context did not tally with those present in the RNAseq dataset. Given that mRNA sequences for HSP90 were fragmented and incomplete at the 3' end it is most likely that the corresponding portion of transcripts were simply missing in the RNAseq data, though it is also possible that we have under sampled the existing variation for this gene. In contrast, only a single intergenic sequence was recovered for alpha tubulin, which did match the mRNA sequence. In both cases cDNAs were trans-spliced, a universal feature of dinoflagellate transcription , and the trans-splicing acceptor site corresponded to an 'AG' signal as noted in other dinoflagellates . Of course, as a result of potential amplification biases associated with PCR detection, the actual diversity of intergenic regions is difficult to assess; nevertheless the occurrence of tandem gene repeats separated by different intergenic spacers suggests a number of potential genomic arrangements. Different intergenic spacers potentially indicate the occurrence of multiple tandem arrays at different genomic loci. Alternatively, individual arrays may be a complex arrangement of gene copies and heterogeneous intergenic spacers. Notably an in situ hybridisation based study of several genes in O. marina indicated 3, 4, and 5 genomic locations for actin, alpha tubulin, and HSP90 . The precise structure and the extent of these tandem gene arrays remains to be investigated in O. marina; regardless, it is now increasingly clear that gene duplication is extensive in dinoflagellates more generally, and results in complex gene arrangements (e.g. [13, 39]. Understanding the mechanisms promoting such expansions is an important focus for dinoflagellate genome biologists. A systematic survey of the arrangement of such duplicated genes will be informative and given the basal position of Oxyrrhis it will almost certainly prove valuable for establishing the likely origin of extensive duplication in the dinoflagellate lineage.
The gene complement of O. marina
Analysis of the existing CCMP1788 EST dataset identified a range of O. marina genes indicative of significant evolutionary processes . Oxyrrhis marina possesses genes such as proteorhodopsins that appear to have been laterally transferred from a bacterial origin  and a number of plastid genes, including ketol-acid reductoisomerase, carbonic anhydrase, and cysteine synthase, which suggests an evolutionary ancestry that included a chloroplast bearing cell . In this study, we highlight the occurrence of a broad range of genes associated with amino acid synthetic and metabolic pathways, including genes which indicate the ability to synthesise 'essential' amino acids, a capacity not typical in heterotrophic protists. Molecular evidence for extensive biosynthetic capacities certainly supports previous study on the nutritional biochemistry of O. marina. A series of comprehensive studies of nutritional physiology by MR Droop and co-workers (e.g. [25, 44]) highlighted that, in addition to phagotrophy, O. marina displayed a "plant-like" biochemistry including the ability to synthesise the full complement of amino-acids from ammonium or other simple nitrogen sources. While amino acid biosynthesis capability in heterotophic protists is exceptionally diverse, an absolute requirement for several amino acids is typical . A broad range of transcripts indentified in this study were associated with amino acid metabolism and biosynthesis; based on the KEGG databases , 18 of the 22 amino acid biosynthesis pathways were represented by 100 454 contigs. The ability to undertake population growth on a fully synthetic medium with relatively simple absolute requirements (acetic acid or ethanol; valine, alanine; biotin; thiamine; vitamin B1s; ubiquinone; and a sterol ) and an exceptionally broad phagotrophic capacity (35-40 different prey items are documented as supporting O. marina population growth in vitro ) make O. marina exceptional. One mechanism by which O. marina may have gained its biosynthetic capacity is via an ancestral plastid or ancestral cyanobacterial endosymbiont . The occurrence of plastid targeting signalling peptides and genes that are almost certainly plastid or cyanobacterial in origin (e.g. those coding for 1-deoxy-D-xylulose-5- phosphate reductoisomerase, haem, carbonic anhydrase, ketol-acid reductoisomerase, and dihydrodipicolinate reductase , and this study) are certainly strong support for such a mechanism.
More generally, based on GO and BLAST annotations a broad range of gene families and metabolic processes are nominally represented in the O. marina RNAseq library presented here. However, estimation of transcriptomic diversity, the comprehensiveness of the sequencing, and thus the likely gene complement of O. marina is difficult in the absence of a reference or close reference genome. Estimates of gene content based on genome size are possible; recent work by Hou and Lin  shows a strong non-linear correlation between genome size and protein-coding gene number across a broad range of eukaryotes. Hou and Lin  estimate total gene content of the largest dinoflagellate genomes to be on the order of 80-90, 000 genes comprising ~1% of the total genome. An estimated DNA content for O. marina of ~55.8 pg cell-1  places its genome within the dinoflagellate range (~50 Gbp) and suggests some ~70, 000 genes (assuming an average eukaryotic gene size of 1.3 Kbp ). Gene-content predictions of this magnitude are exceptionally high in comparison to other eukaryotes; however, as noted above, many genes in dinoflagellates occur in high copy numbers (up to 5, 000 gene copies in some cases, e.g. ); thus, it is possible that much of the 'gene space' in dinoflagellates is occupied by multi-copy genes and the total proteomic diversity is closer to that displayed by eukaryotes more generally .
The representation of conserved gene classes also provides an approximate indication of transcriptome coverage. In this study we detected 61 ribosomal protein coding transcripts of the 75-80 that are typical of most eukaryotes ; while contigs did not represent full transcripts and such a comparison can only give a crude estimate these figures suggest a representation in the region of 75% of the transcriptome. It should be noted however, that comparison of the RNAseq and EST datasets for O. marina potentially conflict with this estimate. Assuming strains are relatively similar (sequence divergence based on mitochondrial cytochrome oxidase I is ~2% ), the degree of overlap in transcriptome sequence datasets between strains was relatively small (~15%), potentially indicating a high degree of under-sequencing in both cases. Of course, strains might differ more than suspected, or biases in sequencing (e.g. truncation or fragmentation of transcripts) might reduce overlap between the datasets. In either case, it seems clear that comprehensive sampling of the O. marina transcriptome is likely to require a further substantial sequencing effort.
Transcriptomic novelty and the problem of identification by identity
We have identified a number of interesting features of the O. marina transcriptome adding to previous descriptions of an unusual gene content in this organism. However, the majority of the sequences generated in this study were not identified by identity searches. This limited identification success, whilst partially accounted for by a 3' bias in this dataset (and thus a high representation of UTR sequence), is nevertheless diagnostic of a broader difficultly for genomic studies of dinoflagellates. While, the dinoflagellates are increasingly regarded as important targets for the study of genome evolution, large scale sequence resources are only relatively recently accumulating [21, 22, 48, 49]. This poor sequence representation has an impact on the current use of such databases for sequence identification. For example, within the NCBI databases, EST datasets (totalling 155, 474 sequences) exist for only 21 dinoflagellate species, and the majority of ESTs (122, 235) are derived from just 5 species. Similarly, in a genomic context, only a handful of plastid genomes and genome sequence surveys exist for dinoflagellates and the majority of nucleotide sequences are environmental rDNAs. Consequently, identification of new sequences via database searches presents a significant challenge for dinoflagellate taxa.
In context, the relatively low annotation rate achieved in this study is, therefore, not surprising. EST projects on metazoa, with relatively close ancestry to many genomic model organisms, can yield high proportions of ESTs (e.g. > 95%) that are identified by reference to existing sequence databases (e.g. ). By contrast, only 1, 890 (16%) contigs were identified for O. marina, and less than 2% of transcripts matched to a single relatively closely related species, such as Perkinsus marinus. Comparably low rates of annotation have been reported for other dinoflagellate EST projects, with only 9% of the (~1, 400) ESTs isolated from Alexandrium ostenfeldii homologous to known proteins  and ~20% (of 6, 723) of ESTs from Alexandrium tamerense identified . While ESTs from a number of other eukaryotic protist taxa, for example diatoms, do not appear to be so different from the protein and transcript data available in public databases, a typical annotation rate of ~50% of transcripts again highlights gaps in genomic information [48, 51]. Most notably a recent EST project on Perkinsus marinus generated ~31, 000 EST sequences, clustered into ~8, 000 unique sequences of which 55% were identified ; possibly the higher annotation rate in this case is a result of the closer (relatively) phylogenetic affinity between Perkinsus and the Apicomplexa (a group that is well characterised by virtue of containing numerous parasites of humans and livestock). It is notable that only 145 O. marina transcripts produced significant identity to P. marinus sequences, and only 161 matches occurred between O. marina contigs and those from other dinoflagellate taxa. Whether, this is a genuine result of a high degree of novelty of the O. marina genome or a simple result of limited genomic data can only be confirmed by further genome scale sequencing, although inferences from phylogenetic analysis do suggest that Oxyrrhis represents a highly divergent and novel lineage .
Identification of salinity tolerance mechanisms by differential gene expression
The application of next-generation sequencing technology to directly characterise transcript abundance is an increasingly used strategy for gene expression profiling [52–54]. The most precise strategies quantify either 5'or 3' (or both) cDNA fragments and thus overcome potential biases associated with sequence read length and incomplete reverse transcription ; but for species that lack genome references (for fragment mapping) this approach negates the generation of full or near full length coding sequences, which are typically a valuable output of transcriptome sequencing projects in the case of poorly characterised organisms. Our aim here was to determine whether a de novo transcript assembly can be used concurrently with an experiment to obtain an informative gene expression profile.
Comparisons of transcript abundance profiles for cells grown under 2 salinity treatments nominally identified differing gene expression patterns and in combination with growth rate estimates seemed to provide evidence for specific physiological responses and a tangible molecular mechanism. A higher maximum grow rate at 30 PSU, was concurrent with a relatively strong induction of ~20 transcripts at this salinity. Likewise, a reduced growth rate and modest induction of a different set of 8 genes occurred at 50 PSU. However, agreement between transcript abundance and qPCR gene expression estimates were relatively poor, both in terms of direction and magnitude, and in only 6 out of 14 assays were expression estimates similar. In a broader context, gene expression patterns derived via different methodologies (e.g. qPCR vs. microarray platforms) often do not strongly correlate, although there appears to be more concordance between qPCR and next-generation sequencing platforms than with microarrays (cf. [53, 54]), which may relate to overall transcript abundance . It is clear from a range of studies that some features of next generation sequencing protocols not specifically designed/targeted to quantify transcript abundance potentially generate significant biases in representation (e.g. ). From the study presented here for example, the representation of a number of gene transcripts in the O. marina RNAseq dataset by numerous non-overlapping fragments (with differing read abundances) is clearly problematic and is likely a result of either incomplete cDNA synthesis and/or a proportion of read assembly errors. Likewise, the occurrence of extensive expressed gene variants, seemingly common in most dinoflagellates has the potential to result in extensive discrepancy between sequence and qPCR bases approaches; particularly if qPCR assays co-amplify extensive gene variant families.
Accepting the above issues, those genes whose expression profiles were confirmed by qPCR did tentatively suggest a potential underlying salinity response. In 3 cases qPCR and 454 expression estimates identified genes as up regulated at 50 PSU; most notable was the up regulation of phosphoethanolamine N-methyltransferase - this enzyme is a component of a common pathway in plants that generates the osmoprotectant glycine betaine . Thus, increased salinity appears to elicit a decrease in specific growth rate and tentatively a concurrent osmoregulatory response. Clearly such an inference is speculative and confirmation of the occurrence of this metabolic pathway in O. marina is required.