Characterisation of full-length cDNA sequences provides insights into the Eimeria tenellatranscriptome

Background Eimeria tenella is an apicomplexan parasite that causes coccidiosis in the domestic fowl. Infection with this parasite is diagnosed frequently in intensively reared poultry and its control is usually accorded a high priority, especially in chickens raised for meat. Prophylactic chemotherapy has been the primary method used for the control of coccidiosis. However, drug efficacy can be compromised by drug-resistant parasites and the lack of new drugs highlights demands for alternative control strategies including vaccination. In the long term, sustainable control of coccidiosis will most likely be achieved through integrated drug and vaccination programmes. Characterisation of the E. tenella transcriptome may provide a better understanding of the biology of the parasite and aid in the development of a more effective control for coccidiosis. Results More than 15,000 partial sequences were generated from the 5' and 3' ends of clones randomly selected from an E. tenella second generation merozoite full-length cDNA library. Clustering of these sequences produced 1,529 unique transcripts (UTs). Based on the transcript assembly and subsequently primer walking, 433 full-length cDNA sequences were successfully generated. These sequences varied in length, ranging from 441 bp to 3,083 bp, with an average size of 1,647 bp. Simple sequence repeat (SSR) analysis identified CAG as the most abundant trinucleotide motif, while codon usage analysis revealed that the ten most infrequently used codons in E. tenella are UAU, UGU, GUA, CAU, AUA, CGA, UUA, CUA, CGU and AGU. Subsequent analysis of the E. tenella complete coding sequences identified 25 putative secretory and 60 putative surface proteins, all of which are now rational candidates for development as recombinant vaccines or drug targets in the effort to control avian coccidiosis. Conclusions This paper describes the generation and characterisation of full-length cDNA sequences from E. tenella second generation merozoites and provides new insights into the E. tenella transcriptome. The data generated will be useful for the development and validation of diagnostic and control strategies for coccidiosis and will be of value in annotation of the E. tenella genome sequence.


Background
Coccidiosis is an economically important intestinal disease of poultry caused by parasitic Eimeria species. The annual cost of coccidiosis to the poultry industry worldwide has been estimated to exceed £2 billion [1]. Control of this disease in intensively reared poultry is accomplished principally by prophylactic chemotherapy with specific anticoccidial drugs, although drug-resistance is a serious problem that has to be constantly managed. No new drugs have been introduced in recent years and alternative methods of control are now required. Vaccination using live vaccines is a viable option, though it is hampered by the complexity and production constraints of live parasites. Thus, new approaches for control continue to be sought.
Eimeria tenella is widely considered to be the most economically relevant and well known of the seven Eimeria species that cause coccidiosis in chickens [2].
The second generation merozoite of Eimeria is an interesting target for transcriptomic studies as it is the progeny derived from the most pathogenic endogenous stage of the E. tenella life cycle [3] and may contribute to the stimulation of the protective immune response in the host for at least some Eimeria species [4]. In addition, it is among the most readily isolated stages of the life cycle [5]. Detailed study of the merozoite stage will support the identification of proteins important to key biological processes in the parasite including host invasion, replication, pathogenicity and the stimulation of host immunity.
The availability of segments of sequences from randomly selected cDNA clones, known as expressed sequence tags (ESTs), has provided valuable resources for the identification and study of genes in E. tenella [6][7][8]. Sequencing of full-length cDNAs provides additional advantages including data derived from a single clone rather than an assembly of multiple ESTs, which can generate ambiguous contigs, and complete transcripts, which include open reading frames (ORFs) and untranslated regions (UTRs). Thus, a large collection of full-length cDNA sequences provides a set of protein coding sequences that facilitate the prediction of gene identity and function by comparison with other known protein coding genes [9].
In this study, partial sequences were generated from the 5' and 3' ends of randomly selected clones of an E. tenella second generation merozoite full-length cDNA library. These partial sequences were pre-processed and subsequent sequence clustering and primer walking generated full-length cDNA sequences. Characterisation of these full-length cDNA sequences included determination and analysis of ORFs and UTRs, Kozak sequence consensus, simple sequence repeats (SSRs) and codon usage. Analysis of the full-length cDNA sequences generated also identified candidate secretory and membrane proteins that may prove relevant in developing disease control strategies against avian coccidiosis.

Results and discussion
Generation of full-length cDNA sequences A total of 9,024 clones were randomly selected for plasmid extraction and subsequent single-pass sequencing from the 5' and 3' ends. After eliminating low quality and vector contaminated sequences, 8,433 and 6,981 good quality sequences were obtained from the 5' and 3' ends respectively [dbEST: JK017416-JK032828, JK032875]. These partial sequences were clustered and resulted in the identification of 1,529 unique transcripts (UTs). Using the clustered sequences 81 full-length cDNA sequences were generated by aligning overlapping 5' and 3' end partial sequences. In addition, clones representing 586 consensus sequences with both 5' and 3' end partial sequences were randomly selected and subjected to complete sequencing by single-pass primer walking, generating a further 363 full-length cDNA sequences. Primary sequence analysis revealed the absence of in-frame start or stop codons in one and 10 clones respectively. Such sequences might represent non-coding RNAs, although they could also have been derived from contaminants or cloning artefacts and have been excluded from our subsequent analyses. Thus, a total of 433 full-length cDNA sequences were generated and analysed in this study [GenBank: JN987230-JN987662].

Functional annotation
BLASTX similarity search of the 1,529 UTs against the GenBank non-redundant database revealed that 54.2% (829/1,529) of the transcripts had significant matches (E-value < 1e-6) to publicly available gene sequences, with most of these (71.4%) matches to gene sequences from apicomplexan parasites [Additional file 1]. A total of 2,053 gene ontology (GO) terms, distributed within the categories Biological Process, Molecular Function and Cellular Component, were assigned to the 1,529 UTs (831, 603 and 619 respectively) [ Figure 1]. The most highly represented subcategories within Biological Process were cellular and metabolic processes, accounting for 32.4% (269/831) and 30.7% (255/831) of the transcripts respectively, in line with previous proteomic characterisation of the second generation merozoite [10]. Binding (42.5%; 256/603) and catalytic activity (40.3%; 243/603) were the most highly represented subcategories within Molecular Function. Combined, these results are consistent with the biological role of the second generation merozoite, a life cycle stage characterised by metabolically active processes including motility within the caeca, host cell attachment and cellular invasion. The Cellular Component category was dominated by the cell (42.5%; 263/619) and organelle (30.2%; 187/619) subcategories, consistent with the relatively high abundance of cell cycle-associated proteins reported in the second generation merozoite proteome compared to the sporozoite [10]. GO annotation of the E. tenella full-length cDNA sequences revealed similar functional patterns [ Figure 1; Additional file 2].

cDNA, ORF and UTR length distribution
Analysis of the 433 E. tenella full-length cDNA sequences revealed an average size of 1,647 bp [ Figure  2]. Most of the sequences were in the length range of 1,401 bp to 2,100 bp, while 24.2% (105/433) of the sequences were within 1,000 bp to 1,400 bp and 16.6% (72/433) were within 2,101 bp to 3,000 bp. The analysis also showed that 27 sequences generated were less than 1,000 bp in length and two sequences were more than 3,000 bp in length. Our analysis of the previously reported 732 full-length cDNA sequences from Toxoplasma gondii [11] and 644 full-length cDNA sequences from Cryptosporidium parvum [12] revealed average sizes of 1,539 bp and 1,399 bp respectively. Comparison of the three data sets revealed longer average full-length cDNA sequences within the E. tenella transcriptome. While the difference was not statistically significant, variation in transcriptome-wide SSR prevalence between the three data sets is clearly a contributory factor (described in detail below).
The sequences were also analysed to predict ORF and UTR components following start and stop codon identification [Additional file 3]. Length distribution analysis of the 433 predicted ORFs showed that the majority were between 501 to 900 bp in length, with the average size being 867 bp [ Figure 2]. Approximately 18.2% (79/ 433) of the ORFs were less than 500 bp, while only 1.4% (6/433) of the ORFs were more than 2,000 bp. The length distribution of the 5'UTRs showed that few exceeded 500 bp (20.3%; 88/433), with the average size of the 5'UTR being 342 bp. The length distribution of the 3'UTRs showed that 9.9% (43/433) were more than 800 bp long, with the average size of the 3'UTR being 438 bp.
Although the 5'UTR does not contribute directly to the encoded protein, the characterisation of 5'UTR features is important as this region is believed to be involved in the control of translation and transcription processes that subsequently reflect gene expression [13,14]. Thus, data generated on these regions may reveal control elements and regulatory mechanisms of gene expression patterns in the parasite. In a previous study of apicomplexan full-length cDNA sequences, Wakaguri et al [11] reported that the average size of 5'UTRs was consistent amongst Plasmodium species, namely P. falciparum (303 bp), P. vivax (304 bp), P. yoelii (345 bp) and P. berghei (299 bp), but varied between genera with C. parvum and T. gondii presenting average 5'UTR lengths of 137 bp and 288 bp respectively. The average 5'UTR length was shortest for C. parvum, which may reflect the fact that both the genome size and the average gene size are the smallest in this species. This comparison has revealed longer 5'UTRs in the E. tenella genome than reported for most other apicomplexan parasites. While the significance of this finding is not yet clear the detection of numerous SSRs may once again be a contributory factor.
Genomic cDNA transcript mapping-E. tenella chromosome 1 Eimeria tenella is the first of the Eimeria species parasites to have been subjected to genome sequencing, although the draft assembly remains fragmented [15]. In order to demonstrate the utility of the data generated here for gene prediction the 1,529 UTs and 433 fulllength cDNA sequences were mapped onto the first sequenced E. tenella chromosome (chromosome 1), representing~1.8% of the genome [16]. Based on an overlap of at least 70% of the original transcript length, a total of 13 UTs were successfully mapped-seven to genes in the feature-poor 'P'-regions and the remaining six to genes in the feature-rich 'R'-regions of the chromosome [Additional file 4; Additional file 5]. Further analysis revealed that mapping of the UT sequences identified and resolved several inconsistencies with the previously predicted coding regions, indicating the usefulness of the transcript sequences in improving gene structure prediction on the E. tenella genome sequence. Two full-length cDNA transcripts were mapped to E. tenella chromosome 1 [Additional file 6] where the alignment of ln23_Etm023C06 showed consistency with the previously characterised 15-exon structure of the glucose-6-phosphate isomerase gene [16,17].

SSR motif analysis
SSRs can be found in the genome of both prokaryotic and eukaryotic organisms [18,19]. These repeats represent a rich source of hypervariable markers due to the constant allelic changes of array length caused by their high mutation rate [20,21]. As a result, they have been widely used in the fields of linkage mapping [22,23], population genetics [24] and phylogenetic or comparative genomic research [25,26]. In addition, SSRs are believed to be important in genome evolution, stimulating the development of genetic variability [27] and influencing gene expression [28,29].
A notable feature of E. tenella chromosome 1 was the abundance of SSRs, not only in the introns and intergenic regions, but also in the predicted coding regions [16]. In order to further characterise the type and location of these repeated motifs in E. tenella genes, SSR motif analysis was carried out on the full-length cDNA sequences generated. Results showed that the SSRs present were composed of various types of mono-, di-, tri-, tetra-, hexa-, hepta-, nona-and decanucleotides [ Table  1; Additional file 7]. The location of the SSRs were subsequently categorised to three different locations, i.e. 5'UTR, ORF and 3'UTR. Based on the distribution of the SSRs trinucleotide motifs were found to be the most abundant (88.4%; 455/515), with mono-and tetranucleotide motifs also found to be common. The trinucleotide CAG was the most abundant motif and constituted 71.4% (325/455) of the entire trinucleotide motifs identified, while the most abundant tetranucleotide motif was AGCT, which comprised of 71.4% (15/21) of all tetranucleotide motifs. A total of eight hexanucleotide, two heptanucleotide, two nonanucleotide and one decanucleotide motifs were also identified in the E. tenella fulllength cDNA sequences. The abundance of CAG trinucleotide motifs within the E. tenella full-length cDNA sequences was consistent with the published findings from E. tenella chromosome 1 [16]. Comparison with publicly available full-length cDNA sequences from C. parvum [12] and T. gondii [11] revealed the dominance of mononucleotide repeats for the former, but dinucleotide repeats for the latter [Additional file 8]. Surprisingly, T. gondii full-length cDNA sequences were found to contain only mono-, di-and trinucleotide repeats, while penta-, hepta-, octa-, nona-and decanucleotide motifs were also found to be absent in C. parvum. As expected, the highest SSR content was observed in E. tenella.

Codon usage analysis
Codon usage often varies between organisms and may reflect the cellular location of gene products [30] and aid in coding region determination [31]. Codon usage in both eukaryotes and prokaryotes is known to be affected by directional mutation of nucleotides present in the genome [32] and may be influenced by the composition of a genome's transfer RNA (tRNA) portfolio. Thus, genomic evolution frequently demonstrates genome-specific over-or under-representation of some dinucleotides, and dinucleotide frequency is believed to influence codon usage [33,34]. Dinucleotides that are under-represented in coding regions thus appear as codons that are present at low frequency. In this study, the ORFs identified were subjected to codon usage analysis using CodonW and a codon usage table for full-length E. tenella ORFs was subsequently generated consisting of 125,231 codons [ Table 2]. A previous study of codon usage by parasites including Babesia bovis, Theileria parva, T. gondii and E. tenella showed that codons CGA, CGG and UGU are infrequently used by all of these organisms [35]. For E. tenella, it was revealed that based on the frequency of usage of less than 10 per 1000 codons, 17 codons (GUA, AGA, AGU, AUA, ACG, UGU, UAU, UUA, UUU, UCG, UCA, CGG, CGA, CGU, CAU, CAC and CUA) are infrequently used. Six of these codons contain either the UA or AU dinucleotide, while another four contain the CG dinucleotide. In this study, analysis of the E. tenella ORF sequences identified ten codons (UAU, UGU, GUA, CAU, AUA, CGA, UUA, CUA, CGU and AGU) that are infrequently used based on the same criterion. Comparison between these two studies revealed that all ten of the codons identified in this study are the same as those identified by Ellis et al [35]. Furthermore, six of them also contained either the UA or AU dinucleotide while two of them contained the CG dinucleotide, supporting the finding of the previous study that codons with low usage frequency contain under-represented UA, AU or CG dinucleotides. Five codons were over-represented  [36], a total of 26 T. gondii gene sequences from position -20 relative to the ATG start codon up to position +4 were compared. A consensus sequence was apparent with A dominating at positions -3, -2 and -1, plus C at position -4 and G at position +4. Thus, the Kozak sequence CAAAATGG was assigned for T. gondii. Comparison of the Kozak sequences for E. tenella and T. gondii shows a high similarity where A at positions -3, -2 and -1, and G at position +4 for both organisms are similar. However, the Kozak sequence for both of these organisms appears to differ from that of the higher eukaryotes [37] [ Table 3].

Secretory and membrane protein prediction
Parasite secreted proteins commonly interact with host cells at the molecular level and are exposed to the host immune system. Parasite growth and invasion processes may be prevented once an essential secretory protein is inhibited. Therefore, many secretory proteins can be considered to be vaccine candidates or potential drug targets [38][39][40][41]. Prediction analysis using SignalP suggested that 19.6% (85/433) of the peptide sequences contain a signal peptide. Out of these 85 peptide sequences, 60 were predicted to contain one or more transmembrane domains and/or a GPI-anchor, leaving 25 as predicted unbound secretory proteins. Similarity searches based upon homology showed that a large proportion of these predicted secretory proteins could not be assigned a putative function as 24.0% (6/25) had matches with hypothetical proteins or proteins with unknown function, while 56.0% (14/25) had no significant similarity to any publicly available protein sequence [Additional file 10]. Intriguingly, although most of the putative secretory proteins identified were apparent homologues of apicomplexan genes no recognised apical organellar proteins were found. Many apicomplexan surface proteins have been shown to play an important role in the pathogenicity of these parasites and a number of them are potential vaccine candidates or drug targets. Proteins that are attached via a GPI-anchor to the surface of protozoan parasites can induce a variety of host immunological responses [42,43]. In this study, membrane proteins were predicted by identifying the presence of signal peptides, transmembrane domains and GPI-anchors. The prediction of transmembrane helices carried out using TMHMM revealed a total of 92 peptide sequences likely to contain at least a single transmembrane domain. GPI-anchor prediction analysis carried out using GPI-SOM, which detects both the N-terminal signal peptide and C-terminal GPI-anchor signal, suggested a total of 26 peptides with a GPI-anchor. Protein sequences that contain a signal peptide and a transmembrane domain or a GPIanchor were predicted to be membrane proteins. Based on these criteria, 60 membrane proteins were predicted in this study. Database similarity searches showed that putative functions could not be assigned to most of the predicted membrane proteins as 5.0% (3/60) were most similar to hypothetical proteins, while 48.3% (29/60) had no significant similarity with sequences in the GenBank database [Additional file 11]. In total 31.7% (19/60) of the predicted membrane proteins had matches with E. tenella surface antigens (EtSAGs). Two proteins had a perfect match with members from the previously described A family (i.e. EtSAG4 and EtSAG6) [44]. Interestingly, seven other predicted surface proteins showed between 45.8% and 95.5% similarity to the entire coding region length of the EtSAGs. Using multiple sequence alignment these sequences can be divided into two groups, representing the A and B families [Additional file 12]. The alignments show the presence of the six conserved cysteine residues in both families. Family A revealed a mosaic pattern with conserved and variable regions distributed throughout the alignment while family B exhibited a more biased pattern with variation predominantly in the N-terminal half of the alignment, consistent with the analysis described by Tabares et al. [44]. This analysis strongly suggests that the surface antigens discovered in this study represent new members of the EtSAG families. Both of the previously annotated EtSAGs identified in this study had been reported to be expressed in second generation merozoites [44]. Using GO many of the other putative membrane proteins were classified as involved in cellular and metabolic processes; for example identification of a putative longevity-assurance (LAG1) domain-containing protein.
As described elsewhere such molecules can present opportunities to disrupt parasite infection and thus have the potential to become good targets for novel intervention strategies [45].

Conclusions
In this study, we generated and analysed 433 full-length cDNA sequences with complete coding regions derived from the E. tenella second generation merozoite transcriptome. These sequences provide access to a relatively large resource of nucleotide and amino acids sequences for E. tenella that will support a better understanding of the transcriptome of this economically relevant parasite. Moreover, in combination with other genomic resources including whole genome sequences and genome maps [46], these full-length cDNA sequences will offer new insights into the structure, composition and function of the E. tenella genome. We have also identified panels of 25 and 60 predicted secretory and membrane proteins, with potential for development as novel diagnostic and/or control strategies for E. tenella via molecular techniques.

Parasite passage and purification
The reference E. tenella Houghton strain was used throughout this study [2]. The parasite was routinely propagated as described elsewhere [5] using specific pathogen free Light Sussex chickens produced and maintained at the Institute for Animal Health. Second generation merozoites were purified following the method of Prof. N. Smith as described elsewhere using several serial five minute incubation steps, each in fresh incubation medium [5]. Only incubation medium washes lacking microscopically detectable red blood cells were processed for RNA extraction to limit host cell contamination.
Full-length cDNA library construction RNA was extracted from E. tenella second generation merozoites using the TRIzol reagent as described by the manufacturer (Invitrogen, USA) and used in the construction of a full-length cDNA library by the oligo-capping method [47]. In brief, RNAs were sequentially treated with bacterial alkaline phosphatase (BAP) and tobacco acid pyprophosphatase (TAP). The BAP-TAP treated RNAs were then ligated with 5' oligo-cap linker using RNA ligase. First strand cDNAs were synthesised with the oligo-capped mRNA as a template, followed by PCR using the oligo-cap linker sequence and oligo-dTadapter as primers. The full-length cDNAs produced were then cloned into the pME18S-FL3 plasmid vector and subsequently transformed into Escherichia coli Elec-troMAX DH10B cells (Invitrogen, USA).

Plasmid extraction and cDNA sequencing
Colonies were picked randomly and inoculated into individual wells of 96-deep well plates containing LB media, and subsequently grown overnight. Plasmid DNAs were extracted using the Montage™ Plasmid MiniPrep 96 Kit (Milipore, USA) according to the manufacturer's instructions. The cDNA inserts were sequenced once from the 5' and 3' ends using the forward (5' GGA TGT TGC CTT TAC TTC TA 3') and reverse (5' TGT GGG AGG TTT TTT CTC TA 3') primers respectively, and the Big Dye Terminator v3.1 Cycle Sequencing Kit (Applied Biosystem Inc., USA) on an ABI PRISM 3730×l DNA Analyzer (Applied Biosystem Inc., USA).

Generation of full-length cDNA sequences
The generated 5' and 3' end sequences were pre-processed using a Phred [48,49] cut-off quality value of 20. The sequences were subsequently screened against the GenBank non-redundant nucleotide database, and specifically against chicken genome sequences. No sequences with more than 90% similarity to a known chicken genome sequence were identified. Clustering was then carried out using StackPACK version 2.2 [50,51]. Consensus sequences with overlapping 5' and 3' end sequences were identified as representing full-length cDNA sequences, while those containing both the 5' and 3' end sequences that did not overlap were selected for single-pass primer walking to generate full-length cDNA sequences. Internal primers for primer walking were designed using Primer3 [52]. The sequence reads generated were manually assembled to produce a consensus sequence with a coverage of at least one strand.

Functional annotation and mapping of transcript sequences
The consensus and full-length cDNA sequences were compared against the GenBank non-redundant database using BLASTX [53], and the assignment of GO terms was carried out using the BLAST2GO pipeline [54].
Mapping of UTs and the full-length cDNA sequences to E. tenella chromosome 1 [16] was carried out separately using ssahaEST [55] with the following parameters: kmer = 10, seeds = 3, skip = 10, cutp = 80, score = 40, depth = 50, memory = 40, array = 0, edge = 200, identity = 95. Each transcript aligned to the chromosome 1 sequence was required to include at least 70% of the original transcript sequence and mapped in a single contiguous sequence without non-intron/exon gaps. Singleexon alignments were required to include at least 50 bp, while in multi-exon alignments, each aligned exon was required to be longer than 10 bp, with introns between 5 bp to 5000 bp. The transcript mapping results were inspected manually using the Artemis genome browser [56].

Characterisation of ORFs and UTRs
The coding region in each full-length cDNA sequence was individually predicted using ORF Finder [57]. Whenever possible, BLAST matches were used to confirm the reading frame, and in-frame start and stop codon positions. The determined ORFs and UTRs were analysed with MISA [58] to identify and localise SSRs. The coding regions were also submitted to CodonW [59] to generate a codon usage table. Kozak sequence consensus analysis was carried out by generating sequence logos using WebLogo [60].

Secretory and membrane protein prediction
Secretory and membrane proteins were predicted using SignalP 4.0 [61] and TMHMM 2.0 [62]. GPI-anchored proteins were predicted using GPI-SOM [63], which predicts both the N-terminal signal peptide and C-terminal GPI-anchor signal. Protein localisation analysis using WoLF PSORT [64] and BLAST matches were used to support each prediction. The bioinformatic tools used in this study are summarised in Additional file 13.
sequences, and together with X-WL, Y-LT and L-SL analysed the data. DPB, YS, JW, CS and K-LW participated in data collection monitoring and data interpretation. NA and X-WL drafted the manuscript. DPB, FMT and K-LW critically revised the manuscript. K-LW supervised and coordinated the study. All authors read and approved the final manuscript.