Insights into the genome structure and copy-number variation of Eimeria tenella
© Lim et al.; licensee BioMed Central Ltd. 2012
Received: 17 February 2012
Accepted: 1 August 2012
Published: 13 August 2012
Skip to main content
© Lim et al.; licensee BioMed Central Ltd. 2012
Received: 17 February 2012
Accepted: 1 August 2012
Published: 13 August 2012
Eimeria is a genus of parasites in the same phylum (Apicomplexa) as human parasites such as Toxoplasma, Cryptosporidium and the malaria parasite Plasmodium. As an apicomplexan whose life-cycle involves a single host, Eimeria is a convenient model for understanding this group of organisms. Although the genomes of the Apicomplexa are diverse, that of Eimeria is unique in being composed of large alternating blocks of sequence with very different characteristics - an arrangement seen in no other organism. This arrangement has impeded efforts to fully sequence the genome of Eimeria, which remains the last of the major apicomplexans to be fully analyzed. In order to increase the value of the genome sequence data and aid in the effort to gain a better understanding of the Eimeria tenella genome, we constructed a whole genome map for the parasite.
A total of 1245 contigs representing 70.0% of the whole genome assembly sequences (Wellcome Trust Sanger Institute) were selected and subjected to marker selection. Subsequently, 2482 HAPPY markers were developed and typed. Of these, 795 were considered as usable markers, and utilized in the construction of a HAPPY map. Markers developed from chromosomally-assigned genes were then integrated into the HAPPY map and this aided the assignment of a number of linkage groups to their respective chromosomes. BAC-end sequences and contigs from whole genome sequencing were also integrated to improve and validate the HAPPY map. This resulted in an integrated HAPPY map consisting of 60 linkage groups that covers approximately half of the estimated 60 Mb genome. Further analysis suggests that the segmental organization first seen in Chromosome 1 is present throughout the genome, with repeat-poor (P) regions alternating with repeat-rich (R) regions. Evidence of copy-number variation between strains was also uncovered.
This paper describes the application of a whole genome mapping method to improve the assembly of the genome of E. tenella from shotgun data, and to help reveal its overall structure. A preliminary assessment of copy-number variation (extra or missing copies of genomic segments) between strains of E. tenella was also carried out. The emerging picture is of a very unusual genome architecture displaying inter-strain copy-number variation. We suggest that these features may be related to the known ability of this parasite to rapidly develop drug resistance.
The phylum Apicomplexa contains a diverse range of parasites including Plasmodium, Cryptosporidium, Babesia, Toxoplasma and others that cause disease in both humans and animals. The genomes of several apicomplexans have been extensively studied [1–5] in an attempt to understand these organisms and to gain insights into potential new methods of control and treatment. Although a few common features have emerged, the apicomplexan genomes studied to date have presented a remarkable diversity of genomic organization and gene content, with each genome having its own unique, and often unusual, characteristics. The genus Eimeria, belonging to the family Eimeriidae within the order Eucoccidiorida (commonly known as the coccidians) represents the last major group of apicomplexans to be analyzed in detail. Because it is homoxenous, Eimeria can be propagated to high numbers in a single life-cycle with relative ease, making it a potential model for aspects of apicomplexan biology that are difficult to study in other genera.
Eimeria is also significant in its own right, as the causative agent of the intestinal disease coccidiosis in poultry . Eimeria species are both site- and host-specific, and are transmitted through the ingestion of sporulated oocysts - resistant, hardy, thick-walled spores that contain infective sporozoites . Seven species infect chickens and Eimeria tenella is among the most pathogenic  causing weight loss, reduced feed efficiency, reduced egg production and death. The total loss including the costs of control and prevention worldwide is estimated at around USD2.4 billion per annum, making this one of the most economically important diseases of domestic livestock .
The desire for a better understanding of the parasite and its interaction with the chicken, and a need for better disease control, had driven the E. tenella genome sequencing project . A draft genome assembly (released by the Wellcome Trust Sanger Institute in May 2007) of ~8.3-fold sequence coverage contains 4707 contigs, ranging from thousands to half a million bases in length. The total size of these contigs is 47 Mb corresponding to ~78% of the estimated 60 Mb genome. A further assembly, incorporating second-generation sequencing data, was produced in 2010, but much of the genome remains unrepresented by large contigs.
The completion of eukaryotic genome sequencing projects depends heavily on having good genome maps to position contigs, give information on large-scale genome structure, and reveal errors in sequence assembly [11–13]. Such a map has hitherto been lacking in the case of E. tenella. The available genetic linkage map, which relies on polymorphic loci, is low in resolution and does not reflect the genome physically . Furthermore, large insert clones are difficult to produce for Eimeria species as they are often unstable due to repetitive sequences , eliminating an effective method for physical mapping.
HAPPY mapping  is an in vitro physical mapping technique which analyses markers’ co-segregation amongst a pool of sub-genomic samples (each containing an approximately haploid amount of randomly sheared DNA). Markers that are close together tend to co-segregate strongly amongst the aliquots. Each marker will be typed by PCR to detect its presence in each aliquot and the probability of linkage between two markers is given as a logarithm of odds (LOD) score. Based on the LOD scores, a genome map can be constructed. This technique was successful in assisting the assembly of E. tenella chromosomes 1 and 2 with individual maps constructed respectively by Ling et al.  and Paul H. Dear (unpublished data).
Mapping also provides an opportunity to examine genomic variation. Comparisons between different strains of E. tenella have found only limited sequence variation [18, 19]. Recent studies in several species (including human) have shown that structural variation (duplication or rearrangement) contributes more diversity than sequence variation , but there have been no efforts to look into structural variations in Eimeria or other Apicomplexa. HAPPY mapping can be extended to allow molecular copy-number counting (MCC), determining copy-numbers by counting the molecules present . MCC has been used to screen variations in copy-number due to its speed, flexibility and sensitivity [22–24].
As HAPPY maps for E. tenella chromosomes 1 and 2 have previously been constructed, we demonstrate in this study the construction of an integrated map for the remainder of the genome in order to give an overview of genome organization and to aid sequence assembly. We managed to improve and validate the map by integrating BAC-end and contig sequences. We find that the striking segmentation of the genome into feature-poor (P) and feature-rich (R) regions (previously noted on Chromosome 1) is present throughout the genome, and that the P- and R-segments correspond to unique and non-unique regions, respectively. Investigation of copy-number variation using MCC has highlighted further structural perspectives on the E. tenella genome.
Markers were generated from the draft genome assembly (released in May 2007) obtained from the Wellcome Trust Sanger Institute (WTSI). Contigs similar to Chromosome 1  and 2 (unpublished data) were filtered prior to marker generation. A total of 1245 of the largest contigs, representing 70% of the assembled sequence, were selected. In total, 2482 markers were developed and typed. Of these, 576 (23.2%) markers failed to show amplification and therefore were discarded. A total of 914 (36.8%) multi-copy markers (recognized by an excess number of aliquots scoring positive for the marker) were also set aside while 231 (9.3%) low-copy markers were further analyzed to screen for significant linkage. Overall, a total of 761 (30.7%) good markers were obtained (details in Additional file 1). This low proportion of usable markers was due largely to the presence of sequences which, though unique in the assembly, were present in multiple copies in the genome. There are 301 contigs that carry only non-multicopy markers (unique contigs) while 279 contigs contain only multi-copy markers (multi-copy contigs). In these latter cases, up to nine unsuccessful attempts were made to identify unique sequences elsewhere in the contigs. The multi-copy contigs are up to 196 kb in length.
Together with 34 low-copy markers which showed significant linkages, a total of 795 markers were used for the construction of the HAPPY map. Of these, 664 fell into 67 linkage groups of three or more markers, while 18 formed pairs of linked markers, at LOD ≥ 6 (odds of better than one million to one in favor of linkage), leaving 113 markers as singletons (details in Additional file 2). The relative physical size of the map was calculated to be 27.3 Mb, or almost half of the estimated 60 Mb genome. The HAPPY map linked 454 contigs with a total size of 18,224,408 bp, meaning that approximately 9 Mb of gaps between contigs were bridged by HAPPY linkages alone.
To produce a more comprehensive map, we integrated BAC-end sequences. The BAC clones of E. tenella have been end-sequenced in the WGS project, but were not used in the 2007 sequence assembly. We have also noticed that linkages by BAC clones alone can be unreliable, since many BAC-end sequences do not map uniquely to the genome, and some BACs show possible size discrepancies when both ends map to a single contig. However, the combination of BAC data with HAPPY map data is more robust.
Analysis of 9567 BAC-end sequences found a total of 2514 clones that aligned to different contigs of the assembly, including 277 linkages supported by at least two BACs (details in Additional file 3). Five new linkages (that is, linkages which had not already been made by HAPPY data alone) were identified by these clones, between the HAPPY linkage groups and large contigs that do not have any good markers. Most of the HAPPY linkages were supported by at least one BAC clone. Furthermore, 14 groups of multi-copy contigs were linked by BAC clones. The number of BAC clones that fall within these multi-copy contig groups often exceeds the number of clones within the HAPPY linkage group. The total size of all the contigs within the group ranged from about 100 kb to 400 kb. Two groups were found to have one of their end-markers originating from a contig that contains a good marker in one of the HAPPY linkage groups. The good markers are also situated at the ends of HAPPY linkage groups. Small contigs (unmapped in this study) are believed to lie in the gaps between contigs in the map, based on what was observed in the BAC-clone integration.
To monitor the quality and reliability of the HAPPY map, markers were designed at 20 kb intervals along the two largest contigs of the draft assembly. All the good markers from these two contigs were used in the subsequent map-making process and were successfully arranged into two linkage groups that correspond to their respective contigs (details in Additional file 5), and the HAPPY map arrangement of these markers corresponds well with their locations in the contigs. Similarly encouraging results come from HAPPY markers designed at opposite ends of contigs smaller than 100 kb (the maximum range of the HAPPY map): in all cases, these pairs of markers are linked, and the distance between them as estimated from the HAPPY data corresponds well with the contig length.
The BAC-end sequences also suggested that HAPPY mapping has correctly ordered the contigs, as most of the HAPPY linkages are supported by at least one BAC clone. All these results strongly suggest that both the HAPPY data and the 2007 assembly are of good quality.
Details of chromosomally-assigned genes
Gene size (bp)
GenBank accession no.
Gene HAPPY marker
6 or 7
6 or 7
9 or 10
9 or 10
5 S rRNA
18 S-5.8 S-28S_rDNA
Location of chromosomally-assigned genes or markers
Chromosome size* (Mb)
Linkage group size (Mb)
6 or 7
9 or 10
As a result of incorporating BAC-end and contig sequences, an integrated HAPPY map was constructed [Figure 1; Additional file 4]. The map consists of 59 linkage groups that range from 100 kb to 1.7 Mb in size, covering ~31.0 Mb. Eight of these groups are chromosomally assigned.
From earlier analysis of Chromosome 1  and from analysis of markers on the largest two contigs of the 2007 assembly, it is obvious that P-regions have a higher density of good markers than R-regions [Figure 1; Figure 2. This is due largely to the non-unique nature of much of the R-region sequence (leading to multi-copy markers), and probably also to abundant simple-sequence repeats, which we find interfere with amplification of adjoining non-repetitive sequence.
The 661 HAPPY markers which fall into linkage groups cover ~31.0 Mb of the genome, giving an average spacing on mapped regions of ~46.9 kb. There are 105 singletons (unlinked HAPPY markers) in the integrated map. Assuming that these are distributed evenly across the unmapped areas, their average spacing is approximately 276 kb. This is more than double the range of the mapping panel (~100 kb), explaining why they do not link to other markers.
Although HAPPY mapping reveals an abundance of multi-copy sequences (which were represented only once in the assembly), the exact copy-numbers of these sequences in the genome is still unknown. In humans, multi-copy regions tend to vary in copy-number between individuals . Furthermore, structural variations have never been studied in the E. tenella genome. Hence, we set out to investigate the HAPPY markers for variation in copy-number between strains of E. tenella using the molecular copy-number counting (MCC) method. We randomly selected 48 of the HAPPY markers for this analysis.
The analysis carried out on the Weybridge (Wey) strain also revealed eleven of the markers to be multi-copy while analysis on the Wisconsin (Wis) strain showed only nine multi-copy markers. Although the Wey strain had almost the same number of multi-copy markers as the H strain, most of them were lower in copy-number. The Wis strain has the lowest number of multi-copy markers but one of them (marker 44) is present in triple the normal copy-number. One marker, marker 27, did not show amplification in the Wis strain.
Overall, four markers were multi-copy in all three strains while sixteen markers were found to have differences in copy-number between the strains. We therefore estimate that, within any one strain, about 20% of sequences (other than short tandem repeats) are present in multiple copies (11, 12 and 9 out of the 47 markers, in the three strains analyzed), if our selection of markers is representative. Moreover, about a third of sequences (again, assuming our markers to be representative) differ in copy-number between strains. Further work is needed to clarify the extent of the duplicated sequence blocks, and their possible relation to genes.
The analysis of the Chromosome 1 sequence  showed a striking segmentation, with feature-poor (P) regions alternating with feature-rich (R) regions. The P-regions are slightly longer, have a higher and more uniform AT content than the R-regions and are almost free of simple repeats. In contrast, R-regions contain an abundance of simple repeats including tandemly repeated trinucleotides, the telomere-like AGGGTTT heptamer, the TGCATGCA palindromic octamer (which seems to be peculiar to a subset of apicomplexans; ), and also LINE transposons. It was speculated that the rest of the genome of E. tenella was organized in a similar segmented fashion.
The integrated map described here has revealed the structure for about half of the E. tenella genome based on the relative physical size of the map in comparison with the estimated genome size (60 Mb). Our analysis suggests that this half consists mainly of P-type sequence, largely because markers in R-regions are usually multicopy and therefore unmappable. If most R-regions are larger than 140 kb (the smallest R-region on Chromosome 1), then this would explain the inability of the map to connect across most R regions by means of HAPPY linkage or BAC clones.
Regarding the accuracy of the integrated map, we do see some conflicts in marker order when compared with the sequence assembly (see, for example, Additional file 4). For the most part, these are local inversions in line with the expected resolution of the HAPPY map; others, particularly where they involve small isolated sequence contigs, are likely to represent errors in the sequence assembly. As with all complex genomes, mapping and sequencing are iterative and interdependent processes, and there is no universally agreed metric for measuring the goodness of a map, nor agreement on how to weight common local errors against rarer large-scale errors. However, it is generally agreed that maps which integrate several datasets (in this case, HAPPY data, BAC-end data and contig data) are more robust than those which depend on a single method.
If we are correct in assuming that most of the unmappable multicopy markers originate from R-regions, this suggests that the R-regions are rich in repeated sequences (in addition to the simple-sequence repeats which are avoided in marker selection), and that a segment from an R-region on one chromosome may be duplicated at an R-region of another chromosome. About 20% of genome may be repeated in this way, in addition to the 14% made up of simple-sequence repeats (based on Chromosomes 1 and 2). This may account for the fact that the draft sequence assembly represents only about 78% of the genome.
Comparisons between E. tenella strains are scarce but have generally shown moderate variation at the sequence level. For instance, the recent comparison of the ~9 kb glucose-6-phosphate isomerase genomic locus between the H, Wis and Wey strains revealed 33 SNPs and 14 indels , or a variation of around half of one percent of nucleotides. In contrast, our preliminary study based on CNV analysis showed the possibility of much more widespread structural variation between these three strains.
The integrated HAPPY map has revealed the probable structure of the E. tenella genome, and explains why the ongoing sequencing program has encountered difficulties. It suggests that the genome is architectured segmentally, alternating between P- and R-regions, with an average of about four or five P-regions per Mb. R-regions are likely to contain tracts of repeated sequence amounting to >20% of the genome, as well as a further 14% of simple-sequence repeats. This segmental structure and the repetitive nature of the R-segments probably explain both the gaps in the integrated map and the incompleteness of the draft sequence assembly.
There are also indications that much of the genome displays copy-number variation between E. tenella strains. Given that parasites must constantly adapt to oppose emerging resistance in the host, it is tempting to speculate that the segmental architecture contributes to a structurally dynamic genome, lubricated by the repeat-rich R-regions. This, in turn, may play an important role in the rapid emergence of drug resistance which is known to be a feature of Eimeria.
The HAPPY mapping was performed essentially as described previously [26–28] on a mapping panel containing fragments selected at approximately 100 kb, with customization in the primer design and marker selection process. The draft genome assembly contigs were divided into fragments of ~2 kb in length. All of these 2 kb fragments were subjected to primer design using UniversalPrimerDesigner (Paul H. Dear, unpublished). This software was set to find the optimum set of hemi-nested primers (forward external, forward internal and reverse) for each 2 kb fragment. The designed primers had a predicted melting temperature of 55–62°C, with two G or C nucleotides at the 3’ end and one G or C nucleotide at the 5’ end, and length 18–22 bases. Each set of primers was designed to produce an internal amplicon of 100–200 bp (external amplicon length 150–350 bp) with an A + T content not exceeding 80%. Contigs similar to the Chromosome 1 and 2 sequences were filtered. The uniqueness of each primer set was then checked against the genome assembly. Markers were picked at contig ends for contigs larger than 17 kb, and one marker for contigs smaller than 17 kb. Where markers failed (for example, were found to be multi-copy), further markers were chosen from nearby sequences in an attempt to find successful markers.
Contigs larger than 5 kb were divided into P- and R-segments using custom software, (Paul H. Dear, unpublished). Briefly, all of the simple-sequence repeats (of [CAG]n or [AGGGTTT]n, where n ≥ 3) which pepper the R-regions were identified, and both they and any intervening stretches of sequence shorter than 4 kb were marked as putative R-segments, with the remainder (stretches of >4 kb lacking simple-sequence repeats) being marked as putative P-segments. Based on analysis of Chromosomes 1 and 2, we assumed that genuine P- or R-segments would each be larger than 30 kb. Therefore, the shortest putative P- or R-segment was eliminated (so that, for example, a 5 kb putative R-region flanked by larger P-regions would be merged to become a single contiguous P-region); this was done iteratively until no segment smaller than 30 kb remained, apart from the first and last segments of the contig.
Contigs smaller than 5 kb were not divided into segments, but were instead classified as entirely P-type, R-type or unclassifiable based on their overall content of simple-sequence motifs (less than 0.5 motifs, more than 1 motif, or between 0.5 and 1 motifs per kilobase, respectively).
The parameters for this analysis are somewhat arbitrary, but were chosen such that they accurately identified the P/R segmentation that had been previously noted for Chromosomes 1 and 2. Moreover, slight variation in these parameters, or the use of analyses based on sequence information content and base composition (which also distinguish P- from R-segments) gave essentially similar results (not shown).
BAC-end sequences were mapped to the draft genome using ssahaEST  then processed to remove unpaired sequences and sequences that did not map uniquely to a single region. For the chromosomal assignment of map segments, HAPPY markers were designed for fifteen chromosomally-assigned genes and mapped as described.
Forty-eight randomly selected markers (as described in primer design) were typed on panels of sub-genomic aliquots of sheared DNA from three different E. tenella strains (Houghton, Weybridge and Wisconsin) using the MCC approach . Panels of 96 aliquots containing approximately 0.3 genomes of sheared DNA were constructed for each strain to detect up to 3-fold change in copy-number variation. The selected markers were then typed using similar hemi-nested PCR following the same protocol as for the HAPPY mapping. The proportion of aliquots positive for any marker allows one to calculate its abundance in the panel (Poisson distribution), and hence its copy-number relative to other markers. Further details are given in reference .
We wish to thank Martin Shirley and Damer Blake for parasite materials, Enrique Tabares and Fiona Tomley for unpublished data, and Arnab Pain of the Eimeria tenella genome project at the Wellcome Trust Sanger Institute for making the latest genome assembly and BAC-end sequences available for use in this study. This work was supported by the Genomics and Molecular Biology Initiatives Programme of the Malaysia Genome Institute, Ministry of Science, Technology and Innovation Malaysia (grant number 07-05-16-MGI-GMB10).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.