Transposon fingerprinting using low coverage whole genome shotgun sequencing in Cacao (Theobroma cacao L.) and related species
© Sveinsson et al.; licensee BioMed Central Ltd. 2013
Received: 15 April 2013
Accepted: 19 July 2013
Published: 24 July 2013
Transposable elements (TEs) and other repetitive elements are a large and dynamically evolving part of eukaryotic genomes, especially in plants where they can account for a significant proportion of genome size. Their dynamic nature gives them the potential for use in identifying and characterizing crop germplasm. However, their repetitive nature makes them challenging to study using conventional methods of molecular biology. Next generation sequencing and new computational tools have greatly facilitated the investigation of TE variation within species and among closely related species.
(i) We generated low-coverage Illumina whole genome shotgun sequencing reads for multiple individuals of cacao (Theobroma cacao) and related species. These reads were analysed using both an alignment/mapping approach and a de novo (graph based clustering) approach. (ii) A standard set of ultra-conserved orthologous sequences (UCOS) standardized TE data between samples and provided phylogenetic information on the relatedness of samples. (iii) The mapping approach proved highly effective within the reference species but underestimated TE abundance in interspecific comparisons relative to the de novo methods. (iv) Individual T. cacao accessions have unique patterns of TE abundance indicating that the TE composition of the genome is evolving actively within this species. (v) LTR/Gypsy elements are the most abundant, comprising c.10% of the genome. (vi) Within T. cacao the retroelement families show an order of magnitude greater sequence variability than the DNA transposon families. (vii) Theobroma grandiflorum has a similar TE composition to T. cacao, but the related genus Herrania is rather different, with LTRs making up a lower proportion of the genome, perhaps because of a massive presence (c. 20%) of distinctive low complexity satellite-like repeats in this genome.
(i) Short read alignment/mapping to reference TE contigs provides a simple and effective method of investigating intraspecific differences in TE composition. It is not appropriate for comparing repetitive elements across the species boundaries, for which de novo methods are more appropriate. (ii) Individual T. cacao accessions have unique spectra of TE composition indicating active evolution of TE abundance within this species. TE patterns could potentially be used as a “fingerprint” to identify and characterize cacao accessions.
Transposable elements (TEs) are a large and dynamically evolving part of plant genomes [1, 2]. They occupy between 15% - 84% of plant genomes  and TE expansion is known to cause a significant increase in genome size in many cases . Transposable elements are a major force in plant evolution, not only by causing genome expansions but also by altering gene function either through disruption  or acting as a raw material for new genes and novel functions [6, 7].
Transposable elements are usually classified into two major classes based on their transposition mechanisms. Class I retrotransposons move about in a ‘copy-and-paste’ fashion, through a RNA intermediate, which is encoded back into DNA by an endogenous Reverse Transcriptase (RT) enzyme . The two largest super-families of retrotransposons in plants, the LTR/Copia and LTR/Gypsy, have several other open reading frames, which play a role in the transposition, located between two regions of long terminal repeats (LTR) . Class II DNA elements move about in genomes through a DNA intermediate. The most extensively studied group of class II elements transpose by a ‘cut-and-paste’ mechanism and are classified into several super-families based on sequence similarity . Cut-and-paste DNA transposons are characterized by a transposase gene and a pair of flanking terminal inverted repeats (TIRS) .
Transposable elements are known to vary extensively in copy-number and nucleotide sequence among closely related species [4, 10] and even within the same species . Plant LTR retrotransposons are well known to have intraspecific variation in copy-number [12, 13]. This, in combination with the easily amplifiable LTR domain, has been used in the development of molecular markers for several crop species [14–16]. In addition to the extensive presence/absence variability of the LTR elements, sequence heterogeneity is also known to be quite extensive . The reverse transcriptase domain is the most extensively studied retrotransposon gene and it is known to show levels of heterogeneity from about 5% to 75% at the amino acid level . Heterogeneity and sequence evolution of class II DNA transposons is relatively less studied, but a recent study shows that they can be quite heterogeneous .
Cacao (Theobroma cacao L.) is an economically important tree in the mallow family (Malvaceae) . It is widely grown in tropical regions as the source of cocoa beans for the manufacture of chocolate . Cacao has long been known to be genetically diverse  and traditionally three major lineages of Cacao varieties have been recognized: Trinitario, Criollo, and Forastero . Recent work based on a variety of markers, including microsatellites and whole chloroplast genome sequences of several cacao varieties, has confirmed that the Criollo and Forastero groups are two distinct genetic lineages while the Trinitario group is of hybrid origin [21, 22]. Cacao has a relatively small genome, estimated to be around 430 Mb and it has a published genome assembly of about 75% of its estimated genome size . This small genome size can be partly explained by the relatively low abundance of transposable elements, compared to other angiosperms. TEs comprise only approximately a quarter of the cacao genome .
In this study we use low-coverage Illumina whole genome shotgun sequencing to investigate the evolutionary dynamics and comparative analysis of 3,500 TE families in nine T. cacao varieties and two related species, Theobroma grandiflorum and Herrania balaensis.
Sequence coverage estimates and phylogenetic analysis using the UCOS contigs
Sequence summary statistics
Read length (bp)
No. reads after trimming
EET-64 (T. cacao)
6.6E + 07
6.5E + 07
Criollo-22 (T. cacao)
4.6E + 07
4.2E + 07
Stahel (T. cacao)
6.0E + 07
5.7e + 07
Pentagonum (T. cacao)
5.8E + 07
5.5E + 07
ICS39 (T. cacao)
6.4E + 07
6.0E + 07
Amelonado (T. cacao)
7.0E + 07
6.8E + 07
ICS06 (T. cacao)
6.4E + 07
6.1E + 07
ICS01 (T. cacao)
5.0E + 07
4.9E + 07
Scavina-6 (T. cacao)
3.8E + 07
2.9E + 07
T. grandiflorum (Cupuaçu)
7.2E + 07
6.8E + 07
7.7E + 07
7.4E + 07
Variation in TE abundance using short read mapping
LTR retrotransposon frequencies in the three species estimated with two different methods
Reference based mapping
Graph based clustering
Variation of TE copy number using de novo approaches
Intraspecific variation of TE abundance in T. cacao using short read mapping and PCA
Sequence conservation of transposable elements in T. cacao
The study of transposable elements (TEs) has been revolutionized by the increased availability and lowered costs of next generation sequencing (NGS) technologies . NGS methods have not only been applied in TE studies of plants with high quality whole genomic sequences available such as Zea luxurians and rice (Oryza sativa)  but also in organisms with limited genomic resources available such as barley (Hordum vulgare), pea (Pisum sativum) and banana (Musa acuminata) [27–29]. These studies demonstrate a strong correlation between copy-number estimation of TEs by traditional molecular methods and methods that count short reads from NGS experiments [25, 27]. It was therefore not surprising that the copy-number estimation of TEs in this study fitted very well with previously published estimates in T. cacao, both in regard to the overall TE abundance in the genome, around 23%, and in the copy-number of the most abundant class I retroelement . Our study therefore confirms the utility and reliability of studying genomic repeats using short reads directly.
Different levels of nucleotide conservation in class I and class II TEs
The two major classes of TE, class I retroelements and class II DNA transposons, have been recognized for a long time as two fundamentally different groups of mobile elements probably present in all eukaryotic genomes . The results presented in this study illustrate a considerable difference in the apparent conservation of the TEs in the genome of T. cacao, where the class I retrotransposons show significantly higher levels of heterogeneity, represented by an order of magnitude higher level of nucleotide diversity (Figure 5). This may be simply because DNA transposons are more narrowly defined at the superfamily level. However, one possible biological explanation of the high levels of heterogeneity in class I retroelements results from their transposition mechanism, as described in detail in . Class I retrotransposons move about as a RNA intermediate, which is encoded into DNA before re-entry into the host genome by their endogenous reverse transcriptase enzyme, which is known to be low-fidelity, causing a high mutation rate [7, 31].
Inter- and intraspecific differences in TE abundance in H. balaensis, T. grandiflorum and T. cacao
Transposable elements are known to cause large inter- and intraspecific differences in the size and composition of plant genomes, demonstrated in barley (Hordeum vulgare) [11, 32] and rice (Oryza sativa) . However, our study only found relatively subtle intraspecific differences of the overall TE abundance in T. cacao. Nevertheless this slight intraspecific variation in TE copy number does potentially contribute to the variable genome sizes of different Cacao accessions reported in the supplementary material in the T. cacao genome paper and other sources [23, 33, 34]. Furthermore using a PCA approach to differentiate accessions based on TE abundance, wide separations do occur (Figure 4). The ability to separate cacao accessions according to TE composition is despite the fact that they are all closely related, some being of recent hybrid origin . As massive parallel sequencing (MPS) costs fall, there is interest in using MPS to identify accessions, and such use has been called “ultra-barcoding” . This paper shows that data generated for ultra-barcoding could also be used for “transposon composition fingerprinting” of cacao accessions (i.e. identification based on a unique spectrum of transposon composition).
Mapping vs. de novo approaches to studying TEs from short reads
Our results (Table 2) suggest that the mapping approach, while reliable within the reference species (T. cacao), is unreliable in interspecific comparisons, at least for some TE families. The mapping approach reports considerable differences in the composition of repetitive elements in the three species studied (Figure 2). Apparently the genomes of T. grandiflorum and H. balaensis are significantly deficient in many LTR retrotransposons families that are very abundant in T. cacao (Figure 3). However this difference may be at least partly caused by low interspecific mapping quality of the short reads, since our reference contigs originate from the genome of T. cacao. The LTR retrotransposon families in particular have high nucleotide diversity (Figure 5), which is likely to cause problems in the mapping of the short reads.
The evidence for the failure of the mapping approach in interspecific comparisons comes from the de novo approach of graph based clustering using RepeatExplorer. This demonstrates that in both T. grandiflorum and H. balaensis the LTR TE families are more abundant than the mapping approach suggested (Table 2, Figure 3 and Additional file 2). More importantly the graph based clustering showed that the composition of H. balaensis and T. grandiflorum is quite different from T. cacao. Therefore we conclude that mapping based approaches are well suited to look at TE evolution in an intraspecific manner whereas de novo methods, such as graph based clustering, are much more useful in the exploration of differences in repetitive elements across species boundaries.
The present study demonstrates considerable differences in transposable element composition among and even within species, highlighting their dynamic role in plant genome evolution. Variation of transposable elements in plants is important especially given the great abundance of transposable elements in plant genomes and their potential impact on the genespace. We used two different methods of looking at transposable element variation from Illumina short read data: reference-based mapping and graph-based clustering. Both are effective at capturing variation, although each is appropriate at different levels of taxonomic comparison. Reference based mapping works well within a species while graph-based clustering is preferred for between species comparisons.
Plant material and Illumina sequencing
Total genomic DNA was extracted from leaf tissue from 11 individuals belonging to three species in the Malvaceae: one Herrania balaensis, one Theobroma grandiflorum and nine T. cacao. Each T. cacao individual represented a different cultivated variety (see Table 1). DNA extraction was performed using the DNeasy Plant Mini Kit (Qiagen, Valencia, California, USA) according to the manufacturer’s protocol. Sequencing libraries were constructed using standard protocols and chemistry for the Illumina platform. Each library was sequenced on a single lane and generated either 60- or 80-bp paired-end sequences (see Table 1) on the Illumina GAII platform by Cofactor Genomics of St. Louis, MO (http://www.cofactorgenomics.com/). The raw reads are available on NCBI’s Short Read Archive [SRA048198].
Mapping of reads, coverage estimates and SNP calling
The reference sequences of the transposable element (TE) families used in this study were extracted and characterized by the authors of the publication describing the T. cacao genome , who graciously made their data available for this study (Additional file 4). Briefly they identified class I retro-transposons using LTR_finder , LTRharvest  and in-house software that looked for signatures of class I retroelements, such as the long terminal repeat (LTR) and reverse transcriptase (RT). Class II elements were discovered using a blastX search of the transposase gene against the Repbase database proteins . In all they identified 650 class I - and 2860 class II families. For more details see the supplementary methods in .
In order to estimate copy-number and sequence evolution of the TE families using our sequenced libraries of three species and nine T. cacao varieties, we mapped reads from each sequenced library to the TE reference contigs. Firstly, the reads were trimmed for quality, with bases below quality of 20 trimmed from the ends of each read. Quality trimmed reads were treated as single-end sequences and mapped to the TE reference contigs using BWA v0.6.1  with the program’s default settings. The rationale behind treating the paired-end sequences as single-end was that TE copy-number estimation from coverage of the latter was believed to be more accurate, as paired-end information often links the repeat to different single-copy portions of the genome, preventing pairs from mapping near the boundaries of the repeated segment. Coverage estimates for each nucleotide position in the reference contigs were extracted from the sorted BAM file output of BWA using the genomeCoverageBed tool in the bedTools package v2.15.0 (genomeCoverageBed flags: -d -ibam) . Relative copy-number of each TE family was estimated by counting the number of reads covering each position of the reference contig and dividing by the length of the contig. Proportional abundance was calculated for each species, by dividing the abundance of each TE super-family by the abundance of all TEs. Information on nucleotide variants detected in the reads, compared to the TE reference contigs, was extracted using samtools v0.1.7a . Nucleotide diversity was estimated for each TE reference contig by counting the number of variable sites, with read-depth higher than 6 and base qualities higher than 20 (column 6 from samtools pileup –vcf output), and dividing by the length of the contig. To control for the effect of different read depths between different libraries, subsampling was used to ensure equality of total reads. Due to the repetitive nature of TEs, a variable site could represent a single nucleotide polymorphisms in a homologous copy, i.e. a heterozygote, or could stem from sequence divergence between different copies of a transposable element.
To account for differences in sequencing depth and read length of different libraries, reduced equalized data sets were used for some of the analyses presented here (Figures 4 and 5). The reduced data sets were generated by trimming the read length of all libraries to 60-bp and randomly extracting reads from all but the smallest sequenced library (Scavina-6). The purpose of this step was to make sure that variable read lengths and sequencing depths were not the cause of observed differences in TE coverage and nucleotide diversity. However any observed differences in UCOS coverage could be due to differences in genome sizes among the three species and the T. cacao varieties. Furthermore 49 sampling replicates were generated in order to test the effect of data sub-sampling on TE coverage estimates.
The class I LTR retrotransposons reference contigs were annotated using LTRHarvest  and LTRDigest . These programs use similarity searches of conserved regions of LTR elements, such as the long terminal repeat and protein coding genes, to estimate the coordinates of the various features of the elements. That information was then used to estimate the variability of each of the LTR element feature, by combining the feature file output of LTRDigest with the nucleotide variant output of samtools.
In order to test whether we could better account for sequence divergence in the class I TE among the three species, we tried mapping the reads exclusively to conserved regions of the LTR retrotransposons and with relaxed BWA alignment stringency. Protein coding regions and the LTR were extracted from the reference contigs, based on the annotations from LTRHarvest  and LTRDigest  and the BWA alignment step was preformed with more relaxed settings (bwa flags: -l 1024 -i 0 -o 3). These settings allowed for more gaps and BWA used longer seed length for its short read alignments. These relaxed settings and the conserved regions of the LTR elements were only used to generate Additional file 1.
Identification of, and mapping to UCOS contigs
A set of 357 Ultra Conserved Orthologous sequences (UCOS, http://compgenomics.ucdavis.edu/compositae_reference.php) was used to estimate the sequencing coverage of individual libraries as well as to estimate phylogenetic relationships between the three species and among the nine T. cacao varieties. These sequences represent single copy genes in Arabidopsis thaliana and tend to be conserved as single copy genes across Eukaryotes . Since these genes are highly conserved and often present in a single copy in the genome, they are useful for estimating the sequencing coverage of each library and estimating copy number of the TE families. The 357 putative UCOS homologs in T. cacao were identified using blastx with an e-value threshold of 1E-34 . The single copy status of the UCOS was verified by removing all contigs that had multiple hits to the T. cacao genome with a e-value lower than 1E-06. This left 245 UCOS to which the reads were mapped to using BWA, coverage of each UCOS contig was estimated using bedTools and single nucleotide polymorphisms (SNPs) called using samtools (see Additional file 5). Finally an average coverage was calculated for each library, by calculating the mean coverage of the 245 UCOS contigs. Coverage of each TE reference contig was divided by the mean UCOS coverage, in order to estimate a relative copy-number of TEs to single copy nuclear genes.
Phylogenetic analysis using the UCOS contigs
A phylogenetic matrix was constructed by using the 245 UCOS contigs identified above as reference for short read mapping and by calling SNPs using previously described methods. Theobroma cacao cv. Scavina-6 was excluded from the phylogenetic analysis due to low sequencing coverage. For the construction of the matrix, only positions that were covered by 6 or more high quality reads in a given sample, with base quality equal or larger to 20 (column 6 in samtools pileup -vcf output) were used. Positions containing any ambiguous nucleotides, i.e. heterozygotes, were converted to Ns as were all other positions that did not meet previously mentioned criteria. Finally Ns were converted to gaps, trimAl v.1.2  used to remove all gaps and to convert the alignments to nexus format (trimAl flags: -nogap -nexus). All alignments shorter than 50 nucleotides were excluded from further analysis, leaving 97 UCOS for further analysis. A matrix with positional information of each of the UCOS contigs was constructed using phyutility v.2.2.4  (phyutility flags: -concat) (Additional file 6), for a combined analysis that includes separate analyses of each contig using a coalescence-based program (see below). Gene trees of individual UCOS alignments longer than 50 nucleotides were estimated with RAxML v7.2.6 , using 10 independent runs and the GTRGAMMA sequence substitution model (raxml flags: -m GTRGAMMA -N 10). In order to estimate a single phylogeny of the three species and nine remaining T. cacao varieties (Scavina-6 excluded), a STAR (species trees based on average ranks of coalescences) phylogeny  was constructed using the phybase R package (v.1.3) . STAR uses the mean ranks of coalescent occurrences in a set of gene trees to construct a species tree topology . In order to estimate branch lengths on the STAR tree, model parameters of the entire matrix were estimated using jModelTest v2.0.2 [49, 50] and GARLI v2.0  used to optimize model parameters and to add branch lengths to the STAR tree. Support values for the STAR phylogeny were estimated using a multi locus bootstrap  method implemented in the phybase package . One thousand multi locus bootstrap replicates were analyzed using Phyml v3.0 , STAR trees estimated for each set of bootstrap replicates and a consensus tree constructed from all the STAR trees using the consense program in the phylip package v3.69 .
Graph based clustering of the Illumina reads
The repetitive elements of three species studied here were also investigated in a de novo fashion using RepeatExplorer , which is a graph based clustering method of characterizing repetitive elements described in , with the program’s default settings. The Criollo-22 individual was chosen as the T. cacao representative. Briefly, RepeatExplorer uses information from sequence similarity among the reads and their partial overlap to construct graphs. Graphs are constructed using a Louvain method , where sequence reads are represented by vertices, edges are connected with overlapping reads and edge weights correspond to the similarity score among reads. The graph layouts are then examined in order to find separate clusters of reads that are often connected and correspond to distinct families of genomic repeats. These clusters are analyzed in regards to their size, determined by the number of reads comprising each cluster as well as their graph topology which gives information about their structure and variability. RepeatExplorer also performs a sequence similarity search of each cluster against RepBase  in order to identify the type of the repetitive elements present in the cluster. If the predicted number of nodes exceeds the capacity of the available RAM, RepeatExplorer randomly subsamples the reads. RepeatExplorer outputs a comma separated value (csv) file, containing relevant information of the clusters it identified and that consist of 0.01% or more of the reads used in the analysis (default cut-off). The program calculates the genome percentage, which is the number of reads in each cluster divided by the all the reads used in the graph based clustering (11,243,224 reads in total). An in house python script (available on request) was used parse the csv output file and combine parts of it with the figures of graph layouts. The output of that script is a panel of graph layouts, with each cluster’s most abundant class of element, in addition to the genome percentage and number of paired-end reads belonging to the cluster.
Similarity of TE composition among sequenced individuals was investigated with a principal component analysis (PCA) using coverage of each TE super-family in the genomes of H. balaensis, T. grandiflorum and the nine T. cacao varieties. The PCA was performed using the prcomp function in R v2.14.1 , using the abundance of each super-family as explanatory variables with a natural logarithmic transformation and scale = TRUE. An in-house R script was used to run the PCA analysis on all sub-sampled data sets. The reduced data set ensured that differences in sequencing depth and read length did not affect the results.
The authors would like to thank Armando Geraldes and Charles Hefer for useful discussion. We are grateful to both Hannes Dempewolf and Jan Engels who were instrumental in setting up the cacao sequencing project. We would also like to thank Jeannette Whitton for useful comments and proof reading of the manuscript. We would also like to thank the staff of Cofactor genomics for their helpfulness during sequencing. We are grateful for funding for this project received from the World Bank (Development Marketplace competition), NSERC (discovery grant to Q.C.) and UBC Department of Botany scholarship funding (to S.S.).
- Kumar A, Bennetzen JL: Plant retrotransposons. Annu Rev Genet. 1999, 33: 479-532. 10.1146/annurev.genet.33.1.479.View ArticlePubMedGoogle Scholar
- Feschotte C, Pritham EJ: DNA transposons and the evolution of eukaryotic genomes. Annu Rev Genet. 2007, 41: 331-368. 10.1146/annurev.genet.40.110405.090448.PubMed CentralView ArticlePubMedGoogle Scholar
- Kelly LJ, Leitch IJ: Exploring giant plant genomes with next-generation sequencing technology. Chromosome Res. 2011, 19: 1-15.View ArticleGoogle Scholar
- Sun C, Shepard DB, Chong RA, Arriaza JL, Hall K, Castoe TA, Feschotte C, Pollock DD, Mueller RL: LTR retrotransposons contribute to genomic gigantism in plethodontid salamanders. Genome Biol Evol. 2012, 4: 168-183. 10.1093/gbe/evr139.PubMed CentralView ArticlePubMedGoogle Scholar
- Martin A, Troadec C, Boualem A, Rajab M, Fernandez R, Morin H, Pitrat M, Dogimont C, Bendahmane A: A transposon-induced epigenetic change leads to sex determination in melon. Nature. 2009, 461: 1135-1138. 10.1038/nature08498.View ArticlePubMedGoogle Scholar
- Zhou L, Mitra R, Atkinson PW, Hickman AB, Dyda F, Craig NL: Transposition of hAT elements links transposable elements and V (D) J recombination. Nature. 2004, 432: 995-1001. 10.1038/nature03157.View ArticlePubMedGoogle Scholar
- Craig NL, Craigie R, Gellert M, Lambowitz AM: Mobile DNA II. 2002, Washington, DC: Amer Society for MicrobiologyView ArticleGoogle Scholar
- Boeke JD, Corces VG: Transcription and reverse transcription of retrotransposons. Annu Rev Microbiol. 1989, 43: 403-434. 10.1146/annurev.mi.43.100189.002155.View ArticlePubMedGoogle Scholar
- Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, Flavell A, Leroy P, Morgante M, Panaud O: A unified classification system for eukaryotic transposable elements. Nat Rev Genet. 2007, 8: 973-982. 10.1038/nrg2165.View ArticlePubMedGoogle Scholar
- Novick PA, Smith JD, Floumanhaft M, Ray DA, Boissinot S: The evolution and diversity of DNA transposons in the genome of the lizard Anolis carolinensis. Genome Biol Evol. 2011, 3: 1-14. 10.1093/gbe/evq080.PubMed CentralView ArticlePubMedGoogle Scholar
- Vicient CM, Suoniemi A, Anamthawat-Jónsson K, Tanskanen J, Beharav A, Nevo E, Schulman AH: Retrotransposon BARE-1 and its role in genome evolution in the genus Hordeum. Plant Cell Online. 1999, 11: 1769-1784.View ArticleGoogle Scholar
- Pearce SR, Knox M, Ellis TH, Flavell AJ, Kumar A: Pea Ty1-copia group retrotransposons: transpositional activity and use as markers to study genetic diversity in Pisum. Mol Gen Genet. 2000, 263: 898-907. 10.1007/s004380000257.View ArticlePubMedGoogle Scholar
- Huang X, Lu G, Zhao Q, Liu X, Han B: Genome-wide analysis of transposon insertion polymorphisms reveals intraspecific variation in cultivated rice. Plant Physiol. 2008, 148: 25-40. 10.1104/pp.108.121491.PubMed CentralView ArticlePubMedGoogle Scholar
- Kumar A, Hirochika H: Applications of retrotransposons as genetic tools in plant biology. Trends Plant Sci. 2001, 6: 127-134. 10.1016/S1360-1385(00)01860-4.View ArticlePubMedGoogle Scholar
- Syed N, Sureshsundar S, Wilkinson M, Bhau B, Cavalcanti J, Flavell A: Ty1-copia retrotransposon-based SSAP marker development in cashew (Anacardium occidentale L.). Theor Appl Genet. 2005, 110: 1195-1202. 10.1007/s00122-005-1948-1.View ArticlePubMedGoogle Scholar
- Schulman AH, Flavell AJ, Paux E, Ellis T: The application of LTR retrotransposons as molecular markers in plants. Methods Mol Biol. 2012, 859: 115-153. 10.1007/978-1-61779-603-6_7.View ArticlePubMedGoogle Scholar
- Flavell AJ, Smith DB, Kumar A: Extreme heterogeneity of Ty1-copia group retrotransposons in plants. Mol Gen Genet. 1992, 231: 233-242.PubMedGoogle Scholar
- Wood GAR, Lass RA: Cocoa. 2001, Blackwell, UK: Longman Group, 4View ArticleGoogle Scholar
- Motamayor JC, Lachenaud P, e Mota JWS, Loor R, Kuhn DN, Brown JS, Schnell RJ: Geographic and genetic population differentiation of the Amazonian chocolate tree (Theobroma cacao L). PLoS One. 2008, 3: e3311-10.1371/journal.pone.0003311.PubMed CentralView ArticlePubMedGoogle Scholar
- Cheesman EE: Notes on the nomenclature, classification and possible relationships of cocoa populations. Trop Agric. 1944, 21: 144-159.Google Scholar
- Motamayor JC, Risterucci AM, Heath M, Lanaud C: Cacao domestication II: progenitor germplasm of the Trinitario cacao cultivar. Heredity. 2003, 91: 322-330. 10.1038/sj.hdy.6800298.View ArticlePubMedGoogle Scholar
- Kane N, Sveinsson S, Dempewolf H, Yang JY, Zhang D, Engels JMM, Cronk Q: Ultra-barcoding in cacao (Theobroma spp.; Malvaceae) using whole chloroplast genomes and nuclear ribosomal DNA. Am J Bot. 2012, 99: 320-329. 10.3732/ajb.1100570.View ArticlePubMedGoogle Scholar
- Argout X, Salse J, Aury JM, Guiltinan MJ, Droc G, Gouzy J, Allegre M, Chaparro C, Legavre T, Maximova SN: The genome of Theobroma cacao. Nat Genet. 2010, 43: 101-108.View ArticlePubMedGoogle Scholar
- Chaparro C, Sabot F: Methods and software in NGS for TE analysis. Methods Mol Biol. 2012, 859: 105-114. 10.1007/978-1-61779-603-6_6.View ArticlePubMedGoogle Scholar
- Tenaillon MI, Hufford MB, Gaut BS, Ross-Ibarra J: Genome size and transposable element content as determined by high-throughput sequencing in maize and Zea luxurians. Genome Biol Evol. 2011, 3: 219-10.1093/gbe/evr008.PubMed CentralView ArticlePubMedGoogle Scholar
- Sabot F, Picault N, El-Baidouri M, Llauro C, Chaparro C, Piegu B, Roulin A, Guiderdoni E, Delabastide M, McCombie R: Transpositional landscape of the rice genome revealed by paired-end mapping of high-throughput re-sequencing data. Plant J. 2011, 66: 241-246. 10.1111/j.1365-313X.2011.04492.x.View ArticlePubMedGoogle Scholar
- Macas J, Neumann P, Navratilova A: Repetitive DNA in the pea (Pisum sativum L.) genome: comprehensive characterization using 454 sequencing and comparison to soybean and Medicago truncatula. BMC Genomics. 2007, 8: 427-10.1186/1471-2164-8-427.PubMed CentralView ArticlePubMedGoogle Scholar
- Wicker T, Narechania A, Sabot F, Stein J, Vu G, Graner A, Ware D, Stein N: Low-pass shotgun sequencing of the barley genome facilitates rapid identification of genes, conserved non-coding sequences and novel repeats. BMC Genomics. 2008, 9: 518-10.1186/1471-2164-9-518.PubMed CentralView ArticlePubMedGoogle Scholar
- Hribova E, Neumann P, Matsumoto T, Roux N, Macas J, Doležel J: Repetitive part of the banana (Musa acuminata) genome investigated by low-depth 454 sequencing. BMC Plant Biol. 2010, 10: 204-10.1186/1471-2229-10-204.PubMed CentralView ArticlePubMedGoogle Scholar
- Wessler SR: Transposable elements and the evolution of eukaryotic genomes. Proc Natl Acad Sci. 2006, 103: 17600-17601. 10.1073/pnas.0607612103.PubMed CentralView ArticlePubMedGoogle Scholar
- Gabriel A, Willems M, Mules EH, Boeke JD: Replication infidelity during a single cycle of Ty1 retrotransposition. Proc Natl Acad Sci. 1996, 93: 7767-7771. 10.1073/pnas.93.15.7767.PubMed CentralView ArticlePubMedGoogle Scholar
- Kalendar R, Tanskanen J, Immonen S, Nevo E, Schulman AH: Genome evolution of wild barley (Hordeum spontaneum) by BARE-1 retrotransposon dynamics in response to sharp microclimatic divergence. Proc Natl Acad Sci. 2000, 97: 6603-6607. 10.1073/pnas.110587497.PubMed CentralView ArticlePubMedGoogle Scholar
- Figueira A, Janick J, Goldsbrough P: Genome size and DNA polymorphism in Theobroma cacao. J Am Soc Hortic Sci. 1992, 117: 673-677.Google Scholar
- Marie D, Brown SC: A cytometric exercise in plant DNA histograms, with 2C values for 70 species. Biological Cell. 1993, 78: 41-51. 10.1016/0248-4900(93)90113-S.View ArticleGoogle Scholar
- Xu Z, Wang H: LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 2007, 35: W265-W268. 10.1093/nar/gkm286.PubMed CentralView ArticlePubMedGoogle Scholar
- Ellinghaus D, Kurtz S, Willhoeft U: LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinforma. 2008, 9: 18-10.1186/1471-2105-9-18.View ArticleGoogle Scholar
- Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J: Repbase update, a database of eukaryotic repetitive elements. Cytogenet Genome Res. 2005, 110: 462-467. 10.1159/000084979.View ArticlePubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009, 25: 1754-1760. 10.1093/bioinformatics/btp324.PubMed CentralView ArticlePubMedGoogle Scholar
- Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010, 26: 841-842. 10.1093/bioinformatics/btq033.PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence alignment/map format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralView ArticlePubMedGoogle Scholar
- Steinbiss S, Willhoeft U, Gremme G, Kurtz S: Fine-grained annotation and classification of de novo predicted LTR retrotransposons. Nucleic Acids Res. 2009, 37: 7002-7013. 10.1093/nar/gkp759.PubMed CentralView ArticlePubMedGoogle Scholar
- Kozik A, Matvienko M, Kozik I, Van Leeuwen H, Van Deynze A, Michelmore R: Eukaryotic ultra conserved orthologs and estimation of gene capture In EST libraries [abstract]. Plant and Animal Genomes Conference. 2008, 16: P6-Google Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Capella-Gutierrez S, Silla-Martinez JM, Gabaldon T: trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009, 25: 1972-1973. 10.1093/bioinformatics/btp348.PubMed CentralView ArticlePubMedGoogle Scholar
- Smith SA, Dunn CW: Phyutility: a phyloinformatics tool for trees, alignments and molecular data. Bioinformatics. 2008, 24: 715-716. 10.1093/bioinformatics/btm619.View ArticlePubMedGoogle Scholar
- Stamatakis A: RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006, 22: 2688-2690. 10.1093/bioinformatics/btl446.View ArticlePubMedGoogle Scholar
- Liu L, Yu L, Pearl DK, Edwards SV: Estimating species phylogenies using coalescence times among sequences. Syst Biol. 2009, 58: 468-477. 10.1093/sysbio/syp031.View ArticlePubMedGoogle Scholar
- Liu L, Yu L: PHYBASE: an R package for phylogenetic analysis. Bioinformatics. 2010, 26: 962-963. 10.1093/bioinformatics/btq062.View ArticlePubMedGoogle Scholar
- Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003, 52: 696-704. 10.1080/10635150390235520.View ArticlePubMedGoogle Scholar
- Posada D: jModelTest: phylogenetic model averaging. Mol Biol Evol. 2008, 25: 1253-1256. 10.1093/molbev/msn083.View ArticlePubMedGoogle Scholar
- Zwickl DJ: Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. PhD thesis. 2006, University of TexasGoogle Scholar
- Seo TK: Calculating bootstrap probabilities of phylogeny using multilocus sequence data. Mol Biol Evol. 2008, 25: 960-971. 10.1093/molbev/msn043.View ArticlePubMedGoogle Scholar
- Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O: New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol. 2010, 59: 307-21. 10.1093/sysbio/syq010.View ArticlePubMedGoogle Scholar
- Felsenstein J: PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. 1993, Seattle: Department of Genetics, University of WashingtonGoogle Scholar
- Novak P, Neumann P, Pech J, Steinhaisl J, Macas J: RepeatExplorer: a Galaxy-based web server for genome-wide characterization of eukaryotic repetitive elements from next generation sequence reads. Bioinformatics. 2013, 29: 792-793. 10.1093/bioinformatics/btt054.View ArticlePubMedGoogle Scholar
- Novák P, Neumann P, Macas J: Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data. BMC Bioinforma. 2010, 11: 378-10.1186/1471-2105-11-378.View ArticleGoogle Scholar
- Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E: Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment. 2008, 200 (8): P10008Google Scholar
- Ihaka R, Gentleman R: R: a language for data analysis and graphics. J Comp Graph Stat. 1996, 5: 299-314.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.