Tandem repeats derived from centromeric retrotransposons
© Sharma et al.; licensee BioMed Central Ltd. 2013
Received: 22 November 2012
Accepted: 23 February 2013
Published: 4 March 2013
Skip to main content
© Sharma et al.; licensee BioMed Central Ltd. 2013
Received: 22 November 2012
Accepted: 23 February 2013
Published: 4 March 2013
Tandem repeats are ubiquitous and abundant in higher eukaryotic genomes and constitute, along with transposable elements, much of DNA underlying centromeres and other heterochromatic domains. In maize, centromeric satellite repeat (CentC) and centromeric retrotransposons (CR), a class of Ty3/gypsy retrotransposons, are enriched at centromeres. Some satellite repeats have homology to retrotransposons and several mechanisms have been proposed to explain the expansion, contraction as well as homogenization of tandem repeats. However, the origin and evolution of tandem repeat loci remain largely unknown.
CRM1TR and CRM4TR are novel tandem repeats that we show to be entirely derived from CR elements belonging to two different subfamilies, CRM1 and CRM4. Although these tandem repeats clearly originated in at least two separate events, they are derived from similar regions of their respective parent element, namely the long terminal repeat (LTR) and untranslated region (UTR). The 5′ ends of the monomer repeat units of CRM1TR and CRM4TR map to different locations within their respective LTRs, while their 3′ ends map to the same relative position within a conserved region of their UTRs. Based on the insertion times of heterologous retrotransposons that have inserted into these tandem repeats, amplification of the repeats is estimated to have begun at least ~4 (CRM1TR) and ~1 (CRM4TR) million years ago. Distinct CRM1TR sequence variants occupy the two CRM1TR loci, indicating that there is little or no movement of repeats between loci, even though they are separated by only ~1.4 Mb.
The discovery of two novel retrotransposon derived tandem repeats supports the conclusions from earlier studies that retrotransposons can give rise to tandem repeats in eukaryotic genomes. Analysis of monomers from two different CRM1TR loci shows that gene conversion is the major cause of sequence variation. We propose that successive intrastrand deletions generated the initial repeat structure, and gene conversions increased the size of each tandem repeat locus.
Maize centromeres are enriched in the tandem centromeric satellite repeat CentC and centromeric retrotransposons (CR). The centromeric retrotransposons of maize (CRM) are characterized by an integrase that contains a CR motif [1, 2], and elements belonging to subfamilies CRM1, CRM2 and CRM3 integrate predominantly at the centromeres of corn . Recombinant elements have been identified within the CRM1 and CRM4 subfamilies . The exact role of the CentC and CRM elements in corn centromeres remains unknown.
Tandem repeats are major constituents of higher eukaryotic genomes and are typically localized to specialized chromosomal regions such as centromeres, telomeres, (see  for a review) and heterochromatic knobs . Satellite repeats are thought to play a role in organizing and stabilizing these specialized chromosomal features, which are important for chromosome behavior during cell division . In addition to noncoding satellite repeats, many genes, such as histones and ribosomal RNA genes, are amplified and arrayed in tandem on chromosome arms. These tandemly arrayed genes (TAG) are thought to provide the large quantities of protein or RNA products required for important physiological and biological functions [7, 8].
While some satellite repeats are chromosome-specific, others are more broadly distributed. In maize, the 741 nt Cent4 satellite repeat is localized near the centromere of chromosome 4 , while the 156 nt centromere specific satellite repeat CentC is present at all centromeres . The 180 bp  and 360 bp TR1 tandem knob repeats  were detected on all eight maize chromosomes analyzed, although their copy number on different chromosomes varies greatly .
Many satellite repeat arrays and TAGs seem to undergo concerted evolution by sequence homogenization. Random homology-dependent unequal crossovers , replication slippage followed by unequal crossing over , rolling circle replication of extrachromosomal circular DNAs [16–18], segmental duplication , and gene conversion  have all been proposed to account for the sequence homogeneity, higher order structure, as well as rapid expansion and contraction of tandem arrays.
The origin of novel satellite repeats, however, remains elusive. Tandem repeats with homology to parts of retrotransposons have been identified in several plants, e.g., wheat, rye , and potato [22, 23]. The Cent4 tandem repeat of maize has homology to telomeric repeat, knob repeat, and the B chromosome centromere . Tandem repeats with homology to intergenic spacer of ribosomal DNA have been described in several plants, including potato , common bean , tobacco , and tomato .
Here we report the discovery and characterization of two CR-derived tandem repeats, describe the sequence features shared by them, and propose molecular mechanisms responsible for their generation, amplification and homogenization.
CRM1TR arrays are organized into 44 uninterrupted tandem arrays, or islands: 28 islands at locus I and 16 islands at locus II. These islands are separated from each other either by gaps in the physical assembly or by intervening sequences, e.g. LTR retrotransposons. We identified at least 5 full length LTR retrotransposon insertions, dated between 0.54 to 4.09 My (κ = 0.00699 and 0.05319), within CRM1TR arrays at locus I. This indicates that the locus I CRM1TR arrays originated around the time of the allotetraploidization event that resulted in modern day corn . No full-length retrotransposon insertions were detected within locus II CRM1TR arrays, suggesting that locus I is older than locus II, that retrotransposons are more efficiently removed from, or less likely to insert into, locus II.
Locus I CRM1TR monomers have the general structure ‘S-IR-EB’ while those from locus II have the general structure S-IR*-EA/EB, where IR* indicates between 1 to 4 tandem copies of IR and EA/EB indicates either EA or EB. Most (66.7%, 60/90) locus II CRM1TR monomers have three tandem copies of IR sequence. Other locus II CRM1TR monomers have either four (6/90), two (22/90), or one (2/90) IR sequence. Consequently, locus II monomers are longer (819 nt average) and more variable in length than locus I monomers (average 696 nt) (Additional file 1). Figure 2a illustrates the structure of CRM1TR monomers as well as their arrangement at locus I and locus II.
To investigate the sequence homogenization within and across CRM1TR arrays, we analyzed diagnostic variant nucleotides (SNPs) within ‘S’ and ‘IR’ sequences from fl-CRM1TR monomers. A set of seven SNPs within ‘S’ distinguishes monomers from locus I and II (Additional file 2). Based on these SNPs, the S sequences from 97 locus I CRM1TR monomers were grouped into 14 haplotypes and those from 90 locus II CRM1TR monomers were grouped into 18 haplotypes. Most S sequences from locus I CRM1TR monomers (76%) are of the haplotype ‘ACGGAGT’ while most of those at locus II are of the haplotype ‘GTCACTC’ (53.9%) or ‘GCCGCTC’ (17.9%) (Additional file 3 and Additional file 4).
No haplotypes of ‘S’ were shared between locus I and II, but, a search of the htgs database of GenBank using the consensus S sequences of locus I and locus II identified the respective parent BAC(s) as the best match followed by the BAC corresponding to the other CRM1TR locus. This suggests that the S haplotypes at the two CRM1TR loci, though distinct, are more closely related to each other than to any full-length CRM element in sequenced maize genome and may even have originated from the same ancestral sequence.
For IR haplotype analysis, we used all locus I CRM1TR monomers but only those locus II monomers with three IR copies. The IR sequences contains a terminal A-rich region that is approximately 38 nt long in locus I monomers and ~25 nt long in locus II monomers (Additional file 2). This A-rich region was not included in SNP analysis because it is highly polymorphic. A set of 7 SNPs distinguished IR sequences from locus I and II. Based on these SNPs, the 97 ‘IR’ sequences from 97 locus I monomers were grouped into 10 haplotypes, whereas the 180 ‘IR’ sequences from 60 ‘IR’ locus II monomers were grouped into 28 haplotypes. Most IR sequences from locus I monomers are of the haplotype CACCCTA (55.5%) or CACCCTC (33%) (Additional file 3 and Additional file 4). Position specific IR haplotypes were detected in locus II monomers with three IR repeats, i.e. haplotype CACCTCT dominates at the first IR, IR.c1 (76.7%, 46/60), the related haplotypes TGACCTA (51.7%, 31/60) or TGACCTC (28.3%, 17/60) dominate at the second IR, IR.c2, and the related haplotypes TGATCTA (63.3%, 38/60) or TGATCTC (18.3%, 11/60) dominate at the third IR, IR.c3 (Additional files 3 and Additional file 4). This position specificity was absent in monomers with one, two or four IRs. For instance, the IR.c3 haplotype TGATCTA is located at the second IR in 12 of 22 monomers with two IRs and at the fourth IR in 3 of 6 monomers with four IRs (data not shown). These data suggests CRM1TR monomers with one, two, or four IRs likely originated from CRM1TR monomers with three IRs via recombination.
The two IR haplotypes present in the majority of locus I monomers are also represented at locus II, albeit only in three locus II monomers, i.e. locus I haplotype CACCCTC also matches the first IR of a one IR and a two IR containing monomer at locus II, and locus I haplotype CACCCTC matches the first IR of a two IR containing monomer at locus II. Nonetheless, all three IR copies from locus II that share haplotype with locus I monomers can be distinguished from locus I monomers based on their shorter A rich region at the termini, which is similar to other locus II monomers. These data may be indicative of a common origin of the two CRM1TR loci or a low level of intrachromosomal gene conversion between the two CRM1TR loci.
Distinct S and IR haplotypes at the two CRM1TR loci suggests that CRM1TR monomers are locally homogenized, and that the two loci followed separate evolutionary trajectories. This is reminiscent of centromeric satellites in Arabidopsis, where repeats derived from the same genomic locus were discovered to be more similar to each other than those from disparate genomic regions . Similarly, analysis of centromeric repeats in human X chromosomes revealed that repeats that are near each other (within ~15 kb) were significantly more similar than those at random loci either on the same or different X chromosomes .
To assess whether the three IRs present in most locus II CRM1TR monomers formed de novo or originated from existing CRM1 elements, we searched the htgs database of GenBank for CRM1 elements with multiple copies of the IR sequence. Most full length CRM1B retrotransposons contain a single IR copy that is similar in sequence to the IR sequence in locus I CRM1TR monomers, but we discovered at least two full length CRM1 copies (refGen_v2 chr4 coordinates 108,400,064-108,393,127 and 107,845,660-107,838,735) that contain three IRs similar in sequence to the three IRs in most locus II CRM1TR monomers. SNPs shared between the IR regions of the two CRM1B elements and the consensus sequence of locus II CRM1TR monomers (Additional file 2 and Additional file 4) suggest that the three IRs in most locus II monomers originated from existing CRM1B elements rather than by de novo triplication of IR sequence within locus II CRM1TR monomers.
Sequence similarity between CRM1TR subsequences and CRM1A or CRM1B consensus sequences
Maximum bitscore with CRM1_A
Maximum bitscore with CRM1_B
CRM1TR_ II_E B
CRM1TR_ II_E A
CRM4TR is derived from a member of the CRM4 subfamily. Full length CRM4TR monomers share between 95-98% sequence homology to a ~1370 nt LTR-UTR segment (nucleotide positions 138 to 1507) of the full length CRM4B retrotransposon consensus sequence (Figure 1, 2b). CRM4TR repeat arrays are located at a single genomic locus spanning coordinates 47,638,635 to 47,807,586 on chr 6 of RefGen_v2. CRM4TR arrays are located in the pericentromeric region spanned by two overlapping BACs (c0466I13 and c0290C08) that lack CentC as well as other centromere enriched retrotransposons CRM1/2/3 and are located ~1.8 Mb from the functional centromere (data not shown).
CRM4TR arrays are organized into 6 islands separated by either gaps in the physical assembly or intervening sequences, including LTR retrotransposons. We identified two nested insertions of retrotransposon A188 (GenBank accession ZMU11059) in opposite orientation between the first two islands and estimated their insertion times at 0.94 My (LTR edit distance κ = 0.0122) and 0.79 My (κ = 0.0103), suggesting that this CRM4TR array has existed in the maize genome for at least ~1 My and may have formed more recently than CRM1TR locus I.
The 39 fl-CRM4TR monomers from the CRM4TR locus range in length from 1369–1391 nt (average size 1386 nt). CRM4TR consensus sequence contains roughly 13 repetitions of ~18 nt sequence near its 3′ end (Additional file 5). Based on shared SNPs, repeats 2, 4, 6, 10, and 12 form group 1, repeats 3, 5, 7 form group 2, and repeats 8, 10, and 12 form group 3. These repeats thus form at least two sets of ~36 nt composite repeat, where set I includes composite pairs 2–3, 4–5, 6–7 and set II includes composite pairs 8–9, 10–11, and 12–13 (Additional file 5).
CRM1TR and CRM4TR repeats include part of their parent elements’ LTR and UTR, but the relative start sites in the CRM1 and CRM4 LTRs differ. The 5′ end of CRM4TR monomers lies ~395 nt upstream of TATA box, and therefore in the U3, and includes several motifs conserved between the CRM1 and CRM4 LTRs, i.e. the TATA box, recombination breakpoint (RB), and the C-rich and T-rich motifs (Additional file 6). In contrast, the 5′ end of CRM1TR monomers is located ~180 nt downstream of TATA box in the U5 region and thus lacks all of the sequence motifs listed above.
Remarkably, CRM1TR and CRM4TR monomers terminate at similar locations within their respective UTRs. A 251 nt region of CRM1B UTR is homologous (73% sequence identity) to a 263 nt region of the CRM4 UTR. CRM1TR and CRM4TR monomers terminate within 12 nucleotides of each other, i.e. at position 100 and 113 respectively within this homologous region. The first 99 nucleotides and the terminal 143 nt of the homologous region are 72% and 83% identical between CRM1 and CRM4. The terminal 143 nt of the homologous region also contains a polypurine rich region (Additional file 7). The conserved ~ 250 nt UTR region was not detected in CRM2.
A GenBank search revealed one CRM1TR- and one CRM4TR-derived EST spanning the junction between two monomers (Figure 1). In addition, 3 full length cDNAs (Fl-cDNA) originating from CRM4TR repeats were detected in two different cDNA libraries (Additional file 8). These ESTs and cDNA clones were prepared from polyadenylated RNA. All three CRM4TR derived cDNAs had spliced out most of the UTR region and one cDNA had an additional splice in the LTR (Figure 1). The 5′ ends of the three CRM4TR derived Fl-cDNAs are located 18, 40, and 43 nt downstream of TATA box in the CRM4TR monomers. No CRM1TR-derived full length cDNAs was detected, which may reflect the lack of a TATA box and other regulatory regions required for transcription by RNA polymerase II. The impact or role, if any, of the inclusion or exclusion of transcription regulatory signal in CRM4TR and CRM1TR on local chromatin organization and/or function is unclear. Although, the role, if any, of non-coding polyadenylated transcripts in plant heterochromatin formation remains largely unknown, a recent study in Arabidopsis showed that RNA polymerase II transcription recruits Pol IV and Pol V at heterochromatic loci to promote siRNA biogenesis and siRNA-mediated transcriptional gene silencing .
The maize genome contains hundreds of full-length centromeric retrotransposons belonging to six different subfamilies (CRM1-CRM6)  (Sharma et al. unpublished). Three of these (CRM1-CRM3) have the ability to target their insertion to the functional centromeres as defined by CENH3 . Maize centromeres also contain the tandem repeat CentC, which shares high sequence homology with the CentO repeat found at rice centromeres , indicating that this centromeric repeat likely has resided at the centromeres of these two species since they diverged about 50 My. It is not clear how these tandem centromeric repeats arose, but the discovery of tandem repeats derived from CRM1 and CRM4 raises the possibility that CentC and CentO may have been derived from ancient centromeric retrotransposons.
CRM1TR and CRM4TR were created in at least two independent events. The two CRM1TR loci on the other hand seem to have a common origin at least 4 My. The localization of CRM1TR at only two neighboring loci separated by ~1.4 Mb on chr9 suggests that the second CRM1TR locus originated either by retrotransposon insertions and/or other genomic rearrangement that separated the original cluster into two, or by intrachromosomal gene conversion events that transferred some CRM1TR repeat to the second locus possibly by recombination with a CRM1 sequence. Several studies in yeast indicate that intrachromosomal gene conversion events are frequent and may result in long conversion tracts [37–40].
Transposable elements are a major source of tandem repeats although a few cases of acquisition of satellite repeat monomers by transposable elements have also been reported. Roughly one quarter of all minisatellites/satellites in the human genome are derived from transposable elements . The creation of tandem repeats from retrotransposons, including CR elements, has been documented a number of times. For example, at least four of the centromeric repeats in potato are amplified from retrotransposon-related sequences  and the 4.7 kb monomer of sobo satellite repeat of wild potato, which spans ~360 kb in the chromosome 7 pericentromere, shares sequence similarity with the LTRs of Sore1 gypsy retrotransposon, satellite repeat, and genomic DNA . In rodent, the 348 bp monomer of satellite repeat RPCS shares sequence identity with the U3 region of the LTR of the Rous sarcoma virus, and contains several sequence elements that are characteristic of retroviral LTRs such as a polypurine tract, CCAAT boxes, a TATA box and putative polyadenylation signals, as well as binding sites for the CCAAT/enhancer-binding protein (C/EBP) and CCAAT proteins related to NF-1 [42, 43]. In rye, the satellite repeat family E3900 has sequence derived from a Ty3-gypsy retrotransposon while the D1100 family contains a rearranged MITE element . In wheat and rye, a centromeric repeat with 250 bp repeat unit has 53% amino acid sequence similarity to the Cereba (a CR family retrotransposon) gag gene containing CAA microsatellite  and the satellite 1 family of Xenopus laevis has homology to a SINEs . Satellite sequences with homology to the 3′ UTR regions of plant retrotransposons belonging to the Tat-lineage, which frequently contain variable tandem repeats, have been identified in several plant species .
In some cases, partial homology between satellite repeats and retrotransposons has been attributed to acquisition and dispersal of satellite repeat sequence by transposable elements. For example, in Drosophila, direct terminal repeats of the functional pDv elements might have been derived from the pvB370 satellite DNA family through insertion of a tandemly repeated 36-bp transcription unit . Similarly, acquisition and subsequent dispersal of part of a TCAST satellite DNA sequence into a retro transposon is proposed to explain the distribution of TCAST element in the vicinity of genes within euchromatin .
We have discovered several parallels between CRM1TR and CRM4TR, which may begin to provide some clues about the transitioning process from autonomous retrotransposon to tandem repeat. First, it is noteworthy that the only CRM subfamilies that gave rise to tandem repeats are those for which recombinant elements have been documented . The CRM1 and CRM4 recombinants were postulated to have arisen from nested insertions of related elements, which suggests that the CR tandem repeats may also have arisen from nested insertions. Second, similar regions of the parental CRM element (i.e. several hundred nucleotides upstream and downstream of the LTR-UTR junction) gave rise to the tandem repeats in each case, indicating perhaps that similar mechanisms were involved in the creation and maintenance of these tandem repeats. Third, all three CRM-derived tandem repeat loci lie in the centromere or pericentromere, regions that are likely subject to large physical forces and possibly frequent chromosome breakage.
The fact that CRM-derived repeats are located in or near the centromeres—chromosomal loci where meiotic crossing over is suppressed but gene conversions are frequent [49, 50]— suggest that CRM-derived repeats were amplified and homogenized primarily by non-allelic/interlocus gene conversion (Figure 3). The amenability of these tandem repeat loci to gene conversion in general is further evidenced by our discovery of recombinant CRM1TR monomer such as those having replaced their CRM1B type ends with a CRM1A derived sequence. However, we cannot exclude that other mechanisms such as unequal exchange between sister chromatids (during mitosis) [50–52] as well as insertions of the products of rolling circle replication of an extrachromosomal circular DNA  may have also contributed to the amplification and homogenization of CRM –derived repeats.
The presence of megabase spanning satellite repeats is a hallmark of centromeres, although there is little detectable sequence conservation. For example, in Oryza brachyantha the centromeric satellite CentO has been replaced with CentO-F, which has no sequence similarity to CentO, since divergence from O. sativa about 10–15 million years ago [54, 55]. Centromere specific satellite repeats that span nearly the entire core of six potato centromeres, are composed of long monomers (979 bp to 5.4 kb). Four of these six centromeric repeats are derived from retrotransposon-related sequences . Array sizes of centromere and knob repeats vary substantially in different maize inbreds , and maize centromeres have been described as dynamic loci that can shift their position over time away from centromeric satellite repeats . Although the function of centromeric repeats is still unknown, it may be to preferentially bind nucleosomes containing the centromeric histone variant CENH3 [58, 59]. CENH3 has been characterized as a protein that evolves rapidly, possibly in concert with centromeric DNA . In vitro nucleosome reconstitution experiments suggest that CentC and the LTR/UTR region of CRM sequences preferentially bind CENH3 nucleosomes (Xie et al. unpublished). Taken together with these facts, the discovery of CRM-derived tandem repeats suggest a mechanism by which the centromeric satellite repeats can be renewed or replaced using rapidly evolving retrotransposon sequences.
To our knowledge, CRM1TR and CRM4TR represent the first tandem repeats derived entirely from single parental retrotransposons. The fact that these repeats arose independently from similar regions of two centromeric retrotransposon subfamilies and exhibit high sequence homology (97%-98%) to their extant parental retrotransposons suggests that the primary DNA sequence of these LTR/UTR regions have characteristics that favor illegitimate recombination and gene conversion. The discovery of tandem repeats originating from centromeric retrotransposons, some of which are enriched in CENH3 domains, raises the intriguing possibility that centromeric satellite repeats can be renewed or replaced by novel satellite repeats derived from retrotransposons belonging to the CR family.
A BLASTable database was formatted from maize genome assembly RefGen_v2 . CRM1TR and CRM4TR repeats were identified in RefGen_v2 based on their homology to LTR-UTR segment of fl-CRM1 and CRM4 elements using blastn. CRM1TR and CRM4TR monomers were extracted from RefGen_v2, using custom perl script, based on coordinates in the BLAST output file.
BLAST2seq and rpsblast was used to identify direct repeats and polyprotein domains of retrotransposons inserted between CRM-derived tandem repeats. Pairwise alignments of the 5′ and 3′ LTR of each full length retrotransposon with TSDs were generated using MUSCLE  and manually corrected using BioEdit . The evolutionary distances between the 5′ and 3′ LTR pair of each retrotransposon (κ = estimated number of nucleotide substitutions per site) was estimated using K2P model in MEGA version 5 .
Multiple sequence alignments of CRM1TR S and IR sequences were generated using MUSCLE  and visualized as well as manually edited using BIOEDIT . SNPs were discovered using visual inspection of the multiple sequence alignment and scored as haplotypes in an excel sheet. Haplotype and monomer length graphs were generated using excel.
CRM1TR and CRM4TR schematics were generated using Fancygene  based on coordinates determined from multiple sequence alignments. Graphical visualization of CRM-derived tandem repeat loci in RefGen_v2 were created using JunctionViewer .
The dot plot was generated using the online server at http://www.vivo.colostate.edu/molkit/dnadot/.
Bacterial artificial chromosome
Centromeric histone H3 variant
Centromeric tandem repeat of maize
Centromeric retrotransposon of maize
Long terminal repeat of a retrotransposon
Untranslated region of a retrotransposon.
This work was funded by the National Science Foundation grant DBI 0922703.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.