Structural characterization of helitrons and their stepwise capturing of gene fragments in the maize genome

Background As a newly identified category of DNA transposon, helitrons have been found in a large number of eukaryotes genomes. Helitrons have contributed significantly to the intra-specific genome diversity in maize. Although many characteristics of helitrons in the maize genome have been well documented, the sequence of an intact autonomous helitrons has not been identified in maize. In addition, the process of gene fragment capturing during the transposition of helitrons has not been characterized. Results The whole genome sequences of maize inbred line B73 were analyzed, 1,649 helitron-like transposons including 1,515 helAs and 134 helBs were identified. ZmhelA1, ZmhelB1 and ZmhelB2 all encode an open reading frame (ORF) with intact replication initiator (Rep) motif and a DNA helicase (Hel) domain, which are similar to previously reported autonomous helitrons in other organisms. The putative autonomous ZmhelB1 and ZmhelB2 contain an extra replication factor-a protein1 (RPA1) transposase (RPA-TPase) including three single strand DNA-binding domains (DBD)-A/-B/-C in the ORF. Over ninety percent of maize helitrons identified have captured gene fragments. HelAs and helBs carry 4,645 and 249 gene fragments, which yield 2,507 and 187 different genes respectively. Many helitrons contain mutilple terminal sequences, but only one 3'-terminal sequence had an intact "CTAG" motif. There were no significant differences in the 5'-termini sequence between the veritas terminal sequence and the pseudo sequence. Helitrons not only can capture fragments, but were also shown to lose internal sequences during the course of transposing. Conclusions Three putative autonomous elements were identified, which encoded an intact Rep motif and a DNA helicase domain, suggesting that autonomous helitrons may exist in modern maize. The results indicate that gene fragments captured during the transposition of many helitrons happen in a stepwise way, with multiple gene fragments within one helitron resulting from several sequential transpositions. In addition, we have proposed a potential mechanism regarding how helitrons with multiple termini are generated.

Helitrons constitute over 2% of the maize genome. It was estimated that there might be tens of thousands elements in maize inbred line B73 [13,14]. They could capture gene fragments and move around the genome, which leads to gene diversity between the maize inbred lines [15]. Helitrons have contributed the remarkable variation of haplotype in the Bz (bronze) genomic locus among different maize inbred lines [16,17]. Two helitrons containing hundreds of copies in maize inbred line B73 have been identified [13].
More helitrons and their capture gene fragments have been detected in maize than in A. thaliana and O. sativa [3,13,14,18,19]. Yang et al. [14] found that over half of the helitrons have contained gene fragments in the B73 genome. They could be from 28 bp to a 7.6 kb gene fragments in length, and might even include an entire gene sequence [20,21]. According to the results of Du et al. [13] and Yang et al. [14], the helitrons could possess zero to nine gene fragments, which came from 376 and 840 different genes. The gene fragments carried by these elements could also form chimeric genes [13,20]. ESTs of helitron sequences have been detected in certain maize tissues [15]. It is possible that some functional genes can be produced from the shuffling of the capture gene fragments.
The mechanism how helitrons capture gene fragments and how they transpose remain unknown. The replication initiator (Rep) protein motif and a DNA helicase (Hel) domain are considered to be the key protein features of rolling circle (RC) processes in bacteria [3,4,10,22]. It was postulated that helitrons could mobilize by the RC replication of the "copy-and-paste" model in eukaryotes [4]. Choi et al. [5] found a predicted autonomous element carrying Rep/Hel-TPase and RPA-TPase in I. tricolor, however, it contained a frameshift and a non-sense mutation. Morgante et al. [15] identified two sequences that contained the conserved RC-Rep motif and DNA helicase domain in two maize inbred lines. However they both are interrupted by other transposons. Du et al. [13] and Yang et al. [14] proposed that helitrons had amplified within the last 6 million years and could still be active in the modern maize. So far, no intact autonomous element has been discovered in maize [13,14,19].
The full genome sequence of inbred line B73 has been achieved using BAC by BAC sequencing strategy recently [23]. Du et al. [13,24] and Yang et al. [14] have developed methods for identifying helitrons, and mined 2,791 and 1,930 elements, respectively. They had analyzed the extensive distribution, variability and diversity of helitrons in the maize genome. From these studies, certain hallmarks of helitrons in maize have emerged, such as that they preferentially inserted near other ones, but less commonly inserted into certain gene. There were some elements with more than one 5'-termini or 3'-termini. Many helitrons have been shown to carry phosphatase 2C-like gene fragments.
To further understand the characteristics of helitrons as well as the features of their transpositions, we have again developed a set of PERL scripts to search for additional helitrons in the maize genome. A total of 1,649 helitrons have been identified including three putative autonomous elements and two helitrons with high copy number. Our study not only provides a detailed characterization of putative intact automomous helitrons, but also presents evidence to suggest that gene fragment capturing during the transposition of helitrons happened in a stepwise way, with multiple gene fragments within one helitron being the capturing the products of several sequential transpositions. We have also proposed and provided the evidence to support a mechanism regarding how multiple terminal elements are generated.

Identification of additional helitrons
To obtain additional helitrons with high confidence, the sequences of 23 published ones [7,15,17,[25][26][27] including twenty helAs and three helBs, were used as query sequences to search against the maize genome sequence by BLASTN. The resulting 248 candidate helitrons were initially identified. To further verify these candidate helitrons, two strategies were used. Firstly, helitron locating in repeated regions could be verified by BLASTN (Additional file 1, Figure S1A) [17]. Secondly, helitrons with multiple copies of high similarity could be verified each other by aligning their sequences together to determine their exact 5' and 3' boundaries (Additional file 1, Figure  S1B). Altogether, we obtained 96 validated helitrons by these two methods, including eighty helAs and sixteen helBs. To further confirm these helitrons, we conducted PCR experiments for some selected helitrons. All fourteen that had successful PCR amplification showed variable in sizes of PCR products (Additional file 2, Figure  S2), indicating the vacant sites and occupied sites, therefore providing final confirmation for our 96 seed helitrons.
Based on the terminal sequence characteristics of the 96 validated helitrons, a PERL script was designed to identify additional elements in the maize genome. As a result, a total of 1,649 intact elements were obtained. According to a standard previously reported [17], we divided these new elements into two different families, which including 1,515 helAs and 134 helBs (Additional file 3, 4, Table S1, S2). The size of these elements ranged from 128 bp to 20,874 bp; the average length was 6,357 bp for helA, and 4,629 bp for helB. Overall, 82.7% (1,253/1,515) of helA sequences were less than 10 kb in length. Similarly 94.8% (127/134) helB were less than 10 kb. HelAs with the length of over 10 kb (22.5%; 59/262) and all 7 helBs with the length of over 10 kb were classified as putative "autonomous" helitrons if they do not contain other long transposons such as retrotransposon.
HelAs had a conserved sequence of the 24 bp at the 5'-terminus and 28 bp at the 3'-terminus including palindromic structures. HelBs had conservative sequences for 28 bp and 32 bp at the 5'-terminus and 3'-terminus, respectively (Additional file 5, Figure S3). The 5'-terminus of helBs was significantly different from those of helAs.

Putative autonomous helitrons
In general, the helitrons that encode replication initiator (Rep) motif, DNA helicase domain and a possible replication A protein 1 (RPA1)-like motif in plants, are considered as putative autonomous ones [4]. To find potential autonomous helitrons, all helAs sequences of over 10 kb and helBs of over 5 kb were carefully annotated. Two sequences, named ZmhelA1 (AC208648.2, 14,632 bp) and ZmhelB2 (AC212020.2, 12,217 bp) respectively were qualified as putative autonomous elements. ZmhelA1 and ZmhelB2 all contained conserved Rep motif and DNA helicase domain without frameshift ( Figure 1A, B, C, Additional file 6, Table S3). Those conserved domains were reported to be essential for DNA replication and for unwinding double stranded DNA in other prokaryotic and eukaryotic species [3,5,10]. The putative autonomous ZmhelA1 also contained a putative RPA remnant before the Rep motif ( Figure 1A), although the RPA sequence had a very low sequence homologous with that of A. thaliana and O. sativa [3]. In addition, ZmhelA1 also carried eight predicted gene fragments. ZmhelB2 possessed three putative single strand DNA-binding domains (DBD)-A/-B/-C of RPA1 following the helicase domain in the ORF (Figure 1A), which were in the same orientation as the Rep/ Helicase gene. ZmhelB2 also carried two postulated gene fragments. Based on these structural characteristics, autonomous helitrons in maize could be at least divided into two types, a result that was consistent with the neighbor-joining phylogeny analysis ( Figure 2). To obtain additional putative "autonomous" elements, the RPA-like and DNA helicase of A. thaliana and O. sativa [3,5] were used to search against maize genome by TBLASTN respectively. Then the obtained sequences were extended 10 kb each in the 5'-terminus and 3'-terminus respectively. Finally, the obtained putative autonomous helitrons were annotated by Fgenesh (http:// linux1.softberry.com/berry.phtml). As a result, five putative autonomous helBs were identified by this homolog searching approach. One of the five putative autonomous helBs, ZmhelB1 (AC200867.3) with the length of 12,992 bp, also encoded an intact ORF as ZmhelB2 with potentially functional Rep motif, a DNA helicase domain and a RPA1 motif without frameshift ( Figure 1A, B, C, Additional file 6, Table S3). These two putative autonomous helBs have similar structural characteristics as that reported by Morgante et al. [15].

Helitrons of multiple terminal sequences and of high copy number
Our result showed that 28.7% of helAs had contained multiple terminal structures. We called the internal terminal sequences as the pseudo terminus ( Figure 3A, B, C). Through multiple sequence alignment, we found that the real 3'-terminus of helitrons contained highly conserved "CTAG" motif, but not at the pseudo 3'-terminus of elements with multiple 3'-termini ( Figure 3D). One hundred helAs with multiple 3'-termini were randomly sampled to analyze structure of their pseudo 3'termini, the result showed that 99% (99/100) of the internal 3' end sequence had a pseudo 3'-terminus with no intact "CTAG" motif. However, we did not find any multiple terminal sequences in the 134 helBs.
Based on the sequence characteristics of pseudo 3'-termini that we obtained, the following consensus sequence model was defined: {6, 8}CTAT". By searching against the maize genome sequence according to the model, 662 pseudo 3'-termini sequences were obtained. Ten sequences were randomly selected from these newly identified pseudo 3'-termini, and the intact 3'-termini structures were shown within 10 kb downstream. It was ubiquitous that the pseudo 3'-termini we identified had no intact "CTAG" motif in maize. Using the same methods, we found that 17.6% of helAs also had multiple 5'-termini. However, there were no distinct differences between the pseudo 5'-terminal sequence and the true 5'-terminal one.

Gene fragments captured by helitrons
In order to analyze the gene fragments carried by helitrons, all detected elements were searched against the nonredundant protein (nr) database using the BLAST program. Most of helitrons with a size of less than 1 kb (64.7%) did not contain any gene fragment. Most of elements with lengths from 1 kb to 2 kb (90%) had only obtained one gene segment. The number of capture gene fragments by the helAs ranged from 0 to 12, with a mean value of 3. Most helAs (82.1%) carried between 1 and 5 gene fragments. All of the helBs held no more than five gene fragments, with an average of 1.8. The majority of helBs (82%) acquired 1 to 3 gene fragments ( Figure 5).
A total of 4,645 gene fragments were carried by the helAs, which encoded 2,507 proteins (Additional file 9,   Figure 3A, B, C.  Table S6). There were 229 helAs that had captured a near identical fragment of phosphatase (type) 2C-like protein (ACG41393.1) [13,14], the same gene fragment found in helitron_mc1 and helitron_mc2. Different members of the phosphatase (type) 2C family protein were also captured by other helAs, such as ACF84978.1 (48 hits), AAQ06294.1 (29 hits) and ACF83293.1 (19 hits). It is possible that the phosphatase (type) 2C-like protein carried by the helAs could have been amplified previously [13]. A total of 249 gene fragments coming from 187 proteins have been captured by helBs (Additional file 10, Table S7). There were 6 helBs that contained a same gene fragment (ACG47094.1). Our results suggest that helitrons do not have a bias in capturing gene fragments.

A. Putative autonomous helitrons:
Step by step capturing of gene fragment Many helitrons have captured several gene fragments. Some of the gene fragments are apparently even from   and ZmhelA2 have over 95% identity from 5' to 3' end, excepting one insertion in the middle for ZmhelA2. Therefore, ZmhelA2 can be explained by having captured a 1,366 bp gene fragment and having inserted into 25 bp of its 5'-termini of its ancestral element (ZmhelA3). In the same way, ZmhelA4 and helitron_mc1 showed high sequence similarity (more than 85%) with the 193 bp of the 3'-terminus of ZmhelA3. Detailed analysis indicated that, starting from an ancestral element

Number of helitron
Distribution of the number of gene fragments captured per helitron  that is missing only one internal gene fragment (shown in blue as Figure 6A) from ZmhelA3, the ZmhelA4 and helitron_mc1 can both be generated by capturing different gene fragments over several steps of transposition.
Our result strongly suggested that the gene fragments captured by helitrons happened in sequential fashion, with each step of transposition likely capturing one gene fragments. In fact, such a stepwise gene capturing capacity will provide endless opportunity to shuffle gene fragments originating from all over the genome. ZmhelB7 (AC186647.3, AC212020.4) might have evolved from ZmhelB2. ZmhelB7 and ZmhelB2 have the same terminal sequences, but the former lacked the DNA helicase domain and the replication protein A (RPA)-like fragments that were found in ZmhelB2 (Figure 6B). This indicated that helitrons could lose the internal sequence during the process of transposition in maize.

Discussion
Helitrons are particularly complex in the maize genome [13,14,28]. A total of 1,649 elements were obtained based on the terminal sequence characteristics of elements in this research. Du et al. [13] and Yang et al. [14] identified 2,791 and 1,930 intact elements in the maize genome, which overlapped 52.46% and 34.45% with our result respectively (Additional file 11, 12, Table  S8, S9). The differences among these three searching programs are mainly due to the parameters used in the respective perl scripts. For example, the script used by Du et al. [13] only aimed to identify helAs, while script used in this study is intended to cover both helAs and helBs. Additionally, Du's script and that of the current study have also differed in a number of searching criteria which leaded to a number of specific helitrons being identified by each script. Based on previous estimation [15], there are still a large number of helitrons in maize B73 genome have not been identified. Due to the unique structure of helitrons, it is still very difficult to unambiguously identify all these elements. With more seed helitrons available, a more accurate script could be generated which would drastically increase the number of elements being identified in the B73 genome.

Putative autonomous helitrons
All helitrons that have been identified so far in the maize genome are nonautonomous [13,14]. In fact, truly autonomous elements have not been found in eukaryotic species to date. In a spontaneous pearly-s mutant of I. tricolor, Choi et al. [5] found that a putative autonomous helitron containing Rep/Hel-TPase and RPA-TPase, but it had a frameshift and a nonsense mutation. Morgante et al. [15] identified two sequences contained Rep motif and DNA helicase domain. However they both are interrupted by other transposons. Three putative autonomous helitrons found in this research have contained intact Rep motif and DNA helicase domain, the same as those found in A. thaliana and O. sativa [3]. We also detected other four helBs with the conserved Rep motif and the DNA helicase domain, however, their ORF were either having frameshift or incomplete (Additional file 6, Table S3). Although we can not confirm that these three putative autonomous helitrons are actually function as autonomous element at present, the presence of these three putative autonomous sequences with intact ORF in the B73 genome is strongly suggested that true autonomous helitrons could exist in modern maize.
ZmhelA1 had a putative RPA remnant before the Rep motif. ZmhelB1 and ZmhelB2 possessed an intact RPA1-like domain following the helicase domain in the same ORF respectively. Choi et al. [5] speculated that Rep/Helicase were ubiquitous in eukaryotes, and could play a more important role in the helitrons transposition than RPA1. The structural characteristics of putative autonomous elements in A. thaliana, C. elegans, I. tricolor, M. lucifugus, O. sativa and Z. mays were carefully analyzed [3,5,10,15] (Additional file 13, Table S10

Generation of helitron with multiple termini from nested helitrons
Most helitrons in the maize genome were found to be small sizes. About 80% (1,253/1,515) helAs were between 100 bp and 10 kb in length, and 94.8% (127/ 134) helBs ranged from 600 bp to 10 kb in this research. Yang et al. [14] identified 1,930 elements, of which 95.4% (1,841/1,930) were less than 10 kb in length. The finding of helitrons with multiple copies suggests that they do not always capture gene fragments in the process of transposition.
There were 28.7% helAs that possessed multiple terminal structures as shown by Du et al. [13]. The pseudo 3'-termini sequences had damaged "CTAG" motif comparing with the real 3'-termini. We found that the pseudo 3'-termini structures were ubiquitous in maize inbred line B73. HelAs had preference to insert near to or inside other helitrons [14], which could have caused to form multiple terminal sequences inside them. Genomic evolution or transpositions could have caused an intact terminal structure to turn into a pseudo 3'-terminus ( Figure 4B, C, D). Yang et al. [14] reported that helitrons could recognize a new 3'-or 5'-terminus site to form a new element in A. thaliana. Du et al. [13] found that the 3'-termini sequences were more variable than the 5'-termini ones.
The evolutionary pathway of helitrons with shared capture gene fragments can be deduced according their different combination of their capture gene fragments ( Figure 6A, B). We detected two elements with multiple copies, helitron_mc1 and helitron_mc2, the latter possessed two pseudo 3'-termini structures ( Figure 4). There was a high similarity in the 5'-terminal sequence of helitron_mc2 and ZmhelA5 (AC215227.3). Heli-tron_mc1, helitron_mc2 and ZmhelA5 had one, three and one fragment respectively, which are highly homologous to 193 bp of the 3'-terminus of ZmhelA3 ( Figure  4A). According to these observations, helitron_mc2 might have evolved from helitron_mc1 and ZmhelA5 [29,30]. The detail of the hypothesized evolution path for helitron_mc2 is shown in Figure 4B, C, D. ZmhelA5 were inserted into helitron_hypo2 (a hypothesized intermediate). Then helitron_hypo2 carrying ZmhelA5 inserted into helitron_mc1 to form nested heltrons. Eventually helitron_mc2 was generated by further transposition starting from the 5'-end of ZmhelA5 while including the rest of three 3' ends. The intact 3' end "CTAG" motif can be mutated either before or after the generation of helitron_mc2. As there exist a large number of nested retrotransposons [31], there can be a lot of nested helitrons in the maize genome. The later is then served as intermediate to give rise to many helitrons of multiple termini seen in the B73 genome.

Conclusions
Helitrons in the maize genome are variable size. When the elements transposed, they could sometimes capture gene fragments or lose their internal sequence. Gene capturing of helitrons can happen in a stepwise mode through sequential transpositions. Three putative autonomous helitrons were discovered in maize with intact replication initiator (Rep) motif and a DNA helicase (Hel) domain, similar to those identified in other species. Therefore, it is possible that active autonomous elements exist in modern maize. Our study also indicated that helitrons with multiple termini can be generated from nested helitrons.
Two candidate elements with less than 20 kb between them were regarded as a single helitron. We initially obtained 248 candidate helitrons. A single element that had inserted into highly duplicated regions could be verified by BLASTN (Additional file 1, Figure S1A) [17]. Secondly, helitrons with multiple copies of high similarity are verified by aligning their sequences together to determine their exact 5' and 3' boundaries (Additional file 1, Figure S1B). Through these two methods, we finally validated 96 helitrons. Then primers of fourteen sequences of validated 96 elements were designed to the flanking regions upstream and downstream of the inserted element to verify the putative helitrons, to see the vacant sites and occupied sites displayed by different PCR bands in a set of 12 inbred lines (Additional file 2, Figure S2).
A PERL script was then written based on terminal characteristics of 96 validated elements to search against the sequence database of the inbred line B73. We applied two steps to identify helitrons more reliably, firstly using the following search criteria: According to the search results and the validated criteria mentioned above, we searched the genomic sequences again using the stricter criteria as follows:

Sequence analysis and annotation
Local BLAST software (blast-2.2.16) was used to align the sequences. A neighbor-joining phylogeny (1,000 bootstrap replications) was built for the helicases of different species by the Molecular Evolutionary Genetics Analysis (MEGA) 4.0 software [32]. CLUSTALX 2.0 software was used to align sequence. Identified helitrons were annotated by FGENESH (http://linux1.softberry. com/berry.phtml).
The sequences of newly identified helitrons (1,649) were used to blast against the nr protein sequence database in NCBI (http://www.ncbi.nlm.nih.gov/). Information about the quantity, location and annotation of capture gene fragments was obtained from the blast results.