Genome-wide characterization of non-reference transposons in crops suggests non-random insertion

Background Transposons (transposable elements or TEs) are DNA sequences that can change their position within the genome. A large number of TEs have been identified in reference genome of each crop(named accumulated TEs), which are the important part of genome. However, whether there existed TEs with different insert positions in resequenced crop accession genomes from those of reference genome (named non-reference transposable elements, non-ref TEs), and what the characteristics (such as the number, type and distribution) are. To identify and characterize crop non-ref TEs, we analyzed non-ref TEs in more than 125 accessions from rice (Oryza sativa), maize (Zea mays) and sorghum (Sorghum bicolor) using resequenced data with paired-end mapping methods. Results We identified 13,066, 23,866 and 35,679 non-ref TEs in rice, maize and sorghum, respectively. Genome-wide characterization analysis shows that most of non-ref TEs were unique and non-ref TE classes shows different among rice, maize and sorghum. We found that non-ref TEs have a strong positive correlation with gene number and have a bias toward insertion near genes, but with a preference for avoiding coding regions in maize and sorghum. The genes affected by non-ref TE insertion were functionally enriched for stress response mechanisms in all three crops. Conclusions These observations suggest that transposon insertion is not a random event and it makes genomic diversity, which may affect the intraspecific adaption and evolution of crops. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2847-3) contains supplementary material, which is available to authorized users.


Background
Transposons are DNA sequences that can change their positions within the genome. Transposons were first discovered in maize by McClintock in the 1940s [1], and over the next several decades, transposons have been found in almost every plant and animal genome. Moreover, transposons are important components of crop genomes. For example, at least 35 % of the rice genome [2], 62 % of the sorghum genome [3], and nearly 85 % of the maize genome [4] is made up of transposable elements (TEs).
A scheme for the classification of transposons is based on transposition mechanisms, sequence similarities and structural relationships [5]. Transposons are divided into two classes: DNA transposons and RNA transposons (retrotransposons) [6]. Retrotransposons include the following three groups: Long terminal repeats (LTRs), which are flanked by long terminal repeats and encode reverse transcriptase; long interspersed elements (LINEs), which lack LTRs and are transcribed by RNA polymerase II; and short interspersed elements (SINEs), which also lack LTRs and are transcribed by RNA polymerase III. In addition, there are the helitrons, which are replicated by the 'rolling-circle' mechanism, and are therefore also called rolling-circle (RC) transposons. Transposons of theses classes are widely distributed and constitute major components of plant genomes. Additionally, TE superfamilies may be subdivided depending on their replication strategies in crops, such as LTR/Copia, LTR/Gypsy, DNA/ CMC-EnSpm, DNA/MULE-MuDR, LINE/L1 and RC/ Helitron.
In recent years, we have gradually realized the importance of transposons in genome structure, function and evolution. As a fundamental function elements constituting the genomes, transposons are playing important roles in the formation and evolution of the DNA "jigsaw puzzle" structure. They are distributed nonrandomly in large genome and have a correlative relation between other function elements [7,8]. Transposons not only affect plant genome structure but also play important roles in gene expression regulation [9]. Their activity can inactivate genes. Some transposons prefer insertion into genes or near gene flanking regions, leading to a mutation that affects gene function. This transposon activity can be engineered using appropriate vectors to produce artificial mutations in genes. For example, wrinkled peas result from a 0.8-KB transposon insertion in the SBE1 gene, the mechanism of which is similar to the mechanism for the corn Ac/Ds transposon family [10]. Transposon insertion can also positively or negatively alter gene expression levels. A classic example is the transposon insertion into intron 1 of the maize knotted1 gene, causing the expression in the leaves [11]. In additional, transposon insertions can also cause gene rearrangement and epigenetic silencing.
With the advance of high-throughput sequencing and data analysis technologies, researchers have been able to identify new transposon insertions in various species. A comparative genome analysis showed that 14 % of the genomic differences between Nipponbare and 9311 are the result of transposon insertion [12]. Naito et al. detected 1664 mPing transposon insertions by analyzing the genome resequencing data of 24 rice accessions [13]. Ewing and Kazazian analyzed data from the 1000 Genomes Project and presented their analysis of LINE-1 insertions in genomes that are not represented in the reference genome assembly [14]. Tian et al. analyzed sequencing data of 31 wild and cultivated soybeans and detected 34,154 new transposon insertions, which revealed the evolutionary trends of transposons in soybean [15]. The above studies demonstrate that transposons between accessions of the same species are markedly different, and these differences may play important roles in the evolution of species.
Rice, maize and sorghum are important cereal crops; all of their reference genomes are available. Many landrace accessions of these crops and improved and wild varieties have been resequenced using second-generation sequencing technology. Lai et al. resequenced six maize inbred lines, and 103 maize lines of teosintes, landraces and improved varieties were resequenced in the maize hapmap2 project (HapMapV2) [16]. Genome resequencing of 40 cultivated lines and ten wild lines of rice were completed, with an average depth of >15X [17]. Mace et al. resequenced 45 sorghum varieties with an average sequencing depth of 16-45X [18]. At the same time, many methods and tools have been used to identify new transposon insertions in resequenced accessions, which are inserted in different genomic locations from those of reference genome, and termed non-reference transposable elements (non-ref TEs). That is, non-ref TEs are not in the reference genome but in other resequenced accession genomes. RetroSeq introduced a method using pair-end reads mapping to reference genome and accumulated transposon database to do this. First, one end of the pair-end short reads are mapped to the reference genome, while the other paired reads are mapped to the transposons library; paired short reads will therefore overlap with potential transposon insertion sites. Second, transposons that pass aggregation analyses of all possible positions and filtering for depth coverage are designated as non-reference transposons [19]. Although transposons are major components of the genome, their exact functions and relevance in plant genomes have not been revealed. Genome resequencing of crop accessions can be used efficiently to identify and characterize Non-ref TEs. Comparing to the reference genome, Non-ref TEs have different insert positions in accessions.
In this study, resequencing data of 125 accessions for rice, maize and sorghum were collected, including wild, landrace and improved groups. Non-ref TEs were identified using pair-end read alignment to the reference genome and transposon databases separately. To characterize genome-wide non-ref TEs, we compared classes of non-ref TEs between both species and groups and analyzed the insertion location and affected genes. We found that the number, classification and distribution of non-ref TEs were different for each crop group and each accessions of the same species. In addition, non-ref TEs had an insertion preference for intergenic regions, avoiding coding regions. These observations suggest that transposon insertion is not a random event. Furthermore, the functional analysis of affected genes suggested that transposon insertion plays an important role in the adaptive evolution of crops.

Identification of non-ref TE insertions
We used the RepeatMasker (Version: 3.3.0) [20] (Table 1 and Additional file 1: Table S4). According to their different evolutionary and domestication history, we divided them into three groups of improved, landrace and wild. The NPSPD (Average number of non-ref TEs per sample per depth) in the wild group was highest, followed by landraces. The NPSPD of the improved group was lowest in rice and sorghum (because the wild group of maize had only one accession, no comparison could be made). The results were consistent with the genetic differences between groups, which suggest the reliability of our approach for identifying non-ref TEs.
The sequencing depth of the accessions we studied ranged from 6X to 45X, of which the average depths were 18X, 6X and 20X for rice, maize and sorghum, respectively. For our method of mapping reads to iden- The results showed a Pearson correlation coefficient of 0.3, suggesting no obvious correlation between the two indices ( Fig. 1a) and making our method reasonable.
We used PCR-based validation to examine TE insertion events in the maize inbred line MO17, while B73 served as the reference. The results for the predicted TE insertion positions show different fragment lengths between these two lines, and the sequence results support our prediction ( Fig. 1b and c and Additional file 2: Figure S2). Wild#SR1000339T, and 62 % were between Improved #SR1000318T and Improved#SR1000334T. These results suggest a strong phylogenetic relationship between these accession pairs (Additional file 2: Figure S3).

Non-ref TE sharing in the accessions and groups
The non-ref TEs shared by the three groups are shown in Fig. 2b. The number of shared non-ref TEs was highest between improved and landrace groups for rice,  Figure S4). We also compared different accessions and found the LTR and DNA of each accession with the highest number of non-ref TEs had a similar distribution. We classified TEs by superfamilies and showed that the TEs of LTR/Gypsy comprised 18, 48 and 40 % of the   (Fig. 3a). These results suggest that differences between TE classes can be observed between the superfamilies.
To further explore differences in the non-ref TEs, we compared superfamilies between accession groups. We used Student's t-test to identify significantly different superfamilies of non-ref TEs from each group in the three species. The wild group of maize was excluded from this analysis because that group had only one accession. In rice, LINE/L1 and RC/Helitron were significantly different between the improved group and the landrace group (p < 0.01). In maize, DNA/DNA, DNA/   Table S1). The numbers of non-ref TE classes and superfamilies in rice, maize and sorghum are in Additional file 3: Table S5.
To discover TE differences between accessions, in cases of random sampling, the longer TE may have higher probability. We calculated the reads per kilobase per million mapped reads (RPKM) [22] for each transposon in all accessions of the three species and then calculated the Pearson correlation coefficients in pairwise comparisons. For example, the RPKM value of  (Fig. 3b). See all other results in Additional file 2: Figure S5. After that, the distribution of Pearson values is shown in Fig. 3c. The average Pearson correlation coefficient (PCC) of the RPKMs between accessions was 0.70, with a minimum of 0.17 and a maximum of 0.99 in rice. In maize, the average PCC was 0.77, with a minimum of 0.40 and a maximum of 0.97. In sorghum, each pairwise comparison had a PCC >0.6, with an average of 0.98, a minimum of 0.88 and a maximum of 1. Therefore, the differences in all non-ref TEs between sorghum accessions were smaller than those of rice and maize, which suggested different evolutionary histories of rice, maize and sorghum, and there have smaller genetic differences between the various accessions in sorghum.

Chromosome distribution of non-ref TEs
To explore the distribution of non-ref TEs, we counted the number of genes, accumulated TEs, single-nucleotide polymorphisms (SNPs) and non-ref TEs in each chromosome. We further calculated the Pearson correlation coefficient between non-ref TEs and the other three indices. Figure 4 shows the distributions of non-ref TEs and genes in chromosome 1 for rice, maize and sorghum, and the PCC are 0.61, 0.67 and 0,85, respectively. Additional file 2: Figure S6 shows the distribution of other chromosomes. Additional file 2: Table S2 (Fig. 5a). Overall, the results indicated that the proportion of non-ref TE insertion into genic regions was highest in rice, followed by maize; the proportion for sorghum was lowest.
For non-ref TE insertion into intergenic regions, we calculated the distance between non-ref TEs and nearby genes. In rice, the average distance between two nearby genes was 9200 bp, and the average distance between non-ref TEs and nearby genes was 4491 bp. The two indices were 18,436 and 4667 bp for maize and 16,542 and 3533 bp for sorghum. The density of distance from non-  Table S3 shows the structures of these genes compared to the reference genome annotation. Overall these results show that genes with nonref TEs have a longer average transcript length and average CDS length and a higher average number of exons per gene compared to all of the genes in the genome.
To further investigate the effects of non-ref TE on gene function, we identified and annotated all genes with non-ref TEs in the coding region using InterProScan [23]. The results of gene annotation analysis were similar in rice, maize and sorghum. Most of these genes encoded protein kinases, including protein kinase, catalytic domain, serine/threonine-/dual-specificity protein kinase, catalytic domain, tyrosine-protein kinase, catalytic domain, serine/threonine-protein kinase, active site, protein kinase, ATP binding site, and serine-threonine/ tyrosine-protein kinase catalytic domain. In addition to protein kinase, there are also some others were listed (Additional file 4: Table S6). For example, NB-ARC: a motif shared by plant resistance gene products and regulators of cell death in animals [24]; Cytochrome P450: Key players in plant development and defense [25].
Gene Ontology (GO) [26] analysis showed that function of proteins annotated in the envelope, extracellular region and membrane-enclosed lumen in maize and sorghum. Molecular function ontology analysis identified enzyme regulator and molecular transducer in maize and sorghum. Nutrient reservoir proteins were only found in sorghum. The biological process ontology analysis found proteins of multi-organism process, pigmentation and reproduction mainly in maize and sorghum, depth only in sorghum, and developmental process only in rice ( Fig. 5d and Additional file 5: Table S7).
Biological Networks Gene Ontology (BiNGO) [27] was used to perform the enrichment analysis of GO items, such as ATP binding, protein amino acid phosphorylation, protein kinase activity and apoptosis in rice, maize and sorghum. In rice and maize, many proteins involved in defense response were also enriched. In addition, GO analysis in rice found cellular component enrichments for proteolysis, RNA-dependent DNA replication and DNA integration and molecular function enrichments for calcium-transporting ATPase activity, ribonuclease H activity, peptidase activity and RNA-directed DNA polymerase activity. RNA glycosylase activity, isomerase activity and terpenoid metabolic process were enriched in sorghum only. Iron ion binding was enriched in maize (Additional file 2: Figure S7). The results suggested that the genes affected by non-ref TEs were involved in multiple biological functions, and the results of the functional annotations were similar in rice, maize and sorghum.

Identification non-ref TEs using resequencing data
Transposons as an important part of the plant genome, not only can regulate gene expression, gene function, but also provide important information for study of the evolution history of plants. In recent years, with the development of high-throughput sequencing technology, genome-resequencing data have been on an explosive growth trend, which includes growth in the discovery of non-ref TEs using resequencing data. Multiple studies have demonstrated the feasibility of this approach [12,14,17]. Our study used a modified RetroSeq workflow, adjusting some alignment methods and parameters for suitable use in genome-wide analysis of non-ref TEs in crops. A total 125 accessions of rice, maize and sorghum was used to identify novel TE insertions compared to a reference genome. The depth coverage was 6-45×, and the average numbers of non-ref TEs identified were 261, 796 and 793 for rice, maize, and sorghum, respectively. We did not find a significant correlation between the number of non-ref TEs and the depth coverage of the sequencing data. This results support the use of resequencing data to identify non-ref TEs. We found that non-ref TEs were different between accessions. We assume these differences are consistent with polymorphic variations, such as SNPs, InDels and SVs, as these DNA level changes affect polymorphisms between accessions. The investigation of non-ref TEs increases our understanding of genetic polymorphism and evolution.

Variation of non-ref TEs among crops
The non-ref TEs identified in rice, maize and sorghum were different. First, we identified averages of 261, 796 and 793 non-ref TEs for each accession in rice, maize and sorghum, and the NPSPDs were 0.28, 4.35 and 0.89, respectively. So the non-ref TEs number is obviously different among species, which of rice is far less than that of maize and sorghum. Second, our analysis shows an inverse relationship for TE classes between non-ref TEs and accumulated TEs. In rice, most accumulated TEs belongs to DNA class, but LTRs were the most common identified in non-ref TEs. By contrast, for maize and sorghum, the LTR proportion was highest in accumulated TEs and lower in non-ref TEs. We also analyzed the divergence of accumulated TEs. The results in rice show that the average divergence rate was 17 %, and the divergence rates in maize and sorghum were both 15 %. Moreover, DNA class has a greater divergence rate than LTR in rice (Fig. 6). We speculate that the higher divergence in rice influences the alignment process, resulting in more false-negative results and fewer DNA transposon identifications. This possibility may also explain our findings that LTR transposons are more active in maize and sorghum and DNA transposons in rice are more active in maize and sorghum. At last, non-ref TEs difference among species is related to genome stability. Rice genome is smaller and more conservative than maize and sorghum, which may be related to their growth environment and evolution history.
Differences in transposons in the genome occur not only between species but also between groups. We divided the accessions of rice, maize and sorghum separately into three groups: wild, improved and landrace. First, we analyzed the numbers of non-ref TEs between different groups. The average numbers of non-ref TEs in improved, landrace and wild groups of rice were 249, 307 and 518, respectively, 1641, 679 and 1798 in maize, respectively, and 999, 858 and 2563 in sorghum, respectively. These results indicate that there are more non-ref TEs in the wild group than in the improved and landrace groups in rice, maize and sorghum. The results of the NPSPD analysis were similar (Table 1). Because nonref TEs are defined as TEs that are not in the reference genome but in other accession genomes, we note that accessions that are closely related to the reference genome may be identified with fewer non-ref TEs. By contrast, increased genetic distance would result in more non-ref TEs. The cultivar sequencing of rice (Japonica), maize (B73) and sorghum (BTx623) provide reference genomes, so these reference genomes are more distantly related to the wild group and more closely related to their domestication and improvement processes. Second, we compared the superfamilies of non-ref TEs. Significant superfamily differences were observed among the groups. Identifying the source of these differences requires further analysis; however, we speculate that these differences are also related to evolutionary history, genetic relationships between accessions and the distance from accessions to the reference genome.

Non-ref TE insertions are not random events
The four following lines of evidence suggest that non-ref TE insertions are not random events: (1)Positive correlation between non-ref TEs and gene density. Other researchers are also concerned about the relationship between genes and transposons. In Arabidopsis, distribution analysis of accumulated TEs suggests a negative correlation between gene and TE density [28,29]. This association is also found in rice, where investigation of non-LTR-RTs (Non-long terminal repeat retrotransposons) and DNA transposons revealed a negative correlation between gene densities [30,31]. We analyzed the chromosomal distributions of non-ref TEs and genes in sorghum and maize, and found that they were strongly correlative, and the respective mean PCC were 0.88 and 0.67. Our discovery of this relationship between non-ref TEs and gene number is novel. The results suggest that the TEs in the region near a gene have high activity, whereas accumulated TEs are more stable. Moreover, the position of non-ref TE insertion in sorghum was positively related to SNP loci; this relationship is also Fig. 6 Divergence of accumulate TEs in rice, maize and sorghum clearly shown for accumulated TEs in the human genome [32,33]. Presumably, these non-ref TEs are an important source of SNPs, and in rice and maize but not sorghum, non-ref TEs have a smaller contribution to SNPs. The number and distribution of non-ref TEs in rice is different from those of maize and sorghum, meanwhile, the correlation coefficient between non-ref TEs and gene number in rice is far less than those of maize and sorghum. The possible reasons are as follows. 1) the total gene number in rice, maize and sorghum genomes is similar to each other. However, the genome size of rice is far less than sorghum and maize. So the rice genome included fewer TEs. 2) Previous reports showed that rice genome is more conservative [34,35]. It was speculated that the TE activity is lower than other grasses, such as maize and sorghum, which causes small TE difference among rice accessions.  [40]. Two approaches can be used to test this hypothesis. The first involves stress exposure to genetically controlled organisms [41][42][43][44]. The second approach involves analysis of natural populations of the same species living in different conditions [45]. Here, we analyzed the functions of genes that are affected by non-ref TE insertion. Although the identified non-ref TEs number in rice is far less than maize and sorghum, the results of gene function annotation and classification are consistent. Interpro results showed that most affected genes encoded proteins annotated as protein kinases which involved in many aspects of cellular regulation and metabolism [46]. Additional, some affected genes were annotated as NB-ARC and Cytochrome P450 which involved in plant resistance gene and defense. GO analysis showed that affected genes are functionally different. The GO enrichment analysis identified affected genes encoding proteins that have ATP-binding sites, amino acid phosphorylation sites, and protein kinase activity, along with biological processes related to cell apoptosis in rice, maize and sorghum. In addition, affected genes in rice and maize included functional enrichments for defense response processes.

Datasets
Maize resequencing data were acquired from a project deposited in the NCBI Short Read Archive with accession number SRA010130 [47]. This project generated resequenced data from a group of six elite maize inbred lines. Another maize resequencing dataset was acquired from the NCBI Short Reads Archive under the accession number SRA051245 [16]. The data from this maize HapMapV2 study had a depth coverage that ranged from 4X to 30X. We only used landrace lines sequenced at the same facility. The rice resequencing data acquired from the NCBI Short Read Archive under accession number SRA023116 [17], had 50 accessions in total, which included 40 cultivated accessions and ten accessions of wild progenitors with >15X raw data coverage.
The acquired sorghum resequencing data had 16-45X raw data coverage of 45 sorghum lines [18]. The transposon data from the three species studied were extracted from the Repbase Update database at www.girinst.org/repbase/ [21]. The Repbase Update (RU) database contains prototypic sequences representing repetitive DNA from different eukaryotic species.
The B73 sequences of the maize reference genome were obtained from the Maize Genome Sequencing Project AGPv2, which was the first draft assembly of the maize genome released in 2010. We used the maize gene annotation from the 5b.60 release of the maize Genome Sequencing Project based on the AGPv2 assembly [4]. The International Rice Genome Sequencing Project (IRGSP) genome sequence (build 5) was used as the rice reference genome. Accordingly, rice annotation information used the 2009-01-MSU gene set [48]. The sorghum reference genome used was the DOE-JGI (sbi1), and the Sbi1.4 gene set was used for gene annotation [3].

Identifying non-ref TEs
First, we used BWA (0.6.2-r126) [49] to map the nextgeneration sequencing reads against the reference genome, and then we used the 'bwa sampe' to generate paired reads alignment in SAM format. Second, we used SAMTOOLS (Version: 0.1.18) [50] to sort the alignments and then used the command 'samtools merge' to merge the alignments from different sequence lanes.
To identify candidate non-ref TEs with read pair support, we used a modified RetroSeq [19] workflow, adjusting some alignment methods and parameters for suitability in genome-wide analysis of non-ref TEs in crops. First, we checked the bam file, and the alignments with duplicate reads or a mapping quality <30 were discarded. We kept mapped reads and unmapped mate reads for further analysis. Second, we used the unmapped mate reads in a BLAST query search [51] against the TE library. Matches with >80 % identity and alignment length >36 in the 500 bp beginnings or 500 bp ends of sequences in the TE library were accepted as candidate non-ref TEs.
We clustered the reads into fwd and rev clusters in 4000-bp regions. Only an average length of a region <200 bp and a minimum number of reads >5 was considered for non-ref TE identification. Finally, we compared candidate read positions against the accumulated TEs and filtered the results that overlapped with the reference (Additional file 2: Figure S1).
We used PCR to validate a sample of non-ref TE insertions in the MO17 maize inbred line. Primers were designed to the predicted non-ref TE insertion site. Comparisons of amplicon sizes between references B73 and MO17 were used to determine insertions of predicted non-ref TEs. The PCR products were also sequenced for comparison to further establish the