Exonization of active mouse L1s: a driver of transcriptome evolution?
© Zemojtel et al. 2007
Received: 18 June 2007
Accepted: 26 October 2007
Published: 26 October 2007
Skip to main content
© Zemojtel et al. 2007
Received: 18 June 2007
Accepted: 26 October 2007
Published: 26 October 2007
Long interspersed nuclear elements (LINE-1s, L1s) have been recently implicated in the regulation of mammalian transcriptomes.
Here, we show that members of the three active mouse L1 subfamilies (A, GF and TF) contain, in addition to those on their sense strands, conserved functional splice sites on their antisense strands, which trigger multiple exonization events. The latter is particularly intriguing in the light of the strong antisense orientation bias of intronic L1s, implying that the toleration of antisense insertions results in an increased potential for exonization.
In a genome-wide analysis, we have uncovered evidence suggesting that the mobility of the large number of retrotransposition-competent mouse L1s (~2400 potentially active L1s in NCBIm35) has significant potential to shape the mouse transcriptome by continuously generating insertions into transcriptional units.
LINE-1 elements (L1s) are by far the most abundant class of active autonomous transposons in mammalian genomes . It has been established that active, i.e. retrotransposition-competent, L1 elements in the mouse genome [2, 3] outnumber by many fold those found in the human genome [3, 4]. This is reflected in more than an order of magnitude difference in the percentage of spontaneous mutations due to L1 activity in mice (~2.5%) compared to that in humans (~0.07%) . As a result, based on recent experimental and bioinformatic data, one might speculate that the high insertional activity of mouse L1s could play a significant part in shaping the structure and expression of the mouse transcriptome [6, 7]. Importantly, intronic L1 insertions have been shown to influence the expression of their host genes in a wide variety of ways including retardation of transcriptional elongation , transcriptional control [8–10], premature polyadenylation , and exon skipping .
The process by which L1 sequences inserted within introns are recruited into a mRNA, termed exonization, has been primarily studied by analysis of the human transcriptome [13–16] but to date little evidence has been collected for the mouse [15, 17]. In this study, we have assessed the exonization potential of currently active mouse L1 elements. Through detailed analysis of members of active L1 families and cDNA-supported L1 exonization events, we show this potential to be much more significant than previously appreciated. This finding, coupled with the much greater activity of L1s in mouse, suggests that not only have these elements dynamically modified the mouse transcriptome in the past, but continue to do so.
Transcription of these open reading frames is driven by the mouse L1 internal promoter, which is built of a variable number of ~200 nt long repeats called monomers. The sequences of active mouse L1s contain related GF and TF monomers (F-type) as well as unrelated A-type monomers. Consequently, mouse L1s are classified into subfamilies based on the type of the monomer they harbor . Notably we have established, by bioinformatic data mining, that whereas the number of potentially active L1s belonging to the GF subfamily agrees with earlier estimates , the number of potentially active members of the TF and A subfamilies is ~1.6-fold higher than previously (Fig. 1A) estimated. As a result, based on our analysis of genomic sequence data, we conclude that perhaps as many as ~4800 (2*2382) potentially active L1s reside in the diploid mouse genome.
Since L1 activity is expected to correlate with L1 expression level and the latter has been shown to correlate with the length of the internal promoter , we aimed to characterize the promoters of the 2382 potentially active L1 elements which we had discovered (Fig. 1B). We found that the GF subfamily has the longest average promoter size (~5.5 monomers), followed by the A (~4.4 monomers) and TF (~4.1 monomers) families. The annotation of promoter regions is available at .
Of the 52 discovered events 43 involved the exonization of antisense L1 sequences. Similarly, the authors of a very recent study investigated exonization of transposable elements and reported the greater number of the antisense orientation L1 exonization events . The observed greater number of L1 exonizations in the antisense may result from the antisense orientation bias of intronic L1s (see Antisense splice sites vs. antisense L1 insertional bias and Conclusion).
The most frequently used acceptor and donor splice sites we have identified are SA-154 (located in the antisense strand of 5' UTR) and SD+52 (in the antisense strand of ORF1), supported by 13 and 16 different cDNA transcripts, respectively (Fig. 3A).
A classic example of sense orientation L1 exonization was previously reported when the insertion of a ~1100 bp 3' fragment of a L1 TF element within an intron of the beige gene caused a disease-specific mutation in mouse . Usage of the two SD sites, BG_SD+4694 and BG_SD+4903, identified in the latter study, was also evident in 6 and 3 different cDNAs, respectively, in our data set (see Fig. 3A and online annotation at ).
Clearly, truncated and rearranged L1s provide different splice sites. The shortest exonized L1 insertion we have annotated is only 164 bp long (see online annotation of AK032656) and provides the antisense SA+5614 site located at the polyadenylation signal (notably, the polypyrimidine tract here is derived from the polyadenylation signal of L1). As shown in this study (see ) the diverse range of lengths of exonized L1 insertions renders numerous possibilities for the combinatorial usage of the splice sites. Conversely, as evidenced by cDNA AK034994, two separate antisense (237 bp) and sense (1117 bp) L1 insertions cooperatively provide SD and SA sites to create a L1-derived exon.
Provocatively it has been proposed that, in general, repeat exonization via alternative splicing may constitute a vehicle for the exaption of repeat sequences into novel functions [20, 26, 27]. In line with this, a recent experimental study demonstrated that arbitrary sequences can evolve towards functionality when fused with other pre-existing protein modules .
Our analyses of cDNAs have revealed that exonized antisense L1 sequences have the potential to code for parts of ORFs. For example, the alternative transcripts of the GBP-5 gene (gi: 24266664, 26326418) contain an L1-derived exon which contains sequences from three GF monomers. The antisense sequences of each of the three monomers can be translated into peptides which are ~60 a.a. in length yielding a novel 174 a.a. long C-terminus of GBP-5 protein (see Fig. 3B and also ). Although, clearly, the resulting protein variant is mouse-specific, it was noted that the alternative C-terminus variant of GBP-5 exists in humans (AF328727) and that both mouse and human variants lack the C-terminal CaaX isoprenylation motif [29, 30], which might be of physiological importance.
We rendered sequence logos of the sequences corresponding to the identified acceptor and donor motifs in L1s (Additional File 2). In analyzing logos of the acceptor sites we observed polypyrimidine tracts which are typical of consensus splice sites in mouse genes  (see Fig. 5A). The presence of these motifs supports the putative functionality of annotated acceptor sites in active L1s.
As illustrated for the case of SA-154 site in Fig. 5A, family-specific patterns of splice motif conservations are observed in potentially active L1 sequences. More cases, such as that of SA+106, SD+288 sites that are AG/GT intact only in F-type mouse L1s, are highlighted in Fig. 5B.
We also found three splice sites SD+5094, SD+2036 and SA+1930 that are not intact in any subfamily of the potentially active L1 sequences and have been generated by single nucleotide mutations leading to functional "GT" and "AG" motifs (GG->GT, CT->GT and AT->AG respectively, see Additional File 2). The SA+1930 acceptor splice site contains the 11 bp-long polypyrimidine tract which is present in all sequences of potentially active L1s (Additional File 2). Thus the single nucleotide T->G mutation could activate this cryptic acceptor splice site in potentially active L1 sequences.
The amount of L1 sequence residing in mouse introns is ~25% of the total genomic L1 content and as much as 8% of all intronic sequences are L1-derived nucleotides (based on our cumulative analysis of annotated transcriptional units found in Refseq, Known Genes and Ensembl from the UCSC mm7 Dataset). This sequence is comprised mostly of truncated L1 sequences: according to RepeatMasker annotation of NCBIm35 (UCSC mm7) more than 92% of intronic L1 sequences are less than 1000 nt long. As shown above, even these short sequences can be subject to exonization. It is expected, however, that intronic full-length (FL) L1 insertions have a much higher exonization potential since they contain multiple splice sites and, for example, only FL L1s will include 5' promoter regions containing splice sites.
We identified 1739 antisense and 1014 sense intronic FL (greater than 5 kbp in length) L1 insertions. As revealed by our L1Xplorer annotation, these belong in large part (75% in sense and 78% in antisense) to the active A, GF and TF families but some (17% in sense and 16% in antisense) belong to the inactivated F subfamily (Additional File 3). Similar to potentially active L1s, splice sites are largely intact in intronic FL L1 insertions (Fig. 5C, Additional File 1, Additional File 2). Multiple cDNAs confirm exonization of antisense sequences of FL elements (see ). In particular, antisense sequences of two intronic potentially active L1 TF elements are exonized via the SD+52 site to create first exons in cDNAs AK132928 and AK007235.
By analysing three groups of gene annotations (cDNA and corresponding DNA), we identified as many as 1259 Ensembl Genes, 1436 UCSC Known Genes and 858 RefSeq Genes with at least one FL antisense intronic L1 insertion and 718 Ensembl Genes, 797 UCSC Known Genes and 464 RefSeq Genes that contained at least one intronic FL sense L1 insertion. Hence, the prediction of potential splice sites within FL intronic L1s may be an important consideration for researchers studying transcripts of particular genes. We have added the annotation of intronic FL L1 elements to L1Base, which is available at .
The existence of antisense splice sites is highly intriguing, particularly in the light of ~2 fold antisense orientation bias of L1s located in introns of transcriptional units . This orientation bias is especially evident when comparing regions immediately flanking transcriptional start sites (TSS) and transcriptional end sites (TES) (Additional File 4).
However, this global picture, which is based on all L1 insertions, does not provide information on whether the bias results from processes acting on a long time scales or rather is already reflected in young L1 insertions. To gain insight into this issue we specifically looked at young intronic FL L1 insertions.
The ratio of antisense to sense FL intronic insertions is ~1.7 and chi-square testing established that this ratio is significantly different from random insertion orientation model (chi2: p < 0.0001), where either orientation is equally likely.
We set out to investigate, if this insertional bias is still evident among younger intronic insertions. For this analysis we utilized the set of the potentially active L1s (i.e. full-length elements with intact ORFs) that were inserted within introns. This is more stringent than analysing FL insertions with disrupted ORFs, since elements with intact ORFs are likely to be younger due to the L1 proteins' cis-preference towards their encoding RNA . ~28% (657) of putatively active sequences reside in introns (393 in antisense, 270 in sense, 6 both in sense and antisense, ratio ~1.46). The chi-square test indicated that this is again highly significantly different from a random insertion model (chi2: p < 0.0001). This result suggests that if insertion orientation is random for de novo insertions, and the observed bias towards antisense insertion occurs due to selection against sense insertions, this selection process is rapid.
The family distribution for the antisense intronic L1 insertions (Additional File 3A) is very similar to the FL L1s in intergenic regions (Additional File 3B). This is what one would expect to see under the assumption that the sequences of all L1 families equally impact the genes they insert into (i.e. there is no negative selection against any particular subfamily). However, we did observe a difference in the distribution of TF and A subfamilies between intronic sense FL insertions (Additional File 3B) and intragenic FL insertions (chi2: p < 0.0001). One might argue here that negative selection appears to have acted specifically on the intronic sense L1 insertions belonging to the A subfamily, but it is not clear why this might be.
We have shown that active mouse L1 elements contain functional splice sites within their antisense sequences using evidence from exonization events in mouse cDNA libraries. This is especially interesting in the light of the antisense insertional bias of de novo L1 insertions. A recent experimental study addressed the molecular nature of this phenomenon by showing that sense insertions of mouse L1 TF element reduce transcript levels and impair their structure. By contrast antisense intronic insertions have little or no effect on transcript elongation and abundance . These data suggest that the apparently benign nature of antisense intronic insertions and the presence of functional antisense splice sites can lead to the frequent exonization of L1 sequences, as we have observed in our dataset. Further, the conservation of these splices sites in L1 families known to be currently active in the mouse genome strongly implies that generation of L1-exonized transcripts is ongoing, and thus represents a driver of transcriptome evolution. However, it has to be said that this picture is complicated by many factors, particularly since the combination of intronic environment, the size and exact structure of the L1 insertion can impact upon its exonization potential, resulting in gene and insertion specific patterns of exonization. With this caveat, the evidence for inclusion of L1 sequences in transcripts and the high activity of the many LINE-1s in the mouse genome, suggests that their integration into introns has significant ongoing potential to shape the structure of the transcriptome, and ultimately, the proteome , in the course of evolution.
We screened mouse cDNA databases (FANTOM3  and NCBI) with RepeatMasker to identify cDNAs containing L1 sequences. The splice sites within L1 sequences were identified using a combination of the following tools. We used BLAT  to identify the genomic localization of the cDNAs on the mouse genome (NCBIm35). The genomic regions were extracted either from ENSEMBL  or the NCBI Nucleotide Database . SPLIGN  was used to split the cDNAs into exons. RepeatMasker was used to identify the repeats in genomic regions corresponding to the mapped cDNAs. Family classifications of L1s were carried out with RepeatMasker and a customized version of the monomer search module of L1Xplorer, which uses Matcher from the EMBOSS package  and template sequences for A- and F-type monomers [2, 40]. This allowed us to specifically identify the candidate splice sites that occur in sequences belonging to active L1 families. Because of their sequence similarity to the active GF and TF subfamilies, we also included in our analysis cases of exonization of sequences from the inactivated F subfamily. To expedite the splice site annotation process we developed, using PHP  and MySQL , a web interface which allowed for manual data curation. Furthermore a set of perl scripts has been developed to interact with the data and compute statistics. The online database containing annotations of cDNA transcripts with exonized L1 sequences  is a read-only version of the annotation system.
We used a set of potentially active L1s, as identified in NCBIm35 , to examine the potential location of splice sites. In order to compile a set of FL intronic L1 insertions we extracted L1s residing in introns and spanning more than 5000 nt using the RepeatMasker annotation present in Ensembl (Mus musculus v38.35 ). The data was then split into sense and antisense L1s. To automatically annotate the presence or absence of the splice sites two L1Xplorer modules were developed: the first module utilizes the alignment-based search as determined by Matcher  for splice sites within monomers and the second utilizes a HMMer-based search  for splice sites within remaining parts of L1s.
Funding. TZ received founding from the European Commission within its FP6 Programme "Biosapiens", under the thematic area 'Life sciences, genomics and biotechnology for health', contract number LHSG-CT-2003-503265.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.