Effect of 5'UTR introns on gene expression in Arabidopsis thaliana
© Chung et al. 2006
Received: 18 April 2006
Accepted: 19 May 2006
Published: 19 May 2006
Skip to main content
© Chung et al. 2006
Received: 18 April 2006
Accepted: 19 May 2006
Published: 19 May 2006
The majority of introns in gene transcripts are found within the coding sequences (CDSs). A small but significant fraction of introns are also found to reside within the untranslated regions (5'UTRs and 3'UTRs) of expressed sequences. Alignment of the whole genome and expressed sequence tags (ESTs) of the model plant Arabidopsis thaliana has identified introns residing in both coding and non-coding regions of the genome.
A bioinformatic analysis revealed some interesting observations: (1) the density of introns in 5'UTRs is similar to that in CDSs but much higher than that in 3'UTRs; (2) the 5'UTR introns are preferentially located close to the initiating ATG codon; (3) introns in the 5'UTRs are, on average, longer than introns in the CDSs and 3'UTRs; and (4) 5'UTR introns have a different nucleotide composition to that of CDS and 3'UTR introns. Furthermore, we show that the 5'UTR intron of the A. thaliana EF1α-A3 gene affects the gene expression and the size of the 5'UTR intron influences the level of gene expression.
Introns within the 5'UTR show specific features that distinguish them from introns that reside within the coding sequence and the 3'UTR. In the EF1α-A3 gene, the presence of a long intron in the 5'UTR is sufficient to enhance gene expression in plants in a size dependent manner.
Introns, first discovered in 1977 , are genomic sequences that are removed from the corresponding RNA transcripts of genes. The most abundant class are spliceosomal introns, which are found in the nuclear genomes of all characterized eukaryotes, and rely on spliceosomes - a complex that comprises five RNAs and hundreds of proteins - for successful splicing from RNA transcripts [2, 3]. There are two types of spliceosomal introns: (1) U2 introns, which are the most abundant and are spliced by the U2-type spliceosome, and (2) the rarer U12 introns (< 0.4%), which are spliced by the less abundant U12-type spliceosome . In this paper we consider only plant U2 spliceosomal introns.
A growing number of plant expression studies on chimeric RNA have demonstrated that such intron sequences can enhance the level of protein expression, a phenomenon termed Intron-Mediated Enhancement (IME) [4–10]. Inclusion of an intron in the 5' region of a gene, either in the 5'UTR or fused to the 5' portion of the coding sequence, leads to enhanced RNA levels [11–15]. While the degree of expression enhancement varies for each intron, up to a 1000-fold increase in protein accumulation has been reported . The alteration in RNA and protein accumulation is known to act post-transcriptionally . Nonetheless, the intrinsic determinants of 5'UTR IME in plants, especially those within the intron itself, remain poorly defined.
The plant Arabidopsis thaliana has a compact genome and generally small introns , consistent with the proposed correlation between intron size and genome size [19, 20]. On the other hand, the length of intron contributes to the energetic cost of transcription, which is proportional to the length of the transcript produced . Therefore, the fact that a significant number of 5'UTRs contain introns suggests that these, like coding sequence introns, may be functionally important. Mechanistically it is possible that the 5'UTR introns are involved in IME and act in the nucleus , and it has been proposed that IME results from synergistic interactions between the factors involved in the various steps of gene expression from transcription to translation . The elevated translational efficiency is most likely due to an increased in the affinity of mRNA to ribosomes via their interactions with exon junction complexes (EJCs), which are deposited on the mRNA 20–24 nucleotides upstream of introns during splicing [23–26].
Studies on plant introns have revealed a strong nucleotide bias toward T proximal to the AG intron acceptor site, and throughout the intron there is an A/T bias relative to the adjacent exon . While these nucleotide biases are believed to be required for efficient intron recognition and splicing in coding region introns , for introns that reside within the non-coding regions, there is no nucleotide bias that distinguishes intron from exon sequence. To date there are no studies on the statistical properties of 5'UTR introns on the genomic scale in multicellular eukaryotes. Here we present a comprehensive bioinformatic analysis of nucleotide composition, intron-position, and intron-length distribution of all the annotated A. thaliana 5'UTR U2 introns supported by EST and cDNA data. Our results show that, firstly, the density of introns in the 5'UTRs is similar to that in the CDSs but much higher than that in the 3'UTRs; secondly, introns within the 5'UTR are not randomly distributed along the UTR but are more likely to be located closer to the ATG; thirdly, the introns that reside within the 5'UTR are, on average, significantly larger than the average intron found in both the CDS and 3'UTR; and finally, the sequences around the splicing junctions show distinct nucleotide bias that distinguish them from CDS and 3'UTR introns. Our findings indicate that 5'UTR introns may be subject to different selective forces from the introns in CDSs and 3'UTRs, possibly due to a specific regulatory role in gene expression. These observations are exposed in the well-annotated and relatively compact Arabidopsis genome.
To complement the bioinformatic analysis, an experimental analysis of the A. thaliana gene EF1α-A3 - which has an intron-containing 5'UTR - was undertaken in order to investigate what influence 5'UTR introns have on gene expression, and how this is affected by intron length. We confirm that the presence of the 5'UTR intron in EF1α-A3 increases gene expression [13–15] levels 3-fold in transient assays and over 10-fold in stable transgenic plants. In addition, a deletion series based on the intron length showed that the expression level is dependent either on intron length or distributed motifs dispersed throughout the 5' region of the intron.
The presence, frequency, length distributions, and structure of introns and exons have been extensively studied [29–35]. While it is known that the presence of a 5'UTR intron can enhance gene expression , not much is known about the underlying mechanism for this phenomenon. In this study, an extensive bioinformatic analysis of A. thaliana introns was undertaken, using the TAIR (The Arabidopsis Information Resource) database . This study focuses particularly on the length, position and nucleotide composition of CDS and UTR introns, in order to characterize the differences between 5'UTR, CDS and 3'UTR introns.
Table showing statistics of 5' UTR, CDS and 3' UTR.
Number of sequences
Sequences with introns
Total bases (genomic)
Number of introns/nucleotide (mRNA)
2.6 × 106
1.6 × 10-3
6.8 × 107
2.7 × 10-3
5.0 × 106
2.9 × 10-4
Post-transcriptional gene silencing (PTGS) or RNAi is a mechanism that is used to degrade RNA transcripts after they have been transcribed . PTGS is activated by dsRNA to produce siRNA that can act as signalling molecules, promoting a cascade of mRNA degradation [49, 50]. As IME is a post-transcriptional regulation, it is possible that the inclusion of certain introns within the 5' region of a gene reduces the RNA's susceptibility to siRNA, through some unknown mechanism. In order to assess the involvement of PTGS in IME, a similar transient assay using plasmids pGEF1I and pGEF1Idel (Figure 5) was performed using a modified version of the pSoup helper plasmid that is able to suppress gene silencing . To suppress gene silencing the p19 expression cassette from pBIN 61-P19 was cloned into pSoup and therefore resident within the same Agrobacterium cell as the dual-luciferase reporter cassette. The P19 enzyme is a viral silencing suppressor from the tomato bushy top virus (TBTV), which prevents activation of PTGS . Although gene expression of both the LUC and REN reporter genes are enhanced when co-expressed with P19, both with and without the intron, under the presence of P19 (0.2185 ± 0.0183 with intron and 0.0692 ± 0.0073 without intron) the enhancement achieved by the presence of the 5'UTR intron was consistent with the relative enhancement that is obtained without P19 (0.1089 ± 0.0081 with intron and 0.0315 ± 0.0034 without intron). As the IME levels in the presence and absence of P19 are similar, we conclude that no component of this IME can be attributed to PTGS.
These observations suggest that IME in EF1α-A3 is mediated by, not one, but three or more elements distributed over the 5' 350 nucleotides of the 5' UTR intron, with relatively little effect from the 3' 250 nucleotides. Indeed, multiple AT-rich stimulatory elements have been previously described in plants [10, 30–33]. and, consistent with this, the EF1α-A3 5' UTR intron is AT-rich. It is interesting that the three inferred IME elements - distributed over 300 nucleotides - require a very significant 'increase' in intron length when compared with the median lengths (CDS introns 98 nucleotides; 5'UTR introns 253 nucleotides). Potentially, this offers an explaination as to why the median 5'UTR intron length is so much greater than the median CDS intron length, although experiments on more genes would be required to confirm this.
A growing number of plant expression studies have revealed that the presence of an intron within the 5'UTR induces enhanced RNA and protein accumulation. However, the intrinsic determinants of 5'UTR IME in plants, especially the role of any sequence motifs within the intron, remain poorly defined. In this paper, we have presented extensive statistical analyses of all the annotated A. thaliana 5'UTR introns in the TAIR database and shown that 5'UTR introns are noteworthy in terms of their nucleotide composition around the splicing donor and acceptor site, the distribution of intron sizes and the position distribution within the UTR and proximity to the ATG start codon. In addition, we have shown that, not only can the presence of an intron in the 5'UTR significantly enhance gene expression in at least one gene, but the length of intron also influences the level of gene expression. These results should be beneficial in determining the mechanism of IME in plants, as well as determining the origin and role of 5'UTR introns. As these introns are not embedded within coding sequence, the flanking nucleotides can be modified without interfering with the open reading frame (cf. CDS introns). We believe that this makes the introns that reside within non-coding sequences a powerful resource to assist in unravelling the role of introns within the genome.
Files extracted from TAIR FTP site . Due to the small amount of miss-annotation that may interface with the bioinformatics analysis, UTRs less that 4 nt in length and introns less than 6 nt in length were excluded from the analysis.
Type of data
Coordinates and sequences of 3'UTRs
Coordinates and sequences of 5'UTRs
Coordinates and sequences of introns
Coordinates of CDSs, ORFs, exons and genes
Chromosome 1 - complete sequence
Chromosome 2 - complete sequence
Chromosome 3 - complete sequence
Chromosome 4 - complete sequence
Chromosome 5 - complete sequence
Data were processed using C-shell scripts, C++ programs, and the statistics package R version 2.0.1 . Basic statistics (means, standard deviations, regression, etc.) and statistical tests (t-tests, Kolmogorov-Smirnov tests etc.) were calculated using R. Also, Figures 1, 2, 6, 7 and 9 were drawn using R.
The Monte-Carlo simulation program - written in C++ - was used to calculate the expected position distribution of 5'UTR introns if introns were distributed uniformly (i.e. constant insertion probability after any given nucleotide) throughout 5'UTRs. To do this, all of the original introns were extracted from all of the original 5'UTRs, and then they were randomly re-inserted into the processed 5'UTR sequences. This was repeated 10,000 times, and the average intron position distributions were calculated. Sequence logos were drawn using WebLogo version 2.8 . The input sequences were extracted from the chromosome files [Table 2] using a C++ program, and the nucleotide frequency matrices around splice junctions were calculated using C-shell scripts.
The 1.87 kb promoter of the Arabidopsis EF1α-A3 (AT1G07940) gene was isolated from genomic DNA of 'Columbia' using primers RPH-130 (TCTAGAATGGTACCTAATTACTTCAC) and RPH-131 (CTCTTTACCCATGGTTAGAGACTG). The PRH-130 primer altered the sequence at the 5' end of this promoter, introducing a KpnI site as well as an NcoI site at the ATG of this gene. The PCR product was cloned into pGEM-T Easy (Promega) and sequenced to ensure accurate amplification. Similarly, four other promoters were isolated: AT1G10670 (0.651 kb) CAS-001 (GGTACCCACAAATGGAATGGTTGAAG) and CAS-002 (CTTCCTCGCCATGGCAAAACGAAAACTGG); AT1G13980 (2.1 kb) CAS-013 (GGTACCTAGAGGTGTGTATGATAATG) and CAS-014 (CCATGGAATCTGCTCAAATCTTCAGCCAG); AT1G17470 (0.92 kb) CAS-017 (GGTACCTGTAGCGTTTCTACTCTCGT) and CAS-018 (CCATGGTGCTTCACTTGTTTTTGC); AT1G72050 (2.7 kb), CAS-021 (GGTACCATTCGGTCACTGAAGACAC) and CAS-022 (CCATGGTGCGTGATCGAGGCTTACTTGC). In all cases, a KpnI site was introduced at the 5' end of the promoter and an NcoI site at the ATG.
Intron-containing promoter-UTR clones were modified to become intronless by a version of inverted PCR. In the case of AtEF1α-A3 (AT1G07940), two primers were utilized. A forward primer, RPH-133 (CTCAGAGATATCGCAAGAGAG), corresponded to the region of the sequence at the 3' end of exon I, the sequence that precedes the 5'UTR intron; this primer is homologous to the complementary strand and primes towards the upstream promoter region. A reverse primer, RPH-134 (ATTTGTTTGACAGTCTCTAAC), corresponded to the 5' region of exon II, the sequence that directly follows the 5'UTR intron. These two primers were used in a PCR amplification with Pwo polymerase (Roche), using the pGEM-T Easy clone of the intron-containing promoter as template. PCR products were then treated with polynucleotide kinase (NEB) and re-circularized with T4 DNA ligase (NEB). Because of the blunt termini created by the Pwo polymerase, the ligated product excludes the intervening sequence between the divergent primers and, in our case, precisely removed the intron sequence as it would be in the processed mRNA. In the same way, intronless versions for 4 other Arabidopsis promoter clones were generated: AT1G10670, CAS-003 (CGTTGAGAGAGAATGGGGGTAG) and CAS-004 (TTTTCGTTTTGCCATGGCGAGG); AT1G13980, CAS-015 (TGTTTCTCCAGCGATTCAGAG) and CAS-016 (TAATGATTGAGTTTGGCCTCTATC); AT1G17470, CAS-019 (AGTGGAATTGGTGAAGGGCG) and CAS-020 (ATAGTAGCAAAAACAAGTGAAGC); AT1G72050, CAS-023 (CTGATGATGAGATTGGATGTG) and CAS-024 (ATGGAACTTGAAGAGGAAAGAG) for intron 1, CAS-025 (CTCGAGCGAATGACTCTGCA) and CAS-026 (GAAATAAATAGCCTTTTGTTT) for intron 2. As AT1G72050 has two 5'UTR introns, two rounds of intron-out-PCR were needed to generate the intronless version of this promoter.
Promoter fragments were subcloned into the binary vector pGreenII 0800-LUC . This vector includes the Renilla luciferase (Promega) reporter gene (REN) under the transcriptional regulation of 35S promoter and CaMV terminator , and a promoter-less firefly luciferase LUC (Promega) with a CaMV terminator. The 5' end of the firefly luciferase contains multiple cloning sites suitable for the insertion of promoter fragments forming translational fusions. Binary vectors were electroporated into Agrobacterium GV3101 (MP90) according to . As the initiating ATG of firefly luciferase has an NcoI site (CCATGG), the fusion between promoter-5'UTR and reporter gene contains no intervening sequence. The intron-containing and intronless constructs containing the AtEF1α-A3 promoter were converted to stable plant transformation vectors by inserting a nos-kan selection cassette  downstream of the LUC reporter gene.
Deletions within the EF1α-A3 intron were achieved by performing the same PCR method used to remove the whole 5'UTR intron described above. A combination of diverging primers to the 5' (primer A, B or C) and 3' (primer 1, 2 or 3) of the deletion point were used in generating nine intron deletions. Primer A (RPH-258, GATCAACAGAAGAGAAAGAAGCA), Primer B (RPH-259, CACCACAGATCAGAAATTCCAAA), Primer C (RPH-260, GAACCAGATCGATCATATAGTTTA), Primer 1 (RPH-261, AAGTCTACTGTTTTTCTTGATTC), Primer 2 (RPH-262, AGGGTCGCTTAGCTCAGTTGATA), Primer 3 (RPH-263, AGCATAAACAATCAATTGATTCA).
Reporter gene plasmids were electroporated into Agrobacterium GV3101 (MP90) . Agrobacterium were cultured in Lennox agar (Invitrogen) with 50 mg/ml kanamycin (Sigma) at 30°C for 3 days. Cells were re-suspended in infiltration media (10 mM MgCl2, 10 μM acetosyringone) until OD600 = 0.2 and allowed to incubate at room temperature for 2 hours prior to infiltration. Re-suspended Agrobacterium were infiltrated into the leaves of 3–4 weeks old Nicotiana benthamiana (16 h day length, 22°C) and the plants were allowed to grow for a further 3 days. Infiltrated patches were ground in 500 μL Passive Lysis Buffer (PLB) (Promega), diluted (5:500 in PLB), and 5 μ'L used for dual-luciferase assay using the Dual-Luciferase Reporter (DLR™) Assay System (Promega). Relative light units (RLU) were measured over 15 seconds following 5 seconds of delay with a Turner 20/20 luminometer. Relative luciferase activity was calculated by performing a regression analysis from 6 independent measurements using the statistics package R version 2.0.1 .
The transgenic A. thaliana 'Columbia' were created via the floral dip method  with the two stable plant transformation vectors containing the AtEF1α-A3 promoter and a nos-kan selection cassette.
The authors wish to thank Karen Bolitho and Karryn Grafton for assistance with the production of transgenic Arabidopsis, Julie Nicholls for maintaining plants in the glasshouse, and William Laing and Erika Varkonyi-Gasic for useful comments on the manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.