Sequence determinants in human polyadenylation site selection
© Legendre and Gautheret; licensee BioMed Central Ltd. 2003
Received: 10 December 2002
Accepted: 25 February 2003
Published: 25 February 2003
Differential polyadenylation is a widespread mechanism in higher eukaryotes producing mRNAs with different 3' ends in different contexts. This involves several alternative polyadenylation sites in the 3' UTR, each with its specific strength. Here, we analyze the vicinity of human polyadenylation signals in search of patterns that would help discriminate strong and weak polyadenylation sites, or true sites from randomly occurring signals.
We used human genomic sequences to retrieve the region downstream of polyadenylation signals, usually absent from cDNA or mRNA databases. Analyzing 4956 EST-validated polyadenylation sites and their -300/+300 nt flanking regions, we clearly visualized the upstream (USE) and downstream (DSE) sequence elements, both characterized by U-rich (not GU-rich) segments. The presence of a USE and a DSE is the main feature distinguishing true polyadenylation sites from randomly occurring A(A/U)UAAA hexamers. While USEs are indifferently associated with strong and weak poly(A) sites, DSEs are more conspicuous near strong poly(A) sites. We then used the region encompassing the hexamer and DSE as a training set for poly(A) site identification by the ERPIN program and achieved a prediction specificity of 69 to 85% for a sensitivity of 56%.
The availability of complete genomes and large EST sequence databases now permit large-scale observation of polyadenylation sites. Both U-rich sequences flanking both sides of poly(A) signals contribute to the definition of "true" sites. However, the downstream U-rich sequences may also play an enhancing role. Based on this information, poly(A) site prediction accuracy was moderately but consistently improved compared to the best previously available algorithm.
Alternative splicing, alternative transcription initiation and alternative polyadenylation are three mechanisms through which a variety of transcripts can be synthesized from a single eukaryotic gene. Beside the sheer number of genes, the combination of these mechanisms largely contributes to transcript diversity in complex genomes such as those of mammalians. Alternative polyadenylation occurs when two or more polyadenylation sites are present in the 3' untranslated region (UTR) of an mRNA, producing transcript isoforms of variable lengths. At least 22% of mRNAs  undergo alternative polyadenylation, often in a tissue- or time-specific manner . Since 3' UTRs may host important regulatory elements – for instance affecting stability, localization or translation – alternative polyadenylation may strongly affect the fate of mRNA and therefore gene function. Although critical, the mechanisms of polyadenylation site selection are not yet fully understood. In particular, it is not clear how the polyadenylation machinery is able to distinguish bona fide polyadenylation signals from the multiple look-alikes interspersed along the 3' UTRs.
Polyadenylation sites are primarily defined by a hexameric polyadenylation signal (PAS) of sequence AAUAAA or a one-base variant, located about 15 bases upstream of the cleavage site. This central motif can be flanked by optional upstream and downstream sequence elements (USE and DSE). The DSE is described as a U-rich , or GU-rich element located 20–40 bases 3' to the cleavage site ([4, 5] for reviews). It is present in a large proportion of genes, and its deletion has been shown to suppress polyadenylation . The presence of a USE has been described in fewer cases: in viruses  and in four human genes [7–9]. The position and sequence of the USE are poorly defined, although U-richness is also suspected. EST counts suggest that, when several polyadenylation sites are present in the same UTR, the distal site is generally the most efficient . However, polyadenylation efficiency does not strictly depend on the hexameric signal: "strong" sites may contain variant AGUAAA signals while "weak" sites may contain canonical AAUAAA hexamers. It is therefore suspected that sequence determinants affecting polyadenylation efficiency may lie in the flanking USE and DSE.
We used bioinformatics analysis of EST and genomic sequences to characterize biases in the regions encompassing 600 nucleotides around the cleavage site. Correlations between poly(A) site efficiency and sequence biases in flanking regions were identified. In addition, we observed that sequences in a downstream region broader than the DSE discriminate actual poly(A) sites from randomly occurring AAUAAA hexamers. We exploited this information in a computer program for polyadenylation site detection that presents a better specificity/sensitivity ratio than previous algorithms.
Results & Discussion
For genes with multiple polyadenylation sites, each site was classified into one of two categories according to its putative polyadenylation efficiency, as measured by relative EST counts (see Methods). "Strong" sites were those supported by more than 70% of the ESTs for this gene (645 sites), while "weak" sites were those supported by less than 30% of the ESTs (1200 sites). Genes with less than 10 supporting ESTs were ignored for this classification. Let us emphasize here that EST counts do not accurately reflect polyadenylation efficiency for any specific gene, since the observed poly(A) variants may come from different EST libraries with different sizes and amplification protocols. Here we used only non-normalized libraries and, since we made observations over a large number of genes, the biases observed for specific genes should neutralize each other and allow for general tendencies to emerge.
Two further groups of poly(A) sites were built: "unique" sites from UTR with a single poly(A) site (3776 sites), and "control" sites from AAUAAA signals for which no EST was ever observed (1249 sites).
A first consequence of Figure 2 is that the DSE is generally not a GU-rich region as often stated, but rather a U-rich region. A second concerns the importance of the USE element: although less conspicuous than the 3' element, upstream U-rich elements are certainly present in a large number of mRNAs, even if few USEs have been documented to date.
In the DSE, U-richness is significantly higher for "strong" or "unique" sites, than for "weak" sites. Student's test P value for U frequencies in the +20/+60 region is 2.8 10-10 for "strong" vs. "weak" and 2.2 10-16 for "weak" vs. "control". This suggests that the U-rich element in the +20/+60 region may act as an enhancer and help distinguish the most efficient sites in case of alternative polyadenylation. There is no such obvious correlation in the USE, which does not vary as strongly between "strong" and "weak" sites. Therefore, the USE is not as strongly linked to processing efficiency as the DSE.
We tried to further characterize the sequence bias in the USE and DSE regions using word count and Gibbs sampling algorithms. Although these methods are able to identify any k-letter words that is significantly enriched in a specific sequence set, they failed to identify any recurrent motif in either the USE or DSE region, whatever subset of sequence was considered: strong, weak, distal, proximal, etc. (data not shown). Therefore, there is no specific sequence motif associated with the general USE or DSE element. For the DSE, an absence of determined sequence motif is consistent with previous experiments showing that point mutations in this region had no effect on polyadenylation efficiency . However, our analysis does not exclude the presence of gene-specific motifs in certain USEs or DSEs.
Polyadenylation site detection
Measure of True Positives (TP), False Negatives (FN) and Sensitivity (SN) in the prediction of polyadenylation signals by the POLYADQ and ERPIN programs, based on a dataset of 982 annotated UTR sequences from the EMBL database. See Methods for information on database construction. ERPIN parameters were adjusted to match the sensitivity of POLYADQ.
Negative predictions and accuracy of the ERPIN and POLYADQ program, evaluated for different control sequences not containing polyadenylation sites: coding sequences (CDS), introns, and two types of randomized UTR sequences: simple shuffling or first order Markov simulation.
AWUAAA per 100 kb
FP per 100 kb
UTR Markov 1 st order
Having a larger collection of EST-validated poly(A) sites and a better knowledge of upstream and downstream regions discriminating true poly(A) sites from random AWUAAA hexamers, we asked whether this could be exploited to further improve poly(A) site detection. We used the 2327 "strong" and "unique" terminal sequences as a training set for the ERPIN program. ERPIN  is an extension of the classical weight matrix representation of sequence alignments, adapted to RNA sequences containing base-paired regions. From an RNA alignment, the program creates weight matrices corresponding to different parts of the molecule, as defined by users, and finds all sequences matching these matrices above a certain score threshold. Matrices can be defined for each helix, gapped single strand or ungapped single strand. Here we used only ungapped single strands, which are modelled by ERPIN using dinucleotide frequencies (see Methods). Dinucleotides are usually better than mononucleotides at capturing sequence biases in non-coding DNA or RNA sequences. We then tested different search strategies, varying the size and position of the regions used to define the weight matrices. In spite of the significant sequence bias in the USE region, no upstream region was beneficial to search accuracy. The highest accuracy was obtained using both the hexameric signal and the 0 to +46 nt downstream region. Table 1 and 2 show results obtained using this region, compared to POLYADQ results. To facilitate comparisons, ERPIN was calibrated to achieve the same sensitivity as POLYADQ on a test set of 982 annotated UTRs (Table 1). Then, the programs were compared for their ability to filter out false positives (Table 2).
Our control sets were sequences known not to contain poly(A) signals: coding sequences, introns and randomized UTRs. Each AWUAAA hexamer in these sequences was considered as a negative site. This AWUAAA background varies significantly between CDS (31 sites / 100 kb), randomized UTRs (about 100 sites / 100 kb) and introns (156 sites / 100 kb). We thus expected more false positives per kilobase in introns or UTRs than in CDS, and this is indeed what we observed, for both POLYADQ and ERPIN. However, both programs were more efficient in filtering out false positives in CDS or shuffled UTRs, than in 1st order Markov model UTRs or introns. For instance, there were only 3.7 false positives per 100 kb in CDS, instead of about 40 / 100 kb in introns (Table 2). The good performance on shuffled UTRs relative to 1st order Markov models may be due to dinucleotide biases in the downstream region that are lost after shuffling, although such biases do not appear very significant in dinucleotide counts (data not shown). Poly(A) site prediction is worst in intronic and 1st order Markov model UTRs, where about one out of four AWUAAA sites is predicted as true (specificity around 70%). However, we note a moderate but consistent gain in overall accuracy from POLYADQ to ERPIN. The smaller gain is for CDS sequences (0.483 instead of 0.459) and the largest in shuffled UTR sequences (0.494 instead of 0.415). This gain is surprising if one considers the simplicity of the ERPIN algorithm for ungapped single strands (based on a simple dinucleotide weight matrix) compared to POLYADQ.
Compared accuracy of the ERPIN and POLYADQ programs for the prediction of EMBL annotated poly(A) sites, and of EST-derived weak poly(A) sites identified in this study. The negative set for SP and CC calculation is "UTR shuffled" from Table 2.
EMBL annotated sites
A large fraction of human polyadenylation sites are flanked by U-rich elements, both upstream (USE) and downstream (DSE) of the cleavage site, located around positions 0 to -50 and +20 to +60, relative to the poly(A) signal. USE and DSE clearly distinguish true polyadenylation sites from randomly occurring AAUAAA hexamers. While the USE is not specifically associated with "efficient" poly(A) sites, a "U-rich" DSE may act as an enhancer specifying the most efficient sites in case of alternative polyadenylation. For the purpose of predicting poly(A) signals in genomic sequences, the most discriminating region includes the poly(A) signal and the 46 nt downstream sequence. Using the ERPIN program and a simple dinucleotide weight matrix description, we were able to improve the accuracy of poly(A) site predictions by 5% to 19% relative to POLYADQ, depending on sequence context. Such a gain can be helpful in annotation projects aiming at a better characterization of 3' non coding sequences.
Terminal sequence extraction
Polyadenylation sites were identified using the EST-Parser program (Beaudoing et al. 2000 ), based on the mapping of ESTs to UTR sequences. When at least two ESTs finished at the same position and less than 36 bases from a potential PAS (AAUAAA or one of 11 variants), a hypothetic cleavage site was considered as valid. Moreover, in the case of multiple polyadenylation sites, the numbers of ESTs observed at each site were taken as a measure of relative polyadenylation efficiency. For this measure, only non-normalized and non-subtracted EST libraries were considered. The EST database was dbEST (October 2001 release) and the UTR database was UTRdb release 13 . The procedure identified 6563 distinct Polyadenylation sites from 4956 3' UTRs.
The next task was to obtain the +/-300 nt region around each PAS. This region usually goes past the boundaries of the UTR, either because the PAS is too close to the 3' extremity of the UTR (most frequent case) or because it is too close from the Stop codon. In order to retrieve the 5' or 3' genomic regions, UTRs were aligned on the human genome working draft (NCBI Oct-2001 version) with the BLAST program. We retained alignments meeting the following criteria: length > 60 nt; E-value < 0.001; identity > 98%; dangling ends of less than 10 nt in both directions. The complete UTR and 300 nt flanking regions was then extracted from the genomic sequence, producing 5110 "extended UTR" sequences. Then, for each polyadenylation site in a UTR, the +/-300 nt region around the PAS was extracted. This region is referred to as a "terminal sequence" hereafter.
Polyadenylation site classification
For UTRs with alternative polyadenylation, sites were classified as follows (any site may belong to more than one category):
• Strong sites (645 terminal sequences): sites in UTR with at least 10 matching ESTs, more than 70% of which are associated to this site.
• Weak sites (1200 terminal sequences): sites in UTR with at least 10 matching ESTs, less than 30% of which are associated to this site.
Strong and weak sites were further subdivided into "distal" or "proximal", according to their position in the UTR. In each UTR, the 3'-most site was said "distal" and the 5'-most site was said "proximal". Numbers of terminal sequences were as follows. Strong proximal sites: 129; Strong distal sites: 499; Weak proximal sites: 655; Weak distal sites: 210.
We then defined as "Unique" sites (3776 terminal sequences) those sites from UTRs with a single EST-supported poly(A) site. Unique sites were further distinguished according to their proximity to the Stop codon: sites within 300 nt of Stop codon (1328 terminal sequences) and sites at distance 300 nt or more from Stop codon (2448 terminal sequences).
Finally, we defined "Control" sites (1249 terminal sequences) as regions containing an AATAAA hexamer and no matching EST within 60 nt around the hexamer. These controls did not include any AAUAAA hexamer located within 50 nt of the 3' end of the UTR, to avoid real sites not covered by ESTs.
Nucleotide composition analysis
Position-dependent compositions in the terminal sequences were measured in an 11 nucleotide sliding window, advancing by one nucleotide steps over the 600 nt region.
We used Erpin version 3.1 http://tagc.univ-mrs.fr/pub/erpin/. The training set was made of 2,327,600 nt terminal sequences (1632 "unique" sites + 695 "strong" sites) and is available at the same location. Optimal search parameters were determined empirically. The best results were obtained when searching first for the hexameric PAS with a score cutoff of 70%, and then searching for the 46 nt region immediately downstream of the PAS with a score cutoff of 74%. Score cutoffs are expressed in percentage of training set sequences retained. A 70% cutoff for the hexamer region amounts to retaining either AAUAAA or AUUAAA and rejecting any other variant. Searched regions were defined as ungapped single strands. Therefore, their weight matrices were computed by ERPIN using a lod-score for each pair of consecutive bases at positions i and i+1 :
Where Oi,i+1 is the observed frequency for the pair of consecutive bases at position i and i+1, and Ei, Ei+1 the expected frequencies of individual bases. All searches were conducted using the "-unifstat" option, which sets all expected base frequencies to 0.25, thus reducing the number of false positives in GC-rich regions.
Measure of site detection accuracy
True Positives (TP) and False Negative (FN)
Computational detection of "real" poly(A) sites was evaluated based on 982 EMBL sequence fragments containing 2000 nt upstream and 200 nt downstream of an annotated PAS (total: 2.16 Mb). For genes with multiple annotated PAS, only the most distal one was retained. We ensured that none of the training set sequences above was present in this database. TP was then measured as the number of annotated signals detected by the program and FN as the number of annotated signals undetected (Table 1). ERPIN parameters were adjusted so that TP and FN were nearly the same as with POLYADQ, which explains the nearly identical sensitivities in Table 1.
True Negative (TN) and False Positives (FP)
Mispredictions were evaluated on four distinct databases not expected to contain polyadenylation sites (Table 2), the size of which was adjusted so that the number of false sites (TN+FP) was the same as the number of true sites (TP+FN). The following databases were used: CDS sequences (4448 seq, 3.15 Mb) extracted from Genbank release 127; intronic sequences (154 seq, 628 kb) of the first intron from Genbank release 127; randomized UTR sequences (551 seq, 896 kb) of same mononucleotide composition as human 3' UTRs; and randomized UTR sequence (699 seq, 1.04 Mb) of same 1st order Markov model as human 3' UTRs. FP is the number of signals detected in those sequences and TN the number of AAUAAA or AUUAAA signals not detected in these sequences.
Predictive accuracy was then measured as follows:
For Table 3, we compared polyadenylation site prediction in the 982 annotated EMBL sequences above to that obtained in 848 complete UTR sequences (1.8 Mb) containing at least one "weak" poly(A) site, defined as above based on EST counts.
ML developed the computer scripts and conducted the statistical analyzes. DG directed the study and drafted the manuscript.
We thank Dr. Philippe Benech and Dr. Pascal Hingamp for their detailed reading of the manuscript and useful suggestions.
- Beaudoing E, Freier S, Wyatt JR, Claverie JM, Gautheret D: Patterns of variant polyadenylation signal usage in human genes. Genome Res. 2000, 10: 1001-1010. 10.1101/gr.10.7.1001.PubMed CentralView ArticlePubMedGoogle Scholar
- Edwalds-Gilbert G, Veraldi KL, Milcarek C: Alternative poly(A) site selection in complex transcription units: means to an end?. Nucleic Acids Res. 1997, 25: 2547-2561. 10.1093/nar/25.13.2547.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen F, MacDonald CC, Wilusz J: Cleavage site determinants in the mammalian polyadenylation signal. Nucleic Acids Res. 1995, 23: 2614-2620.PubMed CentralView ArticlePubMedGoogle Scholar
- Proudfoot N: Poly(A) signals. Cell. 1991, 64: 671-674.View ArticlePubMedGoogle Scholar
- Colgan DF, Manley JL: Mechanism and regulation of mRNA polyadenylation. Genes Dev. 1997, 11: 2755-2766.View ArticlePubMedGoogle Scholar
- Zhao J, Hyman L, Moore C: Formation of mRNA 3' ends in eukaryotes: mechanism, regulation, and interrelationships with other steps in mRNA synthesis. Microbiol Mol Biol Rev. 1999, 63: 405-445.PubMed CentralPubMedGoogle Scholar
- Moreira A, Takagaki Y, Brackenridge S, Wollerton M, Manley JL, Proudfoot NJ: The upstream sequence element of the C2 complement poly(A) signal activates mRNA 3' end formation by two distinct mechanisms. Genes Dev. 1998, 12: 2522-2534.PubMed CentralView ArticlePubMedGoogle Scholar
- Brackenridge S, Proudfoot NJ: Recruitment of a basal polyadenylation factor by the upstream sequence element of the human lamin B2 polyadenylation signal. Mol Cell Biol. 2000, 20: 2660-2669. 10.1128/MCB.20.8.2660-2669.2000.PubMed CentralView ArticlePubMedGoogle Scholar
- Aissouni Y, Perez C, Calmels B, Benech PD: The cleavage/polyadenylation activity triggered by a U-rich motif sequence is differently required depending on the poly(A) site location at either the first or last 3'-terminal exon of the 2'-5' oligo(A) synthetase gene. J Biol Chem. 2002, 277: 35808-35814. 10.1074/jbc.M200540200.View ArticlePubMedGoogle Scholar
- Chou ZF, Chen F, Wilusz J: Sequence and position requirements for uridylate-rich downstream elements of polyadenylation signals. Nucleic Acids Res. 1994, 22: 2525-2531.PubMed CentralView ArticlePubMedGoogle Scholar
- Zarkower D, Wickens M: A functionally redundant downstream sequence in SV40 late pre-mRNA is required for mRNA 3'-end formation and for assembly of a precleavage complex in vitro. J Biol Chem. 1988, 263: 5780-5788.PubMedGoogle Scholar
- Davuluri RV, Grosse I, Zhang MQ: Computational identification of promoters and first exons in the human genome. Nat Genet. 2001, 29: 412-417. 10.1038/ng780.View ArticlePubMedGoogle Scholar
- Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997, 268: 78-94. 10.1006/jmbi.1997.0951.View ArticlePubMedGoogle Scholar
- Tabaska JE, Zhang MQ: Detection of polyadenylation signals in human DNA sequences. Gene. 1999, 231: 77-86. 10.1016/S0378-1119(99)00104-3.View ArticlePubMedGoogle Scholar
- Gautheret D, Lambert A: Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles. J Mol Biol. 2001, 313: 1003-1011. 10.1006/jmbi.2001.5102.View ArticlePubMedGoogle Scholar
- Beaudoing E, Gautheret D: Identification of alternate polyadenylation sites and analysis of their tissue distribution using EST data. Genome Res. 2001, 11: 1520-1526. 10.1101/gr.190501.PubMed CentralView ArticlePubMedGoogle Scholar
- Pesole G, Liuni S, Grillo G, Licciulli F, Larizza A, Makalowski W, Saccone C: UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs. Nucleic Acids Res. 2000, 28: 193-196. 10.1093/nar/28.1.193.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.