Similarities and differences of polyadenylation signals in human and fly
© Retelska et al. 2006
Received: 22 December 2005
Accepted: 12 July 2006
Published: 12 July 2006
Skip to main content
© Retelska et al. 2006
Received: 22 December 2005
Accepted: 12 July 2006
Published: 12 July 2006
Cleavage of messenger RNA (mRNA) precursors is an essential step in mRNA maturation. The signal recognized by the cleavage enzyme complex has been characterized as an A rich region upstream of the cleavage site containing a motif with consensus AAUAAA, followed by a U or UG rich region downstream of the cleavage site.
We studied these signals using exhaustive databases of cleavage sites obtained from aligning raw expressed sequence tags (EST) sequences to genomic sequences inHomo sapiensandDrosophila melanogaster. These data show that the polyadenylation signal is highly conserved in human and fly. In addition,de novomotif searches generated a refined description of the U-rich downstream sequence (DSE) element, which shows more divergence between the two species. These refined motifs are applied, within a Hidden Markov Model (HMM) framework, to predict mRNA cleavage sites.
We demonstrate that the DSE is a specific motif in both human andDrosophila. These findings shed light on the sequence correlates of a highly conserved biological process, and improvein silicoprediction of 3' mRNA cleavage and polyadenylation sites.
The process of mRNA 3' end formation requires pre-mRNA cleavage followed by polyadenylation. Recent reports support the hypothesis that cleavage occurs during transcription and directly influences transcription termination in yeast and human cells [1,2]. mRNA cleavage in mammals is thought to be controlled by two dominant sequence signals, the well defined polyadenylation signal (PAS) located 10–30 bases upstream of mRNA cleavage site (CS) and a less conserved U-rich sequence, called downstream sequence element (DSE) and located within the first 30 nucleotides downstream of the CS. Protein complexes involved in cleavage and polyadenylation have been identified: the PAS is bound by the cleavage and polyadenylation specifiCity factor (CSPF), while the DSE recruits the cleavage stimulation factor (CstF) (reviewed in ).
In mammals, the PAS has been identified early on from its high degree of conservation  and studies have shown that point mutations of the consensus sequence AAUAAA decrease or abolish polyadenylation efficiency . However, genome-wide analysis of PAS clearly shows that some variants which are less efficientin vitroare functionalin vivo[6,7]. The second important region, the DSE, is less conserved in mammalian genes and no clear consensus signal has been defined . Tabaska and Zhang suggested a consensus motif based on motif searches in ~100 test sequences . Counting overrepresented words identified U-rich hexamers in genome-aligned fly ESTs , but failed to identify a conserved DSE motif in human. Selex experiments reported binding of CstF to a small number of related sequences (reviewed in ), while NMR studies suggested that a stretch of two adjacent uridines is crucial for CstF binding .
The position of the CS was shown to be primarily defined by the distance between the PAS and the DSE, and to occur preferentially 5' of an adenine . Survey of Unigene clusters show that the positions in the CSs of individual ESTs in a cluster tend to vary over a distance of 30–40 nucleotides . This micro-variability in the cleavage position further suggests that the fixed anchors governing the cleavage process are the PAS and DSE and not the CS itself.
Accurate characterization of 3' termination is particularly relevant for gene prediction programs. To define the 3'-ends of genes, the HMMgene program scans for the occurrence of the AAUAAA hexanucleotide (A. Krogh, personal communication) while Genscan  applies a simplified model of the mRNA cleavage site that uses an approximate scoring for the PAS irrespective of the presence of a DSE. Thus, these programs are prone to false positives and ignore mRNA CSs that do not match this motif. Attempts to take into account other sequence features of the CS region in human sequences used different computational approaches. In a program called poladq, Tabaska and Zhang  included the DSE motif using a quadratic discriminant function. Legendre and Gautheret  designed a weight matrix based on the nucleotide composition in the 46 bases following the PAS. Both programs search the genomic sequence for the occurrence of a PAS, and then scan the putative positives for further signals. Recent reports have successfully demonstrated the applicability of Hidden Markov Models (HMMs) to the problem inS. cerevisiae orC. elegans.
Here, we take advantage of the systematic mapping of the raw data from >1'500'000 human 3'ESTs  and of >30'000Drosophila3'ESTs to their respective genomes to model and compare PAS and DSE signals in both species using a HMM framework. This study establishes that the main features of mRNA cleavage regions are clearly conserved between both species, with a highly conserved PAS signal and the presence of a DSE in both species. Model assessment showed that including the DSE in HMMs significantly improves the accuracy of CS predictions.
Frequency (%) of PAS motifs reported by Beaudoinget al. in the 50 bp upstream of documented mRNA cleavage sites for 7000 human genes (2048 fly genes).
Other noticeable features include an A-rich region 15–25 bases upstream of the CS (positions 20–35 in Figure2A, encompassing the PAS), and a U-rich region 10–25 bases downstream of the CS (positions 60–75). The A-rich region upstream of the CS is followed by a narrower (~5 nt) U-rich peak, then a small A-rich region, and possibly another U-rich peak immediately before the mRNA CS. An increase in Gs is visible between 5 and 10 nucleotides after the cleavage site, partly overlapping with the broad U-rich peak.
Interestingly, all features described above are conserved inDrosophila(Figure2B). The A frequency at position 51 reaches 90%, which is noticeably higher than in human regions. We computed the average A frequency in the 10 bases encompassing mRNA CS. In human the A frequency in this region is 32,3%, which is close to the average sequence composition of the polyadenylation regions.Drosophilasequences show a local increase to 45,4% A's near the CS compared to a 34,25% average A content. The coarse compositional bias in the mRNA cleavage region is clearly conserved between both species, pointing to a functional conservation of mRNA cleavage and possibly a participation of several observed regions in the cleavage process.
Nucleotide frequency of PAS and DSE profiles.
Model parameters were optimized and performances were evaluated using two different approaches, one in which we compared optimal scores (Forward decoding) for real polyadenylation sites-containing sequences with scores obtained from randomized sequences, and the other in which we evaluated the performance of the model in predicting the positions of documented polyadenylation sites.
Furthermore, to assess the specifiCity of the DSE weight matrices, we substituted the DSE matrix with U-rich profiles. The DSE matrix was replaced with a profile of the whole U-rich region (bases 55 to 80, see figure2Aand2B). The use of this 25-nt weight matrix decreases the specifiCity of polyadenylation site prediction in both human and Drosophila 3'UTRs (Figure5Aand5B, green curve). Both DSE profiles include specific positions that have strong preferences for non-U (C or G) nucleotides (figure3Cand3E). To assess whether these significantly contributed to the DSE specifiCity, positions 4 and 6 in the human DSE matrix and position 2 inDrosophilaDSE were replaced with background nucleotide frequency. Each of these small changes of the DSE profiles resulted in a systematic decrease in specifiCity (Figure5Aand5B, blue curves).
Since mRNA cleavage is thought to occur at the 5'-most site having a sufficiently strong cleavage signal , we designed a prediction paradigm that mimics the biological cleavage process. After model optimization (cf. below), we computed the posterior label probability (PLP) of the PAS for each nucleotide in each human test sequence. To measure the sensitivity and specifiCity of the predictions, a range of thresholds in PLP was chosen. For each fixed threshold and each sequence, we checked if the first position after the stop codon with a PLP above the threshold matches a known polyadenylation site. More precisely, the prediction was considered a true positive (TP) if its position occurred at most 40 bases upstream of the 5' end of a documented site (shown in Figure S1). Predictions falling outside were considered false positives (FP) and sequences having no prediction exceeding the threshold were counted as false negatives (FN). As our specifiCity and sensitivity assessment relies on the assumption that all existing mRNA cleavage sites have been documented, we applied this procedure only to human 3'UTRs. For Drosophila, the coverage of 3P tags is unlikely to be complete.
To investigate polyadenylation signals we used an exhaustive and unbiased database of CSs based on systematic mapping of all expressed sequences (cf. Methods) to genomes. Applying the same protocol to human and fly allowed pointing out strong similarities in the organization of their mRNA cleavage site regions. Based on these datasets, we identified PAS and DSE in both species, and confirmed the specifiCity and relevance of these signals for mRNA cleavage site prediction.
We defined as trustable CSs all genome-aligned EST reads containing a polyA sequence not present on the genome. Thus, our data differed from those assembled by  or  in two main aspects. First, we did not require the CS ESTs to overlap with a RefSeq, which allows us to take into account distant polyadenylation sites that might not be covered by mRNA sequences in RefSeq, which are very often incomplete at their 3' ends. Similarly, in , the candidate EST tags were derived from Unigene clusters, which might also counterselect distant downstream polyadenylation sites. Secondly CS were not selected for the presence of well-described PAS variants. In fact, scanning our CSs for the 13 common PAS variants  confirmed most of them and revealed very similar distributions in both species. The relative frequency of PAS motifs might reflect the efficacy of their recognition by the CSPF, and suggests that the specifiCity of this enzyme is well conserved between the two species. Interestingly, 15% of human mRNA cleavage sites (22% inDrosophila) do not have any of the previously described PAS motifs. Absence of over-represented signals in these suggests that cleavage might happen via an alternative mechanism. Compared to the numbers reported by Beaudoing at al , and Tian et al , we observe a lower frequency of all known polyadenylation variants, and a higher proportion of sequences without any known variant. We performed additional tests that ruled out a widespread contamination of our dataset with false priming sequences. Rather, detailed analysis suggests that the proportion of the main PAS variants increases with the number of tags documenting a PAS, so that the lower numbers that we observe are due to the more complete coverage of our dataset. Consistently with what was found in , the fine structure of the features observed in the nucleotide profiles in human mRNA CS region is largely conserved inDrosophila, while studies of CSs inC. elegans andS. cerevisiae(reviewed in ) showed different profiles. Recent phylogenetic studies confirm that arthropods are closer to chordates than to nematodes . Our observation is thus consistent with this finding, and may indicate that CS organization in human andDrosophila melanogastersequences evolved before separation of arthropods and chordates.
The similarities between human andDrosophilamRNA CS regions led us to search for finer sequence motifs. We consistently identified the well-known PAS motif AAUAAA and a weaker U-rich motif that we interpreted as the DSE. Our consensus sequence for the human DSE (Figure3C) is U [U/G]UCU [G/U]U. Tabaska and Zhang identified a similar DSE using a much smaller dataset and a motif-search tool (gibbs-seq) searching for two copies per fragment, while we required only one occurrence per sequence. A recent detailed study focusing on motifs involved in alternative polyadenylation  reports 4 conserved UG-rich motifs in the DSE regions, one of them (CDE.1) is very similar to the motif we found in all polyadenylation site tags. DSE weight matrices derived from human 3UTRs with one of the four main PAS variants [see Additional file 1, table1] are extremely similar, although they slightly more polarized in the sequences lacking the AAUAAA variant. Our refined DSE motif, a U-rich motif with conserved G residues, confirms the presence of a DSE in Drosophila, as suggested earlier in word-searches using a smaller set of ESTs . Its similarity with the human DSE is consistent with observations that crucial residues for RNA binding in C-terminus of the RNA-binding domain of CstF-64 are conservatively substituted inDrosophila. It is possible that the lower information content of the DSE as compared to the PAS reflects true biological variability of DSE, important for the control of alternative polyadenylation. Variability in DSE motifs might influence this process as it was postulated that the presence of sub-optimal DSE sequences causes a differential choice of polyadenylation sites depending on the expression level of CstF protein . We investigated the relationship between the PAS and DSE motifs, but we do not see significant difference between the DSE WMs derived from sequences with different PAS variants [see Additional file 1 table2]. However, the DSE motifs could be tissue-specific. Different consensus sequences might be bound by tissue-specific isoforms of CstF-RNA binding subunits  or competitor proteins .
We used the derived PAS and DSE motifs to build a probabilistic model of human andDrosophilamRNA CSs. Previously constructed tools for human CS prediction used a quadratic discriminant function  or were based on RNA secondary structure . Both tools require the presence of the main AAUAAA or AUUAAA variants of the polyadenylation motif. We chose the HMM framework, which makes our approach suitable for easy incorporation in gene prediction programs. Similar approaches have been used for prediction of mRNA cleavage in yeast  andC. elegans, with more complex models of the polyadenylation region including 4 to 6 distinct subsequent signal regions interspersed with higher order background models. Our model consists of the PAS and DSE weight matrices chained together by a Gaussian zero-th order background spacer reflecting the distance between the 2 motifs (Figure4A). We have verified that higher order background models did not change the predictions significantly. In both human andDrosophila, inclusion of DSE elements in HMMs increased the accuracy of the model predictions. However, our model does not include a signal for the CS itself. These putative signals cover only one or two bases, and have variable locations with respect to both PAS and DSE (microvariability in the CS ). Using CS signals did not improve our predictions. We assessed the performance of the model in human sequences using documented mRNA cleavage sites and our model performed better in mRNA CS recognition than previously reported tools (cf. Figure5). In test sequences containing 3'000 bp downstream of the translation stop codon, we achieved a specifiCity of 70% for a sensitivity of 80%. Notice that we did not require the presence of the main variants of the PAS. However, as this matrix is more informative than the DSE, PLP scores are dominated by the PAS and sites strongly differing from the consensus will tend to be missed (false negatives, [see Additional file 1, Figure1]). So even though our prediction specifiCity is high, differentially weighting the PAS and DSE might lead to further improvements. In fact, the reported specifiCity might be underestimated, since our evaluation procedure assumed that all existing polyadenylation sites are documented with 3' tags. This probably results in our overestimation of false positives. One extension to improve our current model would be to exploit comparative genomics to define more precise DSE weight matrices, which should be possible for both mammals and insects.
We analyzed the sequence region encompassing mRNA cleavage site in human and Drosophila ESTs, derived PAS and DSE motifs for both species and proved their specifiCity. We integrated these motifs into a HMM which predicts mRNA cleavage and polyadenylation sites with higher specifiCity than previously published tools.
Moreover, we show that the sequence regions involved in polyadenylation are highly conserved between these two species. Our study underlines the value of using the primary sequence data derived from EST projects, as well as genome sequences, for the large-scale documentation and analysis of polyadenylation sites. With the constantly increasing number of available ESTs, future studies might uncover sequence signals that control tissue specific regulation of alternative polyadenylation.
We mapped all publicly available human (described in ; data available at our ftp site ) andDrosophilaEST reads (246'248 EST sequences from BGDP EST collection, raw sequencing data, obtained from Dr. Mark Stapleton, BGDP) to the corresponding genomes. The sequences that matched the genome with at most 2 mismatches and ended with at least 10 A's not present in the genome were taken to be derived frombona fidepolyadenylation events. The genome sequences spanning from -50 to + 50 bp relative to the CS ("3' tags") were collected for all polyadenylation events. Micro-variability of a few bases in the position of the CS is frequently observed [Additional file 1, Figure2A] consistent with . All unique CS sites so defined were used for the statistics in Fig.1,2,3and for building the PAS and DSE models. A polyadenylation site can be defined as a cluster (our 3P clusters) of closely spaced CS, for which the cleavage is driven by the same PAS. Therefore CS that differ only by a few nt (cf. Fig S2A) are clustered for the evaluation of polyadenylation sites (Fig6) as in ref .
Figure1Ashows the distribution of the number of 3' tags per gene. In both human andDrosophila, a large fraction of genes are represented by a small number of tags, while some human genes have up to 5000 tags. Due to large quantity of data (590'008 tags), the distribution of human 3' tags is likely to reflect biological variation in the expression levels of the corresponding genes. ForDrosophila, we were able to collect only 11'385 3' tags, because most of the sequences produced by the BGDP project were 5' ESTs. Although the average coverage is about 1/20thof that in human, the similarity in these distributions in both species suggests that theDrosophilatags reflect similar expression variability. The clustered 3' tags documented a total of 53'469 polyadenylation regions in human, and 2'659 inDrosophila. All datasets used in the various sections are described and available at .
We searched for motifs in the 100-bp polyadenylation regions documented by clusters of 3' tags using thememesoftware . To incorporate all sequences for defining the motifs, we applied preliminarymemeruns to identify the two strongest signals (AAUAAA for PAS, UUUGUUU for human DSE, and UUUCUGU for fly DSE) which were then used as seeds to exhaustively search all sequences in groups of 250.
In fly the DSE seed is related although not identical with the hexamers founds in . Hits in these runs were then used to define the final matrices if they met positional criteria. For the PAS hits we retained all occurrences with first nucleotide at positions between 20 and 38 (Figure3). For the second motif, we retained all occurrences between nucleotide 51 and 73. The extension of the matrices was defined by the information content ( ) at each position. Positions with IC > 0.2 bits, which provided a clear separation from surrounding background, were retained.
HMMs were trained and decoded using the implementation written by A. Krogh (unpublished). Training is based on the Baum-Welch expectation minimization procedure and models were decoded using either the Forward or posterior label probability (PLP)  algorithms. The PLP of the first nucleotide in the PAS motif at each position in the sequence is computed for polyadenylation site prediction. The full model architecture is shown in figure4.
When assessing the specifiCity of the DSE matrices, the only parameters that were optimized were the background emission probabilities (b1 = b2 = b3 for model 4A and 4A and b1 = b3 for 4B) and transitions (p1 and p3 in both models), while the weight matrices and Gaussian spacer parameters were kept fixed. After optimization, the model was decoded (Forward) on two sets of sequences: one set of positives independent from the training set, and a shuffled version thereof serving as negative controls. A model consisting of a single zero order background state was optimized on the training sequence and used as background score.
For prediction, the model was optimized using an iterative procedure rather than simultaneous optimization of all parameters using Baum-Welch. First we noticed that optimizing the background transitions p1 and p3 tended to place the cleavage site too far downstream, leading to inferior performance. Thus, p1 and p3 were scanned and fixed at values of optimal prediction. In the final model, these corresponded to an average length in the first background state of 500 in human. Spacer length  and SD  were always fixed.
Since our purpose was to predict 3' mRNA end formation knowing the position of the stop codon, we restricted our set to genes with unique stop codons having at least one entry in RefSeq and one polyadenylation site. To avoid ambuiguites in the evaluation of predictions, we further restricted the sets to sequences having an unique mRNA cleavage site, documented spread on less 40 bp. These sets consisted of 7743 genes in humans and 1680 inDrosophila. The sets were randomly split in disjoint halves for model training and testing.
We thank Mark Stapleton and Joe Carlton from BGDP for providing us rawDrosophilaEST reads and in particular Anders Krogh for providing us the HMM software. DR was supported by an NIH administrative supplement to parent grant GM54339. FN is supported by the NCCR program in Molecular Oncology of the Swiss National Science Foundation.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.