POLYAR, a new computer program for prediction of poly(A) sites in human sequences
© Akhtar et al; licensee BioMed Central Ltd. 2010
Received: 24 March 2010
Accepted: 19 November 2010
Published: 19 November 2010
Skip to main content
© Akhtar et al; licensee BioMed Central Ltd. 2010
Received: 24 March 2010
Accepted: 19 November 2010
Published: 19 November 2010
mRNA polyadenylation is an essential step of pre-mRNA processing in eukaryotes. Accurate prediction of the pre-mRNA 3'-end cleavage/polyadenylation sites is important for defining the gene boundaries and understanding gene expression mechanisms.
28761 human mapped poly(A) sites have been classified into three classes containing different known forms of polyadenylation signal (PAS) or none of them (PAS-strong, PAS-weak and PAS-less, respectively) and a new computer program POLYAR for the prediction of poly(A) sites of each class was developed. In comparison with polya_svm (till date the most accurate computer program for prediction of poly(A) sites) while searching for PAS-strong poly(A) sites in human sequences, POLYAR had a significantly higher prediction sensitivity (80.8% versus 65.7%) and specificity (66.4% versus 51.7%) However, when a similar sort of search was conducted for PAS-weak and PAS-less poly(A) sites, both programs had a very low prediction accuracy, which indicates that our knowledge about factors involved in the determination of the poly(A) sites is not sufficient to identify such polyadenylation regions.
We present a new classification of polyadenylation sites into three classes and a novel computer program POLYAR for prediction of poly(A) sites/regions of each of the class. In tests, POLYAR shows high accuracy of prediction of the PAS-strong poly(A) sites, though this program's efficiency in searching for PAS-weak and PAS-less poly(A) sites is not very high but is comparable to other available programs. These findings suggest that additional characteristics of such poly(A) sites remain to be elucidated. POLYAR program with a stand-alone version for downloading is available at http://cub.comsats.edu.pk/polyapredict.htm.
Polyadenylation including the cleavage of pre-mRNA and adding a stretch of adenosines, poly(A), to the 3'-end is an essential stage of pre-mRNA processing, which results in the generation of the mature mRNA in eukaryotes. This is an important step for the stability, nucleus-to-cytoplasm export and translation initiation of mRNA [1, 2]. Polyadenylation is also required for the proper and effective transcription termination, splicing of mRNA, translation termination, as well as being involved in gene silencing [3–8] and genomic imprinting . Although polyadenylation is a common modification of pre-mRNA, it is achieved by different mechanisms in different organisms [10–12]. Moreover, another type of RNA polyadenylation processing with a different set of proteins has been identified recently in eukaryotes, which has been implicated in RNA degradation .
mRNA polyadenylation involves (a) a specific endonucleolytic cleavage at the poly(A) site and (b) subsequent addition of a poly(A) tail. In mammals the poly(A) region contains various cis acting elements that interact with the corresponding proteins of the mRNA polyadenylation machine, as the cleavage and polyadenylation specificity factor (CPSF), the cleavage stimulation factor (CstF), the cleavage factors I and II (CF I and CF II), and the enzyme poly(A) polymerase (PAP). All components of this machine appear to act cooperatively. In particular, the CPSF and CstF interact with each other and bind to the AAUAAA hexamer (polyadenylation signal, PAS) and its downstream counterpart, U/GU-rich element, respectively [11, 13, 14].
The PAS is one of the well-studied key elements, which have been shown to be involved in the regulation of mRNA-polyadenylation [15, 16]. The optimal (canonical) PAS consists of the AAUAAA motif and most base substitutions in this sequence, except the AUUAAA variant, have been shown to have a significantly reduced cleavage and polyadenylation efficiency (to almost 10% of those sequences with the canonical PAS). However, the AAUAAA element is not as universal a signal as it has previously been considered to be. For example, in human, only ~70% of the 3' expressed sequence tags (ESTs) contain one of the two optimal PAS sequences, AAUAAA or AUUAAA [16, 17]. These findings suggest that no consensus sequences of the main cleavage and polyadenylation signals exist, but rather the cooperative action of these sequences and their binding factors results in the pre-mRNA maturation. Indeed, in addition to PAS and the U-/GU-rich elements, a number of auxiliary elements appear to play an important role in the regulation of this process. Comparative studies of the human and Drosophila melanogaster polyadenylation regions show that the PAS is highly conserved in these two species. However, as opposed to this, the U-rich downstream sequence element (DSE) shows a higher divergence between the two species. Such a variation of the maturation process of the pre-mRNA elements seems to be related to alternative polyadenylation of the same transcription unit [10, 12, 14, 16, 18].
A growing line of evidence indicates that most pre-mRNAs undergo alternative polyadenylation and this mechanism, altogether with alternative transcription initiation and splicing, are used by eukaryotic organisms to produce a diverse number of mature mRNA transcripts from the same transcription unit. This is further confirmed by the fact that splicing and polyadenylation appear to be coupled co-transcriptional events of pre-mRNA maturation. It has been shown that more than half of the human genes have multiple poly(A) sites, which can potentially produce transcripts with variable 3'-untranslated regions (3'-UTRs) potentially encoding in many cases different proteins [7, 12, 19]. Besides, polyadenylation in the intronic regions can result in conversion of an internal exon to a 3'-terminal exon (termed "composite terminal exon") or usage of a 3'-terminal exon that is otherwise skipped ("skipped terminal exon"; 7,20). It has been shown that about 20% of human genes have, at least, one intron with polyadenylation site(s) which can potentially produce alternative mRNAs encoding different proteins. The conservation of poly(A) signals in introns of mouse and rat genes is lower than that of poly(A) sites in gene ends. Moreover, the intronic polyadenylation activity appears to be dependent on cellular conditions, 5'-splice sites and intron size: larger introns with weaker donor splice sites prevail among exons used as composite terminal exons, whereas skipped terminal exons are associated mostly with strong polyadenylation signals [7, 21]. Both bioinformatics and experimental studies indicate that utilizing intronic alternative poly(A) site in human and mouse CstF-77 gene, which encodes one of the three subunits of the CstF, as well as in its Drosophila homologue, produces short CstF-77 transcripts lacking sequences encoding some of the functional domains of CstF-77 . Computational analysis of 3'-ends of ESTs suggested the existence of four classes of alternative polyadenylation in human, mouse, and rat: tandem poly(A) sites, composite exons, hidden exons, and truncated exons. It was estimated that about 49% of the human, 31% of the mouse, and 28% of the rat polyadenylated transcription units have alternative polyadenylation sites, which result in the generation of new protein isoforms . Recently, Muro et al.  reported that about 60% of human and murine genes utilize multiple (on the average, 3-4) polyadenylation sites. All available data, briefly reviewed above, indicate the importance of accurate identification of polyadenylation sites in genes. Of course, the most accurate approach for the solution of this problem is mapping of the full-length cDNAs onto the genomic sequences. However, till date a complete set of full-length cDNAs is not available, clustering and DNA mapping studies of human and murine EST databases reveals that of the currently known genes, 15% have no EST-supported 3'end. In other words, the 3'-termini of a significant portion of known genes from these species remains to be identified . Similar results have been obtained by Lopez et al. . Therefore, the computational prediction of poly(A) sites becomes very important. However, it is also a challenge for the following reasons: (a) there are gaps in our knowledge on regulation and performance of mRNA polyadenylation; (b) DNA signals and regulatory proteins, involved in the transcription termination, are still poorly described; (c) polyadenylation appears to be tissue- and organism-specific, and also many genes utilize alternative points of polyadenylation; (d) available data of poly(A) sites are very limited: the real transcriptional status of a genome, dependence of cell/tissue/organ, developmental stage and environmental conditions, is unknown; (e) poly(A) sites may exist in coding sequences (CDS) and introns. The last two points create additional problems in getting true positive and negative data sets necessary for learning and testing of any computational tool.
To date, using both experimental and in silico mapping of 3'-end ESTs, thousands of poly(A) sites have been identified and analyzed [24, 25]. Based on these data, during the last 15 years, a number of attempts have been made to define the consensus sequences involved in the pre-mRNA polyadenylation and to develop computer tools for prediction of poly(A) sites. By analyzing known human poly(A) sites, Yada et al.  suggested CAAUAAA(U/C) as a consensus of the poly(A) signal. Kondrakhin et al.  computed a general consensus matrix for 63 poly(A) sites in vertebrate pre-mRNAs and implemented it into the computational method for the recognition of polyadenylation points in mRNA. However, all these methods had a very high false positive rate. Later, Salamov and Solovyev , applying a linear discriminant function (LDF)-based algorithm, trained on 131 real poly (A) signal regions and 1466 other regions of human genes with seemingly nonfunctional AAUAAA motif, developed the computer program POLYAH. In comparison with the previous analogous methods, the accuracy of this latter approach, which has been tested on a larger data set, was better (sensitivity and specificity of 86% and 51%, respectively). Tabaska and Zhang  developed the polyadq program, which used two quadratic discriminant functions for the AAUAAA/AUUAAA motifs and a position weight matrix for the downstream elements. In tests on whole genes and the last two exons of genes, polyadq predicted poly(A) signals with a correlation coefficient of 0.413 and 0.512, respectively. Legendre and Gautheret  developed the ERPIN program based on a probabilistic hidden Markov model, which achieved a prediction specificity of 69 to 85% for a sensitivity of 56%. By analyzing about 4956 EST-validated poly(A) sites from human genes, some sequence determinants of human poly(A) regions were suggested: U-rich upstream and downstream sequence elements (USE and DSE, respectively) appear to be the main characteristics distinguishing true poly(A) sites from randomly occurring A(A/U)UAAA hexamers; while USEs were found to be indiscriminately associated with strong and weak poly(A) sites, DSEs are mostly found near strong poly(A) sites. To recognize poly(A) region of Saccharomyces cerevisiae  and Caenorhabditis elegans , the weight-matrix-only approaches have been developed. To identify PASs, Bajic et al.  have developed another program, Dragon PolyAtt, the accuracy of which was substantially better than that obtained by the polyadq program. It predicted the two most frequent poly(A) sites, AAUAAA and AUUAAA, with a sensitivity of 48.4% and 13.6%, and specificity of 74% and 79.1%, respectively.
Using a hexamer enrichment method, PROBE, Hu et al.  revealed the prevalence of some U-rich and G-rich elements in human poly(A) regions: AUEs (auxiliary USE; -100:-41 region, where +1 is poly(A) site), CUEs (core USE; -40:-1), CDEs (core DSE; +1:+40) and ADEs (auxiliary DSE; +41:+100). Further studies of poly(A) regions by applying position-specific scoring matrixes (PSSM) for these motifs revealed the presence of 15 elements upstream or downstream of the poly(A) site that were suggested to play a role in determining polyadenylation sites.
The latest computational method for the prediction of poly(A) sites is polya_svm, which uses support vector machine (SVM) and previously identified 15 cis-motifs for the prediction of the poly(A) sites. Compared with polyadq, this method achieves higher sensitivity though with similar specificity .
However, there is still room for improvement in the accuracy of these tools, especially for genome-wide searches. In the current work we report the development of a new tool, POLYAR, for the recognition of poly(A) sites in human and closely related mammals. The prediction accuracy of this computer program is significantly higher than that of previously developed tools.
Using positions of the human mapped poly(A) sites from polya_DB http://polya.umdnj.edu/PolyA_DB1/ and GenBank annotation of human genome (Build 34.2), 29281 sequences of 600 bp length surrounding the mapped poly(A) site (or Clevage Site, CS) at position 301 (+1) were extracted. To select non-redundant poly(A) sequences, the pairwise comparison of [-50:+50] regions by the BLAST program  was performed and the non-redundant positive learning and training dataset of 28761 poly(A) regions showing less than 90% pairwise similarity were selected. These poly(A) regions, here after referred to as "polya DB Set", represent 13693 genes. Applying the same approach, the non-redundant negative set was created. For comparative purposes the polya_svm program  was trained and tested on all the 29281 mapped sites.
Studies of PAS motifs upstream of the mapped poly(a) sites revealed 12 different forms of this hexamer; whereas the remaining variations of the PAS-motif have been suggested to be non-functional [17, 21]. Out of these 12 forms two PAS-motifs, AAUAAA or AUUAAA, are found in about 70% of the human mapped poly(A) sites. To explain the variations in the PAS-motifs, it was suggested that a site of pre-mRNA cleavage and polyadenylation should be determined by the combination of PAS-motif and some other DNA elements, together with their binding factors . Taking into account these observations and suggestions, and applying Expectation Maximization (EM) approach [36, 37], poly(A) sites were differentiated into 3 classes: (1) PAS-strong poly(A) sites with the AAUAAA or AUUAAA motifs; (2) PAS-weak poly(A) sites with the remaining 10 forms of PAS-motif: AGUAAA, UAUAAA, CAUAAA, GAUAAA, AAUAUA, AAUACA, AAUAGA, ACUAAA, AAGAAA and AAUGAA; (3) PAS-less poly(A) sites which lack any of these 12 forms. Such a grouping of poly(A) sites into 3 classes differs from the classification criterion used by Cheng et al. : poly(A) sites are divided into different groups based (a) on their usage (the number of cDNA/ESTs supporting them; "strong", "weak" and "medium" sites) and location ("first", "middle" and "last" upstream sites); as opposed to the classification criterion used in the current study is based only on sequence variation in PAS motifs.
To get a set of negative false poly(A), sequences, the coding sequences (CDS) of only "head-to-head" (H2H) genes annotated (Build 36.3) were extracted, because there is a less chance of the existence of polyadenylation sites in the intergenic spacer between these pairs (in comparison with "tail-to-head" and "tail-to-tail" gene pairs). The middle parts of these CDSs (without the first and last 200 bp) were then divided into non-overlapping fragments of 150, 120 and 160 bp (for PAS-strong, PAS-weak and PAS-less poly(A) sites, respectively) and taken as the negative datasets. Using this method 21,671 sequences ("H2H" negative dataset) were obtained. In addition, using pairwise BLAST comparison non-redundant subset of the H2 H set of 21,350 sequences were obtained. To test the specificity of the current approach, the polya_svm and polyadq programs, we extracted CDSs of 17600 human genes, 19600 human intronic sequences and 3748 5'-UTR regions, and generated 8261 randomized poly (A) site regions (simple randomization by maintaining the same nucleotide frequency as in original poly(A) site sequences).
where and are the sample mean vectors of characteristics for Class 1 and Class 2, respectively; s is the pooled covariance matrix of characteristics for Class 1 and Class 2; s i is covariation matrix, and n i is the sample size of Class i.
Characteristics of polyadenylation regions used for recognition of PAS-strong, PAS-weak and PAS-less sites, and Mahalonobis distance (D2; 38) showing the power of recognition of each characteristic
D 2 for PAS-strong sites
D 2 for PAS-weak sites
D 2 for PAS-less sites
PAS-motif, [-40: -1]
GU/U-motif, [+1: +50]
Upstream Pentamer Composition
Downstream Pentamer Composition c
Total D 2 :
Only sequences with PAS-motif having a score higher than some threshold (in the current case, the minimum score of authentic PAS-motifs) were considered as candidate poly(A) sites.
where w(b, i) is the frequency of base b at position i of the k-mer candidate CS-motif; k = 18 for PAS-strong and PAS-weak sites, and k = 12 for PAS-less sites (for CS-motif consensuses see: Additional file 3).
where w(b, i) is the frequency of base b at position i of the 10-mer candidate GU/U-motif.
where P(S k, i ) is P(S k ) for hexamer S k starting in the position i (for pentamers available in 20% or more of the positive set of different classes of poly(A) sequences see Additional files 4, 5 and 6, respectively).
where f(D PAS-Cs ) is the frequency of a distance betweem PAS-motif and CS observed in sequences from the corresponding positive training dataset.
where f(D Cs-GU/U ) is the frequency of a distance betweem CS and GU/U motif observed in sequences from the corresponding positive training dataset.
Most of these features were kept the same for all 3 classes of poly(A) sequences, though there was also some difference, and the discriminating abilities of these features were different (Table 1). In particular, for PAS-strong and PAS-weak poly(A) sequences PAS-motif is the most significant feature.
where TP - True Positives, TN - True Negatives, FP - False Positives, FN - False Negatives, Sn p and Sn n - Sensitivity of predictions in sets of Positive and Negative samples, respectively.
for any pair of predicted PAS-strong and PAS-weak CSs, or PAS-strong and PAS-weak CSs, within 100 bp of each other, only PAS-strong site is retained;
for any pair of predicted PAS-weak and PAS-less CSs, within 100 bp of each other, only PAS-weak site is retained;
for any pair of predicted CSs of the same class, within 100 bp of each other, only the one with the highest score is retained.
Statistics of initial testing of POLYAR program on three classes of CS/poly(A) sites
Striking differences in the accuracy of predictions in the 3 sets were observed: quite high prediction sensitivity and correlation coefficient for the PAS-strong sequences contradict with very low sensitivity and correlation coefficient for the PAS-weak and PAS-less sequences. Such a variation in accuracy of predictions indicates that we are still far from understanding the regulatory architecture of pre-mRNA 3'-end processing regions lacking the upstream strong PAS site. This gap in our knowledge seems to be a general problem rather than a challenge inherent only in the current approach.
Moreover, the real task of poly(A) site prediction is quite different from just discriminating between CS and non-CS regions: the most probable poly(A) site in a long genomic sequence should be identified. Therefore, we tested the POLYAR recognition algorithm on genomic sequences and compared its performance with polya_svm, till date the most accurate tool for prediction of human poly(A) sites , and polyadq program . All three programs were tested on five sets of sequences: (1) [-300:+300] regions relative to the mapped CSs (+1) from 5225 PAS-strong, 2475 PAS-weak and 561 PAS-less test sequences; (2) coding sequences (CDS) of 17600 human genes; (3) 19600 human intronic sequences; (4) 5'-UTR sequences of 3748 human genes; (5) 8261 randomized poly (A) site regions generated by simple randomization of original poly(A) site sequences, with the same nucleotide frequency. Moreover, we tested POLYAR and polya_svm also on [-1000:+1000] gene end (+1) regions of 11916 sequences with the mapped poly(A) site around the [-100:+100] gene end (for comparison of the CPU times used by POLYAR and polya_svm see: Additional file 7).
Comparative testing results of POLYAR, polya_svm and polyadq rograms in [-300:+300] regions around the mapped PAS-strong, PAS-weak and PAS-less CS/poly (A) sites
Predicted CSs, total
For the PAS-strong sites, POLYAR was found to be more sensitive than polya_svm and polyadq (80.8% versus 65.7% and 53.7%, respectively), as well as it being significantly higher in specificity (66.4% versus 51.7%) than polya_svm. At the same time, by comparison with POLYAR, polyadq showed slightly higher specificity (70.4%). The last finding, probably, results from the totally low predictive power of polyadq program. For PAS-weak sites, both POLYAR and polya_svm show very low sensitivity and specificity at almost the same level (25.2% versus 26.1%, 28.4% versus 29.5%, respectively). For PAS-less sites the lowest sensitivity and specificity of prediction by both programs was observed, though the sensitivity of POLYAR was higher (13.5% versus 7.5%).
Comparative testing results of POLYAR, polya_svm and polyadq programs on coding sequences of 17600 human genes
POLYAR, All 1
Comparative testing results of POLYAR, polya_svm and polyadq programs on 19600 intronic sequences
POLYAR, All 1
Searching for PAS-strong CS/poly(A) sites in human sequences, the POLYAR program, as compared to the polya_svm and polyadq programs, had a significantly higher prediction sensitivity (80.8% versus 65.7% and 53.7%, respectively). POLYAR has also higher specificity than polya_svm (66.4% versus 51.7%). However, polyadq showed slightly higher specificity (70.4%) than POLYAR, which may be due to the totally low predictive power of polyadq program. At the same time, in search for PAS-weak and PAS-less CSs both POLYAR and polya_svm programs show very low prediction accuracy. Moreover, POLYAR program shows almost the same accuracy in analysis of mouse and rat gene end sequences (data not shown). These observations indicate that the current knowledge about factors (protein and DNA/RNA sequence requirements) involved in the selection of poly(A) sites is not sufficient to identify polyadenylation regions containing neither AAUAAA nor AUUAAA signals. Therefore, it is important to identify the additional characteristics involved in determining the"PAS-weak" and "PAS-less" CS/poly(A) sites. In particular, we can not exclude that an alternative criterion is required for classification of poly(A) sites with degenerated or without PAS-by signal: there may be different signal(s) which initiate the polyadenylation of pre-mRNA. If this assumption is true, then regrouping of such poly(A) sites into new sub-classes might allow to reveal new conservative motifs around polyadenylation points. Our studies along these lines are continuing.
This work was supported by the Higher Education Commission of Pakistan through a research grant entitled "Comparative Computer Analysis of Promoter Architecture and Expression Patterns of Plant Genes", which is collaborative research project jointly conducted by the Department of Biosciences (COMSATS Institute of Information Technology, Islamabad, Pakistan), the Department of Molecular-genetic bases of Production Processes (Institute of Botany, Azerbaijan National Academy of Sciences, Baku, Azerbaijan) and Department of Computer Science (Royal Holloway, University of London, UK).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.