POLYAR, a new computer program for prediction of poly(A) sites in human sequences
© Akhtar et al; licensee BioMed Central Ltd. 2010
Received: 24 March 2010
Accepted: 19 November 2010
Published: 19 November 2010
mRNA polyadenylation is an essential step of pre-mRNA processing in eukaryotes. Accurate prediction of the pre-mRNA 3'-end cleavage/polyadenylation sites is important for defining the gene boundaries and understanding gene expression mechanisms.
28761 human mapped poly(A) sites have been classified into three classes containing different known forms of polyadenylation signal (PAS) or none of them (PAS-strong, PAS-weak and PAS-less, respectively) and a new computer program POLYAR for the prediction of poly(A) sites of each class was developed. In comparison with polya_svm (till date the most accurate computer program for prediction of poly(A) sites) while searching for PAS-strong poly(A) sites in human sequences, POLYAR had a significantly higher prediction sensitivity (80.8% versus 65.7%) and specificity (66.4% versus 51.7%) However, when a similar sort of search was conducted for PAS-weak and PAS-less poly(A) sites, both programs had a very low prediction accuracy, which indicates that our knowledge about factors involved in the determination of the poly(A) sites is not sufficient to identify such polyadenylation regions.
We present a new classification of polyadenylation sites into three classes and a novel computer program POLYAR for prediction of poly(A) sites/regions of each of the class. In tests, POLYAR shows high accuracy of prediction of the PAS-strong poly(A) sites, though this program's efficiency in searching for PAS-weak and PAS-less poly(A) sites is not very high but is comparable to other available programs. These findings suggest that additional characteristics of such poly(A) sites remain to be elucidated. POLYAR program with a stand-alone version for downloading is available at http://cub.comsats.edu.pk/polyapredict.htm.
Polyadenylation including the cleavage of pre-mRNA and adding a stretch of adenosines, poly(A), to the 3'-end is an essential stage of pre-mRNA processing, which results in the generation of the mature mRNA in eukaryotes. This is an important step for the stability, nucleus-to-cytoplasm export and translation initiation of mRNA [1, 2]. Polyadenylation is also required for the proper and effective transcription termination, splicing of mRNA, translation termination, as well as being involved in gene silencing [3–8] and genomic imprinting . Although polyadenylation is a common modification of pre-mRNA, it is achieved by different mechanisms in different organisms [10–12]. Moreover, another type of RNA polyadenylation processing with a different set of proteins has been identified recently in eukaryotes, which has been implicated in RNA degradation .
mRNA polyadenylation involves (a) a specific endonucleolytic cleavage at the poly(A) site and (b) subsequent addition of a poly(A) tail. In mammals the poly(A) region contains various cis acting elements that interact with the corresponding proteins of the mRNA polyadenylation machine, as the cleavage and polyadenylation specificity factor (CPSF), the cleavage stimulation factor (CstF), the cleavage factors I and II (CF I and CF II), and the enzyme poly(A) polymerase (PAP). All components of this machine appear to act cooperatively. In particular, the CPSF and CstF interact with each other and bind to the AAUAAA hexamer (polyadenylation signal, PAS) and its downstream counterpart, U/GU-rich element, respectively [11, 13, 14].
The PAS is one of the well-studied key elements, which have been shown to be involved in the regulation of mRNA-polyadenylation [15, 16]. The optimal (canonical) PAS consists of the AAUAAA motif and most base substitutions in this sequence, except the AUUAAA variant, have been shown to have a significantly reduced cleavage and polyadenylation efficiency (to almost 10% of those sequences with the canonical PAS). However, the AAUAAA element is not as universal a signal as it has previously been considered to be. For example, in human, only ~70% of the 3' expressed sequence tags (ESTs) contain one of the two optimal PAS sequences, AAUAAA or AUUAAA [16, 17]. These findings suggest that no consensus sequences of the main cleavage and polyadenylation signals exist, but rather the cooperative action of these sequences and their binding factors results in the pre-mRNA maturation. Indeed, in addition to PAS and the U-/GU-rich elements, a number of auxiliary elements appear to play an important role in the regulation of this process. Comparative studies of the human and Drosophila melanogaster polyadenylation regions show that the PAS is highly conserved in these two species. However, as opposed to this, the U-rich downstream sequence element (DSE) shows a higher divergence between the two species. Such a variation of the maturation process of the pre-mRNA elements seems to be related to alternative polyadenylation of the same transcription unit [10, 12, 14, 16, 18].
A growing line of evidence indicates that most pre-mRNAs undergo alternative polyadenylation and this mechanism, altogether with alternative transcription initiation and splicing, are used by eukaryotic organisms to produce a diverse number of mature mRNA transcripts from the same transcription unit. This is further confirmed by the fact that splicing and polyadenylation appear to be coupled co-transcriptional events of pre-mRNA maturation. It has been shown that more than half of the human genes have multiple poly(A) sites, which can potentially produce transcripts with variable 3'-untranslated regions (3'-UTRs) potentially encoding in many cases different proteins [7, 12, 19]. Besides, polyadenylation in the intronic regions can result in conversion of an internal exon to a 3'-terminal exon (termed "composite terminal exon") or usage of a 3'-terminal exon that is otherwise skipped ("skipped terminal exon"; 7,20). It has been shown that about 20% of human genes have, at least, one intron with polyadenylation site(s) which can potentially produce alternative mRNAs encoding different proteins. The conservation of poly(A) signals in introns of mouse and rat genes is lower than that of poly(A) sites in gene ends. Moreover, the intronic polyadenylation activity appears to be dependent on cellular conditions, 5'-splice sites and intron size: larger introns with weaker donor splice sites prevail among exons used as composite terminal exons, whereas skipped terminal exons are associated mostly with strong polyadenylation signals [7, 21]. Both bioinformatics and experimental studies indicate that utilizing intronic alternative poly(A) site in human and mouse CstF-77 gene, which encodes one of the three subunits of the CstF, as well as in its Drosophila homologue, produces short CstF-77 transcripts lacking sequences encoding some of the functional domains of CstF-77 . Computational analysis of 3'-ends of ESTs suggested the existence of four classes of alternative polyadenylation in human, mouse, and rat: tandem poly(A) sites, composite exons, hidden exons, and truncated exons. It was estimated that about 49% of the human, 31% of the mouse, and 28% of the rat polyadenylated transcription units have alternative polyadenylation sites, which result in the generation of new protein isoforms . Recently, Muro et al.  reported that about 60% of human and murine genes utilize multiple (on the average, 3-4) polyadenylation sites. All available data, briefly reviewed above, indicate the importance of accurate identification of polyadenylation sites in genes. Of course, the most accurate approach for the solution of this problem is mapping of the full-length cDNAs onto the genomic sequences. However, till date a complete set of full-length cDNAs is not available, clustering and DNA mapping studies of human and murine EST databases reveals that of the currently known genes, 15% have no EST-supported 3'end. In other words, the 3'-termini of a significant portion of known genes from these species remains to be identified . Similar results have been obtained by Lopez et al. . Therefore, the computational prediction of poly(A) sites becomes very important. However, it is also a challenge for the following reasons: (a) there are gaps in our knowledge on regulation and performance of mRNA polyadenylation; (b) DNA signals and regulatory proteins, involved in the transcription termination, are still poorly described; (c) polyadenylation appears to be tissue- and organism-specific, and also many genes utilize alternative points of polyadenylation; (d) available data of poly(A) sites are very limited: the real transcriptional status of a genome, dependence of cell/tissue/organ, developmental stage and environmental conditions, is unknown; (e) poly(A) sites may exist in coding sequences (CDS) and introns. The last two points create additional problems in getting true positive and negative data sets necessary for learning and testing of any computational tool.
To date, using both experimental and in silico mapping of 3'-end ESTs, thousands of poly(A) sites have been identified and analyzed [24, 25]. Based on these data, during the last 15 years, a number of attempts have been made to define the consensus sequences involved in the pre-mRNA polyadenylation and to develop computer tools for prediction of poly(A) sites. By analyzing known human poly(A) sites, Yada et al.  suggested CAAUAAA(U/C) as a consensus of the poly(A) signal. Kondrakhin et al.  computed a general consensus matrix for 63 poly(A) sites in vertebrate pre-mRNAs and implemented it into the computational method for the recognition of polyadenylation points in mRNA. However, all these methods had a very high false positive rate. Later, Salamov and Solovyev , applying a linear discriminant function (LDF)-based algorithm, trained on 131 real poly (A) signal regions and 1466 other regions of human genes with seemingly nonfunctional AAUAAA motif, developed the computer program POLYAH. In comparison with the previous analogous methods, the accuracy of this latter approach, which has been tested on a larger data set, was better (sensitivity and specificity of 86% and 51%, respectively). Tabaska and Zhang  developed the polyadq program, which used two quadratic discriminant functions for the AAUAAA/AUUAAA motifs and a position weight matrix for the downstream elements. In tests on whole genes and the last two exons of genes, polyadq predicted poly(A) signals with a correlation coefficient of 0.413 and 0.512, respectively. Legendre and Gautheret  developed the ERPIN program based on a probabilistic hidden Markov model, which achieved a prediction specificity of 69 to 85% for a sensitivity of 56%. By analyzing about 4956 EST-validated poly(A) sites from human genes, some sequence determinants of human poly(A) regions were suggested: U-rich upstream and downstream sequence elements (USE and DSE, respectively) appear to be the main characteristics distinguishing true poly(A) sites from randomly occurring A(A/U)UAAA hexamers; while USEs were found to be indiscriminately associated with strong and weak poly(A) sites, DSEs are mostly found near strong poly(A) sites. To recognize poly(A) region of Saccharomyces cerevisiae  and Caenorhabditis elegans , the weight-matrix-only approaches have been developed. To identify PASs, Bajic et al.  have developed another program, Dragon PolyAtt, the accuracy of which was substantially better than that obtained by the polyadq program. It predicted the two most frequent poly(A) sites, AAUAAA and AUUAAA, with a sensitivity of 48.4% and 13.6%, and specificity of 74% and 79.1%, respectively.
Using a hexamer enrichment method, PROBE, Hu et al.  revealed the prevalence of some U-rich and G-rich elements in human poly(A) regions: AUEs (auxiliary USE; -100:-41 region, where +1 is poly(A) site), CUEs (core USE; -40:-1), CDEs (core DSE; +1:+40) and ADEs (auxiliary DSE; +41:+100). Further studies of poly(A) regions by applying position-specific scoring matrixes (PSSM) for these motifs revealed the presence of 15 elements upstream or downstream of the poly(A) site that were suggested to play a role in determining polyadenylation sites.
The latest computational method for the prediction of poly(A) sites is polya_svm, which uses support vector machine (SVM) and previously identified 15 cis-motifs for the prediction of the poly(A) sites. Compared with polyadq, this method achieves higher sensitivity though with similar specificity .
However, there is still room for improvement in the accuracy of these tools, especially for genome-wide searches. In the current work we report the development of a new tool, POLYAR, for the recognition of poly(A) sites in human and closely related mammals. The prediction accuracy of this computer program is significantly higher than that of previously developed tools.
Training and testing datasets
Using positions of the human mapped poly(A) sites from polya_DB http://polya.umdnj.edu/PolyA_DB1/ and GenBank annotation of human genome (Build 34.2), 29281 sequences of 600 bp length surrounding the mapped poly(A) site (or Clevage Site, CS) at position 301 (+1) were extracted. To select non-redundant poly(A) sequences, the pairwise comparison of [-50:+50] regions by the BLAST program  was performed and the non-redundant positive learning and training dataset of 28761 poly(A) regions showing less than 90% pairwise similarity were selected. These poly(A) regions, here after referred to as "polya DB Set", represent 13693 genes. Applying the same approach, the non-redundant negative set was created. For comparative purposes the polya_svm program  was trained and tested on all the 29281 mapped sites.
Studies of PAS motifs upstream of the mapped poly(a) sites revealed 12 different forms of this hexamer; whereas the remaining variations of the PAS-motif have been suggested to be non-functional [17, 21]. Out of these 12 forms two PAS-motifs, AAUAAA or AUUAAA, are found in about 70% of the human mapped poly(A) sites. To explain the variations in the PAS-motifs, it was suggested that a site of pre-mRNA cleavage and polyadenylation should be determined by the combination of PAS-motif and some other DNA elements, together with their binding factors . Taking into account these observations and suggestions, and applying Expectation Maximization (EM) approach [36, 37], poly(A) sites were differentiated into 3 classes: (1) PAS-strong poly(A) sites with the AAUAAA or AUUAAA motifs; (2) PAS-weak poly(A) sites with the remaining 10 forms of PAS-motif: AGUAAA, UAUAAA, CAUAAA, GAUAAA, AAUAUA, AAUACA, AAUAGA, ACUAAA, AAGAAA and AAUGAA; (3) PAS-less poly(A) sites which lack any of these 12 forms. Such a grouping of poly(A) sites into 3 classes differs from the classification criterion used by Cheng et al. : poly(A) sites are divided into different groups based (a) on their usage (the number of cDNA/ESTs supporting them; "strong", "weak" and "medium" sites) and location ("first", "middle" and "last" upstream sites); as opposed to the classification criterion used in the current study is based only on sequence variation in PAS motifs.
To get a set of negative false poly(A), sequences, the coding sequences (CDS) of only "head-to-head" (H2H) genes annotated (Build 36.3) were extracted, because there is a less chance of the existence of polyadenylation sites in the intergenic spacer between these pairs (in comparison with "tail-to-head" and "tail-to-tail" gene pairs). The middle parts of these CDSs (without the first and last 200 bp) were then divided into non-overlapping fragments of 150, 120 and 160 bp (for PAS-strong, PAS-weak and PAS-less poly(A) sites, respectively) and taken as the negative datasets. Using this method 21,671 sequences ("H2H" negative dataset) were obtained. In addition, using pairwise BLAST comparison non-redundant subset of the H2 H set of 21,350 sequences were obtained. To test the specificity of the current approach, the polya_svm and polyadq programs, we extracted CDSs of 17600 human genes, 19600 human intronic sequences and 3748 5'-UTR regions, and generated 8261 randomized poly (A) site regions (simple randomization by maintaining the same nucleotide frequency as in original poly(A) site sequences).
Description of the method
where and are the sample mean vectors of characteristics for Class 1 and Class 2, respectively; s is the pooled covariance matrix of characteristics for Class 1 and Class 2; s i is covariation matrix, and n i is the sample size of Class i.
Characteristics of polyadenylation regions used for recognition of PAS-strong, PAS-weak and PAS-less sites, and Mahalonobis distance (D2; 38) showing the power of recognition of each characteristic
D 2 for PAS-strong sites
D 2 for PAS-weak sites
D 2 for PAS-less sites
PAS-motif, [-40: -1]
GU/U-motif, [+1: +50]
Upstream Pentamer Composition
Downstream Pentamer Composition c
Total D 2 :
Only sequences with PAS-motif having a score higher than some threshold (in the current case, the minimum score of authentic PAS-motifs) were considered as candidate poly(A) sites.
where w(b, i) is the frequency of base b at position i of the k-mer candidate CS-motif; k = 18 for PAS-strong and PAS-weak sites, and k = 12 for PAS-less sites (for CS-motif consensuses see: Additional file 3).
where w(b, i) is the frequency of base b at position i of the 10-mer candidate GU/U-motif.
Pentamer composition of Upstream and Downstream Regions
where P(S k, i ) is P(S k ) for hexamer S k starting in the position i (for pentamers available in 20% or more of the positive set of different classes of poly(A) sequences see Additional files 4, 5 and 6, respectively).
Distance between PAS- and CS-motifs
where f(D PAS-Cs ) is the frequency of a distance betweem PAS-motif and CS observed in sequences from the corresponding positive training dataset.
Distance between CS and GU/U-motifs
where f(D Cs-GU/U ) is the frequency of a distance betweem CS and GU/U motif observed in sequences from the corresponding positive training dataset.
Most of these features were kept the same for all 3 classes of poly(A) sequences, though there was also some difference, and the discriminating abilities of these features were different (Table 1). In particular, for PAS-strong and PAS-weak poly(A) sequences PAS-motif is the most significant feature.
Estimation of prediction accuracy
where TP - True Positives, TN - True Negatives, FP - False Positives, FN - False Negatives, Sn p and Sn n - Sensitivity of predictions in sets of Positive and Negative samples, respectively.
Results and Discussion
The algorithm described above was realized in the POLYAR [poly(A) region] program. In the first step, the POLYAR program classifies each position (+1) on a given sequence as a potential CS or non-CS based on three LDF classifiers for PAS-strong, PAS-weak and PAS-less poly(A) sites, with characteristics calculated in the [-100:+50], [-60:+60] and [-60:+100] regions around the current position, respectively (see: Table 1). In the case of PAS-strong and PAS-weak poly(A) sites, only positions with PAS-motif in the region (-40,-1) having a score higher than a preliminary defined threshold are selected for further consideration. First, the LDF for a position is estimated by the classifier for PAS-strong sites; if it is not classified as PAS-strong site, the LDF for that position is estimated by the classifier for PAS-weak sites; otherwise it is estimated by the classifier for PAS-less sites. Estimation of LDFs for the candidate positions is performed by applying thresholds for PAS-strong, PAS-weak and PAS-less poly(A) sites defined on the training dataset. For the final selection of potential CSs, the following criterion is applied:
(i) for any pair of predicted PAS-strong and PAS-weak CSs, or PAS-strong and PAS-weak CSs, within 100 bp of each other, only PAS-strong site is retained;
(ii) for any pair of predicted PAS-weak and PAS-less CSs, within 100 bp of each other, only PAS-weak site is retained;
(iii) for any pair of predicted CSs of the same class, within 100 bp of each other, only the one with the highest score is retained.
Statistics of initial testing of POLYAR program on three classes of CS/poly(A) sites
Striking differences in the accuracy of predictions in the 3 sets were observed: quite high prediction sensitivity and correlation coefficient for the PAS-strong sequences contradict with very low sensitivity and correlation coefficient for the PAS-weak and PAS-less sequences. Such a variation in accuracy of predictions indicates that we are still far from understanding the regulatory architecture of pre-mRNA 3'-end processing regions lacking the upstream strong PAS site. This gap in our knowledge seems to be a general problem rather than a challenge inherent only in the current approach.
Moreover, the real task of poly(A) site prediction is quite different from just discriminating between CS and non-CS regions: the most probable poly(A) site in a long genomic sequence should be identified. Therefore, we tested the POLYAR recognition algorithm on genomic sequences and compared its performance with polya_svm, till date the most accurate tool for prediction of human poly(A) sites , and polyadq program . All three programs were tested on five sets of sequences: (1) [-300:+300] regions relative to the mapped CSs (+1) from 5225 PAS-strong, 2475 PAS-weak and 561 PAS-less test sequences; (2) coding sequences (CDS) of 17600 human genes; (3) 19600 human intronic sequences; (4) 5'-UTR sequences of 3748 human genes; (5) 8261 randomized poly (A) site regions generated by simple randomization of original poly(A) site sequences, with the same nucleotide frequency. Moreover, we tested POLYAR and polya_svm also on [-1000:+1000] gene end (+1) regions of 11916 sequences with the mapped poly(A) site around the [-100:+100] gene end (for comparison of the CPU times used by POLYAR and polya_svm see: Additional file 7).
Comparative testing results of POLYAR, polya_svm and polyadq rograms in [-300:+300] regions around the mapped PAS-strong, PAS-weak and PAS-less CS/poly (A) sites
Predicted CSs, total
For the PAS-strong sites, POLYAR was found to be more sensitive than polya_svm and polyadq (80.8% versus 65.7% and 53.7%, respectively), as well as it being significantly higher in specificity (66.4% versus 51.7%) than polya_svm. At the same time, by comparison with POLYAR, polyadq showed slightly higher specificity (70.4%). The last finding, probably, results from the totally low predictive power of polyadq program. For PAS-weak sites, both POLYAR and polya_svm show very low sensitivity and specificity at almost the same level (25.2% versus 26.1%, 28.4% versus 29.5%, respectively). For PAS-less sites the lowest sensitivity and specificity of prediction by both programs was observed, though the sensitivity of POLYAR was higher (13.5% versus 7.5%).
Comparative testing results of POLYAR, polya_svm and polyadq programs on coding sequences of 17600 human genes
POLYAR, All 1
Comparative testing results of POLYAR, polya_svm and polyadq programs on 19600 intronic sequences
POLYAR, All 1
Searching for PAS-strong CS/poly(A) sites in human sequences, the POLYAR program, as compared to the polya_svm and polyadq programs, had a significantly higher prediction sensitivity (80.8% versus 65.7% and 53.7%, respectively). POLYAR has also higher specificity than polya_svm (66.4% versus 51.7%). However, polyadq showed slightly higher specificity (70.4%) than POLYAR, which may be due to the totally low predictive power of polyadq program. At the same time, in search for PAS-weak and PAS-less CSs both POLYAR and polya_svm programs show very low prediction accuracy. Moreover, POLYAR program shows almost the same accuracy in analysis of mouse and rat gene end sequences (data not shown). These observations indicate that the current knowledge about factors (protein and DNA/RNA sequence requirements) involved in the selection of poly(A) sites is not sufficient to identify polyadenylation regions containing neither AAUAAA nor AUUAAA signals. Therefore, it is important to identify the additional characteristics involved in determining the"PAS-weak" and "PAS-less" CS/poly(A) sites. In particular, we can not exclude that an alternative criterion is required for classification of poly(A) sites with degenerated or without PAS-by signal: there may be different signal(s) which initiate the polyadenylation of pre-mRNA. If this assumption is true, then regrouping of such poly(A) sites into new sub-classes might allow to reveal new conservative motifs around polyadenylation points. Our studies along these lines are continuing.
This work was supported by the Higher Education Commission of Pakistan through a research grant entitled "Comparative Computer Analysis of Promoter Architecture and Expression Patterns of Plant Genes", which is collaborative research project jointly conducted by the Department of Biosciences (COMSATS Institute of Information Technology, Islamabad, Pakistan), the Department of Molecular-genetic bases of Production Processes (Institute of Botany, Azerbaijan National Academy of Sciences, Baku, Azerbaijan) and Department of Computer Science (Royal Holloway, University of London, UK).
- Colgan DF, Manley JL: Mechanism and regulation of mRNA polyadenylation. Genes Dev. 1997, 11: 2755-2766. 10.1101/gad.11.21.2755.PubMedView ArticleGoogle Scholar
- Zhao J, Hyman L, Moore C: Formation of mRNA 3' ends in eukaryotes: mechanism, regulation, and interrelationships with other steps in mRNA synthesis. Microbiol Mol Biol Rev. 1999, 63: 405-445.PubMed CentralPubMedGoogle Scholar
- Calvo O, Manley JL: Evolutionarily conserved interaction between CstF-64 and PC4 links transcription, polyadenylation and termination. Mol Cell. 2001, 7: 1013-1023. 10.1016/S1097-2765(01)00236-2.PubMedView ArticleGoogle Scholar
- Beaudoing E, Gautheret D: Identification of alternate polyadenylation sites and analysis of their tissue distribution using EST data. Genome Res. 2001, 11: 1520-1526. 10.1101/gr.190501.PubMed CentralPubMedView ArticleGoogle Scholar
- Proudfoot NJ, Furger A, Dye MJ: Integrating mRNA processing with transcription. Cell. 2002, 108: 501-512. 10.1016/S0092-8674(02)00617-7.PubMedView ArticleGoogle Scholar
- Plant KE, Dye MJ, Lafaille C, Proudfoot NJ: Strong polyadenylation and weak pausing combine to cause efficient termination of transcription in the human Gγ-globin gene. Mol Cell Biology. 2005, 25: 3276-3285. 10.1128/MCB.25.8.3276-3285.2005.View ArticleGoogle Scholar
- Tian B, Pan Z, Lee JY: Widespread mRNA polyadenylation events in introns indicate dynamic interplay between polyadenylation and splicing. Genome Res. 2007, 17: 156-165. 10.1101/gr.5532707.PubMed CentralPubMedView ArticleGoogle Scholar
- Wang S-W, Stevenson AL, Kearsey SE, Watt S, Bahler J: Global role for polyadenylation-assisted nuclear RNA degradation in posttranscriptional gene silencing. Mol Cell Biology. 2008, 28: 658-665.Google Scholar
- Wood AJ, Schulz R, Woodfine K, Koltowska K, Beechey CV, Peters J, Bourc'his D, Oakey RJ: Regulation of alternative polyadenylation by genomic imprinting. Genes and Dev. 2008, 22: 1141-1146. 10.1101/gad.473408.PubMed CentralPubMedView ArticleGoogle Scholar
- Retelska D, Iseli C, Bucher P, Jongeneel CV, Naef F: Similarities and differences of polyadenylation signals in human and fly. BMC Genomics. 2006, 7: 176-10.1186/1471-2164-7-176.PubMed CentralPubMedView ArticleGoogle Scholar
- Mandel CR, Bai Y, Tong L: Protein factors in pre-mRNA 3'-end processing. Cell Mol Life Sci. 2008, 65: 1099-1122. 10.1007/s00018-007-7474-3.PubMed CentralPubMedView ArticleGoogle Scholar
- Lutz CS: Alternative Polyadenylation: A Twist on mRNA 3' End Formation. Acs Chem Bio Rev. 2008, 3: 609-617. 10.1021/cb800138w.View ArticleGoogle Scholar
- Brown KM, Gilmartin GM: A mechanism for the regulation of pre-mRNA 3' processing by human cleavage factor Im. Molecular Cell. 2003, 12: 1467-1476. 10.1016/S1097-2765(03)00453-2.PubMedView ArticleGoogle Scholar
- Zarudnaya MI, Kolomiets IM, Potyahaylo AL, Hovorun DM: Downstream elements of mammalian pre-mRNA polyadenylation signals: primary, secondary and higher-order structures. Nucleic Acids Res. 2003, 31: 1375-1386. 10.1093/nar/gkg241.PubMed CentralPubMedView ArticleGoogle Scholar
- Proudfoot NJ, Brownlee GG: 3' Non-coding region sequences in eukaryotic messenger RNA. Nature. 1976, 263: 211-214. 10.1038/263211a0.PubMedView ArticleGoogle Scholar
- Legendre M, Gautheret D: Sequence determinants in human polyadenylation site selection. BMC Genomics. 2003, 4: 7-10.1186/1471-2164-4-7.PubMed CentralPubMedView ArticleGoogle Scholar
- Beaudoing E, Freier S, Wyatt JR, Claverie JM, Gautheret D: Patterns of variant polyadenylation signal usage in human genes. Genome Res. 2000, 10: 1001-1010. 10.1101/gr.10.7.1001.PubMed CentralPubMedView ArticleGoogle Scholar
- LeTexier V, Riethoven JJ, Kumanduri V, Gopalakrishnan C, Lopez F, Gautheret D, Thanaraj TA: AltTrans: transcript pattern variants annotated for both alternative splicing and alternative polyadenylation. BMC Bioinformatics. 2006, 7: 169-181. 10.1186/1471-2105-7-169.View ArticleGoogle Scholar
- Yan J, Marr TG: Computational analysis osf 3'-ends of ESTs shows four classes of alternative polyadenylation in human, mouse, and rat. Genome Res. 2005, 15: 369-375. 10.1101/gr.3109605.PubMed CentralPubMedView ArticleGoogle Scholar
- Pan Z, Zhang H, Hague LK, Lee JY, Lutz CS, Tian B: An intronic polyadenylation site in human and mouse CstF-77 genes suggests an evolutionarily conserved regulatory mechanism. Gene. 2006, 366: 325-334. 10.1016/j.gene.2005.09.024.PubMedView ArticleGoogle Scholar
- Tian B, Hu H, Zhang H, Lutz CS: A large-scale analysis of mRNA polyadenylation of human and mouse genes. Nucleic Acids Res. 2005, 33: 201-212. 10.1093/nar/gki158.PubMed CentralPubMedView ArticleGoogle Scholar
- Muro EM, Herrington R, Janmohamed S, Frelin C, Andrade-Navarro MA, Iscove NN: Identification of gene 3' ends by automated EST cluster analysis. Proc Natl Acad Sci USA. 2008, 105: 20286-20290. 10.1073/pnas.0807813105.PubMed CentralPubMedView ArticleGoogle Scholar
- Lopez F, Granjeaud S, Ara T, Ghattas B, Gautheret D: The disparate nature of "intergenic" polyadenylation sites. RNA. 2006, 12: 1794-1801. 10.1261/rna.136206.PubMed CentralPubMedView ArticleGoogle Scholar
- Brockman JH, Singh P, Liu D, Quinlan S, Salisbury J, Graber JH: PACdb: PolyA Cleavage Site and 3'-UTR Database. Bioinformatics. 2005, 21: 3691-3693. 10.1093/bioinformatics/bti589.PubMedView ArticleGoogle Scholar
- Lee JY, Yeh I, Park JY, Tian B: PolyA_DB 2: mRNA polyadenylation sites in vertebrate genes. Nucleic Acids Res. 2007, 35: D165-D168. 10.1093/nar/gkl870.PubMed CentralPubMedView ArticleGoogle Scholar
- Yada T, Ishikawa M, Totki Y, Okubo K: Statistical Analysis of human DNA sequences in the vicinity of poly(A) signal. ICOT Technical Report. 1994, TR-876Google Scholar
- Kondrakhin YuV, Shamin VV, Kolchanov NA: Construction of a generalized consensus matrix for recognition of vertebrate pre-mRNA 3' terminal processing sites. CABIOS. 1994, 10: 597-603.PubMedGoogle Scholar
- Salamov AA, Solovyev VV: Recognition of 3'-processing sites of human mRNA precursors. CABIOS. 1997, 13: 23-28.PubMedGoogle Scholar
- Tabaska JE, Zhang MQ: Detection of polyadenylation signals in human DNA sequences. Gene. 1999, 231: 77-86. 10.1016/S0378-1119(99)00104-3.PubMedView ArticleGoogle Scholar
- Graber JH, McAllister GD, Smith TF: Probabilistic prediction of Saccharomyces cerevisiae mRNA 3'-processing sites. Nucleic Acids Res. 2002, 30: 1851-1858. 10.1093/nar/30.8.1851.PubMed CentralPubMedView ArticleGoogle Scholar
- Hajarnavis A: A probabilistic model of 3' end formation in Caenorhabditis elegans. Nucleic Acids Res. 2004, 32: 3392-3399. 10.1093/nar/gkh656.PubMed CentralPubMedView ArticleGoogle Scholar
- Bajic VB, Charn TH, Xu JX, Panda SK, TKrishnan SP: Prediction Models for DNA Transcription Termination Based on SOM Networks. Conf Proc IEEE Eng Med Biol Soc. 2005, 5: 4791-4794.PubMedGoogle Scholar
- Hu J, Lutz CS, Wilusz J, Tian B: Bioinformatic identification of candidate cis-regulatory elements involved in human mRNA polyadenylation. RNA. 2005, 11: 1485-1493. 10.1261/rna.2107305.PubMed CentralPubMedView ArticleGoogle Scholar
- Cheng Y, Miura RM, Bin Tian B: Prediction of mRNA polyadenylation sites by support vector machine. Bioinformatics. 2006, 22: 2320-2325. 10.1093/bioinformatics/btl394.PubMedView ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralPubMedView ArticleGoogle Scholar
- Dempster AP, Laird NM, Rubin DB: Maximum Likelihood from Incomplete Data via the EM Algorithm. J of the Royal Statistical Society. Series B (Methodological). 1977, 39: 1-38.Google Scholar
- Cardon L, Stormo G: Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. J Mol Biol. 1992, 5: 159-170. 10.1016/0022-2836(92)90723-W.View ArticleGoogle Scholar
- Afifi AA, Azen SP: Statistical Analysis. A Computer Oriented Approach. 1979, Academic Press, NYGoogle Scholar
- Lawrence CE, Reilly AA: An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins. 1990, 7: 41-51. 10.1002/prot.340070105.PubMedView ArticleGoogle Scholar
- Solovyev VV, Salamov AA, Lawrence CB: Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Res. 1994, 22: 5156-5163. 10.1093/nar/22.24.5156.PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.