Distinct polyadenylation landscapes of diverse human tissues revealed by a modified PA-seq strategy
- Ting Ni†1, 2Email author,
- Yanqin Yang†1,
- Dina Hafez3,
- Wenjing Yang1, 4,
- Kurtis Kiesewetter1,
- Yoshi Wakabayashi1,
- Uwe Ohler3, 5,
- Weiqun Peng4 and
- Jun Zhu1Email author
© Ni et al.; licensee BioMed Central Ltd. 2013
Received: 28 March 2013
Accepted: 10 September 2013
Published: 11 September 2013
Polyadenylation is a key regulatory step in eukaryotic gene expression and one of the major contributors of transcriptome diversity. Aberrant polyadenylation often associates with expression defects and leads to human diseases.
To better understand global polyadenylation regulation, we have developed a polyadenylation sequencing (PA-seq) approach. By profiling polyadenylation events in 13 human tissues, we found that alternative cleavage and polyadenylation (APA) is prevalent in both protein-coding and noncoding genes. In addition, APA usage, similar to gene expression profiling, exhibits tissue-specific signatures and is sufficient for determining tissue origin. A 3′ untranslated region shortening index (USI) was further developed for genes with tandem APA sites. Strikingly, the results showed that different tissues exhibit distinct patterns of shortening and/or lengthening of 3′ untranslated regions, suggesting the intimate involvement of APA in establishing tissue or cell identity.
This study provides a comprehensive resource to uncover regulated polyadenylation events in human tissues and to characterize the underlying regulatory mechanism.
Polyadenylation is a crucial step during the maturation of pre-messenger RNAs (pre-mRNA) in eukaryotes. With a few exceptions (e.g., histone), protein coding transcripts are cleaved at their 3′ ends and subsequently polyadenylated, resulting in a characteristic poly(A) tail . While 3′-end processing takes place in nucleus, it has profound effects on gene expression regulation, including transcription termination, terminal intron removal, mRNA export, translation initiation, and mRNA stability [2–5]. As one of the key regulated steps in mRNA maturation, improper cleavage and polyadenylation is often associated with expression defects and lead to human diseases [6–9].
3′-end formation of nascent transcripts is mediated by a large multi-protein complex, which in human constitutes of more than 80 proteins . Several key subunits of this large complex has been characterized, including the cleavage and polyadenylation specificity factor (CPSF), cleave stimulation factor (CstF), cleavage factor I and II (CFIm, CFIIm), and poly(A) polymerase (PAP) . In mammals, recruitment of the 3′-end processing complex to pre-mRNAs is mediated by a bipartite sequence element: CPSF interacts with a conserved AAUAAA or AUUAAA upstream motif, while CstF recognizes an U- or GU-rich downstream sequence element (DSE). Other factors (e.g., CFIIm and PAP) are subsequently recruited to form a functional 3′-end processing complex. Furthermore, additional sequences upstream (USEs) and/or downstream of the core elements may also play an auxiliary role in complex assembly and 3′-end formation . Lastly, it has been shown that nucleosomes are preferentially depleted at the polyadenylation sites (PA sites), implying that nucleosome position might be involved in defining authentic 3′-end formation .
Polyadenylation is a highly regulated event and alternative polyadenylation (APA) is one of the major contributors of transcriptome diversity [12, 13]. Earlier estimation suggested that approximately half of human and mouse transcripts (51.25% and 46.97%, respectively) harbor multiple poly(A) sites, leading to heterogeneous 3′ end formation . A more recent survey showed that up to 80% of the human genes exhibit tissue-specific variants resulting from tandem 3′ untranslated region (3′ UTR) events . Emerging evidence suggests that APA is coordinated with other regulatory events (e.g. transcriptional activity, alternative splicing and miRNA targeting) to ensure robust tissue- or cell-specific gene expression [8, 12, 15–18]. 3′ UTR length also plays a critical role in cell differentiation, proliferation and human diseases [8, 15, 16, 19, 20].
Global understanding of polyadenylation events in vertebrates has come primarily from cDNAs and expressed sequence tags . Several genome-wide approaches have been recently developed based on tilling array  and sequencing based approaches [19, 23–32]. Here, we present a modified polyadenylation sequencing or PA-seq strategy, allowing for precise identification of polyadenylation site at the genome scale in a cost-effective manner. By monitoring polyadenylation profiles of 13 human tissues, tissue-specific APA signatures were identified. In addition, noncoding transcripts, similar to protein-coding genes, are extensively regulated at the polyadenylation level. Lastly, we showed distinct patterns of 3′ UTR shortening/lengthening among different tissues, suggesting that APA may play a critical role in establishing tissue or cell identity. Together, the PA-seq strategy provided a comprehensive polyadenylation landscape of human transcripts across diverse tissues. It can be broadly employed to monitor polyadenylation profiles of eukaryotic transcriptomes.
A paired-end sequencing strategy to map global polyadenylation sites
PA-seq was employed to monitor polyadenylation profiles of 13 human tissues, including fetal and adult brains, breast, colon, heart, kidney, liver, lung, pancreas, prostate, skeletal muscle, spleen and testis. We obtained ~108 million paired 51-mer reads. Of them, 78% (~84 millions) of the read pairs can be uniquely mapped to the reference genome (Additional file 1). Another 7% of reads were mapped to multiple genomic locations, possibly due to repetitive regions in the human genome. For downstream data analyses, we focused on polyadenylation cleavage sites derived from uniquely mapped sequencing pairs.
As the initial assessment of PA-seq data, we interrogated the relative location of experimentally identified PA sites (all uniquely mapped pairs) in respect to PolyA_DB annotations . In terms of coverage, ~60% of all PA sites in the PolyA_DB have concordant reads observed in our data set (Figure 1b). An additional 18% of PolyA_DB sites are within 5nt of our PA-seq data (dashed green line in Figure 1b). Conversely, more than 60% of all uniquely mapped pairs are consistent with PA sites annotated in PolyA_DB (Figure 1c). We also compared the distance between PA-seq data and RefSeq annotation, which showed similar concordant results (Additional file 2). Non-redundant read pairs were also used to avoid potential biases for highly expressed genes or due to polymerase chain reaction (PCR) artifacts. While known PA sites remained as the dominant category, distal PA sites became more prominent (Additional file 3), suggesting that novel PA sites remained to be identified. A representative genomic region with four genes showed that the majority of the reads overlapped with the annotated PA sites (Figure 1d). Together, these results strongly suggested that PA-seq is a reliable approach for direct monitoring of genome-wide polyadenylation events.
To examine whether the PA-seq results reflect relative gene expression level, the tag counts of individual genes were compared with the available microarray-based expression datasets of the same tissue origin. With minimal normalization of both the array and the sequence data, PA-seq and array expression profiles were considerably correlated (R = 0.64, Additional file 4). This result was comparable with the correlation observed between typical RNA sequencing (RNA-seq) and array-based methods [33, 34]. PA-seq expression profile was also correlated with RNA-seq (R = 0.76, Additional file 5). Therefore, the read count generated with the PA-seq approach can potentially be used to reflect transcript abundance.
Prevalent APA in protein-coding and noncoding transcripts
For the protein-coding PA clusters, their relative locations within respective genes were determined (Figure 2b). 40.3% of the clusters are overlapped with known polyadenylation sites, 36.82% and 3.11% clusters are located in the annotated and extended 3′ UTR regions (500 bp downstream of the most proximal 3′ PA sites), respectively. These alternative PA sites are expected to modulate the length of 3′ UTRs and may affect the stability, localization and/or translation of the corresponding transcripts [14, 23]. In addition, a considerable proportion (15.26%) of the PA clusters fall into introns, suggesting the potential coupling of alternative splicing and polyadenylation . A small subset of the PA clusters is mapped to exons (1.81%) as well as the upstream regions of the transcripts (5′ UTR: 2.5%; transcription start sites: 0.01%; promoter: 0.2%; Figure 2b). These 5′ proximal PA clusters are interesting and may have resulted from transcriptional read-through , enhancer RNAs [37, 38] or other novel mechanisms . Together, our data support the notion that 3′-end formation in the human transcriptomes is much more complex than previously appreciated [23, 30, 36].
Validation of novel PA clusters
Sequence motifs involved in precise PA site selection
It has been shown that PA clusters may vary significantly in size and shape . We therefore broadly categorized PA clusters into 3 groups: Narrow Peak (NP), Broad with Peak (BP) and Weak Peak (WP). Both NP and BP have a dominant peak (mode ± 2bp > 50% of total reads), whereas WP exhibits a more dispersed pattern of cleavage sites (see Methods for detail). Strikingly, majority of the PA clusters belong to the NP (17,183 clusters or 45.1%) and BP (16,257 clusters or 42.7%) categories. WP clusters (4,640 clusters) only account for a small percentage (12.2%) of all PA clusters. This is in sharp contrast to transcription initiation clusters where WP is the dominant class . We speculated that the precision in PA site selection might be achieved by sequence conservation of cis-acting elements (e.g. PA signal) and/or tight regulation of polyadenylation machinery.
Next we compared the local sequence content between focused (NP and BP) and dispersed (WP) polyadenylation clusters. Interestingly, the distribution of A(A/T)TAAA motif is much tighter for the peaked (NP/BP) PA clusters than that of dispersed clusters (WP) (Figure 4d, top 2 panels), suggesting that the broad PA cleavage patterns may be resulting from multiple closely spaced polyadenylation signals. In addition, the sequences downstream of PA cleavage site differ significantly between NP/BP and WP clusters. For NP/BP clusters, they tend to have an extended G-rich region compared to the background (intergenic regions), whereas the U-rich region (‘TTTTTT’ at DNA level) is rather restricted to immediate sequences downstream of the cleavage site (Figure 4d, bottom panels). These results are consistent with notion that PA cleavage sites are often followed by U/GU-rich (DSE: downstream sequence element) and G-rich sequences (Auxiliary DSE), which are bound by CstF and CFIIm, respectively . In contrast, WP clusters tend to have an extended U-rich downstream of cleave site with a narrow G-rich region at nucleotide 34–38 positions (Figure 4d). Two additional motifs, ‘UGUA’ and ‘UGUG’, were also analyzed. Consistent with the previous report , our results showed that for both NP/BP and WP clusters these two motifs are enriched at the upstream and downstream regions of PA cleavage sites, respectively (Additional file 11). Together, these data suggested that intrinsic sequence motifs might play an important role in determining the precision of polyadenylation cleavage events.
Identification of PA signature in diverse tissues
Tissue specific PA sites are not necessarily resulting from tissue-specific APA. Several other mechanisms, such as tissue-specific transcription and/or alternative splicing, may lead to PA sites with preferential tissue usage. To determine the potential contribution at the transcriptional level, tissue-specific PA sites were used to compile their host transcripts, the promoter regions of which were subsequently searched for overrepresented sequence motifs. Interestingly, several transcription factor-binding sites were identified, which are known to promote tissue-specific gene expression (Figure 5, right panel). For instance, the steroidogenic factor 1 (SF1) is enriched in transcripts with testis-specific PA clusters. SF1 has recently been shown to cooperate with Sry to initiate testis development from early bipotential gonads . It is worth noting that tissue-restricted expression is not the sole contributor of tissue-specific PA usage.
3′ UTR shortening and gene expression regulation
Alternative polyadenylation (APA) is an important mechanism for regulated eukaryotic gene expression. It has been proposed that transcriptional activity is coupled with APA , and highly expressed loci tend to have shortened 3′ UTR by favoring the proximal APA sites . These earlier studies arbitrarily divided the 3′ UTR into “constant” and “alternative” regions based on prior APA annotations, and the shortening of 3′ UTR was inferred from microarray or RNA-seq data. Since our data contain experimentally defined PA sites as well as their counts, it provides a unique opportunity to interrogate global 3′ UTR patterning among different tissues.
Three distinct enrichment patterns were discovered among tissues examined. Brain, testis, lung and breast are enriched for genes with abundant transcripts and a shortened 3′ UTR (Figure 6b middle panel and Additional file 12 middle panel), while low-abundance transcripts in several other tissues (e.g. heart, skeletal muscle) tend to have a lengthened 3′ UTR (Figure 6b top panel and Additional file 12 top panel). The rest of tissues (e.g., pancreas and spleen) are by virtue an admixture and favor both gene categories (Figure 6b bottom panel and Additional file 12 bottom panel). Despite the tissue preferences in 3′ UTR shortening/lengthening, these observations supported the notion of reverse correlation between 3′ UTR length and expression level, indicating the potential coordination between transcriptional control and alternative polyadenylation . Other possibilities may also exist; for instance, 3′ UTR length might affect mRNA stability, a potential mechanism does not necessarily involve transcription regulation.
It has been proposed that 3′ UTR shortening may promote gene expression, possibly by avoiding miRNA targeting [43, 44]. Conversely, transcripts with longer 3′ UTR are more likely to be targeted by miRNAs and/or other regulatory molecules. With regard to APA, two different paradigms might be involved to help establish tissue identities. The preferential usage of proximal PA sites in tissue-enriched genes (e.g. brain and testis) is expected to reinforce transcriptional decision on genes required for proper tissue functions. Supporting this notion, Gene Ontology (GO) analysis of genes with high expression rank and shortened 3′ UTR in adult and fetal brain shows significant enrichment of neural functions (Additional file 13 and Additional file 14). One interesting example is SNCA, which encodes α-synuclein (non A4 component of amyloid precursor) and is involved in the regulation of dopamine release and transport . SNCA is highly expressed in adult and fetal brain compared to other tissue analyzed. In addition, it utilizes a proximal PA site only in the two brain tissues examined, while the distal PA site is employed in other tissues (e.g. pancreas and testis, Figure 6c). These data supported the notion that APA might play important roles in establishing tissue or cell identity. In contrast, certain lowly expressed transcripts might be the result of leaky transcription and 3′ UTR lengthening of these transcripts may help prevent their expression at the posttranscriptional level. It is worth noting that the two above mechanisms are not mutually exclusive. As demonstrated in the pancreas and spleen tissues (Figure 6b bottom panel), they might complement each other to ensure a broader genome-wide coordination.
To profile global polyadenylation events, several sequencing based methods have been developed [19, 25, 28, 30]. In these methods, the final sequencing libraries contained a stretch of repetitive sequence introduced by oligo(dT) priming. Due to low sequence complexity downstream of the PA cleavage sites of interest, such experimental design often led to unnecessary challenges at the sequencing step (requiring long sequence read if sequenced from 5′ end or using oligo(dT) as sequencing primer, which might not be optimal, if sequenced from 3′ end). To solve this problem, our PA-seq strategy introduced a modified base (dUTP) in the oligo(dT), which can help remove the stretch of ‘T’ by Uracil-Specific Excision Reagent (USER) after second strand synthesis, and thus enables direct interrogation of polyadenylation sites in a more effective manner. Although our PA libraries can be sequenced directly from the 3′ end, paired-end sequencing was employed in this study, which is expected to further improve mapping specificity and efficiency as well as recapitulate local sequence complexity (e.g. splicing events in the 3′ UTR region). A detailed comparison between the PolyA-seq  and PA-seq (this study) in mappability and identification of PA sites can be found in Additional file 15 and Additional file 16, respectively.
Another innovative polyadenylation method, 3P-Seq, has also been developed . One major advantage of this method is the use of splint oligo ligation to avoid internal priming events. In addition, it employed RNase H digestion to shorten poly(A) tail of the target mRNAs. However, the PA profiles resulting from 3P-Seq strategy is not as quantitative as conventional methods, possibly due to the involvement of multiple ligation steps that may not be highly efficient . An alternative approach, 3′READS, has also been developed to address the internal priming issue . These methods can be used to generate a reference list of authentic PA sites, which can help further improve the computational strategy of our PA-seq analysis to achieve more reliable and quantitative profiling of genome-wide polyadenylation events.
Alternative polyadenylation is known to be an important mechanism in regulating tissue-specific gene expression and proteome diversity. One striking observation from this study is the distinct tissue preference in 3′ UTR shortening/lengthening. Since our analyses focused on tandem PA sites in the 3′ UTR region, it is unlikely that 3′ UTR shortening/lengthening is a direct result of alternative splicing. However, it does not rule out that splicing factors, in addition to other RNA binding proteins, may play a direct role in controlling PA choices. An early profiling study has identified the expression signature of splicing factors in different tissues; and snRNP (small nuclear ribonucleoprotein particle) is among the splicing factor genes that are highly differentially expressed particular in brain and testis . Accumulated evidence showed that U1 snRNP plays a determinative role in regulating polyadenylation and mRNA length [24, 48]. Therefore, it will be of great interest to further characterize the direct involvement of component of basal and alternative splicing machinery in regulated polyadenylation.
Strikingly, we found selective 3′ UTR shortening in genes highly expressed in fetal and adult brain, while others have found that brain transcripts on average have the longest 3′ UTR in length , a subset of which may even have extended 3′ UTR beyond current annotations . Instead of averaging among transcribed loci, our analysis is a gene-centric approach, comparing the normalized expression level as well as the effective UTR length of the same gene among diverse tissues, to uncover potential coordination between gene expression and polyadenylation. These opposing observations may not be contradictory as they seem to be at the initial glimpse. Together, these data implied that a subset of brain-enriched genes preferentially shorten their 3′ UTRs to ensure a high level of expression, despite that a different subset of transcripts expressed in the brain may have a rather extended 3′ UTR. As one prevalent phenomenon in neurons, it has been shown that brain-enriched miRNAs tend to co-express with their target genes, which also exhibit brain-specific expression . The paradox can be in part explained by selective 3′ UTR shortening and target avoidance, suggesting alternative polyadenylation might be critical for establishing neuronal cell identity.
Lastly, we found ncRNA, similar to protein coding transcripts, are also extensively regulated at the polyadenylation level. The data sets presented here provided a starting point to further characterize the functional significance of such regulatory events. It is plausible that different ncRNA isoforms may have distinct regulatory functions, cellular localization and/or stability. For instance, comparison of polyadenylation transcripts at different cellular compartments is expected to interrogate potential links between alternative polyadenylation and subcellular localization. Further integration of function and interaction data (e.g. the secondary structure of noncoding transcripts with or without alternative region) is required for to characterizing this additional level of transcriptome complexity and deserves future study.
One major challenge of profiling transcriptome 3′ end is to avoid sequencing the poly(A) stretch. We used Uracil-Specific Excision Reagent to recognize and cleave the modified base (dUTP) near the polyadenylation site, therefore eliminating the potential complication due to these low complexity sequences. Using this modified PA-seq strategy, tissue-specific PA signatures were identified. Furthermore, we showed that noncoding transcripts, similar to protein-coding genes, are extensively regulated at the 3′ end. In addition, downstream G-rich motif and upstream U-rich motif played an important role in the regulation of tissue-specific polyadenylation. Lastly, we showed distinct patterns of 3′ UTR shortening/lengthening among different tissues, which suggested that APA may play a critical role in establishing tissue or cell identity. Together, the PA-seq strategy provided a comprehensive polyadenylation landscape of human transcripts across diverse tissues.
PA-seq library construction
Total RNA derived from 13 normal human tissues, including adult brain, fetal brain, kidney, liver, prostate, testis, breast, colon, heart, lung, pancreas, spleen and skeletal muscle were purchased from Biochain or Clontech. 10 μg of DNA-free total RNA was sheared into 200–300 nt fragments by heating (94°C for 3 minutes) with magnesium. Sheared RNA was precipitated by ethanol with GlycoBlue (Ambion) as a carrier. Reverse transcription was carried out using a modified oligo(dT) primer (5′-bio-TTTTTTTTTTTTTTTTdUTTTVN-3′, ‘bio’ denotes duo biotin group, ‘dU’ stands for deoxyuricile, ‘V’ represents any nucleotide except T and ‘N’ denotes any nucleotide). Incubate the reverse transcription reaction at 42°C for 2 min before adding Superscript reverse transcriptase II (Invitrogen) to increase specificity. After second strand synthesis, Dynabeads MyOne C1 (Invitrogen) was used to pull down the resulting dsDNA. Incubate the beads with APex Heat-Labile Alkaline Phosphatase (Epicentre) to remove phosphate group, which enables strand specificity at the later PCR step since only the bottom strand cDNA can be ligated and thus amplified. To release dsDNA from MyOne beads, USER enzyme (NEB) was added. The released dsDNA was then end repaired, followed by adding an ‘A’ base at the ends. Illumina paired-end Y linker was ligated and size selected. A 16-cycle PCR was then carried out with Phusion Hot Start High-Fidelity DNA Polymerase (Finnzymes) to generate the final PA-seq libraries, which were sequenced using Illumina HiSeq2000 platform.
Paired-end reads were split into 13 tissue samples by barcode and further correct the strand according to ‘TTT’ at the beginning of the reads. Processed raw data were then aligned by bwa  to human genome (version hg19) allowing two mismatches and processed by samtools . All uniquely mapped pairs were used for downstream analyses.
Comparison of PA-seq data with PolyA_DB and RefSeq
PolyA_DB (version 1, only contain human and mouse poly(A) sites) was downloaded from Bin Tian’s website (http://exon.umdnj.edu/polya_db/) . PolyA_DB data was collected from all cDNA/ESTs sequences in the UniGene database from NCBI (July and August 2005 version) and aligned to genome sequences build hg17 using BLAT. ESTs sequences came from a variety of tissues including eye, retina, skin, testis, pancreas, stomach, colon, brain and mixed tissues. For PolyA_DB sequences, coding and intronic regions were taken into consideration for those falling in 3′ UTR with 1kb upstream and 1kb downstream. For PA-seq data, we selected those pairs with more than one unique 5′ reads. Since in PolyA_DB, each gene has at least one poly(A) site, some genes actually appear more than once, and hence the minimum distance varies accordingly. For comparison with RefSeq annotation, we download the database from UCSC. Both protein-coding and non-coding genes were computed. All mappable pairs and non-redundant pairs were calculated separately.
Peak calling and cluster analysis
After mapping, the information of exact cleavage and polyadenylation sites was extracted. F-seq  was used for the peak calling of PA sites. The PA-seq data from 13 tissues were combined so that a unified peak-calling scheme can be applied. We resized the PA clusters to the shortest distance that contained 95% of the reads according to our previous publication . To filter internal priming, we removed PA clusters with 15 ‘A’ in the 20 nucleotides region downstream of peak mode. With a minimum of 50 tags, we classified the PA clusters into three patterns by the following definitions : Narrow Peak (NP) clusters contained ≥ 50% of the reads within ±2 nt of the mode and span < 10 nt; Broad with Peak (BP) clusters were those that contained ≥ 50% of the reads within ±2 nt of the mode and are ≥ 10 nt in length; All other clusters were classified as Weak Peak (WP). RefSeq annotation was used for the genomic location analysis. For peaks falling into multiple categories, we set priority as PA > 3′-UTR > Extended_3′-UTR > Exon > Intron > 5′-UTR > TSS > Promoter. PA defined as annotated PA site plus upstream and downstream 10bp. Extended_3′-UTR denotes downstream 1 kb of 3′ UTR. TSS represents exact transcriptional start site plus upstream and downstream 10bp. Promoter defined as upstream 250 bp of TSS. For Figure 4d, we applied 5 bp window as threshold of NP and BP.
Motif and information content analysis
Homer software from UCSD (http://biowhat.ucsd.edu/homer/) was used to perform the motif analysis in Figure 4d and Additional file 11. 6mer motifs were analyzed. Weblogo version 3 (http://weblogo.threeplusone.com) was used to draw the information content with classic color scheme for Figure 4b,c. Top four enriched 6mer motifs (p value < 1 × 10-5) were counted in 100-bp regions surrounding NP/BP and WP PA clusters separately. 4000 random intergenic sequences each with 100 bp in length were also scanned by top four enriched 6mer motifs. The frequency of each enriched motif was calculated by normalizing total input sequences.
Shannon entropy analysis of tissue-specific polyadenylation
Shannon entropy  was used for the analysis of identifying tissue-specific polyadenylation sites. First, we normalized each tissue PA-seq data by its library size, and then performed quantile normalization for all 13 tissues PA-seq data. Shannon entropy score H was computed for each polyadenylation site. Q scores were computed based on H score to estimate the expression specificity of each PA site for a particular tissue. Based on H and Q scores’ distribution and variation, we applied the cut-off corresponding approximately less than the median minus two standard deviations to identify the regulated tissue specific polyadenylation sites. On the contrary, we detected constitutive PA sites when both H and Q values are greater than mean plus two standard deviations.
Clustering of tissue specific gene expression
Unsupervised two-way hierarchical clustering of 13 tissues and identified tissue specific polyadenylation sites was performed based on Pearson correlation and the average linkage method. In each specific tissue cluster, 1kb upstream sequences from PA cluster start site were scanned for finding transcription factor (TF) binding sites common to at least 60% of input sequences by MatInspector in Genomatix Genome Analyzer. Tissue-specific TFs were selected by p value < 10-7.
Gene ontology analysis
DAVID (Database for Annotation, Visualization and Integrated Discovery)  was used for the GO analysis of enriched genes from Figure 6b. Three terms (GOTERM_BP_FAT, GOTERM_CC_FAT and GOTERM_MF_FAT) were selected for analysis. The top ten GO terms in the “Functional Annotation Chart” were shown in Additional file 13 and Additional file 14. The default population background in enrichment calculation consists of all corresponding genes in the genome that have at least one annotation in the analyzing categories.
Define 3′ UTR shortening index
We used Entrez gene ID to cluster the RefSeq transcript IDs. Overlapping genes on the same strand were removed from further consideration to avoid wrongful assignment of PA site to 3′ UTR. Furthermore, we restricted our attention to genes (represented by Entrez IDs) whose annotated transcripts (represented by RefSeq IDs) that shared the same stop codon. Based on the consolidated gene model above, the pipeline considers both the locations (L) and the tag counts (C) of the PA clusters identified in the 3′ UTR of each transcribed locus to compute “Effective_UTR_Length” (Effective_UTR_Length = Σ(C i × L i )/Σ C i , see Figure 6a). 3′ UTR shortening index (USI) is defined as the length difference between annotated and effective 3′ UTR (USI = Annotated_UTR_Length − Effective_UTR_Length, see Figure 6a). The effective 3′ UTR length and PA cluster expression were ranked from low to high separately among 13 tissues. Each box in a given 2D plot (Figure 6b) represents Z-score of the observed number of PA clusters (genes) in the specific tissue.
RT-PCR validation of PA clusters
To validate the mode of a PA cluster, a junction primer (+, half complementary to 3′ end sequence and half complementary to poly(A) sequence) together with an upstream gene-specific primer were designed by Primer3 (version 0.4.0) and synthesized by IDT. The control primer (−) lacks the sequence complementary to poly(A). Reverse transcription of 1μg DNA-free total RNA was performed with SuperScript II reverse transcriptase (Invitrogen) in a 20 μl reaction, containing 4 pmol oligo(dT) primer (5′-TTTTTTTTTTTTTTTTTTTTVN-3′), 40 units of RNasin (Promega) and 6 ng/μl freshly-made actinomycin D. RT reaction was incubated at 42°C for 2 min before adding reverse transcriptase. We then incubate the reaction at 42°C for 60 min and 75°C for 15 min. PCR was performed in a 20 μl reaction, containing 1 μl of RT template, 1× PCR buffer (Qiagen), 0.4 nmol dNTP, 0.4 pmol of each forward and reverse primer, and 1 μl of Taq DNA Polymerase (Qiagen). Thermal cycling was carried out as the following: 94°C for 30s; 30 ~ 40 cycles of 94°C for 30s, 52 ~ 58°C for 30s and 72°C for 30s; 72°C for 10 min; hold at 4°C. 3μl of PCR product was taken out and run in a 2% agarose gel.
PA-seq raw data can be found at the NCBI Sequence Read Archive (SRA) with submission number SRA059064. Custom tracks of 13 tissues can be found in Additional file 17.
A step-by-step protocol of PA-seq can be found in Additional file 18.
We are grateful to Dr. Delong Liu for the help of gene expression analysis between PA-seq and microarray data. We are indebted to Drs. Kang Tu, Yuan Gao for their advices on data analysis. This work is supported by the Intramural Research Program of National Heart Lung and Blood Institute (NHLBI), National Institutes of Health (NIH), the National Basic Research Program of China (2013CB530700), and the National Science Foundation of China (31271348).
- Soller M: Pre-messenger RNA processing and its regulation: a genomic perspective. Cell Mol Life Sci. 2006, 63 (7–8): 796-819.View ArticlePubMedGoogle Scholar
- Buratowski S: Connections between mRNA 3′ end processing and transcription termination. Curr Opin Cell Biol. 2005, 17 (3): 257-261. 10.1016/j.ceb.2005.04.003.View ArticlePubMedGoogle Scholar
- Gallie DR: A tale of two termini: a functional interaction between the termini of an mRNA is a prerequisite for efficient translation initiation. Gene. 1998, 216 (1): 1-11. 10.1016/S0378-1119(98)00318-7.View ArticlePubMedGoogle Scholar
- Misquitta CM, Iyer VR, Werstiuk ES, Grover AK: The role of 3′-untranslated region (3′-UTR) mediated mRNA stability in cardiovascular pathophysiology. Mol Cell Biochem. 2001, 224 (1–2): 53-67.View ArticlePubMedGoogle Scholar
- Millevoi S, Vagner S: Molecular mechanisms of eukaryotic pre-mRNA 3′ end processing regulation. Nucleic Acids Res. 2010, 38 (9): 2757-2774. 10.1093/nar/gkp1176.PubMed CentralView ArticlePubMedGoogle Scholar
- Danckwardt S, Hentze MW, Kulozik AE: 3′ end mRNA processing: molecular mechanisms and implications for health and disease. Embo J. 2008, 27 (3): 482-498. 10.1038/sj.emboj.7601932.PubMed CentralView ArticlePubMedGoogle Scholar
- Jan CH, Friedman RC, Ruby JG, Bartel DP: Formation, regulation and evolution of Caenorhabditis elegans 3′UTRs. Nature. 2010, 469 (7328): 97-101.PubMed CentralView ArticlePubMedGoogle Scholar
- Mayr C, Bartel DP: Widespread shortening of 3′UTRs by alternative cleavage and polyadenylation activates oncogenes in cancer cells. Cell. 2009, 138 (4): 673-684. 10.1016/j.cell.2009.06.016.PubMed CentralView ArticlePubMedGoogle Scholar
- Shi Y, Di Giammartino DC, Taylor D, Sarkeshik A, Rice WJ, Yates JR, Frank J, Manley JL: Molecular architecture of the human pre-mRNA 3′ processing complex. Mol Cell. 2009, 33 (3): 365-376. 10.1016/j.molcel.2008.12.028.PubMed CentralView ArticlePubMedGoogle Scholar
- Sartini BL, Wang H, Wang W, Millette CF, Kilpatrick DL: Pre-messenger RNA cleavage factor I (CFIm): potential role in alternative polyadenylation during spermatogenesis. Biol Reprod. 2008, 78 (3): 472-482. 10.1095/biolreprod.107.064774.View ArticlePubMedGoogle Scholar
- Spies N, Nielsen CB, Padgett RA, Burge CB: Biased chromatin signatures around polyadenylation sites and exons. Mol Cell. 2009, 36 (2): 245-254. 10.1016/j.molcel.2009.10.008.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB: Alternative isoform regulation in human tissue transcriptomes. Nature. 2008, 456 (7221): 470-476. 10.1038/nature07509.PubMed CentralView ArticlePubMedGoogle Scholar
- Xing D, Li QQ: Alternative polyadenylation: a mechanism maximizing transcriptome diversity in higher eukaryotes. Plant Signal Behav. 2009, 4 (5): 440-442. 10.4161/psb.4.5.8345.PubMed CentralView ArticlePubMedGoogle Scholar
- Tian B, Hu J, Zhang H, Lutz CS: A large-scale analysis of mRNA polyadenylation of human and mouse genes. Nucleic Acids Res. 2005, 33 (1): 201-212. 10.1093/nar/gki158.PubMed CentralView ArticlePubMedGoogle Scholar
- Sandberg R, Neilson JR, Sarma A, Sharp PA, Burge CB: Proliferating cells express mRNAs with shortened 3′ untranslated regions and fewer microRNA target sites. Science. 2008, 320 (5883): 1643-1647. 10.1126/science.1155390.PubMed CentralView ArticlePubMedGoogle Scholar
- Ji Z, Lee JY, Pan Z, Jiang B, Tian B: Progressive lengthening of 3′ untranslated regions of mRNAs by alternative polyadenylation during mouse embryonic development. Proc Natl Acad Sci USA. 2009, 106 (17): 7028-7033. 10.1073/pnas.0900028106.PubMed CentralView ArticlePubMedGoogle Scholar
- Majoros WH, Ohler U: Spatial preferences of microRNA targets in 3′ untranslated regions. BMC Genomics. 2007, 8: 152-10.1186/1471-2164-8-152.PubMed CentralView ArticlePubMedGoogle Scholar
- Ji Z, Luo W, Li W, Hoque M, Pan Z, Zhao Y, Tian B: Transcriptional activity regulates alternative cleavage and polyadenylation. Mol Syst Biol. 2011, 7: 534-PubMed CentralView ArticlePubMedGoogle Scholar
- Fu Y, Sun Y, Li Y, Li J, Rao X, Chen C, Xu A: Differential genome-wide profiling of tandem 3′ UTRs among human breast cancer and normal cells by high-throughput sequencing. Genome Res. 2011, 21 (5): 741-747. 10.1101/gr.115295.110.PubMed CentralView ArticlePubMedGoogle Scholar
- Lin Y, Li Z, Ozsolak F, Kim SW, Arango-Argoty G, Liu TT, Tenenbaum SA, Bailey T, Monaghan AP, Milos PM: An in-depth map of polyadenylation sites in cancer. Nucleic Acids Res. 2012, 40 (17): 8460-8471. 10.1093/nar/gks637.PubMed CentralView ArticlePubMedGoogle Scholar
- Lee JY, Yeh I, Park JY, Tian B: PolyA_DB 2: mRNA polyadenylation sites in vertebrate genes. Nucleic Acids Res. 2007, 35: D165-D168. 10.1093/nar/gkl870.PubMed CentralView ArticlePubMedGoogle Scholar
- Lian Z, Karpikov A, Lian J, Mahajan MC, Hartman S, Gerstein M, Snyder M, Weissman SM: A genomic analysis of RNA polymerase II modification and chromatin architecture related to 3′ end RNA polyadenylation. Genome Res. 2008, 18 (8): 1224-1237. 10.1101/gr.075804.107.PubMed CentralView ArticlePubMedGoogle Scholar
- Ozsolak F, Kapranov P, Foissac S, Kim SW, Fishilevich E, Monaghan AP, John B, Milos PM: Comprehensive polyadenylation site maps in yeast and human reveal pervasive alternative polyadenylation. Cell. 2010, 143 (6): 1018-1029. 10.1016/j.cell.2010.11.020.PubMed CentralView ArticlePubMedGoogle Scholar
- Berg MG, Singh LN, Younis I, Liu Q, Pinto AM, Kaida D, Zhang Z, Cho S, Sherrill-Mix S, Wan L: U1 snRNP determines mRNA length and regulates isoform expression. Cell. 2012, 150 (1): 53-64. 10.1016/j.cell.2012.05.029.PubMed CentralView ArticlePubMedGoogle Scholar
- Jenal M, Elkon R, Loayza-Puch F, van Haaften G, Kuhn U, Menzies FM, Oude Vrielink JA, Bos AJ, Drost J, Rooijers K: The poly(A)-binding protein nuclear 1 suppresses alternative cleavage and polyadenylation sites. Cell. 2012, 149 (3): 538-553. 10.1016/j.cell.2012.03.022.View ArticlePubMedGoogle Scholar
- Martin G, Gruber AR, Keller W, Zavolan M: Genome-wide analysis of pre-mRNA 3′ end processing reveals a decisive role of human cleavage factor I in the regulation of 3′ UTR length. Cell Rep. 2012, 1 (6): 753-763. 10.1016/j.celrep.2012.05.003.View ArticlePubMedGoogle Scholar
- Derti A, Garrett-Engele P, Macisaac KD, Stevens RC, Sriram S, Chen R, Rohl CA, Johnson JM, Babak T: A quantitative atlas of polyadenylation in five mammals. Genome Res. 2012, 22 (6): 1173-1183. 10.1101/gr.132563.111.PubMed CentralView ArticlePubMedGoogle Scholar
- Mangone M, Manoharan AP, Thierry-Mieg D, Thierry-Mieg J, Han T, Mackowiak SD, Mis E, Zegar C, Gutwein MR, Khivansara V: The landscape of C. elegans 3′ UTRs. Science. 2010, 329 (5990): 432-435. 10.1126/science.1191244.PubMed CentralView ArticlePubMedGoogle Scholar
- Jan CH, Friedman RC, Ruby JG, Bartel DP: Formation, regulation and evolution of caenorhabditis elegans 3′ UTRs. Nature. 2011, 469 (7328): 97-101. 10.1038/nature09616.PubMed CentralView ArticlePubMedGoogle Scholar
- Shepard PJ, Choi EA, Lu J, Flanagan LA, Hertel KJ, Shi Y: Complex and dynamic landscape of RNA polyadenylation revealed by PAS-Seq. RNA (New York, NY. 2011, 17 (4): 761-772. 10.1261/rna.2581711.View ArticleGoogle Scholar
- Yao C, Biesinger J, Wan J, Weng L, Xing Y, Xie X, Shi Y: Transcriptome-wide analyses of CstF64-RNA interactions in global regulation of mRNA alternative polyadenylation. Proc Natl Acad Sci USA. 2012, 109 (46): 18773-18778. 10.1073/pnas.1211101109.PubMed CentralView ArticlePubMedGoogle Scholar
- Smibert P, Miura P, Westholm JO, Shenker S, May G, Duff MO, Zhang D, Eads BD, Carlson J, Brown JB: Global patterns of tissue-specific alternative polyadenylation in Drosophila. Cell Rep. 2012, 1 (3): 277-289. 10.1016/j.celrep.2012.01.001.PubMed CentralView ArticlePubMedGoogle Scholar
- Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I, Penkett CJ, Rogers J, Bahler J: Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature. 2008, 453 (7199): 1239-1243. 10.1038/nature07002.View ArticlePubMedGoogle Scholar
- Ni T, Corcoran DL, Rach EA, Song S, Spana EP, Gao Y, Ohler U, Zhu J: A paired-end sequencing strategy to map the complex landscape of transcription initiation. Nat Methods. 2010, 7 (7): 521-527. 10.1038/nmeth.1464.PubMed CentralView ArticlePubMedGoogle Scholar
- Boyle AP, Guinney J, Crawford GE, Furey TS: F-Seq: a feature density estimator for high-throughput sequence tags. Bioinformatics. 2008, 24 (21): 2537-2538. 10.1093/bioinformatics/btn480.PubMed CentralView ArticlePubMedGoogle Scholar
- Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermuller J, Hofacker IL: RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science. 2007, 316 (5830): 1484-1488. 10.1126/science.1138341.View ArticlePubMedGoogle Scholar
- Ren B: Transcription: enhancers make non-coding RNA. Nature. 2010, 465 (7295): 173-174. 10.1038/465173a.View ArticlePubMedGoogle Scholar
- Kim TK, Hemberg M, Gray JM, Costa AM, Bear DM, Wu J, Harmin DA, Laptewicz M, Barbara-Haley K, Kuersten S: Widespread transcription at neuronal activity-regulated enhancers. Nature. 2010, 465 (7295): 182-187. 10.1038/nature09033.PubMed CentralView ArticlePubMedGoogle Scholar
- ENCODE: Post-transcriptional processing generates a diversity of 5′-modified long and short RNAs. Nature. 2009, 457 (7232): 1028-1032. 10.1038/nature07759.View ArticleGoogle Scholar
- Schug J, Schuller WP, Kappen C, Salbaum JM, Bucan M, Stoeckert CJ: Promoter features related to tissue specificity as measured by Shannon entropy. Genome Biol. 2005, 6 (4): R33-10.1186/gb-2005-6-4-r33.PubMed CentralView ArticlePubMedGoogle Scholar
- Sekido R, Lovell-Badge R: Sex determination involves synergistic action of SRY and SF1 on a specific Sox9 enhancer. Nature. 2008, 453 (7197): 930-934. 10.1038/nature06944.View ArticlePubMedGoogle Scholar
- Pinto PA, Henriques T, Freitas MO, Martins T, Domingues RG, Wyrzykowska PS, Coelho PA, Carmo AM, Sunkel CE, Proudfoot NJ: RNA polymerase II kinetics in polo polyadenylation signal selection. Embo J. 2011, 30 (12): 2431-2444. 10.1038/emboj.2011.156.PubMed CentralView ArticlePubMedGoogle Scholar
- Stark A, Brennecke J, Bushati N, Russell RB, Cohen SM: Animal MicroRNAs confer robustness to gene expression and have a significant impact on 3′ UTR evolution. Cell. 2005, 123 (6): 1133-1146. 10.1016/j.cell.2005.11.023.View ArticlePubMedGoogle Scholar
- Shkumatava A, Stark A, Sive H, Bartel DP: Coherent but overlapping expression of microRNAs and their targets during vertebrate development. Genes Dev. 2009, 23 (4): 466-481. 10.1101/gad.1745709.PubMed CentralView ArticlePubMedGoogle Scholar
- Abeliovich A, Schmitz Y, Farinas I, Choi-Lundberg D, Ho WH, Castillo PE, Shinsky N, Verdugo JM, Armanini M, Ryan A: Mice lacking alpha-synuclein display functional deficits in the nigrostriatal dopamine system. Neuron. 2000, 25 (1): 239-252. 10.1016/S0896-6273(00)80886-7.View ArticlePubMedGoogle Scholar
- Hoque M, Ji Z, Zheng D, Luo W, Li W, You B, Park JY, Yehia G, Tian B: Analysis of alternative cleavage and polyadenylation by 3′ region extraction and deep sequencing. Nature methods. 2013, 10 (2): 133-139.PubMed CentralView ArticlePubMedGoogle Scholar
- Grosso AR, Gomes AQ, Barbosa-Morais NL, Caldeira S, Thorne NP, Grech G, von Lindern M, Carmo-Fonseca M: Tissue-specific splicing factor gene expression signatures. Nucleic Acids Res. 2008, 36 (15): 4823-4832. 10.1093/nar/gkn463.PubMed CentralView ArticlePubMedGoogle Scholar
- Kaida D, Berg MG, Younis I, Kasim M, Singh LN, Wan L, Dreyfuss G: U1 snRNP protects pre-mRNAs from premature cleavage and polyadenylation. Nature. 2010, 468 (7324): 664-668. 10.1038/nature09479.PubMed CentralView ArticlePubMedGoogle Scholar
- Miura P, Shenker S, Andreu-Agullo C, Westholm JO, Lai EC: Widespread and extensive lengthening of 3′ UTRs in the mammalian brain. Genome Res. 2013, 23 (5): 812-825. 10.1101/gr.146886.112.PubMed CentralView ArticlePubMedGoogle Scholar
- Tsang J, Zhu J, van Oudenaarden A: MicroRNA-mediated feedback and feedforward loops are recurrent network motifs in mammals. Mol Cell. 2007, 26 (5): 753-767. 10.1016/j.molcel.2007.05.018.PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760. 10.1093/bioinformatics/btp324.PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence alignment/map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralView ArticlePubMedGoogle Scholar
- da Huang W, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009, 4 (1): 44-57.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.