Quantification of alternative polyadenylation with LABRAT
To quantify relative alternative polyadenylation site usage from RNAseq data, LABRAT takes a genome annotation file and first searches the annotation for tags that define transcripts with ill-defined 3′ ends in order to filter and remove them from further analysis (Fig. 1A). Because annotations are isoform-based, they are often rigid in their explicit connection of upstream alternative splicing events to downstream APA sites, even though this connection may not be accurate. Therefore, to exclude spurious contributions of upstream alternative splicing events to APA site quantification, we extracted the final two exons of every transcript and the expression of these transcript “terminal fragments” was quantified using Salmon [30].
For each gene, alternative polyadenylation sites are then defined using terminal fragments. Terminal fragments with 3′ ends within 25 nt of other 3′ ends are grouped together to define a single polyadenylation site, and the sites are ordered from most gene-proximal to most gene-distal. Each APA site within a gene is assigned a value, m, which is defined as its position within this proximal-to-distal ordering, beginning with 0. Each gene is assigned a value, n, which is defined as the number of distinct APA sites that it contains. The expression (TPM) of every terminal fragment belonging to a given APA site is then summed to define the expression level of the APA site, and this process is repeated for every APA site within a gene. The expression level of each APA site is then scaled according to the following formula:
$$ {TPM}_{scaled}={TPM}_{unscaled}\ \left(\frac{m}{n-1}\right) $$
To quantify a gene’s relative APA site usage, we defined a term, ψ. Scaled and unscaled TPM values are summed across all APA sites within a gene, and ψ is defined as the ratio between these summed values:
$$ \psi =\frac{\sum {TPM}_{scaled}}{\sum {TPM}_{unscaled}} $$
With this strategy, genes that show exclusive usage of the most gene-proximal APA site will be assigned a ψ value of 0, while those that show exclusive usage of the most gene-distal APA site will be assigned a ψ value of 1 (Fig. 1B). Usage of both sites will result in a ψ value between 0 and 1 depending on the relative usage of the sites. Importantly, this strategy also applies to genes with more than 2 APA sites. In these cases, one ψ value is assigned to the entire gene without the need to do multiple pairwise comparisons between APA sites.
After calculating ψ values for genes in all samples, LABRAT compares ψ values of experimental replicates across experimental conditions to identify genes with statistically significantly different ψ values between conditions. This is done using a mixed linear effects model that tests the relationship between ψ values and experimental condition. A null model is also created in which the term denoting the experimental condition has been removed. A likelihood ratio test compares the goodness of fit of these two models to the observed data and assigns a p value for the probability that the real model is a better fit than the null model. In simple comparisons between two conditions, this approach mimics a t-test. However, this technique has the advantage of being able to easily incorporate covariates into significance testing. After performing this test on all genes, the raw p values are corrected for multiple hypothesis testing using a Benjamini-Hochsberg correction [32].
In addition, LABRAT determines whether a gene’s APA sites conform to either the tandem UTR or ALE structures (Fig. 1B) and designates the gene accordingly. For genes with more than 2 APA sites, it is possible to contain both tandem UTR and ALE structures. These genes are designated as having a “mixed” APA structure.
Identifying tissue-specific differences in APA isoform abundance with LABRAT
To demonstrate the ability of LABRAT to identify and quantify differences in APA isoform abundance, we analyzed RNAseq data from mouse brain and liver tissues [33]. Because neuronal tissues are known to be highly enriched for the use of distal APA sites [17], we reasoned that comparison of these two tissues might provide a positive control for LABRAT’s ability to identify differential APA isoform abundance.
We found 470 genes that displayed differential relative APA isoform abundance between the tissues (FDR < 0.05) (Fig. 1C). As expected, 68% of these genes showed increased usage of distal APA sites in brain, indicating a significant enrichment for the use of downstream APA sites in this tissue (binomial p = 3.2e-15). To further explore changes in ψ value for specific genes, we plotted read coverages over two genes that showed significantly more downstream APA site usage in brain tissue: Slc16a7 and Elavl1 (Fig. 1D, S1A). For both genes, we observed significantly lower read coverages corresponding to usage of the distal APA site in the liver samples relative to the brain samples. Accordingly, LABRAT assigned these genes to have low ψ values in the liver samples, and high ψ values in the brain samples, indicating that LABRAT can accurately quantify APA.
To perform similar analyses in human samples, we analyzed over 5000 RNAseq samples from over 30 different human tissues produced as part of the Genotype-Tissue Expression (GTEx) project [34]. We quantified APA isoform abundance in these samples and observed relationships between tissue APA using PCA analysis (Figure S1B). In this analysis, brain and testis samples were clear outliers. Interestingly, performing the PCA analysis using only tandem UTR (Figure S1C) or ALE (Figure S1D) genes produced very similar results, suggesting that these two forms of APA are broadly coregulated across many tissues.
To understand more about APA in human brain and testis, we compared their APA profiles to those observed in human liver samples. As expected, we observed that brain samples exhibited a significant bias for the use of downstream APA sites (p < 2.2e-16) (Figure S1E). Conversely, testis samples exhibited a similar bias for the use of upstream APA sites (p < 2.2e-16) (Figure S1F). The propensity of testis to use upstream APA sites has been previously observed [35,36,37] and is likely a key feature of spermatogenesis [38]. Overall, these results demonstrate the ability of LABRAT to recapitulate previously reported observations and gave us confidence in its results moving forward.
Quantifying APA isoform abundance with 3′ end sequencing data using LABRAT
Although the plethora of available RNAseq datasets make it possible to observe APA trends in a variety of contexts, RNAseq is not perfectly suited to APA quantification. Library preparations that enrich for reads near cleavage and polyadenylation sites provide a more direct, and potentially more accurate, quantification of APA isoforms [24, 25].
To allow LABRAT to quantify APA isoforms in 3′ end sequencing data, we included the ‘librarytype’ parameter. If this parameter is designated as ‘3pseq’, LABRAT uses the counts of reads assigned to polyA sites for quantification rather than the length-normalized TPM metric. Because 3′ end data is produced using oligo-dT anchors, length normalization is not necessary and if utilized would unfairly penalize long transcripts.
To assess the accuracy of APA isoform quantification with LABRAT from 3′ end data, we used a dataset in which the authors prepared 3′ end libraries from RNA isolated from human peripheral blood mononuclear cells (PBMCs) with and without treatment with poly dI:dC [39]. Salmon, the transcript quantification tool utilized by LABRAT, has been shown to accurately quantify transcript abundances using 3′ end data [39]. We calculated ψ values from this data using LABRAT (Fig. 1E, S1G) and compared them to values produced by the more classical approach of simply counting the aligned reads associated with each APA site (see Methods). We found that the ψ values produced by LABRAT were in strong agreement with those produced by alignment-dependent method (R ~ 0.92) (Figure S1H). It is important to note here that the deviation from perfect agreement is possibly due to the greater ability of Salmon (and therefore LABRAT) to accurately assign reads that could be consistent with multiple polyadenylation sites.
An open question in the field of APA quantification is the extent to which APA isoforms can be accurately quantified using RNAseq as opposed to 3′ end sequencing. To address this question, we took advantage of the fact that in the PBMC study, the authors prepared RNAseq and 3′ end sequencing libraries from the same RNA samples [39]. We used LABRAT to calculate ψ values from the RNAseq and 3′ end data using ‘RNAseq’ and ‘3pseq’ librarytype parameters, respectively. We found that ψ values from the two library preparation methods were reasonably and reproducibly correlated (R ~ 0.67) while ψ values from samples produced using the same highly correlated (R ~ 0.97) (Fig. 1F). RNAseq is therefore able to quantify APA isoform abundance with generally acceptable accuracy. Further, both methods accurately segregated samples into treatment and control groups using ψ values (Fig. 1F), giving confidence in the ability of RNAseq libraries to accurately reflect APA status and opening up tens of thousands of RNAseq datasets for quantification.
Comparison of LABRAT to similar methods of APA isoform quantification
To compare LABRAT with other APA analysis tools, we generated a synthetic RNAseq dataset containing 50 million reads in which 1250 genes displayed increased distal APA site usage, 1250 genes displayed increased proximal APA site usage, and 2500 genes displayed no change in APA site usage [40]. We used the software packages QAPA [26], DaPars [27], and Roar [28] in addition to LABRAT to quantify APA isoforms in these data.
QAPA, like LABRAT, uses lightweight alignments to quantify APA. Reassuringly, we found that ψ values calculated by LABRAT were highly correlated to the analogous metric used by QAPA, PPAU (R = 0.81) (Figure S1I). In comparing the four methods, LABRAT was the best suited to accurately identify differential APA in the simulated data (Fig. 1G). We further found that the accuracy of LABRAT was not noticeably affected by read depth down to one million reads (Figure S1J).
Alternative polyadenylation isoforms are differentially localized in cell bodies and projections
Multiple studies have found that alternative polyadenylation decisions made during nuclear processing can influence the subcellular localization of the resulting transcript, particularly in neuronal cells [41,42,43]. However, it has been unclear how widespread this effect is and whether it was driven primarily by tandem UTRs or ALEs. To address this, we used LABRAT to analyze the relative APA status of 26 paired transcriptomic datasets from cell body and projection samples from neuronal cells, NIH 3 T3 cells, and MDA-MB231 cells [41, 42, 44,45,46,47,48,49,50].
For all samples, we identified genes whose ψ value was significantly different between subcellular compartments (FDR < 0.05), finding between 10 and 740 genes that fit this criterion in each sample (Fig. 2A). Many of these genes were shared across multiple samples (Figure S2A). For these genes, we then compared their ψ values across compartments by subtracting the ψ value in the cell body from the ψ in the projection to define Δψ. Genes with positive Δψ values therefore had their distal APA isoform enriched in projections while those with negative Δψ values had their proximal APA isoform enriched in projections.
We found that for 19 of these 26 samples, over 50% of significant genes had positive? Δψ values, indicating a broad connection between the use of distal APA sites and localization of the resulting transcript to cell projections (Fig. 2A), consistent with previous reports [41, 42]. Further, we observed a relationship between the amount of time that the projection had been allowed to grow and the fraction of genes with positive Δψ values. Of the samples in which the projections had grown for 2 days or less, 15 out of 20 showed a significant bias for the association of distal APA sites with projections. Conversely, of the samples in which the projections had grown for 6 days or more, 0 out of 6 showed a significant bias for the association of distal APA sites with projections. This suggests that distal APA transcripts may play a role in early projection outgrowth but may be less important in mature projections.
Given the conflicting reports about the relative contributions of distal APA produced by tandem UTR and ALEs to the transcriptomes of cell projections [41,42,43], we analyzed these two classes of APA isoforms separately. Across the 26 subcellular comparisons, we found a strong, significant correlation (R = 0.58, p = 0.0024) between the fraction of ALE genes with positive Δψ values and the fraction of tandem UTR genes with positive Δψ values (Fig. 2B). This indicates that both classes of genes are preferentially contributing their distal APA isoforms to projections and suggests that these two classes of alternative poly(A) site selection may be regulated by a common mechanism.
Alternative polyadenylation isoforms are differentially localized in biochemically defined cytosolic and membrane fractions
To further explore connections between APA and RNA localization beyond cell projections, we used LABRAT to analyze RNAseq data from a biochemical fractionation of 3 cell types, Drosophila DM-D17-C3 (D17) cells, human HepG2 cells, and human K562 cells [51]. In these data, cells were fractionated into nuclear, cytosolic, membrane-associated and insoluble fractions. RNA was isolated from each of these fractions and prepared for high-throughput sequencing using either polyA-selection-based or ribosomal RNA-depletion-based library preparation. For each fraction, two replicates of each library preparation method were sequenced.
As with the projection data, we compared ψ values for genes across cellular compartments. Hierarchical clustering of samples based on ψ values revealed that samples from the same fraction generally clustered with each other, indicating the high quality of the data (Figure S2B-D). To minimize the effect of library preparation on the identification of genes with significantly different ψ values across compartments, we included the library preparation method as a covariate in LABRAT’s linear model. This allowed us to pool all of the samples for a given compartment in order to identify genes with significantly different ψ values between compartments regardless of library preparation method.
We first identified genes with significantly different ψ values across any pairwise comparison between cytosolic, membrane-associated, and insoluble fractions (FDR < 0.05). Based on our observations relating distal APA and RNA localization to projections, we then asked if any of these fractions were associated with higher ψ values than the other two. We visualized these comparisons using simplex plots (Fig. 2C). In these plots, each dot represents a gene, and its position is determined by the relative ψ values in each fraction. A gene with a ψ value of 1 in a fraction and ψ values of 0 in the other two would be placed at that fraction’s vertex while a gene with equal ψ values in all 3 fractions would be placed equidistant from each vertex at the intersection of the dotted lines. We found that genes tended to have higher ψ values in the membrane fraction (Fig. 2C, S2E, F), indicating a preferential association of downstream APA isoforms with that fraction.
Because of the apparent bias toward distal APA site use among membrane-associated transcripts, we next focused on comparing the cytosolic and membrane fractions. When comparing the cytosolic and membrane fractions of HepG2 cells, we identified 552 genes that had significantly different ψ values between the fractions (FDR < 0.01). Of these, 492 (89%) had a ψ value that was higher in the membrane fraction than the cytosolic fraction, indicating a broad association between transcripts produced using distal APA sites and the membrane fraction (Fig. 2D). We observed highly similar results when comparing the cytosolic and membrane fractions from K562 cells and D17 cells (Figure S2G, H).
We then queried whether the same genes had differential APA isoform associations with the cytosolic and membrane fractions in the HepG2 and K562 samples. To test this, we calculated Δψvalues (membrane - cytosol) for all genes expressed in both cell lines. We observed a strong correlation (R = 0.73) between Δψ values in the two cell lines (Fig. 2E), suggesting that the effects of APA on transcript membrane association are shared between cell lines and are therefore likely transcript-specific with a conserved mechanistic basis.
The ER comprises a large fraction of cellular membranes, and RNA localization to the ER is important for cotranslational access to the secretory pathway. We therefore asked whether transcripts with significant membrane vs. cytosol Δψ values were more or less likely than expected to encode the peptide-based signal sequences required for RNA transport to the ER through cotranslational targeting. We identified ER signal sequences using SignalP [52]. Interestingly, we found that in both the HepG2 and K562 samples, genes that had significant membrane vs. cytosol Δψ values were significantly less likely to contain an ER signal sequence than other genes (Fig. 2F). This observation therefore suggests two alternative modes of RNA localization to the ER: one for transcripts that encode signal peptides and another for those that do not. Specifically, mRNAs that are not cotranslationally targeted by signal peptide recognition appear to be targeted by a mechanism involving distal APA use.
The transcription speed of RNA polymerase II regulates alternative polyadenylation site choice
The speed of transcription by RNA Polymerase II (Pol II) regulates multiple co-transcriptional processes, including alternative splicing, and termination that is coupled to poly(A) site cleavage [23, 53,54,55,56]. To assess how changes in Pol II speed can affect APA, we used LABRAT to analyze RNAseq samples from HEK293 cells that expressed either wildtype or slow Pol II [55]. The slow Pol II mutant used in these studies is a single amino acid substitution in the funnel domain of the Pol II large subunit Rpb1 (R749H).
During transcription, a gene-proximal APA site is necessarily transcribed before a gene-distal APA site. There exists a time, therefore, during which the proximal site is the only APA site that exists on the transcript. Reducing the speed of Pol II transcription would increase this time in which the proximal site is free from competition with the distal site. We hypothesized that this would lead to an increase in usage of the proximal APA site (Fig. 3A). Indeed, we found that for many genes, proximal APA site usage was increased in slow Pol II samples (Fig. 3B), and that overall there was a shift towards increased usage of the proximal site (Fig. 3C).
If the shift in APA was related to the amount of time during which the proximal site was exclusive, then the shift should be most pronounced in genes in which the distance between proximal and distal sites is large. Consistent with this hypothesis, we found that this “inter-polyA distance” for genes that displayed increased proximal APA was significantly longer than expected (Fig. 3D), further suggesting that changes in Pol II kinetics can predictably alter APA.
We found no correlation between a gene’s inter-polyA distance and its ψ value from the wildtype Pol II sample alone (R = 0.01). However, in this analysis across genes, other factors that influence APA (e.g. the relative strengths of upstream and downstream polyadenylation sites) can dominate. Comparing within genes but across conditions, as done above using Δψ, alleviates this concern.
If alternative polyadenylation of tandem UTRs and ALEs were generally coregulated, then it would be expected that changes in Pol II speed would affect both classes of genes. To test this, we examined the increase in proximal APA site usage caused by slow transcription in the context of tandem UTR and ALE genes separately. We found that proximal APA usage was increased for both tandem UTR and ALE genes (Fig. 3E, F), indicating that the two classes of genes are similarly affected by changes in Pol II speed and consistent with the idea that they are coregulated by a common mechanism.
Dozens of RNA-binding proteins (RBPs) regulate relative APA isoform abundance across many genes
To investigate the contributions that individual RBPs can have to the regulation of APA isoform abundance, we analyzed the ENCODE RBP knockdown RNAseq datasets with LABRAT [57, 58]. This resource contains 523 shRNA-mediated RBP knockdown RNAseq experiments spread across human HepG2 and K562 cell lines. We compared ψ values for all expressed genes between RBP knockdown and control knockdown samples for 191 RBPs that were expressed in both cell lines. To identify genes that had significantly different ψ values (FDR < 0.05) between RBP knockdown and control knockdown samples, we incorporated the cell line of the experiment as a covariate in LABRAT’s linear model.
We began by assessing the reproducibility of changes in APA isoform abundance upon RBP knockdown between the two cell lines. To do this, we correlated Δψ values (control knockdown - RBP knockdown) for all expressed genes in a given RBP knockdown in HepG2 cells with their Δψ values upon knockdown of the same RBP in K562 cells. We therefore end up with one correlation coefficient per RBP knockdown. As a control, we compared these values to correlations of Δψ values where the RBP that was knocked down was different between the cell lines (Fig. 4A). Reassuringly, we found that correlations between experiments in which the expression of the same RBP was knockdown were significantly higher than those in which the expression different RBPs were knocked down (p = 1.5e-19, Wilcoxon ranksum test). When we restricted the comparison to genes that had significantly different ψ values between RBP and control knockdowns (FDR < 0.05), we observed a much higher correlation of Δψ values between cell lines (Fig. 4A). These results gave us confidence that we could accurately quantify APA isoform abundance in the ENCODE datasets.
For each RBP knockdown experiment, we then took the genes with significantly different ψ values between RBP and control knockdowns and analyzed the distribution of their Δψ values (control knockdown - RBP knockdown) (Fig. 4B, Table S1, S2). We observed that many RBP had distributions of Δψ values that were skewed towards being mostly positive or mostly negative. We defined a term, α, as the fraction of these genes with positive Δψ values. RBPs with α values greater than 0.5 therefore were broadly associated with increased distal APA isoform abundance while those with α values less than 0.5 were associated with increased proximal APA isoform abundance. 94 RBPs had α values that were significantly skewed from the expected value of 0.5 (binomial p < 0.01), and of these 52 had α values of greater than 0.5 while 42 had α values less than 0.5 (Fig. 4C).
The effects of specific RBPs on APA isoform abundance have been reported for a handful of RBPs. CPSF6 has been found to promote distal APA isoforms [11, 13], and LABRAT analysis of the ENCODE RBP knockdown data agreed with this finding (α = 0.86, p = 5e-20). Fip1 has been found to promote proximal APA isoforms [11]. While Fip1 was not present in the ENCODE data, a related protein, Fip1l1 was present, and LABRAT analysis also found that it promotes proximal APA isoforms (α = 0.39, p = 0.007). Similarly, CSTF2 has been noted to promote proximal APA isoforms [27]. While CSTF2 was not present in the ENCODE data, its related protein CSTF2T was, and LABRAT analysis also found that it promotes proximal isoforms (α = 0.31, p = 0.0001). These LABRAT analyses reflecting prior literature on specific RBPs gave us increased confidence in the ability of LABRAT to probe RBP-specific effects on APA isoform abundance.
For each RBP knockdown experiment we then calculated α values for tandem UTR and ALE genes separately. α values for these two APA types were highly correlated (R = 0.62), further indicating that these two mechanisms of APA regulation are not independent and share elements in common (Fig. 4D, Figure S3A).
If changes in APA isoform abundance upon RNAi were directly due to loss of the RBP, then we would expect that the RBP would directly bind the 3′ UTRs of the genes whose APA isoform abundance it regulates. To test this, we analyzed RBP/RNA interactions as measured by the eCLIP experiments performed as part of the ENCODE project [59]. We observed that some RBPs displayed highly promiscuous 3′ UTR binding while others bound very few 3′ UTRs (Figures S3B, C).
In HepG2 cells, 84 RBPs had both RNAseq data from RNAi experiments and eCLIP data. For each RBP, we calculated how many of the genes with significant changes in ψ value upon RBP knockdown also contained an eCLIP peak for that RBP in their 3′ UTR. We then calculated whether this overlap of RBP binding and function was statistically significant (binomial p < 0.05). For 21 of these RBPs, we observed a significant overlap between the RBPs functional APA isoform regulatory targets and the 3′ UTRs it bound (Fig. 4E). To assess whether this was more or less than the number of expected significant RBPs, we shuffled the relationships between RBPs and their lists of APA targets and bound 3′ UTRs and again calculated the number of RBPs that showed significant overlap between APA and eCLIP data. Repeating this process 1000 times gave us a null distribution of the expected number of RBPs with significant overlaps and indicated that the observed number of overlaps was significant in HepG2 cells (p = 0.006).
Although we did not observe a similar significant relationship between APA and eCLIP data in K562 cells (p = 0.4) (Figure S3D), overall, these results indicate that many of the RBPs tested are modulating relative APA isoform abundance through direct interactions. We then repeated the analysis, but looked for eCLIP peak overlaps throughout the gene bodies of APA targets (Figure S3E,F). In this analysis, we did not observe that RBPs were preferentially bound to their APA isoform regulatory targets. Taken together with the results of the 3′ UTR-centric analysis, these results indicate that RBPs that regulate relative APA isoform abundance likely do so through binding the 3′ UTRs of their targets.
Misregulation of alternative polyadenylation is cancer type specific and correlates with patient survival
Changes APA have long been known to be associated with cancer [60, 61]. Most often, APA is thought to contribute to cancer phenotypes through a general increased usage of proximal APA sites, which are thought to be associated with increased expression of oncogenes and proliferation of cell lines [18, 20]. To further explore this phenomenon, we used LABRAT and data from The Cancer Genome Atlas (TCGA) [62] to examine changes in APA isoform abundance between matched tumor and normal samples from 671 patients across 21 different cancers.
For each cancer, we identified between 130 and 3043 genes that displayed significant differences in ψ values (FDR < 0.05) between tumor and normal samples. We then defined Δψ values (tumor - normal) to ask whether proximal or distal sites showed increased usage in tumor samples. For some cancers, including Lung Squamous Cell Carcinoma (LUSC), Urothelial Bladder Carcinoma (BLCA) and Lung Adenocarcinoma (LUAD), tumors displayed the expected pattern of increased proximal APA in tumors (Fig. 5A). Conversely, Thyroid Cancer (THCA) and Kidney Renal Clear Cell Carcinoma (KIRC) showed strong biases in the opposite direction with increased distal APA in tumors. Mechanisms that drive APA dysregulation are therefore likely specific to different cancer types, and it is not true that increased proximal APA is a general feature of cancer cells.
We then compared ψ values in the TCGA data for tandem UTR genes and ALE genes separately. For each pair of tumor and normal samples, we calculated the fraction of genes with significantly different ψ values across conditions (FDR < 0.05) in which the ψ value was greater in the tumor sample than the normal sample. Put another way, for each patient, we calculated the fraction of significant tandem UTR and ALE genes with positive Δψ (tumor - normal) values (Fig. 5B). The tandem UTR- and ALE-derived fractions were strongly correlated with each other (R = 0.74), again suggesting that these two modes of APA may be coregulated.
We wondered if APA was misregulated in the same genes across many different cancer types or whether the set of genes with misregulated APA was cancer type specific. Although many APA misregulated genes were specific to certain cancers, we did observe that hundreds of genes repeatedly showed misregulation across multiple cancers (Fig. 5C). We defined a set of genes that repeatedly showed increased proximal APA usage in BLCA, LUAD, and LUSC tumors. Using gene ontology analysis, we found that these genes were significantly enriched for those encoding single-stranded RNA binding proteins [63]. Similarly, we defined a set of genes that repeatedly showed increased distal APA usage in THCA and KIRC. These genes were enriched for being involved in programmed cell death and responses to stress.
We enquired whether transcripts we identified whose APA status correlates with membrane association (Fig. 2C, D) are among those subject to misregulation in tumors. Many of these membrane-associated mRNAs showed significantly different ψ values between tumor and normal samples, suggesting that the subcellular localization of these transcripts may be altered in cancerous cells.
To determine if the degree of APA misregulation was related to patient prognosis, we performed survival analyses for patients from the TCGA dataset. In Fig. 5A, we defined genes with tumor-specific APA misregulation by comparing ψ values in tumor and matched normal patient samples. For each tumor, we then calculated a median ψ value across these genes in thousands of tumor RNAseq samples. Using this median ψ of misregulated genes, we ranked patients and separated them into quartiles. The extreme quartiles (patients with the highest and lowest ψ values for misregulated genes) for each cancer were compared. We found that for head and neck squamous cell carcinoma (HNSC), a cancer that typically exhibits increased proximal APA, patients with lower ψ values in misregulated genes had poorer prognoses (p = 0.0069) compared to patients with higher ψ values for the same genes (Fig. 5D). Conversely, for kidney renal clear cell carcinoma (KIRC), a cancer that typically exhibits increased distal APA, we found the opposite. Patients with lower ψ values in misregulated genes had better outcomes compared to patients with higher ψ values (p < 0.0001) (Fig. 5E). Therefore, the direction of APA misregulation is cancer-specific, and both increased proximal and distal APA are associated with poor patient prognosis, depending on the cancer type.
Usage of distal APA sites is broadly but weakly associated with decreased RNA expression
Some of the original studies on the relationship between APA and RNA expression reported that distal APA is associated with a decrease in RNA levels [20] while more recent genome-wide studies have reported that the relationship is less clear [64, 65]. To comprehensively examine the relationship between APA and gene expression, we compared changes in ψ and changes in RNA levels across the 191 ENCODE RBP knockdown sample pairs and the 671 TCGA tumor/normal sample pairs. To do so, we defined a term, rho (ρ), as the correlation between changes in ψ and changes in gene expression across two samples (Fig. 6A). Sample comparisons where Δψ and gene expression changes are positively correlated indicate that distal APA and increased RNA levels were associated, and these comparisons will have positive ρ values. Conversely, sample comparisons where Δψ and gene expression changes are negatively correlated indicate that distal APA and decreased RNA levels were associated, and these comparisons will have negative 휌 values.
We calculated ρ values across all genes for each RBP knockdown in the ENCODE data. In both the HepG2 and K562 samples, these ρ values overwhelmingly tended to be negative, but weakly so (Fig. 6B). We similarly calculated ρ values across all genes for every patient-derived tumor/normal pair in the TCGA data (Fig. 6C). Again, these ρ values were consistently but weakly negative. These results indicate that although distal APA is generally associated with decreased gene expression, its contribution to changes in RNA levels is modest when comparing all genes in aggregate.
It could be the case, though, that for specific genes, APA and gene expression may be more strongly linked. To explore this, we calculated ρ values for each gene individually across all of the ENCODE and TCGA sample pairs (Fig. 6D, E). The median genes again had weakly negative 휌 values (− 0.12 in the ENCODE data, − 0.20 in the TCGA data). ENCODE- and TCGA-derived ρ values for genes were correlated with each other (Fig. 6F, Table S3). Tandem UTR genes and ALE genes displayed similar distributions of ρ values, indicating that relationships between gene expression and APA are of approximately equal strength in these two APA classes (Figure S4A-D). Relaxing the transcript expression threshold used by LABRAT from 5 TPM to 1 TPM had a minimal effect on these results as the median ρ value remained negative (− 0.10 in the ENCODE data, − 0.14 in the TCGA data).
The tails of the ρ value distributions were long, indicating that there were genes whose changes in ψ value and changes in expression were highly correlated across conditions. We selected three of these, RPLP1, NOLC1, and UBE2G1, for further analysis. Given that each of these genes had strong negative ρ values in both the ENCODE and TCGA data (Fig. 6G), we reasoned that there may be elements in their distal UTRs downstream of the proximal APA site that confer reduced steady-state RNA levels. To test this experimentally, we fused the proximal and distal UTRs of each of these genes to the coding region of Firefly luciferase. Each construct was then site-specifically incorporated into the genome of HeLa cells through Cre-mediated recombination [66]. The Firefly luciferase transcripts were coexpressed from a bidirectional tet-On promoter with unmodified Renilla luciferase. The RNA level of each Firefly-UTR fusion was measured using Taqman qRT-PCR with the Renilla luciferase transcript as a normalizing control. For all three tested genes, fusion of the distal UTR to Firefly luciferase significantly reduced the steady-state level of the RNA relative to a fusion with the proximal UTR, indicating that sequence elements downstream of the proximal APA sites likely have a role in reducing RNA expression (Fig. 6H). We conclude that by comparing changes in gene expression and APA, we can identify functional elements within 3′ UTRs that regulate mRNA abundance.
Features enriched in UTRs associated with gene expression changes
To better understand sequence elements downstream of proximal APA sites that may reduce RNA expression, we used the ρ values calculated for individual genes using ENCODE and TCGA sample sets to assign genes to positively correlated, negatively correlated or not correlated (control) gene sets (Fig. 6I, Figure S4E). These gene sets behave differently: positively correlated genes are more highly expressed when downstream PAS are used (increased ψ) while negatively correlated genes become less expressed as they utilize more downstream PAS.
This analysis was simplified by only considering genes with two APA isoforms such that RNA expression could be explained by proximal or distal UTR usage. The analyzed UTR sequences were unique, meaning that tandem UTRs were separated into proximal and distal UTRs such that distal UTRs lacked their shared 5′ sequence (Fig. 6H). This allowed us to identify sequence characteristics of distal UTRs that explain the differences in RNA expression of the positively correlated and negatively correlated gene sets.
Negatively correlated genes were found to have longer distal UTRs with lower GC content than expected (Fig. 6J, K). This increased 3′ UTR length may make these isoforms more susceptible to NMD, partially explaining their decreased expression [67, 68]. Additionally, negatively correlated genes were generally enriched for AU rich five-mers including the canonical AU rich element (ARE) “AUUUA” (Fig. 6L, Figure S4F). Conversely the distal UTRs of positively correlated genes were depleted for AU-rich five-mers (Figure S4G). Unsurprisingly given their AU-richness, negatively correlated genes were enriched for ARE binding protein motifs in their distal UTRs and contained more AREs as scored by AREScore [69] (Fig. 6M, N). AREs are destabilizing RNA elements bound by several ARE binding proteins that facilitate RNA degradation. The presence of AREs in distal UTRs of negatively correlated genes is consistent with lower RNA expression when downstream PAS are utilized. It is important to note that the distal UTRs of positively correlated genes are depleted for AREs consistent with their higher expression. These results suggest that APA can regulate gene expression through the inclusion of destabilizing AREs in a transcript’s 3′ UTR. Further, given how these results mirror previously observed effects of 3′ UTR AREs [18, 70, 71], they lend further confidence to the ability of LABRAT to accurately quantify relative APA isoform abundance and derive insights regarding its regulation.
Regulatory effects of RBPs on APA isoform abundance inferred from ENCODE data can be observed in TCGA data
The relation between RBP expression and the widespread misregulation of APA in cancer cells is poorly understood. We investigated this problem by examining expression in patient samples of the 191 RBPs that potentially influence APA isoform abundance revealed by our analysis of ENCODE knockdown RNAseq results (Fig. 4B, C). Based on the ENCODE RBP knockdown data, we defined α values for RBPs where values of greater than 0.5 indicated an RBP that promoted distal APA isoform abundance while values of less than 0.5 indicated an RBP that promoted proximal APA isoform abundance. To compare α values to RBP effects on APA isoform abundance observed in the TCGA data, we defined another term, β, as the correlation between the change in RNA expression of an RBP between tumor and matched normal TCGA samples and the median Δψ of genes with significantly different APA between the samples (FDR < 0.05) (Fig. 7A). RBPs with positive β values are therefore associated with increased distal APA isoform abundance in patient samples while those with negative β values are associated with increased proximal APA isoform abundance.
If ENCODE-derived effects of RBPs on APA isoform abundance were recapitulated in the TCGA data, we would expect to see a positive correlation between the α and β values for RBPs. We restricted this comparison to the 94 RBPs that had α values significantly different from the expected value of 0.5 (p < 0.01, binomial test). For these RBPs, α and β values were positively correlated (R = 0.23, p = 0.03). RBPs with α values greater than 0.5 had significantly higher β values than those with α values less than 0.5 (Fig. 7B). Further, when we correlated α and β values across all RBPs for all sample pairs within a cancer type, we observed positive correlations in all 12 cancers tested (Fig. 7C). These results further suggest that dozens of RBPs have the ability to regulate relative APA isoform abundance of many genes in a coordinated, directional manner and that the misregulation of APA seen in many cancers may be due to altered expression of specific RBPs.