RNA-Seq data
We used a copy of the TCGA RNA-Seq data hosted by the Institute for Systems Biology-Cancer Genomics Cloud (ISB-CGC) pilot on the Google Cloud Storage, part of the Google Cloud Platform (GCP), mirroring the repository hosted at the NCI Genomic Data Commons (GDC, https://gdc.cancer.gov/). In total, 10,668 samples were analyzed, with each sample identified by a unique analysis ID. Sample types are generalized as ‘normal’ (solid tissue normal) and ‘tumor’ (which includes primary solid tumor, metastatic, recurrent solid tumor, additional - new primary). A more detailed description of the protocols for data collection is provided in Additional file 4: Supplementary Methods.
Design of the targeted CS prediction pipeline
RNA-Seq reads were first filtered against the candidate genes using the biobloomcategorizer utility from BioBloomTools (BBT) [35]. The resulting categorized reads were then assembled into contigs with Trans-ABySS [36], and these contigs were in turn aligned to the reference human genome with GMAP [37]. The raw reads were aligned to both the assembled contigs with BWA [38], and the reference genome with GSNAP [39]. Both contig-to-genome and read-to-contig alignment results were used to predict CSs with KLEAT [21], and the read-to-genome alignments were used for both expression level quantification and assessment of KLEAT predictions.
Implementation of the pipeline
The pipeline was implemented in Python with the Ruffus framework [40]. The software used include SAMtools-0.1.18 [41], BioBloom tools-2.0.12 [35], Trans-ABySS-1.5.2 [36], ABySS-1.5.2 [42], GMAP-2014-12-28 [37, 39], and BWA-0.7.12 [38]. The source code also includes a copy of the specific version of KLEAT.py used in this study. For its use on the GCP, a Docker image of the pipeline can be built from the Dockerfile included in the source code.
Execution of the pipeline
We executed the pipeline on the ISB-CGC powered by the GCP. For each RNA-Seq sample, a virtual machine (VM) instance with four vCPUs, 20 GB of memory, and a sample size-dependent amount of persistent disk was used. For each instance, we requested a sufficient disk size for storing both input data and intermediate and final results, calculated as 30 x Size(sample) + 50 GB. The scaling factor of 30 is based on experience in pilot runs, and the extra 50 GB was reserved for storing reference data. Google Genomics Pipelines API (https://cloud.google.com/genomics/reference/rest/v1alpha2/pipelines) was used to orchestrate all VM instance tasks including VM creation, deletion, and data transfer, and it substantially reduced the administrative workload.
The reference data included an hg19 reference genome [43], the GMAP/GSNAP [37, 39] index of hg19, a prebuilt BioBloom filter [35] of all the candidate genes’ transcripts, and a specific version of the gene annotation used by KLEAT [21] (more details on annotation are available in Additional file 4: Supplementary Methods).
The BioBloom filter was built with the biobloommaker utility from BBT [35]. As for the input to biobloommaker, all transcripts of all the candidate genes from the Ensembl annotation [25] were used. The annotated sequences were augmented by 300 bp flanking sequences on both ends of each transcript to collect RNA-Seq reads that were partially aligned to them.
During the de novo assembly of transcripts for each sample, three k-mer sizes were used, depending on the corresponding read length: {22, 32, 42}, {32, 52, 72}, and {32, 62, 92} were used for samples with read lengths of 45–50, 75–76, and 100 bp, respectively.
Annotation pre-processing
The Ensembl annotation was downloaded from http://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz, and then pre-processed before being used. First, we extracted the annotated CSs of all protein coding and NMD transcripts that were CDS 3′ complete (without cds_end_NF tag, https://www.gencodegenes.org/gencode_tags.html) for all candidate genes. To calculate 3’ UTR lengths, we also extracted the mapping information between annotated CSs and stop codons from transcripts. A more detailed description of the extraction process can be found in Additional file 4: Supplementary Methods. After extraction, since a predicted CS may not have transcript-level resolution when associated with multiple transcripts, we discarded transcript-level information from the annotation, and removed redundant mapping relationships caused by multiple transcripts sharing the same CS and stop codon. Lastly, we clustered the annotated CSs as described in CS Clustering below.
CS prediction and post-processing
The CSs were predicted by KLEAT with all parameters set to default. We post-processed the KLEAT results before any CS usage frequency analysis (Additional file 1: Figure S3A). Specifically, we parsed the 10,668 KLEAT output files (one per sample), using the information from the following fields: gene, transcript_strand, chromosome, cleavage_site, length_of_tail_in_contig, number_of_bridge_reads, and max_bridge_read_tail_length. In total, 67,544,140 CSs were predicted across 10,668 samples. First, we filtered out off-target CSs by only selecting those that were associated with the candidate genes, keeping 17% of the predictions. After initial filtering by genes, we reassigned each remaining CS to the closest clustered annotated CS (See Annotation pre-processing), and then calculated the signed distance between them. We also calculated the location of PAS hexamer motifs, if present, searching up to a 50 bp window upstream of a predicted CS. When multiple PAS hexamers existed in the window, the strongest one was picked [26]. Next, we applied another filter to select the most confident predictions. Specifically, a predicted CS must meet at least one of the following two criteria to be retained:
-
1)
Its distance to the closest annotated CS was required to be 25 bp or less. The 25-bp threshold was chosen by plotting the distribution of distances, and taking a threshold at the plateau. This criterion was designed for selecting CSs that had already been annotated.
-
2)
One of the two strongest PAS hexamers AATAAA and ATTAAA [27] were required to be within a 50 bp window, and at least one of the following conditions of polyadenylation evidence was satisfied: length_of_tail_in_contig ≥4, number_of_bridge_reads ≥2, or max_bridge_read_tail_length ≥ 4. The second criterion is an empirical one that is independent of annotation, and it is designed mainly for selecting potential novel CSs.
We verified that AATAAA and ATTAAA were the two most frequent PAS hexamers associated with the predicted CSs both before and after the second filtering steps (Additional file 1: Figure S3C). After the two filtering steps, about 5% of the CSs were retained and clustered as described in the CS Clustering section. The CSs filtered out by this process are considered not robust enough and thus omitted from further analysis to reduce false positives. We also confirm that there is no gene overlap among the 114 genes investigated here. The post-processing steps resulted in 2136 unique predicted CSs in 114 candidate genes across all samples.
CS clustering
We used the single-linkage hierarchical clustering algorithm to combine CSs that were ≤ 20 bp apart, iterating when necessary for clusters to converge. After clustering, we selected the mode CS coordinate within each cluster as its representative location. If multiple modes existed, the median of the modes was used. Then, every CS was associated with one of the representative CSs, and multiple CSs associated with the same representative CS were merged within each sample. The clustering method was independently applied to both annotated and predicted CSs.
The clustering process inevitably decreases the prediction resolution, so our analysis is not able to distinguish CSs that were closer than the clustering cutoff (20 bp). However, we verified that the clustering results were insensitive to different cutoff values, even though the number of clusters could vary.
CS usage frequency calculation
For a given CS in each gene, each cancer type and each sample type (normal/tumor), its frequency is calculated as the fraction of samples that were predicted to use it:
where s is the number of samples predicted to use the CS, and g is the total number of samples with sufficient gene expression level of this gene available for this cancer type and sample type. For each sample, the gene expression level is considered sufficient if at least one CS was predicted within the gene; otherwise, the expression was considered insufficient, and the sample was excluded from the frequency calculation.
Comparison of cleavage patterns between normal and tumor samples
For every predicted CS of every gene in each cancer type, we calculated its frequencies in both normal and tumor samples, and then evaluated the significance of the difference with a Fisher’s exact test. The input to the test included the number of normal and tumor samples with and without a CS predicted. The frequencies of multiple CSs within one gene collectively formed a cleavage pattern for that gene, and to report the difference in patterns between normal and tumor, we required the co-occurrence of at least one significant increase (p < 0.01) and one significant decrease in the frequencies of two CSs, respectively.
To estimate the false discovery rate (FDR), we obtained an upper bound for the p-value at the gene-cancer pair level by multiplying the lowest p-values of its corresponding significant (p < 0.01) increase and decrease in CS frequency. Thus, pair-level p-values are less than 0.0001 (0.01 × 0.01). For gene-cancer type pairs that are not reported, we assigned an arbitrary p-value of 1. In total 1596 hypothesis tests (114 genes × 14 cancer types) were conducted, and applying the Benjamini-Hochberg procedure [44], we obtain an FDR < 0.002. Note that our FDR calculation is conservative since we only considered two CSs when estimating the pair-level p-values while 40% (31/77) of the reported APA events had three or more CSs undergoing significant frequency changes.
Resolution of the 3’ UTR length change trends
We first mapped a predicted CS to the closest annotated one. If it was > 25 bp away, the predicted CS was considered potentially novel, and was ignored for length trend resolution because of the uncertainty of its corresponding stop codon. After trying a range of values, the 25-bp cutoff was selected when the number of unmapped CSs reached a plateau.
After mapping we determined the associated stop codons for each CS, also based on annotation. We do not assume that a CS could be associated with all upstream stop codons, in accordance with the transcript annotations, which do not support an all-to-all type of relationship (arcs in Figs. 2, 3 and Additional file 3: Figure S4). If a CS was associated with only a single stop codon, its corresponding 3’ UTR length was unambiguously calculated and used for trend resolution. All CSs mapped to multiple stop codons were ignored. A detailed description of the trend resolution approach is provided in Additional file 4: Supplementary Methods.
Python libraries used
In addition to the aforementioned Ruffus framework [40], we used several other Python libraries for scientific computing [45] to facilitate our analysis. The hierarchical clustering algorithm implemented in SciPy-0.18.1 [46, 47] was used for CS clustering. Pandas-0.19.0 [48] was used for tabular data transformation and analysis. Matplotlib-1.5.3 [49] was used for plotting. Jupyter-1.0.0 notebook [50] was used for tracking analysis steps and results.