In a degraded chronic myelogenous leukemia (CML) cell line the BCR-ABL fusion was not detected while the reciprocal ABL-BCR fusion was found
KU812 is a myeloid cell line established from a patient with chronic myelogenous leukemia (CML) [21]. One of the hallmarks of CML is the t (9;22) (q34;q11.2) [24] translocation that results in a BCR-ABL fusion [25] which is a driver of the disease. In an attempt to characterize the effect of RNA degradation on the detection of fusions by RNA-seq we artificially degraded RNA isolated from the KU812 cell line to various RIN values (10, 7, 5, and 3) and performed RNA-seq.
Using the intact RNA (RIN 10) from the KU812 cell line we identified a BCR-ABL fusion with 1.82 supporting events per million reads (around 27 supporting reads in a sample with 15 million reads) and its reciprocal ABL-BCR fusion with 0.63 supporting events per million reads (around 9 supporting reads in sample with 15 million reads) (Fig. 1a). In the degraded samples (RIN values 7, 5 and 3) we detected the ABL-BCR fusion but did not detect the BCR-ABL fusion (Fig. 1a). We noticed that the two fusion products have substantially different distances from the fusion breakpoint to the 3′ end of the transcript (approximately 1.5 kb for ABL-BCR and 5.3 kb for BCR-ABL) (Fig. 1b). We speculated that this might explain the differences in fusion product supporting reads identified for the ABL-BCR and BCR-ABL fusion at different RIN values.
A plot of the number of reads starting from the 3′end up to the BCR-ABL breakpoint (Fig. 1c, Additional file 1: Figure S1a) and ABL-BCR fusion breakpoint (Fig. 1d, Additional file 1: Figure S1b) reveals that the level of coverage is relatively constant for the sample with a RIN of 10. However, for both fusions RNA degradation caused an increase in read counts at the 3′ end of the transcript, with a subsequent non-linear decrease in read counts as a function of the 3′ distance. The number of reads close to the ABL-BCR breakpoint is similar across all levels of degradation (Fig. 1d, Additional file 1: Figure S1b) resulting in the detection of the ABL-BCR fusion at all four RIN values (Fig. 1a). In contrast, the number of reads close to the BCR-ABL breakpoint drops dramatically and results in the inability to detect such a fusion in the degraded samples (Fig. 1c, Additional file 1: Figure S1a).
Median coverage decreases exponentially as a function of the distance from the 3′ end
To more accurately characterize the effect RNA degradation has on read coverage of the transcripts we chemically degraded Universal Human Reference RNA (UHR) to six RIN values (8.6, 8.4, 7.6, 5.9, 4.9 and 3.9) and performed RNA-Seq analysis. We measured the read coverage for all expressed genes as a function of transcript 3′ end distance (Fig. 2a, Additional file 2: Figure S2a) and observed that read coverage decreased as a function of both 3′ distance and sample RIN value. Furthermore, there was an exponential relationship between the median coverage and the distance from the 3′ end (Fig. 2a, Additional file 3: Table S2). The rate of exponential decay increased for highly degraded samples. We obtained similar results (Additional file 2: Figure S2b, Additional file 3: Table S2) in a replicate study using a public dataset [11] that had degraded and sequenced RNA from a U251 MG brain glioblastoma cell line using similar methods.
For a degraded sample we can increase the sequencing depth to obtain the coverage levels of an intact sample. We estimated this increase in sequencing depth as a function of the 3′ distance for the UHR sample at different levels of degradation (Additional file 2: Figure S2c). Given that the read coverage decreases exponentially as a function of the 3′ distance we have that even for a mildly degraded sample (RIN = 8.4) we need around 1.8x more reads to achieve adequate coverage at a distance of 2.5 kb from the 3′ end and around 4.9x more sequencing depth to achieve the same level of coverage at 5 kb from the 3′ end. For a highly degraded sample (RIN = 3.9) we need 9.4x and 28.3x more reads at a distance of 2.5 kb and 5 kb from the 3′ end, respectively. This fact makes the strategy of adding sequencing depth prohibitively expensive in practice.
Coverage and decay profile of fusion related genes
We defined the median coverage decay rate per kilobase (decay rate for short) as the rate of decay of the median read coverage as a function of the distance from the 3′ end of the mRNA multiplied by 1 kb. The median read decay rates for the UHR and U251 samples at different RIN values are shown in Fig. 2b and similar median read decay rates were observed for independent samples with similar RIN values. This relationship can be modeled as Median coverage decay rate/kb = − 1.45 + 0.13 RIN with an R
2 = 0.92. As expected, samples with lower RIN values had a larger negative decay rates and samples with high RIN values had a decay rate close to 0. It should be noted that there were samples that deviated from the model, for example our UHR sample with a RIN of 8.6 has a decay rate close to 0 even though it would have been expected to have a decay rate of approximately −0.2.
To characterize any bias associated with genes known to be part of cancer-related fusions, we queried all fusion partner genes in the COSMIC database [26]. Based on this gene set, the median distance of fusion breakpoints from the 3′ end is 2.7 kb, with approximately 80 % of the breakpoints occurring within 5 kb of the 3′ end and 95 % occurring within 7 kb of the 3′end (Fig. 2c). We then expanded this analysis to include fusion partners reported in the Atlas of Genetics and Cytogenetics in Oncology and Hematology [27] and scientific literature for potentially clinically significant gene fusions in solid and hematologic tumors as well as oncogenes not currently known to be involved in gene fusions that have the potential to be activated through gene fusions. This curated list of 545 genes (further referred to as 545 gene set) (Additional file 3: Table S3) has a median size of 3.8 kb (Additional file 4: Figure S3a), which is higher than the median size of 2.5 kb for all genes reported in UCSC (Additional file 4: Figure S3a). Evaluating the 545 gene set in the UHR sample data, we show the median coverage follows a similar exponential decay dependent on the distance from the 3′ end, but with a 12 % greater decay rate than that computed for all expressed genes (Additional file 4: Figure S3b).
To further validate our findings we isolated RNA from 20 samples, which included 14 tumor specimens (7 cell lines and 7 snap frozen tumor specimens) and 6 snap frozen normal tissue specimens (Additional file 3: Table S4). RNA from the 7 cell line samples had high RIN values (average = 9.9, minimum = 9.7, maximum = 10), while RNA from the 13 tumor and normal fresh tissue samples had lower RIN values (average 8.0, minimum = 3.4, maximum = 9.3). No intentional degradation was performed on these samples. We calculated the decay rate for the 545 gene set on these 20 samples (Additional file 3: Table S4, Fig. 2d). Reflective of our previous findings, there was a direct relationship between sample RIN and the decay rate modelled as Median coverage decay rate/kb = − 1.51 + 0.15 RIN with an. R
2 = 0.66. It should be noted that samples with similar RIN values did not always have the same decay rate and that individual samples deviated from the trend line. These differences were more pronounced for samples with lower RIN values. For example, the two most degraded samples (Additional file 3: Table S4) had very different RIN values (3.4 and 6.4) but similar decay rates (−0.74 and −0.76).
The sensitivity of fusion detection depends on fusion breakpoint distance from the 3′ end
Using the 545 gene set, the probability of detecting a gene fusion (i.e. sensitivity on y-axis of plot) whose breakpoint occurs at a specific distance from the 3′ was determined by calculating the fraction of expressed genes whose coverage was ≥10x at that distance from the 3′ end of the gene (Fig. 3a). This calculation assumes at least 10 total reads are required to detect a heterozygous fusion product at our lower detection limit, which requires at least 5 supporting reads. Using the 545 gene set we plotted the estimated fusion detection sensitivity (Fig. 3b). The sensitivity decreased with the distance from the 3′ end, with greater reductions in sensitivity for the more heavily degraded UHR samples. However, the sensitivity remained high (>85 %) for all samples for fusion breakpoints less than 1 kb from the 3′ end regardless of RIN value.
The UHR is RNA isolated from 10 tumor cell lines [22], some which contain well characterized fusions (e.g. BCR-ABL fusion in the CML tumor cell line). We plotted the estimated sensitivity per RIN score for five known fusions in UHR with breakpoints having varied distances from the 3′ end of the gene (Fig. 3c); also computing whether that particular fusion had more than 5 supporting reads (Fig. 3c, Additional file 3: Table S5). As expected, the fusion detection for each of these 5 fusions is consistent with the estimated sensitivity, in that fusions with breakpoints further away from the 3′ end (e.g. BCR-ABL at 5.3 kb or ARHGEF2-SULF2 at 3.1 kb) are not detected as the level of degradation increases. At the same time, fusions with breakpoint distance less than 1 kb to 3′end (e.g. FOXA1-TTC6 or BCAS3-BCAS4) were detected regardless of RIN value (Fig. 3c). Interestingly, the number of supporting reads for these fusions increased as the RIN value decreased down to 5.9 (Additional file 3: Table S5), which is consistent with a majority of the reads coming from the 3′ end. No false positives were found at different levels of degradation.
We tested our approach on our previously described set of 20 specimens. Out of the 14 tumor samples from such set, 12 had known fusions. Using RNA-seq we were able to detect the known fusions from the 12 cases. We calculated the estimated sensitivity for the distance from the 3′ end of the breakpoint of each fusion in each sample (Additional file 3: Table S6). The fusions that were detected had high values of estimated sensitivity at its corresponding distance from the 3′ end (average = 92 %, minimum = 75 %, maximum = 96 %).