- Methodology article
- Open Access
Open-access synthetic spike-in mRNA-seq data for cancer gene fusions
https://doi.org/10.1186/1471-2164-15-824
© Tembe et al.; licensee BioMed Central Ltd. 2014
- Received: 21 April 2014
- Accepted: 24 September 2014
- Published: 30 September 2014
Abstract
Background
Oncogenic fusion genes underlie the mechanism of several common cancers. Next-generation sequencing based RNA-seq analyses have revealed an increasing number of recurrent fusions in a variety of cancers. However, absence of a publicly available gene-fusion focused RNA-seq data impedes comparative assessment and collaborative development of novel gene fusions detection algorithms. We have generated nine synthetic poly-adenylated RNA transcripts that correspond to previously reported oncogenic gene fusions. These synthetic RNAs were spiked at known molarity over a wide range into total RNA prior to construction of next-generation sequencing mRNA libraries to generate RNA-seq data.
Results
Leveraging a priori knowledge about replicates and molarity of each synthetic fusion transcript, we demonstrate utility of this dataset to compare multiple gene fusion algorithms’ detection ability. In general, more fusions are detected at higher molarity, indicating that our constructs performed as expected. However, systematic detection differences are observed based on molarity or algorithm-specific characteristics. Fusion-sequence specific detection differences indicate that for applications where specific sequences are being investigated, additional constructs may be added to provide quantitative data that is specific for the sequence of interest.
Conclusions
To our knowledge, this is the first publicly available synthetic RNA-seq data that specifically leverages known cancer gene-fusions. The proposed method of designing multiple gene-fusion constructs over a wide range of molarity allows granular performance analyses of multiple fusion-detection algorithms. The community can leverage and augment this publicly available data to further collaborative development of analytical tools and performance assessment frameworks for gene fusions from next-generation sequencing data.
Keywords
- RNA-seq
- Gene fusions
- Cancer genomics
Background
Oncogenic fusion genes underlie the mechanism of several common cancers and also constitute or encode important diagnostic and therapeutic targets. Fusions may drive oncogenic growth by joining a proliferation-inducing gene to an active promoter, by disrupting the function of tumor suppressor genes, or by creating novel functional products that rewire the biochemical pathways that regulate cellular division [1]. Research has led to identification of drugs that are currently used to target fusions in different malignancies. Examples include imatinib, tretinoin, and crizotinib, which target the BCR-ABL, PML-RAR, and EML4-ALK fusion products associated with chronic myelogenous leukemia [2, 3], acute promyelocytic leukemia [4–6], and non-small cell lung carcinoma [7–9], respectively. These established associations and clinical applications underscore the need to comprehensively and accurately detect fusions in cancer samples.
Next-generation sequencing technologies, particularly RNA sequencing (RNA-seq), have revealed an increasing number of recurrent fusions in a variety of cancers, and it is likely that their detection will have growing diagnostic and prognostic utility. As such, validating the laboratory and analysis methods to establish analytical parameters including the limit of detection, linearity, sensitivity, and specificity of fusion detection in tumor RNA specimens is critical for adoption in clinical research settings. For example, does a fusion transcript present at higher molarity (higher transcript abundance) correlate with higher number of fusion-supporting sequencing reads? Are there differences in detection algorithms’ efficacy with respect to specific fusion sequence and independent of abundance? Answering such questions and establishing robust metrics is difficult due to the lack of publicly available RNA-seq data specifically generated to capture gene fusions.
Summary of nine synthetic fusion gene transcripts, excluding the poly-A tail.
Methods
Generation of synthetic gene fusion RNA (SGFR) constructs
Vector design: the gene sequence was synthesized by IDT and inserted into a pUCIDT vector.
RNA sequencing
RNA aliquots were washed in 70% ice cold ethanol, resuspended in 50 μL TE buffer (10 mM Tris–HCl pH 8.0, 1 mM EDTA), then quantitated using UV absorption. 2.2 ng of each RNA spike were pooled in a PCR plate, and the volume was brought up to 50 μL with RNase free water. A cDNA library was prepared using TruSeq Stranded mRNA LT Sample Prep Kit (Illumina®, cat# RS-122-2101) and sequenced on an Illumina MiSeq to confirm the sequences of the mRNA transcripts as a final QC step. Fresh aliquots of RNA were taken from storage, washed with 70% ice cold ethanol, resuspended in 1 × TE, and quantitated using RiboGreen (Invitrogen). RNA spikes were mixed together to create a high concentration pool with 40 nM of each spike. This pool was diluted and titrated into to 1 μg aliquots of COLO-829 total RNA (ATCC 1974). cDNA libraries were prepared using the TruSeq Stranded mRNA LT Sample Prep Kit (Illumina®, cat# RS-122-2101) following the manufacturer’s protocol. The resulting libraries were sequenced on the Illumina HiSeq2500 in Rapid Run mode using paired end reads with 101 cycles in each read.
In summary, equimolar amounts of all nine SGFRs were pooled together and this pool was titrated into total RNA from the melanoma cell line COLO-829 [10] at ten different abundances. Each SGFR abundance pool was prepared in duplicate. Libraries were prepared for sequencing using the Illumina TruSeq Stranded mRNA LT Sample Preparation Kit and sequenced on an Illumina HiSeq 2500 (2 × 101 cycles).
Bioinformatics
Command-line parameters used for running the three fusion detection tools. Reference genome was GRCh37. In each case, custom scripts were developed internally to extract statistics about fusion-supporting reads.
Results and discussion
Analytically, gene fusions are typically detected from RNA-seq data by: 1) Aligning reads to a reference genome or transcriptome assembly; 2) Identifying discordant read pairs, i.e., pairs for which genomic distance between the two ends’ alignments is significantly different from the expected genomic distance based on library preparation; 3) Extracting split sections of the same read that align to different regions of the genome, thereby, indicating a potential fusion; 4) Algorithm-specific additional steps, such as contig construction, sequence homology search, guided analyses based on exon junction annotation files, etc.
We emphasize here that our focus is to demonstrate utility of the SGFR constructs for evaluating assay performance and to make them available to the clinical and research communities to further active research in gene-fusion detection methods. To that end, the choice of three representative algorithms and the analysis framework is based on our experience in analyzing such data. Since emphasis is on making RNA-seq gene fusion data publically available, we do not attempt to provide a detailed comparative assessment, pros-cons, or performance characterization of the growing number of gene fusion detection tools discussed elsewhere [14, 15]. However, to highlight the differences in the underlying analytical methods in these three fusion-detection tools, we briefly describe each of the approaches and direct readers to bibliography [11–13] for complete details. THF builds on Tophat to align RNA-seq reads using Bowtie [16] without using any annotation to independently align paired end reads, followed by segment mapping of unaligned reads that are used together for identifying candidate fusion junctions. Next, spliced fusion contig index is created and read segments are remapped using BLAST (in the TophatFusionPost step) followed by stitching all segments together into full read alignments that are further filtered based on criteria, such as number of fusion-supporting reads. SSH uses 50-bp reads that are aligned by BWA [17] guided by customized exon annotation file to identify potential fusions as well as unmapped reads. In our SSH analysis, we retained the first 50-bases from FASTQ files, and SnowShoes-FTD authors provided the annotation file (personal communication). Subsequent steps consists of using Megablast and a junction database to identify overlapping, spanning, and split reads to detect fusions that are further filtered using SnowShoes-FTD author provided false positive list. CHS uses known junctions from an annotation file that guides Bowtie alignment algorithm to find discordant read pairs and unmapped reads. Trimmed unmapped reads are aligned and used in conjunction with previous alignments to identify chimeric events by examining exon junctions from the annotation file. Thus, the three methods share an overall approach of identifying fusions based on aligning paired-end reads and detecting evidence of fusion junction. However, they are different with respect to the specific underlying alignment algorithm, read length, guidance from optionally provided annotation file, post-alignment processing to assemble fusion contigs, and parameters used to retain fusions from candidate fusions.
We also verified by running a separate parallel sample that the COLO-829 cell line provided a neutral background, i.e., it did not contain any of the nine SGFRs. Therefore, SGFRs in our experiment were not barcoded prior to spiking into the total RNA. However, barcoded SGFRs should be preferred in other cell lines to avoid mixing of spiked-in fusions and potential endogenous fusions.
Three algorithms TopHat-Fusion (THF), ChimeraScan (CHS), and SnowShoes-FTD (SSH) were used to identify and plot the number of fusion-supporting reads for SGFRs versus experimental input abundance. Triangles correspond to data for sample replicate 1 (R1) and diamonds correspond to data for the second replicate (R2) with. Complete data is included as a table in supplementary materials.
To independently verify the presence of fusion reads (true positives) in the sequencing data, data was aligned using GSNAP to a combined reference sequence consisting of the human genome GRCh37 build and nine fusion transcripts. For each fusion, the number of fusions supporting reads identified by GSNAP (blue squares), THF (purple triangle), CHS (red triangle), and SSH (inverted green triangle) are plotted for replicates R1 and R2.
Correlation between replicates based on number of fusion supporting reads. Panel (a) shows fusion-supporting reads (X-axis: Replicate 1, Y-axis: Replicate 2) for high read count (>100). Pearson correlation was CHS: 0.9613, THF: 0.9990, SSH: 0.9986, All: 0.9955. Panel (b) shows data for low read count (<=100) with Pearson correlation values CHS: 0.3209, THF: 0.2577, SSH: 0.7292, All: 0.4025.
Variance of fusion supporting reads across molarity. For each fusion-transcript molarity (X-axis), variance of the fraction of fusion-supporting reads across nine fusions was calculated. Variances for replicates tend to be more similar at higher molarity indicating consistency in identifying fusion-supporting reads than at lower molarity.
Sensitivity of the three algorithms at various levels of fusion-supporting reads cutoff (2, 5, 10, 25, 50, and 100).
Fusions detected by each algorithm. For two example thresholds of 2 (left matrix) and 50 (right matrix) on minimum number of fusion-supporting reads, number of fusions detected at different concentrations for two replicates R1 and R2 are shown. Brown cell: fusion detected. Blue cell: fusion missed. For example, at minimum threshold of 2, BRD4-NUT was positively identified most frequently (59/60 times) and TMPRSS2-ETV1 was detected least frequently (20/60 times).
Notably, some fusions were not detected by one or more tool(s) irrespective of molarity as shown by the points on X-axis in Figure 4. As shown in Figure 5, irrespective of the fusion transcript abundance all three tools detected EWS-ATF1, two tools detected EML4-ALK, and only one tool detected TMPRSS2-ETV1. On further investigation of SSH workflow, we discovered that fusion-supporting reads for both EML4-ALK and TMPRSS2-ETV1 were present in the initial candidate fusion list. However, these fusions were subsequently discarded by the SSH workflow when final list of fusions was reported. As end-users of the tool, we could not precisely identify specific reasons for this filtering out and a detailed investigation of SSH algorithm implementation is out of scope of this study. To explore why THF did not report TMPRSS2-ETV1 fusion, we extracted known fusion-supporting reads from GSNAP alignments and searched for those in the alignment files (generally known as accepted_hits.bam) generated by THF. We discovered that several fusion-supporting reads were aligned against TMPRSS2 (chr21:42.84-42.9 mb) and ETV1 (chr7:13.93-14.03 mb) loci across various molarities as shown in Additional file 1: Table S4. However, TMPRSS2-ETV1 fusion was not reported in the final list of fusions after the TophatFusionPost step was executed. A detailed investigation of actual THF algorithm implementation and specific reasons behind filtering out the fusion is out of scope of this study. However, observations based on additional investigation of unreported fusions highlight the critical importance of tool-specific criteria and parameters that might lead to false negatives or false positives—evidence for fusions from alignment data was processed differently by different tools yielding different results.For the sake of completeness, we also note that each detection tool has a large number of input parameters that significantly affect its detection ability. Figure 4 depicts overall trend in capturing fusion-supporting reads based on our experimental design and chosen parameters. However, assessing the dynamic range and limits of detection for analytical tools will require extensive combinatorial selection of parameters, an in-depth analysis of algorithm implementation, and a much larger number of SGFRs across wide range of transcript abundance as part of testing and validation. These are out of scope of this study that is primarily focused on making available a publically available data for collaborative research and highlighting some of the issues in RNA-seq based gene fusion detection based on our analysis framework.
Conclusion
The key contribution of this work is the first publicly available gene fusion RNA-seq data that specifically targets known oncogenic gene fusions that are gaining increasing importance in clinical genomics based on next-generation sequencing. The community can augment this dataset and the proposed analytical framework to further collaborative development of advanced analytical tools for gene fusion detection from RNA-seq data.
Data availability
All sequencing data is available in FASTQ format from the Short Read Archive under accession number SRP043081.
Declarations
Acknowledgements
Research partially supported by a Stand Up To Cancer – Melanoma Research Alliance Melanoma Dream Team Translational Cancer Research Grant (#SU2C-AACR-DT0612). Stand Up To Cancer is a program of the Entertainment Industry Foundation administered by the American Association for Cancer Research. The authors thank TGen’s IT division for computational resources.
Authors’ Affiliations
References
- Villanueva MT: Genetics: gene fusion power. Nat Rev Clin Oncol. 2012, 9: 188-PubMedView ArticleGoogle Scholar
- Goldman JM, Melo JV: Chronic myeloid leukemia–advances in biology and new approaches to treatment. N Engl J Med. 2003, 349: 1451-1464. 10.1056/NEJMra020777.PubMedView ArticleGoogle Scholar
- Saglio G, Morotti A, Mattioli G, Messa E, Giugliano E, Volpe G, Rege-Cambrin G, Cilloni D: Rational approaches to the design of therapeutics targeting molecular markers: the case of chronic myelogenous leukemia. Ann N Y Acad Sci. 2004, 1028: 423-431. 10.1196/annals.1322.050.PubMedView ArticleGoogle Scholar
- Kakizuka A, Miller W, Umesono K, Warrel R, Franekl S, Murty V, Dmitrovsky E, Evans R: Chromosomal translocation t(15;17) in human acute promyelocytic leukemia fuses RAR alpha with a novel putative transcription factor, PML. Cell. 1991, 66: 663-674. 10.1016/0092-8674(91)90112-C.PubMedView ArticleGoogle Scholar
- Huang ME, Ye YC, Chen SR, Cai JR, Lu JX, Zhoa L, Gu LJ, Wang ZY: Use of all-trans retinoic acid in the treatment of acute promyelocytic leukemia. Blood. 1988, 72: 567-572.PubMedGoogle Scholar
- Castaigne S, Chomienne C, Daniel M, Ballerini P, Berger R, Fenaux P, Degos L: All-trans retinoic acid as a differentiation therapy for acute promyelocytic leukemia: I. Clinical results. Blood. 1990, 76: 1704-1709.PubMedGoogle Scholar
- Gerber DE, Minna JD: ALK inhibition for non-small cell lung cancer: from discovery to therapy in record time. Cancer Cell. 2010, 18: 548-551. 10.1016/j.ccr.2010.11.033.PubMed CentralPubMedView ArticleGoogle Scholar
- Ou SH, Bazhenova L, Camidge DR, Solomon BJ, Herman J, Kain T, Bang YJ, Kwak EL, Shaw AT, Salgia R, Maki RG, Clark JW, Wilner KD, Iafrate AJ: Rapid and dramatic radiographic and clinical response to an ALK inhibitor (crizotinib, PF02341066) in an ALK translocation-positive patient with non-small cell lung cancer. J Thorac Oncol. 2010, 5: 2044-2046. 10.1097/JTO.0b013e318200f9ff.PubMedView ArticleGoogle Scholar
- Kwak EL, Bang YJ, Camidge DR, Shaw AT, Solomon B, Maki RG, Ou SH, Dezube BJ, Jänne PA, Costa DB, Varella-Garcia M, Kim WH, Lynch TJ, Fidias P, Stubbs H, Engelman JA, Sequist LV, Tan W, Gandhi L, Mino-Kenudson M, Wei GC, Shreeve SM, Ratain MJ, Settleman J, Christensen JG, Haber DA, Wilner K, Salgia R, Shapiro GI, Clark JW, et al: Anaplastic lymphoma kinase inhibition in non-small-cell lung cancer. N Engl J Med. 2010, 363: 1693-1703. 10.1056/NEJMoa1006448.PubMed CentralPubMedView ArticleGoogle Scholar
- Pleasance ED, Cheetham RK, Stephens PJ, McBride DJ, Humphray SJ, Greenman CD, Varela I, Lin ML, Ordóñez GR, Bignell GR, Ye K, Alipaz J, Bauer MJ, Beare D, Butler A, Carter RJ, Chen L, Cox AJ, Edkins S, Kokko-Gonzales PI, Gormley NA, Grocock RJ, Haudenschild CD, Hims MM, James T, Jia M, Kingsbury Z, Leroy C, Marshall J, Menzies A, et al: A comprehensive catalogue of somatic mutations from a human cancer genome. Nature. 2010, 463: 191-196. 10.1038/nature08658.PubMed CentralPubMedView ArticleGoogle Scholar
- Iyer MK, Chinnaiyan AM, Maher CA: ChimeraScan: a tool for identifying chimeric transcription in sequencing data. Bioinformatics. 2011, 27: 2903-2904. 10.1093/bioinformatics/btr467.PubMed CentralPubMedView ArticleGoogle Scholar
- Kim D, Salzberg SL: TopHat-fusion: an algorithm for discovery of novel fusion transcripts. Genome Biol. 2011, 12: R72-10.1186/gb-2011-12-8-r72.PubMed CentralPubMedView ArticleGoogle Scholar
- Asmann YW, Hossain A, Necela BM, Middha S, Kalari KR, Sun Z, Chai HS, Williamson DW, Radisky D, Schroth GP, Kocher JP, Perez EA, Thompson EA: A novel bioinformatics pipeline for identification and characterization of fusion transcripts in breast cancer and normal cell lines. Nucleic Acids Res. 2011, 39: e100-10.1093/nar/gkr362.PubMed CentralPubMedView ArticleGoogle Scholar
- Carrara M, Beccuti M, Lazzarato F, Cavallo F, Cordero F, Donatelli S, Calogero RA: State-of-the-art fusion-finder algorithms sensitivity and specificity. Biomed Res Int. 2013, 2013: 340620-PubMed CentralPubMedView ArticleGoogle Scholar
- Wang Q, Xia J, Jia P, Pao W, Zhao Z: Application of next generation sequencing to human gene fusion detection: computational tools, features and perspectives. Brief Bioinform. 2013, 14: 506-519. 10.1093/bib/bbs044.PubMed CentralPubMedView ArticleGoogle Scholar
- Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012, 9: 357-359. 10.1038/nmeth.1923.PubMed CentralPubMedView ArticleGoogle Scholar
- Li H, Durbin R: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010, 26: 589-595. 10.1093/bioinformatics/btp698.PubMed CentralPubMedView ArticleGoogle Scholar
- Wu TD, Nacu S: Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics. 2010, 26: 873-881. 10.1093/bioinformatics/btq057.PubMed CentralPubMedView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.