Indel detection model
The core component of the transIndel algorithm is the ability to infer large deletions and medium-sized insertions from chimeric alignments. Reads with small indels can be represented as a single linear alignment to the reference genome and hence can be used to detect indels from the direct evidence of the alignment by available indel callers. However, as the indel size increases, short read aligners such as STAR [19] fail to map reads with those indels linearly in a single record. Instead, it detects large deletions as splicing junctions (Additional file 2: Figure S1a). Conversely, a chimera-aware aligner such as BWA-MEM [10] aligns short reads in a chimeric alignment, which consists of two linear alignments with each of the hits marked by soft-clipping in the alignment file that may account for half of the soft-clipped reads (Additional file 2: Figure S1b). By leveraging the alignment details for those chimeric reads, it is possible to reconstruct the linear alignment with mid-sized insertions and large deletions from the initial short read alignment and provide a redefined alignment output for downstream indel detection.
First, transIndel searches chimeric alignments from BAM files generated by BWA-MEM and selects those containing two linear alignments that do not have large overlaps but align on the same chromosome and strand. Second, the type and size of indels are determined by comparing the differences between target offset and read length (Fig. 1). In the case of deletions, the target offset is larger than the read length. In cases of mid-sized novel sequence insertions or tandem duplications, the target offset is shorter than the read length.
Validation on synthetic and real data
We compared the performance of transIndel with seven existing indel detection algorithms (GATK HaplotypeCaller [20], Pindel [21], Scalpel [22], Platypus [23], Fermikit [24], Delly [25] and NovoBreak [26]) on a synthetic data set. This comparison showed that Delly, Pindel and transIndel robustly detected large deletions from low (10×) to high (50×) coverage data with sizes ranging from100bp to 1 kbp (Fig. 2). Fermikit performed the best for detecting large insertions as it is the only one among these tools carrying out global assembly for indel detection. When examining mid-sized and small indels (< 100 bp), Pindel had the highest recall and precision, followed by transIndel (Additional file 3: Figure S2). Delly did not perform as well as Pindel or transIndel in terms of small indels (< 20 bp) (Additional file 3: Figure S2).
To assess performance with real DNA-seq data, we applied each tool to a 2X100bp, 50× coverage whole genome sequencing (WGS) data set from human individual NA12878. We compared the indel predictions against two reference call sets. One was available from the Genome in a Bottle (GIAB) Consortium, which identified mostly small indels less than 20 bp [27]. The other reference set was composed of mid-sized and large deletions > 50 bp provided by 1000 Genomes Phase 3. We computed the precision, recall (i.e., sensitivity), and harmonic mean of precision and recall (F-measure). We observed that transIndel achieved the highest sensitivity for detecting small indels (Fig. 3a-b). For large deletion detection, transIndel displayed the best performance (measured by F1 score) relative to other structural variation detection algorithms including Delly, Pindel, Platypus, Fermikit as well as our previously developed indel caller, ScanIndel. (Fig. 3c).
Next, we applied transIndel to detect 10 validated large deletions (> 1 kb) within the androgen receptor (AR) gene locus from prostate cancer specimens [14]. TransIndel achieved 100% sensitivity with estimated variant allele frequency (VAF) ranging from 3% to 93% (Additional file 4: Table S1). Notably, we found our calculated VAFs based on SAMTools pileup were lower than the reported VAFs in the original study. For example, a 3433 bp deletion from site A of patient C-6 had a 30% VAF but estimated as 47% by SHEAR originally. Since SAMTools directly used chimeric reads with recovered indels to calculate the VAF which may only account for partial variant depth, it was likely to underestimate the VAF. To make the prediction more accurate, we applied VarDict to predict the VAF of this deletion. VarDict enables accurate estimation of the VAF for indels by performing supervised and unsupervised local realignment of soft-clipped reads [12]. Interestingly, we found VarDict did not detect this deletion using the BWA-MEM generated BAM file (Additional file 5: Figure S3a), but with the redefined BAM file from transIndel, VarDict identified this deletion with 85% VAF, suggesting transIndel improved the sensitivity of existing tools for indel detection (Additional file 5: Figure S3b).
Application to whole-exome data of primary and metastatic prostate cancer
To test whether transIndel could enhance indel detection in a large dataset relevant to human disease, we first applied transIndel to whole exome sequencing (WES) data of 333 tumor and matched-normal primary prostate cancer (PC) sample pairs from The Cancer Genome Atlas (TCGA) study [15] and 59 metastatic castration-resistant prostate cancer (mCRPC) and matched-normal sample pairs from the AACR-PCF Stand-Up-To-Cancer (SU2C) study [28]. Since existing algorithms can reliably detect SNVs and small indels less than 10 bp (Additional file 3: Figure S2), we focused on indels equal to or larger than 10 bp. In addition, we set the minimal VAF to 10% to keep our approach consistent with thresholds applied in initial analyses of these WES data [15, 28]. We noted that the size of deletions initially called by transIndel varied significantly, with many falling into the structural variation category scale, rather than the indel scale. We therefore further limited our study to indels that were contained within a single gene only. With these filtering thresholds, we detected 1043 somatic indels in PC and 2034 somatic indels in mCRPC. The size range of deletions observed in PC and mCRPC were 10 to 27,293 bp and 10 to 113,612 bp, respectively and the size range of insertions in PC and mCRPC were 10 to 66 bp and 10 to 85 bp, respectively. Compared to the indel detection methods employed in the original TCGA and SU2C studies, transIndel detected more medium- and large-sized indels (Fig. 4a,b). Among these newly detected indels, we found that ten patients in the TCGA cohort harbored deletions in FOXA1 with sizes larger than 10 bp. FOXA1 mutations define one of the 7 distinct molecular subtypes of PC, yet nine of these patients had been assigned to a molecular subtype other than FOXA1 molecular subtype (Additional file 6: Table S2). Re-assignment of the nine patients positive for FOXA1 indels from the non-FOXA1 molecular subtype to the FOXA1 molecular subtype of PC would increase the proportion of this group from 3% to 5% of all PC cases (Fig. 4c).
To ensure we did not miss any additional FOXA1 indels in the TCGA cohort, we leveraged VarDict to estimate the VAF of our detected indels and also applied Delly, Pindel, Scalpel, GATK, Platypus and Fermikit for detection. We found none of these tools could identify all ten FOXA1 deletions. Moreover, no additional mid-sized and large indels were found in FOXA1 by these tools (Additional file 6: Table S2).
Application to RNA-seq data of mCRPC
To validate our findings from WES data and evaluate effects of detected genomic indels on transcriptome structure, we applied transIndel to RNA-seq data from 59 mCRPC samples in the SU2C cohort. RNA-seq raw reads were aligned with BWA-MEM and transIndel was used to infer candidate indels. Deletion candidates were subjected to several filters to discriminate genomic deletions from RNA splicing events, including removal of deletions neighboring canonical splicing motifs or annotated splice sites (Fig. 5a). This pipeline identified genomic deletions that were misclassified as splicing junctions (Additional file 7: Figure S4a-d) and genomic insertions that were missed (Additional file 7: Figure S4e-h) by the STAR algorithm, which is a component of GATK best practice for indel discovery in RNA-seq (https://software.broadinstitute.org/gatk/best-practices/rnaseq.php). Compared to WES, transIndel called more indels from RNA-seq data, and these were more abundant in untranslated regions (UTRs), splicing regions, and coding exonic regions (Fig. 5b). This situation likely occurred because the exome capture kits used for WES are primarily designed to capture the protein-coding regions in the genome (~ 2% of human genome), whereas RNA-seq also covers UTRs and retained introns in the transcriptome (~ 5% of human genome) [29]. Indeed, we found the largest proportion of detected indels from RNA-seq data were intronic indels (Additional file 8: Figure S5).
A surprising result was that WES indel calls overlapped with indels discovered in RNA-seq to a very limited extent (Fig. 5b). To investigate the reasons for this low degree of overlap, we examined differences in WES and RNA-seq read coverage at locations corresponding to the 2034 WES indels (Additional file 9: Figure S6a) and 6734 RNA-seq indels (Additional file 9: Figure S6b). These results indicated that coverage difference between WES and RNA-seq at the locations of predicted indels was the main reason for this low overlap in indel calls.
As anticipated, the fraction of indels in coding exons was markedly higher in the overlap between WES and RNA-seq than in either dataset alone (Fig. 5b). Interestingly, the VAFs of the 71 common indels called from WES and RNA-seq data showed weak correlation (r = 0.31;p = 0.0079). Linear regression analysis indicated that indels called from RNA-seq data were present at higher VAF than those same indels called from WES data (Fig. 5c). This is likely due to inefficient hybridization-based capture of DNA fragments harboring medium- to large-scale indels, which is a standard step in WES workflows. An example of this scenario is shown for FOXA1 in Figure S7 (Additional file 10). We found the WES didn’t cover FOXA1 well between aa240–380, which might prevent detecting additional FOXA1 mutations. These results suggest that detection of indels in RNA-seq data using transIndel could complement WES for detection and validation of genomic alterations, which is crucial for clinical diagnostics.
Since WES and RNA-seq data are both expected to display enrichment in exonic regions, we focused on the indels discovered in coding exons. Surprisingly, we found only 12 coding indels called from WES data that overlapped with coding indels called from RNA-seq data (Fig. 5d). Explanations for this surprisingly low overlap could be that the genes impacted by indels were not expressed (RPKM < 1), or gene expression was arising from the unaffected, reference allele (Fig. 5d). Overall, there were more coding indels called from RNA-seq data. In several instances, manual inspection of WES data indicated the presence of individual DNA-seq reads supporting these indel events. However, these would have ultimately been filtered out in the DNA-seq indel calling pipeline due to low coverage (< 10×). Conversely, many coding indels called from RNA-seq data appeared to be RNA-only events, as there was no evidence of corresponding DNA-levels events despite high WES coverage in these regions. One explanation for this phenomenon could be a type of non-canonical splicing event, termed exonic introns (exitron) [30]. One of the distinguishing features of exitrons is overrepresentation of indel sizes that are multiples of three nucleotides [31]. Indeed, by comparing the size distribution of detected exitron events with all RNA-seq indels, we found that exitron events were enriched for indels with sizes that were multiples of three (Additional file 11: Figure S8).
Although the transIndel pipeline for indel detection from RNA-seq data included the removal of deletions that could be splicing junctions, the catalog of alternative splicing events in prostate cancer, including exitrons, is incomplete. Remarkably, many of the detected exitrons were recurrent within genes known to be altered at high frequency in cancer (Fig. 5e). Over half of the cancer genes displaying non-canonical exitron splicing were known tumor suppressor genes (TSGs). We also found that KMT2B, a known regulator of mCRPC [32], displayed exitron splicing in 7% of mCRPC samples. To test the accuracy of these RNA-level calls, we assessed their presence in the LNCaP cell line by analyzing LNCaP WES and RNA-seq data using transIndel. LNCaP cells displayed evidence for exitron splicing in ZBTB18, which was the most frequent exitron splicing events observed in mCRPC specimens. We performed RT-PCR with LNCaP RNA and PCR with LNCaP DNA, and confirmed by Sanger sequencing that ZBTB18 displayed bona fide exitron splicing at the RNA level only (Fig. 5f). ZBTB18 encodes a transcriptional repressor involved in neural development. It has been reported as a possible tumor suppressor as its expression reduces cell proliferation [33]. Additionally, it has been reported that the ZBTB18 exitron splicing expressed in both ERBB2-positive breast cancer and normal breast tissues, but it has higher percent of spliced in value in tumor compared with normal samples [30].