ChimPipe: accurate detection of fusion genes and transcription-induced chimeras from RNA-seq data

Background Chimeric transcripts are commonly defined as transcripts linking two or more different genes in the genome, and can be explained by various biological mechanisms such as genomic rearrangement, read-through or trans-splicing, but also by technical or biological artefacts. Several studies have shown their importance in cancer, cell pluripotency and motility. Many programs have recently been developed to identify chimeras from Illumina RNA-seq data (mostly fusion genes in cancer). However outputs of different programs on the same dataset can be widely inconsistent, and tend to include many false positives. Other issues relate to simulated datasets restricted to fusion genes, real datasets with limited numbers of validated cases, result inconsistencies between simulated and real datasets, and gene rather than junction level assessment. Results Here we present ChimPipe, a modular and easy-to-use method to reliably identify fusion genes and transcription-induced chimeras from paired-end Illumina RNA-seq data. We have also produced realistic simulated datasets for three different read lengths, and enhanced two gold-standard cancer datasets by associating exact junction points to validated gene fusions. Benchmarking ChimPipe together with four other state-of-the-art tools on this data showed ChimPipe to be the top program at identifying exact junction coordinates for both kinds of datasets, and the one showing the best trade-off between sensitivity and precision. Applied to 106 ENCODE human RNA-seq datasets, ChimPipe identified 137 high confidence chimeras connecting the protein coding sequence of their parent genes. In subsequent experiments, three out of four predicted chimeras, two of which recurrently expressed in a large majority of the samples, could be validated. Cloning and sequencing of the three cases revealed several new chimeric transcript structures, 3 of which with the potential to encode a chimeric protein for which we hypothesized a new role. Applying ChimPipe to human and mouse ENCODE RNA-seq data led to the identification of 131 recurrent chimeras common to both species, and therefore potentially conserved. Conclusions ChimPipe combines discordant paired-end reads and split-reads to detect any kind of chimeras, including those originating from polymerase read-through, and shows an excellent trade-off between sensitivity and precision. The chimeras found by ChimPipe can be validated in-vitro with high accuracy. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3404-9) contains supplementary material, which is available to authorized users.


ChimPipe filtered chimeric junction file description
. This table explains the different fields/columns present in the filtered chimeric junction file output by ChimPipe (tsv file). For each field the name of the field (as indicated in the header), the meaning of the field and an example of value of the field are indicated.     Table S6. This table provides for each benchmark dataset (PE76 simulated, Berger and Edgren) and each program, the number of reference chimeric junctions, the number of predicted junctions corresponding to correctly predicted gene pairs and the average (AVG) and standard deviation (STD) of the distance between the predicted and the true junction for those. This distance is computed as the sum of the distance between the predicted and the true donor and the distance between the predicted and the true acceptor.   Table S9. This table provides for each primer pair used for RT-PCR validation (see Table S8), its purpose (individual gene in completementary or genomic DNA, chimeric junction) and in case it was targetting a gene, the name of this gene. Figure S1. Two evaluation levels: gene pair level and junction level. On top is one reference junction with its associated gene pair (gene A, gene B). At the bottom are two predicted junctions of which the first one exactly corresponds to the reference junction, and is therefore considered both a junction level and a gene pair level true positive (TP). The second junction does not exactly correspond to the reference junction, and is therefore not considered a junction level TP, however since its first part overlaps an exon of gene A and its second part an exon of gene B, it is considered a gene pair level TP.

RT-PCR primer sequences
Gene A Gene B

Reference junction
Predicted junctions Junction level TP?
Gene pair level TP?

Yes Yes
Yes No Figure S3. Benchmark results when excluding read-through events for simulated data. This is the same as figure 3 but excluding read-through events when benchmarking the programs on simulated data. Note that figures S3B and S3D are the same as figures 3B and 3D, and are present here for the purpose of comparison between results on simulated and real data.   Figure S4. Distance between predicted and true junction on simulated data. This figure shows for the PE50 (A), PE76 (B) and PE101 (C) simulated sets, and for each program, the distance between the predicted and the true/reference junction for predicted junctions belonging to TP gene pairs. The distance between the predicted and the true junction is computed as the sum of the distance between the predicted and the true donor/upstream/5' splitce sites and the distance between the predicted and the true acceptor/downstream/3' splice sites.    Figure S5. RT-PCR validation method. For each chimeric junction attempted to be validated by RT-PCR, 3 tests were actually performed, each of them requiring different pairs of primers: (1) the actual validation of the chimeric junction is done by doing RT-PCR on a cDNA library using a pair of primers that are located externally to the junction but in the exons overlapped by each part of the junction; (2) genomic DNA tests (starting from genomic DNA) are done for each of the parent gene and for the chimeric junction, in order to check whether a genomic rearrangement could explain the chimeric junction and to see if the parent genes are present at the DNA level; (3) mRNA checks are done for each parent gene (starting from cDNA and using pairs of primers specific to each gene) in order to check whether they are present at the mRNA level. Figure S6. RT-PCR validation results for 4 chimeras in four cell lines (HeLa, HL60, MCF-7, K562). Here we show the products of the RT-PCR amplification of 4 chimeras and their parent genes, from the cDNAs of 4 different cell lines: HeLa (H), HL60 (6), MCF-7 (M) and K562 (K). For each chimeric junction we also provide a negative control (-) for comparison, and higlight the bands that show the presence of the chimeras and of the parent genes. For the 3 successfully validated cases (3 first ones, i.e. UBA2-WTIP, PICALM-SYTL2 and RPL38-TTYH2), the bands show the presence of the chimeric RNAs at the expected size, except for PICALM-SYTL2, which chimera size is higher than expected, and of all parent mRNAs except SYTL2, although this could be due to a low expression level of this gene. Figure S7. Genomic DNA analysis for the 3 successfully validated chimeras. For the 3 chimeras that were successfully validated by RT-PCR, we show the products of the genomic amplification of the 6 chimeras and their parent genes in the same 4 cell lines as the ones where the mRNA analysis was done (see supplementary Figure S5 above). These tests show that the parent genes are present at the DNA level, but not the chimeras. Indeed we see some unspecific amplification in the genomic DNA for the chimeras, but the band intensities are too low to consider them as genomic rearrangements. For the RPL38-TTYH2 chimera, there are some clearer unspecific products, but they are probably due to a primer contamination since they are also present in the negative control. H: HeLa, 6: HL60, M: MCF-7, K: K562, -: negative control. Figure S8. Colony PCR check and selection for RT-PCR validated chimeras. For the 3 RT-PCR validated chimeras, the RT-PCR products were purified from the gel bands, and cloned into pGEMTeasy vectors. E. coli bacterias were then transformed with these vectors, white colonies were selected and colony PCR was performed with the results indicated on the picture. Selected colonies (indicated by red boxes) were grown, plasmid purified and sent for sequencing.

Simulating reads from transcripts
Learning sequencing quality profile from Illumina PE data of the same read length In order to produce simulated reads that are as close as possible to real Illumina reads, we generated read quality profiles from real data of the same read length using the art_profiler_illumina program of the ART suite (version 2.3.7), and then gave it as input to the art_illumina program of the same suite (see below). In order not to learn sample specific quality profiles and since art_profiler_illumina is able to average profiles learnt from several samples, we used 4 samples for each read length (8 fastq files in total since the data is PE): the 4 Edgren PE50 samples, 4 ENCODE PE76 samples and 2 ENCODE PE101 samples (GEO accession numbers provided in Additional file 2).
Here is the command line that we used for a given read length:

art_profiler_illumina illumina${rl}bp $readDir fastq.gz
where: -$rl is the read length -$readDir is a directory containing the fastq files of the 4 samples of read length $rl of which we want to learn the read quality profile This resulted in 2 profile matrices for each read length, one for each mate, which are used by art_illumina for read simulation (see below).

Simulating Illumina unstranded PE reads
We used the art_illum ina program of the ART suite version 2.3.7 to simulate Illumina non directional paired end reads from the 102,399 chimeric and non chimeric transcripts previously generated. Since we wanted to see the influence of the read length of the programs' behaviour, we used art_illumina three times, asking for PE50, PE76 and PE101 reads. We asked for a coverage of 20 in all cases (-f option, see below).
Here is the command line that we used for a given read length: During this process, we discarded two validated fusion transcripts from the SK-BR-3 cell line: CSE1L-ENSG00000236127 and NFS1-PREX1. The first case was discarded since the gene ENSG00000236127 was not annotated in gencode v19 and the second one because it had a too short cDNA sequence on one side of the junction site, making this piece of sequence impossible to align with Blat.

Running ChimPipe and state-of-the art programs
Here we give the commands used for the 5 programs on the simulated datasets, but the parameters were the same for the real datasets.
Chimpipe version 0.9.3 was run with this command, using SAMtools 1.2 and bedtools 2.21: ChimPipe.sh --fastq_1 $read1 --fastq_2 $read2 -g $genome -a $annot -t where: -$ism is the mean insert size -$iss is the insert size standard deviation -$index is the index file of the genome and annotation -$read1 is the fastq.gz file of the first reads of the pairs -$read2 is the fastq.gz file of the second reads of the pairs and the output file considered for the evaluation was chimeras.bedpe .
For TopHatFusion version 2.0.12 we ran the following 2 successive steps, the first one for each read length separately, and the second one for all read lengths together, using Bowtie 1. -$mid is the mate inner distance, which is the insert length minus twice the read length -$iss is the standard deviation of the above distance, which we took as the insert length standard deviation -$index is the indexed genome file -$read1 is fastq.gz file of the first reads of the pairs -$read2 is the fastq.gz file of the second reads of the pairs and the output file considered for the evaluation was tophatfusion_out/result.txt In order to evaluate the programs in an automatic way, their output was first made into chimpipe format, and the same benchmark script was applied to each of them, taking as input the reference sets of junctions and the gencode v19 long annotation (see below).

Benchmarking the programs
To benchmark the programs' predicted chimeric junctions on both simulated and real data, we  Table S7) with default parameters and the --no-clean-up option. We then used the candidate chimera file from each experiment to select the chimeras that did not have a strong exonic sequence similarity between the 2 connected genes (a high sequence similarity being defined as a blast hit of more than 30 bp with more than 80% identity) and with the support of at least 10 staggered split-reads and 5 discordant PE reads. We then gathered all such chimeras across the 106 experiments.
Total DNA and RNA were extracted independently from four different cell types (HeLa, HL60, K562 and MCF7) using DNeasy® Blood & Tissue kit (QIAGEN) and SV Total RNA Isolation System (Promega), respectively. Then, 1 µg of total RNA was used to synthetize cDNA using ImProm-II™ Reverse Transcription System (Promega) according to manufacturer's instructions.
Several pair of primers were designed to validate the chimeric junctions. A first pair of primers was designed to analyze chimeric junctions in the cDNA preparations, and to discard a genomic reorganization using genomic DNA (gDNA) as template. Another pair of primers was then used to confirm the presence of each gene individually in gDNA and the presence of each individual parental transcripts in the cDNA preparations (Additional file 1: Figure S5). PCRs were performed using GoTaq Green Master Mix (Promega). The PCR products from the three positive chimeric junctions were gel purified using Wizard® SV Gel and PCR Clean-Up System (Promega) and cloned into pGEMT-easy (Promega), Finally, cloned chimeric junctions were sequenced at the Genomic facility of the Instituto de Investigación Biomédicas Alberto Sols.
For the most interesting case, the UBA-WTIP chimera, primers were designed in order to identify each possible complete chimeric transcript (Q1, Q3 and Q5 in Figure 6), based on the annotation of both parent genes and the exact localization of the chimeric junction. PCRs and sequencing were performed as described above.
Additional file 1: Tables S8-9 provide the primers' sequences and their characteristics.
Common recurrent chimeras between human and mouse ChimPipe 0.9.2 was run with default parameters (--no-clean-up option) on each human and mouse ENCODE CSHL PE RNA-seq experiment (bio-replicate of a sample) requiring each reported junction to be supported by at least 1 split-read and 1 discordant PE read (starting from the candidate chimera file output by ChimPipe). For human we used the hg19 genome assembly and the gencode v19 long gene annotation, and for mouse we used the mm9 genome assembly and the ensembl v65 long gene annotation.
A chimera was defined as detected in a given sample if the sum of its number of supporting split-reads in the 2 bio-replicates of the sample was at larger or equal to 2. A recurrent chimera was defined as a chimera detected in at least 2 samples.