A fast detection of fusion genes from paired-end RNA-seq data

Background Fusion genes are known to be drivers of many common cancers, so they are potential markers for diagnosis, prognosis or therapy response. The advent of paired-end RNA sequencing enhances our ability to discover fusion genes. While there are available methods, routine analyses of large number of samples are still limited due to high computational demands. Results We develop FuSeq, a fast and accurate method to discover fusion genes based on quasi-mapping to quickly map the reads, extract initial candidates from split reads and fusion equivalence classes of mapped reads, and finally apply multiple filters and statistical tests to get the final candidates. We apply FuSeq to four validated datasets: breast cancer, melanoma and glioma datasets, and one spike-in dataset. The results reveal high sensitivity and specificity in all datasets, and compare well against other methods such as FusionMap, TRUP, TopHat-Fusion, SOAPfuse and JAFFA. In terms of computational time, FuSeq is two-fold faster than FusionMap and orders of magnitude faster than the other methods. Conclusions With this advantage of less computational demands, FuSeq makes it practical to investigate fusion genes in large numbers of samples. FuSeq is implemented in C++ and R, and available at https://github.com/nghiavtr/FuSeqfor non-commercial uses. Electronic supplementary material The online version of this article (10.1186/s12864-018-5156-1) contains supplementary material, which is available to authorized users.

Many tools have been developed for detecting fusion transcripts using RNA-seq data, and their comparisons are available in several recent publications [9,10]. These methods use various approaches, but generally include three main steps: (i) read alignment, (ii) fusion candidate detection and (iii) false positive elimination. Read alignment is usually done by standard read alignment methods in RNA-seq, such as TopHat-Fusion [11], SnowShoe-FTD [12], EricScript [13], JAFFA [14], or by its own method as FusionMap [15]. To determine fusion candidates, most of the methods use discordant reads such as spanning read pairs and/or split reads. Spanning reads contain one read located in different genes, while split reads indicates a single read overlapping on two different genes. The final step contains filtering and/or scoring systems to remove false positive fusion candidates. This step varies from method to method and has been summarized in several reviews [9,10].
Most of the current fusion detection methods require significant computational demands. As reported recently [9] for 15 fusion detection methods, including BreakFusion [16], Chimerascan [17], defuse [18], EricScript [13], FusionCatcher [3,19], FusionHunter [20], FusionMap [15], FusionQ [21], JAFFA [14], MapSplice [22], PRADA [23], ShortFuse [24], SnowShoes-FTD [12], SOAPfuse [25], TopHat-Fusion [11], they require from 7 to 240 h to analyse a single prostate cancer sample containing ∼118M 100bp-long read pairs. This makes it impractical to perform routine fusion gene detection in datasets with a large number of samples. To address this, we develop FuSeq, a novel fusion detection method utilizing a recent quasi-mapping method for alignment that is substantially faster than traditional alignment methods [26]. FuSeq consists of two separate pipelines based on mapped read-pairs (MR) and junction split-reads (SR) that are combined in the final step. For the MR pipeline, FuSeq introduces a new concept of fusionequivalence class to generate fusion candidates. In the SR pipeline, fusion candidates are collected from split reads where two different genes share the same read, and each gene has at least k-mer mapped bases. In addition, various filters and statistical tests are applied to the fusion candidates, primarily for false-positive reduction. We apply FuSeq to four validated datasets and show that it outperforms commonly used methods Tophat-fusion, SOAPfuse and JAFFA in sensitivity, specificity and discovery operating characteristics, while FuSeq is orders of magnitude faster in computational time.

Methods
The pipeline of the proposed fusion detection is presented in Fig. 1. The key steps are: (i) quasi-mapping to detect mapped reads and split reads of fusion candidates, and (ii) statistical tests and filtering, separately for mappedread pipeline and split-read pipeline to eliminate likely false positives. Before exporting the final results, de novo assembly can be used as an extra step to verify and determine the fusion sequence. The details of each phase are presented in the following sections.

Mapped reads and split reads
We utilise the quasi-mapping from Rapmap to generate mapped reads and split reads for FuSeq. The split reads and mapped reads in FuSeq are determined depending on k-mer length k and read-length r used in the quasimapping. A mapped read is from a read-pair where each read is completely or mostly (with length >= r − k − 1) mapped to a different gene. In split reads, a single read is partially mapped in both genes and the mapped sequences must have a length >= k. Additional file 1: Figure S1 intuitively demonstrates different cases of mapped reads and split reads. It is worth noting that, by definition, mapped reads in FuSeq will span over the fusion junction-break.

Fusion-gene candidates from split reads
In the quasi-mapping of Rapmap, each read is mapped to a transcriptome stored in a k-mer index system. The result of a quasi-mapping for a read is a list of k-mers (ordered from left to right) of the read. For each split read, we extract the important mapping information of the k-mers at the first and the last of the list for downstream analysis. The information includes mapping directions, query positions of the k-mers, mapped transcripts and corresponding genes, mapped positions of the transcripts, and the mapped position of the other read of the pair. The pair of the split read must be mapped to the same transcript on either side of the fusion. Additional file 1: Figure S2 presents the details of the data structure extracted from split reads. All possible split reads are collected to input to statistical tests and filtering steps.

Fusion-gene candidates from mapped read
Since the number of mapped reads are significantly higher than the number of split reads, to speed up calculation, we introduce a novel concept "fusion equivalence class" to organize and generate fusion-gene candidates.

Fusion equivalence class
We first explain the concept of fusion equivalence class, which is motivated by the transcript equivalence class used for transcript quantification by Patro et al. [27]. For a given read pair (r 1 , r 2 ), using quasi-mapping from Rapmap [26], we extract T 1 and T 2 as the sets of transcripts that r 1 and r 2 map to, respectively. We define S ≡ T 1 ∩ T 2 as the concordant transcript set between T 1 and T 2 . We denote S 1 ≡ T 1 − S and S 2 ≡ T 2 − S as the sets of discordant mapped transcripts of r 1 and r 2 , respectively. From now on, we define a fusion transcript (ftx) as an ordered combination of two transcripts tx u and tx v belonging to the discordant mapped transcript sets: where tx u ∈ S 1 and tx v ∈ S 2 . Additional file 1: Figure S3 displays a simple example where S 1 = {tx 1 , tx 2 } and S 2 = {tx 3 , tx 4 , tx 5 }. From S 1 and S 2 , there are six possible fusion candidates generated from combination between the transcripts in S 1 and the transcripts in S 2 : ftx 1 = (tx 1 , tx 3 ), ftx 2 = (tx 1 , tx 4 ), ftx 3 = (tx 1 , tx 5 ), ftx 4 = (tx 2 , tx 3 ), ftx 5 = (tx 2 , tx 4 ), ftx 6 = (tx 2 , tx 5 ).
For simplicity here we denote each fusion transcripts with a single index, but there is no conflict with the notation in formula 1. We are always able to refer a fusion transcript to an ordered combination of two transcripts, for example ftx 1 = ftx(1, 3) = (tx 1 , tx 3 ). Thus, each readpair maps to or generates a set of fusion candidates (which could be empty). Two read-pairs are said to be equivalent if (and only if ) they map to the same set of fusion transcripts; this means we can group/partition the read-pairs into equivalence classes using the corresponding set of fusion transcripts as the group index. Thus we set up the fusion equivalence (feq) classes -each indexed by the set of potential fusion transcripts (ftx 1 &hellip; ftx n ) -such that every read-pair in a feq must map exactly to all the fusion transcript candidates that defines the feq.
In the example above, a read pair (r 1 , r 2 ) belongs to a fusion equivalence class feq consisting of six fusion transcripts and contributes one count to that equivalence class.
Naturally, from a single fusion transcript ftx(u, v) we can also derive a fusion gene (fge) as a combination of two genes: gene A and gene B as follows: where tx u is a transcript of gene A , and tx v is a transcript of gene B . Thus, a fusion gene is connected to fusion equivalence classes through its corresponding fusion transcripts. If the transcript pair (tx u , tx v ) belongs to feq j , then we say feq j supports the fge. We denote the set of fusion equivalence classes as FEQ = feq 1 ..feq M with a set of corresponding numbers of supporting read pairs C = {c 1 ..c M }, the set of fusion transcript as FTX = {ftx 1 ..ftx N } and the set of fusion gene FGE = fge 1 ..fge K . The FEQ table in Additional file 1: Figure S3 also shows the relationship between FEQ, FTX and FGE. The first and the second rows of the table presents fusion gene and its fusion transcripts. Each following row presents a single fusion equivalence class. The binary value indicates the absence/presence of ftx i in feq j . The last column shows the number of fragments supporting the corresponding fusion equivalence class. From the table, we extract the number of fragments supporting a fusion gene by summing up the c j of its supporting fusion equivalence classes, that is those feq including any ftx included in the fge k set as follows where summation over all j such that feq j 's support fge k . We also can compute the number of fragments supporting a fusion transcript supportCount(ftx i ) by the same formula with the c j supporting fusion transcript ftx i . If many fusion genes share the same fusion equivalence classes then this indicates that the supporting reads from these fusion equivalence classes are not uniquely mapped to a single fusion genes. To compensate this issue, we correct the supporting counts of the fusion genes by adding weights of fusion genes in the fusion equivalence class. For simplicity, we set the weights of all fusion genes of a fusion equivalence class to be equal. Thus, the corrected count is computed as follows where w k,j = 1/|feq j | and |feq j | is the number of fusion transcripts.
We also discard all fragments with a fusion equivalence class containing two transcripts from the same gene. Finally, a list of fusion-gene candidates is extracted from the table of fusion equivalence classes for further analysis.
From hereon the term 'fusion transcript' is defined in relation to the fusion equivalence classes, which is not necessarily a transcript generated from a fusion event. The 'fusion gene' indicates a fusion event occurring between two separate genes, consequently generating a fusion transcript. The aim of our method is to detect a fusion event between two genes, so 'fusion gene' is reported as the final result.

Statistical tests and filtering
Several statistical tests and filtering criteria are applied to limit false-positive fusion-gene candidates. We divide the filters into three main categories: (i) general features of fusion genes; (ii)sequence similarity of constituent genes; (iii) positional distribution of the supporting reads. The applications of the filters and tests might be different from mapped reads to split reads. The details of the filters for practical implementation in each pipeline are supplied in the Supplementary report.

General features of fusion genes
We limit fusion genes to common situations, for example: in selected chromosomes 1-22, X and Y, constituent genes coming from protein-coding genes, large enough distance between constituent genes, and sufficient supporting read count. We also do not allow 'inverted fusion' that if a fusion gene fge(gene A , gene B ) is expressed, the inverted direction fusion gene fge(gene B , gene A ) is not likely expressed. The inverted fusion gene created by exchanging the roles between 5-prime gene and 3-prime gene from one fusion gene. This can create circular fusions that are likely false positives.

Sequence similarity of constituent genes
Different genes with highly similar sequences are often listed in fusion-gene candidates but they are likely false positive. The similarity is frequently observed between a gene and its paralogs or its known read-through (conjoined) genes, that can be collected from the reference database. Since mapped reads do not generally contain junction-break information, we use strict criteria for the mapped read pipeline to reduce false positives. In particular, we do not expect many supporting read pairs to be shared between fusion genes. Furthermore, we utilize the equivalence classes to discover all the possible sequence similarities between two genes that will be used as an extra paralog reference.
Specifically, we first use RNASeqReadSimulator tool (http://alumni.cs.ucr.edu/~liw/rnaseqreadsimulator. htm) to generate a paired-end simulation sample, where the expected read counts of all transcripts of the transcriptome are equal and high enough, herein 1000 read counts. Then, transcript equivalence classes of the simulation sample are generated using Rapmap [26]. Since all transcripts are expressed, two transcripts with similar sequence regions have to appear in at least one equivalence class. Finally, the transcripts are mapped to their corresponding genes to determine which genes share similar sequences.

Positional distribution of supporting reads
For each read pair (r 1 , r 2 ) of mapped reads, we collect all start positions of r 1 and r 2 mapped to the annotation reference. Thus, for each fusion gene gene A , gene B , we are able to compute the distributions of the start positions (startPos) of the supporting reads to gene A and gene B . Similarly we get the distributions of start positions of the k-mer sequences mapped to the genes from the split reads. For simplicity, we call them positional distribution. If the start positions of two single reads are very close to each other, they are likely duplicated (quasi-duplicated). This feature is similar to the criteria of patterns of short reads mentioned in [3,13]. Collections of positional distribution can be used to estimate the fragment length generating a read pair. Then statistical tests for fragment length are used to remove outliers and eliminate false positives. In addition, breaking points and exons information of each site can be estimated from the positional distribution. Thus, FuSeq can allow to check the satisfactions of popular splicing sites such as GT-AG, GC-AG and AT-AC. Moreover, if the distance between two breaking points from the same chromosome (junctionDistance) is too close to each other, the fusion-gene candidate is likely a false positive.

Other statistical tests and filtering
In general, paralogs can easily produce false-positive candidates, but these candidates will have too many sequence similarities. Thus we check the length of the overlapping mapped sequence between two genes in a split read based on the first and the last k-mers. Moreover, a split read is deemed a misalignment if > 85% of the read sequence is fully mapped to either 3' transcript or 5' transcript. Finally, the consistency between expression level of mapped reads and split reads supporting a fusion gene is tested. We also report (but not eliminate) the genes relating to mitochondrial translation, cytosolic ribosomal subunit and ribonucleoprotein, which are filtered out in some fusion detection methods [3,19].

De novo assembly
If the number of supporting read pairs of a fusion gene is large enough, then it is useful to perform de novo assembly to build a sequence-contig capturing the fusion. This step is optional in FuSeq and any denovo assembly tools such as Trinity [28], Oases [29], Trans-ABySS [30], SOAPdenovo-Trans [31] can be used. To get a computationally efficient procedure, the de novo assembly is done as follows. First, we extract all read pairs supporting all fusion-gene candidates from the previous stage of the pipeline. This produces a small set of reads, so the assembly computation is trivial. These read pairs (both mapped reads and split reads) are used as input to a de novo assembly tool to construct the contigs; in practice we use Trinity [28]. Next, we consider these contigs as the reference for Rapmap [26] and do quasi-mapping for the read pairs to the reference. From the mapping results we can associate the contigs to the candidate fusion genes.

Final fusion-gene candidates
The final set of fusion-gene candidates is combined from the candidate lists of mapped-read and split-read pipelines. It is worth noting that the fusion gene candidates from mapped read might have no supporting split reads, and vice versa. The score of each fusion gene is the sum of the corrected count of the supporting mapped-reads and split-reads. In FuSeq, scores of final fusion gene candidates must be at least 3. In the final set, fusion genes are ranked according to their scores.

Implementation
In our implementation, we use the genome reference and annotation from Ensemble version GRCh37.75. The method was implemented in C/C++ for extraction of split reads and fusion equivalence classes of mapped reads, combined with R language for downstream analysis. FuSeq software is available for non-commercial use at https://github.com/nghiavtr/FuSeq. User guides with practical examples are also provided in the website.

Materials
We illustrate the applications of FuSeq to four publicly available and validated real datasets.

Breast-cancer dataset
There are 6 samples from 4 breast-cancer cell lines (BT-474, SK-BR-3, KPL-4 and MCF-7) [3], where BT-474 and SKBR3 have two samples. The samples contain 14-42M paired-end reads of 50bp long, using Qiagen PCR purification kit in library preparation following sequenced by 1G Illumina Genome Analyzer 2X. There are 27 validated fusion genes from the original study [3], and extended to a total of 99 in later publications [3,4,14]. For clarity we will separately call these two validated versions as TP27 and TP99 datasets. They highlight the difficulty in assessing a method when there is no real gold standard. For example, some fusion candidates that are not validated in TP27 dataset turn out to be true positives in the TP99 dataset.

Melanoma dataset
This dataset has 6 samples (501-MEL, M000216, M000921, M010403, M980409 and M990802) from melanoma patients [6] including 8-16M paired-end reads of 50bp long. The library was prepared using the SuperScript Double-Stranded cDNA Synthesis kit (Invitrogen) before sequenced by Illumina Genome Analyzer II. A total of 11 fusion genes has been validated in this dataset.

Spike-in dataset
This dataset contains 9 synthetic spike-in fusion genes titrated into the cell line COLO-829 [32] with ten different abundances, each abundance has one duplicate. The final 20 RNA-seq samples have 72-180M pairedend reads of 100bp long. The library is prepared by the TruSeq Stranded mRNA LT Sample Prep Kit, later sequenced using Illumina HiSeq2500. Thus, the strandness of the read is kept in this dataset. In this dataset, sample SRR1659964 with the medium concentration (-6.17 log10(pMoles)) was compared to other dataset recently (Liu et al., 2015), so it is also considered in a separate comparison.

Competing tools and evaluation
We select TopHat-Fusion, JAFFA and SOAPfuse as the main tools for comparisons. TopHat-Fusion is a widely used fusion gene detection tool, JAFFA is one of the most recent and top performing methods, and SOAPfuse is the best method in the recent comparison [9]. We compare the performance in terms of sensitivity, specificity and computational time. We utilize results from a recent study [14], because it contains performances of JAFFA, SOAPfuse and TopHat-Fusion for the breastcancer dataset and the glioma dataset. The original study of the spike-in dataset [32] is used for the comparison of all 20 samples with TopHat-Fusion. We also use a comparative study of fusion detection methods [9], since it contains information of the melanoma dataset and sample SRR1659964 of the spike-in dataset. Finally, for comparison of computational time, we select FusionMap [15], the fastest method from the comparative study, and a recent method TRUP [5], which demonstrated similar computational performances to FusionMap and TopHatfusion. We use the version of FusionMap migrated in the Oshell pipeline at http://www.arrayserver.com/wiki/ index.php?title=Oshell since all older versions of Fusion-Map are inactive. We use version TRUP_2015-21-05 from the TRUP webpage https://github.com/ruping/TRUP/.
Since the performances of methods depend on the parameter settings [9], for fair comparisons, we set the parameters in our method similar to the corresponding values from the previous comparisons. For example, similar to the other methods, FuSeq also depends on the minimum number of supporting spanning reads (mapped reads in FuSeq) to remove false positives. Thus, we set this value as 2 for the comparison of the breast-cancer dataset, the glioma dataset and all 20 samples of spike-in dataset, and 3 for the comparison of the melanoma dataset and sample SRR1659964 of the spike-in dataset. The minimum supporting reads is set to 1 as common. We also implement de novo assembly to verify the true positives in a few samples from the datasets. Since there are no particular settings of FusionMap and TRUP for the minimum number of supporting spanning reads or spliting reads, these criteria are not applicable. We also used GSNAP (default) for the mapper of TRUP since it gets more sensitivity than STAR [5]. For other setting parameters of FusionMap and TRUP, the default values are used. Since the breast-cancer dataset and the melanoma dataset have short reads (50 bp), a low value (=1) is set for the parameter consisCount (number of consistent read pairs with discordant mapping) of the step "runlevel 3" as the suggestion of the tool. However, using either the default value (consisCount = 5) or the suggested value (consisCount = 1), TRUP reported no fusion gene candidates. Since we are not sure if this is the proper result, and we could not find any obvious solutions, we do not report the performance metrics ("-") for TRUP for these two datasets (Table 1).
In addition, to be able to generate split reads in FuSeq, the k-mer length must be less than a half of read length. The default setting of k-mer length 31 in Rapmap is not suitable for dataset with short reads (50 bp read long). Therefore, we set this value as 21 for the short read datasets (breast-cancer and melanoma datasets with 50bp read long) and as 31 for long read datasets (gliomas and spike-in datasets with 100 bp read long). Figure 2 shows one example of a true fusion event involving AKAP9-BRAF genes in chromosome 7, discovered by FuSeq in the spike-in dataset SRR1659964. The full data contains 93.9M read-pairs, and the fusion gene contains 3318bp and 1158bp from AKAP9 and BRAF, respectively. In Fuseq output, the fusion is supported by 60 read-pairs. It is a highly confident fusion, as it is also supported by a contig constructed from the de novo assembly; this contig is 639bp long with 252bp belonging AKAP9.

Discovery performance
The discovery operating characteristics for the breastcancer, melanoma and glioma datasets are presented in Table 1. The results reported here are pooled from all the samples of each dataset. We use recall, precision and F1 score for the comparison. Recall or sensitivity is defined  as the ratio between the discovered true positives and the total validated fusion genes. Precision is the ratio between the discovered true positives and the total fusion candidates discovered by the fusion-detection methods. The precision conveys the specificity of the methods. F1 score or F-measurement is a balanced metric between precision and recall, and is calculated by The list of validated true positives in each dataset is likely incomplete; assuming this list is a random subset of all true positives, the reported recall is unbiased, but precision is only a lower bound of the true values.
For the breast-cancer data, FuSeq discovers 22 of 27 validated fusions in TP27, and 36 of 99 extended validated fusions in TP99. Compared to FusionMap and TopHat-Fusion, FuSeq discovers more true positives and has a smaller candidate list. JAFFA has the smallest final fusiongene list, but it loses many validated fusion genes. The F1 score of FuSeq is equal to SOAPfuse and slightly smaller than that of JAFFA in TP27 (0.55 vs 0.58). However, F1 scores of both FuSeq and SOAPfuse are significantly greater in TP99 (0.47 and 0.51 vs 0.40). As mentioned in the "Methods" section, TRUP reported no fusion gene candidates, thus we consider its metrics as not available ("-") for the comparisons in this dataset, and similarly in the melanoma dataset.
As described in a recent study [9], the melanoma dataset represents a difficult dataset for fusion-gene detection. Among 15 competing fusion gene detection tools, most of them find less than 6 validated fusion genes. Fusion-Map, TopHat-Fusion and JAFFA report 3, 4 and 2 true positives, respectively; this is shown in Table 1. SOAPfuse discovers most true positives (10) but introduces a lot of false positive fusion genes (recall=0.09). FuSeq reports 7 true positives and recommends 14 other candidates. FuSeq also has the best F1 score (0.44) as compared with the other methods.
In the glioma dataset, except for TRUP and SOAPfuse (with 20 and 22 discovered true positives, respectively), all the other methods discover most of validated fusion genes, 28 for FusionMap, 29 for both TopHat-Fusion and FuSeq, and 30 for JAFFA. However, FuSeq discovers the smallest fusion candidate set (188 candidates in total), which makes the F1 score of FuSeq higher than that of all other methods.
For the spike-in dataset, due to the huge library sizes (72-180M read pairs), there are not many available results from current fusion detection methods for all 20 samples. The original publication of the spike-in dataset [32] reports the total number of spike-in fusion genes detected from TopHat-Fusion and ChimeraScan and SnowShoes-FTD tools for all 20 samples, but without information of other fusion-gene candidates. In this study, only 142, 80, 133 and 138 out of 180 true positives are discovered from FusionMap, TRUP, TopHat-Fusion and JAFFA, respectively ( Table 2). As shown in the table, FuSeq obtains a much better result by missing only one true positive. We also find out another report [9] of the spike-in dataset, but only for a single sample SRR1659964 (-6.17 log10(pMoles) concentration). In the report, all 15 fusion gene detection tools are not able to detect all 9 spike-in fusion genes. From that report, JAFFA, SOAPfuse and TopHat-Fusion detect 5, 4 and 1 true positives respectively (Table 2). From the table, FusionMap and TRUP also discover 7 and 4 true positives, respectively. In contrast, FuSeq can detect all 9 spike-in fusion genes, and they are among the top 11 in fusion ranks (Fig. 3). FuSeq also shows the best F1 score (0.53) in this dataset.
We further evaluate whether the differences of the precision values of FuSeq to other methods are statistically significant. To do this, we collect the number of true positives (TP) and the total discovered fusion-gene candidates (Total) from Tables 1 and 2. Fisher's exact test is then applied, comparing the precision between FuSeq vs each of the other methods. Two-sided p-values are reported in the tables. The results show that the tests Table 2 Fusion discoveries in the spike-in dataset. The results for TopHat-Fusion, SOAPfuse and JAFFA for sample SRR1659964 are collected from a recent study [9] One sample(SRR1659964) 20 samples   [14] and Liu et al. [9], respectively. For purpose of visual comparison, the x-axis of the plots is limited mostly by the curves of FuSeq for the precision difference between FuSeq and TopHat-Fusion are significant in the breast cancer datasets TP27 (9.1e-06) and TP99 (1.1e-08). For the Melanoma data, the test is significant vs SOAPfuse (p-value = 0.02). Moreover, in the Glioma datasets, the higher precision of FuSeq is significant vs two methods including JAFFA (5.5e-08) and SOAPfuse (0.02). There are no significant p-values vs any methods in sample SRR1659964; this likely due to the small sample problem. Finally, the higher precision of FuSeq is significant vs FusionMap (p-value = 0.04) and TopHat-Fusion (1.6e-22) in the full spike-in dataset.

Evaluation by discovery rates
Since we do not have information on true negatives in the real datasets, we further evaluate the discovery rates of FuSeq through the ranks of validated fusion genes. In general, the validated fusion genes of each dataset should have low ranks in the final set. In the breast-cancer, melanoma and glioma datasets, most of the validated fusion genes are in the top 10 in the final set of a sample (Additional file 1: Table S1, S2 and S3). These fusion genes usually contains a high number of supporting reads (≥ 10). Additional file 1: Figure S4 presents the ranks of 9 spikein fusion genes over 20 samples detected by FuSeq. The samples in the x axis are ordered by the concentration levels from high to low and replication 2 to replication 1. In general, all spike-in fusions are on the top of ranks, indicating the stability of FuSeq. Half of samples in the right side, which have high levels of concentration, reveal all 9 spike-in fusion genes at the top 9 of ranks. This indicates that the spike-in fusion genes have the strong signals in these samples with high supporting reads (about more than 50 counts, see Additional file 1: Table S4). The trend of the ranks of spike-in fusion genes only slightly increases when the level of concentration decreases. The higher ranks of spike-in fusion genes in the low levels of concentration might be caused by higher signals of endogenous fusions in the baseline cell line. The more details of the ranks of each spike-in fusion genes crossing over 20 samples are reported in Additional file 1: Table S5. We also report candidates for endogenous fusion of the spike-in dataset in Additional file 1: Table S6. These endogenous fusion candidates are replicated at least a half of samples (10 times).
Furthermore, in order to compare to other fusion detection methods, we plot operating characteristic (OC) curves in Fig. 3. First, we extract the number of supporting reads of all fusion genes in the datasets. Then, we rank fusion genes and report the number of true positives discovered by FuSeq. For ranking, in a single sample, we sort fusion genes by their scores in decreasing order. To summarize the information from all samples of a dataset, we rank fusion genes by the maximum scores across the samples of the dataset. The results from FusionMap, TRUP, JAFFA, SOAPfuse and TopHat-Fusion are presented by crosses, circles, stars, diamonds and triangles in the plots.
For the breast-cancer and glioma datasets, we collect the list of ranked fusion genes from previous study [14] for JAFFA, SOAPfuse and TopHat-Fusion to build their OC curves. For melanoma and spike-in dataset, no ranking information of fusion gene available. The OC curve for all 20 samples of the spike-in dataset is not reported since we do not have information of the total number of fusion genes discovered by the these methods. As shown in the plots, the points of JAFFA, SOAPfuse and TopHat-Fusion generally follow the trends of the curves. However, FuSeq potentially discovers more true positives than JAFFA and TopHat-Fusion and less false positives than SOAPfuse. The OC curves of two methods TopHat-Fusion and SOAPfuse are worse than both JAFFA and FuSeq. In the glioma dataset, the OC curve of FuSeq is better than that of JAFFA. For the breast-cancer dataset, FuSeq are better and competitive with JAFFA at the beginning and the end of the OC curves. Moreover, in both TP27 and TP99 datasets, FuSeq discovers more true positives than JAFFA.

Verification using de novo assembly
We illustrate the de novo assembly for one sample from each dataset. In each dataset, we select the sample with the highest number of validated fusion genes detected by FuSeq: SRR064439 (7.9M read-pairs) from the breast-cancer dataset, SRR018266 (14.9M read-pairs) from the melanoma dataset, and SRR934930 (28.7M readpairs) from the glioma dataset. For the spike-in dataset, for consistency with previous works, we select sample SRR1659964 (93.9M read-pairs). The numbers of supporting read pairs from fusion-gene candidates used as input into Trinity software are 321, 165, 310 and 481 for the breast-cancer, melanoma, glioma and spike-in samples respectively. The results are summarized in Table 3.
The numbers of fusion-gene candidates and the validated fusion genes discovered by methods are presented TP= true positive fusion genes, Total= total discovered fusion-gene candidates in rows Total and TP (true positive), respectively. The results in column 'FuSeq + de novo assembly' indicate the fusion-gene candidates supported by contigs from de novo assembly. For both breast-cancer and melanoma datasets, only a few fusion-gene candidates are verified by the assembly, but all of them are true positive. For the long read datasets (the glioma and spike-in datasets), all validated and spike-in fusion genes discovered by FuSeq are verified again by the de novo assembly. It also introduces the evidences of contigs from 14 and 3 extra fusion genes for the glioma dataset and the spike-in dataset, respectively. These examples show that de novo assembly increases the specificity of the fusion discovery, but a sufficient number of supporting reads is needed to achieve similar sensitivity as the mapping-based approach. In general, de novo assembly is a useful step to add confidence in a fusion event if the fusion gene is associated with a contig. However, if the number of supporting reads are too low (low-abundant fusion genes), or there are no split read input, there might be not enough information to build a contig. Therefore, a true positive might not be supported by a contig. We can also use this result as an indication that in general a de novo assembly-based method is not a good detection method for low-abundant fusion genes.

Computational time
The computational cost of FuSeq comes from two steps including (i) quasi-mapping and (ii) statistical tests and filtering. The first step gets the benefit from the excellent efficient performances of the light-weight hash-tablebased quasi-mapping process from Rapmap [26] for speed and memory usage. The time and memory of the second step is linear to the number of reads supporting the fusion genes candidates which is usually proportional to the sample's library size. Figure 4 demonstrates that our method computational time is linear to the number of reads of samples. Similar to other methods, FuSeq keeps all the reads of fusion candidates that can be helpful for users. Therefore the storage requirements of FuSeq is also linear to the number of reads.
Due to the speed advantage of the quasi-mapping step [26] and the fusion equivalence class structure, FuSeq shows an excellent performance in computational time, see Fig. 4 and Additional file 1: Table S7. The time starts from the processing of the FASTQ file until the production of final candidates. For small datasets such as the breast-cancer and melanoma datasets, it takes 4-6 CPU hours to finish 6 samples (less than an hour per sample). It slightly increases to 1.15 h in average to process the longread glioma dataset. It takes 45.47 CPU hours to complete all 20 samples of the big spike-in dataset (an average of 112M 100bp read-pairs per sample), an average of 2.27 CPU hours per sample.
We now compare this to other methods based on a recent report [9], where they compare 15 fusion detection methods for a single prostate cancer 171T sample (118M 100bp read-pairs). As reported, these methods require from 7.3 up to 240 h to analyse the sample. Among those, TopHat-Fusion, SOAPfuse and JAFFA need 42.7, 148.75 and 154 h respectively. Figure 4 presents the comparison between FuSeq and the three methods by the average computational time per sample over sample sizes. Thus, FuSeq provides a significant improvement (∼19, 66 and 68 times faster than TopHat-Fusion, SOAPfuse and JAFFA, respectively) in computational time.
We further compare performace times of FuSeq with FusionMap [15] known as the fastest methods in the comparison study, and TRUP [5] reported with similar total time compared to FusionMap and TopHat using their datasets. We select sample SRR1659964 containing 94 million reads and run three methods on this sample. The results show FuSeq requires 1.83 h, which is more than 2-fold faster than FusionMap (4.08 h), while TRUP requires a lot more time (136.04 h). These results are presented in Fig. 4 and details of computational time and memory usage are given in Additional file 1: Table S8.

Discussion and conclusion
We have developed a novel method called FuSeq for fast and accurate discovery of fusion genes from RNA-seq data. The experiments of the method on four different real datasets with validated fusion genes reveal that FuSeq compares well against TopHat-Fusion, SOAPfuse and JAFFA in terms of sensitivity, specificity and F1 score. FuSeq also substantially improves on the computational time compared to the other fusion detection methods. Overall, FuSeq makes it easier to perform fusion gene discoveries from large RNA-seq datasets, e.g. involving large numbers of samples.