Volume 17 Supplement 1
Selected articles from the Fourteenth Asia Pacific Bioinformatics Conference (APBC 2016): Genomics
NMFP: a nonnegative matrix factorization based preselection method to increase accuracy of identifying mRNA isoforms from RNAseq data
 Yuting Ye^{1} and
 Jingyi Jessica Li^{2, 3}Email author
DOI: 10.1186/s1286401523048
© Ye and Li. 2015
Published: 11 January 2016
Abstract
Background
The advent of nextgeneration RNA sequencing (RNAseq) has greatly advanced transcriptomic studies, including systemwide identification and quantification of mRNA isoforms under various biological conditions. A number of computational methods have been developed to systematically identify mRNA isoforms in a highthroughput manner from RNAseq data. However, a common drawback of these methods is that their identified mRNA isoforms contain a high percentage of false positives, especially for genes with complex splicing structures, e.g., many exons and exon junctions.
Results
We have developed a preselection method called “Nonnegative Matrix Factorization Preselection” (NMFP) which is designed to improve the accuracy of computational methods in identifying mRNA isoforms from RNAseq data. We demonstrated through simulation and real data studies that NMFP can effectively shrink the search space of isoform candidates and increase the accuracy of two mainstream computational methods, Cufflinks and SLIDE, in their identification of mRNA isoforms.
Conclusion
NMFP is a useful tool to preselect mRNA isoform candidates for downstream isoform discovery methods. It can greatly reduce the number of isoform candidates while maintaining a good coverage of unknown true isoforms. Adding NMFP as an upstream step, computational methods are expected to achieve better accuracy in identifying mRNA isoforms from RNAseq data.
Keywords
mRNA isoform discovery Nextgeneration RNA sequencing (RNAseq) Nonnegative matrix factorization (NMF) Cufflinks SLIDEBackground
In molecular biology and transcriptomics, an important question is to understand the alternative splicing process, a key step of gene transcription in diverse species, from invertebrates to mammals. Alternative splicing generates at least two mRNA isoforms (i.e., transcripts) for one gene, and these mRNA isoforms as well as their protein products can play different roles in various biological phenomena. For example, studies have found that aberrant structures or abundance of mRNA isoforms can cause various human diseases [1, 2]. Hence, systematic identification and quantification of mRNA isoforms can greatly help understand gene regulation mechanisms for both basic biology and translational medicine. For decades, mRNA isoforms of individual genes have been accurately identified and quantified by lowthroughput technologies such as cDNA cloning and qPCR. However, it was impossible to conduct this task on a systemwide scale for thousands of genes simultaneously until the invention of highthroughput technologies.
More than a decade ago, microarray technologies established a highthroughput platform for identifying and quantifying mRNA isoforms of genes with known sequences. SPACE [3, 4] is a method using nonnegative matrix factorization (NMF) to predict mRNA isoforms and estimate their abundance from microarray data. NMF is a popular pattern recognition method and is distinguished from other matrix factorization methods by its use of nonnegativity constraints [5]. A main advantage of NMF is its interpretable factorization results, which are positive and sparse combinations of features. In the context of mRNA isoform discovery, SPACE uses NMF to find isoforms as combinations of exons and exon junctions. SPACE was demonstrated to have good power in detecting mRNA isoforms under ideal simulation settings. However, many open questions remain about how to use NMF to accurately find mRNA isoforms from real microarray data. Reasons for these questions include the following issues inherent with NMF and microarray. First, in many scenarios NMF outputs nonunique factorization results [6], making SPACE identify ambiguous mRNA isoforms. Second, NMF requires a prespecified matrix rank for factorization results, and how to determine this rank remains a difficult question [7–9]. Third, microarray can only detect exons and exon junctions with at least partially known sequences, and thus it cannot provide information on novel exons or exon junctions. This hinders SPACE from discovering novel isoforms that contain novel exons or exon junctions.
Compared to microarray, the more recent nextgeneration RNA sequencing (RNAseq) technologies yield deeper and wider insights into transcriptomes at base level resolution [10]. RNAseq measures the expression of genomewide exons and exon junctions, regardless of prior knowledge on their existence. Hence, unlike microarray, RNAseq enables the identification of novel mRNA isoforms containing previously unknown exons or exon junctions. Despite its many advantages, a major drawback of Illumina RNAseq, the most popular RNAseq platform, is that it can only generate RNAseq reads with short lengths (≤2×250 bp for pairedend reads). Such short RNAseq reads cannot cover a fulllength mRNA isoform, making isoform identification a difficult computational problem. Since the advent of RNAseq, various computational methods have been developed to discover mRNA isoforms from RNAseq data. For the species with goodquality reference genomes, including human and model organisms, mainstream isoform discovery methods (e.g., Cufflinks [11], Scripture [12], IsoLasso [13] and SLIDE [14]) use RNAseq reads first mapped to a reference genome. In this paper, our discussion focuses on these socalled “refernecebased methods”, which attempt to identify fulllength mRNA isoforms from mapped RNAseq reads using various approaches. For example, Cufflinks, Scripture and IsoLasso first build graphs of exonic regions containing mapped reads, and then find mRNA isoforms as the maximal paths in the graph based on a parsimony assumption. These three methods are also often referred to as “de novo methods,” because their isoform identification approach can completely rely on RNAseq reads without using existing annotations. Another referencebased method SLIDE uses a different approach. It constructs a search space of possible isoforms by using known boundaries of genes and exons from existing annotations, with the possible addition of novel genes and exons inferred from RNAseq data. SLIDE then identifies isoforms from the search space based on a regularized linear model, which accounts for the RNAseq read generating mechanism and read distributions. Because of its use of annotations, SLIDE is also called an “annotation aided method”, in contrast to the previous “de novo methods.” Note that the recent versions of Cufflinks also added an annotation aided option, which can simulate RNAseq reads from annotations and use the simulated reads together with RNAseq data for isoform discovery [15]. Both de novo and annotation aided methods have advantages and drawbacks. The graphbased de novo methods are more robust to RNAseq data noise thanks to their use of the parsimony assumption, which, however, also prevents them from finding overlapping isoforms. SLIDE, on the other hand, is capable of finding overlapping isoforms but is more sensitive to RNAseq data noise. Despite their differences, a common obstacle for both types of methods is the large search space of possible mRNA isoforms for genes with complex splicing structures (i.e., many exons). For the graphbased de novo methods, a gene with many exons will lead to a graph with large numbers of nodes and edges, making it difficult to search for isoforms as maximal paths. For SLIDE, a gene with n exons will give rise to a total of 2^{ n }−1 possible isoforms, placing great difficulty on the regularized linear model to find the correct isoforms. Although SLIDE is coupled with a naïve preselection procedure to filter out the isoform candidates with unsupported exon junctions by RNAseq reads, this procedure is not ideal for lowly expressed isoforms, whose exon junctions may not be supported by reads due to the low read coverage.
Given the wide existence of RNAseq noise and the large search space of possible isoforms, existing computational methods for isoform discovery have unsatisfactory performance in many cases. In an evaluation conducted by the RGASP consortium [16], Cufflinks only achieved precision rates of 23 %, 50 % and 47 % and recall rates of 18 %, 40 % and 39 % for discovering fulllength mRNA isoforms from H. sapiens, D. melanogaster, and C. elegans RNAseq data respectively. For SLIDE, the corresponding precision rates were 37 %, 80 % and 63 %, and the corresponding recall rates were 7 %, 49 % and 59 %. This evaluation showed that both Cufflinks and SLIDE, as well as other evaluated methods, have low accuracy for discovering human isoforms, mainly because many human genes have large numbers of exons and thus a huge search space of possible isoforms. In statistical literature, for highdimensional linear models with far more features than observations, it was found that adding a screening procedure to filter out excessive features in the search space can improve the feature identification performance of Lasso [17]. Since Lasso was used in SLIDE to find isoforms as features, we are motivated to reduce the search space of possible isoforms by designing a screening (or preselection) procedure, so as to improve the performance of SLIDE and hopefully other isoform identification methods.
Also motivated by SPACE and the good properties of NMF, we propose a preselection method “Nonnegative Matrix Factorization Preselection” (NMFP) to improve the accuracy of referencebased isoform discovery methods. Our goal is to use NMFP to greatly reduce the search space of isoforms, and have the selected isoform candidates cover most of the true isoforms. NMFP employs NMF to select isoform candidates for a given gene by decomposing its RNAseq read count matrix. We demonstrate the performance of NMFP by simulation and real data studies. In the simulation, we generate RNAseq reads from D. melanogaster (fly) and M. musculus (mouse) using Flux Simulator [18] and show that NMFP can greatly improve the accuracy of Cufflinks and SLIDE in isoform discovery. We also show the efficacy of NMFP on discovering D. melanogaster isoforms from real RNAseq data. Moreover, we evaluate the robustness of NMFP to the choice of NMF ranks, the number of input samples in the read count matrix, and gene expression levels. In addition, we show that NMFP can increase the robustness of SLIDE to its regularization parameter choice.
Methods
Our proposed method NMFP shares the same root as SPACE, because they both use NMF to identify mRNA isoforms. However, unlike SPACE that aims to directly identify isoforms via NMF, NMFP has a less ambitious goal, which is to identify a set of good isoform candidates for downstream isoform discovery methods. To achieve this goal, in NMFP we use highfrequency filtering to alleviate the nonuniqueness issue of NMF results, and design a model selection approach to choose the NMF rank. Moreover, we develop a new NMF algorithm to penalize the isoforms that have conflicting exons or exon junctions.
Data processing
Suppose there are m RNAseq data sets (samples) and a gene has n subexons, which were defined as the exonic regions between two adjacent splicing sites in SLIDE [14]. For this gene, we first categorize RNAseq reads in every sample into bins, which are defined as twodimensional vectors (k, l), representing the mapped reads (singleend reads or two ends of pairedend reads) with starting position in the kth subexon and ending position in the lth subexon, 1≤k≤l≤n. Hence, this gene has a total number of \(p = n + {n \choose 2}\) bins. We count the numbers of reads in every bin in all samples and generate a read count matrix U with dimensions p×m, with rows corresponding to bins and columns corresponding to samples
where U _{ ij } is the (i,j)th entry of U, representing the read count of the ith bin in the jth sample, R _{ j } is the total number of reads mapped to this gene in the jth sample, and P _{ i } is the probability that a read falls into the ith bin (0≤P _{ i }≤1 and \(\sum _{i=1}^{p} P_{i}=1\)). We assume that P _{ i } is proportional to the length of the ith bin. Although this assumption can be violated by the nonuniformity of read distribution due to alternative splicing and sequencing noise, it provides an approximate and simple solution to removing bin length effects from the read count matrix.
 1.Scale the read counts of each sample inverse proportionally with its total number of readswhere \(\bar {R}=\frac {1}{m}\sum _{j=1}^{m} R_{j}~\).$$U'_{ij}=U_{ij} \cdot \frac{\bar{R}}{R_{j}}~,~i=1,2,\ldots,p~;~j=1,2,\ldots,m~, $$
 2.Further scale the read counts of each bin inverse proportionally with its bin lengthand denote V=(V _{ ij }), where V _{ ij } represents the normalized read count of the ith bin in the jth sample.$$V_{ij}=U'_{ij} \cdot \frac{1}{p P_{i}}~,~i=1,2,\ldots,p~;~j=1,2,\ldots,m~, $$
Matrix decomposition model
where V is a p×m matrix representing the intensities of p probes in m samples; A is a p×p diagonal matrix representing the affinity of p probes; G is a p×s indicator matrix with binary elements indicating the existence of the p probes in s transcripts; T is an s×m matrix representing the abundance of s transcripts in m samples; ε is a p×m error matrix.
where V is the p×m normalized reads count matrix (from last section “Data processing”), and W is a p×s isoform composition matrix, whose rows and columns are bins and mRNA isoforms respectively. Each column of W is a pdimensional vector with elements ∈ [ 0,1], indicating the existence of bins in an isoform, with the values subject to RNAseq noise and biases. Theoretically, all the existing bins in an isoform should have the same read density assuming that RNAseq reads follow a uniform distribution in that isoform. However, due to RNAseq noise and biases, the uniformity assumption is often violated, and nonexisting bins may contain erroneously mapped reads. To account for these issues, we model the existence of bins in an isoform as values between 0 and 1, where 1 indicates a bin having the maximal read intensity among all the bins, and 0 indicates a bin containing no mapped reads. H is a s×m isoform abundance matrix, whose rows and columns are mRNA isoforms and samples respectively. Each row of H is an mdimensional vector representing the isoform’s abundance levels in the m samples. Based on our definition of W, h _{ ij } (the abundance value of the ith isoform in the jth sample) should be equal to the maximum bin abundance of the ith isoform in the jth sample. ε is an error matrix.
where A is a p×p diagonal matrix with elements ∈ [ 0,1], representing the biases of read counts in different bins. Such biases are attributable to the features of the bins, such as GC contents and the bins’ positions in transcripts. G is a p×s indicator matrix with binary values to describe the existence of bins in isoforms. Each column of G represents an isoform.
Modified NMF approach
In other words, \(\hat {A}_{\textit {ii}}\), the bias factor of read counts in the ith bin, is considered as \(\max _{k} (\hat {W}_{\textit {ik}})\), the largest “adjusted existence” of the ith bin in all isoforms.
where I(·) is an indicator function, and c∈IR is a threshold to make \(\hat {\mathbf {G}}\) a binary matrix. We used c=0.4 in this paper.
Prior to that, we estimate W by decomposing V using NMF. There exist many available NMF algorithms [5, 20–22], which have different preferences over the factorization results. Since our goal is to find isoform candidates by decomposing V, one constraint we need is the compatibility of bins in every column of \(\hat {\textbf {W}}\), i.e., every isoform found by NMF. For example, Bin (1,3) and Bin (2,2) are conflicting and should not coexist in an isoform, because the former indicates that Subexon 2 is skipped, while the latter implies the existence of Subexon 2.
where p is the number of bins, m is the number of samples, and α is a regularization parameter controlling the weight of the penalty term (e.g., α=0.1). i⊥j denotes that Bin i and Bin j are conflicting (See Section 1 in the Additional file 1 for more detail). We estimate W by minimizing the objective function D(V, WH) subject to a prespecified s=rank(W), which is equivalent to the number of columns of W. How to determine the rank remains an open question. In NMFP, we choose the rank by gap statistic, which was originally designed to select the number of clusters in kmeans clustering [23] (See Section 2 in the Additional file 1 for more detail). After obtaining \(\hat {\textbf {W}}\), we use Eqs. (6) and (7) to find \(\hat {\textbf {G}}\), which indicates a set of isoform candidates.
The penalty term in the objective function (8) cannot fully prevent the coexistence of conflicting bins in every isoform (i.e, column in \(\hat {\textbf {G}}\)). To resolve this issue and recover potentially true isoforms, we consider both the existing and skipped statuses of the ambiguous subexons that lead to conflicting bins. For example, if both Bin (1,3) and Bin (2,2) exist, Subexon 2 is ambiguous. We will consider the cases of Subexon 2 being existing and skipped as two different isoforms. If there are n ambiguous subexons in an isoform candidate, 2^{ n } isforms will be considered, but only those whose subexons and subexon junctions are supported by RNAseq reads will be kept as isoform candidates.
Aggregation of multiple NMF runs
To address the issue of nonunique NMF results, we use a highfrequency filtering approach to find isoform candidates identified in a number of NMF runs. Specifically, we implement the NMF estimation procedure described in the previous two subsections for N times (e.g. N=100) and aggregated the identified isoform candidates into a candidate pool. In the filtering step, we first remove the isoform candidates, which contain subexon junctions not supported by reads, from the pool. Then among the rest of candidates in the pool, we only retain the highfrequency ones, which are selected in at least r NMF runs (e.g., r=20), as final isoform candidates.
Results
We conduct simulations and realdata analyses in many scenarios to demonstrate the effectiveness of NMFP in improving the performance of downstream isoform discovery methods. The following results show that NMFP can help Cufflinks and SLIDE achieve better isoform discovery accuracy. We also study the robustness of NMFP to the choice of NMF rank, the number of input samples in the read count matrix, and the gene expression levels.
Simulation results in D. melanogaster
Profiles of 4125 genes in chr3R of D. melanogaster
# of subexons n  n≤2  3≤n≤10  11≤n≤15  16≤n≤20  n>20  Sum 
1618 (39.2 %)  2132 (51.7 %)  232 (5.6 %)  79 (1.9 %)  64 (1.6 %)  4125  
# of annotated isoforms q  q=1  q=2  q=3  q=4  q>4  Sum 
2450 (59.4 %)  878 (21.3 %)  356 (8.6 %)  180 (4.4 %)  261 (6.3 %)  4125 
To check the effects of NMFP on improving the isoform discovery performance of SLIDE, we define a new procedure called “NMFP + SLIDE”, which combines NMFP and SLIDE by inputting the isoform candidates found by NMFP into SLIDE. In the original SLIDE method, 2^{ n }−1 possible isoforms are enumerated for an nsubexon gene, and SLIDE uses a simple preselection to filter out the isoforms with subexons or subexon junctions not supported by reads. NMFP + SLIDE replaces this preselection step by NMFP.
Similarly, we define “NMFP + Cufflinks”, which inputs the isoform candidates found by NMFP as a GTF file into Cufflinks and uses Cufflinks with the “–GTF” option to estimate the abundance of these candidates. The candidates with nonzero estimated abundance are kept as the isoforms identified by NMFP + Cufflinks. This is different from the original Cufflinks, because it does not rely on a connectivity graph or the parsimony assumption as Cufflinks does.

Cufflinks

NMFP + Cufflinks

SLIDE(fewer)

NMFP + SLIDE(fewer)

SLIDE(more)

NMFP + SLIDE(more)

NMFP
Evaluation at nucleotide level
We first compare the seven methods at the nucleotide level, that is, how many nucleotides are shared by the identified isoforms of each method and the true isoforms in the annotation. For each identified isoform, we match it to the annotated isoform that shares the most nucleotides with it. We rank the identified isoforms by their percentages of overlapping nucleotides with their matched annotated isoforms, from the highest percentage to the lowest. If there are k identified isoforms and r annotated isoforms and k>r, only the top r identified isoforms would be paired with an annotated isoform, and the rest k−r identified isoforms would be counted as false positives with zero precision rates. For each identified isoform paired with an annotated isoform, we calculate its precision rate as \(\frac {\#\text {overlapping nucleotides with the paired annotated isoform}}{\# \text {nucleotides in the identified isoform}}\). The precision rate of a gene was defined as the average of the individual precision rates of its identified isoforms.
Similarly, we match each annotated isoform with an identified isoform. We calculate the recall rate of each annotated isoform with a paired identified isoform as \(\frac {\#\, \text {overlapping nucleotides with the paired identified isoform}}{\#\, \text {nucleotides in the annotated isoform}}\). If there are more annotated isoforms than identified ones, the unpaired annotated isoforms would have zero recall rates. For a given gene, its recall rate is defined as the average of the individual recall rates of its annotated isoforms. Given the precision and recall rates of a gene, we calculate the F score as the their harmonic mean. That is, F score \(= \frac {2}{1/\text {precision}+1/\text {recall}}\).
In terms of F scores, which combine precision and recall rates, NMFP is shown to have greatly improved the performance of Cufflinks, by increasing the F scores of approximately 0.6 for Cufflinks to more than 0.8 for NMFP + Cufflinks. Strikingly, with the addition of NMFP, Cufflinks outperforms SLIDE, which originally had better F scores than Cufflinks without NMFP [14]. NMFP + Cufflinks becomes the top performer in F scores. On the other hand, NMFP does not much improve the performance SLIDE(more) and even slightly deproves the performance SLIDE(fewer) at the nucleotide level. The reason is that the original SLIDE has reasonably good performance at the nucleotide level, though its identified isoforms are often similar to but not exactly the annotated isoforms (see the subsection “Evaluation at transcript level”).
Evaluation at exon level
We next evaluate the performance of these seven methods at the exon level. If an exon in an identified isoform overlaps at least 50 % of an annotated exon, we call the two exons “overlapping”. Similar to our matching scheme at the nucleotide level, we match identified isoforms with annotated isoforms at the exon level. Then for each identified isoform, its precision rate at the exon level is defined as \(\frac {\#\,\text {overlapping exons with the paired annotated isoform}}{\#\,\text {exons in the identified isoform}}\). Similarly for each annotated isoform, its recall rate is defined as \(\frac {\#\,\text {overlapping exons with the paired identified isoform}}{\#\,\text {exons in the annotated isoform}}\). Then for every gene, its precision and recall rates are calculated as the average precision rate of its identified isoforms and the average recall rate of its annotated isoforms. Its F score is the harmonic mean of its precision and recall rates. For every method in every sample, the precision, recall and F score are calculated by taking the average of 951 genes.
Evaluation at transcript level
Compared to the previous evaluation at nucleotide and exon levels, the evaluation at the transcript level is more important, as the ultimate goal of isoform discovery methods is to correctly identify fulllength mRNA isoforms. At the transcript level, if an identified isoform and an annotated isoform have every exon matched by the criterion we defined for matching at the exon level, we called these two isoforms “matched”. Then for every gene, its precision and recall rates at the transcript level are defined as \(\frac {\#\,\text {identified isoforms matched with annotated isoforms}}{\#\,\text {identified isoform}}\) and \(\frac {\#\,\text {annotated isoforms matched with identified isoforms}}{\#\,\text {annotated isoform}}\) respectively. Its Fscore is the harmonic mean of its precision and recall rates. For every method in every sample, the precision, recall and F score are calculated by taking the average of 951 genes.
Robustness of NMFP to the choice of NMF rank
Simulation results in M. musculus
We also evaluate the performance of NMFP in M. musculus (mouse) (reference genome mm10 and annotation GRCm38 of release 81), a mammalian model organism with more complex gene structures than D. melanogaster. We generate 100 RNAseq samples from mouse chr1 using Flux Simulator, with pairedend reads of length 2×76 bp. The 100 samples have 10 different numbers of RNA molecules and 10 different numbers of reads. The numbers of RNA molecules range from 4,200,000 to 6,000,000, increasing in steps of 200,000. The number of reads range from 11,000,000 to 20,000,000, increasing in steps of 1,000,000. More detail on the 100 samples is summarized in Section 3.2 and Table S1 of the Additional file 1. On chr1, there are 3432 genes, among which we are interested in 852 genes with 310 subexons and 217 annotated isoforms (more detail in Additional file 1: Table S2). In this simulation, we calculate precision and recall rates and F scores at the transcript level for every method, same as what we did in the simulation in D. melanogaster.
Increased robustness of NMFP + SLIDE to the choices of parameter λ
Robustness of NMFP to the number of samples
Simulation study on the effects of expression levels
Once the gene expression level goes above 3 RPKM, the number of needed candidates decreases to approximately 30 and has a reduced variance. This result demonstrates that for a gene with low expression in a sample, NMFP can take advantage of other samples to help reduce the number of isoform candidates of this gene. This property makes it possible for downstream isoform discovery methods to explore isoform structures of lowly expressed genes in a sample given the availability of other samples where the genes are more highly expressed.
Real data case study
We further demonstrate the performance of NMFP and compare the seven methods on 74 real RNAseq data sets of D. melanogaster (reference genome dm5 and annotation BDGP5 of release 66). Description of the 74 data sets is in Section 5 of the Additional file 1. Since we have no knowledge on the true isoforms existing in real RNAseq data sets, we interpret the results by studying a few genes in Integrative Genomics Viewer (IGV) [24, 25], where we plot the read distribution in an RNAseq data set (D. melanogaster pupae 3 days) and the isoforms identified by the seven methods.
Conclusion and discussions
NMF is inherently capable of providing interpretable results for the problem of mRNA isoform discovery. From an input matrix of RNAseq read counts, NMF can find a sparse structure to indicate which bins are likely to form one isoform. However, NMF alone is not competent for recovering isoforms from RNAseq data because of several reasons: (i) inaccurate read coverages and missing junction reads due to RNAseq biases and mapping difficulties, (ii) nonuniqueness of NMF solutions leading to ambiguous isoform discovered, (iii) conflicting bins coexisting in discovered isoforms making the results biologically invalid, and (iv) difficulty in determining the NMF rank. Those issues can be largely alleviated or resolved when NMF is used as a method to preselect isoform candidates rather than determining final isoforms. For (i), (ii) and (iv), since each nonunique NMF solution provides isoform candidates that may include some true isoforms among many false positives, aggregating multiple NMF solutions to find the highfrequency isoforms will increase the chance of discovering the true isoforms. We demonstrate in the simulation study of M. musculus, that the NMFP results are not sensitive to the choice of NMF ranks. We address (iii) by adding a penalty term to NMF and decomposing the invalid isoform candidates with conflicting bins into biologically valid isoforms.
We propose NMFP as a preselection method for improving the accuracy of downstream isoform discovery methods. NMFP takes input from annotated genes and exon boundaries (e.g. UCSC and Ensemble annotations) and/or de novo assemblies (e.g. Cufflinks output). It outputs a small pool of isoform candidates, which are expected to well cover true isoforms and can form a reduced search space for downstream methods. From a general perspective, combination of methods with opposing advantages and drawbacks may improve over individual methods. NMFP and SLIDE are both built on the idea of selecting isoforms from all possibilities, while Cufflinks assembles transcripts directly via an overlap graph. A possible reason for NMFP + Cufflinks to become the top performer is that it combines the strengths of two different approaches: selection and assembly. This may also explain why NMFP + SLIDE, which combines two selection based methods, is outperformed by NMFP + Cufflinks, though SLIDE alone outperforms Cufflinks without NMFP. However compared to SLIDE itself, NMFP + SLIDE still has improved performance at the transcript level, and even becomes more robust to the regularization parameter choices. Overall, we demonstrate by simulation and real data studies that NMFP is a useful tool for improving isoform discovery accuracy and robustness in many aspects.
Nevertheless, NMFP still encounters several problems and there remains substantial room for improvement. The noise and bias generated during RNAseq still has large influence on NMFP results. Some methods for denosing and debiasing may help the performance of NMFP [26, 27]. In addition, NMFP requires enough samples to implement NMF, for which normalization is important but difficult for samples generated by different platforms or from different sources. Normalization methods for RNAseq data (e.g., [28]) may improve the performance of NMFP. Moreover, the current NMFP algorithm has high computational complexity, and optimizing the algorithm is one of our top priority. We also intend to build a database of isoform candidates found by NMFP for different species, so that researchers can conveniently use them for downstream analysis without aggregating multiple samples and going through the whole NMFP pipeline.
Notes
Abbreviations
 NMF:

nonnegative matrix factorization
 NMFP:

nonnegative matrix factorization preselection
 mRNA:

messenger RNA
 cDNA:

complementary DNA
 qPCR:

quantitative polymerase chain reaction
Declarations
Acknowledgements
This work was supported by the startup fund of Department of Statistics at University of California, Los Angeles, and the Hellman Fellowship from the Hellman Foundation. The publication costs for this article were funded by the Hellman Foundation. We thank YuCheng T. Yang in School of Life Sciences, Tsinghua University and Zhiyi Qin and Dr. Xuegong Zhang in Department of Automation, Tsinghua University for their comments and help on this work. We also thank the three anonymous reviewers for their insightful comments that helped us improve our manuscript.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Faustino NA, Cooper TA. PremRNA splicing and human disease. Genes Dev. 2003; 17(4):419–37.View ArticlePubMedGoogle Scholar
 Wang GS, Cooper TA. Splicing in disease: disruption of the splicing code and the decoding machinery. Nat Rev Genet. 2007; 8(10):749–61.View ArticlePubMedGoogle Scholar
 Anton MA, Gorostiaga D, Guruceaga E, Segura V, Carmonasaez P, Pascualmontano A. SPACE : an algorithm to predict and quantify alternatively spliced isoforms using microarrays. Genome Biol. 2008; 9(2):46. doi:10.1186/gb200892r46.View ArticleGoogle Scholar
 Anton Ma, Aramburu A, Rubio A. Improvements to previous algorithms to predict gene structure and isoform concentrations using Affymetrix Exon arrays. BMC Bioinf. 2010; 11(1):578. doi:10.1186/1471210511578.View ArticleGoogle Scholar
 Lee DD, Seung HS. Learning the parts of objects by nonnegative matrix factorization. Nature. 1999; 401(6755):788–91. doi:10.1038/44565.View ArticlePubMedGoogle Scholar
 Donoho D, Stodden V. When Does NonNegative Matrix Factorization Give a Correct Decomposition into Parts? In: Advances in Neural Information Processing Systems: 2003.
 Brunet JP, Tamayo P, Golub TR, Mesirov JP. Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA. 2004; 101(12):4164–9. doi:10.1073/pnas.0308531101.View ArticlePubMedPubMed CentralGoogle Scholar
 Pascualmontano A, Carmonasaez P, Chagoyen M, Tirado F, Carazo JM, Pascualmarqui RD. bioNMF: a versatile tool for nonnegative matrix factorization in biology. BMC Bioinf. 2006; 9:1–9. doi:10.1186/147121057366.Google Scholar
 Fogel P, Young SS, Hawkins DM, Ledirac N, Hall F, Ne CS, et al.Inferential, robust nonnegative matrix factorization analysis of microarray data. Bioinformatics (Oxford, England). 2007; 23(1):44–9. doi:10.1093/bioinformatics/btl550.View ArticleGoogle Scholar
 Wang Z, Gerstein M, Snyder M. RNAseq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009; 10(1):57–63.View ArticlePubMedPubMed CentralGoogle Scholar
 Trapnell C, Williams Ba, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al.Transcript assembly and quantification by RNASeq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010; 28(5):511–5. doi:10.1038/nbt.1621.171.View ArticlePubMedPubMed CentralGoogle Scholar
 Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J, Adiconis X, et al.Ab initio reconstruction of cell type – specific transcriptomes in mouse reveals the conserved multiexonic structure of lincRNAs. Nat Biotechnol. 2010; 28(5):503–10. doi:10.1038/nbt.1633.View ArticlePubMedPubMed CentralGoogle Scholar
 Li W, Feng J, Jiang T. IsoLasso: a LASSO regression approach to RNASeq based transcriptome assembly (Extended abstract). Lect Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 2011; 6577 LNBI:168–88. doi:10.1007/9783642200366_18.Google Scholar
 Jessica J, Jiang Cr, Brown JB, Huang H, Bickel PJ. Sparse linear modeling of nextgeneration mRNA sequencing (RNASeq) data for isoform discovery and abundance estimation. PNAS. 2011. doi:10.1073/pnas.1113972108.
 Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using rnaseq. Bioinformatics. 2011; 27(17):2325–29.View ArticlePubMedGoogle Scholar
 Steijger T, Abril JF, Engström PG, Kokocinski F, The RGASP Consortium, Hubbard TJ, et al.Assessment of transcript reconstruction methods for RNAseq. Nat Methods. 2013; 10(12):1177–84. doi:10.1038/nmeth.2714.View ArticlePubMedGoogle Scholar
 Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B (Stat Methodol). 2008; 70(5):849–911.View ArticleGoogle Scholar
 Griebel T, Zacher B, Ribeca P, Raineri E, Lacroix V, Guigó R, et al.Modelling and simulating generic RNASeq experiments with the flux simulator. Nucleic Acids Res. 2012; 40(20):10073–83. doi:10.1093/nar/gks666.View ArticlePubMedPubMed CentralGoogle Scholar
 Li C, Wong WH. Modelbased analysis of oligonucleotide arrays: expression index computation and outlier detection. PNAS. 2001; 98:31–6. doi:10.1073/pnas.98.1.31.View ArticlePubMedPubMed CentralGoogle Scholar
 Li SZ, Hou X, Zhang H, Cheng Q. Learning Spatially Localized, PartsBased Representation. In: Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference, vol. 1. IEEE: 2001. p. 207–212. http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=990477&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D990477.
 Feng T, Li SZ, Shum Hy, Zhang H. Local NonNegative Matrix Factorization as a Visual Representation. In: Proceedings of the 2nd International Conference on Development and Learning(ICDL’02): 2002. p. 178–183. http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1011835&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1011835.
 Pascualmontano A, Carazo JM, Member S, Kochi K, Lehmann D, Pascualmarqui RD, et al.Nonsmooth Nonnegative Matrix Factorization (ns NMF). IEEE Trans Pattern Anal Mach Intell. 2006; 28(3):1–13.View ArticleGoogle Scholar
 Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistics. J R Stat Soc Ser B (Stat Methodol). 2001; 63(2):411–23.View ArticleGoogle Scholar
 Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, et al.Integrative genomics viewer. Nat Biotechnol. 2011; 29(1):24–6.View ArticlePubMedPubMed CentralGoogle Scholar
 Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative genomics viewer (igv): highperformance genomics data visualization and exploration. Briefings in bioinformatics. 2013; 14:178–192. http://bib.oxfordjournals.org/content/14/2/178.long.View ArticlePubMedPubMed CentralGoogle Scholar
 Hansen KD, Brenner SE, Dudoit S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010; 38(12):1–7. doi:10.1093/nar/gkq224.View ArticleGoogle Scholar
 Li J, Jiang H, Wong WH. Modeling nonuniformity in shortread rates in RNASeq data. Genome Biol. 2010; 11(5):50. doi:10.1186/gb2010115r50.View ArticleGoogle Scholar
 Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNAseq data using factor analysis of control genes or samples. Nat Biotechnol. 2014; 32(9):1–10. doi: 10.1038/nbt.2931.View ArticleGoogle Scholar