NMFP: a non-negative matrix factorization based preselection method to increase accuracy of identifying mRNA isoforms from RNA-seq data

Background The advent of next-generation RNA sequencing (RNA-seq) has greatly advanced transcriptomic studies, including system-wide identification and quantification of mRNA isoforms under various biological conditions. A number of computational methods have been developed to systematically identify mRNA isoforms in a high-throughput manner from RNA-seq data. However, a common drawback of these methods is that their identified mRNA isoforms contain a high percentage of false positives, especially for genes with complex splicing structures, e.g., many exons and exon junctions. Results We have developed a preselection method called “Non-negative Matrix Factorization Preselection” (NMFP) which is designed to improve the accuracy of computational methods in identifying mRNA isoforms from RNA-seq data. We demonstrated through simulation and real data studies that NMFP can effectively shrink the search space of isoform candidates and increase the accuracy of two mainstream computational methods, Cufflinks and SLIDE, in their identification of mRNA isoforms. Conclusion NMFP is a useful tool to preselect mRNA isoform candidates for downstream isoform discovery methods. It can greatly reduce the number of isoform candidates while maintaining a good coverage of unknown true isoforms. Adding NMFP as an upstream step, computational methods are expected to achieve better accuracy in identifying mRNA isoforms from RNA-seq data. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-2304-8) contains supplementary material, which is available to authorized users.


Contradicting bins
We define bins as two-dimensional vectors that describe the exon indices of the starting and ending positions of mapped reads (single-ended reads or paired-end reads decomposed into two ends). For example, Bin (4,4) contains reads whose starting and ending positions Figure 1: Definition of the subexon are both in Subexon 4. For reads that cannot originate from the same transcript, their corresponding bins are mutually exclusive. We call them contradicting bins. For example, Bins (4,4) and (3,5) are contradicting bins, because Bin (4,4) indicates the existence of Subexon 4 but Bin (3,5) indicates the skipping of Subexon 4.

Decomposing isoforms candidates containing contradicting bins
After non-negative matrix factorization (NMF) is completed, a basis matrix W would be obtained, and each column of W represents an isoform candidate (See the main text). However, isoform candidates may contain contradicting bins, and such candidates cannot be true isoforms. To resolve this issue without losing possibly true isoforms, we decompose an isoform candidate with two contradicting bins into two isoform candidates, each containing one of the two bins. We use Figure 1 as an example. Suppose an isoform candidate contains contradicting Bins (4,4) and (3,5), which indicate contradicting status of Subexon 4. Suppose all the other bins are non-contradicting and indicate the existence of Subexons 1, 2, 3, 5, 6, and 7. Then we decompose the isoform candidate into two candidates: 1111111 and 1110111, where the former contains all subexons and supports Bin (4,4) while the latter excludes Subexon 4 and supports Bin (3,5). This procedure is to reduce our chance of missing true isoforms. ||V − W H|| F and additional orthogonality constraint on H, i.e., H T H = I, NMF can be regarded as one type of K-means clustering on the bins (rows of V ) with non-negativity constraint. The reason is that the purpose of NMF is to cluster bins into bin groups, which are sub-structures of isoforms and can form into multiple isoforms including the true ones. This motivated us to use the gap statistic, a method for choosing the number of cluster K in K-means clustering, to select the rank of NMF. Gap statistic was proposed by Tibshirani et al. [2] and has since been a widely used metric for choosing K in K-means clustering because of its good performance in estimating the number of well separated clusters.

K-means clustering
Suppose there are n p-dimensional data points, X 1 , X 2 , . . . , X n ∈ R p and the goal is to cluster them into K clusters C 1 , . . . , C K . Given K, K-means clustering would assign the n data points to K clusters, i.e., find the cluster memberships C 1 , . . . , C K by minimizing the following objective function arg min where µ r is the mean of cluster C r , which is a subset of the n data points. Formula (1) is equivalent to arg min n r is the number of points in cluster C r and d ij is the distance between X i and X j , i.e. ||X i −X j ||. There are many choices for the distance metric, such as the Euclidean distance. The objective funciton W K = K r=1 1 2nr i,j∈Cr d ij is the within-cluster variance, which is a basic statistic for determining K.

Gap statistic
Gap statistic is defined as Gap n (k) = E * k [log(W k )]−log(W k ). The first term is the expected W k under a reference distribution with no clusters, and the second term is the observed W k . The idea is to choose the number of clusters as the value of k that leads to the largest Gap n (k). To estimate E * k [log(W k )], the simplest reference distribution is the uniform distribution in all the p dimensions over the range of the observed data. The gap statistic algorithm sketched below is from the original gap statistics paper [2].
1. Vary the number of clusters k = 1, . . . , T , and cluster the data X 1 , . . . , X n by Kmeans clustering into k clusters, resulting in W k , k = 1, . . . , T , where T is the upper bound on k.

Application of gap statistic to NMF rank determination
NMF is a way of K-means clustering that clusters the bins with similar expression levels into the bin groups, i.e., splicing structures that can be reconstructed into isoforms. In most cases, the number of bin groups is close to the number of isoforms. For exmple, assume there is a 5-subexon gene with 3 isoform, 11111, 11011 and 11101. The relative abundance of the three isoforms are 50%, 35% and 15% respectively. Then the subexons have relative expression levels as 100%, 100%, 65%, 85% and 100% sequentially. Therefore, Subexons 1, 2 and 5 will be clustered into one bin group, while Subexon 3 and Subexon 4 will each be clustered as one bin group respectively. In this example, both the number of isoforms and the number of bin groups are 3. For genes with more complicated splicing structures, the number of bin groups may be more than the number of isoforms. In such cases, our estimated number of bin groups,K from gap statistic, could be larger than the number of true isoforms. However, from our simulation results, we observed that NMFP is not sensitive to the NMF rank choice and performs reasonably well as long as the rank is no less than the number of annotated isoforms. (See the section Low sensitivity of NMFP to ranks in the main text.) Combined with the fact that gap statistic tends to be conservative [2], the NMF rank should be better chosen as larger thanK, the number of clusters chosen by the gap statistic on V . In our results, we chose the NMF rank aŝ K + 1. Continued from the main text, here we attach two more simulation examples to illustrate that NMFP is not sensitive to the choice of NMF rank. In Figure 2(a), Gene FBgn0259821 has three annotated isoforms (Ensemble BDGP6 of release 80) with 13 subexons. NMFP is able to capture all the annotated isoforms (recall rate = 1) regardless of the rank choices. The precision rate of NMFP is 1 when the rank equals 2. Although it decreases when the rank increases to 3 because higher ranks would lead to more isoform candidates, it becomes relatively stable after rank equals 4. In Figure 2(b), Gene FBgn0037215 has 5 annotated isoforms with 12 subexons. NMFP has stable performance across all the rank choices.

Detailed information of genes on chromosome chr1 of Mus musculus
In the section Simulation results in Mus musculus in the main text, we did another simulation to demonstrate the performance of NMFP on mouse transcriptome. Apart  from what has been already stated in the main text, some supplementary detail ( Table  2) is provided here about the genes we selected to work on from chromosome chr1 of Mus musculus (reference genome mm10 and annotation GRCm38 of release 81). Here, a brief description is given as following about the parameters for flux simulator to simulate the 100 samples. The 100 samples are equally split into 10 groups, Group 1 (Sample 1, Sample 2, . . . , Sample 10), Group 2 (Sample 11, Sample 12, . . . , Sample 20), . . . , Group 10 (Sample 91, Sample 92, . . . , Sample 100). For Group i, NB MOLECULES for Expression step is (20 + i) · 200000. For the (10 · (i − 1) + j) th sample within Group i, READ NUMBER for Sequencing step is (10 + j) · 1000000 ( Table 1). All the samples share other parameters such as that they all use paired-end reads with length of 2 × 76 bp.

Increased robustness of NMFP+SLIDE to the choices of parameter λ (More results)
Continued from the main text, here we include two more simulation results to show that NMFP can help SLIDE achieve better isoform discovery accuracy at lower values of λ, the regularization parameter used in the LASSO step in SLIDE. Hence, the choice of a proper value for λ becomes an easier task for SLIDE+NMFP than for SLIDE. In Figure 3  as 4. NMFP+SLIDE has much higher F scores than SLIDE for λ < 0.04. In Figure 3 (b), Gene ENSMUSG00000025930 has 4 annotated isoforms with 8 subexons. The NMF rank is set as 4. We also observe that NMFP+SLIDE has better performance than SLIDE especially when λ < 0.015. Since NMFP can largely reduce the isoform candidate pool for SLIDE, it is recommended to use a small λ value for NMFP+SLIDE.

Real data case study (More results)
Continued from the main text, we use another two cases to show that NMFP has good performance on real data. In Figure 4, RNA-seq reads for gene FBgn0019936 were generated by the modENCODE consortium [3] from D. melanogaster L3 stage larvae and 12 hours post-molt (SRA accession: SRS004682; see the Supplemental Material "Updated Table S2.xlsx" in [4] for more details). This gene has 1 annotated isoform (shown in orange), which is well supported by the RNA-seq reads (shown in gray). Cufflinks alone connected the latter three exons together with the introns in between into one piece (shown in light blue). NMFP+Cufflinks accurately assembled the annotated isoform and recovered another isoform (shown in dark blue), which reflects the low read counts of the Exon 3. Similarly, NMFP+SLIDE at λ = 0.2 ("more", shown in dark green) and λ = 0.01 ("fewer", shown in dark red) achieved better isoform discovery results than their SLIDE counterparts (shown in light green and light red). In Figure 5, RNA-seq reads for gene FBgn0038145 were also generated by the modENCODE corsortium from the heads of mated female D.melanogaste after 1 day of eclosion (SRA accession: SRR070434, SRR070435 andSRR100279; see the Supplemental Material "Updated Table S2.xlsx" in [4] for more details). FBgn0038145 has a complicated splicing structure and 5 annotated isoforms (shown in orange). Cufflinks alone assembled one transcript (shown in light blue) similar to the first annotated one except that part of Exon 1 is missed. NMFP+Cufflinks identified 4 isoforms (shown in dark blue) among which 2 are annotated. NMFP also improved the performance of SLIDE at both λ = 0.2 ("more", shown in light and dark green) and λ = 0.01 ("fewer", shown in light and dark red). One significant contribution of NMFP to Cufflinks and SLIDE is capturing Exon 2, which is missed by Cufflinks and SLIDE alone because of its low read coverage compared to the other exons.

Summary of 74 real data samples
Tables 3-10 summarize the description of the 74 D. melanogaster RNA-seq data sets [3,4] we use in our real data case study.