lncDIFF: a novel quasi-likelihood method for differential expression analysis of non-coding RNA

Background Long non-coding RNA (lncRNA) expression data have been increasingly used in finding diagnostic and prognostic biomarkers in cancer studies. Existing differential analysis tools for RNA sequencing do not effectively accommodate low abundant genes, as commonly observed in lncRNAs. Results We investigated the statistical distribution of normalized counts for low expression genes in lncRNAs and mRNAs, and proposed a new tool lncDIFF based on the underlying distribution pattern to detect differentially expressed (DE) lncRNAs. lncDIFF adopts the generalized linear model with zero-inflated Exponential quasi-likelihood to estimate group effect on normalized counts, and employs the likelihood ratio test to detect differential expressed genes. The proposed method and tool are applicable to data processed with standard RNA-Seq preprocessing and normalization pipelines. Simulation results showed that lncDIFF was able to detect DE genes with more power and lower false discovery rate regardless of the data pattern, compared to DESeq2, edgeR, limma, zinbwave, DEsingle, and ShrinkBayes. In the analysis of a head and neck squamous cell carcinomas data, lncDIFF also appeared to have higher sensitivity in identifying novel lncRNA genes with relatively large fold change and prognostic value. Conclusions lncDIFF is a powerful differential analysis tool for low abundance non-coding RNA expression data. This method is compatible with various existing RNA-Seq quantification and normalization tools. lncDIFF is implemented in an R package available at https://github.com/qianli10000/lncDIFF. Electronic supplementary material The online version of this article (10.1186/s12864-019-5926-4) contains supplementary material, which is available to authorized users.


Background
Long noncoding RNAs (lncRNAs) are transcripts longer than 200 nucleotides with no or limited protein-coding capability. It is estimated that, in the human genome, there are at least four times more lncRNA genes than protein-coding genes [1]. Currently, there are more than 14,000 human lncRNAs annotated in GENCODE (https://www.gencodegenes.org/). Overall, lncRNA genes have fewer exons, lower abundance and are under selective constraints compared to protein-coding genes. LncRNAs are involved in diverse regulatory mechanisms and in some critical pathways. For example, they can act as scaffolds to create higher-order protein complexes, as decoys to bind sequester transcription factors, and as guides of protein-DNA interactions [2][3][4]. Emerging evidence suggests that lncRNAs serve as essential regulators in cancer cell migration and invasion, as well as in other cancerous phenotypes [5,6]. Therefore, lncRNAs are becoming attractive potential therapeutic targets and a new class of biomarkers for the cancer prognosis and diagnosis. For example, the lncRNA PCA3 (prostate cancer antigen 3) is an FDA-approved biomarker for prostate cancer prediction. The overexpression of lncRNA HOTAIR in breast cancer patients is reported to be associated with patient survival and risk of metastasis [7]. Another important lncRNA ANRIL (CDKN2-AS1) is one of the most frequently alerted genes in human cancers and has been reported to increase the risks of diverse cancers.
Although a large number of lncRNAs have been identified, only a very small proportion of them have been characterized for cellular and molecular functions. Similar to protein-coding genes, the biomarker discovery of lncRNAs can start from a genome-wide differential expression (DE) analysis. One advantage of lncRNAs research in cancer is that we can leverage the large collection of previously published RNA-seq data and perform secondary analyses. Unlike the miRNAs counterparts, the expression of a large number of lncRNAs can be detected by standard RNA-seq with sufficient sequencing depth. Through downloading RNA-seq BAM files and recalling using GENCODE genomic coordinates, more than 8000 human tumor samples across all major cancer types in The Cancer Genome Atlas (TCGA) and other published studies have been reanalyzed for the lncRNAs expression profile [8,9]. There is a limited number of non-tumor samples sequenced for RNA-seq in TCGA. If necessary, the database such as the GTEx (http://gtexportal.org) can serve as additional tissue-specific controls, which provides over 9600 RNA-seq samples across 51 tissues.
lncRNAs expression data have several features that pose significant challenges for the data analysis, including low abundance, large number of genes, and rough annotations. To ensure detection reliability, a common practice is to filter out lncRNA genes with low average Reads Per Kilobase per Million mapped reads (RPKM), e.g. < 0.3. We recommend using the two-step filter proposed by Yan et al. [9]: first eliminates the genes with 50th-percentile RPKM =0, and then only keep the genes with 90th-percentile RPKM < 0.1. About two-thirds of lncRNAs are excluded after this filtering procedure. Interestingly, excess zeros or low expression values are still observed in the downsized dataset. It is well known that excess zero read counts in RNAseq data can distort model estimation and reduce power in differential expression analysis. The popular R packages DESeq2 and edgeR assume a negative binomial (i.e. over-dispersed Poisson) distribution for the count data. Methods based on zero-inflated negative binomial (ZINB) and zeroinflated GLM have been proposed to explicitly address the issue of excess zeros in RNA-seq data [10]. These methods have been recently applied to single-cell RNAseq (scRNA-seq) data, which has high dropout rates. Since the difference in gene expression variance is biologically interesting, multiple methods have been developed to incorporate the testing of variance in the differential model. However, for biomarkers in clinical settings, genes with pronounced group contrast in mean expression level usually have more translation value. Gene-wise expression variability can generate from different sources and vary widely from study to study, especially with different normalization methods. Hence, we focus on the group comparison of mean gene expression levels in this study.
In a large-scale secondary analysis of expression data such as in lncRNAs studies, it is common to only have access to normalized data (such as RPKM), due to either limited data availability or less ideal performance of other normalization methods [11,12]. Packages such as DESeq2, however, are not applicable to lncRNA normalized counts because they do not allow non-integer normalized expression or zero as input. In this case, a plausible practice is to round continuous expression values into integers and then to add 1 to each value to remove zeroes. Another commonly-adopted approach is using log 2 (x + 1) transformed normalized data in R package like limma [13], i.e., assuming a log-transformed Gaussian distribution as in microarray intensity levels. The core function in limma, which runs a moderated ttest after an empirical Bayes correction, is more generic and more suitable for the differential expression of processed lncRNA expression data. In a very recent study, a total of 25 popular methods for testing differential expression genes were comprehensively evaluated with special emphasis on low-abundance mRNAs and lncRNAs [14]. It was observed that linear modeling with empirical Bayes moderation (implemented in limma with variance stabilizing transformation [15], voom [16] or trend), and a non-parametric method based on Wilcoxon rank sum statistics (implemented in SAMSeq) showed overall good balance of false discovery rate (FDR) and reasonable detection sensitivity. However, none of the methods compared can outperform other tools and all tools exhibited substandard performance for lncRNAs in terms of differential testing, often with higher FDR and true positive rate (TPR) than for mRNAs. This study also concluded that accurate differential expression inference of lncRNAs requires more samples than that of mRNAs. Even methods like limma can exhibit an excess of false discoveries under specific scenarios, making these methods unreliable in practical applications.
In this paper, we first investigated the distribution of lncRNA and low-abundance mRNA via the relation between gene-wise coefficient of variation and mean. The patterns for these RNAs were compared with high abundant mRNA, providing evidence for an underlying Exponential distribution in most genes of lower expression, especially those in lncRNA. Based on the assumption of Exponential-distributed non-zero abundance for the majority of lncRNA genes, we presented the lncDIFF, an efficient and reliable toolset in a zero-inflated Exponential quasi-likelihood strategy on the Generalized Linear Model. The quasi-likelihood provides unbiased estimations for biological group effect on lncRNA gene expression, including a small proportion of lncRNA genes with expression following Negative Binomial or Log Normal distribution. It thus provides a simple and versatile approach to model gene expression data without making strong distributional assumptions about the underlying variation, but still being compatible with existing RNA-Seq quantification and normalization tools. The flexibility in allowing for the estimation of calibration and variance parameters is especially important for lncRNAs differential analysis. lncDIFF is thus able to integrate desirable features from the aforementioned two topperforming methods (limma and SAMSeq [14]) for lncRNA differential analysis. lncDIFF is compared with existing tools using an extensive simulation study and lncRNA DE analysis on TCGA head and neck squamous cell carcinomas (HNSC), with data downloaded from TANRIC [8]. Results suggest that lncDIFF is powerful and robust in a variety of scenarios and identifies DE lncRNA genes of low expression with higher accuracy.

Simulation study to assess lncDIFF performance
We conducted a comprehensive simulation study to assess the performance of lncDIFF, and compared with existing common tools DESeq2, edgeR and limma (with log transformation), along with recently developed single-cell tools zinbwave [17] incorporated in DESeq2 (i.e. zinbwave+DESeq2), and DEsingle [18]. We rounded decimals to integers as input for DESeq2 and selected the quasi-likelihood estimation method in edgeR. lncDIFF and the compared methods were applied to low-abundance RNA-Seq genes sampled from zeroinflated NB or LN families. ShrinkBayes [19,20] is a Bayesian approach that also adopts zero-inflated NB for low counts RNA-Seq DE analysis, designed for small sample size studies but slower in computation compared to other tools. Hence, ShrinkBayes was not applied to simulated datasets, and was compared to lncDIFF only based on TCGA HNSC datasets. We adopted the gene-wise estimated dispersion or log variance from TCGA HNSC [21] lncRNA RPKM as the density parameters for data generation. Based on the dispersion and log variance estimate for the data in this TCGA study, we used ϕ = 1, 2, 10, 20, σ 2 = 0.01, 0.25, 1, 2.25, and fixed ϕ, σ 2 values to generate RPKM of each genes across all samples in the same simulation scenario. Each scenario was defined by the unique gene-wise nonzero proportion π = 0.5, 0.7, 0.9, sampling distribution function (NB or LN) and value of ϕ, σ 2 , with sample size varying at N = 100, 200, 300. In order to generate data similar to lncRNA RPKM, we first obtained binary outcomes (0-1) for all samples in one scenario from the Bernoulli sampling, and then replace the 1's by positive abundance value sampled from NB or LN densities. The HNSC study includes 40 pairs of matched normal-tumor tissues. We used the 40 normal samples to calculate the mean RPKM as baseline group parameter β i1 in simulation. Similar to the common filtering criteria in existing lncRNA analysis, we removed the genes in the real data with mean RPKM < 0.3 [22,23] and zero expression in more than half of the samples, reducing to 1100 genes used for simulation.
In the simulation study, we only considered two-group comparison to illustrate the contrast between different methods. RPKM of the first group was randomly generated by the specified density function and the baseline parameter, while the second group had the mean parameter of the baseline times a shift, i.e., the tumor/normal fold change in TCGA HNSC data. We manually set the shift between two simulated groups at 1 if the absolute log2 fold change for the corresponding gene is less than 0.5. Simulated genes with between-groups shift at 1 are the null genes and the remaining are DE genes. For each simulated scenario, we generated 100 replicates to assess the performance of different methods by the mean of false discovery rate (FDR) and true positive rate (TPR), and area under the curve (AUC) of receiver operating characteristics (ROC) with FDR threshold 0.05. We ordered the scenarios by the scale of variance (with 1-4 representing the smallest to the largest), proportion of nonzero expression, and sample size to investigate the impact of parameters on performance metrics. Figure 1 and Additional file 1: Figures S4-S5 presented the AUC, FDR and TPR of all scenarios, illustrating that lncDIFF outperforms the other methods, especially for scenarios with LN density.
AUC for all methods in Fig. 1 decrease as the genewise variation increases, and lncDIFF's performance is close to the optimal method (DESeq2) for NB density. The change of AUC across different sample sizes implies that adding more samples improves the performance of lncDIFF and DESeq2, but does not have impact on edgeR, limma and DEsingle. Furthermore, the AUC of lncDIFF in NB density is equivalent to or slightly larger than that of DESeq2 at sample size N = 300. According to AUC and TPR, the outperformance of DESeq2 compared to lncDIFF in NB sampling was not as pronounced as the outperformance of lncDIFF compared to DESeq2 in LN distribution. The single-cell RNA-Seq tool zinbwave improves DE detection power of DESeq2, but only for small gene-wise variance and many samples having simulated counts > 4. TPR of limma was higher than the other methods except for lncDIFF on LN distributed data in smaller sample sizes. On the other hand, the FDR shows that lncDIFF has similar performance of DESeq2 in most scenarios regardless of density and greatly outperforms the other two methods, although lncDIFF in large-variance LN scenarios presents performance close to edgeR and limma. The change of performance of DESeq2 under either sampling distribution brought by zinbwave illustrates that it is the Exponential likelihood rather than zero-inflated point mass contributing towards the outperformance of lncDIFF. In summary, lncDIFF is an ideal method for DE analysis of lncRNA RPKM with different distributions, while DESeq2 is a preferred tool if the non-zero counts are relatively high and NB-distributed.  The results in Figs. 2-3(b) suggested that lncDIFF provided ideal power or alternative TPR (75%) in DE analysis for LFC < 0.5, with approximated FPR below 5%. ShrinkBayes has detection power close to lncDIFF only for DE genes in SS1. Figures 2 and 3c-e displayed the group contrast on genes identified as DE (positive) by one method but non-DE (negative) by the other method using boxplot of RPKM at log2 scale per group. The group contrast on the DE genes identified only by lncDIFF was much larger than that in DE genes identified only by each of the compared methods. In other words, lncDIFF identifies 'true' DE genes with more power and is less likely to 'miss' the DE genes with pronounced group contrast.  (Table 1). There are 11 overlapped genes in the top 20 significant gene list of paired and unpaired analysis, some of which are associated with overall survival time. For each the overlapped significant genes, we divided the 426 HNSC tumor samples into two groups by the median of RPKM per DE gene, and then apply Cox Proportional Hazard model to survival association analysis. The Kaplan-Meier curves and the log-rank test p-values reveal marginal or significant associations between genes ERVH48-1, HCG22, LINC00668, LINC02582 and the overall survival months (Additional file 1: Figure S6). For the same set of HNSC tumor samples, we also used the mRNA normalized counts to select 20 mRNA genes highly correlated with the 11 tumor-normal DE lncRNA genes by Spearman correlation (Additional file 3).

Computational performance of lncDIFF
The GLM group effect estimation was implemented in the R function ZIQML.fit, separated from the likelihood ratio testing included in another function ZIQML.LRT. The GLM group effect estimate in lncDIFF is based on the zero-inflated Exponential likelihood with either identity or log link function, which is also valid and unbiased for low-expression lncRNA genes distributed as NB or LN. The choice of link function does not have any impact on the group effect estimate and LRT results ( Table 2), but the log link function can avoid NA values produced The distribution of p-values from lncDIFF was also investigated and compared with the other methods in TCGA HNSC tumor vs. normal analysis, using simulated p-values from sample permutation. We randomly selected three genes with different RPKM density patterns to generate the null p-values and then visualized the p-values distribution via QQ plots in Fig. 4. Figure 4(b)-(c) showed that the p-values of lncDIFF and DESeq2 (with or without zinbwave) were close to the expected distribution aligned on the identity line, while the other methods resulted in a large proportion of small pvalues (< 0.1). The histogram and density plot of RPKM presented in Fig. 4(a) implied that the null p-values of lncDIFF and DESeq2 for higher expressed lncRNA genes (ENSG00000130600.11) followed the expected uniform distribution, while those for low abundance genes (ENSG00000152931.7, ENSG00000153363.8) may deviate from the assumed uniform distribution. To avoid the distorted distribution of LRT p-values, we also implemented the option of empirical p-value and FDR based on the zero-inflated Exponential likelihood in the R function ZIQML.LRT.
We further illustrated the computation efficiency of lncDIFF by running on the TCGA HNSC matched tumor-normal samples with~1130 filtered genes. The processing time (in seconds) of this biological data analysis by lncDIFF, DESeq2, edgeR, limma, zinbwave+DE-Seq2, DEsingle, and ShrinkBayes are 3.17, 4.31, 3.37, 0.02, 55.33, 52.67, and 341.47, respectively. If the option of simulated p-value is enabled, the running time of lncDIFF on this real dataset is increased to 267.86 s for default 100 permutations, but the correlation between observed and simulated p-values or FDR's is around 0.9. In order to illustrate normalization methods having no impact on lncDIFF performance, we simply applied lncDIFF DE analysis to three different types of normalized counts (i.e., FPKM, TMM and UQ) of low abundance mRNA in TCGA HNSC tumor-normal samples (N = 546) . The low abundance genes were selected with mean FPKM in the range of (0.3, 2) and no more than 20% zero expression, similar to the majority of lncRNA genes. The Pearson correlation of log10 adjusted p-values between   [24], baySeq [25], and QuasiSeq [26]. Hence, the lncDIFF DE analysis can be incorporated in existing RNA-Seq quantification and normalization pipeline, regardless of the models employed in the preprocessing tools.

Conclusions
We implemented GLM with zero-inflated Exponential likelihood and LRT for either identity or logarithmic link function in lncDIFF, along with an option of simulated p-values and FDR generated from permutations. This package allows the input expression matrix to be either continuous or discrete and requires group or phenotype factor provided in the design matrix format. lncDIFF is a powerful differential analysis tool for zero-inflated lowcounts RNA-Seq data, especially for lncRNA and largescale studies, with improved DE detection power and computational performance compared to others. This is an efficient DE analysis method compatible with various RNA-Seq quantification and normalization tools.

Low-abundance RNA-Seq data distribution
In RNA-Seq analysis, the type of RNAs and the selected alignment, quantification and normalization tools usually have substantial impacts on the distribution pattern of transcript abundance [27], especially on the level of gene expression dispersion, i.e. the mean-variance relation. Most of the existing RNA-Seq analysis tools, such as DESeq [28], edgeR [29], and baySeq [25], estimate gene-wise dispersion to perform normalization or downstream differential expression analysis. However, algorithms based on gene-wise dispersion may not be suitable for low counts in RNA-Seq studies, such as lncRNA and low-expression genes in mRNA [14]. Existing analysis on RNA-Seq data usually assumes Negative Binomial (NB) or the Log Normal (LN) distribution for RNA-Seq normalized counts X mapped to a gene [14,16], with gene-wise dispersion summarized as a quadratic mean-variance relation Var(X) = c • E(X) 2 . The square root of c coincides with coefficient of variation (CV) and depends on the assumed statistical distribution, i.e. c ¼ ϕ þ 1 μ 1 for NB and c = exp(σ 2 ) − 1 for LN [28], where μ 1 , ϕ are the mean and dispersion parameters of NB, and σ is the log standard deviation of LN, not related to log mean. Obviously, a drop in the gene-wise CV is expected to occur along with an increase in gene-wise mean, if the NB distribution assumption is valid for RNA-Seq counts. On the other hand, the gene-wise CV and mean should be independent if the assumed LN distribution is valid.
We first used the lncRNA and mRNA FPKM in the TCGA HNSC study [21] to investigate the dispersion patterns for three types of RNAs, i.e. high-abundance mRNA, low-abundance mRNA and lncRNA. Genes in lncRNA dataset were filtered by the criteria proposed by Yan et al. [9], while genes in mRNA dataset with more than 30% zero expression were removed. The cutoff between high vs. low abundant mRNA genes was the 85th percentile of gene-wise mean FPKM. We used the violin-box plots in Fig. 5 to illustrate the CV-mean relation for different RNAs in three panels. The totals of genes for each type of RNA are 9561 high-abundance mRNA genes, 8362 lowabundance mRNA genes, and 1322 lncRNA genes.
CVs for the majority of high-abundance mRNA genes were less than 1 and display a drop in higher expressed genes (Fig. 5). In contrast, the CV level for most of lncRNA and low-abundance mRNA genes in the lower two panels were close to CV = 1 and did not change along with gene-wise mean, especially for lncRNA genes with mean below the 80th percentile. The other genes in these panels severely deviated from CV = 1, and a negative CV-mean relation still existed in low-abundance mRNA when gene-wise mean increases from the 70th percentile to higher. We visualized and confirmed such CV-mean patterns via mRNA and lncRNA FPKM data in another two TCGA studies, i.e. Lung Squamous Cell Carcinoma (LUSC) and Lung Adenocarcinoma (LUAD), shown in Additional file 1: Figures S1-S2. We further assessed the dispersion patterns of mRNA low counts normalized by TMM and UQ methods [30,31] (Additional File 1: Figure S3). The similarity between different normalized mRNA counts implies that TMM or UQ normalized lncRNA counts also follow the CV-mean pattern of lncRNA RPKM in Fig. 5, although TMM and UQ normalized lncRNA counts in TCGA HNSC study were not publically available.
The expected CV level for lncRNA and low-abundance mRNA in Fig. 5 revealed an underlying statistical distribution in a large proportion of low abundant genes, which should have CV = 1 or Var(X) = E(X) 2 . This naturally leads to the Exponential distribution with density function f ðXÞ ¼ 1 λ e − X λ , and E(X) = λ, Var(X) = λ 2 . In the light of fewer statistical parameters, it is of interest to consider the Exponential family as a latent distribution for low-counts RNA-Seq data, especially for lncRNA. We cannot ignore the fact that expression of certain lncRNA genes and lowabundance mRNA genes are still distributed as the wellknown NB or LN family, illustrated by the genes with CV deviating from CV = 1 ( Fig. 5 and Additional file 1: Figs. S1-S3). Therefore, in the presence of NB or LNdistributed counts, we adopted exponential family to account for the latent distribution of lncRNA genes and perform differential expression analysis.

GLM with exponential likelihood
Let Y ij be the lncRNA normalized counts for gene i in sample j, belonging to phenotype or treatment group k, k = 1, …, K. The generalized linear model (GLM) with the Exponential family is w jk and β ik are design matrix elements and coefficients for groups, while v jm and γ m (m = 1, …, M) are the M covariates and corresponding coefficients. Since Y ij has been normalized for library size, this model does not include the RNA sequencing normalization factor, although it is a common parameter in existing tools based on NB assumption [28,29,32,33].
In the absence of zero counts, lncDIFF uses the Exponential GLM for lncRNA DE analysis. Let β i = (β i1, …, β iK ) and γ = (γ 1, …, γ m ), for gene i with negligible zero occurrence (< 1%), the GLM likelihood based on the exponential dens- λ ij with identity or log link function is Logarithmic link : The exponential likelihood estimate for mean gene expression is the maximizer of L(β i , γ), that is ðβ i ;γÞ ¼ argmax Lðβ i ; γÞ. Statistical models similar to Exponential GLM had been proposed and assessed in previous studies [34][35][36][37].

Zero-inflated exponential likelihood
In lncRNA expression data, it is common to observe zero values in most genes at a non-negligible proportion (i.e., at least 1%) of samples. The excess zeroes and low counts for lncRNA cannot be addressed by integer models like Poisson and Negative Binomial (or Gamma- Fig. 5 Gene-wise coefficient of variation. Violin and box plots for gene-wise coefficient of variation (CV) based on RPKM of three types of RNAs in TCGA HNSC study: high-abundance mRNA, lncRNA, and low-abundance mRNA. For each type of RNA, genes are divided into ten groups by the gene-wise mean percentiles Poisson), especially for non-integer normalized counts in the range of (0, 2). Rounding decimals to integers and then applying Poisson or NB density [38,39] or using data transformation, e.g. log2, voom, or VST [15,16,38] with limma [13,40] may lead to errors in DE analysis. Therefore, we propose the zero-inflated quasi likelihood of Exponential GLM to account for the gene-wise inflation of zeros.
In order to incorporate the zero-inflated pattern, we re-write the normalized counts for gene i in sample j by a multiplicative error model [41][42][43] with random error ϵ ij , that is The random errors ϵ ij also have the occurrence of excess zeros with a prior probability mass P(ϵ ij = 0) = 1 − π i , P(ϵ ij > 0) = π i , and a continuous density at positive value with E(ϵ ij | Y ij > 0) = γ, similar to [42,44,45]. If the non-zero expression Y ij | Y ij > 0 follows an Exponential distribution (so does ϵ ij |Y ij > 0), then the density functions for Y ij including zero occurrence is Equation (4) is derived in the Additional file 2. The corresponding likelihood function is l j * (π i , β i , γ) is defined according to the selected link function as Logarithmic link : l j Ã π i ; β i ; γ The zero-inflated maximum likelihood (ZI-ML) estimate for group-wise mean expression is the maximizer of L * (π, β i , γ) in eq. (6), that iŝ It is worthwhile to note that the likelihood function L * (π i , β i , γ) in eq. (5) reduces to eqs. (1) and (2) if the proportion of zero expression is negligible, i.e. no more than 1%.

Estimate group wise mean
For each gene, lncDIFF utilizes ðπ i ;β i ;γÞ ZI−ML in eq. (8) to estimate the mean expression level per group. We can prove mathematically that this estimate is asymptotically unbiased at large sample size, even though RNA-Seq low counts are usually a mixture of multiple distributions as previously reported [34][35][36]. Zero-inflated Poisson, NB, or LN likelihood may result in biased estimate for group wise mean gene expression in lncRNA low counts, due to limited mathematical power of these functions. Mathematical proof for unbiased estimate of group wise mean gene expression in lncDIFF is elaborated in the Additional File 2.
To illustrate the estimation accuracy of ðπ i ;β i ;γÞ ZI−ML , we simply generated normalized lncRNA counts for a gene in three biological groups (i.e. groups A, B, C) without covariate effects by sampling from zero-inflated Exponential, NB, LN distributions, respectively. Each scenario contained 1000 replicates. The mean and median of 1000 estimated group effects were listed in Table 3, indicating that the presence of NB and LNdistributed low-counts did not have impact on the accuracy of group effect estimate in lncDIFF. In other words, lncRNA counts may occasionally deviate from Exponential family but does not affect the performance of lncDIFF. Hence, lncDIFF is a pseudo or quasi- Table 3 Estimated group effect on a gene by lncDIFF on simulated low-abundance expression. Low-abundance expressions were sampled from three statistical distributions and two scenarios of parameters (defined by β 's and CV). 1000 replicates were generated resulting in 1000 estimates per scenario