 Methodology article
 Open Access
 Published:
lncDIFF: a novel quasilikelihood method for differential expression analysis of noncoding RNA
BMC Genomics volume 20, Article number: 539 (2019)
Abstract
Background
Long noncoding RNA (lncRNA) expression data have been increasingly used in finding diagnostic and prognostic biomarkers in cancer studies. Existing differential analysis tools for RNA sequencing do not effectively accommodate low abundant genes, as commonly observed in lncRNAs.
Results
We investigated the statistical distribution of normalized counts for low expression genes in lncRNAs and mRNAs, and proposed a new tool lncDIFF based on the underlying distribution pattern to detect differentially expressed (DE) lncRNAs. lncDIFF adopts the generalized linear model with zeroinflated Exponential quasilikelihood to estimate group effect on normalized counts, and employs the likelihood ratio test to detect differential expressed genes. The proposed method and tool are applicable to data processed with standard RNASeq preprocessing and normalization pipelines. Simulation results showed that lncDIFF was able to detect DE genes with more power and lower false discovery rate regardless of the data pattern, compared to DESeq2, edgeR, limma, zinbwave, DEsingle, and ShrinkBayes. In the analysis of a head and neck squamous cell carcinomas data, lncDIFF also appeared to have higher sensitivity in identifying novel lncRNA genes with relatively large fold change and prognostic value.
Conclusions
lncDIFF is a powerful differential analysis tool for low abundance noncoding RNA expression data. This method is compatible with various existing RNASeq quantification and normalization tools. lncDIFF is implemented in an R package available at https://github.com/qianli10000/lncDIFF.
Background
Long noncoding RNAs (lncRNAs) are transcripts longer than 200 nucleotides with no or limited proteincoding capability. It is estimated that, in the human genome, there are at least four times more lncRNA genes than proteincoding genes [1]. Currently, there are more than 14,000 human lncRNAs annotated in GENCODE (https://www.gencodegenes.org/). Overall, lncRNA genes have fewer exons, lower abundance and are under selective constraints compared to proteincoding genes. LncRNAs are involved in diverse regulatory mechanisms and in some critical pathways. For example, they can act as scaffolds to create higherorder protein complexes, as decoys to bind sequester transcription factors, and as guides of proteinDNA interactions [2,3,4]. Emerging evidence suggests that lncRNAs serve as essential regulators in cancer cell migration and invasion, as well as in other cancerous phenotypes [5, 6]. Therefore, lncRNAs are becoming attractive potential therapeutic targets and a new class of biomarkers for the cancer prognosis and diagnosis. For example, the lncRNA PCA3 (prostate cancer antigen 3) is an FDAapproved biomarker for prostate cancer prediction. The overexpression of lncRNA HOTAIR in breast cancer patients is reported to be associated with patient survival and risk of metastasis [7]. Another important lncRNA ANRIL (CDKN2AS1) is one of the most frequently alerted genes in human cancers and has been reported to increase the risks of diverse cancers.
Although a large number of lncRNAs have been identified, only a very small proportion of them have been characterized for cellular and molecular functions. Similar to proteincoding genes, the biomarker discovery of lncRNAs can start from a genomewide differential expression (DE) analysis. One advantage of lncRNAs research in cancer is that we can leverage the large collection of previously published RNAseq data and perform secondary analyses. Unlike the miRNAs counterparts, the expression of a large number of lncRNAs can be detected by standard RNAseq with sufficient sequencing depth. Through downloading RNAseq BAM files and recalling using GENCODE genomic coordinates, more than 8000 human tumor samples across all major cancer types in The Cancer Genome Atlas (TCGA) and other published studies have been reanalyzed for the lncRNAs expression profile [8, 9]. There is a limited number of nontumor samples sequenced for RNAseq in TCGA. If necessary, the database such as the GTEx (http://gtexportal.org) can serve as additional tissuespecific controls, which provides over 9600 RNAseq samples across 51 tissues.
lncRNAs expression data have several features that pose significant challenges for the data analysis, including low abundance, large number of genes, and rough annotations. To ensure detection reliability, a common practice is to filter out lncRNA genes with low average Reads Per Kilobase per Million mapped reads (RPKM), e.g. < 0.3. We recommend using the twostep filter proposed by Yan et al. [9]: first eliminates the genes with 50thpercentile RPKM =0, and then only keep the genes with 90thpercentile RPKM < 0.1. About twothirds of lncRNAs are excluded after this filtering procedure. Interestingly, excess zeros or low expression values are still observed in the downsized dataset. It is well known that excess zero read counts in RNAseq data can distort model estimation and reduce power in differential expression analysis. The popular R packages DESeq2 and edgeR assume a negative binomial (i.e. overdispersed Poisson) distribution for the count data. Methods based on zeroinflated negative binomial (ZINB) and zeroinflated GLM have been proposed to explicitly address the issue of excess zeros in RNAseq data [10]. These methods have been recently applied to singlecell RNAseq (scRNAseq) data, which has high dropout rates. Since the difference in gene expression variance is biologically interesting, multiple methods have been developed to incorporate the testing of variance in the differential model. However, for biomarkers in clinical settings, genes with pronounced group contrast in mean expression level usually have more translation value. Genewise expression variability can generate from different sources and vary widely from study to study, especially with different normalization methods. Hence, we focus on the group comparison of mean gene expression levels in this study.
In a largescale secondary analysis of expression data such as in lncRNAs studies, it is common to only have access to normalized data (such as RPKM), due to either limited data availability or less ideal performance of other normalization methods [11, 12]. Packages such as DESeq2, however, are not applicable to lncRNA normalized counts because they do not allow noninteger normalized expression or zero as input. In this case, a plausible practice is to round continuous expression values into integers and then to add 1 to each value to remove zeroes. Another commonlyadopted approach is using log_{2}(x + 1) transformed normalized data in R package like limma [13], i.e., assuming a logtransformed Gaussian distribution as in microarray intensity levels. The core function in limma, which runs a moderated ttest after an empirical Bayes correction, is more generic and more suitable for the differential expression of processed lncRNA expression data. In a very recent study, a total of 25 popular methods for testing differential expression genes were comprehensively evaluated with special emphasis on lowabundance mRNAs and lncRNAs [14]. It was observed that linear modeling with empirical Bayes moderation (implemented in limma with variance stabilizing transformation [15], voom [16] or trend), and a nonparametric method based on Wilcoxon rank sum statistics (implemented in SAMSeq) showed overall good balance of false discovery rate (FDR) and reasonable detection sensitivity. However, none of the methods compared can outperform other tools and all tools exhibited substandard performance for lncRNAs in terms of differential testing, often with higher FDR and true positive rate (TPR) than for mRNAs. This study also concluded that accurate differential expression inference of lncRNAs requires more samples than that of mRNAs. Even methods like limma can exhibit an excess of false discoveries under specific scenarios, making these methods unreliable in practical applications.
In this paper, we first investigated the distribution of lncRNA and lowabundance mRNA via the relation between genewise coefficient of variation and mean. The patterns for these RNAs were compared with high abundant mRNA, providing evidence for an underlying Exponential distribution in most genes of lower expression, especially those in lncRNA. Based on the assumption of Exponentialdistributed nonzero abundance for the majority of lncRNA genes, we presented the lncDIFF, an efficient and reliable toolset in a zeroinflated Exponential quasilikelihood strategy on the Generalized Linear Model. The quasilikelihood provides unbiased estimations for biological group effect on lncRNA gene expression, including a small proportion of lncRNA genes with expression following Negative Binomial or Log Normal distribution. It thus provides a simple and versatile approach to model gene expression data without making strong distributional assumptions about the underlying variation, but still being compatible with existing RNASeq quantification and normalization tools. The flexibility in allowing for the estimation of calibration and variance parameters is especially important for lncRNAs differential analysis. lncDIFF is thus able to integrate desirable features from the aforementioned two topperforming methods (limma and SAMSeq [14]) for lncRNA differential analysis. lncDIFF is compared with existing tools using an extensive simulation study and lncRNA DE analysis on TCGA head and neck squamous cell carcinomas (HNSC), with data downloaded from TANRIC [8]. Results suggest that lncDIFF is powerful and robust in a variety of scenarios and identifies DE lncRNA genes of low expression with higher accuracy.
Results
Simulation study to assess lncDIFF performance
We conducted a comprehensive simulation study to assess the performance of lncDIFF, and compared with existing common tools DESeq2, edgeR and limma (with log transformation), along with recently developed singlecell tools zinbwave [17] incorporated in DESeq2 (i.e. zinbwave+DESeq2), and DEsingle [18]. We rounded decimals to integers as input for DESeq2 and selected the quasilikelihood estimation method in edgeR. lncDIFF and the compared methods were applied to lowabundance RNASeq genes sampled from zeroinflated NB or LN families. ShrinkBayes [19, 20] is a Bayesian approach that also adopts zeroinflated NB for low counts RNASeq DE analysis, designed for small sample size studies but slower in computation compared to other tools. Hence, ShrinkBayes was not applied to simulated datasets, and was compared to lncDIFF only based on TCGA HNSC datasets.
We adopted the genewise estimated dispersion or log variance from TCGA HNSC [21] lncRNA RPKM as the density parameters for data generation. Based on the dispersion and log variance estimate for the data in this TCGA study, we used ϕ = 1, 2, 10, 20, σ^{2} = 0.01, 0.25, 1, 2.25, and fixed ϕ, σ^{2} values to generate RPKM of each genes across all samples in the same simulation scenario. Each scenario was defined by the unique genewise nonzero proportion π = 0.5, 0.7, 0.9, sampling distribution function (NB or LN) and value of ϕ, σ^{2}, with sample size varying at N = 100, 200, 300. In order to generate data similar to lncRNA RPKM, we first obtained binary outcomes (0–1) for all samples in one scenario from the Bernoulli sampling, and then replace the 1’s by positive abundance value sampled from NB or LN densities. The HNSC study includes 40 pairs of matched normaltumor tissues. We used the 40 normal samples to calculate the mean RPKM as baseline group parameter β_{i1} in simulation. Similar to the common filtering criteria in existing lncRNA analysis, we removed the genes in the real data with mean RPKM < 0.3 [22, 23] and zero expression in more than half of the samples, reducing to 1100 genes used for simulation.
In the simulation study, we only considered twogroup comparison to illustrate the contrast between different methods. RPKM of the first group was randomly generated by the specified density function and the baseline parameter, while the second group had the mean parameter of the baseline times a shift, i.e., the tumor/normal fold change in TCGA HNSC data. We manually set the shift between two simulated groups at 1 if the absolute log2 fold change for the corresponding gene is less than 0.5. Simulated genes with betweengroups shift at 1 are the null genes and the remaining are DE genes. For each simulated scenario, we generated 100 replicates to assess the performance of different methods by the mean of false discovery rate (FDR) and true positive rate (TPR), and area under the curve (AUC) of receiver operating characteristics (ROC) with FDR threshold 0.05. We ordered the scenarios by the scale of variance (with 1–4 representing the smallest to the largest), proportion of nonzero expression, and sample size to investigate the impact of parameters on performance metrics. Figure 1 and Additional file 1: Figures S4S5 presented the AUC, FDR and TPR of all scenarios, illustrating that lncDIFF outperforms the other methods, especially for scenarios with LN density.
AUC for all methods in Fig. 1 decrease as the genewise variation increases, and lncDIFF’s performance is close to the optimal method (DESeq2) for NB density. The change of AUC across different sample sizes implies that adding more samples improves the performance of lncDIFF and DESeq2, but does not have impact on edgeR, limma and DEsingle. Furthermore, the AUC of lncDIFF in NB density is equivalent to or slightly larger than that of DESeq2 at sample size N = 300. According to AUC and TPR, the outperformance of DESeq2 compared to lncDIFF in NB sampling was not as pronounced as the outperformance of lncDIFF compared to DESeq2 in LN distribution. The singlecell RNASeq tool zinbwave improves DE detection power of DESeq2, but only for small genewise variance and many samples having simulated counts > 4. TPR of limma was higher than the other methods except for lncDIFF on LN distributed data in smaller sample sizes. On the other hand, the FDR shows that lncDIFF has similar performance of DESeq2 in most scenarios regardless of density and greatly outperforms the other two methods, although lncDIFF in largevariance LN scenarios presents performance close to edgeR and limma. The change of performance of DESeq2 under either sampling distribution brought by zinbwave illustrates that it is the Exponential likelihood rather than zeroinflated point mass contributing towards the outperformance of lncDIFF. In summary, lncDIFF is an ideal method for DE analysis of lncRNA RPKM with different distributions, while DESeq2 is a preferred tool if the nonzero counts are relatively high and NBdistributed.
Application of lncDIFF to TCGA HNSC data
We employed the above methods along with ShrinkBayes to perform DE analysis on the TCGA HNSC lncRNA data for matched (or paired) tumor and normal samples, with results summarized in Figs. 23. The Venn diagrams in Figs. 23(a) show the overlap and difference of the DE genes identified by different methods. We do not have prior knowledge about the ‘true’ DE genes for HNSC tumor vs normal. Thus, the genes with log2 fold change > 0.5, 1, or 1.5 were considered as ‘pseudo’ or ‘surrogate’ DE genes, respectively, labeled as Surrogate Set 1–3 (SS1 SS3) of DE genes. For each set, the proportion detected by each method is a surrogate true positive rate (SS1.TPRSS3.TPR), while the surrogate false positive rate (SS1.FPRSS3.FPR) is the percentage of those not in surrogate DE genes set but detected as positive by each method, listed in Figs. 23(b). The significance threshold for tumor vs normal DE gene is adjusted pvalue< 0.05. We further visualized the contrast between lncDIFF and the other methods by boxplots in Figs. 2 and 3ce, with each panel showing the tumor vs normal group effect on the lncDIFF positive genes identified as negative by other methods. We only include the genes with upregulation for normal tissues and LFC > 0.5 in the boxplots.
The results in Figs. 23(b) suggested that lncDIFF provided ideal power or alternative TPR (75%) in DE analysis for LFC < 0.5, with approximated FPR below 5%. ShrinkBayes has detection power close to lncDIFF only for DE genes in SS1. Figures 2 and 3ce displayed the group contrast on genes identified as DE (positive) by one method but nonDE (negative) by the other method using boxplot of RPKM at log2 scale per group. The group contrast on the DE genes identified only by lncDIFF was much larger than that in DE genes identified only by each of the compared methods. In other words, lncDIFF identifies ‘true’ DE genes with more power and is less likely to ‘miss’ the DE genes with pronounced group contrast.
We also applied the same analysis to the unpaired tumor (N = 426) and normal (N = 40) samples in the TCGA HNSC study by lncDIFF, and compared the top significant genes in the paired and unpaired DE analysis results (Table 1). There are 11 overlapped genes in the top 20 significant gene list of paired and unpaired analysis, some of which are associated with overall survival time. For each the overlapped significant genes, we divided the 426 HNSC tumor samples into two groups by the median of RPKM per DE gene, and then apply Cox Proportional Hazard model to survival association analysis. The KaplanMeier curves and the logrank test pvalues reveal marginal or significant associations between genes ERVH48–1, HCG22, LINC00668, LINC02582 and the overall survival months (Additional file 1: Figure S6). For the same set of HNSC tumor samples, we also used the mRNA normalized counts to select 20 mRNA genes highly correlated with the 11 tumornormal DE lncRNA genes by Spearman correlation (Additional file 3).
Discussion
Computational performance of lncDIFF
The GLM group effect estimation was implemented in the R function ZIQML.fit, separated from the likelihood ratio testing included in another function ZIQML.LRT. The GLM group effect estimate in lncDIFF is based on the zeroinflated Exponential likelihood with either identity or log link function, which is also valid and unbiased for lowexpression lncRNA genes distributed as NB or LN. The choice of link function does not have any impact on the group effect estimate and LRT results (Table 2), but the log link function can avoid NA values produced in numerical optimization of the likelihood function. lncDIFF provides the option of either identity or log link function in the function ZIQML.fit.
The distribution of pvalues from lncDIFF was also investigated and compared with the other methods in TCGA HNSC tumor vs. normal analysis, using simulated pvalues from sample permutation. We randomly selected three genes with different RPKM density patterns to generate the null pvalues and then visualized the pvalues distribution via QQ plots in Fig. 4. Figure 4(b)(c) showed that the pvalues of lncDIFF and DESeq2 (with or without zinbwave) were close to the expected distribution aligned on the identity line, while the other methods resulted in a large proportion of small pvalues (< 0.1). The histogram and density plot of RPKM presented in Fig. 4(a) implied that the null pvalues of lncDIFF and DESeq2 for higher expressed lncRNA genes (ENSG00000130600.11) followed the expected uniform distribution, while those for low abundance genes (ENSG00000152931.7, ENSG00000153363.8) may deviate from the assumed uniform distribution. To avoid the distorted distribution of LRT pvalues, we also implemented the option of empirical pvalue and FDR based on the zeroinflated Exponential likelihood in the R function ZIQML.LRT.
We further illustrated the computation efficiency of lncDIFF by running on the TCGA HNSC matched tumornormal samples with ~ 1130 filtered genes. The processing time (in seconds) of this biological data analysis by lncDIFF, DESeq2, edgeR, limma, zinbwave+DESeq2, DEsingle, and ShrinkBayes are 3.17, 4.31, 3.37, 0.02, 55.33, 52.67, and 341.47, respectively. If the option of simulated pvalue is enabled, the running time of lncDIFF on this real dataset is increased to 267.86 s for default 100 permutations, but the correlation between observed and simulated pvalues or FDR’s is around 0.9.
lncDIFF on different normalization methods
In order to illustrate normalization methods having no impact on lncDIFF performance, we simply applied lncDIFF DE analysis to three different types of normalized counts (i.e., FPKM, TMM and UQ) of low abundance mRNA in TCGA HNSC tumornormal samples (N = 546). The low abundance genes were selected with mean FPKM in the range of (0.3, 2) and no more than 20% zero expression, similar to the majority of lncRNA genes. The Pearson correlation of log10 adjusted pvalues between the three normalization methods were FPKM vs. TMM 0.82, FPKM vs. UQ 0.92, TMM vs. UQ 0.96, implying similar DE analysis results. Therefore, we only used RPKM of lncRNA in TCGA HNSC to illustrate the application and performance of lncDIFF in this study. In addition to TMM and UQ, the quasilikelihood parameter estimation in lncDIFF is still robust for gene expression processed from modelbased RNASeq quantification and normalization tools, such as RSEM [24], baySeq [25], and QuasiSeq [26]. Hence, the lncDIFF DE analysis can be incorporated in existing RNASeq quantification and normalization pipeline, regardless of the models employed in the preprocessing tools.
Conclusions
We implemented GLM with zeroinflated Exponential likelihood and LRT for either identity or logarithmic link function in lncDIFF, along with an option of simulated pvalues and FDR generated from permutations. This package allows the input expression matrix to be either continuous or discrete and requires group or phenotype factor provided in the design matrix format. lncDIFF is a powerful differential analysis tool for zeroinflated lowcounts RNASeq data, especially for lncRNA and largescale studies, with improved DE detection power and computational performance compared to others. This is an efficient DE analysis method compatible with various RNASeq quantification and normalization tools.
Methods
Lowabundance RNASeq data distribution
In RNASeq analysis, the type of RNAs and the selected alignment, quantification and normalization tools usually have substantial impacts on the distribution pattern of transcript abundance [27], especially on the level of gene expression dispersion, i.e. the meanvariance relation. Most of the existing RNASeq analysis tools, such as DESeq [28], edgeR [29], and baySeq [25], estimate genewise dispersion to perform normalization or downstream differential expression analysis. However, algorithms based on genewise dispersion may not be suitable for low counts in RNASeq studies, such as lncRNA and lowexpression genes in mRNA [14].
Existing analysis on RNASeq data usually assumes Negative Binomial (NB) or the Log Normal (LN) distribution for RNASeq normalized counts X mapped to a gene [14, 16], with genewise dispersion summarized as a quadratic meanvariance relation Var(X) = c ∙ E(X)^{2}. The square root of c coincides with coefficient of variation (CV) and depends on the assumed statistical distribution, i.e. \( c=\phi +\frac{1}{\mu_1} \) for NB and c = exp(σ^{2}) − 1 for LN [28], where μ_{1}, ϕ are the mean and dispersion parameters of NB, and σ is the log standard deviation of LN, not related to log mean. Obviously, a drop in the genewise CV is expected to occur along with an increase in genewise mean, if the NB distribution assumption is valid for RNASeq counts. On the other hand, the genewise CV and mean should be independent if the assumed LN distribution is valid.
We first used the lncRNA and mRNA FPKM in the TCGA HNSC study [21] to investigate the dispersion patterns for three types of RNAs, i.e. highabundance mRNA, lowabundance mRNA and lncRNA. Genes in lncRNA dataset were filtered by the criteria proposed by Yan et al. [9], while genes in mRNA dataset with more than 30% zero expression were removed. The cutoff between high vs. low abundant mRNA genes was the 85th percentile of genewise mean FPKM. We used the violinbox plots in Fig. 5 to illustrate the CVmean relation for different RNAs in three panels. The totals of genes for each type of RNA are 9561 highabundance mRNA genes, 8362 lowabundance mRNA genes, and 1322 lncRNA genes.
CVs for the majority of highabundance mRNA genes were less than 1 and display a drop in higher expressed genes (Fig. 5). In contrast, the CV level for most of lncRNA and lowabundance mRNA genes in the lower two panels were close to CV = 1 and did not change along with genewise mean, especially for lncRNA genes with mean below the 80th percentile. The other genes in these panels severely deviated from CV = 1, and a negative CVmean relation still existed in lowabundance mRNA when genewise mean increases from the 70th percentile to higher. We visualized and confirmed such CVmean patterns via mRNA and lncRNA FPKM data in another two TCGA studies, i.e. Lung Squamous Cell Carcinoma (LUSC) and Lung Adenocarcinoma (LUAD), shown in Additional file 1: Figures S1S2. We further assessed the dispersion patterns of mRNA low counts normalized by TMM and UQ methods [30, 31] (Additional File 1: Figure S3). The similarity between different normalized mRNA counts implies that TMM or UQ normalized lncRNA counts also follow the CVmean pattern of lncRNA RPKM in Fig. 5, although TMM and UQ normalized lncRNA counts in TCGA HNSC study were not publically available.
The expected CV level for lncRNA and lowabundance mRNA in Fig. 5 revealed an underlying statistical distribution in a large proportion of low abundant genes, which should have CV = 1 or Var(X) = E(X)^{2}. This naturally leads to the Exponential distribution with density function \( f(X)=\frac{1}{\lambda }{e}^{\frac{X}{\lambda }} \), and E(X) = λ, Var(X) = λ^{2}. In the light of fewer statistical parameters, it is of interest to consider the Exponential family as a latent distribution for lowcounts RNASeq data, especially for lncRNA. We cannot ignore the fact that expression of certain lncRNA genes and lowabundance mRNA genes are still distributed as the wellknown NB or LN family, illustrated by the genes with CV deviating from CV = 1 (Fig. 5 and Additional file 1: Figs. S1S3). Therefore, in the presence of NB or LNdistributed counts, we adopted exponential family to account for the latent distribution of lncRNA genes and perform differential expression analysis.
GLM with exponential likelihood
Let Y_{ij} be the lncRNA normalized counts for gene i in sample j, belonging to phenotype or treatment group k, k = 1, …, K. The generalized linear model (GLM) with the Exponential family is
Identity link: \( {\lambda}_{ij}=\sum \limits_{k=1}^K{\beta}_{ik}{w}_{jk}+\sum \limits_{m=1}^M{\gamma}_m{v}_{jm.} \)
Logarithmic link: \( \log \left({\lambda}_{ij}\right)=\sum \limits_{k=1}^K{\beta}_{ik}{w}_{jk}+\sum \limits_{m=1}^M{\gamma}_m{v}_{jm} \)
w_{jk} and β_{ik} are design matrix elements and coefficients for groups, while v_{jm} and γ_{m} (m = 1, …, M) are the M covariates and corresponding coefficients. Since Y_{ij} has been normalized for library size, this model does not include the RNA sequencing normalization factor, although it is a common parameter in existing tools based on NB assumption [28, 29, 32, 33].
In the absence of zero counts, lncDIFF uses the Exponential GLM for lncRNA DE analysis. Let β_{i} = (β_{i1, …,}β_{iK}) and γ = (γ_{1, …,}γ_{m}), for gene i with negligible zero occurrence (< 1%), the GLM likelihood based on the exponential density \( f\left({Y}_{ij}\right)=\frac{1}{\lambda_{ij}}{e}^{\frac{Y_{ij}}{\lambda_{ij}}} \) with identity or log link function is
The exponential likelihood estimate for mean gene expression is the maximizer of L(β_{i}, γ), that is \( \left({\hat{\beta}}_i,\hat{\gamma}\right)= argmax\ L\left({\beta}_i,\gamma \right) \). Statistical models similar to Exponential GLM had been proposed and assessed in previous studies [34,35,36,37].
Zeroinflated exponential likelihood
In lncRNA expression data, it is common to observe zero values in most genes at a nonnegligible proportion (i.e., at least 1%) of samples. The excess zeroes and low counts for lncRNA cannot be addressed by integer models like Poisson and Negative Binomial (or GammaPoisson), especially for noninteger normalized counts in the range of (0, 2). Rounding decimals to integers and then applying Poisson or NB density [38, 39] or using data transformation, e.g. log2, voom, or VST [15, 16, 38] with limma [13, 40] may lead to errors in DE analysis. Therefore, we propose the zeroinflated quasi likelihood of Exponential GLM to account for the genewise inflation of zeros.
In order to incorporate the zeroinflated pattern, we rewrite the normalized counts for gene i in sample j by a multiplicative error model [41,42,43] with random error ϵ_{ij}, that is
The random errors ϵ_{ij} also have the occurrence of excess zeros with a prior probability mass P(ϵ_{ij} = 0) = 1 − π_{i}, P(ϵ_{ij} > 0) = π_{i}, and a continuous density at positive value with E(ϵ_{ij} Y_{ij} > 0) = γ, similar to [42, 44, 45]. If the nonzero expression Y_{ij} ∣ Y_{ij} > 0 follows an Exponential distribution (so does ϵ_{ij}Y_{ij} > 0), then the density functions for Y_{ij} including zero occurrence is
Equation (4) is derived in the Additional file 2. The corresponding likelihood function is
l_{j}^{∗}(π_{i}, β_{i}, γ) is defined according to the selected link function as
The zeroinflated maximum likelihood (ZIML) estimate for groupwise mean expression is the maximizer of L^{∗}(π, β_{i}, γ) in eq. (6), that is
It is worthwhile to note that the likelihood function L^{∗}(π_{i}, β_{i}, γ) in eq. (5) reduces to eqs. (1) and (2) if the proportion of zero expression is negligible, i.e. no more than 1%.
Estimate group wise mean
For each gene, lncDIFF utilizes \( {\left({\hat{\uppi}}_i,{\hat{\beta}}_i,\hat{\gamma}\right)}_{ZI ML} \) in eq. (8) to estimate the mean expression level per group. We can prove mathematically that this estimate is asymptotically unbiased at large sample size, even though RNASeq low counts are usually a mixture of multiple distributions as previously reported [34,35,36]. Zeroinflated Poisson, NB, or LN likelihood may result in biased estimate for group wise mean gene expression in lncRNA low counts, due to limited mathematical power of these functions. Mathematical proof for unbiased estimate of group wise mean gene expression in lncDIFF is elaborated in the Additional File 2.
To illustrate the estimation accuracy of \( {\left({\hat{\uppi}}_i,{\hat{\beta}}_i,\hat{\gamma}\right)}_{ZI ML} \), we simply generated normalized lncRNA counts for a gene in three biological groups (i.e. groups A, B, C) without covariate effects by sampling from zeroinflated Exponential, NB, LN distributions, respectively. Each scenario contained 1000 replicates. The mean and median of 1000 estimated group effects were listed in Table 3, indicating that the presence of NB and LNdistributed lowcounts did not have impact on the accuracy of group effect estimate in lncDIFF. In other words, lncRNA counts may occasionally deviate from Exponential family but does not affect the performance of lncDIFF. Hence, lncDIFF is a pseudo or quasilikelihood [33] approach rather than a ‘true’ likelihood method for lncRNA low counts analysis.
Detect differential expression by likelihood ratio test
For genes with nonExponential low counts, the group wise mean expression level is independent of variance. Applying lncDIFF to these genes only detects the group effect on mean expression. On the other hand, declaring an Exponentialdistributed lowcounts gene as DE via lncDIFF implies significant group effect on both mean expression and variance, as logmean is always half of logvariance in Exponential family. For differential analysis in lncDIFF, we apply the Likelihood Ratio Test (LRT) to the zeroinflated exponential likelihood function in eq. (5) to test hypothesis: H_{0} : β_{i} = β_{null} vs H_{1} : β_{i} = β_{full}, where β_{null} is the design matrix coefficients with some equal to zero and β_{full} is the coefficients without zero.
The test statistic of LRT is D = − 2L^{∗}(β_{null}) + 2L^{∗}(β_{full}) with β_{null} and β_{full} being the design matrix coefficients for null and alternative models. Let m_{null} and m_{full} be the number of distinct coefficients in β_{null} and β_{full}. Test statistic D asymptotically follows χ^{2} distribution with degrees of freedom m_{full} − m_{null}. The pvalues from LRT are adjusted for multiple testing using the procedure of Benjamin and Hochberg false discovery rate [46]. The choice of link function does not affect the power of LRT, as illustrated by simulation study. An alternative algorithm to compute pvalues for LRT is to use empirical distribution of LRT statistics D [39]. The empirical distribution of statistics D per gene can be generated by randomly shuffling the samples into K groups for P times and then calculate the LRT statistics for each permutation, that is D_{1}, …, D_{P}. Let the test statistics for the true groups be D_{0}, then the empirical pvalue is \( \frac{\sum_{p=1}^P{I}_{\left({D}_p>{D}_0\right)}\ }{P} \), and can be adjusted by Benjamin and Hochberg procedure.
Availability of data and materials
The public TCGA HNSC RNAseq data was retrieved from: https://ibl.mdanderson.org/tanric/_design/basic/download.html
IncDIFF sourcecode can be found at: https://github.com/qianli10000/lncDIFF
Abbreviations
 AUC:

Area under the curve
 CV:

Coefficient of variation
 DE:

Differential expression
 FDR:

False discovery rate
 FPR:

False positive rate
 GLM:

Generalized linear model
 HNSC:

Head and neck squamous cell carcinomas
 HPV:

Human Papillomavirus
 LFC:

Log2 fold change
 LN:

Log Normal
 lncRNA:

Long noncoding RNA
 LRT:

Likelihood Ratio Test
 NB:

Negative Binomial
 RPKM:

Reads Per Kilobase per Million mapped reads
 TCGA:

The Cancer Genome Atlas
 TPR:

True positive rate
 ZIML:

Zeroinflated maximum likelihood
 ZINB:

Zeroinflated negative binomial
References
Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermüller J, Hofacker IL. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science. 2007;316(5830):1484–8.
Batista PJ, Chang HY. Long noncoding RNAs: cellular address codes in development and disease. Cell. 2013;152(6):1298–307.
Guttman M, Rinn JL. Modular regulatory principles of large noncoding RNAs. Nature. 2012;482(7385):339.
Ulitsky I, Bartel DP. lincRNAs: genomics, evolution, and mechanisms. Cell. 2013;154(1):26–46.
Huarte M. The emerging role of lncRNAs in cancer. Nat Med. 2015;21(11):1253.
Chaudhary R, Lal A. Long noncoding RNAs in the p53 network. Wiley Interdiscip Rev: RNA. 2017;8(3):e1410.
Gupta RA, Shah N, Wang KC, Kim J, Horlings HM, Wong DJ, Tsai MC, Hung T, Argani P, Rinn JL. Long noncoding RNA HOTAIR reprograms chromatin state to promote cancer metastasis. Nature. 2010;464(7291):1071.
Li J, Han L, Roebuck P, Diao L, Liu L, Yuan Y, Weinstein JN, Liang H. TANRIC: an interactive open platform to explore the function of lncRNAs in cancer. Cancer Res. 2015;2015:canres. 0273.
Yan X, Hu Z, Feng Y, Hu X, Yuan J, Zhao SD, Zhang Y, Yang L, Shan W, He Q. Comprehensive genomic characterization of long noncoding RNAs across human cancers. Cancer Cell. 2015;28(4):529–40.
Ran D, Daye ZJ. Gene expression variability and the analysis of largescale RNAseq studies with the MDSeq. Nucleic Acids Res. 2017;45(13):e127.
Zhang W, Yu Y, Hertwig F, ThierryMieg J, Zhang W, ThierryMieg D, Wang J, Furlanello C, Devanarayan V, Cheng J, et al. Comparison of RNAseq and microarraybased models for clinical endpoint prediction. Genome Biol. 2015;16(1):133.
Bouckenheimer J, Fauque P, Lecellier CH, Bruno C, Commes T, Lemaître JM, De Vos J, Assou S. Differential long noncoding RNA expression profiles in human oocytes and cumulus cells. Sci Rep. 2018;8(1):2202.
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. Limma powers differential expression analyses for RNAsequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47.
Assefa AT, De Paepe K, Everaert C, Mestdagh P, Thas O, Vandesompele J. Differential gene expression analysis tools exhibit substandard performance for long noncoding RNAsequencing data. Genome Biol. 2018;19(1):96.
Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNAseq data. BMC Bioinformatics. 2013;14(1):91.
Law CW, Chen Y, Shi W, Smyth GK. Voom: precision weights unlock linear model analysis tools for RNAseq read counts. Genome Biol. 2014;15(2):R29.
Risso D, Perraudeau F, Gribkova S, Dudoit S, Vert JP. A general and flexible method for signal extraction from singlecell RNAseq data. Nat Commun. 2018;9(1):284.
Miao Z, Deng K, Wang X, Zhang X. DEsingle for detecting three types of differential expression in singlecell RNAseq data. Bioinformatics. 2018;34(18):3223–4.
van de Wiel MA, Neerincx M, Buffart TE, Sie D, Verheul HM. ShrinkBayes: a versatile Rpackage for analysis of countbased sequencing data in complex study designs. BMC Bioinformatics. 2014;15(1):116.
Van De Wiel MA, Leday GGR, Pardo L, Rue H, Van Der Vaart AW, Van Wieringen WN. Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. Biostatistics. 2012;14(1):113–28.
The Cancer Genome Atlas N. Comprehensive genomic characterization of head and neck squamous cell carcinomas. Nature. 2015;517:576.
Tsoi LC, Iyer MK, Stuart PE, Swindell WR, Gudjonsson JE, Tejasvi T, Sarkar MK, Li B, Ding J, Voorhees JJ, et al. Analysis of long noncoding RNAs highlights tissuespecific expression patterns and epigenetic profiles in normal and psoriatic skin. Genome Biol. 2015;16(1):24.
Tang Z, Wu Y, Yang Y, Yang YCT, Wang Z, Yuan J, Yang Y, Hua C, Fan X, Niu G, et al. Comprehensive analysis of long noncoding RNAs highlights their spatiotemporal expression patterns and evolutional conservation in Sus scrofa. Sci Rep. 2017;7:43166.
Li B, Dewey CN. RSEM: accurate transcript quantification from RNASeq data with or without a reference genome. BMC Bioinformatics. 2011;12(1):323.
Hardcastle TJ, Kelly KA. baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. 2010;11(1):422.
Lund Steven P, Nettleton D, McCarthy Davis J, Smyth Gordon K. Detecting differential expression in RNAsequence data using quasilikelihood with shrunken dispersion estimates. Stat Appl Genet Mol Biol. 2012;11:15446115.
Li P, Piao Y, Shon HS, Ryu KH. Comparing the normalization methods for the differential analysis of Illumina highthroughput RNASeq data. BMC Bioinformatics. 2015;16(1):347.
Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):R106.
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40.
Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNASeq experiments. BMC Bioinformatics. 2010;11:94.
Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNAseq data. Genome Biol. 2010;11(3):R25.
LeónNovelo L, Fuentes C, Emerson S. Marginal likelihood estimation of negative binomial parameters with applications to RNAseq data. Biostatistics. 2017;18(4):637–50.
Robinson MD, Smyth GK. Smallsample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008;9(2):321–32.
Hiejima Y. Interpretation of the quasilikelihood via the tilted exponential family. J Japan Stat Soc. 1997;27(2):157–64.
Rathouz PJ, Gao L. Generalized linear models with unspecified reference distribution. Biostatistics. 2009;10(2):205–18.
SIN CY. QMLE of a standard exponential ACD model: asymptotic distribution and residual correlation. Ann Financ Econ. 2014;09(02):1440009.
Jahan F, Siddika B, Islam M. An application of the generalized linear model for the geometric distribution, vol. 16; 2016.
Li Q, NoelMacDonnell JR, Koestler DC, Goode EL, Fridley BL. Subject level clustering using a negative binomial model for small transcriptomic studies. BMC Bioinformatics. 2018;19(1):4741.
Chu C, Fang Z, Hua X, Yang Y, Chen E, Cowley AW, Liang M, Liu P, Lu Y. deGPS is a powerful tool for detecting differential expression in RNAsequencing studies. BMC Genomics. 2015;16(1):455.
Smyth GK. Limma: linear models for microarray data. In: Bioinformatics and computational biology solutions using R and Bioconductor. New York: Springer; 2005. p. 397–420.
Brownlees CT, Cipollini F, Gallo GM. Multiplicative error models; 2011.
Hautsch N. Capturing common components in highfrequency financial time series: a multivariate stochastic multiplicative error model. J Econ Dyn Control. 2008;32(12):3978–4015.
MT A. Predicting and correcting Bias caused by measurement error in line transect sampling using multiplicative error models. Biometrics. 2004;60(3):757–63.
Pierson E, Yau C. ZIFA: dimensionality reduction for zeroinflated singlecell gene expression analysis. Genome Biol. 2015;16(1):241.
Wu Z, Zhang Y, Stitzel ML, Wu H. Twophase differential expression analysis for single cell RNAseq. Bioinformatics. 2018. https://doi.org/10.1093/bioinformatics/bty329.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995;57:289–300.
Acknowledgments
The authors wish to thank the editor and two anonymous reviewers for helpful comments. The authors would like to thank colleagues at Department of Biostatistics and Bioinformatics at Moffitt Cancer Center for providing feedback. The authors would also like to thank Noah Sulman and Kevin M. Counts at Health Informatics Institute at University of South Florida for providing support in high performance computing.
Funding
XW was supported by Institutional Research Grant number 14–18919 from the American Cancer Society, and the National Cancer Institute grant P50 CA168536 (Moffitt Skin Cancer SPORE). QL was supported in part by the Environmental Determinants of Diabetes in the Young (TEDDY) study, funded by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). No funding body influenced the development of this study or the writing of the manuscript.
Author information
Authors and Affiliations
Contributions
QL, XY, and XW conceived the study. QL proposed the algorithm, provided mathematical proof and programming, and drafted the manuscript. XW designed the analysis method and the software, supervised all aspects of the research, and drafted the manuscript. QL, XY, RC, RJCS, CHC and XW interpreted the results and reviewed/edited the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional files
Additional file 1:
Figure S1. Violinbox plots for genewise CV and mean for high, lowabundance mRNA and lncRNA FPKM in TCGA LUSC. Figure S2. Violinbox plots for genewise CV and mean for high, lowabundance mRNA and lncRNA FPKM in TCGA LUAD. Figure S3. Violinbox plots for genewise CV and mean for lowabundance mRNA counts normalized by TMM and UQ methods. Figure S4. Mean FDR for DE analysis on simulated data. Scenarios are in the order of genewise variance scale, from the smallest to the largest. Figure S5. Mean TPR for DE analysis on simulated data. Scenarios are in the order of genewise variance scale from the smallest to the largest. Figure S6. Survival time association with DE genes identified in both paired and unpaired TCGA HNSC tumor vs normal analysis. The 426 tumor samples are divided into two groups by the median of RPKM per gene. (A)(D) are the KaplanMeier survival curves for genes ERVH48–1, LINC00668, HCG22, LINC02582 individually. (DOCX 2191 kb)
Additional file 2:
Supplementary Methods. (DOCX 22 kb)
Additional file 3:
Table S1. mRNA genes highly correlated with each lncRNA DE gene for TCGA HNSC tumor vs normal. (XLS 26 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Li, Q., Yu, X., Chaudhary, R. et al. lncDIFF: a novel quasilikelihood method for differential expression analysis of noncoding RNA. BMC Genomics 20, 539 (2019). https://doi.org/10.1186/s1286401959264
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1286401959264
Keywords
 lncRNA
 Differential analysis
 Quasilikelihood
 Head and neck squamous cell carcinomas