 Methodology article
 Open Access
 Published:
BALLI: Bartlettadjusted likelihoodbased linear model approach for identifying differentially expressed genes with RNAseq data
BMC Genomics volume 20, Article number: 540 (2019)
Abstract
Background
Transcriptomic profiles can improve our understanding of the phenotypic molecular basis of biological research, and many statistical methods have been proposed to identify differentially expressed genes (DEGs) under two or more conditions with RNAseq data. However, statistical analyses with RNAseq data are often limited by small sample sizes, and global variance estimates of RNA expression levels have been utilized as prior distributions for genespecific variance estimates, making it difficult to generalize the methods to more complicated settings. We herein proposed a BartlettAdjusted Likelihoodbased LInear mixed model approach (BALLI) to analyze more complicated RNAseq data. The proposed method estimates the technical and biological variances with a linear mixedeffects model, with and without adjusting small sample bias using Bartlkett’s corrections.
Results
We conducted extensive simulations to compare the performance of BALLI with those of existing approaches (edgeR, DESeq2, and voom). Results from the simulation studies showed that BALLI correctly controlled the type1 error rates at various nominal significance levels and produced better statistical power and precision estimates than those of other competing methods in various scenarios. Furthermore, BALLI was robust to variation of library size. It was also successfully applied to Holstein milk yield data, illustrating its practical value.
Conclusions;
BALLI is statistically more efficient and valid than existing methods, and we conclude that it is useful for identifying DEGs in RNAseq analysis.
Background
Transcriptomic profiles can improve our understanding of the phenotypic molecular basis of biological research, and many attempts have been made to identify differentially expressed genes (DEGs) by microarray analysis. However, microarray analysis often suffers from many systematic errors, such as hybridization and dyebased detection bias, hampering the detection of true DEGs [1, 2]. Recently, highthroughput sequencing technology has markedly improved. RNA sequencing (RNAseq), also called wholetranscriptome shotgun sequencing, uses nextgeneration sequencing to quantify the abundance of transcripts with several desirable features, such as increased dynamic range and freedom from a priori chosen probes [3]. Furthermore, RNAseq is robust against systematic errors and has therefore emerged as a successful alternative to microarray analysis [4].
RNAseq quantifies the numbers of reads aligned to particular transcripts or genes, and various approaches have been proposed to manage RNAseq data [5]. There are two different types of statistical methods: readcountbased approaches and transformationbased approaches. Readcountbased approaches assume that observed read counts follow negative binomial distribution, and generalized linear regression with a logarithm as a link function is utilized. These approaches typically assume that variances include biological and technical variances; the latter indicates variance observed among measurements of the same biological unit, and the former indicates variance between different biological units, such as different subjects or different tissues of the same subject. If technical replicates are analyzed, observed read counts from technical replicates have the same means under the same conditions. Marioni et al. (2008) demonstrated that the data follow a Poisson distribution, and variances in technical replicates are expected to be the same as their means for each gene [6]. However, if biological replicates are available, means and variances of read counts differ among different biological units. Bullard et al. (2010) carefully examined such variability and concluded that the biological variances were usually larger than technical variances, supporting the presence of overdispersion [7]. Thus, negative binomial distribution has often been utilized; edgeR [8] and DESeq2 [9] are such methods. Transformationbased approaches assume that the transformed read counts for each gene follow normal distribution. For example, voom calculated proportions of read counts for each gene per subject, and the logtransformed values were then assumed to follow normal distribution, assuming that the relative proportion of technical variances becomes smaller if the read count grows larger [10].
Negativebinomial distributions for read counts and normal distributions for logtransformation of counts per million (CPM) successfully describe distributions of RNAseq data. However, RNAseq is relatively expensive compared with microarray, and thus, further adjustment has been made to handle the problem of small sample size. If the sample size is small, the estimated variance can have large standard errors, and thus, multiple methods that incorporate prior knowledge have been proposed. For example, variances of read counts are assumed to be positively related to their means, and their relationships can be estimated by comparing the means and variances of read counts for all genes. This can often be utilized to shrink variance parameters [11, 12]. edgeR and DESeq2 estimate the overall dispersion parameter for all genes and are then combined with genewise dispersion parameters for each gene using empirical Bayesian rules. Voom assumes that the variances of logtransformed CPM (logcpm) are functionally related to their means. Locally weighted scatterplot smoothing (LOWESS) curves between the mean and residual variances of genes are then utilized to weight variance estimates of each gene.
Existing methods shrink the variance estimate of each gene toward global variance estimates or use variance estimates based on the relationships between means and variances. Such assumptions are often very useful if the sample sizes are small. However, there are multiple factors that can distort these relationships, and if they are violated, the performance of existing approaches can be affected. For example, the quality of data is highly dependent on the preparation steps, and unexpected noise, such as noise from different storage periods or sequencing organization of samples, can occur during preparation steps. Moreover, read counts of cancer tissues are more heterogeneous than those of normal tissues, and biological variances can be affected by disease status [13]. Thus, their effects can distort the relationship between technical and biological variances. Multiple studies have shown that misspecified relationships can lead to substantial biases [14, 15]. For example, variance estimators for random effects, which are assumed to follow normal distribution, can be seriously biased unless they are normally distributed [14]. General approaches that are not sensitive to these problems are necessary.
In this article, we present new methods for identifying DEGs with RNAseq data, namely, BALLI and LLI. Statistical analyses with logtransformed read counts are often more powerful than other existing methods and are relatively insensitive to various errors [16, 17]. Thus, we considered logcpm as response variables and used linear mixedeffects models to estimate technical and biological variance. Furthermore, Bartlettadjusted likelihood ratio tests were applied to correct the small sample bias [18]. By allowing model comparisons among different models, our models enabled robust analyses for various scenarios. For our simulation studies, artificial RNAseq data were generated based on real data and negative binomial distributions. Our studies showed that the proposed method performed better than existing methods. The proposed methods were applied to Holstein milk yield data at the false discovery rate (FDR)adjusted 0.1 significance level and uniquely produced significant results. The proposed methods were implemented as an R package and are freely downloadable at CRAN (http://cran.rproject.org) or http://healthstat.snu.ac.kr /software/balli/.
Methods
Notations
We assumed that there were M different groups, and the averages of the expressed read counts for each gene were compared among these groups. For casecontrol studies, M = 2. We assumed that there were n_{m} subjects in group m and denoted the total sample size by N; thus, we got N = n_{1} + … + n_{M}. We defined dummy variables for subject i in group m by
A design matrix for group variables is defined by
We assumed that indexes of all subjects were sorted in ascending order of groups. Thus, the first n_{1} subjects were in group 1; the second n_{2} subjects were in group 2; and so on. The effect of continuous variables can also be tested, and then some or all of the columns of the design matrix, X, become their realization. We assumed that expressed read counts were observed for G genes and were denoted by r_{gi} for gene g of subject i (g = 1, …., G). Then, the library size for subject i, R_{i}, was equivalent to \( {R}_i=\sum \limits_g{r}_{gi} \). We transformed count data into the logtransformed read counts per million (logcpm) using the cpm function in the edgeR R package. If we denoted the normalized R_{i} with the trimmed mean of the Mvalue [19] by \( {R}_i^{\ast } \), the logcpm of gene g for subject i were defined by:
and their vector Y_{g} was defined as
Technical and biological variances of y _{gi}
We assumed that r_{gi} followed a negative binomial distribution, and its mean and variance were μ_{gi} and \( {\mu}_{gi}+{\mu}_{gi}^2{\phi}_g \), respectively. If we let the mean and variance of log_{2}r_{gi} be λ_{gi} and \( {s}_{gi}^2 \), respectively, then var(y_{gi}) can be obtained by the first order approximation as follows [10]:
Here, \( {\mu}_{gi}^{1} \) and ϕ_{g} indicate the variances attributable to the technical and biological replicates, respectively. By second order approximation, the technical variance, 1/μ_{gi}, can be expressed in terms of λ_{gi} and \( {s}_{gi}^2 \) as follows:
λ_{gi} and \( {s}_{gi}^2 \) are functionally related, and both were estimated with the method used for voomtrend [10] as follows:

i.
For all genes, g = 1, … , G, fit linear regressions, \( {y}_{gi}={\alpha}_g+{x}_i^t{\beta}_g+{\epsilon}_{gi} \), and calculate \( {\hat{y}}_{gi} \). Residual variances are used as \( {\hat{s}}_g^2 \). If environmental effects affect y_{gi}, then they should be included as covariates.

ii.
Calculate \( {\hat{\lambda}}_g={\overline{y}}_g+{\log}_2\overset{\sim }{R}{\log}_2{10}^6, \) where \( {\overline{y}}_g \) is the average of y_{gi}; \( \overset{\sim }{R} \) is the geometric mean of \( \left({R}_i^{\ast }+1\right) \); and g = 1, … , G.

iii.
For \( \left({\hat{\lambda}}_g,{\hat{s}}_g^2\right),g=1,\dots, G \), obtained from (i) and (ii), fit LOWESS curve \( {\hat{s}}_g^{1/2} \) on \( {\hat{\lambda}}_g \).

iv.
Calculate \( {\hat{\lambda}}_{gi}={\log}_2{\hat{r}}_{gi}={\hat{y}}_{gi}+{\log}_2\left({R}_i^{\ast }+1\right){\log}_2{10}^6 \) and apply LOWESS curve from (iii) to obtain \( {\hat{s}}_{gi}^{1/2} \).

v.
Calculate \( {\hat{\mu}}_{gi} \) by incorporating \( {\hat{\lambda}}_{gi} \) and \( {\hat{s}}_{gi} \) to the following equation:
Linear mixedeffects model
We denoted a design matrix for nuisance effects, including an intercept by Z. We let b_{g} and e_{g} be vectors for random effects and measurement errors, respectively. Denoting a w × w dimensional identity matrix by I_{w}, we considered the following linear mixedeffects model:
Here, ψ_{g}Σ_{g, b} and \( {\boldsymbol{\sigma}}_{\boldsymbol{g}}^{\mathbf{2}} \) indicate technical and biological variances, respectively, according to Eq. (2), and the random effect, b_{g}, and measurement error, e_{g}, were used to model the technical and biological variances, respectively. The proposed linear mixed effects model may be conceptually useful for understanding the variance structure of the RNAseq data. Notably, elements of Σ_{g, b} are obtained from Eq. (3), and ψ_{g} and \( {\boldsymbol{\sigma}}_{\boldsymbol{g}}^{\mathbf{2}}=\Big({\sigma}_{g1}^2,\dots, {\sigma}_{gM}^2 \)) are estimated. Equation (2) shows that ψ_{g} becomes 1, and we assumed that \( {\sigma}_{g1}^2=\dots ={\sigma}_{gM}^2 \). Then, our final model becomes
which is equivalent to
Bartlettadjusted profile likelihood ratio tests
The likelihood ratio test (LRT) is very flexible and can be utilized for various purposes such as testing two or more variables at once. However, statistical analyses with RNAseq data often involve few samples, and the LRT statistic has a bias with order O_{p}(N^{−1}) to its null distribution. In such cases, an adjustment can be applied to reduce the bias, such as the Bartlett adjustment or CoxReid adjustment [18, 20]. As Zucker et al. (2000) showed that the latter estimated p values more conservative than the former did in linear mixed models [21], we selected the Bartlettadjusted LRT for identifying DEGs, which reduces the order of bias to O_{p}(N^{−2}) and controls type1 error rates well when the sample size is small [18]. If we let β_{g} = (β_{g, 1}, … , β_{g, M − 1})^{t}, \( {\mathbf{V}}_{\mathbf{g}}={\boldsymbol{\Sigma}}_{g,\mathbf{b}}+{\sigma}_g^2\mathbf{I} \) and \( {\boldsymbol{\uptheta}}_{\boldsymbol{g}}=\left({\boldsymbol{\upalpha}}_{\boldsymbol{g}},{\sigma}_g^2\right) \), the likelihood for the proposed linear mixed model is
If we let \( {\hat{\boldsymbol{\uptheta}}}_{\boldsymbol{g}0} \) be a maximum likelihood estimate (mle) under the parameter space for the null hypothesis H_{0}: β_{g} = 0, and \( \left({\hat{\uptheta}}_{\boldsymbol{g}},{\hat{\upbeta}}_{\boldsymbol{g}}\right) \) be mles of (θ_{g},β_{g}) under the parameter space for null or alternative hypothesis, the LRT for the null hypothesis H_{0}: β_{g} = 0 can be obtained by
The Bartlettadjusted LRT (\( {LR}_g^{\ast } \)) for gene g can be expressed by
C_{g} can be obtained based on the results of Melo et al. (2009) [22], as follows:
Here, D_{g}, M_{g}, P_{g}, ν_{g}, and τ_{g} are scalars, and if we let X^{′} = [I − Z(Z^{T}V_{g}^{−1}Z)^{−1}Z^{t}V_{g}^{−1}]X and \( {\dot{\mathbf{X}}}^{\prime }=\mathbf{Z}{\left({\mathbf{Z}}^t{{\mathbf{V}}_{\mathbf{g}}}^{\mathbf{1}}\mathbf{Z}\right)}^{\mathbf{1}}{\mathbf{Z}}^t{{\mathbf{V}}_{\mathbf{g}}}^{\mathbf{2}}{\mathbf{X}}^{\prime } \), they are
and
The forms of D_{g}, M_{g}, P_{g}, ν_{g}, and τ_{g} depend on the structure of V_{g}, and counterparts to general V_{g} s are shown in Additional file 1.
Parameter estimation
The loglikelihood function for our final model is given by
\( {\hat{\boldsymbol{\uptheta}}}_g \) and \( {\hat{\boldsymbol{\upbeta}}}_g \) can be estimated by maximizing the loglikelihood function. Then, if we let P = V_{g}^{−1} − V_{g}^{−1}(Z, X)((Z, X)^{t}V_{g}^{−1}(Z, X))(Z, X)^{t}V_{g}^{−1}, the profile loglikelihood of \( {\sigma}_g^2 \) becomes
Here, V_{g} is a function of \( {\sigma}_g^2 \), and \( {\hat{\sigma}}_g^2 \) can be obtained by maximizing \( {l}_P\left({\sigma}_g^2\right) \) with Fisher’s scoring method. \( {\hat{\sigma}}_g^{2(m)} \) at the m step was updated by
, where \( {I}_{o\left(m1\right)}=E\left(\frac{\partial^2{l}_P}{\partial {\hat{\sigma}}_g^{2\left(m1\right)}\partial {{\hat{\sigma}}_g^{2\left(m1\right)}}{~}^T}\right)\ \mathrm{and}\ U\left({\hat{\sigma}}_g^{2\left(m1\right)}\right)=\frac{\partial {l}_P}{\partial {\hat{\sigma}}_g^{2\left(m1\right)}} \). We found that Fisher’s scoring method was sometimes unsuccessful for estimating \( {\hat{\sigma}}_g^{2(m)} \), and in such cases, we used Brent’s derivativefree method with the optimize function in R [23]. It always converged and successfully estimated parameters, at least in our simulation. We assumed that \( {\hat{\sigma}}_g^2 \) was nonnegative. \( {\hat{\mathbf{V}}}_{\mathbf{g}}={\boldsymbol{\Sigma}}_{g,\mathbf{b}}+{\hat{\sigma}}_g^2\mathbf{I} \), and if \( {\hat{\sigma}}_g^2 \) is equal to zero, \( {\hat{\mathbf{V}}}_{\mathbf{g}} \) becomes Σ_{g, b}. Then, \( {\hat{\upbeta}}_{\boldsymbol{g}} \) and \( {\hat{\upalpha}}_{\boldsymbol{g}} \) can be obtained by
The Bartlettadjusted LRT requires \( {\hat{\boldsymbol{\uptheta}}}_{\boldsymbol{g}0} \), which maximizes the likelihood under the null hypothesis. Under the null hypothesis, if we let P_{0} = V_{g}^{−1} − V_{g}^{−1}Z(Z^{t}V_{g}^{−1}Z)Z^{t}V_{g}^{−1}, the profile loglikelihood of \( {\sigma}_g^2 \) becomes
\( {\hat{\sigma}}_{g0}^2 \) was estimated with Fisher’s scoring method, and if we incorporated \( {\hat{\sigma}}_{g0}^2 \) to V_{g0} and denoted it as \( {\hat{\mathbf{V}}}_{\mathbf{g0}} \), \( {\hat{\boldsymbol{\upalpha}}}_{\mathbf{g}} \) could be obtained by
Datasets
We considered two real datasets consisting of samples from unrelated Nigerian people and Holstein cattle, respectively. Nigerian subjects participated in the International HapMap Project and comprised 29 male and 40 female participants [24]. The read counts were downloaded from the ReCount website [25]. Holstein data were obtained to identify genes associated with milk yield and consisted of high and lowmilkyielding groups, with nine and 12 samples per group, respectively [26]. Furthermore, parity and lactation periods were available and were considered as covariates. We obtained the read counts from Gene Expression Omnibus (GSE60575). Based on count data, we generated simulation data; the steps are described in Additional file 2.
Results
Simulation studies with RNAseq data from Nigerian individuals
We applied the proposed linear mixed models to the simulated data based on Nigerian individuals’ RNAseq data and calculated empirical type1 error rates and statistical powers with these models. The data were then compared with DESeq2 (v1.20.0), edgeR (v3.22.3), and voom (v3.36.5). Table 1, Additional file 3, and Fig. 1 show results from simulation studies based on Nigerian individuals’ RNAseq data. Nigerian individuals’ RNAseq data consisted of 52,580 genes, and after filtering genes whose total read counts across samples were smaller than onetenth of the sample size, each replicate had around 10,000–10,500 genes. Empirical type1 error rates and powers were estimated with 50 replicates. Table 1 and Additional file 3 assumed δ = 0, and thus, their estimates indicated the empirical type1 error rates. For the proposed methods, we assumed that ψ_{g} = 1 and \( {\sigma}_{g1}^2={\sigma}_{g2}^2 \), and the proposed methods with and without Bartlett’s corrections are denoted as BALLI and LLI, respectively, for the remainder of this article. According to Table 1 and Additional file 3, BALLI and voom always controlled the nominal type1 error rates correctly if N was greater than or equal to 12. LLI also successfully controlled the nominal type1 error rates if N was greater than or equal to 20. However, if N was less than 20, p values by LLI were inflated. edgeR showed the least performance, and the estimated type1 error rates were always inflated at 0.05, 0.01, and 0.005 nominal significance levels. Interestingly, DESeq2 tended to be conservative at 0.1 and 0.05 but liberal at 0.01 and 0.005 nominal significance levels. Thus, we could conclude that the proposed linear mixed model with Bartlett’s correction reasonably controlled the type1 error, and Bartlett’s correction was required if the sample size was less than 20.
Figure 1 shows estimated powers and precisions at the FDRadjusted 0.1 significance level when δ = 0.5σ or 1σ, and N = 12, 16, 20, 24, 28, 40, 64, or 68. Figure 1a and c show the statistical power estimates, and Fig. 1b and d show the precision. Precision indicates the proportions of DEGs among genes for which FDRadjusted p values are less than 0.1. According to Fig. 1a and c, LLI outperformed other methods in terms of power (Fig. 1a and c). However, it should be noted that this method showed worse precision than BALLI and voom if N was less than 40 (Fig. 1b and d), suggesting that LLI had larger falsepositive rates than those of BALLI and voom when N was less than 40. The precision of LLI was increased if N was sufficiently large. In terms of both power and precision, the best performance was always obtained by BALLI. For example, when N = 20 and δ = σ, the estimated power of BALLI was 0.597, followed by voom (0.548) and DESeq2 (0.399). The estimated precision of BALLI was 0.940, and those of voom and DESeq2 were 0.930 and 0.907, respectively. If N = 40 and δ = 0.5σ, the estimated power and precision of BALLI were 0.342 and 0.943, which were higher than those of DESeq2 (0.205, 0.897) and voom (0.298, 0.932).
Our method was also applied to simulation data based on RNAseq data from Holstein cows. The results were similar to those of the simulation data based on Nigerian people’s data. Additional file 4 shows that BALLI and LLI controlled the nominal type1 error rates if N ≥ 8 and if N ≥ 20, respectively. Power estimates were highest for LLI, but it always had a smaller precision value than those of BALLI and voom (Additional file 5).
Simulation studies with randomly generated RNAseq data
RNAseq data are generally known to follow negative binomial distribution, and we conducted simulation studies with RNAseq data generated from negative binomial distributions. First, we assumed that library sizes were the same among subjects. The overall trend of the estimated type1 error rate was similar to that of simulation studies based on real RNAseq data. Estimated type1 error rates by voom and BALLI usually maintained the nominal significance levels (Table 2 and Additional file 6). P values obtained from LLI and edgeR tended to be inflated. DESeq2 generally showed deflation of type1 error rates at a 0.1 nominal significance level. Second, we considered the effects of library size variance on statistical analyses. Data with unequal library sizes among subjects were generated by negative binomial distribution whose mean parameters (a_{gi}) were the product of mean estimates, under the equal library size assumption, and random numbers from U(u, 2 − u), where u = 0.2, 0.4, 0.6, 0.8, or 1, and dispersion parameters were estimated from \( {\left(0.2+{a}_{gi}^{1/2}\right)}^2\times {\delta}_g \), where \( 40/{\delta}_g\sim {\chi}_{40}^2 \). If u became smaller, the library size had larger variances. Figure 2 shows the estimated type1 error rates at the 0.05 significance level according to different choices of u. Figure 2 shows that voom was sensitive to the amount of library size variation and became conservative in the context of large library size variation. Compared with voom, BALLI and LLI were robust with regard to u. The estimated type1 error rates of LLI were affected by sample size. If N was larger than or equal to 40, LLI controlled the type1 error rates the most correctly and was not affected by the library size variation. BALLI was slightly conservative, but the amount remained constant. Results at the 0.005 significance level are provided in Additional file 7, and the general pattern was similar to that in Fig. 2, except that DESeq2 was conservative at a significance level of 0.05 and liberal at 0.005 (Additional file 7). Therefore, we could conclude that the performances of BALLI and LLI were robust, and we recommend using BALLI if 10 ≤ N ≤ 40, and LLI if N > 40.
Figure 3 and Additional file 8 show the estimated statistical powers and precision according to different choices of u. BALLI usually had the best estimated power and precision, as was observed in simulation studies based on real RNAseq data. For example, when u = 0.2, N = 28, and δ = 0.5σ, the estimated power by BALLI was 0.438, whereas those for DESeq2 and voom were 0.317 and 0.352, respectively (Fig. 3a). Results when u = 0.2 and δ = 1σ in Fig. 3c are very similar as those for Fig. 3a. Figure 3b and d also show that BALLI achieves the best estimated precisions. Similar patterns were observed when u = 0.4, 0.6, 0.8, and 1 (Additional file 8). In summary, we can conclude that BALLI shows better performance than that of other methods.
DEGs of Holstein milk data
Holstein milk data of 21 Holstein cows were generated to detect genes related to the daily productivity of milk. High and low milk yields were considered the primary exposure variables, and parity and lactation period were included as covariates, as in the original study [26]. In this study, 12 tentative DEGs were chosen and technically validated using quantitative realtime polymerase chain reaction (qRTPCR). qRTPCR was conducted with QuantiTect SYBR Green RTPCR Master Mix (Qiagen, Valencia, CA, USA), and a 7500 Fast Sequence Detection System (Applied Biosystems, Foster City, CA, USA) was used to confirm whether the 12 tentative genes were true DEGs. Among the 12 genes, nine (TOX4, HNRNPL, SPTSSB, NOS3, C25H16orf88, KALRN, SLC4A1, NLN, and PMCH) were significantly validated. According to Seo et al. (2016), however, no DEGs, including the nine genes, were found at an FDR 0.1 significance level by DESeq2, voom, and their methods owing to the lack of statistical power [26]. Our proposed methods and existing methods (DESeq2, edgeR, and voom) were applied to the data analysis. LLI only detected significant differences for the TOX4 gene between the high and low milk yield groups at the FDRadjusted 0.1 significance level, but other methods did not detect any significant genes. The FDRadjusted p value of TOX4 by BALLI was 0.1267, which was much smaller than those of DESeq2, edgeR, and voom. Table 3 shows p values for the nine genes, including TOX4. P values for most of the nine genes obtained by LLI and BALLI were smaller than those obtained by other methods. The nine genes were not significantly affected by parity or lactation period (Additional file 9). We also analyzed all genes by the proposed methods; Fig. 4 shows the number of genes that were significant at the 0.001 nominal significance level. There were no DEGs that were commonly significant only for all existing methods (DESeq2, edgeR and voom). Three genes, including HNRNPL, were detected as DEGs by only BALLI and LLI (Fig. 4). Table 4 shows 12 genes that were commonly significant by BALLI, DESeq2, edgeR, LLI, and voom at the 0.005 nominal significance level. Of the 12 genes, all genes except C4BPA had the lowest p values in LLI, and three genes had lower p values in BALLI than in DESeq2, edgeR, and voom. This analysis with BALLI took 153.35 s using an Intel Xeon E7–4820 2.00 GHz processor, which was quite a bit longer than other methods, voom (2.81 s), DESeq2 (27.21 s) and edgeR (33.25 s) (Additional file 10). However, BALLI can conduct the multithreaded analyses with a simple option unlike other methods and analyses of whole genes can be completed within a reasonable time. For example, the analyses with 20 cores took only 11.99 s under the same conditions (Additional file 10). Simulation studies revealed that LLI tended to be liberal, with the results likely to be inflated. However, BALLI controlled the nominal significance level, and p values by BALLI were expected to be statistically valid. Therefore, we concluded that the proposed method, BALLI, worked well for real data analysis.
Discussion
In this article, we suggested new methods, designated BALLI and LLI, for identifying DEGs with RNAseq data. We assumed that logcpm values of read counts asymptotically followed normal distributions, and the linear mixedeffects model with Bartlett’s correction was proposed. The proposed methods were compared with existing methods, such as DESeq2, edgeR, and voom, with extensive simulation studies. According to our results, negativebinomialbased approaches often failed to preserve the nominal type1 error rates. For example, p values from edgeR were inflated. DESeq2 tended to be conservative and suffered from large falsenegative rates. However, the proposed method with Bartlett’s correction, BALLI, preserved the nominal type1 error rates and was the most powerful method other than LLI. Unless sample sizes were small, LLI controlled the type1 error rates as well and was the most powerful method. Therefore, we recommend using LLI if the sample size is sufficiently large (e.g., larger than 40); otherwise, it is better to use BALLI.
Furthermore, we evaluated the effects of library size variations on statistical analyses. We found that library size variance could affect the estimated type1 error rates, and the effect was the largest for voom. Library sizes are affected by multiple factors, such as the amount of mRNA and the sequencing instrument, which can generate substantial variation among library sizes for subjects. Our simulation studies showed that BALLI was robust with regard to library size variation in samples of various sizes and was a reasonable choice if large library size variance was observed.
The proposed methods assumed that logcpm values of read counts asymptotically followed a normal distribution and that their variances were approximately equal to 1/μ + ϕ with first order approximation. In addition, voom considered logcpm value as a response and assumed that they were normally distributed. However, our simulation studies revealed the superiority of the proposed methods compared with voom, which was found to be attributable to their different variance structures. For the proposed methods, 1/μ + ϕ was derived from the first order approximation of the negative binomial distribution and thus may be a natural assumption for RNAseq data. Furthermore, for 1/μ + ϕ, ϕ obviously indicates the overdispersion parameter, and biological and technical variances can be estimated with BALLI. However, voom assumes ϕ/μ, and the amount attributable to biological or technical variances cannot be clearly defined.
We also suggested the most flexible and general linear mixed model for logcpm. The proposed model assumed that the variance of logcpm was φ/μ + ϕ and had the most generalized variance parameter space. Incorporation of φ = 1 yielded BALLI and LLI, and ϕ = 0 yielded voom. We found that BALLI was the most efficient in the considered scenarios; however, in real data analyses, various factors affected variance structure. For example, subjects with different ethnicities can cause φ to be larger than 1, and thus, a better model may differ according to RNAseq data. φ and ϕ can be estimated with the proposed linear mixed model by implementing only a simple modification, and thus, we can choose the best model using AIC or LRTs. The selected models can then be utilized to identify DEGs. This model was implemented as an R package and can be downloaded from CRAN (http://cran.rproject.org) or http://healthstat.snu.ac.kr/software/balli/. Furthermore, in contrast to methods based on a generalized linear mixed model such as MACAU [27], the proposed methods can be easily extended to various scenarios with a simple modification. For example, repeatedly observed data or multivariate phenotypes can be analyzed by adding some random effects. Maximizing the likelihood for negative binomial or Poisson distributions with random effects is computationally intensive, but the proposed methods can easily obtain variance parameter estimates using existing R packages, such as lme4 and nlme.
With simulation studies for various scenarios, we showed that the proposed methods were usually the most efficient. However, Bartlett’s correction seems inappropriate when N ≤ 6 (Additional files 3, 4, and 6) and the corrected LR statistic was smaller than expected in some cases, especially in simulation studies based on a negative binomial distribution (Table 2). Further studies are necessary to adjust this. Additionally, results from simulation studies obviously depended on various factors. Our results were obtained from simulation data based on RNAseq data from Nigerian individuals and Holstein cows RNAseq data and random samples from negative binomial distributions, but any systematic differences in RNAseq data could generate different results, depending on sequencing errors or differences in preparation steps. Multiple studies have revealed some possible differences in these relationships, and our conclusions based on simulation studies could be limited to the considered scenarios. However, despite such limitations, we believe that our results illustrate the practical value of the proposed methods. Further studies are needed to confirm our findings and expand on the work presented herein.
Conclusions
In this article, we proposed likelihoodbased linear mixed model approaches with and without Bartlett’s correction to analyze more complicated RNAseq data. The proposed methods consider logcpm values of genes as response variables, and technical and biological variances are estimated with a linear mixed model. According to our simulation studies and real data analysis, our methods are statistically more efficient than existing methods and correctly control the type1 error rates. We found that the statistical performance of our method, BALLI and LLI, depends on the sample size; we recommend using LLI if the sample size is larger than 40 and otherwise using BALLI.
Availability of data and materials
We downloaded Nigerian samples from ReCount website (http://bowtiebio.sourceforge.net/recount/countTables/montpick_count_table.txt) and Holstein samples from Gene Expression Omnibus (GSE60575).
Abbreviations
 BALLI:

BartlettAdjusted Likelihood based LInear mixed model approach
 CPM:

Counts Per Million
 DEGs:

Differentially Expressed Genes
 FDR:

False Discovery Rate
 LogCPM:

Logtransformed CPM
 LOWESS:

Locally WEighted Scatterplot Smoothing
 LRT:

Likelihood Ratio Test
 MLE:

Maximum Likelihood Estimate
 qRTPCR:

quantitative Real Time Polymerase Chain Reaction
 RNAseq:

RNA sequencing
References
 1.
Okoniewski MJ, Miller CJ. Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations. BMC Bioinf. 2006;7:276.
 2.
Dobbin KK, Kawasaki ES, Petersen DW, Simon RM. Characterizing dye bias in microarray experiments. Bioinformatics. 2005;21(10):2430–7.
 3.
Zhao S, FungLeung WP, Bittner A, Ngo K, Liu X. Comparison of RNASeq and microarray in transcriptome profiling of activated T cells. PLoS One. 2014;9(1):e78644.
 4.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNASeq. Nat Methods. 2008;5(7):621–8.
 5.
Dillies MA, Rau A, Aubert J, HennequetAntier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, et al. A comprehensive evaluation of normalization methods for Illumina highthroughput RNA sequencing data analysis. Brief Bioinform. 2013;14(6):671–83.
 6.
Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNAseq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18(9):1509–17.
 7.
Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNASeq experiments. BMC Bioinf. 2010;11(1):94.
 8.
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40.
 9.
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 2014;15(12):550.
 10.
Law CW, Chen Y, Shi W, Smyth GK. Voom: precision weights unlock linear model analysis tools for RNAseq read counts. Genome Biol. 2014;15(2):R29.
 11.
Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007;23(21):2881–7.
 12.
Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001;98(9):5116–21.
 13.
McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNASeq experiments with respect to biological variation. Nucleic Acids Res. 2012;40(10):4288–97.
 14.
Litière S, Alonso A, Molenberghs G. The impact of a misspecified randomeffects distribution on the estimation and the performance of inferential procedures in generalized linear mixed models. Stat Med. 2008;27(16):3125–44.
 15.
Chavance M, Escolano S. Misspecification of the covariance structure in generalized linear mixed models. Stat Methods Med Res. 2016;25(2):630–43.
 16.
Seyednasrollah F, Laiho A, Elo LL. Comparison of software packages for detecting differential expression in RNAseq studies. Brief Bioinform. 2015;16(1):59–70.
 17.
Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNAseq data. BMC Bioinf. 2013;14:91.
 18.
Bartlett MS. Properties of sufficiency and statistical tests. Proc R Soc Lond A Math Phys Sci. 1937;160(901):268–82.
 19.
Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNAseq data. Genome Biol. 2010;11(3):R25.
 20.
Cox DR, Reid N. Parameter orthogonality and approximate conditional inference. J R Stat Soc Ser B Methodol. 1987;49(1):1–18.
 21.
Zucker DM, Lieberman O, Manor O. Improved small sample inference in the mixed linear model: Bartlett correction and adjusted likelihood. J R Stat Soc Series B Stat Methodol. 2000;62(4):827–38.
 22.
Melo TF, Ferrari SL, CribariNeto F. Improved testing inference in mixed linear models. Comput Stat Data Anal. 2009;53(7):2573–82.
 23.
Brent RP. Algorithms for minimization without derivatives; 1973.
 24.
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010;464(7289):768–72.
 25.
Frazee AC, Langmead B, Leek JT. ReCount: a multiexperiment resource of analysisready RNAseq gene count datasets. BMC Bioinf. 2011;12:449.
 26.
Seo M, Kim K, Yoon J, Jeong JY, Lee HJ, Cho S, Kim H. RNAseq analysis for detecting quantitative traitassociated genes. Sci Rep. 2016;6:24375.
 27.
Sun S, Hood M, Scott L, Peng Q, Mukherjee S, Tung J, Zhou X. Differential expression analysis for RNAseq using Poisson mixed models. Nucleic Acids Res. 2017;45(11):e106.
Acknowledgements
Sungho Won and Taesung Park contributed equally to this work as corresponding authors.
Funding
This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (HI16C2037) and the National Research Foundation of Korea (2017M3A9F3046543). The funding body was not involved in the study design, data collection, analysis and interpretation, and writing the manuscript.
Author information
Affiliations
Contributions
KP, TP, and SW designed and led the study, and wrote the manuscript. KP developed BALLI R package. JA, JG, MS, and WL advised on statistical methods and data analysis. All authors have read and approved the final version of the manuscript.
Corresponding author
Correspondence to Sungho Won.
Ethics declarations
Ethics approval and consent to participate
Data used in this article were publicly available.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional files
Additional file 1:
The forms of C_{g}'s components depending on the structure of V_{g}. (DOCX 17 kb)
Additional file 2:
Steps of generating simulation data. (DOCX 18 kb)
Additional file 3:
Estimated type1 error rates with simulation data for N = 4, 6, 8, 28, 40, 64, and 68 based on Nigerian people’s data. (DOCX 21 kb)
Additional file 4:
Estimated type1 error rates with simulation data for N = 4, 6, 8, 12, 16 and 20 based on Holstein cow’s data. (DOCX 20 kb)
Additional file 5:
Estimated powers and precisions with simulation data when δ = 0.5σ or 1σ and N = 12, 16 and 20 based on Holstein cow’s data. (DOCX 197 kb)
Additional file 6:
Estimated type1 error rates with simulation data for N = 4, 6, 8, 28, 40, 64, and 68 based on simulated data from negative binomial distribution. (DOCX 21 kb)
Additional file 7:
Effect of varying library sizes on the type1 error rates when u = 0.2, 0.4, 0.6, 0.8, and 1 and N = 12, 16, 20, 24, 28, 40, 64, or 68 at the 0.005 nominal significance level. (DOCX 371 kb)
Additional file 8:
Effect of varying library sizes on the statistical power and precision when u = 1, 0.8, 0.6, or 0.4, δ = 1σ and N = 12, 16, 20, 24, 28, 40, 64, or 68. (DOCX 593 kb)
Additional file 9:
Analysis of covariables, parity and lactation period, for true nine DEGs of Holstein cow's data. (DOCX 18 kb)
Additional file 10:
Computation time for each method when analyzing Holstein cow’s data. (DOCX 14 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Differentially expressed genes
 RNA sequencing
 Linear mixed model
 Bartlett’s correction