BALLI: Bartlett-Adjusted Likelihood-based LInear Model Approach for Identifying Differentially Expressed Gene with RNA-seq Data

Motivation Transcriptomic profiles can improve our understanding of the phenotypic molecular basis of biological research, and many statistical methods have been proposed to identify differentially expressed genes under two or more conditions with RNA-seq data. However, statistical analyses with RNA-seq data often suffer from small sample sizes, and global variance estimates of RNA expression levels have been utilized as prior distributions for gene-specific variance estimates, making it difficult to generalize the methods to more complicated settings. We herein proposed a Bartlett-Adjusted Likelihood based LInear mixed model approach (BALLI) to analyze more complicated RNA-seq data. The proposed method estimates the technical and biological variances with a linear mixed effect model, with and without adjusting small sample bias using Bartlett’s corrections. Results We conducted extensive simulations to compare the performance of BALLI with those of existing approaches (edgeR, DESeq2, and voom). Results from the simulation studies showed that BALLI correctly controlled the type-1 error rates at the various nominal significance levels, and produced better statistical power and precision estimates than those of other competing methods in various scenarios. Furthermore, BALLI was robust to variation of library size. It was also successfully applied to Holstein milk yield data, illustrating its practical value. Availability and Implementation BALLI is implemented as R package and freely available at http://healthstat.snu.ac.kr/software/balli/. Contact won1@snu.ac.kr Supplementary Information Supplementary data are available at Bioinformatics online


INTRODUCTION
Transcriptomic profiles can improve our understanding of the phenotypic molecular basis of biological research, and many attempts have been made to identify differentially expressed genes (DEGs) by microarray analysis. However, microarray analysis often suffers from many systematic errors, such as hybridization and dye-based detection bias, hampering the detection of true DEGs (Dobbin, et al., 2005;Okoniewski and Miller, 2006). Recently, highthroughput sequencing technology has markedly improved. RNA sequencing (RNA-seq), also called whole-transcriptome shotgun sequencing, uses next-generation sequencing to quantify the abundance of transcripts with several desirable features, such as increased dynamic range and the freedom from a priori chosen probes (Zhao, et al., 2014). Furthermore, RNA-seq is robust against systematic errors and has therefore emerged as a successful alternative to microarray analysis (Mortazavi, et al., 2008).
RNA-seq quantifies the numbers of reads aligned to particular transcripts or genes, and various approaches have been proposed to manage the RNA-seq data (Dillies, et al., 2013).
There are two different types of statistical methods: read-count-based approaches and transformation-based approaches. Read-count-based approaches assume that observed read counts follow negative binomial distribution, and generalized linear regression with a logarithm as a link function is utilized. These approaches typically assume that variances include biological and technical variances; the latter indicates variance observed among measurements of the same biological unit, and the former indicates variance between different biological units, such as different subjects or different tissues of the same subject. If technical replicates are analyzed, observed read counts from technical replicates have the same means under the same conditions. Marioni et al (2008) demonstrated that the data follow a Poisson distribution, and variances in technical replicates are expected to be the same as their means for each gene (Marioni, et al., 2008). However, if biological replicates are available, means and variances of read counts are different among different biological units. Bullard et al (2010) carefully examined such variability and concluded that the biological variances were usually larger than technical variances, supporting the presence of overdispersion (Bullard, et al., 2010). Thus, negative binomial distribution has often been utilized; edgeR (Robinson, et al., 2010) and DESeq2 (Love, et al., 2014) are such methods.
Transformation-based approaches assume that the transformed read counts for each gene follow the normal distribution. For example, voom calculated proportions of read counts for each gene per subject, and the log-transformed values were then assumed to follow the normal distribution, assuming that the relative proportion of technical variances becomes smaller if the read count grows larger (Law, et al., 2014).
Negative-binomial distributions for read counts and normal distributions for logtransformation of counts per million (CPM) successfully describe distributions of RNA-seq data. However, RNA-seq is relatively expensive compared with microarray, and thus, further adjustment has been made to handle the problem of small sample size. If sample size is small, the estimated variance can have large standard errors, and thus, multiple methods that incorporate prior knowledge have been proposed. For example, variances of read counts assume to be positively related to their means, and their relationships can be estimated by comparing the means and variances of read counts for all genes. This can often be utilized to shrink variance parameters (Robinson and Smyth, 2007;Tusher, et al., 2001). edgeR and DESeq2 estimate the overall dispersion parameter for all genes and are then combined with gene-wise dispersion parameters for each gene using empirical Bayesian rules. voom assumes that the variances of log-transformed CPM (log-cpm) are functionally related to their means.
Locally weighted scatterplot smoothing (LOWESS) curves between the mean and residual variances of genes are then utilized to weight variance estimates of each gene.
Existing methods shrink the variance estimate of each gene toward global variance estimates or use variance estimates based on the relationships between means and variances.
Such assumptions are often very useful if the sample sizes are small. However, there are multiple factors that can distort these relationships, and if they are violated, the performance of existing approaches can be affected. For example, the quality of data is highly dependent on the preparation steps, and unexpected noise, such as noise from different storage periods or sequencing organization of samples, can occur during preparation steps. Moreover, read counts of cancer tissues are more heterogeneous than those of normal tissues, and biological variances can be affected by disease status (McCarthy, et al., 2012). Thus, their effects can distort the relationship between technical and biological variances. Multiple studies have shown that misspecified relationships can lead to substantial biases (Chavance and Escolano, 2016;Litière, et al., 2008). For example, variance estimators for random effects, which are assumed to follow a normal distribution, can be seriously biased unless they are normally distributed (Litière, et al., 2008). General approaches that are not sensitive to those problems are necessary.
In this article, we present new methods for identifying DEGs with RNA-seq data, BALLI and LLI. Statistical analyses with log-transformed read counts are often more powerful than other existing methods and are relatively insensitive to various errors (Seyednasrollah, et al., 2015;Soneson and Delorenzi, 2013). Thus, we consider the log-cpm as response variables and used linear mixed effect models to estimate technical and biological variance.
Furthermore, Bartlett-adjusted likelihood ratio tests were applied to correct the small sample bias (Bartlett, 1937). By allowing model comparisons among different models, our models enable robust analyses for various scenarios. For our simulations studies, artificial RNA-seq data are generated based on real data and negative binomial distributions. Our studies showed that the proposed method performed better than existing methods. The proposed methods were applied to Holstein milk yield data at the false discovery rate (FDR)-adjusted 0.1 significance level and uniquely produced significant results. The proposed methods were implemented as an R package and are freely downloadable at http://healthstat.snu.ac.kr /software/balli/.

Notations
We assumed that there were M different groups, and the averages of the expressed read counts for each gene were compared among these groups. For case-control studies, M = 2.
We assumed that there were n m subjects in group m and denoted the total sample size by ܰ .
We assumed that indexes of all subjects were sorted in an ascending order of groups. Thus, the first n 1 subjects were in group 1, the second n 2 subjects were in group 2, and so on. We assumed that expressed read counts were observed for G genes and were denoted by

Linear mixed effect model
We denoted a design matrix for nuisance effects including an intercept by Z. We let b g and e g be vectors for random effects and measurement errors, respectively. Denoting a w × w dimensional identity matrix by I w , we considered the following linear mixed effects model:

Bartlett-adjusted profile likelihood ratio tests
Statistical analyses with RNA-seq data often use small samples, and we selected the Bartlettadjusted likelihood ratio test for identifying DEGs. Bartlett's adjustments make the likelihood ratio statistic close to its null distribution with reducing the order of approximation error from O(N -1 ) to O(N -2 ) and control the type-1 error rates well when the sample size is small (Bartlett, 1937 The forms of

Parameter estimation
The log-likelihood function for our final model is given by   Viechtbauer, 2007). We found that Fisher's Scoring method was sometimes unsuccessful for estimating ߪ ො ଶ ሺ ሻ , and in such cases, we used Brent's derivative free method (Brent, 1973) with the optimize function in R. We assumed that

Datasets
We considered two real datasets consisting of unrelated Nigerian and Holstein samples, respectively. Nigerian subjects were participated in the International HapMap Project and were composed of 29 males and 40 females (Pickrell, et al., 2010). The read counts were downloaded from the ReCount website (Frazee, et al., 2011). Holstein data were obtained to identify genes associated with milk yield and consisted of high and low milk yielding groups whose number of subjects is 9 and 12, respectively (Seo, et al., 2016). Furthermore, parity and lactation periods were available and were considered as covariates. Steps for transformation from the raw sequence data to read counts are shown in Supplementary Text B. Based on count data, we generated simulation data and the steps are described in the Supplementary Text C.

Simulation studies with Nigerian RNA-seq data
We applied the proposed linear mixed models to the simulated data based on Nigerian RNAseq data and calculated empirical type-1 error rates and statistical powers with these models.
The data were then compared with DESeq2 (v1.14.1), edgeR (v3.16.5) and voom (v3.30.13).  Figure 1 show results from simulation studies based on Nigerian RNA-seq data. Nigerian RNA-seq data consisted of 52,580 genes, and after filtering genes whose total read counts across samples were smaller than one tenth of the sample size, each replicate had around 10,000-10,500 genes. Empirical type-1 error rates and powers were estimated with 20 replicates.  According to Figure 1a and 1c, LLI outperformed other methods in terms of power (Figure 1a and 1c). However, it should be noted that this method showed worse precision than BALLI and voom if N was less than 20 (Figure 1b and 1d), suggesting that LLI had larger falsepositive rates than BALLI and voom when N was less than 20. The precision of LLI was increased if N was sufficiently large. In terms of both power and precision, the best Simulation studies with randomly generated RNA-seq data RNA-seq data are generally known to follow the negative binomial distribution, and we conducted simulation studies with RNA-seq data generated from negative binomial distributions. First, we assumed that library sizes were the same among subjects. The overall trend of the estimated type-1 error rate was similar to that of simulation studies based on Nigerian RNA-seq data. Estimated type-1 error rates by voom and BALLI usually maintained the nominal significance levels ( . If u became smaller, the library size had larger variances. Figure 2 shows the estimated type-1 error rates at the 0.05 significance level according to different choices of u. Figure 2 shows that voom was sensitive to the amount of library size variation and became conservative in the context of large library size variation. Compared with voom, BALLI and LLI were robust with regard to u. The estimated type-1 error rates of LLI were affected by sample size. If N was larger than or equal to 40, LLI controlled the type-1 error rates the most correctly and was not affected by the library size variation. BALLI was slightly conservative, but the amount remained constant. Results at the 0.005 significance level are provided in Supplementary Figure 1, and the general pattern was similar to that in Figure 2, except that DESeq2 was relatively closer to the base line (Supplementary Figure 1). Therefore, we could conclude that the performances of BALLI    Figure 2). In summary, we can conclude that BALLI shows better performance than other methods.

DISCUSSION
In this article, we suggested new methods, designated BALLI and LLI, for identifying DEGs with RNA-seq data. We assumed that log-cpm values of read counts asymptotically followed normal distributions, and the linear mixed effects model with Bartlett's correction was proposed. The proposed methods were compared with existing methods, such as DESeq2, edgeR, and voom, with extensive simulation studies. According to our results, negativebinomial-based approaches often failed to preserve the nominal type-1 error rates. For example, p values from edgeR were inflated. DESeq2 tended to be conservative and suffered from large false-negative rates. However, the proposed method with Bartlett's correction, BALLI, preserved the nominal type-1 error rates and was the most powerful method other than LLI. Unless sample sizes were small, LLI controlled the type-1 error rates as well and was the most powerful method. Therefore, we recommend using LLI if the sample size is sufficiently large (e.g., larger than 40); otherwise, it is better to use BALLI.
Furthermore, we evaluated the effects of library size variations on statistical analyses. We found that library size variance could affect the estimated type-1 error rates, and the effect was the largest for voom. Library sizes are affected by multiple factors, such as the amount of mRNA and the sequencing instrument, which can generate substantial variation among library sizes for subjects. Our simulation studies showed that BALLI was robust with regard to library size variation in samples of various sizes and was a reasonable choice if large library size variance was observed.
The proposed methods assumed that log-cpm values of read counts asymptotically followed a normal distribution and that their variances were approximately equal to 1 / ߤ Ԅ with the first order approximation. In addition, voom considered log-cpm value as a response and assumed that they were normally distributed. However, our simulation studies revealed the superiority of the proposed methods compared with voom, which was found to be attributable to their different variance structures. For the proposed methods,