aFold – using polynomial uncertainty modelling for differential gene expression estimation from RNA sequencing data

Background Data normalization and identification of significant differential expression represent crucial steps in RNA-Seq analysis. Many available tools rely on assumptions that are often not met by real data, including the common assumption of symmetrical distribution of up- and down-regulated genes, the presence of only few differentially expressed genes and/or few outliers. Moreover, the cut-off for selecting significantly differentially expressed genes for further downstream analysis often depend on arbitrary choices. Results We here introduce a new tool for estimating differential expression in noisy real-life data. It employs a novel normalization procedure (qtotal), which takes account of the overall distribution of read counts for data standardization enhancing reliable identification of differential gene expression, especially in case of asymmetrical distributions of up- and downregulated genes. The tool then introduces a polynomial algorithm (aFold) to model the uncertainty of read counts across treatments and genes. We extensively benchmark aFold on a variety of simulated and validated real-life data sets (e.g. ABRF, SEQC and MAQC-II) and show a higher ability to correctly identify differentially expressed genes under most tested conditions. aFold infers fold change values that are comparable across experiments, thereby facilitating data clustering, visualization, and other downstream applications. Conclusions We here present a new transcriptomics analysis tool that includes both a data normalization method and a differential expression analysis approach. The new tool is shown to enhance reliable identification of significant differential expression across distinct data distributions. It outcompetes alternative procedures in case of asymmetrical distributions of up- versus down-regulated genes and also the presence of outliers, all common to real data sets. Electronic supplementary material The online version of this article (10.1186/s12864-019-5686-1) contains supplementary material, which is available to authorized users.

data from two samples generated via a ribo-depleted protocol, namely RNA from cancer cell lines and also RNA from pooled normal human brain tissues.
We thus exclude data from mixtures of these samples and that based on other protocols. The raw data and counts tables are available at the Gene Expression Omnibus database under accession number GSE48035. The considered RNA-Seq data compares two conditions (UHR and BHR), whereby the same RNA samples are analyzed in three different laboratories. Any variation between these laboratories should not be due to relevant biological differences, but result from variations across sites in environmental and also procedural factors.
The MAQC-II data set consists of seven replicates for each condition and is generated by the MicroArray Quality Control (MAQC) study to evaluate the performance of different gene expression analysis methods [6]. The raw data of MAQC-II are available from the NCBI SRA database under SRA010153 and counts table is downloaded from http://bowtie-bio.sourceforge.net/recount/ [7].
The SEQC data set consists of five replicates and is generated by Sequencing Quality Control (SEQC) study available under GSE49712. The ABRF are aligned to External RNA Control Consortium (ERCC) transcript using STAR 2.5.3a [8] to obtain the information for ERCC. The PrimePCR data set is based on the PrimePCR approach of qRT-PCR and includes more than 20,000 validated DE genes from SEQC (MAQC III), available under GSE56457. The SEQC data set was additionally used to assess the influence of sequencing depth on method performance. We derived five data sets, which only had 10%, 20%, 30%, 40%, and 50% of the read numbers of the original SEQC data set, followed by their analysis with the various DE methods and normalization procedures.
Three additional real data sets were downloaded from http://bowtiebio.sourceforge.net/recount/ [7]. The first of these is the modencodefly data set from the modENCODE project [9], which assesses gene expression during the development of Drosophila melanogaster [10], covering 30 distinct developmental stages. Each of the stages consists of 4 up to 6 technical replicates, which provides an opportunity to construct subgroups per developmental stage to study stochastic variations but not true DE. We accordingly subsampled from each stage to construct a 2:2 pairwise study.
The next real data set is the HapMap-CEU data set [11], which includes 41 samples based on immortalized B-cells from 41 unrelated CEPH grandparents.
It contains a relatively large sample size (17 female samples and 24 male samples) and high variations in read count due to genetic diversity. It is wellstudied and useful for measuring the ability of DE detection models on large samples and variations [12][13][14].
We also considered the Bottomly data set, which is from a study that characterized transcriptomic differences between two inbred mouse strains (C57BL/6J and DBA/2J) with 10 and 11 replicates each, respectively [15]. We filtered out genes with zero read count across samples before analysis.
The basic statistics for all real data sets are summarized in Table 1 of the main text, including the average total number of read count, the number of present genes, sample size and the average number of DE genes. Genes with zero counts in all samples are filtered out for analysis

Normalization
Read count of RNA-Seq data requires normalization before DE inference in order to reduce possible biases from sequence depth, library preparation or even analysis in sequencing lanes [16,17]. Current approaches for normalization rely on the assumption that the majority of genes (or at least those with high expression) are not DE and/or that the distribution of up-and down-regulated DEs is symetrical. These assumptions might not be valid under certain biological processes, such as development or aging, when gene expression shows dramatic biological variation or samples with unsymmetrical DE. In contrast to current normalization approaches, we assess the overall distribution of DE across samples and then use it to estimate the true library size ratio  (i.e., ratio between total number of reads from two samples) from the observed library size ratio o  between two samples (e.g., sample A and B).
where  is the probability of a gene to be up-regulated in the current sample and  is the mean of the half normal distribution. While  for the entire data should be around 0.5 (equal chance for up-or down-regulation), the observed library size ratio between A and B could be thus approximately represented as Under (1), the mean ratio of read count between A and B for any subsets in A and B ( {A}  and {B}  ) can be written as where {A || B} indicates a subset of read counts from sample A or B, respectively. Therefore, from (4) we obtain Since up-regulation increases gene expression and thus top-ranked genes are more likely up-regulated (or non-DE), (i.e., {A}&{B} 1  → in (5)), (5) for the highly expressed subsets (e.g., a large quantile 0 0.95 q = after ranking via reads count) can be approximated by where c indicates read count. While the SD across is determined by biological variation  , it is also scaled by DE with where sd and 0 sd indicates SDs after and before DE, respectively.  is the where j and w are the start and size of sliding window, respectively. A sliding window with small size (e.g, 0.05 w = ) will maximize the influence of  on 0 sd (i.e., | | 1  → , up-or down-regulation has the same effect on all genes in the subset). The true library size ratio can be estimated from (4), (6) and (9) In practice, we select one sample (default is the one with the largest number of read counts) as control and then apply this procedure on all samples to obtain the size factor for normalization. This procedure is implemented as qtotal in ABSSeq and set as default normalization procedure for aFold.
For the additional DE analysis methods, we used the default normalization procedures (voom and TMM for Voom and edgeR, geometry mean for DESeq2, qtotal for ABSSeq, quartile for baySeq, total for ROTS).

Outlier detection
Outliers influence DE detection through shifting both mean and variance [13,14], which thus needs to be corrected for. Here we integrate the procedure from our previous ABSSeq approach into aFold, which utilizes the median absolute deviation (MAD) to detect the outliers in log-transformed read count and shrink the read count of outliers toward median of read count from one condition.

Moderating uncertainty of read count
Due to biological and/or other sources of variance, the observed expression value for the i th gene i g is given as the mean In practice, the uncertainty is represented as the standard deviation (SD) of samples if the SD is independent of the mean. However, in RNA-Seq data or microarray data, the SD is not independent of i  and could be generally written where i a is the coefficient that describes the mean-variance relationship of the i th gene. This implies that there is propagation of error (uncertainty) in measurement of SD based on i  . Therefore, an accurate reads uncertainty measurement should also include the propagation of error from (12). In theory, the propagation uncertainty of SD can be written as which is dominated by mean read count of each gene [14]. As a result, the uncertainty from expression level becomes We leave out the second term of the polynomial function of (16) in (18) (19) where i m is defined as the effective sample size for each gene. We use the effective sample size instead of the real sample size in (19) to capture the data structure (i.e, overal dispersion of CVs). According to G.4.2 in [18], a global effective sample size (effective degrees of freedom) can be obtained via Where k is the coefficient factor for expression level,

Moderating fold change by uncertainty of read count
In our previous study [13], we show that the log fold change can be described This relationship specifies that log fold change depends on the expression level and the mean read count difference between two conditions (denoted as A and The fold change is thus shrunk toward 0 according to uncertainty or variance. As a result, the fold change from (24) presents a robust way of measuring differential expression since it fully accounts for the mean and variance of expression values. We thus term this inference procedure the accurate fold change (aFold) approach.

Determination of the cut-off of aFold
While the ordinary log fold changes usually follows a normal distribution with zero mean [14,20,21], aFold also has a zero-centered normal distribution ( Figure S9B, HapMap-CEU). Therefore, the cut-off (significance threshold) of aFold can be determined by estimating the SD of the zero-centered normal distribution. Notably, the aFold calculation approach is equivalent to adding the pseudocounts ( i  from (22)) to read count. This has no influence on read count variance, but stabilizes the CV (variance stabilization, inlets in Figure S9A, S9C and S9D). Based on this procedure (i.e., adding i  ), we obtain for each data set a general CV for the count level or SD for log transformation of counts (also as the SD for aFold). We can next calculate the general SD under log transformation via moments estimation as After an adjustment of multiple testing (i.e, Benjamini-Hochberg in default), a data-specific aFold cut-off is obtained in consideration of the significance level.
The aFold model is here developed for simple designs. It is possible to adjust it for complex experimental designs with the help of linear models from limma [22]. Such more complex designs will be assessed and implemented in the future.
Supplementary Tables   Table S1. Comparison of AUCs in Figure 1