ABSSeq: a new RNA-Seq analysis method based on modelling absolute expression di(cid:27)erences

This vignette is intended to give a brief introduction of the ABSSeq R package by analyzing the simulated data from Soneson et al. [2] . For details about the approach, consult Yang [1]. Currently, ABSSeq can just be applied on pairwise study. We assume that we have counts data from an experiment, which consists of two conditions and several replicates for each condition in a matrix. The expected expression of each gene is estimated from number of read count, proportional to the expectation value of the true concentration of count. As a result, a normalization method need to be apply on the original counts. The normalized counts usually have enormous variation across genes and compared conditions. The reliable identification of differential expression (DE) genes from such data requires a probabilistic model to account for ambiguity caused by sample size, biological and technical variations, levels of expression and outliers. ABSSeq infers differential expression directly by the counts difference between conditions. It assumes that the sum counts difference between conditions follow a Negative binomial distribution with mean mu proportional to expression level and dispersion factor r (size). The mu and r is determined by variation in the experiment, i.e., biological variation, sequencing and mapping biases. Typically, the number of replicates in a study is small and not enough to reveal all variation. To overcome this problem, a common approach is to borrow information across genes. Here, we use local regression to smooth dispersion across genes. The smoothed dispersions are then used to produce pseudocounts in the mu estimation to accounts for dynamic dispersions at expression level, which in turn moderates the fold-change across expression level. However, the information borrowed across genes based on dispersions is usually incomplete since it often utilizes part of genes with a positive dispersion, which might load to underestimation and influence the DE inference. To overcome it, ABSSeq introduces a penalty for dispersion estimation, that helps avoid extremely siginicant DEs with small change at high expression level. ABSSeq tests counts difference directly against a baseline estimated from the data set (mu), and therefore reports p-values related to magnitude of difference (fold-change). In addition, ABSSeq moderates the fold-changes by two steps: the expression level and gene-specific dispersion, that might facilitate the gene ranking by fold-change and visualization (Heatmap). New alternative approach


Introduction
This vignette is intended to give a brief introduction of the ABSSeq R package by analyzing the simulated data from Soneson et al. [4].For details about the approach, consult Yang [3].Currently, ABSSeq integrates two modules: basic model for pairwise comparison and linear model for complex design.
RNA-Seq quanties gene expression with reads count, which usually consists of conditions (or treatments) and several replicates for each condition.The expected expression of each gene is estimated from number of read count, proportional to the expectation value of the true concentration of count.As a result, a normalization method need to be apply on the original counts.The normalized counts usually have enormous variation across genes and compared conditions.The reliable identication of dierential expression (DE) genes from such data requires a probabilistic model to account for ambiguity caused by sample size, biological and technical variations, levels of expression and outliers.
ABSSeq infers dierential expression directly by the counts dierence between conditions.It assumes that the sum counts dierence between conditions follow a Negative binomial distribution with mean mu (proportional to expression level) and dispersion factor r (size).The mu and r is determined by variation in the experiment, i.e., biological variation, sequencing and mapping biases.Typically, the number of replicates in a study is small and not enough to reveal all variation.To overcome this problem, a common solution is to borrow information across genes.Here, we use local regression to smooth dispersion across genes.The smoothed dispersions are then used to produce pseudocounts in the mu estimation to accounts for dynamic dispersions across expression levels, which in turn moderates the fold-change.
ABSSeq tests counts dierence directly against a baseline estimated from the data set (mu), and therefore reports p-values related to magnitude of dierence (fold-change).In addition, ABSSeq moderates the fold-changes by two steps: the expression level and gene-specic dispersion, that might facilitate the gene ranking by fold-change and visualization (Heatmap).New alternative approach (named aFold) was introduced, which calls DE genes via log fold-change(see second Section for example).aFold uses a polynomial function to model the uncertainty (or variance) of read count, and thus takes into consideration the variance of expression levels across treatments and genes.aFold produces accurate estimation of fold changes.In combination with the linear model from limma [2], aFold is capable to analyze data set with complex experimental design (see last Section for example).

Pairwise study
We rstly import the ABSSeq package.
> library(ABSSeq) Then, we load a simulated data set.It is a list object and contains three elements: the counts matrix, denoted by 'counts', the groups, denoted by 'groups' and dierential expression genes, denoted by 'DEs'.
An ABSDataSet object is required for ABSSeqand could be constructed with the ABSDataSet function by providing counts matrix and dened groups.Here, we can also initiate a paired comparison for specic samples, such as data for cancer and normal tissue from same individuals, by setting the paired parameter in ABSDataSet object.

> obj <-callDEs(obj) > plotDifftoBase(obj)
In the end, we model the counts dierences with Negative binomial distribution and calculate the pvalue for each gene.DE calling is performed by the function callDEs, which reports pvalues as well as adjusted pvalues, that can be accessed by results with names of pvalue and adj.pvalue.Noticely, this function also provides fold-change moderation according to gene-specic dispersion by utilizing aFold, which will report fold-changes closer to gene's dispersion.In the end, ABSSeq produces three kinds fold-changes: the original (denoted by 'rawFC'), corrected by expression level (denoted by 'lowFC') and moderated by expression level and gene-specic dispersion (denoted by 'foldChange'), which are stored in the ABSDataSet object and could be also retrieved by results.> obj <-callDEs(obj) > head(results(obj,c("rawFC","lowFC","foldChange","pvalue","adj.pvalue")))In addition to a step-by-step analysis in above, DE calling could be simply performed by the function ABSSeq, which runs a default analysis by calling above functions in order and returns a ABSDataSet object with all information.

DE analysis with complex design
In combination with linear model from limma [2], aFold is capable to analyze data set with complex experimental design, which is performed by the function ABSSeqlm.Here is an example.

Figure 1 :
Figure 1: 'Absolute log2 fold-change against expression level'-plot for count data.We show the tted and raw data with dierent colors.
141438 4.176475 -1.30085645 0.003116490 0.01707491 2 10.760856 10.176462 -0.56268049 0.153013673 0.35063405 3 11.63551611.389556 -0.23985467 0.845037624 1.00000000 4 4.958562 3.671725 -0.73879630 0.089056224 0.23482571 5 10.799432 12.100000 1.25952601 0.001112932 0.00740706 6 9.984289 9.958485 -0.02476103 0.999999092 1.00000000 3 Detecting DE via aFold Recently, ABSSeq integrates a new method for DE detection: aFold.aFold utilizes a polynormial function to model the uncertainty of observed reads count and moderate the fold change calculation.aFold takes into account variations among samples and genes and reports DE and fold changes in a reliable way.The fold changes produced by aFold may help the experimentalist to avoid arbitrary choice of cut-o thresholds and may enhance subsequent downstream functional analyses.Here is the example for how to use aFold in ABSSeq.