In the post-genomic era, the development of technologies for sequencing the genome and transcriptome has become a key issue in the global analysis of biological systems. Even with the lowering cost of sequencing data, the majority of RNA-Seq experiments are still suffering from low replication numbers. The identification of differential expressed (DE) genes and transcripts is still a key question of interest in many biological studies. To date, there are many methods that provide a test of whether a gene is DE or not [1], including cufflinks [2], DESeq [3] and edgeR [4]. A feature in all of these methods is moderation of gene-wise variance estimates to improve DE inference. Moderation is important in small samples size comparisons, increasing both the power and accuracy of a DE test [5]. The key differences between these methods are the extent of the moderation and their *common variance* estimate--the variance estimate that the procedure is moderating towards.

DESeq and edgeR account for the heteroscedasticity observed in the read counts of genes by modelling the relationship between expected value of the count and its variability. We propose using additional information, such as gene length and variance estimates from external datasets, as explanatory variables to further model the heterogeneity seen in the observed gene variances. Combining these improved models of gene variance with a moderation method [6] creates a robust tool for estimating gene variances and hence calling differential gene expression. When evaluated on publicly available data this tool offers both improved gene ranking and power of detection when compared to DESeq and edgeR. This method is implemented in the R package sydSeq available on http://www.maths.usyd.edu.au/u/jeany/software.htm.

### RNA-Seq

The development of high throughput sequencing technologies has made it possible to sequence the transcriptome at a much higher resolution and coverage than was previously available. Sequencing of cDNA samples (RNA-Seq) has a dynamic range larger than that of microarrays [7]. This, combined with its high level of reproducibility [8] and falling cost, makes high throughput sequencing technologies an increasingly attractive alternative to microarrays for transcriptome analysis. In the following we will describe how variances are estimated for RNA-Seq data with small samples sizes.

A typical RNA-seq data analysis work flow consists of many steps. These steps generally consist of mapping, summarisation, normalisation, differential expression analysis and systems biology [9]. The summarisation step counts how many reads were mapped to each gene or transcript. We will consider the case of mapping to genes rather than transcripts, so for us summarisation results in a matrix of counts, where the rows and columns correspond to genes and samples respectively.

Let *y*
_{
ij
} be the observed read count for the *i*
^{
th
} gene in the *j*
^{
th
} sample where sample *j* belongs to treatment *t*(*j*) = 1, 2. Let
and *μ*
_{
i
} be the variance and mean read count for gene *i*. For ease of presentation we will assume that all effects that are generally normalised for or modelled, such as library sizes and GC content, remain constant across samples. The technical variability for a gene count in RNA-Seq can be modelled quite reliably as Poisson [10, 11]. This is attractive in situations of low replication as one parameter can be estimated to describe both the mean and variance of a gene. Modelling the data as Poisson will give a very reliable estimate of which genes have changed in expression between any two samples. However many experiments are not simply focused on the detection of gene expression differences between any two samples focusing instead on the differences between any two types of cells for example. This distinction is important as it requires us to not only model the technical variability of the experiment but to also model the biological variability of a particular cell type (or experimental condition).

An over-dispersed Poisson, a discrete distribution with dispersion greater than a Poisson, can be modelled using a Negative Binomial. A negative binomial random variable,

*Y* , can be parametrised as

This standard formulation is generally referred to as NB2. Under this formulation, the biological variability of the expression of a gene is modelled as a quadratic function of its mean expression

*μ*:

where as

gets small the negative binomial will approach a Poisson. The parameter

*b* has been referred to as the coefficient of biological variation. A negative binomial is generally parametrised as a function of

*r* and

*p*. However, by parametrising a negative binomial in terms of its mean

*μ* and variance

*σ*
^{2} where

and

*σ*
^{2} >

*μ*, a negative binomial can then be used to model counts that have untraditional mean-variance relationships. This relationship is generally expressed as

where *f*(*μ*) explains the biological variability can be fitted by some form of local regression [3]. This formulation of *σ*
^{2} highlights that *σ*
^{2} should always be greater than or equal to *μ*.

In current RNA-Seq experiments it is still quite common to see experiments with very little biological replication. Estimating variances from few observations is unstable [12]. To improve the stability and accuracy of these variance estimates there have been many methods proposed to shrink the variances to some common value for microarrays [12] and RNA-Seq [1]. We will refer to this as moderation. By stabilising the variances and sharing information moderation also increase the power of a statistic as this increases the degrees of freedom of a variance estimate [5].

### Heterogeneous gene variances

It is well accepted that some genes have a higher variance than other genes [13]. That is, some genes vary in expression more from cell to cell, person to person, or treatment to treatment than other genes. In RNA-Seq datasets, genes with larger average expression have on average larger observed variances. Instead of shrinking the estimate of a genes variance towards some common value (as is often done in microarrays) to improve stability [12], edgeR and DESeq shrink the estimate towards some fitted curve describing the relationship between mean and variance. We refer to this fitted curve as the *common variance*. In doing this they are making the strong assumption (although not an unreasonable one) that genes with a similar average count should have a similar variance.

We incorporate external data from RNA-Seq and microarrays on mouse striatum and RNA-Seq data from different tissues to better estimate variances and hence identify DE genes between C57BL/6J and DBA/2J mouse striatum samples.