High-throughput RNA-sequencing technologies have taken over the field of gene expression level estimation, previously dominated by microarray technologies. Some of the favorable characteristics of using high-throughput RNA-seq data is that there is no limit on the number of genes surveyed nor do the genes have to be pre-selected. Thus these technologies provide a wider dynamic range and also enable the possibility to discovering new sequence variants and transcripts. Since mRNA sequencing technology is not particularly target specific, up to tens of thousands of genes are "active" at different levels. While this is a fertile ground for functional discoveries, possibly of multiple pathways simultaneously, it is challenged by the onerous task of teasing apart thousands of genes based on their activity (expression) levels across assays. This requires the intervention of reliable computational techniques to make accurate and meaningful inferences from the massive data generated by the experiments. Also, the data is discrete i.e., in terms of read counts or number of read fragments rather than an intensity measurement in a continuous domain. As in all the other computational methods, the following assumptions are made:
1 Most of the genes are stable.
2 The DE genes are globally unbiased, i.e., some are over-expressed while some are under-expressed.
Let x
g
be the observed abundance of gene g. This is the number of tags (reads) that map to gene g in the experiment. The length of the gene is absorbed in this value. Let µ
g
be the true abundance. M
k
is the library size and S
k
represents the total RNA output of a sample. S
k
is usually not available but relative RNA production fold change of two samples S
k
/S
k
′ may be [11].
The mathematical framework
Overview. The primary objective is to compare the expression level of genes in high-throughput sequencing (HTS) data between a pair of biological samples. Due to the un-targeted nature of HTS, it is only natural to assume that the observed expression values of the genes are not independent. Second, the technologies produce data in terms of integer read counts. This is in contrast to earlier technologies which produced real-valued intensity measures. Also the total number of genes being simultaneously assayed earlier were determined by the technology while in the sequencing approaches, these read counts are a result of the application of mapping software pipeline, at whose core reside some sophisticated (parametric) string matching algorithms. Such a setting of read count computation is not naturally conducive to normalization processes. Hence the read count of a gene g is simply divided by the total number of reads (in millions) in the assay and by the length of the gene g (in kilobases) to yield a quantity called the RPKM value of g. Instead of directly working with the expression value of g, we define a character function for each gene g. The two most desirable properties of this function are (1) depend on the expression values of all the other genes in the assay and (2) are scale invariant. We also define another family of distributions, built on these functions (called ψ), to study the overall behavior of the changes captured by the two transcriptomes.
Notation
Let G = {g1, g2, ..., g
L
} be a set of L genes. Then let an L-tuple of random variables
(1)
represent a biological sample with L genes G. X
g
is a random variable representing the observed expression or abundance of gene g. In this model-less framework, we make no assumptions about the distribution of each of the components X
g
. Next, consider the L-tuple of random variables
(2)
whose components, , termed character functions, are defined below. In particular, these functions are scale invariant.
Definition 1 (character functions)
For a fixed P > 1, we call a surjective map
(3)
a character function if there exists a , satisfying the following conditions:
1 if
then |t
g
− t
g′
| ≤ δ
p
, and
2 if , then t
g
< t
g′
,
for all g, g′ ∈ G, and .
Notice that such a character function satisfies the following additional properties.
1 is scale-invariant: for all c > 0 and .
This follows from the first condition of the definition of .
2 For all triplets, g, g′, g″, if t
g
<t
g′
<t
g″
with then .
This property follows from the second condition of the definition. In fact this property leads to the algorithms for actually computing the scale-invariant maps.
For two samples a and b defined on the same gene set G, let a pair of L-tuples of random variables, and , represent the two samples. Let the corresponding scale-invariant character maps be and . Recall from Eqn (3) that the image of is {1, 2, .., p, .., P}. Next, consider the P-tuple of random variables
(4)
whose components , termed dispersion functions, are defined below.
Definition 2 (dispersion function)
For samples a and b (with possibly a = b), is defined by the following distribution:
(5)
where q ∈ {1, 2, .., P}.
Informally speaking, is the dispersion of image p in sample b with respect to that of a. Also, specifies the variance of image p in sample a. Based on Eqn (5), for each p, probability measures of over-expression (O) and under-expression (U) are defined as follows:
Along similar lines, the dispersion within a, for each p, is
(6)
(7)
Thus a natural measure of neutrality, for each p, is
(8)
This intrinsic measure can be used to evaluate the accuracy of the estimation of distributions of random variable , based on single or multiple replicates. In practice, we have used to evaluate the character functions (and thus estimate parameter P as well).
Estimating distributions of random variables and
Critical to the computations of and functions, is the estimation of the distribution of the random variables X
g
of Eqn (1). Our model is influenced by the observation that the total number of genes (L) in the RNA-Seq experiment is based on the biological sample and not the technology and that this also renders the observed abundances of each gene to be not quite independent of each other. This is reflected in the details of the computations discussed here.
Let be the observed abundances in the experiment. Without loss of generality, let c
g
be the number of reads (read-counts) of gene g, and Σ
g∈G
c
g
= M. To estimate the distribution we use a Bernoulli process with parameters c
g
/M, for g ∈ G, and M. This is a natural extension of the method for the single replicate. Let be the observed abundances in the K experiments under identical conditions. As before, let be the number of reads (read-counts) of gene g in the kth replicate, for 1 ≤ k ≤ K. Let . We use K Bernoulli processes each with parameters , for g ∈ G, and Mk, as in the single replicate case. Let and let . We estimate the probability, p
g
, of gene g to be c
g
/M.
Thus, modeling the experiment, with K ≥ 1 replicates, as a Bernoulli process, of Eqn (1) follows a multinomial distribution with parameters M and (p
g
)
g∈G
. A closed form of the distribution of or is not straightforward to estimate and moreover, the argument to is an L-tuple . Hence we simulate the Bernoulli process in our implementation. Then based on Defn (1), for a fixed P, is evaluated for each g using the L-tuple, t
i
, of trial i. Similarly, based on Defn (2), is evaluated for each p using the L-tuples of the two experiments. The details are discussed below.
Simulating the character functions
Simulating the Bernoulli process for one experiment with n genes is described below, for some number of iterations I (e.g., I = 103), and some number of reads R (e.g., R = 106), using a fixed P (e.g., P = 20). Consider iteration i:
1 Repeat for each read r ∈ R: Assign the read to gene g ∈ G with probability .
2 Order genes according to decreasing assigned read count. Assign for genes g with no reads.
3 Apply linear (regression) segmentation with P − 1 segments to the cumulative sum of the ordered read counts; assign for genes g in the first segment, P − 1 for the second segment, and so forth. Least-squares linear segmentation can be performed in time O(n2P), see [12]. In practice when n is large one may want to subsample the genes, e.g., only use every 10th gene for the segmentation.
The values are stored in each iteration i and together make up the distribution . The values of the dispersion function ψ
p
are computed from . In practice, to compute , one counts the number of pairs of iterations (i, i′) ∈ (1...I, 1...I) where and , where denotes the bin of gene g in iteration i in dataset a. The probability Pr is the frequency of pairs among all the pairs .
An example We compute the different distributions for a dataset with 10,000 genes and their (observed) abundances c. Figure 6 (a) shows for all the genes g (as defined in Eqn 3) and Figure 6 (b) shows the and , for each p (according to Eqn 5), while (c) summarizes the Ψ distributions as , , and (according to Eqns 6-8).
Comparison across a pair and
Instead of directly studying and , we observe instead the behavior of the scale-invariant maps and . We also distinguish two classes of genes: one whose members are differentially expressed (DE) the others whose members are stable across the two samples a and b with sufficient confidence (based on a threshold on some measure). The definition of stable genes in this context, and detecting them via the combinatorial problem of longest increasing subsequence is discussed in Additional file 1. Unfortunately, these two which appear complementary in their characteristics do not exactly partition the genes G into two because of the need to meet the confidence thresholds.
Differentially expressed (DE) genes
Character functions and are used to estimate the DE genes. We define a real-valued distance, dis(·, ·) ≥ 0, between the two functions as discussed below. For some fixed δ > 0, if , then g is estimated to be differentially expressed. It is important to note that the mathematical framework discussed here is agnostic to the underlying models in the diverse read count simulators prevalent in literature.
To define the amount of differential expression between datasets a and b, we rank each gene according to the differences in their distributions and . The ranking process is described below.
The estimated character functions are essentially histograms on P ordered bins x = 1, ... , P, denoted . We adopt maximum norm as the DE measurement between treatments, denoting the largest difference between the cumulative distributions for gene g in experiments a, b:
(9)
where C denotes the cumulative distribution, . This is the test statistic in the Kolmogorov-Smirnov test. However, the character functions are discrete, so the K-S test is not applicable. Due to the finite numbers of iterations when estimating the distributions, two genes often have the same maximum norm distance. Then we further compare the genes by the distance between their modes in . In the case of K replicates, is calculated for the ith replicate, and we perform DE analysis on the average character functions, :
(10)
The genes are ordered by decreasing maximum norm, then by decreasing mode distance, and the genes at the top of the list are the most likely DE candidates.
Comparison with existing methods
We compare RoDEO against existing methods baySeq [13], edgeR [14], and sSeq [15] for identifying differentially expressed genes. In an effort to avoid biases often associated with generating simulated data using the exact distribution assumed by the methods (negative binomial in this case), we only used real data. Specifically, we used read counts and quantitative expression data from the MAQC human tissue samples, for which DE genes have been experimentally validated by qRT-PCR [1, 2]. The aim here is to detect differentially expressed genes between human reference RNA and brain RNA samples.
The methods were run on all 33k genes for which there were reads sequenced with Ion Torrent Proton technology (MAQC PRO dataset), while the DE results were evaluated only on those 17k genes with available qRT-PCR quantification. We used log-fold change cutoff > 1.5 to define DE genes and < 1.0 to define non-DE genes (other genes remained unassigned). Each method was run with default parameters (if any), RoDEO was run with P = 20 bins, I = 100 iterations, and R = 107 reads. Each compared method provides a list of genes ordered by extent of their DE, and this list is used in the evaluation.
Figure 7 shows the DE methods' performance on the this MAQC PRO data consisting of two treatments, each with three replicates. Results on a single-replicate case using the first replicate only are shown in Figure 8. The four panels illustrate (i) For the number of top DE candidates on the x-axis, how many are false positives, (ii) False positive rate vs. true positive rate, including AUC (Area Under the Curve) measurement, (iii) For the top up to x = 500 true DE genes, how many are included in the method's DE list of size x (ideally x = y), (iv) AUC when varying the threshold for calling DE genes from the qRT-PCR results (threshold 1.5 is used in panels i-iii).
RoDEO outperforms all the other DE methods on this dataset, based on the AUC measurement at varying levels of the PCR DE threshold (panel iv), both on the multiple replicate and single replicate cases.