SERE: Single-parameter quality control and sample comparison for RNA-Seq
© Schulze et al.; licensee BioMed Central Ltd. 2012
Received: 11 July 2012
Accepted: 18 September 2012
Published: 3 October 2012
Skip to main content
© Schulze et al.; licensee BioMed Central Ltd. 2012
Received: 11 July 2012
Accepted: 18 September 2012
Published: 3 October 2012
Assessing the reliability of experimental replicates (or global alterations corresponding to different experimental conditions) is a critical step in analyzing RNA-Seq data. Pearson’s correlation coefficient r has been widely used in the RNA-Seq field even though its statistical characteristics may be poorly suited to the task.
Here we present a single-parameter test procedure for count data, the Simple Error Ratio Estimate (SERE), that can determine whether two RNA-Seq libraries are faithful replicates or globally different. Benchmarking shows that the interpretation of SERE is unambiguous regardless of the total read count or the range of expression differences among bins (exons or genes), a score of 1 indicating faithful replication (i.e., samples are affected only by Poisson variation of individual counts), a score of 0 indicating data duplication, and scores >1 corresponding to true global differences between RNA-Seq libraries. On the contrary the interpretation of Pearson’s r is generally ambiguous and highly dependent on sequencing depth and the range of expression levels inherent to the sample (difference between lowest and highest bin count). Cohen’s simple Kappa results are also ambiguous and are highly dependent on the choice of bins. For quantifying global sample differences SERE performs similarly to a measure based on the negative binomial distribution yet is simpler to compute.
SERE can therefore serve as a straightforward and reliable statistical procedure for the global assessment of pairs or large groups of RNA-Seq datasets by a single statistical parameter.
Massively parallel shotgun RNA-Sequencing (RNA-Seq) has become the technology of choice for transcriptome analysis because of its potential to yield extensive biological information with digital precision. The development of effective statistical data analysis methods has been essential to the utility of RNA-Seq and has been a focus since the original reports on the technology [1, 2]. The statistical analysis of RNA-Seq variability has been the focus of several comprehensive studies [3, 4] and remains a topic of active investigation . A common task in RNA-Seq statistical analysis is to determine whether two RNA-Seq datasets are faithful replicates and, if not, whether two datasets differ only slightly or very markedly. Sophisticated statisitical tools for analysis of Next-Generation Sequencing data are beginning to appear, e.g. the edgeR package  and DESeq package . Here we focus on a more targeted approach that is useful in both quality control and early analysis. An ideal measure for this task should be easy to compute and have three features: (1) Sensitivity: The measure is sensitive to actual differences; (2) Calibration: There is a known baseline value that corresponds to success; (3) Stability: The behavior is independent of the sequencing depth, total number of exons, and other experimental conditions not relevant to the question. Simple computation is a desirable but not essential attribute.
Pearson’s correlation coefficient r has been widely used to affirm that pairs of RNA-Seq datasets are faithful replicates [1, 2, 7–9] and continues to be in use [10–12]. However, as a quasi standard in the RNA-Seq literature r may be problematic as it may suffer not only from the general pitfalls that have long been recognized (e.g. Chambers et al. ) but from additional shortcomings specific to count data.
As an alternative, McIntyre et al. recently suggested a measure of concordance based on the Kappa statistic to compare RNA-Seq samples . Applying the Kappa procedure to multiple RNA-Seq samples, the authors concluded that replicates of the same RNA-Seq library from different “lanes” (compartments in an Illumina genome analyzer flow cell used to separate samples) were subject to a systematic bias, a finding that appeared to contradict previous observations by others [3, 4, 14]. Similarly to Pearson’s r, Kappa may be subject to confounding factors of the experimental design such as the total read count. Furthermore, interpreting Kappa under the premise that it should be 1 for perfect replicates may be naive.
Here we propose a new candidate statistic for RNA-Seq sample comparison based on the ratio of observed variation to what would be expected from an ideal Poisson experiment. We show that the Simple Error Ratio Estimate (SERE), unlike r and Kappa can be expected to be 1 for perfect replicates, only affected by the random sampling effect. We evaluated the 3 statistics on the above criteria (calibration, sensitivity, stability) using original RNA-Seq data from rat neural tissue that contained multiple technical and biological replicates. From this we created ideal in silico replicates by randomly splitting an observed lane into 2 pseudo-lanes and also simulated various degrees of contamination. This allowed us to examine the behavior of the methods under known outcomes and revealed serious deficiencies in the correlation and concordance approach. Finally the methods were compared on the actual datasets.
Pearson’s correlation coefficient r was the main competitor because of the ubiquity of its use in the current RNA-Seq literature. The Kappa statistic was recently proposed as an alternative . This requires the counts to be binned. Here, we used the same binnig as suggested in that paper. The SERE statistic is a ratio of the observed standard deviation between replicates divided by the value that would be expected from an ideal experiment.
For the correlation and concordance measures the value 1 is usually viewed as the “ideal”. This is only achieved for the duplication, a situation where the randomness inherent to the process of read sampling is not allowed and instead a greater than expected congruence between two sample pairs is forced, resulting in an extreme case of “underdispersion”. Data from an actual ideal experiment (0% contamination) had on average correlation values of 0.89 and concordance values of 0.41. SERE on the contrary yielded the expected baseline value of 1 for perfect in silico replicates (0% contamination) and detected contamination as early as 25%. Marked differences appeared when contamination reached 50%. The SERE measure also clearly marks the duplication comparison as unusual (SERE = 0). The sensitivity of both the correlation and concordance measures is much lower, making it difficult to distinguish contaminated samples from the ideal experiment.
R and Kappa were slightly lower for the biological replicates as compared to the technical replicates and further decreased when comparing the two experimental conditions. However, differences were small compared with those caused by total read counts (Figures 2 and 3) suggesting that both measures are poor candidates for detection of global alterations in practice. SERE was highly sensitive to global differences, with scores of approximately 1.15 for biological replicates and 1.7 to 2.0 for comparisons between different experimental conditions.
This study was focused on a global approach that is useful in both quality control and early analysis of RNA-Seq experiments. Therefore, an ideal measure for this task was defined to be easy to compute and have three features of sensitivity, calibration and stability. The SERE measure does well, but the correlation and concordance have serious flaws. Why?
Deficiencies in the correlation coefficient have long been known. Chambers et al.  for instance showed a panel of 8 graphs with very different patterns all with the same value of r. Of most relevance here is that r can be dominated by values at the extremes of the data. Count data for RNA-Seq is very skewed. For example, in the first lane of “control 1”, 55.4% of the observed exons had a count of 10 or less out of 5,512,030 million total uniquely mapped reads, while the highest had a count of 95660. Even under a log-transformation, the largest few counts have in inordinate influence. Further examination of the results underlying Figures 2 and 3 shows that the value of r for the ideal samples is essentially determined by the range of the counts, which in turn is closely related to the total sequencing depth (Additional file 1: Figure S2). This causes r to have both a varying target value (the expected result for a perfect replicate) and high variability. Even a low value (e.g <0.8) might result from an “ideal” experiment. At the same time, markedly different samples can yield correlation coefficients of >>0.8 if the total read count is high. Moreover, Pearson’s r can be computed in several ways: on square root of the raw counts, on count data after addition of a pseudo-count and log-transformation or on RPKM data after adding a pseudo-count and log-transformation. The latter is the most commonly used normalization and therefore employed here. These different approaches can substantially change the outcome since they influence the distance between the extremes (data not shown).
By categorizing the data into bins, as performed by the Kappa statistic, one avoids the susceptibility to values on the extreme of the scale. However the choice of the bin sizes becomes the driving factor for this statistic. Additional file 1: Figure S3 demonstrates how the binning influences the result of Kappa for one and the same dataset. If the bins are chosen as reported previously by McIntyre et al. (Additional file 1: Figure S3A) Kappa is 0.41. When the bins are chosen wider (Additional file 1: Figure S3D), the value for Kappa raises to approximately 0.73. By choosing very small bin sizes (Additional file 1: Figure S3C), the Kappa value decreases to approximately 0.3. We also computed a weighted Kappa  employing the most common definition where disagreement is proportional to the distance from the diagonal. The result shown in Additional file 1: Figure S4 demonstrated the same characteristics of Kappa shown for the unweighted procedure.
For the simulation study, we chose the unweighted Kappa. We took the same bin sizes as proposed by McIntyre et al., which used 0 counts as the smallest bin. Therefore, whenever the expression of an exon is so sparse that only a single read is detected among two or more samples (the exon is a singleton) the exon will be scored as “off the diagonal” since it will fall into the bin “0” for one sample and in “1-10” for the other sample. The fraction of singletons in our in silico samples with 5 million UMRs is 11.56-11.96%, which alone limits the Kappa to a maximum of about 0.89. The total fraction of singletons tended to decrease by increasing the total read count and the calculated Kappa value rises as seen in Figure 2.
Computational simulation can be helpful in estimating the expected values for Pearson’s r or Kappa but need to take the specific experimental condition into account and cannot be generalized easily. Therefore the use of r and Kappa to investigate whether two RNA-Seq datasets are faithful replicates or subject to systematic differences or bias leads to ambiguity in most cases.
The third candidate statistic appears to be a useful measurement to identify global differences between RNA-Seq data by fulfilling the set criteria of a good measure. A primary reason is that it compares the observed variation to an expected value, and the latter accounts for the impact of varying read depth. It is easy to compute and satisfies our three primary criteria.
A “perfect” SERE of 1 indicates that samples differ exactly as would be expected due to Poisson variation. If RNA-Seq samples are truly different, this is identified by values > 1 (overdispersion). Values below 1 are well interpretable and indicate “underdispersion,” e.g. through artefactual duplication of data. A value of 0 would constitute perfect identity, such as might occur from accidentally duplicating a file name. Interestingly, detection of underdispersion has been important in detecting data falsification ; faithful randomness is difficult to fabricate.
A constructed replicate with 25% contamination was successfully indicated as overdispersed by SERE. As soon as one dataset contains 50% of its reads from another biological replicate, the indication of overdispersion becomes even more obvious. Thus, SERE is a qualified measure to detect processing errors and other sources of variation.
In RNA-Seq experiments the read counts per exon in a sample vary, either due to rareness of the exon within the sample or due to total number of reads. The expected variation between lanes for that exon also changes. Because SERE explicitly accounts for this, comparing observed to expected counts, it is largely unaffected by these changes, regardless of the sequencing depth. This was confirmed by 200 in silico simulations performed for various numbers of reads, where SERE was 1 on average. However, each simulation is subject to variation and therefore will slightly deviate from 1 either in the direction of under- (<1) or overdispersion (>1). To characterize the range of this variation we calculated the confidence interval (CI) for all the simulations. As seen in Figure 2, the 99% CI was narrow, ranging from 0.99 to 1.01 regardless of the total read count.
As shown in Figure 5, we can also use the measure to more finely dissect the variation found in an experiment. This is a useful extension to the quality control assessment. However, when comparing samples on a single gene level, e.g. multiple treatment groups, both the expected Poisson variation exploited by SERE and the biological variation between and within treatment groups play an important role, and methods that take both into account would be preferred for deeper inquiry (e.g. DESeq package of Anders et al. and edgeR by Robinson et al.). Yet, SERE remains useful as an initial diagnostic tool.
Li et al. recently introduced the “Irreproducible Discovery Rate” (IDR) as a measure of reproducibility  and demonstrated its utility for the analysis of a variety of sequencing-based high-throughput data types . We tested the IDR on two scenarios considered in this study, namely a comparison of identical datasets and a comparison of perfect replicates. In the case of identical datasets, the change of correspondence curve generated by IDR analysis indicated perfect correspondence (Additional file 1: Figure S5) lacking an indicator that this is a case of extreme underdispersion indicative of an unexpected finding such as human error or mischief. In the case of perfect replicates (generated in silico and differing only due to Poisson variation) the change of correspondence curve became noticeably noisy when the total number of read counts in the simulation was small (Additional file 1: Figure S6). The IDR was developed for comparison of datasets with unknown or differing distribution types. It is therefore unsurprising that it appears to be less useful than SERE for the detection of under- and overdispersion in the comparison of replicate datasets affected by Poisson variation. IDR was developed for situations where two experimental replicates or methods could reasonably expected to agree without anticipating perfection. The statistical concept underlying SERE is unrelated to IDR and was developed for situations where two experimental replicates could reasonably be expected to differ only due to stochasticity.
SERE provides an efficient single-parameter statistical measure of reproducibility for RNA-Seq datasets. Unlike two other measure currently in use, Pearson’s correlation coefficient r and the concordance measure based on Kappa statistic, SERE is independent of typically varying experimental circumstances (such as the total count of reads). The interpretation of SERE is straightforward staying clear of the ambiguities resulting from misinterpretation of r and Kappa. A SERE of 1.0 corresponds to the normal degree of dispersion resulting from the Poisson variation of raw read counts. Because SERE is a measure of dispersion, its interpretation extends to two situations of practical importance: Underdispersion indicative of for example data duplication and overdispersion usable as a global measure of the degree of alterations, which is agnostic to the relative importance of difference genomic regions because SERE weights each observed read identically. SERE may in principle be applicable to a broad range of read count datasets such as from CHIP-seq or for comparison of alternately processed read data such as the counts of a k-mer analysis. The present study suggests that SERE has superior characteristics to previously used measures in the case of RNA-Seq.
RNA-Seq read data used in the present analysis was taken from a previous study  and is available through the sequence read archive (SRA id GSE20895). 14 RNA-Seq datasets were used, each corresponding to the sequencing reads from one “lane,” which is a physical compartment in the flow cell of the Illumina GA-II instrument. The 14 lanes corresponded to 4 RNA-Seq libraries, whereby 3 lanes of sequence data were available for each of two of the libraries and 4 lanes for each of the other two. Each library was synthesized from a different source of RNA. RNA sources corresponded to two experimental conditions, “SNL” (spinal nerve ligation) or “control.” Two independent RNA-Seq libraries, “biological replicates,” were available for each condition. The published analysis  found that biological replicates were similar and that the 2 experimental conditions were meaningfully different. Technical replicates were not compared in the previous report (instead reads originating from different lanes corresponding to the same RNA-Seq library were pooled) but are included in the comparisons made in the present study.
Additional file 1: Table S1 lists the condition and replicate identification corresponding to each of the 14 lanes. The table illustrates how the available data allowed for three types of comparisons between RNA-Seq datasets: pairs of technical replicates (same library sequenced on different lanes); pairs of biological replicates (same experimental condition); and the two experimental conditions “SNL” versus “control”.
RNA-Seq reads (50bp) were aligned to the rat reference genome (RGSC 3.4) by Bowtie . We allowed for a maximum of two mismatches and considered only the uniquely mapped reads for the downstream analysis. This filtration step resulted in 108,636,496 million UMRs to the genome. Genome annotation ENSEMBL 65 was used including 22,921 protein coding genes. Overlapping exons of genes having multiple isoforms were combined resulting in a total of 222,097 exons (Additional file 2). For the subsequent analysis exons served as bins, i.e., reads aligning to each exon were counted and the sum noted in the master read count file (Additional file 3). The file has 222,097 rows (bins corresponding to exons) and 14 columns (corresponding to each of the lanes described above).
All uniquely mapped reads (from lane 1 to 3) from the first “control” RNA-Seq sample were combined resulting in 22.9 × 106 reads in order to create a universal pool. The datasets for the in silico duplicates and replicates described below were generated from this pool. The in silico replicates created from the universal pool of RNA-Seq reads by random drawing are by definition only different due to stochastic (Poisson) variation of the sampling process (see Results). Similarly all 3 lanes from “control 2” were combined to create a second pool used as “contaminant” in the contamination experiments.
A set of RNA-Seq datasets faithfully representing Poisson variation only (perfect non-identical replicates) was generated by randomly choosing sets of 5 x106 reads from the universal pool by using the “sample” function in R (Additional file 4) and a Java script (Additional file 5). This process was repeated 200 times and used as reference of perfect replicates in the subsequent benchmarking of the different test statistics. The same procedure was further repeated with different set sizes (0.5 to 10 million reads per sample) in order to test for the influence of total read counts on the statistics.
To test whether the statistical measures were sensitive to actual differences, we contaminated one in silico replicate out of a pair with 0;5;10;25;50;75;100% of a biological replicate (“control 2”) via computer simulation. In detail, the first sample was created by randomly drawing 5 million reads of the universal pool of “control 1” and the second by drawing x% of reads from “control 1” and y% of reads from file “control 2”, whereby x+y=100, corresponding to 5 million reads. The procedure was repeated 200x.
For Pearson’s correlation coefficient and Kappa, the 3 lanes of each of the two “control” and the 4 lanes for each of the two “SNL” condition were compared in a pairwise fashion, resulting in a total of 18 technical replicate comparisons (see Figure 4). To compare the performance of the statistics on biological replicates we compared the first lane of “control 1” to the first lane of “control 2”, and lane 1 from “SNL 1” to lane 1 from “SNL 2”. To test whether Pearson’s r and Kappa were sensitive to different experimental conditions, we selected the first lane of each “control” and compared it to the first lane of each “SNL” sample. SERE is not restricted to pairwise comparisons, but allows to compare multiple samples or lanes simultaneously. Therefore, SERE yielded 4 values for the technical replicate comparisons, 2 for biological and 4 for different experimental conditions.
The denominator is (M − 1) due to the constraint that for each exon i.
The SERE estimate is .
Simple algebra shows that for a singleton count, i.e., an exon that appears only once in only one of the lanes, SERE equals exactly 1. That is, singletons shrink the overall SERE estimate towards 1, whether or not the samples are actually replicates. Therefore, we modify the average in equation 2 to sum over only the non-singleton counts. The R code to calculate SERE is provided in Additional file 6.
The unmodified measure of equation 2 is the measure of Poisson over-dispersion most often used in generalized linear models, see for instance the classic textbook of McCullagh and Nelder . An alternative measure of overdispersion is based on the deviance statistic. Brown and Zhao  consider this measure along with two others in the context of random arrival data, e.g. calls to a support center, and show that it is inferior to s i 2 (equation 2) whenever there are small values for . Their work corresponds to the case where all lanes have the same total count L j ; extending their method shows that (M − 1)s i 2 will be distributed as a chi-square random variable with (M − 1) degrees of freedom and non-centrality parameter . The parameter p ij is the true fraction of exon i within sample j and the fraction of exons in the (hypothetical) pooled sample. Under the null hypothesis of sample equality the non-centrality parameter is zero. N(M − 1)s 2 will follow a chi-squared distribution with N(M − 1) degrees of freedom. This can be used to set 99 % confidence intervals on the non-centrality parameter, and through that on the SERE estimate. We provide the corresponding R code in Additional file 7. Bins containing a read count of 0 in both samples of a pair contain no information and were therefore excluded from the comparative analyses of SERE and the two alternative parameters described below.
For a pair of lanes, we calculated the RPKM  for each lane. In the RPKM calculation we added a pseudo count of one read to each of the exons. The RPKM values were then log transformed and the Pearson’s correlation was calculated. The process was repeated using the raw counts to verify any influence of the RPKM transform itself, and using both the log and the square root of the raw counts. The latter is the variance stabilizing transform for Poisson data. None of these had any substantive impact on the results (data not shown).
The read counts were normalized to RPKM and divided into 9 bins of size: 0, 1-10, 11-20, 21-40, 41-80, 80-160, 161-320, 321-1000 and greater than 1000 RPKM as it was suggested by McIntyre et al. In order to compare a pair of replicates, a 9 × 9 table of counts was constructed, whereby each exon pair added to a cell of the table (see Additional file 8). In this way, exons that were in agreement added to the diagonal of the table, whereas the total fraction of off-diagonal entries contributed to a measure of non-agreement.
Simple error ratio estimate
Pearson’s correlation coefficient
Spinal nerve ligation
Uniquely mapped reads
Reads per kilobase of exon model per million mapped reads
Irreproducible discovery rate.
This research was supported by NINDS.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.