### Candidate statistical measures

Pearson’s correlation coefficient *r* was the main competitor because of the ubiquity of its use in the current RNA-Seq literature. The Kappa statistic was recently proposed as an alternative[5]. This requires the counts to be binned. Here, we used the same binnig as suggested in that paper. The SERE statistic is a ratio of the observed standard deviation between replicates divided by the value that would be expected from an ideal experiment.

### Sensitivity experiment

Figure1 shows results from constructed datasets representing two lanes from an ideal replication experiment (“perfect” in silico replicates affected only by the random sampling effect), and pairs of lanes where one of the two has various amounts of contamination. The contaminating sample in this experiment was from a litter mate under the same experimental condition (biological replicate) in order to make detection purposefully difficult. Each point of the figure represents the average of 200 independent realizations of the simulation experiment. The last point for each method represents a duplication, i.e., comparing a lane with itself (This could be the result of a copy/paste error during analysis, for instance).

For the correlation and concordance measures the value 1 is usually viewed as the “ideal”. This is only achieved for the duplication, a situation where the randomness inherent to the process of read sampling is not allowed and instead a greater than expected congruence between two sample pairs is forced, resulting in an extreme case of “underdispersion”. Data from an actual ideal experiment (0% contamination) had on average correlation values of 0.89 and concordance values of 0.41. SERE on the contrary yielded the expected baseline value of 1 for perfect in silico replicates (0% contamination) and detected contamination as early as 25%. Marked differences appeared when contamination reached 50%. The SERE measure also clearly marks the duplication comparison as unusual (SERE = 0). The sensitivity of both the correlation and concordance measures is much lower, making it difficult to distinguish contaminated samples from the ideal experiment.

### Stability experiment

Another characteristic, stability, interrogates whether the behavior of the underlying statistic is independent of ancillary aspects of the experiment; the obvious such factor in RNA-Seq is the sequencing depth. Therefore, RNA-Seq perfect replicate datasets of different sizes were generated by drawing random reads from the universal read pool. We simulated two types of scenarios: In our first experiment (Figure2) we decreased the number of reads in both lanes from 10 to 0.5 million. Pearson’s *r* fell markedly from 0.93 to 0.71 when the read counts of both datasets in a pair were reduced (Figure2). Kappa was equally sensitive to the total read count, decreasing from 0.54 for perfect replicate pairs with 10^{7} reads per sample down to 0.09 for pairs with only 0.5x10^{6} reads. All datasets represented perfect replicates by definition as they were generated in silico by sampling from a common pool. Therefore, low values of Pearson’s *r* such as <0.8 and Kappa <0.3 are not in all cases indicative of poor RNA-Seq experimental replication. SERE was unaffected by the total count of RNA-Seq reads, remaining stable at 1.0.

In our 2nd experiment (Figure3) we kept the total read count of 200 in silico replicates constant, but varied the relative size of both samples, simulating multiplexed RNA-samples, where both samples will not always yield the same number of reads. Pearson’s *r* and the Kappa statistic performed continuously worse, as the relative difference between the two perfect replicates became bigger, reaching values of approximately 0.82 and 0.19 respectively at the extremes (1 million versus 9 million uniquely mapped reads per sample). SERE on the contrary stayed at a stable value of 1.0 through all the scenarios. A minor increase of the confidence intervals could be observed as the relative sample size tended to the extremes, yet remaining between ± 0.01.

### Performance of the statistics on empirical data

To put the above findings into perspective, we studied the candidate statistics on an empirical dataset which included technical and biological replicates, as well as samples from different experimental conditions (“control” vs. “SNL”, see Methods). Figure4 shows the result for 14 lanes of data, consisting of 3 replicate lanes for each of the 2 “control” rats and 4 replicate lanes for each of the 2 “SNL” rats. First the replicate lanes of each rat were compared (technical replicates), second the biological replicates (the 2 “controls” and the 2 “SNL” respectively) and third the animals belonging to different experimental groups (“SNL” vs. “control” rats). Note that SERE results in a single value for a set of lanes that are being compared, while the correlation and concordance measures apply only to pairs of lanes.

*R* and Kappa were slightly lower for the biological replicates as compared to the technical replicates and further decreased when comparing the two experimental conditions. However, differences were small compared with those caused by total read counts (Figures2 and3) suggesting that both measures are poor candidates for detection of global alterations in practice. SERE was highly sensitive to global differences, with scores of approximately 1.15 for biological replicates and 1.7 to 2.0 for comparisons between different experimental conditions.

The SERE statistic can also be computed pairwise. For the 3 technical replicates of “control 1” for instance, the overall ratio for the three lanes is 1.005, with pairwise values of 1.003, 1.002, and 1.008. When the overall SERE statistic for a set of lanes is large we can use these individual comparisons to further sort out which lane(s) is the source of concern. A simple way to display this is to use SERE to create a cluster map. Figure5 shows the resulting dendrogram for the 14 lanes of data used in Figure4. The dendrogram clearly reflects to the experimental design by first distinguishing between the two conditions, then separating the biological replicates within an experimental configuration and finally grouping the technical replicates. The vertical axis of dendrograms is often left unlabeled since the values are on an arbitrary scale, but in this case they have a direct interpretation as “excess dispersion”. A more interesting example is shown in Additional file1: Figure S1 where we applied SERE on a drosophila melanogaster dataset (SRA id GSE17107) that was employed by McIntyre et al. Interestingly, it revealed 2 distinct groups (no sign of overdispersion within the groups) although the 5 samples originated from the same RNA-Seq library suggesting a possible batch effect.

### The drawbacks of *r* and Kappa

This study was focused on a global approach that is useful in both quality control and early analysis of RNA-Seq experiments. Therefore, an ideal measure for this task was defined to be easy to compute and have three features of sensitivity, calibration and stability. The SERE measure does well, but the correlation and concordance have serious flaws. Why?

Deficiencies in the correlation coefficient have long been known. Chambers et al.[13] for instance showed a panel of 8 graphs with very different patterns all with the same value of *r*. Of most relevance here is that *r* can be dominated by values at the extremes of the data. Count data for RNA-Seq is very skewed. For example, in the first lane of “control 1”, 55.4% of the observed exons had a count of 10 or less out of 5,512,030 million total uniquely mapped reads, while the highest had a count of 95660. Even under a log-transformation, the largest few counts have in inordinate influence. Further examination of the results underlying Figures2 and3 shows that the value of *r* for the ideal samples is essentially determined by the range of the counts, which in turn is closely related to the total sequencing depth (Additional file1: Figure S2). This causes *r* to have both a varying target value (the expected result for a perfect replicate) and high variability. Even a low value (e.g <0.8) might result from an “ideal” experiment. At the same time, markedly different samples can yield correlation coefficients of >>0.8 if the total read count is high. Moreover, Pearson’s *r* can be computed in several ways: on square root of the raw counts, on count data after addition of a pseudo-count and log-transformation or on RPKM data after adding a pseudo-count and log-transformation. The latter is the most commonly used normalization and therefore employed here. These different approaches can substantially change the outcome since they influence the distance between the extremes (data not shown).

By categorizing the data into bins, as performed by the Kappa statistic, one avoids the susceptibility to values on the extreme of the scale. However the choice of the bin sizes becomes the driving factor for this statistic. Additional file1: Figure S3 demonstrates how the binning influences the result of Kappa for one and the same dataset. If the bins are chosen as reported previously by McIntyre et al. (Additional file1: Figure S3A) Kappa is 0.41. When the bins are chosen wider (Additional file1: Figure S3D), the value for Kappa raises to approximately 0.73. By choosing very small bin sizes (Additional file1: Figure S3C), the Kappa value decreases to approximately 0.3. We also computed a weighted Kappa[15] employing the most common definition where disagreement is proportional to the distance from the diagonal. The result shown in Additional file1: Figure S4 demonstrated the same characteristics of Kappa shown for the unweighted procedure.

For the simulation study, we chose the unweighted Kappa. We took the same bin sizes as proposed by McIntyre et al., which used 0 counts as the smallest bin. Therefore, whenever the expression of an exon is so sparse that only a single read is detected among two or more samples (the exon is a singleton) the exon will be scored as “off the diagonal” since it will fall into the bin “0” for one sample and in “1-10” for the other sample. The fraction of singletons in our in silico samples with 5 million UMRs is 11.56-11.96%, which alone limits the Kappa to a maximum of about 0.89. The total fraction of singletons tended to decrease by increasing the total read count and the calculated Kappa value rises as seen in Figure2.

Computational simulation can be helpful in estimating the expected values for Pearson’s *r* or Kappa but need to take the specific experimental condition into account and cannot be generalized easily. Therefore the use of *r* and Kappa to investigate whether two RNA-Seq datasets are faithful replicates or subject to systematic differences or bias leads to ambiguity in most cases.

### The Simple Error Ratio Estimate (SERE)

The third candidate statistic appears to be a useful measurement to identify global differences between RNA-Seq data by fulfilling the set criteria of a good measure. A primary reason is that it compares the observed variation to an expected value, and the latter accounts for the impact of varying read depth. It is easy to compute and satisfies our three primary criteria.

#### Calibration

A “perfect” SERE of 1 indicates that samples differ exactly as would be expected due to Poisson variation. If RNA-Seq samples are truly different, this is identified by values > 1 (overdispersion). Values below 1 are well interpretable and indicate “underdispersion,” e.g. through artefactual duplication of data. A value of 0 would constitute perfect identity, such as might occur from accidentally duplicating a file name. Interestingly, detection of underdispersion has been important in detecting data falsification[16]; faithful randomness is difficult to fabricate.

#### Sensitivity

A constructed replicate with 25% contamination was successfully indicated as overdispersed by SERE. As soon as one dataset contains 50% of its reads from another biological replicate, the indication of overdispersion becomes even more obvious. Thus, SERE is a qualified measure to detect processing errors and other sources of variation.

#### Stability

In RNA-Seq experiments the read counts per exon in a sample vary, either due to rareness of the exon within the sample or due to total number of reads. The expected variation between lanes for that exon also changes. Because SERE explicitly accounts for this, comparing observed to expected counts, it is largely unaffected by these changes, regardless of the sequencing depth. This was confirmed by 200 in silico simulations performed for various numbers of reads, where SERE was 1 on average. However, each simulation is subject to variation and therefore will slightly deviate from 1 either in the direction of under- (<1) or overdispersion (>1). To characterize the range of this variation we calculated the confidence interval (CI) for all the simulations. As seen in Figure2, the 99% CI was narrow, ranging from 0.99 to 1.01 regardless of the total read count.

As shown in Figure5, we can also use the measure to more finely dissect the variation found in an experiment. This is a useful extension to the quality control assessment. However, when comparing samples on a single gene level, e.g. multiple treatment groups, both the expected Poisson variation exploited by SERE and the biological variation between and within treatment groups play an important role, and methods that take both into account would be preferred for deeper inquiry (e.g. DESeq package of Anders et al. and edgeR by Robinson et al.). Yet, SERE remains useful as an initial diagnostic tool.

Li et al. recently introduced the “Irreproducible Discovery Rate” (IDR) as a measure of reproducibility[17] and demonstrated its utility for the analysis of a variety of sequencing-based high-throughput data types[18]. We tested the IDR on two scenarios considered in this study, namely a comparison of identical datasets and a comparison of perfect replicates. In the case of identical datasets, the change of correspondence curve generated by IDR analysis indicated perfect correspondence (Additional file1: Figure S5) lacking an indicator that this is a case of extreme underdispersion indicative of an unexpected finding such as human error or mischief. In the case of perfect replicates (generated in silico and differing only due to Poisson variation) the change of correspondence curve became noticeably noisy when the total number of read counts in the simulation was small (Additional file1: Figure S6). The IDR was developed for comparison of datasets with unknown or differing distribution types. It is therefore unsurprising that it appears to be less useful than SERE for the detection of under- and overdispersion in the comparison of replicate datasets affected by Poisson variation. IDR was developed for situations where two experimental replicates or methods could reasonably expected to agree without anticipating perfection. The statistical concept underlying SERE is unrelated to IDR and was developed for situations where two experimental replicates could reasonably be expected to differ only due to stochasticity.