Supplementary results: On the utility of RNA sample pooling to optimize cost and statistical power in RNA sequencing experiments

1.1 Proofs for the mean and variance of Yk We present here the proof for Var {Yk} in equation (3) in the manuscript. We first prove the expressions analytically and later we present the empirical confirmation using Monte-Carlo simulations. For a reminder, n is the number of biological samples, m is the number of pools, q is the number of biological samples per pool (q = n/m), Uj is the read counts of a given gene in biological sample j = 1, 2, . . . , n, Yk is the gene expression level in pool k = 1, 2, . . . ,m, Wjk denote the mixing weight for biological sample j in pool k, and Ajk is an indicator defined as 1 if biological sample j is in pool k, and 0 otherwise. Proof. The data generating model in (1) is conditional on the pool size q, which is assumed to be fixed. This implies that Ajk is subject to the additional constraint ∑n j=1 Ajk = q, which affects the variance calculation. Without the constraint, if we let Q = ∑n j=1 Ajk ∈ {0, 1, . . . , n}, then Q ∼ Binomial(n, 1/m). Similarly, let Q(j) = ∑ i 6=j Aik ∈ {0, 1, . . . , n− 1}, then Q(j) ∼ Binomial(n− 1, 1/m). Therefore, P(Ajk = 1|Q = q) = P(Q = q|Ajk = 1)P(Ajk = 1) P(Q = q) = P(Q(j) = q − 1) 1 m P(Q = q) . This is because P(Q = q|Ajk = 1) = P(Q(j) = q − 1). Therefore,

The same result can also be obtained if we translate the problem to an m × n contingency table with fixed row and column totals (similar constraints we have). That is, if pools (k = 1, 2, . . . , m) are in the rows and the biological samples (j = 1, 2, . . . , n) are in the columns, then all the row totals will be q and all the column totals will be 1, and A jk are the (ij) th ellements of the table. This setting will give exactly the same mean and variance of A jk .
In addition, under the assumption that the pooling weights W k ∼ Dirichlet(1, 1, . . . , 1) in a given pool k, for the biological sample j (with A jk = 1), E {W jk } = 1/q and Var {W jk } = q−1 q 2 (q+1) . Therefore, based on these results For the variance, upon using the result that for n independent random variables X 1 , X 2 , . . . X n , q(n − q) n 2 + q 2 n 2 q − 1 q 2 (q + 1) + 1 q 2 σ 2 j + µ 2 j − q 2 n 2 1 q 2 µ 2 j + σ 2 = 2 n(q + 1) If U j ∼ Negative Binomial(µ j , φ), where µ j = ρL 0 j , ρ is the relative abundance, L 0 j is the library size in biological sample j (virtual library size), φ is the over-dispersion parameter, then, Var {U j } = σ 2 j = µ j + φµ 2 j . Therefore, the mean and variance of Y k becomes, To verify the mean and variance of Y k (also the coefficient of variation) based on the expressions in (1) and (2), we set up a Monte-Carlo (MC) simulation with 2000 runs. In a given MC simulation i, i = 1, 2, . . . , 2000, generate n = 60 read counts from negative binomial distribution U (i) j ∼ NB(µ j , φ) and subsequently generates Y (i) k using the data generating model (see equation (1) in the manuscript) for a pool size q. In a single MC simulation run i, the mean and variance of Y i . Different choices of µ j , φ, and q were considered. The results in Figure S1 show that the expressions in (1) and (2) are equivalent to the their corresponding MC approximations. This confirms that the expressions in (1) and (2) describe the true mean and variance of Y k , respectively.  Figure S1: The Monte-Carlo and the analytical estimate of the mean (µ Y ) and variance of (σ Y ) of Y k at different pool size (q). The solid red line indicates the analytical estimates, whereas the boxplots show the distribution of the the sample estimates in each Monte-Carlo simulation at each q. The Monte-Carlo estimates are the average across simulations in each q and they are indicated by solid black points on each boxplot.

Estimation of the relative abundance and log-fold-change
The moment estimator of the relative abundance ρ of a particular gene based on the gene expressions U j from the n individual biological samples, is given bŷ where L 0 = n j=1 L o j is the total virtual library sizes, and L o j is the virtual library size in biological sample j. Similarly, we can drive the moment estimator of ρ based on gene expressions from the pooled samples Y k , k = 1, 2, . . . , m starting from the sample mean of Y k . That is,Ȳ = 1 m m k=1 Y k ⇒ E Ȳ = 1 n ρL 0 , and henceρ Now we compare the expectation and variance ofρ from the the standard experiment (3) and the pooled experiments (4). For this purpose, letρ andρ * denote the estimates of the relative abundance from the standard and pooled experiment, respectively. Since we have used the moment estimators of ρ for both settings, it immediately follows that E {ρ} = E {ρ * } = ρ.
It can be shown that Using equation (3) of the main manuscript, it follows that The right two terms of (6) are nearly 0 (division by very large number L 2 0 ) and have a negligible contribution to Var {ρ * }. Consequently, we find Var The expression in (7) implies that pooling leads to an estimate of the relative abundance which is at least 2q/(q + 1) times more variable than the estimate we can obtain without pooling.
In DGE analysis, one essential statistic is the estimate of the biological effect (effect size). In many parametric methods, the log-fold-change (LFC) is commonly used to caliberate the biological effect size. As a result, we compare the LFC estimates from the standard and the pooled experiments. For testing DGE between two independent groups, the LFC of a particular gene is defined as θ = log ρ2 ρ1 , where ρ k is the relative abundance in group k ∈ {1, 2}. The estimate of θ for the standard experiment is given byθ = logρ 2 ρ1 and for the pooled Using the second-order Taylor expansion (the Delta method), we can approximate the variance ofθ andθ * as and Therefore, it follows that This also indicates that the LFC estimate from pooled experiments is at least 2q/(q + 1) times more variable than that of the standard experiment. This is an important characteristic that affects the statistical power of a DGE test as shown in the next section.

Power calculation
Assume there is no pooling and we want to test for DGE between two independent groups of biological samples. Let U jk denotes the read counts in biological sample j = 1, 2, . . . , n k of group k ∈ 1, 2. Again we assume that U jk ∼ Negative Binomial(µ jk , φ), where φ is the over-dispersion parameter (assumed to be constant for all samples and all groups), and µ jk = E {U jk } = ρ k L 0 jk , where ρ k is the relative abundance in group k and L 0 jk is the library size of biological sample j in group k. Let A jk be the group label of U jk , such that A jk = 0 if k = 1 and A jk = 1 if k = 2. n k denotes the number of biological samples in group k, with n = n 1 + n 2 . We want to test the null hypothesis H 0 : ρ 1 = ρ 2 against the alternative H A : ρ 1 = ρ 2 at the α level of significance. In this section, we will calculate the statistical power of testing this hypothesis based on the method discussed in Zhu and Lakkis (2014).
We can fit the following negative binomial regression with L 0 jk as offset, where β 0 is the intercept and β 1 is the coefficient of the factor A. In this model, the parameter β 1 represents the LFC between the two groups, that is β 1 = log ρ2 ρ1 . This means, β 1 is equivalent to the LFC parameter introduced earlier as θ. Therefore, we can rewrite the hypothesis of DE as H 0 : β 1 = 0 against the alternative H A : β 1 = 0.
Ifβ 1 is the maximum-likelihood estimator of β 1 (under H A ), then the variance ofβ 1 is given by where R = n 2 /n 1 ,L o = n −1 L 0 (the mean library size across all biological samples). It is also easy to show that Var β 1 = Var θ , shown in (8). Under the null hypotsis, whereρ 1 andρ 2 are the true relative abundances under H 0 , such thatρ 1 =ρ 2 = ρ 1 .
Recall that our objective is to determine the power of testing the above hypothesis using the pooled experiment. Therefore, letβ * 1 is the estimate of β 1 using the gene expression data from the pooled samples. β * 1 is the equivalent LFC in the pooled experiment, which was denoted byθ * earlier, i.eβ * 1 =θ * . In (10), we have established that Var θ * ≥ Var θ 2q q+1 . Consequently, under the alternative hypothesis Var β * 1 ≥ 2q q+1 Var β 1 and under the null hypothesis Var β * 1 | H0 ≥ 2q q+1 Var β 1 | H0 . Therefore, given the pool size (q), the number of RNA samples in groups 1 and 2 (n 1 and n 2 , respectively), the effect size to be detected θ, and over-dispersion φ, the power of the two-sided likelihood ratio test at significance level α can be calculated as, where Φ(.) is the cumulative standard normal distribution, and Z α/2 is the (1 − α/2)100% quantile of the standard normal distribution. Note that in pooled experiments, n 1 and n 2 are the number of RNA samples before library prepartion.
In Figure S2 and S3, we present the relationship between the power and the total cost of data generation for different experimental design, including the sample pooling. In particular, we compare three cost-saving strategies and a reference scenario (the full budget experiment). These are • reference: contains a total of n biological samples from two groups (each with n/2 samples) and there is no pooling. The average library size per sample is 20×10 6 . The total cost is C t = C SP ×n+C LP ×n+C S ×L 0 , where C SP , C LP , & C S are sample preparation cost, library preparation cost and sequencing cost per 10 6 , respectively.
• Strategy A: pooling experiment with pool size q. The n/2 RNA samples in each group are pooled to m pools m = n/2q with average library size per pool is 20 × 10 6 . Hence, the total cost is C t = C SP × n + C LP × 2m + C S × L 0 /q. This strategy reduces the library preparation and sequencing costs.
• Strategy B: reducing the number of biological samples (n) without pooling. Instead of the n total number of samples (in the reference design) we use n s samples with n s /2 per group with average library size per samples is 20 × 10 6 . Hence, the total cost is C t = C SP × n s + C LP × n s + C S × (nL 0 /n s ). This strategy reduces the sample preparation, library preparation and sequencing costs.
• Strategy C : reducing the sequencing depth. This is similar to the reference scenario, except that the average library size is reduced to L, where L < 20 × 10 6 . Hence, this strategy reduces only the sequencing cost by a factor l, l = 20 × 10 6 /L.  Figure S2: Zodiac plot representing power (at 5% significance level) versus the total cost of data generation. The gene expression levels are generated from NB(ρL j , φ). This particular plot is for a gene with relative abundace of ρ = 10 −7 (low-abundance gene) in one of the groups. The reference strategy (denoted by a diamond shape) contains n = 120 biological samples (without pooling) with a mean library size of 20 × 10 6 per sample. Each panel represents a different LFC (θ) between the two groups and over-dispersion parameter φ (reflecting biological variability). The relative cost is determined as the total cost of each strategy divided by the total cost of the reference design. relative cost power Figure S3: Zodiac plot representing power (at 5% significance level) versus the total cost of data generation. The gene expression levels are generated from NB(ρL j , φ). This particular plot is for a gene with relative abundace of ρ = 10 −6 (moderate level of expression) in one of the groups. The reference strategy (denoted by a diamond shape) contains n = 120 biological samples (without pooling) with a mean library size of 20 × 10 6 per sample. Each panel represents a different LFC (θ) between the two groups and over-dispersion parameter φ (reflecting biological variability). The relative cost is determined as the total cost of each strategy divided by the total cost of the reference design.  Figure S4: Zodiac plot representing power (at 5% significance level) versus the total cost of data generation. The gene expression levels are generated from NB(ρL j , φ). This particular plot is for a gene with relative abundace of ρ = 10 −5 (high level of expression) in one of the groups. The reference strategy (denoted by a diamond shape) contains n = 120 biological samples (without pooling) with a mean library size of 20 × 10 6 per sample. Each panel represents a different LFC (θ) between the two groups and over-dispersion parameter φ (reflecting biological variability). The relative cost is determined as the total cost of each strategy divided by the total cost of the reference design.   Figure S7: Two-dimensional visualization of neuroblastoma RNA samples (before and after pooling) using principal component analysis. The groups are defined as the MYCN status (group=1 for MYCN amplified and group=0 for MYCN non-amplified samples). In particular, the PCA was applied on the log-CPM transforemated read counts.   Figure S9: Differential gene expression results for pooling scenarios generated using the Zhang RNA-seq dataset. A) The number of DE genes detected at 5% FDR; B) The fraction of overlap (concordance) defined as the fraction of DE genes detected in a test scenario that are also detected in the reference scenario.

MYCN pathway top 200 DE genes
coefficients of variation log2−mean expression  Figure S11: Summary of gene level characteristics for pooling scenarios generated using the NGP nutlin dataset. A) The distribution of log 2 -mean normalized expression of genes, the log 2 -vriance of normalized expression of genes, and the log-fold-change (LFC) between nutlin-3 and control; B) the estimated bias (mean absolute bias) of the three test scenarios relative to the reference scenario A0. Figure S13: Zodiac plot representing power (at 5% significance level) versus the total cost of data generation for NGP cell line data. The gene expression levels are generated from NB(ρL j , φ), where φ is the common overdispersion parameter for the NGP nutlin data (estimated using the edgeR package). The plots are generated for low, medium and high abundance genes with relative abundace of ρ = 10 −7 , ρ = 10 −6 andρ = 10 −5 , respectively. One unpooled design (q = 1) and two pooled designs (q = 2 and q = 3) were compared. These designs have equal number of replicates (3 replicates per group). That is, 3 individual cell lines (q = 1), 3 pools of 2 cell lines (q = 2) and 3 pools of 3 cell lines (q = 3). The mean library size per cell line is 15 × 10 6 . The curves are generated for two different minimum LFCs (θ) between the two groups, θ ≥ 0.5 and θ ≥ 1. The relative cost is determined as the total cost of each strategy divided by the maximum total cost without pooling (9 cell lines per group).