Skip to main content

Table 1 Scaling normalization approaches derive their technical bias estimates from ratio of proportions

From: Analysis and correction of compositional bias in sparse sequencing count data

Technique Proposed abundance measure, scale factor Signal for compositional scale in
Total sum \( \begin {array}{c} \frac {y_{gji}}{\tau _{gj} \cdot \Lambda _{gj}^{-1}},\\ \Lambda _{gj}^{-1}=1 \end {array} \)  
TMM \( \begin {array}{c} \frac {y_{gji}}{\tau _{gj} \cdot \Lambda _{gj}^{-1} }, \\ \Lambda _{gj}^{-1} = e^{\left [ \sum _{i: y_{ij}>0 ~\cap ~i\in \mathbf {trimmed\ set\ for\ j}} w_{ij} \log \left (\frac {q_{gji}}{q_{1ji}}\right) \right ]} \end {array} \) \(\frac {q_{gji}}{q_{1ji}}\), ratio of proportions
DESeq \( \begin {array}{c} \frac {y_{gji}}{ C \cdot \tau _{gj} \cdot \Lambda _{gj}^{-1}} \propto \frac {y_{gji}}{\tau _{gj} \cdot \Lambda _{gj}^{-1}}, \\ \Lambda _{gj}^{-1} = \text {median}_{i} ~ \frac {q_{gji}}{\left [ \prod _{k} q_{ik} \right ]^{\frac {1}{n}}} \end {array} \) \(\frac {q_{gji}}{\left [ \prod _{k} q_{ik} \right ]^{\frac {1}{n}}}\), ratio of proportions
Median \(\begin {array}{c} \frac {y_{gji}}{\tau _{gj} \cdot \Lambda _{gj}^{-1}}, \\ \Lambda _{gj}^{-1}=\text {median}_{i} ~q_{gji} \propto \text {median}_{i} ~\frac {q_{gji}}{1/p} \end {array}\) \(\frac {q_{gji}}{1/p}\), ratio of proportions
Upper quartile \(\begin {array}{c} \frac {y_{gji}}{\tau _{gj} \cdot \Lambda _{gj}^{-1}}, \\ \Lambda _{gj}^{-1}=\text {upper quartile}_{i} ~q_{gji} \propto \text {upper quartile}_{i} ~\frac {q_{gji}}{1/p} \end {array}\) \(\frac {q_{gji}}{1/p}\), ratio of proportions
CLR Transformation \(\begin {array}{c} \log \left (\frac {y_{gji}}{ \left [\prod _{i} y_{gji}\right ]^{\frac {1}{p}}} \right)\ \equiv \log \left (\frac {q_{gji}}{ \left [\prod _{i} q_{gji}\right ]^{\frac {1}{p}}} \right) \equiv \log \left (\frac {y_{gji}}{ \tau _{gj} \cdot \Lambda _{gj}^{-1}} \right),\\~\text {with} ~\Lambda _{gj}^{-1}=\left [\prod _{i} q_{gji}\right ]^{\frac {1}{p}} \propto \left [\prod _{i} \frac {q_{gji}}{1/p}\right ]^{\frac {1}{p}} \end {array}\) \(\frac {q_{gji}}{1/p}\), closely tracks Median factors above; ratio of proportions
Scran \(\begin {array}{c}\frac {y_{gji}}{ \tau _{gj} \cdot \Lambda _{gj}^{-1} }, \\ \Lambda _{gj}^{-1}= \text {fit linear models to} ~\left \{ \frac {q_{1ji}}{\overline {q_{++i}} }, \dots, \frac {q_{nji}}{\overline {q_{++i}}} \right \}_{i=1}^{p} \end {array}\) \(\frac {q_{gji}}{\overline {q_{++i}}}\), ratio of proportions
Wrench \(\begin {array}{c}\frac {y_{gji}}{ \tau _{gj} \cdot \Lambda _{gj}^{-1} }, \\ \Lambda _{gj}^{-1} = \frac {1}{p}\sum _{i} w_{ij} \frac {q_{gji}}{\overline {q_{++i}}}\end {array}\) \(\frac {q_{gji}}{\overline {q_{++i}}}\), ratio of proportions
  1. For each scaling normalization technique (rows of the table, named in the first column), we present the transformation they apply to the raw count data (second column) to produce normalize counts. The third column shows how all techniques use statistics based on ratio of proportions (third column) to derive their scale factors. In the table, i=1…p indexes features (genes/taxonomic units), and each sample is considered to arise from its own singleton group: g=1…n and j=1, τgj the sample depth of sample j, qgji the proportion of feature i in sample j, wij represents a weight specific to each technique, and \(\overline {q_{++i}}\) is the average proportion of feature i across the dataset. In the second column, the first row in each cell represents the transformation applied on the raw count data by the respective normalization approach. They all adjust a sample’s counts based on sample depth (τgj) and a compositional scale factor \(\Lambda _{gj}^{-1}\). As noted in the third column, the estimation of \(\Lambda _{gj}^{-1}\) is based on the ratio of sample-wise relative abundances/proportions (qgji) to a reference that are all some robust measures of central tendency in the count data. The logarithmic transform accompanying CLR should not worry the reader about its relevance here, in the following sense: the log-transformation often makes it possible to apply statistical tests based on normal distributions for the rescaled data; this is in-line with applying log-normal assumptions on the rescaled data obtained with the rest of the techniques. \(C=\left [ \prod _{j} \tau _{gj} \right ]^{-1/n}\) is a constant factor independent of sample, and its presence does not matter. For the same reason, Median and Upper Quartile scalings and CLR transforms, can be thought to base their estimates on a reference that assigns equal mass to all the features or if the reader wishes, a more complicated reference that behaves proportionally. When most features are zero, values arising from classical scale factors can be severely biased or undefined as we shall illustrate in the rest of the paper