Centering, scaling, and transformations: improving the biological information content of metabolomics data

van den Berg, Robert A; Hoefsloot, Huub CJ; Westerhuis, Johan A; Smilde, Age K; van der Werf, Mariët J

doi:10.1186/1471-2164-7-142

BMC Genomics

Table 1 Overview of the pretreatment methods used in this study. In the Unit column, the unit of the data after the data pretreatment is stated. O represents the original Unit, and (-) presents dimensionless data. The mean is estimated as: ${\bar{x}}_{i} = \frac{1}{J} \sum_{j = 1}^{J} x_{i j}$ and the standard deviation is estimated as: $s_{i} = \sqrt{\frac{\sum_{j = 1}^{J} {(x_{i j} - {\bar{x}}_{i})}^{2}}{J - 1}}$ . $\tilde{x}$ and $\hat{x}$ represent the data after different pretreatment steps.

From: Centering, scaling, and transformations: improving the biological information content of metabolomics data

Class	Method	Formula	Unit	Goal	Advantages	Disadvantages
I	Centering	${\tilde{x}}_{i j} = x_{i j} - {\bar{x}}_{i}$	O	Focus on the differences and not the similarities in the data	Remove the offset from the data	When data is heteroscedastic, the effect of this pretreatment method is not always sufficient
II	Autoscaling	${\tilde{x}}_{i j} = \frac{x_{i j} - {\bar{x}}_{i}}{s_{i}}$	(-)	Compare metabolites based on correlations	All metabolites become equally important	Inflation of the measurement errors
	Range scaling	${\tilde{x}}_{i j} = \frac{x_{i j} - {\bar{x}}_{i}}{(x_{i_{\max}} - x_{i_{\min}})}$	(-)	Compare metabolites relative to the biological response range	All metabolites become equally important. Scaling is related to biology	Inflation of the measurement errors and sensitive to outliers
	Pareto scaling	${\tilde{x}}_{i j} = \frac{x_{i j} - {\bar{x}}_{i}}{\sqrt{s_{i}}}$	O	Reduce the relative importance of large values, but keep data structure partially intact	Stays closer to the original measurement than autoscaling	Sensitive to large fold changes
	Vast scaling	${\tilde{x}}_{i j} = \frac{(x_{i j} - {\bar{x}}_{i})}{s_{i}} \cdot \frac{{\bar{x}}_{i}}{s_{i}}$	(-)	Focus on the metabolites that show small fluctuations	Aims for robustness, can use prior group knowledge	Not suited for large induced variation without group structure
	Level scaling	${\tilde{x}}_{i j} = \frac{x_{i j} - {\bar{x}}_{i}}{{\bar{x}}_{i}}$	(-)	Focus on relative response	Suited for identification of e.g. biomarkers	Inflation of the measurement errors
III	Log transformation	$\begin{array}{l} {\tilde{x}}_{i j} =^{10} \log (x_{i j}) \\ {\overset{⌢}{x}}_{i j} = {\tilde{x}}_{i j} - {\bar{\tilde{x}}}_{i} \end{array}$	Log O	Correct for heteroscedasticity, pseudo scaling. Make multiplicative models additive	Reduce heteroscedasticity, multiplicative effects become additive	Difficulties with values with large relative standard deviation and zeros
	Power transformation	$\begin{array}{l} {\tilde{x}}_{i j} = \sqrt{(x_{i j})} \\ {\overset{⌢}{x}}_{i j} = {\tilde{x}}_{i j} - {\bar{\tilde{x}}}_{i} \end{array}$	√O	Correct for heteroscedasticity, pseudo scaling	Reduce heteroscedasticity, no problems with small values	Choice for square root is arbitrary.

Back to article page

ISSN: 1471-2164

Contact us

Submission enquiries: bmcgenomics@biomedcentral.com
General enquiries: ORSupport@springernature.com