Genomic data integration tutorial, a plant case study

Mardoc, Emile; Sow, Mamadou Dia; Déjean, Sébastien; Salse, Jérôme

doi:10.1186/s12864-023-09833-0

BMC Genomics

Table 1 13 selected R tools for omics data integration

From: Genomic data integration tutorial, a plant case study

Tool's name	Scientific question	Method and tool's characteristics				Data’s characteristics
Tool's name	Scientific question	Supervised Unsupervised	Method families	Summary	Updated	Omics	Hypothesis
BCC (Bayesian Consensus Clustering)	I) Description of samples' interactions	Unsupervised	Statistics	Computes a samples' clustering for each omics dataset by using a probabilistic model, then merges clusters to get a consensus cluster across omics datasets	No	Multi-omics (quantitative)	Normal distribution Different omics on the same set of samples
iCluster (iClusterPlus / iClusterBayes)	I) Description of samples' interactions	Unsupervised	Statistics / Dimension reduction	Starts with a latent variables regression across datasets by using a probabilistic model, then uses these joint latent variables for samples' clustering	Yes	Multi-omics (quantitative and qualitative)	Linearity assumption Normal noise distribution Different omics on the same set of samples
JIVE (Joint and Individual Variation Explained)	I) Description of samples/variables' interactions	Unsupervised	Dimension reduction	Decomposes each dataset in three terms: a joint effect (across datasets), an individual effect (specific to the dataset) and a noise effect	No	Multi-omics (quantitative)	Linearity assumption
LRAcluster (Low-Rank Approximation Cluster)	I) Description of samples' interactions	Unsupervised	Statistics / Dimension reduction	Probabilistically computes a common low-dimensional subspace across omics, then uses the K-means algorithm to cluster samples on this subspace	Yes	Multi-omics (quantitative and qualitative)	Linearity assumption Different omics on the same set of samples
MCIA (Multiple co-inertia analysis) (MCOA)	I) Description of samples/variables' interactions	Unsupervised	Dimension reduction	Projects each dataset on a subspace, then maximizes co-inertia between subspaces to get major information shared by datasets	Yes	Multi-omics (quantitative)	Linearity assumption Different omics on the same set of samples
mixKernel	I) Description of samples/variables' interactions II) Variables selection III) Phenotype prediction	Supervised Unsupervised	Dimension reduction	Transforms datasets with kernels, then applies usual dimension reduction methods	Yes	Multi-omics (quantitative and qualitative)	Datasets with the same rows or columns
mixOmics (with PCA, PLS, rCCA, Diablo…)	I) Description of samples/variables' interactions II) Variables selection III) Phenotype prediction	Supervised Unsupervised	Dimension reduction	Contains many matrix factorization methods for multivariate analysis and functions for data visualization. The main analysis method for one single dataset is the PCA. For two datasets or more, the main methods are the PLS and rCCA, and their extentions for discriminant analysis, variable selection ('sparse') and multi-blocks analysis	Yes	Multi-omics (quantitative and qualitative)	Linearity assumption Datasets with the same rows or columns
moCluster (from MOGSA)	I) Description of samples' interactions	Unsupervised	Statistics / Dimension reduction	Computes latent variables by using a PCA's extension, then clusters them and finally select the best subtype model	Yes	Multi-omics (quantitative)	Linearity assumption Different omics on the same set of samples
MOFA (Multi-Omics Factor Analysis)(MOFA2)	I) Description of samples' interactions III) Phenotype prediction	Unsupervised	Statistics / Dimension reduction	Factorizes datasets with a Bayesian approach to get a small number of latent factors usable for different purposes	Yes	Multi-omics (quantitative and qualitative)	Linearity assumpion
NEMO (NEighborhood based Multi-Omics clustering)	I) Description of samples' interactions	Unsupervised	Similarity-based	Creates one similarity matrix by dataset, then merges them and finally clusters the merged matrix by Spectral clustering	No	Multi-omics (quantitative)	Euclidean distance metric
PINS (Perturbation clustering for data INtegration and disease Subtyping)(PINSPlus)	I) Description of samples' interactions	Unsupervised	Similarity-based /Network	Does several clustering to identify how often samples are clustered together. Clusterings are made on different datasets, with data perturbed by adding gaussian noise, and different clustering methods are used	Yes	Multi-omics (quantitative)	Different omics on the same set of samples
RGCCA (Regularized Generalized Canonical Correlation Analysis)(sGCCA)	I) Description of samples/variables' interactions II) Variables selection III) Phenotype prediction	Supervised Unsupervised	Dimension reduction	Computes latent variables for each dataset by maximizing correlations within and/or between datasets	Yes	Multi-omics (quantitative and qualitative)	Linearity assumption Different omics on the same set of samples
SNF (Similarity Network Fusion)	I) Description of samples' interactions	Unsupervised	Similarity-based /Network	Creates a similarity matrix then an associated network for each dataset, then iteratively fuses the networks to keep only strong correlations between samples across omics	No	Multi-omics (quantitative and qualitative)	Different omics on the same set of samples Euclidean distance metric

Rows correspond to 13 selected tools and columns to the main characteristics to consider while selecting a tool for omics data integration. ‘Biological question’ describes which of the three main biological questions presented in the article the tool aims to answer. ‘Methods and tools characteristics’ details if they can be used for supervised or unsupervised analysis, at which methods’ family they belong to (statistical, dimension reduction, networks, similarity-based, artificial neural networks), summarizes its functioning, and indicates if the tool is still updated with its source code’s repository available in Supplementary Table 1. ‘Data characteristics’ describes the types of omics the tool can afford. The last column presents the main hypothesis on data, assuming they (1) follow a normal distribution (which can be tested using for instance Shapiro tests or QQ-plots), (2) share identical rows or columns (e.g. omics data produced on the same genes or individuals), (3) have linear interactions (i.e. does not consider more complex interactions such as polynomial interactions) or (4) are considered as similar according only to the Euclidean distance (i.e. does not consider other metrics of similarities such as correlations)

Back to article page

ISSN: 1471-2164

Contact us

Submission enquiries: bmcgenomics@biomedcentral.com
General enquiries: ORSupport@springernature.com