Skip to main content

Table 1 13 selected R tools for omics data integration

From: Genomic data integration tutorial, a plant case study

Tool's name

Scientific question

Method and tool's characteristics

Data’s characteristics

Supervised

Unsupervised

Method families

Summary

Updated

Omics

Hypothesis

BCC (Bayesian Consensus Clustering)

I) Description of samples' interactions

Unsupervised

Statistics

Computes a samples' clustering for each omics dataset by using a probabilistic model, then merges clusters to get a consensus cluster across omics datasets

No

Multi-omics (quantitative)

Normal distribution

Different omics on the same set of samples

iCluster (iClusterPlus / iClusterBayes)

I) Description of samples' interactions

Unsupervised

Statistics / Dimension reduction

Starts with a latent variables regression across datasets by using a probabilistic model, then uses these joint latent variables for samples' clustering

Yes

Multi-omics (quantitative and qualitative)

Linearity assumption

Normal noise distribution

Different omics on the same set of samples

JIVE (Joint and Individual Variation Explained)

I) Description of samples/variables' interactions

Unsupervised

Dimension reduction

Decomposes each dataset in three terms: a joint effect (across datasets), an individual effect (specific to the dataset) and a noise effect

No

Multi-omics (quantitative)

Linearity assumption

LRAcluster (Low-Rank Approximation Cluster)

I) Description of samples' interactions

Unsupervised

Statistics / Dimension reduction

Probabilistically computes a common low-dimensional subspace across omics, then uses the K-means algorithm to cluster samples on this subspace

Yes

Multi-omics (quantitative and qualitative)

Linearity assumption

Different omics on the same set of samples

MCIA (Multiple co-inertia analysis) (MCOA)

I) Description of samples/variables' interactions

Unsupervised

Dimension reduction

Projects each dataset on a subspace, then maximizes co-inertia between subspaces to get major information shared by datasets

Yes

Multi-omics (quantitative)

Linearity assumption

Different omics on the same set of samples

mixKernel

I) Description of samples/variables' interactions

II) Variables selection

III) Phenotype prediction

Supervised

Unsupervised

Dimension reduction

Transforms datasets with kernels, then applies usual dimension reduction methods

Yes

Multi-omics (quantitative and qualitative)

Datasets with the same rows or columns

mixOmics (with PCA, PLS, rCCA, Diablo…)

I) Description of samples/variables' interactions

II) Variables selection

III) Phenotype prediction

Supervised

Unsupervised

Dimension reduction

Contains many matrix factorization methods for multivariate analysis and functions for data visualization. The main analysis method for one single dataset is the PCA. For two datasets or more, the main methods are the PLS and rCCA, and their extentions for discriminant analysis, variable selection ('sparse') and multi-blocks analysis

Yes

Multi-omics (quantitative and qualitative)

Linearity assumption

Datasets with the same rows or columns

moCluster (from MOGSA)

I) Description of samples' interactions

Unsupervised

Statistics / Dimension reduction

Computes latent variables by using a PCA's extension, then clusters them and finally select the best subtype model

Yes

Multi-omics (quantitative)

Linearity assumption

Different omics on the same set of samples

MOFA (Multi-Omics Factor Analysis)(MOFA2)

I) Description of samples' interactions

III) Phenotype prediction

Unsupervised

Statistics / Dimension reduction

Factorizes datasets with a Bayesian approach to get a small number of latent factors usable for different purposes

Yes

Multi-omics (quantitative and qualitative)

Linearity assumpion

NEMO (NEighborhood based Multi-Omics clustering)

I) Description of samples' interactions

Unsupervised

Similarity-based

Creates one similarity matrix by dataset, then merges them and finally clusters the merged matrix by Spectral clustering

No

Multi-omics (quantitative)

Euclidean distance metric

PINS (Perturbation clustering for data INtegration and disease Subtyping)(PINSPlus)

I) Description of samples' interactions

Unsupervised

Similarity-based /Network

Does several clustering to identify how often samples are clustered together. Clusterings are made on different datasets, with data perturbed by adding gaussian noise, and different clustering methods are used

Yes

Multi-omics (quantitative)

Different omics on the same set of samples

RGCCA (Regularized Generalized Canonical Correlation Analysis)(sGCCA)

I) Description of samples/variables' interactions

II) Variables selection

III) Phenotype prediction

Supervised

Unsupervised

Dimension reduction

Computes latent variables for each dataset by maximizing correlations within and/or between datasets

Yes

Multi-omics (quantitative and qualitative)

Linearity assumption

Different omics on the same set of samples

SNF (Similarity Network Fusion)

I) Description of samples' interactions

Unsupervised

Similarity-based /Network

Creates a similarity matrix then an associated network for each dataset, then iteratively fuses the networks to keep only strong correlations between samples across omics

No

Multi-omics (quantitative and qualitative)

Different omics on the same set of samples

Euclidean distance metric

  1. Rows correspond to 13 selected tools and columns to the main characteristics to consider while selecting a tool for omics data integration. ‘Biological question’ describes which of the three main biological questions presented in the article the tool aims to answer. ‘Methods and tools characteristics’ details if they can be used for supervised or unsupervised analysis, at which methods’ family they belong to (statistical, dimension reduction, networks, similarity-based, artificial neural networks), summarizes its functioning, and indicates if the tool is still updated with its source code’s repository available in Supplementary Table 1. ‘Data characteristics’ describes the types of omics the tool can afford. The last column presents the main hypothesis on data, assuming they (1) follow a normal distribution (which can be tested using for instance Shapiro tests or QQ-plots), (2) share identical rows or columns (e.g. omics data produced on the same genes or individuals), (3) have linear interactions (i.e. does not consider more complex interactions such as polynomial interactions) or (4) are considered as similar according only to the Euclidean distance (i.e. does not consider other metrics of similarities such as correlations)