 Research article
 Open access
 Published:
MOSCATO: a supervised approach for analyzing multiOmic singleCell data
BMC Genomics volume 23, Article number: 557 (2022)
Abstract
Background
Advancements in genomic sequencing continually improve personalized medicine, and recent breakthroughs generate multimodal data on a cellular level. We introduce MOSCATO, a technique for selecting features across multimodal singlecell datasets that relate to clinical outcomes. We summarize the singlecell data using tensors and perform regularized tensor regression to return clinicallyassociated variable sets for each ‘omic’ type.
Results
Robustness was assessed over simulations based on available singlecell simulation methods, and applicability was assessed through an example using CITEseq data to detect genes associated with leukemia. We find that MOSCATO performs favorably in selecting network features while also shown to be applicable to real multimodal singlecell data.
Conclusions
MOSCATO is a useful analytical technique for supervised feature selection in multimodal singlecell data. The flexibility of our approach enables future extensions on distributional assumptions and covariate adjustments.
Background
Classic bulk genetic sequencing involves averaging signature levels across all cells. Different sequencers may sequence different types of molecules such as ribonucleic acid (RNA), proteins, DNA methyl groups, etc. Disease progression, therapy success, and other clinical outcomes often vary among individuals suffering from complex diseases [3, 7, 17, 35], and the heterogeneity in their outcomes may be better understood through the intricacies of a patient’s molecular signatures [1, 5, 24, 26]. This has led to an explosive demand for multiomics which involves integrating multiple types of molecular information in order to have a more Systems Biology approach. For example, in breast cancer patients with resistance to lapatinib therapy, Komurov et al. were able to suggest additional therapy targets by identifying combinations of RNA and proteins responsible for glucose deprivation that was associated with the resilience [18].
Methods for identifying graphs and gene regulatory networks within a single molecular type has been well studied [14, 20], however, different methods should be considered when integrating multiple types of molecular information in order to accommodate the between and within molecular relationships [6]. Each molecular type often contains thousands of features, and integrating them creates a higher dimensional problem with more sophisticated relationships both within and between molecular types. For example, the Decomposition of Network Summary Matrix via Instability (DNSMI) method decomposes a matrix of network strengths by fitting a series of models for the expected relationships across molecular types and with the disease outcome [38]. Supervised sparse canonical correlation analysis (SCCA) attempts to optimize the correlation matrix between molecular types through lasso constrained linear combinations of the features and also eliminates features weakly correlated with the outcome [36].
In bulk sequencing experiments, rare cells or smaller celltypes will be diluted due to the averaging across all cells within the sample. This motivated singlecell sequencing techniques where molecular information could then be sequenced on a cellbycell basis. While initial protocols were limited to RNA [23, 29], newer technology may now sequence multiple types of molecular information within each cell, denoted as multimodal (or multiomic) singlecell sequencing. For example, CITEseq simultaneously sequences both cell surface proteins and RNA on each cell of a sample [32]. Although still a growing technology, applications have already been considered using this novel sequencing approach. For example, Kendal et al. utilized CITEseq technology to compare tendons in healthy individuals to those with tendinopathy [15]. Figure 1 displays an example of singlecell data from each patient.
Many bulk sequencing problems utilize matrix decomposition techniques for various analytical goals where the dimensions correspond to different feature sets (e.g., RNA or proteins). However, singlecell sequencing essentially creates an added dimension for cells that did not previously exist in bulk sequencing. Tensors provide a general framework for organizing high dimensional data, and a matrix is a special case of a two dimensional tensor. Therefore, leveraging tensors for singlecell sequencing is an interesting approach to accommodate the added cellular dimension. For example, ScLRTC and scLRTD utilize tensor decompositions to impute dropouts or missing data found in singlecell sequencing data [25, 28]. scTenifoldNet also utilizes tensor decomposition to identify unsupervised gene regulatory networks in singlecell RNAseq data [27].
This manuscript proposes a novel method, MultiOmic SingleCell Analysis using TensOr regression (MOSCATO), for identifying the superstructure of a semidirected graph, or network, within multimodal singlecell data that relates to a disease or phenotypic outcome. Current singlecell methods are often either limited to one subject/experiment, limited to just RNA, unsupervised with no clinical outcome associations, or intended for cell clustering and not feature selection. MOSCATO uses regularized tensor regression to address each of those common limitations found in existing methods. “Preliminaries” section describes preliminary tensor concepts, “Results” section presents the MOSCATO results from a series of simulations and a real data application, “Discussion” and “Conclusions” sections discuss future work and limitations, and “Methods” section describes the mathematical details behind MOSCATO. MOSCATO is applicable when two ‘omic’ types of singlecell data are present (e.g., RNA and proteins) with a univariate outcome of interest.
Preliminaries
MOSCATO utilizes regularized tensor regression, and this section describes existing and relevant tensor concepts. “Tensor definitions” section defines tensors and basic tensor operations, and “Tensor regression” section uses the operations and definitions from “Tensor definitions” section to describe tensor regression and regularization techniques.
Tensor definitions
High dimensional data may be organized into a tensor, and a matrix may be thought of as a 2dimensional tensor. Utilizing familiar tensor notation as provided by Kolda and Bader [16], we let \(\boldsymbol {\mathcal {Z}} \in \mathbb {R}^{p_{1} \times p_{2} \times \dots \times p_{D}}\) denote a Ddimensional tensor where dimension d contains p_{d} variables for d=1,...,D. For example, D=1 denotes a vector and D=2 denotes a matrix. Many mathematical operations for tensors build on mathematical operations used in matrices. For example, Definition 1 describes outer products between D vectors to create a Ddimensional tensor, where ∘ denotes the KhatriRao product.
Definition 1
Let b_{1},b_{2},...,b_{D}, denote vectors where \(\boldsymbol {b_{d}}\in \mathbb {R}^{p_{d}}\). Then the outer product of those vectors, b_{1}∘b_{2}∘⋯∘b_{D}, creates a Ddimensional tensor of size p_{1}×p_{2}×⋯×p_{D}, and each (i_{1},...,i_{D})^{th} element equals \(\prod _{d=1}^{D} b_{d_{i_{d}}}\).
It may also be convenient to reorganize a tensor into a lower dimensional space by vectorizing or moded matricizing the tensor. Definitions 2 and 3 describe these reorganization techniques.
Definition 2
Let \(\boldsymbol {\mathcal {Z}} \in \mathbb {R}^{p_{1} \times p_{2} \times...\times p_{D}}\) denote a Ddimensional tensor. Then \(\boldsymbol {\mathcal {Z}}\) may be reorganized into a column vector through the vec operator \(vec(\boldsymbol {\mathcal {Z}}) \in \mathbb {R}^{\prod _{d=1}^{D} p_{d}}\), where the \(j=1+\sum _{d=1}^{D}(i_{d}  1)\prod _{d^{\prime }=1}^{d1}p_{d^{\prime }}\) element of \(vec(\boldsymbol {\mathcal {Z}})\) corresponds to the (i_{1},...,i_{D})^{th} value in \(\boldsymbol {\mathcal {Z}}\).
Definition 3
Let \(\boldsymbol {\mathcal {Z}} \in \mathbb {R}^{p_{1} \times p_{2} \times...\times p_{D}}\) denote a Ddimensional tensor. Then \(\boldsymbol {\mathcal {Z}}\) may be reorganized into a matrix through the moded matricization operator \(\boldsymbol {\mathcal {Z}}_{(d)} \in \mathbb {R}^{p_{d} \times \prod _{d^{\prime } \ne d} p_{d^{\prime }}}\), where the (i_{d},j)^{th} element within \(\boldsymbol {\mathcal {Z}}_{(d)}\) equals the (i_{1},...,i_{D})^{th} value within \(\boldsymbol {\mathcal {Z}}\) and \(j=1 + \sum _{d^{\prime } \ne d}(i_{d^{\prime }1})\prod _{d^{\prime \prime }< d^{\prime }, d^{\prime \prime }\ne d} p_{d^{\prime \prime }}\).
Similarly as done in matrix operations, it may be of interest to multiply two tensors with comparable dimensions via inner products, as described in Definition 4.
Definition 4
Suppose two tensors \(\boldsymbol {\mathcal {B}} \in \mathbb {R}^{p_{1} \times \cdots \times p_{D}}\) and \(\boldsymbol {\mathcal {Z}} \in \mathbb {R}^{p_{1} \times \cdots \times p_{D}}\). The inner product may be obtained by
Furthermore, it may be of interest to multiply a matrix along the d^{th} dimension of a tensor through dmode products as described in Definition 5.
Definition 5
Suppose a tensor \(\boldsymbol {\mathcal {Z}} \in \mathbb {R}^{p_{1} \times \cdots \times p_{d} \times \cdots \times p_{D}}\) and a matrix \(\mathbf {U} \in \mathbb {R}^{q \times p_{d}}\). The dmode product between \(\boldsymbol {\mathcal {Z}}\) and U may be expressed as \(\boldsymbol {\mathcal {Z}} \times _{d} \mathbf {U} \in \mathbb {R}^{p_{1} \times \cdots \times q \times \cdots \times p_{D}}\) where the (i_{1},...,i_{d−1},j,i_{d+1},...,i_{D})^{th} value equals \(\sum _{i_{d}=1}^{p_{d}} z_{i_{1},...,i_{D}}u_{j,i_{d}}\).
The rank of a matrix denotes the maximum number of linearly independent rows/columns in the matrix. Building on those concepts, the rank of a tensor may be thought of as the maximum number of vectors that can be multiplied and added to replicate the tensor, as shown in Definition 6.
Definition 6
Assume a Ddimensional tensor \(\boldsymbol {\mathcal {Z}} \in \mathbb {R}^{p_{1} \times p_{2} \times...\times p_{D}}\). \(\boldsymbol {\mathcal {Z}}\) has rank R if no smaller R exists such that
where Definition 7 defines \([[\cdot ]], \boldsymbol {z_{d}^{(r)}} \in \mathbb {R}^{p_{d}}\) and \(\mathbf {Z}_{d} = \left [\boldsymbol {z_{d}^{(1)}},...,\boldsymbol {z_{d}^{(R)}}\right ] \in \mathbb {R}^{p_{d} \times R}\) for some set of Z_{d} matrices for d=1,...,D.
Definition 7
Let \(\mathbf {Z}_{d} \in \mathbb {R}^{p_{d} \times R}\) for d=1,...,D. Furthermore, let \(\boldsymbol {z_{d}^{(r)}}\) denote the r^{th} column of Z_{d}. Then
The true rank of a tensor may often be difficult to determine due to the high dimensionality, motivating decomposition techniques that estimate vectors for a given rank that approximate the tensor, as shown in Definition 8.
Definition 8
Assume a Ddimensional tensor \(\boldsymbol {\mathcal {Z}} \in \mathbb {R}^{p_{1} \times p_{2} \times...\times p_{D}}\). A rankR CP decomposition aims to use R vector sets (one vector per dimension) to approximate \(\boldsymbol {\mathcal {Z}}\) by
where \(\mathbf {Z}_{d} = \left [\boldsymbol {z_{d}^{(1)}},...,\boldsymbol {z_{d}^{(R)}}\right ] \in \mathbb {R}^{p_{d} \times R}\).
Kolda and Bader present additional details on decomposition and other tensor operations [16].
Tensor regression
Building on notation covered in “Tensor definitions” section, this section will briefly describe tensor regression that was originally presented by Zhou et al. [39]. Tensor regression builds on Generalized Linear Model (GLM) concepts, where we have an outcome y for each subject that follows some exponential family with link function g(·) and mean μ. Classic GLM uses univariate independent variables to predict the outcome, but tensor regression extends those concepts by additionally allowing a predictor tensor. This is accomplished by multiplying the dimensions of the tensor through coefficient vectors that convert the dimensions to a univariate value that may then predict the outcome.
Figure 2 shows a simple example with rank1 tensor regression and D=2 dimensions for the predictor tensor. Each dimension in the predictor tensor corresponds to a different feature set, and each subject will contain its own D=2 predictor tensor to be used to predict their univariate outcome y.
Imaging modalities such as magnetic resonance imaging (MRI) present an excellent application for tensor regression models. For example, a particular brain MRI slice may be of interest for an outcome, say, disease status. This image slice can be expressed in an array structure (i.e., a 2dimensional tensor) where the value at a given pixel denotes the MRI signal associated with that location. Referring to the model shown in Fig. 2, the MRI image slice would correspond to the predictor tensor, \(\boldsymbol {\mathcal {Z}}\), and the estimated coefficients would indicate regions of interest associated with the outcome.
Tensor regression may also involve higher rank problems with the more formal representation
where U contains the univariate independent variables (e.g., age or sex). A rankR tensor regression estimates R coefficient vectors for each dimension in the predictor tensor, but for simplicity, in this manuscript we will assume rank1. The Block Relaxation Algorithm is used to estimate the coefficient vectors with additional details described by Zhou et al. [39]. Zhou et al. [39] also claim that regularization in tensor regression may be accomplished by simply imposing constraints when fitting the models on each dimension.
If one naively vectorized the predictor tensor and fit a classic GLM model, it would require estimating \(\prod _{d=1}^{D}p_{d}\) coefficients for the tensor. This approach would not only ignore the inherent structure of the data by treating each element in the tensor as independent with no distinction between the dimensions, it would also attempt to estimate many more coefficients compared to \(R\sum _{d=1}^{D}p_{d}\) coefficients in tensor regression. Consequently, this naive approach may be unrealistic in high dimensional problems given a typically much smaller sample size. This reduction in parameters highlights the benefits of tensor regression. However, it is subject to limitations such as uniqueness and identifiability. For example, suppose a rank1 model with \(D=2, g(y) = \boldsymbol {\beta _{1}}^{T}\boldsymbol {\mathcal {Z}}\boldsymbol {\beta _{2}}\). Then for any scalar τ, we could derive an equally optimal model \(g(y) = \boldsymbol {\widetilde {\beta }_{1}}^{T}\boldsymbol {\mathcal {Z}}\boldsymbol {\widetilde {\beta }_{2}}\) where \(\boldsymbol {\widetilde {\beta }_{1}} = \tau \boldsymbol {\beta _{1}}\) and \(\boldsymbol {\widetilde {\beta }_{2}}=\boldsymbol {\beta _{2}}/\tau \). Additionally, the Block Relaxation Algorithm may converge to a local maxima as opposed to the global maxima when attempting to maximize the log likelihood. Zhou et al. [39] describe measures that may be used to check whether these issues are present in the estimated tensor model.
Results
MOSCATO was applied to both simulated data and real data. Results from the simulations are presented in “Simulation results” section and results from the real data application are presented in “Real data application results” section. To our knowledge there are no appropriate methods that easily match the goals of MOSCATO. Nevertheless, to create a sensible competitor we employed a reasonable, but ad hoc, alternative method. MOSCATO was benchmarked against a competing feature selection technique using area under the receiver operating curve (AUC). In short, AUC methods select features that best predict the outcome according to estimated receiver operating curves (ROCs). AUC selections were based on either Bonferroni adjusted pvalues or whether the AUC was less than 0.3 or greater than 0.7.
Simulation results
Two ‘omic’ types denoted as \(\boldsymbol {\mathcal {G}}\) and \(\boldsymbol {\mathcal {X}}\), along with an outcome, were simulated for each subject, and simulations were performed under 9 different settings accounting for differing number of cells per subject and level of technical noise. For additional details on the simulations, refer to “Simulation details” section.
Figure 3 displays the tuned maximum network size for \(\boldsymbol {\mathcal {G}}\) and \(\boldsymbol {\mathcal {X}}\), denoted as max_{G} and max_{X}. In a perfect execution of MOSCATO, max_{G} would be tuned to 10 and max_{X} would be tuned to 15 due to known network sizes in the simulated data. All simulations tuned max_{G} and max_{X} to values greater than the true number of network features, regardless of the simulation setting. For max_{X}, smaller values were tuned as technical noise decreased and number of cells increased, but this trend did not persist when tuning max_{G}.
Figure 4 displays the sensitivity and specificity across the 9 simulation settings for data types \(\boldsymbol {\mathcal {G}}\) and \(\boldsymbol {\mathcal {X}}\) for the 3 different feature selection methods (MOSCATO, AUC using pvalues, and AUC using cutoffs). Sensitivity measures the probability that network features are properly included in the selections, and specificity measures the probability that nonnetwork features are properly excluded from the selections. As shown in Fig. 4, the sensitivity and specificity under MOSCATO generally improve as the number of cells increases per subject and as technical noise decreases. Conversely, the specificity declines for AUC selections using pvalues as the number of cells increases and the technical noise improves. AUC based on pvalues not only produced counterintuitive results where the performance actually degraded as the technical noise reduced, it also selected too many features such that the results were not remotely sparse. This explains that while the sensitivity remained high for all simulations, this is simply due to the fact that nearly all features were selected using that criteria. AUC selections based on cutoffs resulted in opposite issues where it did not select nearly any features and produced poor sensitivity with nearly perfect specificity.
MOSCATO reproduced the superstructure of the graph (i.e., network) reasonably well with generally high sensitivity and also limited false positives present. This is especially true when comparing against approaches using the AUC. However, when it is expected that high levels of technical noise are present with limited cells per subject, caution should be used when considering MOSCATO.
Real data application results
Leukemia is a broad disease that encompasses all cancers that occur in blood cells. The 5year survival rate is about 65% according to data from 2011 to 2017, and about 459,000 people were living with leukemia in 2018 in the United States [12]. Leukemia may be classified based on progression speed where chronic denotes slow progression and acute denotes aggressive progression. In addition to cancer progression, leukemia may be subtyped by the type of cells where the cancer forms. For example, lymphocytic leukemia describes cancer developing from white blood cells and myelogenous leukemia describes cancer developing in blood forming cells within the bone marrow. Although rare, one may have both lymphocytic and myelogenous leukemia which is denoted as mixed phenotype leukemia.
To assess MOSCATO in practice, we applied it to real singlecell data with multiple data types. Limited data is currently available due to the infancy of multimodal singlecell sequencing, so data across multiple studies that all used CITEseq protocols [32] on bone marrow / peripheral blood cells were used. CITEseq produces cellular level RNA information and cell surface protein abundance via antibody derived tags (ADT) simultaneously, and our outcome of interest will be leukemia versus healthy patients. Our goal will be to apply MOSCATO to this data in order to obtain a subset of RNA and ADT features associated with leukemia. After combining the data across studies, we have 14 healthy patients and 7 patients with leukemia. Of the 7 leukemia subjects, 1 had chronic lymphocytic leukemia while the other 6 had mixedphenotype acute leukemia. The studies used and further details are described in “Real data application details” section.
The data was preprocessed and integrated using Seurat version 4.0.3 [10], and Fig. 5 displays the Uniform Manifold Approximation and Projection (UMAP) [2] plots from the cell clusters established from the integrated data across all 21 subjects. The Seurat workflow normalizes the scRNAseq data, identifies the highly variable genes, scales the data, performs reciprocal principal component analysis using the highly variable genes, finds the nearest neighbors for each cell, and clusters the cells. After integration, 8 of the cell clusters (clusters 0, 1, 2, 3, 4, 5, 6 and 14 from Fig. 5), 17991 RNA features, and 5 ADTs (CD3, CD4, CD14, CD19, and CD56) were measured across all subjects.
In this application after preprocessing, each subject contained 8 scRNAseq matrices (one for each cell cluster) where each matrix contained 17991 columns and rows corresponding to the cells within that cluster. In addition to the scRNAseq datasets, each subject contained another 8 matrices (one for each cell cluster) for their singlecell ADT abundance where each matrix contained 5 columns and rows corresponding to the cells within that cluster (same cells as in the scRNAseq datasets). Finally, each subject had a binary disease status (healthy or leukemia). The feature selection techniques were applied to the 8 pairs of matrices separately (i.e., each cell cluster treated independently), resulting in 8 different MOSCATO and AUC selection results. The first step of MOSCATO estimates the correlation between each subject’s scRNAseq and singlecell ADT features, resulting 21 (i.e., one for each subject) correlation matrices, each of size 17991×5.
Analysis was performed on each cell cluster separately, and the selections were later reviewed for biological relevancy. The rest of this section will focus on the results obtained from cell cluster 0, but the complete results from MOSCATO, AUC selections based on pvalues, and AUC selections based on the AUC cutoffs for each of the 8 cell clusters are provided in Additional file 2. In summary, the number of features selected by MOSCATO and AUC cutoffs were similarly sized, but AUC selections based on pvalues resulted in nonsparse feature sets. DAVID [11, 31] was used to analyze and organize the gene ontology information from the RNA gene selections. DAVID clusters genes based on common annotations and functional information, and DAVID only allows clustering on gene sets with less than 3000 genes. Since the AUC selections based on pvalues resulted in RNA selections well over this 3000 restriction for most cell clusters, we only focused on gene clusters from the MOSCATO and AUC cutoff selections.
MOSCATO selected 96 RNA features within cell cluster 0, and DAVID identified 2 gene clusters. The strongest gene cluster (based on highest enrichment score) used 7 of these 96 genes. This was the most notable gene cluster which included the genes CD3D, CD3E, and CD3G which are part of the KEGG pathway for Human Tcell Leukemia Virus type 1 (HTLVI) infection (KEGG pathway hsa05166), and HTLVI infections are a known risk factor for developing adult Tcell leukemia/lymphoma [13]. Additionally, the genes CD2, CD3D, CD3E, CD3G, and CD4 within this gene cluster belong to the KEGG pathway for Hematopoietic cell lineage (hsa04640) which assists in producing blood cells. Given that the disease of interest in this application is based on leukemia (i.e., cancer in tissues which produce blood), it is reassuring that this gene cluster contains genes associated with blood production. Also, genes CD2, CD3D, CD3E, CD3G, KLRB1, CD247, and CD4 within the gene cluster are associated with the gene ontology for the cell surface receptor signaling pathway (GO:0007166) which makes sense given that MOSCATO summarized the singlecell data between RNA and cell surface proteins (i.e., ADTs). Of the 5 cell surface proteins considered, MOSCATO selected the ADT’s CD3 and CD4 for this cell cluster. Figure 6 displays the RNA features within the strongest functional DAVID cluster, along with the ADT selections. No features selected by MOSCATO under cell cluster 0 were selected by AUC cutoffs, and although 83 of the 96 MOSCATO RNA selections were also selected by AUC based on pvalues, the AUC selections based on pvalues selected nearly half of all RNA features considered. AUC selections based on cutoffs selected 11 RNA features and 0 ADTs for cell cluster 0, and DAVID was not able to discern any gene clusters based on the genes selected.
In conclusion, the selections made by MOSCATO under cell cluster 0 resulted in a concise gene cluster discovered by previously known annotations and functionalities. These functionalities not only related to the disease of interest (i.e., leukemia), it also related to cell surface functionalities. This highlights that MOSCATO not only considers supervised information (e.g., disease versus no disease), it also considers the relationships across data types (e.g., RNA and cell surface proteins). Performing feature selection using solely AUC not only neglects the between data type relationships, it also was not able to return a concise set of genes that related to leukemia. Furthermore, it is arbitrary to select prespecified AUC cutoffs for selections, and the pvalue selections did not produce sparse solutions. Also, AUC selections were based directly on normalized expression/sequenced values, but MOSCATO performs selections based on the similarities between data types. This possibly helps reduce batch effects found in individual subjects by standardizing each value between 1 and 1.
Discussion
MOSCATO was performed on both simulated and real data. The simulations produced fairly accurate results with a sensitivity and specificity close to 1 for many of the simulations, although MOSCATO did not perform as well in situations with high technical noise or low cell counts per subject. Since MOSCATO calculates its predictor tensor to be the estimated correlation of the datasets per individual, it is unsurprising that cell count would contribute to more accurate correlation estimates. Although not used in this manuscript, covariate adjustments could easily be made in MOSCATO by simply adding them to the tensor regression.
The real data application involved 21 subjects, and given the highly variable nature of singlecell sequencing data, the features selected should be interpreted with some caution. Furthermore, since the data was collected across different studies, only 5 proteins were used that merged across all 21 subjects. As single cell technologies become more available and cost feasible, future experiments should be considered with more consistent proteins across experiments and larger sets of subjects.
MOSCATO currently assumes all cells come from the same cell type, but it might be more interesting to accommodate situations in which multiple cell types are present. We are currently working on higher dimensional applications of MOSCATO in a future manuscript. A reasonable solution could incorporate another step which estimates a similarity matrix for each cell type and includes another dimension to the predictor tensor for cell type. Additionally, MOSCATO was only tested on experiments with two data types, but extensions should be considered in situations where more than two data types are present. This could be accomplished by extending the predictor tensor to accommodate a dimension per data type extracted, although higher dimensional summary measures would need to be considered.
MOSCATO was only tested using Pearson’s correlation as the summary measure to construct the predictor tensor \(\boldsymbol {\mathcal {Z}}\), but other summaries should be considered. For example, mutual information or inverse covariance matrices might be interesting avenues to explore for future consideration. Graphical lasso [8] is a popular technique to obtain both the nodes and edges of a graph by applying lasso regressions to estimate the inverse of the covariance matrix, and this estimated inverse of the covariance matrix could be explored as the predictor tensor input for MOSCATO.
This study only used rank1 tensors, but higher ranks could be considered that may unveil other patterns and networks available in the data. The proper rank could be obtained using typical model selection criteria such as cross validation, although interpreting the results may not be as straight forward.
Zhou et al. discuss hypothesis testing via asymptotic normality results [39], and these hypothesis testing schemes could be explored to assess network strength. For example, one could perform a global test whether the model coefficients equal zero for the network selections.
Although MOSCATO returns the superstructure for a graph, it does not provide information on directionality and does not currently consider directional consistency. For example, suppose two genes are positively correlated with each other, but they contain conflicting correlative directions with the outcome. This inconsistency in directionality makes interpreting the results more difficult, and may be mitigated by additionally tuning based on optimizing balance. This concept has been considered in bulk level analyses with a single ‘omic’ type [34], and it could be considered for future work for multimodal data.
Conclusions
MOSCATO is a statistical technique for analyzing multimodal singlecell data where the study goals are to identify which features within the ‘omic’ datasets relate to each other and a clinical outcome. MOSCATO was found to perform favorably through a series of simulations and a real data application. Multimodal singlecell data continues to grow in popularity, and feature selection techniques such as MOSCATO may be critical to fully leverage the potential in using the data for highlighting biological markers in complex diseases.
Methods
Multimodal network analysis using tensor regression
In classic bulk sequencing, the data contains one record per subject. Supposing n subjects with two data types, bulk sequencing studies would contain two data sets (i.e., a dataset for each data type), \(\boldsymbol {\mathcal {G}} \in \mathbb {R}^{n \times p}\) and \(\boldsymbol {\mathcal {X}} \in \mathbb {R}^{n \times q}\). In singlecell sequencing, there are multiple records per subject where each row corresponds to a cell within the subject. Consequently, for a given subject i with two data types, their singlecell data would contain two datasets \(\boldsymbol {\mathcal {G}_{i}} \in \mathbb {R}^{m_{i} \times p}\) and \(\boldsymbol {\mathcal {X}_{i}} \in \mathbb {R}^{m_{i} \times q}\), where m_{i} denotes the number of cells for subject i. Since the number of cells typically differs across subjects (i.e., \(m_{i} \ne m_{i^{\prime }}\) where i≠i^{′}), we organize each subject’s data into separate datasets (i.e., \(\boldsymbol {\mathcal {G}_{i}}\) and \(\boldsymbol {\mathcal {X}_{i}}\)) as opposed to organizing the input data directly into a 3dimensional tensor with a dimension for cells. It should also be noted that each subject’s data often consists of thousands of cells, and concatenating the singlecell data in long format may be computationally inefficient. Furthermore, we assume each subject i contains a univariate outcome y_{i} for i=1,...,n, and we may express the outcomes in a vector as y=[y_{1},y_{2},...,y_{n}]^{T}. For simplicity, we may denote the two data types as \(\boldsymbol {\mathcal {G}}\) and \(\boldsymbol {\mathcal {X}}\) without the i subscript, although as described previously, each subject’s singlecell data will contain separate matrices for the data types as opposed to expressing data in long format as found in bulk sequencing.
MOSCATO aims to identify a subset of features within \(\boldsymbol {\mathcal {G}}\) and \(\boldsymbol {\mathcal {X}}\) that relate to each other and the outcome. In graphical modelling terms, MOSCATO identifies the superstructure of a semidirected graph with undirected nodes involving features within \(\boldsymbol {\mathcal {G}}\) and \(\boldsymbol {\mathcal {X}}\) with some path directed to the outcome y. MOSCATO accomplishes this by imposing elastic net constraints on a tensor regression model [40].
Similarly as in the MRI image example from “Tensor regression” section, multimodal singlecell data contains multidimensional data per subject (i.e., features within \(\boldsymbol {\mathcal {G}}\) and features within \(\boldsymbol {\mathcal {X}}\)) with a univariate outcome. This motivates the use of tensor regression for multimodal singlecell data. Additionally, tensor regression not only efficiently accommodates multidimensional input data with a univariate outcome, it also handles regularization techniques and allows for additional covariate adjustments (e.g., age, sex, race, etc.). However, tensor regression requires equivalent dimensions for each subject’s input tensor, and singlecell sequencing experiments nearly always return differing number of cells. Therefore, to standardize the dimensions across each subject, the first step of MOSCATO involves collapsing the cellular dimension all together by estimating a correlation matrix between their data type matrices,
where \(\hat {\rho }_{jk}=\hat {corr}(x_{ij},g_{ik})\) letting x_{ij} denote the j^{th} feature in \(\boldsymbol {\mathcal {X}_{i}}\) and g_{ik} denote the k^{th} feature in \(\boldsymbol {\mathcal {G}_{i}}\) for the i^{th} subject. Although many summary matrices could be considered, such as the inverse of the covariance matrix or mutual information, Pearson’s correlation provides a simple interpretation while also standardizing the values within \(\boldsymbol {\mathcal {Z}_{i}}\) between 1 and 1.
In broad strokes, MOSCATO applies a rank1 tensor regression on \(\boldsymbol {\mathcal {Z}}\) to determine the elements in each ‘omic’ type that are associated with each other and with the outcome. Specifically, using the \(\boldsymbol {\mathcal {Z}_{i}}\) tensors for i=1,...,n to estimate the coefficients, a tensor regression model similar to (5) and depicted in Fig. 2 will be fit with elastic net constraints. The elastic net constraint works to balance by a weighted average between an L^{1}norm and L^{2}norm, where the L^{1}norm truncates small coefficients to zero and the L^{2}norm better handles highly correlated features. In summary, the elastic net constraint typically denoted as \(\lambda ((1\alpha)/2\boldsymbol {\beta }_{2}^{2} + \alpha \boldsymbol {\beta }_{1})\) in the classical GLM setting will now involve
where \(\boldsymbol {\beta _{X}} \in \mathbb {R}^{q}\) denotes the coefficient vector for \(\boldsymbol {\mathcal {X}}, \boldsymbol {\beta _{G}} \in \mathbb {R}^{p}\) denotes the coefficient vector for \(\boldsymbol {\mathcal {G}}, \beta _{X_{j}}\) denotes the coefficient for the j^{th} feature in \(\boldsymbol {\mathcal {X}}\), and \(\beta _{G_{k}}\) denotes the coefficient for the k^{th} feature in \(\boldsymbol {\mathcal {G}}\). The hyperparameters α_{X}∈[0,1] and α_{G}∈[0,1] denote the weights to put on the L^{1}norm constraints for \(\boldsymbol {\mathcal {X}}\) and \(\boldsymbol {\mathcal {G}}\), respectively. The hyperparameter λ in the classical GLM setting denotes the overall weight to put on the constraint, and it may be any positive number from 0 to infinity. Since tuning λ to the proper range may be difficult due to the nontrivial parameter space, we use max_{X} and max_{G} instead to denote the maximum number of nonzero values within β_{X} and β_{G}, respectively. This reparameterization of the constraints drastically simplifies the hyperparameter space and subsequent tuning, as described in “Tuning on stability” section.
For some fixed α_{X},α_{G},max_{X}, and max_{G}, the tensor regression model will be fit to obtain \(\boldsymbol {\hat {\beta }_{X}} \in \mathbb {R}^{q}\) and \(\boldsymbol {\hat {\beta }_{G}} \in \mathbb {R}^{p}\). Due to the L^{1}norm truncating small values to zero from the elastic net constraint, only a subset of values within \(\boldsymbol {\hat {\beta }_{X}}\) and \(\boldsymbol {\hat {\beta }_{G}}\) will be nonzero. Thus, final network features within data type \(\boldsymbol {\mathcal {X}}\) will be the set \(\{j: \hat {\beta }_{X_{j}}\ne 0,j=1,...,q\}\), and final network features within data type \(\boldsymbol {\mathcal {G}}\) will be the set \(\{k: \hat {\beta }_{G_{k}}\ne 0,k=1,...,p\}\). Algorithm 1 summarizes the steps to MOSCATO.
Tuning on stability
MOSCATO assumes fixed values for α_{X},α_{G},max_{X}, and max_{G} which will be tuned using an extension of the Stability Approach to Regularization Selection (StARS) method [22]. Tuning on accuracy, such as by cross validation or Bayesian information criterion, tends to result in overly dense solutions in high dimensional problems with results that are not reproducible [22]. The most extreme scenarios for stability are perfectly stable results from selecting no features (i.e., completely sparse) or selecting all features (i.e., no sparseness). Building on that logic, StARS initializes the parameters to the sparsest solution and gradually relaxes the sparsity until some instability threshold ϕ is met. Instability is estimated based on subsamples from the data by performing the feature selection under each subsample and summarizing the consistency in results across different subsamples. Although ϕ may initially be thought of as an arbitrary cutoff between 0 and 1, it may be easily interpreted as the amount of allowable instability. In essence, a smaller ϕ would imply a more sparse but stable result. The motivation behind allowing some instability as opposed to fixing ϕ to 0 is to allow some noise to be selected in order to ensure that no true signal is missed in the final feature selection. In statistical terms, this means that StARS prioritizes reducing type II errors.
Although the StARS method was developed for tuning a single sparsity parameter, the four hyperparameters α_{X},α_{G},max_{X}, and max_{G} will be tuned using similar logic. Focusing on tuning one dimension at a time, we initialize to a sparse solution with some small max_{X}. Fixing max_{X}, we estimate the instability for a range of α_{X} values between 0 and 1. Select the α_{X} value resulting in the lowest instability, and if that instability is less than ϕ, increase max_{X} and repeat the process. This continues until the ϕ instability is hit to select max_{X} and α_{X}. Using the highest max_{X} and corresponding optimal α_{X} with instability less than ϕ, a similar process is then repeated for tuning max_{G} and α_{G}. In this case with two dimensions, one for \(\boldsymbol {\mathcal {X}}\) and another for \(\boldsymbol {\mathcal {G}}\), we first tune α_{X} and max_{X} for some fixed α_{G} and max_{G}, and then use \(\hat {max}_{X}\) and \(\hat {\alpha }_{X}\) when tuning α_{G} and max_{G}. The initial fixed α_{G} and max_{G} may be kept large, suppose α_{G}=0.5 and max_{G}=floor(p/2) such that an overly sparse \(\boldsymbol {\mathcal {G}}\) does not impact the stability on \(\boldsymbol {\mathcal {X}}\) for the first dimension of tuning. This is summarized in Algorithm 2.
Simulation details
Theoretical details of simulations
Splatter [37] is a popular technique to simulate singlecell RNAseq (scRNAseq) data, and it has been shown to mimic distributions from real scRNAseq data. The general Splatter schematic initiates by simulating a gene mean and then adjusts the gene mean to account for variation in outliers, library size, and dispersion. It then simulates the “observed” scRNAseq values through a Poisson distribution using the adjusted gene mean, and the values are then randomly truncated to zero to replicate dropouts. Although Splatter realistically portrays scRNAseq distributions, a few extensions were required in order to simulate multimodal singlecell data with supervised gene networks.
To accomplish this, we leverage latent structures for multimodal supervised networks detailed in Zhang et al. [38]. Figure 7 demonstrates the expected causal relationships within the data containing a supervised multimodal network. In summary, we expect there to be a subset of features within \(\boldsymbol {\mathcal {G}}\) and \(\boldsymbol {\mathcal {X}}\) that relate to the outcome but not with each other; a subset of features within \(\boldsymbol {\mathcal {G}}\) and \(\boldsymbol {\mathcal {X}}\) that relate to each other but not with the outcome; a subset of features within \(\boldsymbol {\mathcal {G}}\) and \(\boldsymbol {\mathcal {X}}\) that are independent of each other and the outcome; and finally the target subgroups of the analysis, subset of network features within \(\boldsymbol {\mathcal {G}}\) that relate to the network features within \(\boldsymbol {\mathcal {X}}\) that ultimately relates to the outcome. These expected relationships among these latent components may be represented through a covariance matrix where the off diagonals will be nonzero where relationships exist and zero where independence is expected. More details are provided in the Additional file 1.
To extend Splatter for supervised multimodal networks, we will first simulate the latent values (S, H, X, G, G^{′},X^{′}, and y) for each subject using a multivariate normal distribution with zero mean and the latent variables’ covariance matrix. Since the latent values were simulated from a multivariate normal, they will each be normally distributed with mean 0. The Splatter simulation assumes the initial gene mean comes from a gamma distribution, so we take the square of the latent value divided by its standard deviation. By doing so, the transformed latent values then become gamma distributed with shape equal to 1/2 and scale equal to 2 times its variance. These transformed latent values will be used as the initial gene means for each subset of features, and the latent value for y will be used as the mean to simulate an observed outcome from a normal distribution. Initial gene means for the noise subset of features are randomly simulated independently. Additional theoretical details may be found in the Additional file 1.
Dispersion is adjusted on a subject level, library size is adjusted on a cellular level within a subject, outliers are adjusted on a feature level within a subject, and dropouts are accounted for on a cellular/feature level within a subject. The Splatter simulation is performed using the transformed gene means (combining all latent components within \(\boldsymbol {\mathcal {G}}\) and \(\boldsymbol {\mathcal {X}}\)), and then the features are later separated by data type for each subject.
Simulation specifications
MOSCATO was applied to a series of simulations using the techniques described in the previous section. Simulations were performed under 9 different settings accounting for the average number of cells per subject (250, 500, or 1000) and amount of technical noise (low, moderate, or high). Simulations were replicated 50 times under each simulation setting and each simulation had 100 subjects. \(\boldsymbol {\mathcal {G}}\) contained 1440 total features where only 10 belonged to the network, and \(\boldsymbol {\mathcal {X}}\) contained 1555 total features where only 15 belonged to the network. Table 1 describes the total number of features contained within each of the latent variables described in Fig. 7.
For tuning the hyperparameters, we set grid_{α}={0.2,0.5,0.7},ϕ=0.02,R=50, and c=50. \(max_{G_{0}}\) and \(max_{X_{0}}\) were initially set to 5, although this number may be increased in order to reduce runtimes as long as the instability remains below ϕ for its initialization. Ideally, MOSCATO would tune max_{G} to 10 and max_{X} to 15 in order to select the proper network size according to Table 1, but this will be unlikely due to the mechanics behind the StARS tuning method described in “Tuning on stability” section which prioritizes reducing type II error over type I error.
In addition to applying MOSCATO, we also applied competing methods using the AUC. Seurat provides a popular singlecell sequencing workflow, and following similar methods used by the authors of Seurat [10], this AUC approach was done using the presto version 1.0.0 R package [19]. Selections using AUC were performed using two different criteria. One criterion was based on whether the Bonferroni adjusted pvalue was less than the nominal significance level (set to 0.05) under the null hypothesis that the AUC equals 0.5. Additionally, selection criteria using cutoff values where features with an AUC either less than 0.3 or greater than 0.7 were selected. Since AUC requires categorical outcomes, we use the median of the outcome to binarize it (i.e., if the outcome is less than the median then recode the outcome as ‘0’, otherwise if the outcome is greater than the median then recode as ‘1’).
Real data application details
Data was combined across multiple studies for healthy and leukemia subjects. The studies used to obtain the data are summarized in Table 2. Only nonperturbed, baseline cells were considered.
Seurat version 4.0.3 [10] was used to normalize the data, cluster cells, and integrate the cell types across subjects.
We applied MOSCATO to each of these cell clusters separately, each with grid_{α}={0.01,0.05,0.1}. Since Seurat clusters cells by maximizing correlation between features, the multicollinearity across features would consequently be high and require more weight on the L^{2}norm (i.e., lower α within the elastic net constraint).
Due to the modest sample size, we tuned the hyperparameters using a subsampling size of 20 (out of 21 total subjects) to estimate stability based on a “leave one out” scheme. Although StARS suggests setting the instability threshold ϕ to 0.05 for most applications [22], in this application with a small sample size (i.e., only 21 subsamples) and large number of RNA features (i.e., 17991 variables), the estimated instability under sparsest solutions will be much smaller than 0.05. For example, suppose all 21 subsamples select completely disjoint feature sets, but due to the high number of variables in consideration, many variables are consistently excluded from any selections across all of the subsampled results. Since StARS considers both consistency in selections and consistency in exclusions, the estimated instability will be quite small due to the consistency in exclusions despite if the small number of features selected may be completely disjoint across all subsamples. Therefore, ϕ was set to 0.001.
To compare selections with another method, the MOSCATO results were compared to selections based on the AUC. The AUC approach was done using the presto version 1.0.0 R package [19]. Similarly as was done for MOSCATO, the AUC feature selections were performed on each cell cluster separately. Feature selections were made under two different selection criteria for AUC: if the Bonferroni adjusted pvalue was less than 0.05 under the null hypothesis that the AUC equals 0.5 or whether the AUC was less than 0.3 or greater than 0.7. Since the pvalue would likely be small in situations with many cells (i.e., large sample sizes producing sensitive pvalues for miniscule AUC deviations from 0.5), both a pvalue approach and an approach based on the AUC values were considered.
The real data application may be reproduced by following the steps provided at https://github.com/lorinmil/MOSCATOLeukemiaExample.
Availability of data and materials
All code in this manuscript was done using R version 4.1.0 [30]. Code to perform MOSCATO and replicate the simulations may be found publicly on GitHub at https://github.com/lorinmil/MOSCATO.
The real data application utilized publicly available data across three studies (assession numbers ERP124005, GSE152469, and GSE139369). The data for ERP124005 was downloaded through the Human Cell Atlas at https://data.humancellatlas.org/explore/projects/efea6426510a4b609a19277e52bfa815/projectmatrices, the data from GSE152469 was downloaded through the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4616298, and the data from GSE139369 was downloaded through NCBI’s Gene Expression Omnibus at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE139369. Steps and code to replicate the real data application may be found at https://github.com/lorinmil/MOSCATOLeukemiaExample.
Abbreviations
 ADT:

Antibody Derived Tag
 AUC:

Area Under the receiver operating Curve
 DNSMI:

Decomposition of Network Summary Matrix via Instability
 GLM:

Generalized Linear Model
 HTLVI:

Human Tcell Leukemia Virus type 1
 MOSCATO:

MultiOmic SingleCell Analysis using TensOr regression
 MRI:

Magnetic Resonance Imaging
 RNA:

Ribonucleic Acid
 SCCA:

Sparse Canonical Correlation Analysis
 scRNAseq:

singlecell RNAseq
 StARS:

Stability Approach to Regularization Selection
 UMAP:

Uniform Manifold Approximation and Projection
References
Balmain A, Gray J, Ponder B. The Genetics and Genomics of Cancer. Nat Genet. 2003; 33(3):238–44.
Becht E, McInnes L, Healy J, Dutertre CA, Kwok IW, Ng LG, Ginhoux F, Newell EW. Dimensionality Reduction for Visualizing SingleCell Data Using UMAP. Nat Biotechnol. 2019; 37(1):38–44.
Benatar D, Bondmass M, Ghitelman J, Avitall B. Outcomes of Chronic Heart Failure. Arch Intern Med. 2003; 163(3):347–52.
Cadot S, Valle C, Tosolini M, Pont F, Largeaud L, Laurent C, Fournie JJ, Ysebaert L, QuilletMary A. Longitudinal CITESeq Profiling of Chronic Lymphocytic Leukemia During ibrutinib Treatment: Evolution of Leukemic and Immune Cells at Relapse. Biomark Res. 2020; 8(1):1–13.
Cookson W, Liang L, Abecasis G, Moffatt M, Lathrop M. Mapping Complex Disease Traits with Global Gene Epression. Nat Rev Genet. 2009; 10(3):184–94.
Creixell P, Reimand J, Haider S, Wu G, Shibata T, Vazquez M, Mustonen V, GonzalezPerez A, Pearson J, Sander C, et al. Pathway and Network Analysis of Cancer Genomes. Nat Methods. 2015; 12(7):615.
Elissen AM, Steuten LM, Lemmens LC, Drewes HW, Lemmens KM, Meeuwissen JA, Baan CA, Vrijhoef HJ. MetaAnalysis of the Effectiveness of Chronic Care Management for Diabetes: Investigating Heterogeneity in Outcomes. J Eval Clin Pract. 2013; 19(5):753–62.
Friedman J, Hastie T, Tibshirani R. Sparse Inverse Covariance Estimation with the Graphical Lasso. Biostatistics. 2008; 9(3):432–41.
Granja JM, Klemm S, McGinnis LM, Kathiria AS, Mezger A, Corces MR, Parks B, Gars E, Liedtke M, Zheng GX, et al. SingleCell Multiomic Analysis Identifies Regulatory Programs in MixedPhenotype Acute Leukemia. Nat Biotechnol. 2019; 37(12):1458–65.
Hao Y, Hao S, AndersenNissen E, Mauck III WM, Zheng S, Butler A, Lee MJ, Wilk AJ, Darby C, Zagar M, Hoffman P, Stoeckius M, Papalexi E, Mimitou EP, Jain J, Srivastava A, Stuart T, Fleming LB, Yeung B, Rogers AJ, McElrath JM, Blish CA, Gottardo R, Smibert P, Satija R. Integrated Analysis of Multimodal SingleCell Data. Cell. 2021. https://doi.org/10.1016/j.cell.2021.04.048.
Huang DW, Sherman BT, Lempicki RA. Bioinformatics Enrichment Tools: Paths Toward the Comprehensive Functional Analysis of Large Gene Lists. Nucleic Acids Res. 2009; 37(1):1–13.
SRPS in NCI’s Division of Cancer Control, (DCCPS) PS. Cancer Stat Facts: Leukemia. 2021. https://seer.cancer.gov/statfacts/html/leuks.html. Accessed 25 Aug 2021.
Ishitsuka K, Tamura K. Human Tcell Leukaemia Virus Type I and Adult Tcell Leukaemialymphoma. Lancet Oncol. 2014; 15(11):517–26.
Karlebach G, Shamir R. Modelling and Analysis of Gene Regulatory Networks. Nat Rev Mol Cell Biol. 2008; 9(10):770–80.
Kendal AR, Layton T, AlMossawi H, Appleton L, Dakin S, Brown R, Loizou C, Rogers M, Sharp R, Carr A. MultiOmic Single Cell Analysis Resolves Novel Stromal Cell Populations in Healthy and Diseased Human Tendon. Sci Rep. 2020; 10(1):1–14.
Kolda TG, Bader BW. Tensor Decompositions and Applications. SIAM Rev. 2009; 51(3):455–500.
Komarova NL, Thalhauser CJ. High Degree of Heterogeneity in Alzheimer’s Disease Progression Patterns. PLoS Comput Biol. 2011; 7(11):1002251.
Komurov K, Tseng JT, Muller M, Seviour EG, Moss TJ, Yang L, Nagrath D, Ram PT. The GlucoseDeprivation Network Counteracts LapatinibInduced Toxicity in Resistant ErbB2Positive Breast Cancer Cells. Mol Syst Biol. 2012; 8(1):596.
Korsunsky I, Nathan A, Millard N, Raychaudhuri S. Presto Scales Wilcoxon and auROC Analyses to Millions of Observations. BioRxiv. 2019:653253.
Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008; 9:559. https://doi.org/10.1186/147121059559.
Lawlor N, NeharBelaid D, Grassmann JD, Stoeckius M, Smibert P, Stitzel ML, Pascual V, Banchereau J, Williams A, Ucar D. Single Cell Analysis of Blood Mononuclear Cells Stimulated Through Either LPS or AntiCD3 and AntiCD28. Front Immunol. 2021; 12:691.
Liu H, Roeder K, Wasserman L. Stability approach To Regularization Selection (StARS) for High Dimensional Graphical Models. Adv Neural Inf Process Syst. 2010; 24(2):1432.
Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, et al. Highly Parallel GenomeWide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell. 2015; 161(5):1202–14.
McCarthy MI. Genomics, Type 2 Diabetes, and Obesity. N Engl J Med. 2010; 363(24):2339–50.
Ni Z, Zheng X, Zheng X, Zou X. scLRTD: A Novel Low Rank Tensor Decomposition Method for Imputing Missing Values in SingleCell MultiOmics Sequencing Data. In: IEEE/ACM Transactions on Computational Biology and Bioinformatics: 2020.
O’Donnell CJ, Nabel EG. Genomics of Cardiovascular Disease. N Engl J Med. 2011; 365(22):2098–109.
Osorio D, Zhong Y, Li G, Huang JZ, Cai JJ. scTenifoldNet: A Machine Learning Workflow for Constructing and Comparing TranscriptomeWide Gene Regulatory Networks from SingleCell Data. Patterns. 2020; 1(9):100139.
Pan X, Li Z, Qin S, Yu M, Hu H. ScLRTC: Imputation for SingleCell RNAseq Data via LowRank Tensor Completion. BMC Genomics. 2021; 22(1):1–19.
Picelli S, Faridani OR, Björklund ÅK, Winberg G, Sagasser S, Sandberg R. Fulllength RNAseq from Single Cells Using Smartseq2. Nat Protoc. 2014; 9(1):171–81.
R Core Team. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2021. R Foundation for Statistical Computing. https://www.Rproject.org/.
Sherman BT, Lempicki RA, et al. Systematic and Integrative Analysis of Large Gene Lists Using DAVID Bioinformatics Resources. Nat Protoc. 2009; 4(1):44–57.
Stoeckius M, Hafemeister C, Stephenson W, HouckLoomis B, Chattopadhyay PK, Swerdlow H, Satija R, Smibert P. Simultaneous Epitope and Transcriptome Measurement in Single Cells. Nat Methods. 2017; 14(9):865–68. https://doi.org/10.1038/nmeth.4380.
TowleMiller LM, Miecznikowski JC, Zhang F, Tritchler DL. Sumofil: Supervised multiomic filtering prior to performing network analysis. PLoS ONE. 2021; 16(8):0255579.
Tritchler D, TowleMiller LM, Miecznikowski JC. Balanced Functional Module Detection in Genomic Data. bioRxiv. 2020.
Turner NC, ReisFilho JS. Genetic Heterogeneity and Cancer Drug Resistance. Lancet Oncol. 2012; 13(4):178–85.
Witten DM, Tibshirani RJ. Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data. Stat Appl Genet Mol Biol. 2009; 8(1):Article28. https://doi.org/10.2202/15446115.1470.
Zappia L, Phipson B, Oshlack A. Splatter: Simulation of SingleCell RNA Sequencing Data. Genome Biol. 2017; 18(1):174.
Zhang F, Miecznikowski JC, Tritchler DL. Identification of Supervised and Sparse Functional Genomic Pathways. Stat Appl Genet Mol Biol. 2020; 19(1):20180026.
Zhou H, Li L, Zhu H. Tensor Regression with Applications in Neuroimaging Data Analysis. J Am Stat Assoc. 2013; 108(502):540–52.
Zou H, Hastie T. Regularization and Variable Selection via the Elastic Net. J R Stat Soc Ser B Stat Methodol. 2005; 67(2):301–20.
Acknowledgments
The authors would like to thank Dr. David Tritchler for providing valuable feedback for the work presented in this manuscript.
Funding
LTM was funded under the Empire Clinical Research Investigator Program (ECRIP) grant (PI: Dr. Ekaterina Noyes). The funding body played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Author information
Authors and Affiliations
Contributions
LTM performed the simulations, real data application, and was a major contributor in drafting the manuscript. JM was also a major contributor in drafting the manuscript and reviewing methods. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
Additional simulation and application details.
Additional file 2
Gene and protein selections.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
TowleMiller, L.M., Miecznikowski, J.C. MOSCATO: a supervised approach for analyzing multiOmic singleCell data. BMC Genomics 23, 557 (2022). https://doi.org/10.1186/s12864022087593
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12864022087593