Our method builds upon matrix factorization [5], multi-view learning, and deep learning. We will describe each component in the following section.
Some notationsGiven N samples and V types of -omics data, we can often represent the data using V sample-feature matrices: \(\mathbf {M}^{(i)} \in \mathbb {R}^{N \times p^{(i)}}, i=1,2,\cdots, V\). Each matrix corresponds to one data view, and p(i) is the feature dimension for view i.
Before describing Multi-view Factorization AutoEncoder, we first discuss how to process individual views. For ease of description, we drop the superscript (·) when dealing with a single view. For matrix M,Mij represents the element of ith row and jth column, Mi,· represents the ith row vector, and M·,j represents jth column vector.
Let \(\mathbf {M} \in \mathbb {R}^{N \times p}\) be a feature matrix, with each row corresponding to a sample and each column corresponding to a feature. The features are often not independent. We represent the interactions among these features with a network \(\mathbf {G} \in \mathbb {R}^{p \times p}\). For instance, if these features correspond to protein expressions, then G will be a protein-protein interaction network, which is available in public databases such as STRING [26] and Reactome [27]. G can be an unweighted graph or a weighted graph with non-negative elements. Let D be a diagonal matrix with \(D_{ii}=\sum _{j=1}^{p} G_{ij}\), then the graph Laplacian of G is LG=D−G.
Low-rank matrix factorization
Matrix factorization techniques [5] are widely used for clustering and dimensionality reduction. In many real-world applications, M often has a low rank. As a result, low-rank matrix factorization can be used for dimensionality reduction and clustering:
M≈XY, where \(\mathbf {X} \in \mathbb {R}^{N \times k}, \mathbf {Y} \in \mathbb {R}^{k \times p}, k< p\)
Some additional constraints are often added as regularizers in the objective function or enforced in the learning algorithm to find a good solution {X,Y}. For instance, when M is non-negative, Non-negative Matrix Factorization (NMF) [28] is often a “natural” choice to ensure both X and Y are non-negative.
Generally speaking, the objective function can be formulated as follows:
$$ \arg \min_{\mathbf{X}, \mathbf{Y}} \left\Vert \mathbf{M} - \mathbf{X} \mathbf{Y} \right\Vert^{2}_{F} + \lambda R(\mathbf{X}, \mathbf{Y}) $$
(1)
In Eq. 1, R(X,Y) is a regularization term for X and Y. For instance, R(X,Y) can include L1 and L2 norms for X and Y. In addition, structural constraints based on biological interaction networks can also be incorporated into R(X,Y).
Interpretation\(\mathbf {X} \in \mathbb {R}^{N \times k}\) can be regarded as a sample-factor matrix and the inherent non-redundant representation of N samples, with each column corresponding to an independent factor. These k factors are often latent variables. \(\mathbf {Y} \in \mathbb {R}^{k \times p}\) can be seen as a linear transformation matrix. The k rows of Y can be regarded as a basis for the underlying factor space. The observable feature matrix M is generated by a linear transformation Y from X. In a sense, this formulation can be seen as a shallow linear generative model.
LimitationsThe limitations of matrix factorization techniques often stem from their “shallow” linear structure with a limited representation power. In many real-world applications, however, we need to learn a complex nonlinear transformation. Deep neural networks are often good at approximating any complex nonlinear transformations with appropriate training on a sufficiently large dataset.
Non-linear factorization with AutoEncoder
As simple matrix factorization techniques are limited to model complex nonlinear relationships, we can use an Autoencoder to reconstruct the observable sample-feature matrix M, as it can approximate more complex nonlinear transformations well.
The entire Autoencoder is a multi-layer neural network with a encoder and a decoder. We use a neural network with parameter Θe as the encoder:
$$ Encoder(\mathbf{M}, \mathbf{\Theta_{e}}) = \mathbf{X} \in \mathbb{R}^{N \times k} $$
(2)
Here X can be regarded as a factor matrix containing the essential information for all N samples. The encoder network will transform the observable sample-feature matrix M to its latent representation X. The decoder reconstructs the original data from the latent representation.
$$ Decoder(\mathbf{X}, \mathbf{\Theta_{d}}) = \mathbf{Z} \in \mathbb{R}^{N \times p} $$
(3)
In our framework, for the convenience of incorporating biological interaction networks into the framework, the encoder (Eq. 2) contains all layers but the last one, and the decoder is the last linear layer. The parameter of decoder (Eq. 3) is a linear transformation matrix same as in matrix factorization:
$$ \mathbf{\Theta_{d}} = \mathbf{Y} \in \mathbb{R}^{k \times p} $$
(4)
The input sample-feature matrix can be reconstructed as
$$ \mathbf{Z} = Encoder(\mathbf{M}, \mathbf{\Theta_{e}}) \cdot \mathbf{Y} = \mathbf{X} \mathbf{Y} $$
(5)
The reconstruction error can be computed as: \(\left \Vert \mathbf {M} - \mathbf {Z} \right \Vert ^{2}_{F}\).
Different from matrix factorization–which can be regarded as a one-layer AutoEncoder, the encoder in our framework is a multi-layer neural network that can learn complex nonlinear transformations through backpropagation. Moreover, the encoder output X can be regarded as the learned patient representations for N samples, and Y can be seen as the learned feature representations. With the learned patient and feature representations, we can calculate patient similarity networks and feature interaction networks, and add network regularizers to the objective function.
Incorporate biological knowledge as network regularizers
We aim to incorporate biological knowledge such as molecular interaction networks into our model as inductive biases to increase model generalizability. Denote \(\mathbf {G} \in \mathbb {R}^{p \times p}\) as the interaction matrix among p genomic features, which can be obtained from biological databases such as STRING [26] and Reactome [27].
Since our model can learn a feature representation Y, this representation should ideally be “consistent” with the biological interaction network corresponding to these features. We use a graph Laplacian regularizer to minimize the inconsistency between the learned feature representation Y and the feature interaction network G:
$$ Trace\left(\mathbf{Y} \mathbf{L}_{G} \mathbf{Y}^{T}\right) = \frac{1}{2} \sum_{i=1}^{p} \sum_{j=1}^{p} G_{ij} \lVert \mathbf{Y}_{\cdot, i} - \mathbf{Y}_{\cdot, j} \rVert^{2} $$
(6)
LG is the graph Laplacian matrix of G in Eq. 6. Gij≥0 captures how “similar” feature i and feature j are. Each feature i is represented as a k-dimensional vector Y·,i. We can calculate the Euclidean distance between feature i and j as ∥Y·,i−Y·,j∥. The term Trace(YLGYT) is a surrogate for measuring the inconsistency between the learned feature representation Y and the known feature interaction network G. When Y is highly inconsistent with G, the loss term Trace(YLGYT), which accounts for the level of inconsistency between the learned feature representation and the biological interaction network, will be large. Therefore, minimizing the loss function can effectively reduce the inconsistency between the learned feature representation and the biological interaction network.
The objective function incorporating biological interaction networks through the graph Laplacian regularizer is as follows:
$$ \arg \min_{\mathbf{\Theta_{e}}, \mathbf{Y}} \left\Vert \mathbf{M} - \mathbf{Z} \right\Vert^{2}_{F} + \alpha \ Trace\left(\mathbf{Y} \cdot \mathbf{L}_{G} \cdot \mathbf{Y}^{T}\right) $$
(7)
In Eq. 7, α≥0 is a hyperparameter as the weight for the network regularization term. In practice, we normalize G and Y so that the Trace(Y·LG·YT) is within the range of [0,1]. In the implementation of our model, we set \(\lVert \mathbf {G} \rVert _{F} = 1, \lVert \mathbf {Y}_{\cdot, i} \rVert = \frac {1}{\sqrt {p}}, i=1,2,\cdots, p\) (this also means ∥Y∥F=1). This facilitates easy multi-view integration since all the network regularizers from individual views are on the same scale.
Measuring feature similarity with mutual information
Eq. 6 uses Euclidean distance to measure the dissimilarity between learned feature representations. Euclidean distance relies on the inner product operator, which is essentially linear. The fact that two molecular entities interact with each other does not imply that they should have very similar feature representations or a small Euclidean distance. Mutual information can be a better metric quantifying if two molecular entities interact with each other.
Let’s briefly review the definition of mutual information between two random variables X and Y.
$$\begin{array}{*{20}l} I(X, Y) &= H(X) - H(X|Y) \\ &= H(Y) - H(Y|X) \\ &= H(X) + H(Y) - H(X,Y) \end{array} $$
For discrete random variable X∼P(x) (P(x) is the discrete probability distribution of X), the entropy of X is defined as \(H(X) = \sum _{x} P(x) log\frac {1}{P(x)}\).
The observed sample-feature matrix \(\mathbf {M} \in \mathbb {R}^{N \times p}\) can be used to measure the pair-wise mutual information scores between feature i and j: MutualInfo(M·,i,M·,j). However, due to measurement noise and error, this may not be accurate.
Ideally, the reconstructed signal with the proposed autoencoder model should reduce the noise in the data. Thus we can calculate pair-wise normalized mutual information scores using the reconstructed signal Z (Eq. 5):
$$\mathbf{K} = PairwiseMutualInformation(\mathbf{Z}) \in \mathbb{R}^{p \times p} $$
K can be regarded as a learned similarity matrix based on mutual information. Again we want to ensure that the learned similarity matrix is consistent with the known biological interaction network G. We can estimate the consistency between G and K as:
$$ \left \lVert \mathbf{G} \odot \mathbf{K} \right \rVert^{2}_{F} $$
(8)
⊙ is element-wise matrix multiplication. As G and K are normalized feature interaction network and pairwise feature mutual information matrix, the norm of their element-wise multiplication can be an estimate of the consistency between G and K. We inject this mutual information regularization term into Eq. 9:
$$ \begin{aligned} \underset{\mathbf{\Theta_{e}}, \mathbf{Y}}{\arg \min} &\left\Vert \mathbf{M} - Encoder(\mathbf{M}, \mathbf{\Theta_{e}}) \cdot \mathbf{Y} \right\Vert^{2}_{F} \\ &+ \alpha \ Trace\left(\mathbf{Y} \cdot \mathbf{L}_{G} \cdot \mathbf{Y}^{T}\right) \\ &- \gamma \left \lVert \mathbf{G} \odot \mathbf{K} \right \rVert^{2}_{F} \end{aligned} $$
(9)
α,γ are non-negative hyperparameters. There are numerical methods to measure the mutual information between two continuous high-dimensional random variables. The simplest approach is to divide the continuous space into small bins and discretize the variables with these bins. In order to estimate mutual information from data accurately, a large sample size is needed. Due to the difficulty in accurately calculating mutual information based on a limited number of data points, we do not include mutual information term in the following discussion and leave this for future work.
Multi-view factorization AutoEncoder with network constraints
We have given the objective function for a single view in Eq. 7. For multiple views, the objective can be formulated as follows:
$$ \begin{aligned} \underset{\{\mathbf{\Theta_{e}}^{(v)}, \mathbf{Y}^{(v)}\}}{\arg \min} & \sum_{v=1}^{V} \left(\left\Vert \mathbf{M}^{(v)} - Encoder\left(\mathbf{M}^{(v)}, \mathbf{\Theta_{e}}^{(v)}\right) \cdot \mathbf{Y}^{(v)} \right\Vert^{2}_{F} \right.\\ &\left. + \alpha \ Trace\left(\mathbf{Y}^{(v)} \cdot \mathbf{L}_{G^{(v)}} \cdot \mathbf{Y}^{(v)^{T}}\right) \right) \end{aligned} $$
(10)
Note we use a separate autoencoder for each view. We try to minimize the reconstruction loss and feature interaction network regularizers for all views in Eq. 10.
Here Encoder(M(v),Θe(v))=X(v) can be seen as the learned latent representation for N samples. We can derive patient similarity network S(v) (which can also be used for clustering patients into groups) from X(v). Multiple approaches can be employed to calculate a patient similarity network. For example, we can use cosine similarity:
$$ S_{ij}=\frac{\lvert \mathbf{X}_{i, \cdot} \cdot \mathbf{X}_{j, \cdot} \rvert}{\lVert \mathbf{X}_{i, \cdot}\rVert \cdot \lVert \mathbf{X}_{j, \cdot} \rVert} $$
(11)
We can get a patient similarity network S(v) for each view v (Eq. 11 omits the superscript for clarity). Moreover, the outputs of multiple encoders can be “fused” together for supervised learning.
$$ \mathbf{X} = \sum_{v=1}^{V} \mathbf{X}^{(v)} = \sum_{v=1}^{V}Encoder\left(\mathbf{M}^{(v)}, \mathbf{\Theta_{e}}^{(v)}\right) $$
(12)
This idea is similar to ResNet [29]. Another approach is to concatenate all views together like DenseNet [30]. We have tried using both in our experiments and the results are not significantly different.
With the fused view X, we can again calculate the patient similarity network SX using Eq. 11. Moreover, since SX, andS(v),v=1,2,⋯,V are for the same set of patients, we can fuse them together using affinity network fusion [14]:
$$ \mathbf{S} = \frac{1}{V+1} \left(\sum_{i=1}^{V} \mathbf{S}^{(v)} + \mathbf{S}_{X} \right) $$
(13)
Similar to the feature interaction network regularizer (Eq. 6), we also include a regularization term on the patient view similarity:
$$ Trace\left(\mathbf{X}^{(v)^{T}} \cdot \mathbf{L}_{S} \cdot \mathbf{X}^{(v)}\right) $$
(14)
Here LS is the graph Laplacian of S. Adding this term to Eq. 10, we get the new objective function:
$$ \begin{aligned} \underset{\{\mathbf{\Theta_{e}}^{(v)}, \mathbf{Y}^{(v)}\}}{\arg \min} \sum_{v=1}^{V} &\left(\left\Vert \mathbf{M}^{(v)} - Encoder\left(\mathbf{M}^{(v)}, \mathbf{\Theta_{e}}^{(v)}\right) \cdot \mathbf{Y}^{(v)} \right\Vert^{2}_{F} \right.\\ & + \alpha \ Trace\left(\mathbf{Y}^{(v)} \cdot \mathbf{L}_{G^{(v)}} \cdot \mathbf{Y}^{(v)^{T}}\right)\\ & \left.+ \beta\ Trace\left(\mathbf{X}^{(v)^{T}} \cdot \mathbf{L}_{S} \cdot \mathbf{X}^{(v)}\right)\right) \end{aligned} $$
(15)
For each type of -omic data, there is one corresponding feature interaction network G(v). Different molecular interaction networks involve distinct feature sets and thus cannot be directly merged. However, patient similarity networks are about the same set of patients, and therefore can be combined to get a fused patient similarity network S using techniques such as affinity network fusion [14]. Our framework uses both molecular interaction networks and patient similarity networks for regularized learning.
Supervised learning with multi-view factorization autoencoder
With multi-view data and feature interaction networks, our framework with the objective function Eq. 7 can be used for unsupervised learning. When labeled data is available, we can use our model for supervised learning by adding another loss term to Eq. 7:
$$ \begin{aligned} &\underset{\{\mathbf{\Theta_{e}}^{(v)}, \mathbf{Y}^{(v)}\}}{\arg \min} \mathcal{L}(\mathbf{T}, \left(\sum_{v=1}^{V}Encoder\left(\mathbf{M}^{(v)}, \mathbf{\Theta_{e}}^{(v)}\right)\right) \cdot \mathbf{C})\\ &+ \eta \sum_{v=1}^{V} \left(\left\Vert \mathbf{M}^{(v)} - Encoder\left(\mathbf{M}^{(v)}, \mathbf{\Theta_{e}}^{(v)}\right) \cdot \mathbf{Y}^{(v)} \right\Vert^{2}_{F}\right)\\ & + \alpha \sum_{v=1}^{V} Trace\left(\mathbf{Y}^{(v)} \cdot \mathbf{L}_{G^{(v)}} \cdot \mathbf{Y}^{(v)^{T}}\right) \\ & + \beta\ \sum_{v=1}^{V} Trace\left(\mathbf{X}^{(v)^{T}} \cdot \mathbf{L}_{S} \cdot \mathbf{X}^{(v)}\right) \end{aligned} $$
(16)
The first term \(\mathcal {L}(\mathbf {T}, \left (\sum _{v=1}^{V}Encoder\left (\mathbf {M}^{(v)}, \mathbf {\Theta _{e}}^{(v)}\right)\right) \cdot \mathbf {C})\) is the classification loss (e.g., cross entropy loss) or regression loss (e.g., mean squared error for continuous target variables) for supervised learning. T is the true class labels or other target variables available for training the model.
As in Eq. 12, \(\sum _{v=1}^{V}Encoder\left (\mathbf {M}^{(v)}, \mathbf {\Theta _{e}}^{(v)}\right)\) is the sum of the last hidden layers of V autoencoders. This also represents the learned patient representations combining multiple views. C is the weights for the last fully connected layer typically used in neural network models for classification tasks. The second term \(\sum _{v=1}^{V} \left (\left \Vert \mathbf {M}^{(v)} - Encoder\left (\mathbf {M}^{(v)}, \mathbf {\Theta _{e}}^{(v)}\right) \cdot \mathbf {Y}^{(v)} \right \Vert ^{2}_{F}\right)\) is the reconstruction loss for all the submodule autoencoders. The third and four terms are the graph Laplacian constraints for molecular interaction networks and patient similarity networks as in Eq. 6 and Eq. 14. η,α,β are non-negative hyperparameters adjusting the weights of the reconstruction loss, feature interaction network loss, and patient similarity network loss.
A simple illustration of the whole framework combining two views with two-hidden-layer autoencoders is depicted in Fig. 1. The whole framework is end-to-end differentiable. We implemented the model using PyTorch (https://github.com/BeautyOfWeb/Multiview-AutoEncoder).