MultiFacTV: module detection from higher-order time series biological data

Li, Xutao; Ye, Yunming; Ng, Michael; Wu, Qingyao

doi:10.1186/1471-2164-14-S4-S2

Volume 14 Supplement 4

Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Genomics

Research
Open access
Published: 01 October 2013

MultiFacTV: module detection from higher-order time series biological data

Xutao Li^1,3,
Yunming Ye^1,3,
Michael Ng² &
…
Qingyao Wu^1,3

BMC Genomics volume 14, Article number: S2 (2013) Cite this article

2454 Accesses
4 Citations
Metrics details

Abstract

Background

Identifying modules from time series biological data helps us understand biological functionalities of a group of proteins/genes interacting together and how responses of these proteins/genes dynamically change with respect to time. With rapid acquisition of time series biological data from different laboratories or databases, new challenges are posed for the identification task and powerful methods which are able to detect modules with integrative analysis are urgently called for. To accomplish such integrative analysis, we assemble multiple time series biological data into a higher-order form, e.g., a gene × condition × time tensor. It is interesting and useful to develop methods to identify modules from this tensor.

Results

In this paper, we present MultiFacTV, a new method to find modules from higher-order time series biological data. This method employs a tensor factorization objective function where a time-related total variation regularization term is incorporated. According to factorization results, MultiFacTV extracts modules that are composed of some genes, conditions and time-points. We have performed MultiFacTV on synthetic datasets and the results have shown that MultiFacTV outperforms existing methods EDISA and Metafac. Moreover, we have applied MultiFacTV to Arabidopsis thaliana root(shoot) tissue dataset represented as a gene×condition×time tensor of size 2395 × 9 × 6(3454 × 8 × 6), to Yeast dataset and Homo sapiens dataset represented as tensors of sizes 4425 × 6 × 6 and 2920×14×9 respectively. The results have shown that MultiFacTV indeed identifies some interesting modules in these datasets, which have been validated and explained by Gene Ontology analysis with DAVID or other analysis.

Conclusion

Experimental results on both synthetic datasets and real datasets show that the proposed MultiFacTV is effective in identifying modules for higher-order time series biological data. It provides, compared to traditional non-integrative analysis methods, a more comprehensive and better view on biological process since modules composed of more than two types of biological variables could be identified and analyzed.

Background

Identification of biological modules plays a key role in bioinformatics because it can reveal interesting groups of proteins/genes having strong interactions, which may be related to some biological functionalities. In the literature, many methods have been proposed for this purpose. One popular way is to make use of clustering algorithms [1–4], which reveals module patterns by clustering proteins/genes into groups such that intragroup similarities are maximized while inter-group similarities are minimized. The performance of this type of methods relies significantly on the similarity function used during the clustering process. Due to this shortcoming, some researchers also tune to matrix factorization techniques for detecting biological modules. For example, in [5–7], singular value decomposition based methods have been studied and developed to detect modules from gene expression data. In [8–10], nonnegative matrix factorization related methods have been developed to cluster and explore biological data. Recently, CUR decomposition, a new method approximating original data matrix by selecting a set of columns and rows, has been applied to analyze microarray data and SNP data [11, 12] because of its scalability and interpretability. This method may possibly be used to cluster large-scale biological data as well. However, all these methods are developed for analyzing biological data represented as matrix form, which models interactions between only two types of variables.

With rapid acquisition of biological experiments from different laboratories or studies based on different databases, many higher-order biological data representing interactions between more than two types of variables can be obtained. For instance, researchers in different laboratories may be interested in analysing gene co-expression networks under different stimulus, each of which is represented as a gene×gene matrix. Integrating these matrices results in a higher-order biological data, namely a gene×gene×stimulus tensor, and finding module patterns from such data tends to offer a better view of the underlying biological structures. Therefore powerful methods which are able to detect modules with integrative analysis are urgently called for.

In the literature, several integrative analysis methods have already been put forward. In [13], Li et al. developed a framework to find recurrent heavy subgraphs from multiple weighted networks represented as a 3D tensor, i.e., gene × gene × network. In the framework, a tensor objective function is proposed and solved, the solution of which helps to discovery a heavy subgraph. In [14], Omberg et al. employed higher-order singular value decomposition(HOSVD) to perform integrative analysis of multiple mircoarray data from different studies. Zhang et al. extended nonnegative matrix factorization method for exploring protein modules from multiple data sources [15]. In [16], a JointCluster algorithm was proposed to extract coherent clusters from multiple networks. However, all these methods are not suitable for analyzing time series data, which is also a task of particular importance in bioinformatics.

In this paper, we are interested in identifying biological modules from multiple time series data with integrative analysis. There are two ways to build up such data in general. One is to collect and accumulate from different time series data sources [17, 18], and the other is to perform time series biological experiments under different stimulus/conditions [19, 20]. The second way is usually more popular. For instance, in [20], researchers studied the time series expression profiles of genes in Arabidopsis thaliana under several abiotic stimulus; in [19], researchers studied time series gene expression of several sclerosis patients after IFN-β injection. Joining such data together, we can form a higher-order time series tensor, e.g., a gene×condition×time tensor. Identifying modules of genes, conditions and time-points from such tensor data could offer us a better understanding of the corresponding biological processes. For example, Supper et al. proposed EDISA algorithm by extending the 2D iterative signature algorithm to extract and analyze such modules [21].

We propose in this paper, MultiFacTV, a method to find modules from tensor time series data. This method employs a tensor factorization objective function and makes use of the decomposition results to identify modules. As we consider time series data, the modules are expected to be as consecutive as possible in time dimension. Therefore we incorporate a time-related regularization term of total variation into the objective function. Different from the conference version [22], we have re-derived the factorization formulas and updated the algorithm because we do not assume that input biological tensor is nonnegative in this paper. We have compared MultiFacTV with EDISA [21] and MetaFac [23] on synthetic datasets, and the results have shown that MulitFacTV outperforms the other two algorithms. In addition, we have applied MultiFacTV to Arabidopsis thaliana root(shoot) tissue dataset, Yeast dataset and Homo Sapiens dataset, and the results have shown that MultiFacTV indeed identifies some interesting biological modules, most of which have not yet been reported in our conference version. These interesting findings have also been validated and explained by using Gene Ontology analysis with DAVID or other analysis.

Methods

Terminologies

A tensor refers to a multidimensional array or matrix. The order of a tensor is defined to be the number of dimensions, also known as modes, of the corresponding multidimensional array. For instance, given a n₁ × n₂ × n₃ tensor $A = (a_{r, s, t})$ , it is called a third-order tensor. The process of rearranging a tensor into a two-dimensional matrix is called unfolding. A n-th order tensor can be unfolded into n matrices in terms of each of its modes. For example, unfolding the tensor $A$ in terms of mode 1, mode 2 and mode 3, we obtain three matrices A⁽¹⁾, A⁽²⁾ and A⁽³⁾ of sizes n₁ × n₂n₃, n₂ × n₁n₃ and n₃ × n₁n₂ respectively. In this paper, we let A^(p)denote the unfolding matrix of tensor $A$ in terms of mode p.

Let I_n×nbe the n × n identity matrix. Let M^T be the transpose of matrix M. Given a n₁ × n₂ matrix M, we define vec(M) to be a n₁n₂ × 1 vector that is obtained by stacking each column of M. We define shrinkage_α/ρ(·) to be a shrinkage-thresholding operator for each entry of a matrix, i.e.,

shrinkag e_{α / ρ} {(M)}_{i, j} = m_{i, j} - min (α / ρ, | m_{i, j} |) . \frac{m_{i, j}}{| m_{i, j} |},

where $\frac{m_{i, j}}{| m_{i, j} |}$ should be zero when m_i,j = 0.

Let ⊗ and ○ be the Kronecker product operator and outer product operator. Given a n-dimensional vector x = [x₁, x₂, ..., x_n]^T , let x⁺ = {x_i|x_i > 0, 1 ≤ i ≤ n} and x^- = {x_i|x_i < 0, 1 ≤ i ≤ n} denote the sets of its positive entries and negative entries respectively. Besides, we define $\sum x^{+} = \sum_{y \in x^{+}} y$ and $\sum x^{-} = \sum_{y \in x^{-}} y$ . In this paper, max(·) and min(·) are functions used to find the maximum value and minimum value respectively.

MultiFacTV

Our idea to extract modules from higher-order time series biological data is using tensor factorization techniques. A higher-order time series biological data can be represented as a tensor. For example, a gene-condition-time interaction data is represented as a tensor in Figure 1(a). Factorizing this tensor with two decompositions for gene, condition and time-point respectively, we find two modules, i.e., the first module m₁ = {g₁, g₂, g₅, g₇, c₁, c₂, c₄, c₅, c₆, t₁, t₂} and the second module m₂ = {g₃, g₄, g₆, c₃, t₂, t₃, t₄}, by using a threshold (say 4) to cut off the decompositions shown as in Figure 1(b). However, we may not be able to obtain good modules merely based on traditional tensor factorization techniques because the data we are considering includes time dimension. We need to make sure the modules exist consistently in some consecutive time periods, e.g., the time-points involved in module m₁/m₂ are expected to be as consecutive as possible. To achieve this property, some suitable constraints must be incorporated into the factorization process. Next we will formulate the proposed MultiFacTV method.

We assume that the higher-order biological data represents interactions between three types of variables, for example gene × condition × time data. We formulate the proposed MutliFacTV based on such data in this paper. However, it is remarkable that MultiFacTV is a general framework that can be derived similarly for biological data representing interactions more than three types of variables. Suppose we consider the genomic expression profiles of n₁ genes under n₂ conditions over n₃ time-points. The corresponding interactions can be represented as a n₁ × n₂ × n₃ tensor $A = (a_{r, s, t})$ , where a_r_,s,tis a value recording how the gene r responds to the condition s at the time-point t. We note that a_r_,s,tcan be a positive or negative value, i.e., the input tensor $A$ is not necessarily a nonnegative tensor.

Assume we would like to find K modules. The following objective function is proposed to decompose the tensor $A$ into three matrices U, V and W:

\begin{array}{l} min {∥A - \sum_{k = 1}^{K} u_{k} \circ v_{k} \circ w_{k}∥}^{2} + α \sum_{k = 1}^{K} | | B w_{k} | |_{1} \\ s .t . W \geq 0, and 1^{T} \cdot w_{k} = 1 for k = 1, 2, . . ., K \end{array}

(1)

where U = [u₁, u₂, ..., u_K], V = [v₁, v₂, ..., v_K], W = [w₁, w₂, ..., w_K] are three decomposition matrices regarding n₁ genes, n₂ conditions and n₃ time-points respectively; B is a (n₃ - 1) × n₃ matrix satisfying

b_{i, j} = \{\begin{matrix} 1 & i = j \\ - 1 & i = j - 1 \\ 0 & otherwise \end{matrix}

and α > 0 is a regularization parameter. Clearly, $α \sum_{k = 1}^{K} | | B w_{k} | |_{1}$ is a total variation constraint regarding the decomposition matrix of time. With this regularization term, we can control the modules identified such that they exist consistently in some consecutive time periods. Different from the conference version [22], the decomposition matrices U and V do not have nonnegative constraints because we allow negative entries in the tensor $A$ .

MultiFacTV seeks matrices U, V and W that minimize the objective function in (1). As there are three matrices unknown, we need to solve them in an iterative fashion, i.e., changing the optimization problem into three subproblems with one unknown matrix in each, and then solving them iteratively until it converges. Therefore we have three subproblems for MultiFacTV as follows.

Subproblem 1: Fix V and W, and solve U by minimizing the objective function in (1).

In this subproblem, the objective function is transferred into:

min {∥A^{(1)} - U F∥}^{2}

(2)

where F = (W ʘ V)^T. We have the following solution for U:

U = A^{(1)} F^{T} {(F F^{T})}^{- 1}

(3)

Subproblem 2: Fix U and W, and solve V by minimizing the objective function in (1).

In this subproblem, the objective function is transferred into:

min {∥A^{(2)} - V F∥}^{2}

(4)

where F = (W ʘ U)^T. We have the following solution for V:

V = A^{(2)} F^{T} {(F F^{T})}^{- 1} .

(5)

Subproblem 3: Fix U and V, and solve W by minimizing the objective function in (1).

In this subproblem, the objective function is transferred into:

\begin{array}{l} min {∥A^{(3)} - W F∥}^{2} + α \sum_{k = 1}^{k} | | B w_{k} | |_{1} \\ s .t . 1^{T} \cdot w_{k} = 1 for k = 1, 2, . . ., K, \end{array}

(6)

where F = (V ʘ U)^T. In order to solve the matrix W in (6), we introduce two (n₃ - 1) × K auxiliary matrices P and Q and adopt the strategy of Alternating Direction Method of Multipliers (ADMM) [24, 25]. As a result, three updating formulas are derived and obtained (see [22] for the detailed derivation):

v e c (W) = {((F F^{T} \otimes 2 I_{n_{3} \times n_{3}}) + (ρ I_{K \times K} \otimes B^{T} B))}^{- 1} v e c (ρ B^{T} (P - Q / ρ) + 2 A^{(3)} F^{T})

(7)

P = shinkag e_{α / ρ} (B W + Q / ρ)

(8)

Q = Q + ρ (B W - P)

(9)

Here ρ can be any positive number and we use ρ = 1 in our implementation. Clearly, we need to update matrices W, P and Q iteratively until it converges to solve this subproblem. Note that each column of W must be normalized after updating as equation (7) to guarantee its constraints in (1).

Iteratively solving these three subproblems leads to a local minimum of the MultiFacTV objective function in (1) and the solutions for matrices U, V and W at the same time. Different from our conference version, the updating formulas for U and V in Subproblems 1 and 2 change here because we do not have nonnegative constraints on them. Next we summarize the proposed MultiFacTV method in Algorithm 1.

Algorithm 1 The MultiFacTV Algorithm

Input: a n₁ × n₂ × n₃ tensor $A$ , the number of modules K, parameter α, and thresholding parameters τ₁, τ₂, and τ₃

Output: K modules stored in Ω = {Ω₁, Ω₂, ..., Ω_K }

Procedure:

1: Randomly initialize matrices U₍₀₎, V₍₀₎ and W₍₀₎, and set t = 1;

2: Compute U_(t)= A⁽¹⁾F^T (FF^T)^-1 where F = (V_(t-1) ʘ U _(t-1))^T;

3: Compute V(t) = A⁽²⁾F^T (FF^T)^-1 where F = (W_(t-1)ʘ U _(t))^T;

4: Randomly initialize matrices P₍₀₎ and Q₍₀₎, and set F = (V_(t)ʘ U_(t))^T, s= 1, ρ= 1;

5: Iteratively update W_(s), P_(s), Q_(s)as follows:

\begin{array}{l} v e c (W_{(s)}) = {((F F^{T} \otimes 2 I_{n 3 \times n 3}) + (ρ I_{K \times K} \otimes B^{T} B))}^{- 1} v e c (ρ B^{T} (P_{(s - 1)} - Q_{(s - 1)} / ρ) + 2 A^{(3)} F^{T}), \\ P_{(s)} = shinkage {α / ρ}_{} (B W_{(s)} + Q_{(s - 1)} / ρ), \\ Q_{(s)} = Q_{(s - 1)} + ρ (B W_{(s)} - P_{(s)}), \end{array}

until it converges;

6: Set W_(t)= W_(s);

7: If || U_(t)- U_(t-1)||² + ||V_(t)- V_(t-1)||² + || W_(t)- W_(t-1)||² > 0.001, set t = t + 1 and goto Step 2;

otherwise, goto Step 8;

8: For k = 1 to K

Set Ω_k = Ø

$If \sum u_{k}^{+} < - \sum u_{k}^{-}, set u_{k} = - u_{k};$

For r = 1 to n₁

$If u_{r, k} > = 0.5 * τ_{1} * ((max (u_{k}^{+}) + min (u_{k}^{+})), set Ω_{k} = Ω_{k} \cup {gene r};$

$If \sum v_{k}^{+} < - \sum v_{k}^{-}, set v_{k} = - v_{k};$

For s = 1 to n₂

$If v_{s, k} > = 0.5 * τ_{2} * ((max (v_{k}^{+}) + min (v_{k}^{+})), set Ω_{k} = Ω_{k} \cup {condition s};$

For t = 1 to n₃

If w_t,k >= 0.5 ∗ τ₃ ∗ ((max(w_k )+min(w_k )), set Ω_k = Ω_k ∪ {time point t};

9: Return Ω = {Ω₁, Ω₂, ..., Ω_K}.

In this algorithm, we need to input a tensor and five parameters. At the beginning, the algorithm randomly initializes matrices U, V and W in step 1, and then it updates them iteratively from steps 2 to 7. We note that there is an inner loop in step 5 in order to update W. When finishing the computation of U, V and W, the algorithm outputs K modules in step 8 by cutting off each column of U, V and W with thresholding parameters τ₁, τ₂, and τ₃ respectively. Since the decomposition matrices U and V are not necessarily nonnegative, the module extraction in step 8 is also different from the conference version.

Results

In this section, we run MultiFacTV on synthetic datasets, Arabidopsis thaliana dataset, Yeast dataset and Homo sapiens dataset to test its performance and usefulness. The synthetic datasets are generated artificially and the other three real datasets can be found on http://www.ra.cs.uni-tuebingen.de/software/EDISA/downloads/index.htm.

Results on synthetic datasets

In this experiment, we generated gene×condition×time tensor data to test the effectiveness of MultiFacTV. In the synthetic datasets, some "ground-truth" modules containing a set of genes, conditions and consecutive time intervals were generated. There were 400 genes, 400 conditions and 50 time-points. Based on the number of modules included, the datasets were categorized into four types, 3-module dataset, 5-module dataset, 8-module dataset and 10-module dataset. To test the robustness of MultiFacTV, we added different level of noise in the corresponding tensors, i.e., using 0.005, 0.01 and 0.02 as densities to add noise into the tensors respectively. Our objective was to identify the "ground-truth" modules accurately.

As for a comparison, we performed EDISA and MetaFac as well. For MetaFac and MultFacTV, we set K to be the number of modules in the dataset. For EDISA, the sample size was set to be 20 and the iteration number was set to be 50. The parameters τ_g and τ_c were turned in the interval [0,1] with 0.1 as increasing step and then the best parameter values were to produce final result. For MultiFacTV, we used τ₁ = τ₂ = 1.0 and τ₃ = 0.75. All results were evaluated based on the Fscore and NMI (Normalized Mutual Information) by considering the discovered modules and the "ground truth" modules.

Before comparing the performance of MultiFacTV and the other two algorithms, we first demonstrate the convergence of the proposed MultiFacTV and how its performance changes against the tuning of parameter α. In Figure 2, we show the convergence of MultiFacTV based on one synthetic dataset. We see from this figure that the objective function value is decreasing as the number of iterations increases, and after 40 iterations the change is very little and the algorithm is stopped. In Figure 3, we show how the performance of MutliFacTV changes with respect to the tuning of α. We see from this figure that its performance does not change significantly as parameter α changes from 1 to 20, and the best result is yielded when α = 10. Therefore we used this value for α in the experiments. Table 1 shows the results of EDISA, MetaFac and the proposed algorithm on these synthetic datasets. We see from the table that MultiFacTV algorithm outperforms the other two comparison algorithms.

Table 1 Experimental results on synthetic datasets.

Full size table

Results on Arabidopsis thaliana datasets

In this experiment, the MultiFacTV was applied to Arabidopsis thaliana data to explore biological module patterns therein. The data recorded the time-series genomic expression of the root/shoot tissue in Arabidopsis thaliana when different abiotic stresses were considered. For the genomic expression data of root tissue, we constructed a gene×condition×time tensor $A$ of size 2395 × 9 × 6. For the genomic expression data of shoot tissue, we constructed a gene×condition×time tensor $A$ of size 3454 × 8 × 6. Both tensors were nonnegative. We run the MultiFacTV method with K = 40, α = 10, τ₁ = τ₂ = 1.0 and τ₃ = 0.75 on each tensor.

Next we present some biological modules discovered from each of these tensors(data). To validate these modules, we associate them to some functional annotation terms with DAVID analysis [26]. Besides, the corresponding p-values are also given to demonstrate the statistical significance of these functional terms.

Interesting genomic modules in root tissue: Some interesting biological modules detected from root tissue by MultiFacTV are given in the following.

1.
Cold-osmotic modules. In [27], it has been manifested that a large portion of the Arabidopsis genes are sensitive to cold and osmotic stress stimulus. In our results, we found several modules participating in the response to both stresses. We present two of such modules here and their genomic expression profiles are shown in Figure 4. We observe from Figures 4(a) and 4(b) there are distinct expression shapes, where the shapes for cold and osmotic conditions are quite similar. This observation indicates the genes in these two modules co-regulate under these two conditions, suggesting that Arabidopsis may not distinguish between cold and osmotic stresses. The first module is associated to functional terms like "response to water deprivation", "cold acclimation" and "response to cold" (p-values: 3.9 ×10^-4, 7.1 × 10^-4 and 1.2 × 10^-3 respectively) by using DAVID, and the second module is associated to "response to osmotic stress", "response to temperature stimulus" and "response to cold" (p-values: 9.1 × 10^-4, 8.9 × 10^-6 and 1.4 × 10^-2 respectively). These facts confirm that both modules play key roles in the response to cold and osmotic stresses.

2.
Salt module. In Figure 5(a), we show a module detected by MultiFacTV that responds to salt stress. Apparently, this module has quite different expression shapes under salt stress compared to under the other stresses. Moreover, the terms like "response to water deprivation" and "response to salt stress" (p-values: 2.6 × 10^-9 and 5.0 × 10^-3 respectively) are mapped to it, which manifests this module indeed functions under salt stress.

3.
Heat module. We obtained a module participating in the response to heat shock, shown as in Figure 5(b). Clearly, it has quite distinct expression shapes under heat condition. With DAVID, the genes in this module are mapped to "response to heat" and "response to temperature stimulus" (p-values: 1.1 × 10^-55 and 1.3 × 10^-43 respectively).
4.
Uvb-wound modules. We obtained two modules responding to uvb light and wound stresses, see Figure 6. In Figure 6(a), we observe that the module 1 down-regulates slightly from 0.5h to 12h and up-regulates from 12h to 24h. This module is significant for "photosynthesis, light harvesting", "response to light stimulus" and "defense response" (p-values: 1.4 × 10^-8, 2.8 × 10^-2 and 1.2 × 10^-2 respectively). It can be observed from Figure 6(b) that the module 2 has different genomic expression profiles for uvb and wound stresses in comparison with the other stresses. The module is pronounced under "response to light stimulus", "response to UV" and "response to wounding" (p-values: 1.4 × 10^-4, 1.4 × 10^-4 and 8.7 × 10^-3). Clearly, both modules indeed participate in the response to uvb light and wound stresses.

Interesting genomic modules in shoot tissue: In the following, we show two interesting genomic modules in shoot tissue output by the proposed MultiFacTV.

1.
Salt-oxidative-drought module. We found a module participating in the response to salt, oxidative and drought stresses, see Figure 7(a). We observe that the module has similar genomic expression profiles for salt, oxidative and drought stresses. It is annotated to functional terms like "response to salt stress", "oxidoreductase", "oxidation reduction" (p-values: 7.8 × 10^-2, 3.1 × 10^-2 and 4.1 × 10^-2 respectively). This suggests that the module is significant and indeed has biological functionalities related to salt and oxidative stresses.

2.
Wound module. We obtained a module participating in the response to wound stress, see Figure 7(b). It can be observed that the module first up-regulates and then down-regulates from 1h to 12h under wound stress, and its genomic expression shapes are quite different in comparison with the ones for the other stresses. By using DAVID, the module is annotated to functional terms like "defense response" and "response to wounding" (p-values: 2.4 × 10^-4 and 2.5 × 10^-2 respectively). This suggests the module identified by MultiFacTV indeed has wound-related biological functionalities.

Results on yeast dataset

We performed the proposed MultiFacTV on Yeast dataset to explore interesting module patterns. This dataset recorded multiple time series genomic expression of yeast Saccharomyces cerevisiae regarding to different environmental changes [28]. We considered six environmental stresses in this dataset, including heat shock, 0.32mM H₂O₂, 1mM menadione, 2.5mM DTT(dithiothreitol), 1.5mM diamide and 1M sorbitol. Since different time-points were adopted to record the expression under different environmental stresses in the original data, we preprocessed this data by selecting 6 time-points, i.e., 10min, 20min, 30min, 40min, 60min and 80min. The missing time-point was handled by using a linear interpolation of two closest time-points available. Other missing values were replaced with the average expression value at the corresponding time-point. As a result, we constructed a gene×condition×time tensor $A$ of size 4425 × 6 × 6, i.e., there were 4425 genes, 6 stresses and 6 time-points. This tensor was not nonnegative because the genomic expression data included negative values. The MultiFacTV algorithm was performed with K = 20, α = 10, τ₁ = τ₃ = 0.75 and τ₂ = 0.85.

Next we present and analyze some interesting module patterns identified by the proposed MultiFacTV.

1.
H₂O₂-menadione modules. In [28], it has been shown that a large portion of genes in yeast co-regulate under H₂O₂ stress and menadione stress despite that they are supposed to result in different reactive oxygen species. The MultiFacTV obtains similar findings and we present the genomic expression of two modules of such kind, see Figures 8(a) and 8(b). We observe that the module 1 up-regulates from 30min to 40min and down-regulates from 40min to 60min under both stresses, while the module 2 down-regulates from 10min to 20min and up-regulates from 20min to 30min. The analysis with DAVID have shown that the module 1 is functionally related to "reproduction of a single-celled organism", "mating projection tip" and "cell budding" (p-values: 1.9 × 10^-2, 7.8 × 10^-2 and 6.1 × 10^-2 respectively), and the module 2 is functionally associated to "glucose catabolic process", "hexose catabolic process" and "monosaccharide catabolic process" (p-values: 4.2 × 10^-2, 5.2 × 10^-2 and 5.8 × 10^-2 respectively). All these terms may be related to some biological process induced by the oxidative and reductive reactions taking place in the cells.

2.
Heat shock modules. We obtained two interesting modules responding to heat shock in yeast. The genomic expression of both modules are shown as in Figures 9(a) and 9(b). We see that these two modules have opposite expression trends after heat stress where the module 1 down-regulates while the module 2 up-regulates. The analysis with DAVID have shown that the module 1 indeed takes part in the response to heat and temperature stimulus (p-values: 2.5 × 10^-15 and 3.5 × 10^-23 respectively). Moreover, we find this module is annotated to functional terms like "protein catabolic process" and "cellular macromolecule catabolic process" (p-values: 5.2 × 10^-7 and 2.5 × 10^-5 respectively). This can be interpreted by the fact that heat shock usually leads to protein unfolding [28]. The module 2 is annotated to functional terms like "ribonucleoprotein complex biogenesis" and "RNA binding" (p-values: 8.6 × 10^-24 and 7.0 × 10^-7). This may be because the protein unfolding induces the concurrent ribonucleoprotein complex biogenesis.

Results on Homo Sapiens dataset

We applied the proposed MultiFacTV to Homo Sapiens dataset for exploring biological modules. It was a higher-order time series dataset about genomic expression of multiple sclerosis patients after IFN-β injection treatment. We represented this data as a nonnegative gene×patient×time tensor of size 2920 ×14×9, i.e, there were 2920 genes, 14 patients {A, B, C, D, E, F, G, H, I, J, K, L, M, N} and 9 time-points. The MultiFacTV was performed with τ₁ = τ₂ = 0.5, τ₃ = 0.75, α = 10 and K = 40. As a result, we found many interesting modules responding to IFN-β treatment similar to [21]. To exploit the usefulness of those modules from a different view, we made use of them to help us group the patients.

With the 40 modules identified, we constructed a binary matrix M of size 14 × 40 representing the membership of each patient to the modules, where m_i,j = 1 if the i-th patient was associated to the j-th module, otherwise m_i,j = 0. In such case, each of the 14 patients was represented as a 1 × 40 binary vector. Subsequently, we clustered the 14 patients by using k-means algorithm and the clustering results were {A, B, C, D}, {E, F, G, H}, {J, K, L, M }, {I, N }. This grouping result may suggest some differences of patients in their disease histories or progressions. We believe that this result will be beneficial to the designation of personalized medicine for the patients with multiple sclerosis [29].

Conclusions

As more and more time series biological data are being accumulated from different laboratories or databases, identification of modules with integrative analysis become an important and urgent task. One way to accomplish such integrative analysis is assembling multiple time series biological data into a tensor form. In this paper, we have proposed the MultiFacTV method, which extends the tensor factorization objective by introducing a time-related regularization term of total variation, to detect modules from such higher-order time series biological data. We have performed the MultiFacTV method on synthetic datasets, Arabidopsis dataset, Yeast dataset and Homo sapiens dataset to test its performance. The results have shown that the proposed MultiFacTV indeed reveals some interesting module patterns. We have shown and validated these interesting findings with DAVID analysis or other analysis.

In this paper, we assume that the multiple time series genomic expression data have the same size, i.e., the same number of genes and the same number of time-points, so that they can be joined into a tensor. In some cases, the data may be in different sizes. For example, the original Yeast dataset [28] has different number of time-points for different environmental stresses. In the future, it would be interesting to extend the tensor factorization objective function of Tucker1 or Tucker2 [30] in a similar way to perform integrative module detection for such data.

References

Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM et al: Systematic determination of genetic network architecture. Nature genetics. 1999, 22: 281-285. 10.1038/10343.
Article CAS PubMed Google Scholar
Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences. 1998, 95 (25): 14863-14868. 10.1073/pnas.95.25.14863.
Article CAS Google Scholar
Pentney W, Meila M: Spectral clustering of biological sequence data. Proceedings of the national conference on artificial intelligence. 2005, 20: 845-850.
Google Scholar
Cheng Y, Church GM: Biclustering of expression data. Proceedings of the eighth international conference on intelligent systems for molecular biology. 2000, 1: 93-103.
Google Scholar
Wall M, Rechtsteiner A, Rocha L: Singular value decomposition and principal component analysis. A practical approach to microarray data analysis. 2003, 91-109.
Chapter Google Scholar
Lee M, Shen H, Huang JZ, Marron JS: Biclustering via sparse singular value decomposition. Biometrics. 2010, 66 (4): 1087-1095. 10.1111/j.1541-0420.2010.01392.x.
Article PubMed Google Scholar
Sill M, Kaiser S, Benner A, Kopp-Schneider A: Robust biclustering by sparse singular value decomposition incorporating stability selection. Bioinformatics. 2011, 27 (15): 2089-2097. 10.1093/bioinformatics/btr322.
Article CAS PubMed Google Scholar
Jung I, Kim D: Linknmf: Identification of histone modification modules in the human genome using nonnegative matrix factorization. Gene. 2012, 518 (1): 215-221.
Article PubMed Google Scholar
Carmona-Saez P, Pascual-Marqui R, Tirado F, Carazo J, Pascual-Montano A: Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMC bioinformatics. 2006, 7 (1): 78-10.1186/1471-2105-7-78.
Article PubMed Central PubMed Google Scholar
Mejía-Roa E, Carmona-Sáez P, Nogales R, Vicente C, Vazquez M, Yang XY, Garcia C, Tirado F, Pascual-Montano A: Bionmf: a web-based tool for nonnegative matrix factorization in biology. Nucleic acids research. 2008, 36 (suppl 2): 523-528.
Article Google Scholar
Mahoney M, Drineas P: Cur matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences. 2009, 106 (3): 697-702. 10.1073/pnas.0803205106.
Article CAS Google Scholar
Paschou P, Mahoney M, Javed A, Kidd JR, Pakstis AJ, Gu S, Kidd KK, Drineas P: Intra-and interpopulation genotype reconstruction from tagging snps. Genome Research. 2007, 17 (1): 96-107.
Article PubMed Central CAS PubMed Google Scholar
Li W, Liu CC, Zhang T, Li H, Waterman MS, Zhou XJ: Integrative analysis of many weighted co-expression networks using tensor computation. PLoS Computational Biology. 2011, 7 (6): e1001106-10.1371/journal.pcbi.1001106.
Article PubMed Central CAS PubMed Google Scholar
Omberg L, Golub GH, Alter O: A tensor higher-order singular value decomposition for integrative analysis of dna microarray data from different studies. Proceedings of the National Academy of Sciences. 2007, 104 (47): 18371-18376. 10.1073/pnas.0709146104.
Article CAS Google Scholar
Zhang Y, Du N, Ge L, Jia K, Zhang AD: A collective nmf method for detecting protein functional module from multiple data sources. Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine. 2012, 655-660.
Chapter Google Scholar
Narayanan M, Vetta A, Schadt EE, Zhu J: Simultaneous clustering of multiple gene expression and physical interaction datasets. PLoS computational biology. 2010, 6 (4): e1000742-10.1371/journal.pcbi.1000742.
Article PubMed Central PubMed Google Scholar
Garcia-Hernandez M, Berardini T, Chen G, Crist D, Doyle A, Huala E, Knee E, Lambrecht M, Miller N, Mueller LA et al: Tair: a resource for integrated arabidopsis data. Functional integrative genomics. 2002, 2 (6): 239-253. 10.1007/s10142-002-0077-z.
Article CAS PubMed Google Scholar
Edgar R, Domrachev M, Lash AE: Gene expression omnibus: Ncbi gene expression and hybridization array data repository. Nucleic acids research. 2002, 30 (1): 207-210. 10.1093/nar/30.1.207.
Article PubMed Central CAS PubMed Google Scholar
Weinstock-Guttman B, Badgett D, Patrick K, Hartrich L, Santos R, Hall D, Baier M, Feichter J, Ramanathan M: Genomic effects of ifn-β in multiple sclerosis patients. The Journal of Immunology. 2003, 171 (5): 2694-2702. 10.4049/jimmunol.171.5.2694.
Article CAS PubMed Google Scholar
Kilian J, Whitehead D, Horak J, Wanke D, Weinl S, Batistic O, Angelo C, Bornberg-Bauer E, Kudla J, Harter K: The atgenexpress global stress expression data set: protocols, evaluation and model data analysis of uv-b light, drought and cold stress responses. The Plant Journal. 2007, 50 (2): 347-363. 10.1111/j.1365-313X.2007.03052.x.
Article CAS PubMed Google Scholar
Supper J, Strauch M, Wanke D, Harter K, Zell A: Edisa: extracting biclusters from multiple time-series of gene expression profiles. BMC bioinformatics. 2007, 8 (1): 334-348. 10.1186/1471-2105-8-334.
Article PubMed Central PubMed Google Scholar
Li XT, Ye YM, Wu QY, Ng MK: Multifactv: Finding modules from higher-order gene expression profiles with time dimension. Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on: 4-7 October 2012. 2012, 1-6. 10.1109/BIBM.2012.6392641.
Google Scholar
Lin YR, Sun J, Castro P, Konuru R, Sundaram H, Kelliher A: Metafac: community discovery via relational hypergraph factorization. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 2009, 527-536.
Chapter Google Scholar
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J: Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers. 2011
Google Scholar
Ng MK, Weiss P, Yuan X: Solving constrained total-variation image restoration and reconstruction problems via alternating direction methods. SIAM Journal on Scientific Computing. 2010, 32 (5): 2710-2736. 10.1137/090774823.
Article Google Scholar
Dennis G, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA et al: David: database for annotation, visualization, and integrated discovery. Genome Biol. 2003, 4 (5): P3-10.1186/gb-2003-4-5-p3.
Article PubMed Google Scholar
Kreps JA, Wu YJ, Chang HS, Zhu T, Wang X, Harper J: Transcriptome changes for arabidopsis in response to salt, osmotic, and cold stress. Plant Physiology. 2002, 130 (4): 2129-2141. 10.1104/pp.008532.
Article PubMed Central CAS PubMed Google Scholar
Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO: Genomic expression programs in the response of yeast cells to environmental changes. Science's STKE. 2000, 11 (12): 4241-4257.
CAS Google Scholar
Vosslamber S, Baarsen L, Verweij CL: Pharmacogenomics of ifn-β in multiple sclerosis: towards a personalized medicine approach. Pharmacogenomics. 2009, 10 (1): 97-108. 10.2217/14622416.10.1.97.
Article CAS PubMed Google Scholar
Acar E, Yener B: Unsupervised multiway data analysis: A literature survey. IEEE Transactions on Knowledge and Data Engineering. 2009, 21 (1): 6-20.
Article Google Scholar

Download references

Acknowledgements

Based on "MultiFacTV: finding modules from higher-order gene expression profiles with time dimension", by Xutao Li, Yunming Ye, Qingyao Wu and Michael Ng which appeared in Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on. ©2012 IEEE [22].

X. Li's research was supported in part by NSFC under Grant No.61100190. Y. Ye's research was supported in part by NSFC under Grant No. 61272538, National Key Technology R&D Program of MOST China under Grant No. 2012BAK17B08, Shenzhen Science and Technology Program under Grant No.CXY201107010206A, and Shenzhen Strategic Emerging Industries Program under Grant No. ZDSY20120613125016389. M. Ng's research was supported in part by Centre for Mathematical Imaging and Vision, HKRGC Grant No. 201812.

Declarations

The publication costs for this article were funded by Yunming Ye.

This article has been published as part of BMC Genomics Volume 14 Supplement S4, 2013: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S4.

Author information

Authors and Affiliations

Department of Computer Science, Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen, 518055, China
Xutao Li, Yunming Ye & Qingyao Wu
Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
Michael Ng
Shenzhen Key Laboratory of Internet Information Collaboration, Shenzhen, 518055, China
Xutao Li, Yunming Ye & Qingyao Wu

Authors

Xutao Li
View author publications
You can also search for this author in PubMed Google Scholar
Yunming Ye
View author publications
You can also search for this author in PubMed Google Scholar
Michael Ng
View author publications
You can also search for this author in PubMed Google Scholar
Qingyao Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qingyao Wu.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

XL, MN and YY participated in designing the algorithm, drafting and revising the manuscript. XL participated in implementing the algorithm and performing experiments. QW participated in the discussions of experimental results. All authors have read and approved the final version of this manuscript.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( https://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Li, X., Ye, Y., Ng, M. et al. MultiFacTV: module detection from higher-order time series biological data. BMC Genomics 14 (Suppl 4), S2 (2013). https://doi.org/10.1186/1471-2164-14-S4-S2

Download citation

Published: 01 October 2013
DOI: https://doi.org/10.1186/1471-2164-14-S4-S2

Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Genomics

MultiFacTV: module detection from higher-order time series biological data