 Research
 Open access
 Published:
MultiFacTV: module detection from higherorder time series biological data
BMC Genomics volume 14, Article number: S2 (2013)
Abstract
Background
Identifying modules from time series biological data helps us understand biological functionalities of a group of proteins/genes interacting together and how responses of these proteins/genes dynamically change with respect to time. With rapid acquisition of time series biological data from different laboratories or databases, new challenges are posed for the identification task and powerful methods which are able to detect modules with integrative analysis are urgently called for. To accomplish such integrative analysis, we assemble multiple time series biological data into a higherorder form, e.g., a gene × condition × time tensor. It is interesting and useful to develop methods to identify modules from this tensor.
Results
In this paper, we present MultiFacTV, a new method to find modules from higherorder time series biological data. This method employs a tensor factorization objective function where a timerelated total variation regularization term is incorporated. According to factorization results, MultiFacTV extracts modules that are composed of some genes, conditions and timepoints. We have performed MultiFacTV on synthetic datasets and the results have shown that MultiFacTV outperforms existing methods EDISA and Metafac. Moreover, we have applied MultiFacTV to Arabidopsis thaliana root(shoot) tissue dataset represented as a gene×condition×time tensor of size 2395 × 9 × 6(3454 × 8 × 6), to Yeast dataset and Homo sapiens dataset represented as tensors of sizes 4425 × 6 × 6 and 2920×14×9 respectively. The results have shown that MultiFacTV indeed identifies some interesting modules in these datasets, which have been validated and explained by Gene Ontology analysis with DAVID or other analysis.
Conclusion
Experimental results on both synthetic datasets and real datasets show that the proposed MultiFacTV is effective in identifying modules for higherorder time series biological data. It provides, compared to traditional nonintegrative analysis methods, a more comprehensive and better view on biological process since modules composed of more than two types of biological variables could be identified and analyzed.
Background
Identification of biological modules plays a key role in bioinformatics because it can reveal interesting groups of proteins/genes having strong interactions, which may be related to some biological functionalities. In the literature, many methods have been proposed for this purpose. One popular way is to make use of clustering algorithms [1–4], which reveals module patterns by clustering proteins/genes into groups such that intragroup similarities are maximized while intergroup similarities are minimized. The performance of this type of methods relies significantly on the similarity function used during the clustering process. Due to this shortcoming, some researchers also tune to matrix factorization techniques for detecting biological modules. For example, in [5–7], singular value decomposition based methods have been studied and developed to detect modules from gene expression data. In [8–10], nonnegative matrix factorization related methods have been developed to cluster and explore biological data. Recently, CUR decomposition, a new method approximating original data matrix by selecting a set of columns and rows, has been applied to analyze microarray data and SNP data [11, 12] because of its scalability and interpretability. This method may possibly be used to cluster largescale biological data as well. However, all these methods are developed for analyzing biological data represented as matrix form, which models interactions between only two types of variables.
With rapid acquisition of biological experiments from different laboratories or studies based on different databases, many higherorder biological data representing interactions between more than two types of variables can be obtained. For instance, researchers in different laboratories may be interested in analysing gene coexpression networks under different stimulus, each of which is represented as a gene×gene matrix. Integrating these matrices results in a higherorder biological data, namely a gene×gene×stimulus tensor, and finding module patterns from such data tends to offer a better view of the underlying biological structures. Therefore powerful methods which are able to detect modules with integrative analysis are urgently called for.
In the literature, several integrative analysis methods have already been put forward. In [13], Li et al. developed a framework to find recurrent heavy subgraphs from multiple weighted networks represented as a 3D tensor, i.e., gene × gene × network. In the framework, a tensor objective function is proposed and solved, the solution of which helps to discovery a heavy subgraph. In [14], Omberg et al. employed higherorder singular value decomposition(HOSVD) to perform integrative analysis of multiple mircoarray data from different studies. Zhang et al. extended nonnegative matrix factorization method for exploring protein modules from multiple data sources [15]. In [16], a JointCluster algorithm was proposed to extract coherent clusters from multiple networks. However, all these methods are not suitable for analyzing time series data, which is also a task of particular importance in bioinformatics.
In this paper, we are interested in identifying biological modules from multiple time series data with integrative analysis. There are two ways to build up such data in general. One is to collect and accumulate from different time series data sources [17, 18], and the other is to perform time series biological experiments under different stimulus/conditions [19, 20]. The second way is usually more popular. For instance, in [20], researchers studied the time series expression profiles of genes in Arabidopsis thaliana under several abiotic stimulus; in [19], researchers studied time series gene expression of several sclerosis patients after IFNβ injection. Joining such data together, we can form a higherorder time series tensor, e.g., a gene×condition×time tensor. Identifying modules of genes, conditions and timepoints from such tensor data could offer us a better understanding of the corresponding biological processes. For example, Supper et al. proposed EDISA algorithm by extending the 2D iterative signature algorithm to extract and analyze such modules [21].
We propose in this paper, MultiFacTV, a method to find modules from tensor time series data. This method employs a tensor factorization objective function and makes use of the decomposition results to identify modules. As we consider time series data, the modules are expected to be as consecutive as possible in time dimension. Therefore we incorporate a timerelated regularization term of total variation into the objective function. Different from the conference version [22], we have rederived the factorization formulas and updated the algorithm because we do not assume that input biological tensor is nonnegative in this paper. We have compared MultiFacTV with EDISA [21] and MetaFac [23] on synthetic datasets, and the results have shown that MulitFacTV outperforms the other two algorithms. In addition, we have applied MultiFacTV to Arabidopsis thaliana root(shoot) tissue dataset, Yeast dataset and Homo Sapiens dataset, and the results have shown that MultiFacTV indeed identifies some interesting biological modules, most of which have not yet been reported in our conference version. These interesting findings have also been validated and explained by using Gene Ontology analysis with DAVID or other analysis.
Methods
Terminologies
A tensor refers to a multidimensional array or matrix. The order of a tensor is defined to be the number of dimensions, also known as modes, of the corresponding multidimensional array. For instance, given a n_{1} × n_{2} × n_{3} tensor \mathcal{A}=\left({a}_{r,s,t}\right), it is called a thirdorder tensor. The process of rearranging a tensor into a twodimensional matrix is called unfolding. A nth order tensor can be unfolded into n matrices in terms of each of its modes. For example, unfolding the tensor \mathcal{A} in terms of mode 1, mode 2 and mode 3, we obtain three matrices A^{(1)}, A^{(2)} and A^{(3)} of sizes n_{1} × n_{2}n_{3}, n_{2} × n_{1}n_{3} and n_{3} × n_{1}n_{2} respectively. In this paper, we let A^{(p)}denote the unfolding matrix of tensor \mathcal{A} in terms of mode p.
Let I_{n×n}be the n × n identity matrix. Let M^{T} be the transpose of matrix M. Given a n_{1} × n_{2} matrix M, we define vec(M) to be a n_{1}n_{2} × 1 vector that is obtained by stacking each column of M. We define shrinkage_{α/ρ}(·) to be a shrinkagethresholding operator for each entry of a matrix, i.e.,
where \frac{{m}_{i,j}}{\left{m}_{i,j}\right} should be zero when m_{ i,j } = 0.
Let ⊗ and ○ be the Kronecker product operator and outer product operator. Given a ndimensional vector x = [x_{1}, x_{2}, ..., x_{ n }]^{T} , let x^{+} = {x_{ i }x_{ i } > 0, 1 ≤ i ≤ n} and x^{} = {x_{ i }x_{ i } < 0, 1 ≤ i ≤ n} denote the sets of its positive entries and negative entries respectively. Besides, we define \sum {\mathsf{\text{x}}}^{+}={\sum}_{y\in {\mathsf{\text{x}}}^{+}}y and \sum {\mathsf{\text{x}}}^{}={\sum}_{y\in {\mathsf{\text{x}}}^{}}y. In this paper, max(·) and min(·) are functions used to find the maximum value and minimum value respectively.
MultiFacTV
Our idea to extract modules from higherorder time series biological data is using tensor factorization techniques. A higherorder time series biological data can be represented as a tensor. For example, a geneconditiontime interaction data is represented as a tensor in Figure 1(a). Factorizing this tensor with two decompositions for gene, condition and timepoint respectively, we find two modules, i.e., the first module m_{1} = {g_{1}, g_{2}, g_{5}, g_{7}, c_{1}, c_{2}, c_{4}, c_{5}, c_{6}, t_{1}, t_{2}} and the second module m_{2} = {g_{3}, g_{4}, g_{6}, c_{3}, t_{2}, t_{3}, t_{4}}, by using a threshold (say 4) to cut off the decompositions shown as in Figure 1(b). However, we may not be able to obtain good modules merely based on traditional tensor factorization techniques because the data we are considering includes time dimension. We need to make sure the modules exist consistently in some consecutive time periods, e.g., the timepoints involved in module m_{1}/m_{2} are expected to be as consecutive as possible. To achieve this property, some suitable constraints must be incorporated into the factorization process. Next we will formulate the proposed MultiFacTV method.
We assume that the higherorder biological data represents interactions between three types of variables, for example gene × condition × time data. We formulate the proposed MutliFacTV based on such data in this paper. However, it is remarkable that MultiFacTV is a general framework that can be derived similarly for biological data representing interactions more than three types of variables. Suppose we consider the genomic expression profiles of n_{1} genes under n_{2} conditions over n_{3} timepoints. The corresponding interactions can be represented as a n_{1} × n_{2} × n_{3} tensor \mathcal{A}=\left({a}_{r,s,t}\right), where a_{ r }_{,s,t}is a value recording how the gene r responds to the condition s at the timepoint t. We note that a_{ r }_{,s,t}can be a positive or negative value, i.e., the input tensor \mathcal{A} is not necessarily a nonnegative tensor.
Assume we would like to find K modules. The following objective function is proposed to decompose the tensor \mathcal{A} into three matrices U, V and W:
where U = [u_{1}, u_{2}, ..., u_{ K }], V = [v_{1}, v_{2}, ..., v_{ K }], W = [w_{1}, w_{2}, ..., w_{ K }] are three decomposition matrices regarding n_{1} genes, n_{2} conditions and n_{3} timepoints respectively; B is a (n_{3}  1) × n_{3} matrix satisfying
and α > 0 is a regularization parameter. Clearly, \alpha \phantom{\rule{0.3em}{0ex}}\sum _{k=1}^{K}\left\rightB{w}_{k}{}_{1} is a total variation constraint regarding the decomposition matrix of time. With this regularization term, we can control the modules identified such that they exist consistently in some consecutive time periods. Different from the conference version [22], the decomposition matrices U and V do not have nonnegative constraints because we allow negative entries in the tensor \mathcal{A}.
MultiFacTV seeks matrices U, V and W that minimize the objective function in (1). As there are three matrices unknown, we need to solve them in an iterative fashion, i.e., changing the optimization problem into three subproblems with one unknown matrix in each, and then solving them iteratively until it converges. Therefore we have three subproblems for MultiFacTV as follows.
Subproblem 1: Fix V and W, and solve U by minimizing the objective function in (1).
In this subproblem, the objective function is transferred into:
where F = (W ʘ V)^{T}. We have the following solution for U:
Subproblem 2: Fix U and W, and solve V by minimizing the objective function in (1).
In this subproblem, the objective function is transferred into:
where F = (W ʘ U)^{T}. We have the following solution for V:
Subproblem 3: Fix U and V, and solve W by minimizing the objective function in (1).
In this subproblem, the objective function is transferred into:
where F = (V ʘ U)^{T}. In order to solve the matrix W in (6), we introduce two (n_{3}  1) × K auxiliary matrices P and Q and adopt the strategy of Alternating Direction Method of Multipliers (ADMM) [24, 25]. As a result, three updating formulas are derived and obtained (see [22] for the detailed derivation):
Here ρ can be any positive number and we use ρ = 1 in our implementation. Clearly, we need to update matrices W, P and Q iteratively until it converges to solve this subproblem. Note that each column of W must be normalized after updating as equation (7) to guarantee its constraints in (1).
Iteratively solving these three subproblems leads to a local minimum of the MultiFacTV objective function in (1) and the solutions for matrices U, V and W at the same time. Different from our conference version, the updating formulas for U and V in Subproblems 1 and 2 change here because we do not have nonnegative constraints on them. Next we summarize the proposed MultiFacTV method in Algorithm 1.
Algorithm 1 The MultiFacTV Algorithm
Input: a n_{1} × n_{2} × n_{3} tensor \mathcal{A}, the number of modules K, parameter α, and thresholding parameters τ_{1}, τ_{2}, and τ_{3}
Output: K modules stored in Ω = {Ω_{1}, Ω_{2}, ..., Ω_{ K } }
Procedure:
1: Randomly initialize matrices U_{(0)}, V_{(0)} and W_{(0)}, and set t = 1;
2: Compute U_{(t)}= A^{(1)}F^{T} (FF^{T})^{1} where F = (V_{(t1}) ʘ U _{(t1)})^{T};
3: Compute V(t) = A^{(2)}F^{T} (FF^{T})^{1} where F = (W_{(t1)}ʘ U _{(t)})^{T};
4: Randomly initialize matrices P_{(0)} and Q_{(0)}, and set F = (V_{(t)}ʘ U_{(t)})^{T}, s= 1, ρ= 1;
5: Iteratively update W_{(s)}, P_{(s)}, Q_{(s)}as follows:
until it converges;
6: Set W_{(t)}= W_{(s)};
7: If  U_{(t)} U_{(t1)}^{2} + V_{(t)} V_{(t1)}^{2} +  W_{(t)} W_{(t1)}^{2} > 0.001, set t = t + 1 and goto Step 2;
otherwise, goto Step 8;
8: For k = 1 to K
Set Ω_{ k } = Ø
\mathsf{\text{If}}\sum {u}_{k}^{+}\sum {u}_{k}^{},\phantom{\rule{0.3em}{0ex}}\mathsf{\text{set}}\phantom{\rule{0.3em}{0ex}}{u}_{k}\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}{u}_{k};
For r = 1 to n_{1}
\mathsf{\text{If}}\phantom{\rule{0.3em}{0ex}}{u}_{r,k}>\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}=0.\mathsf{\text{5}}*{\tau}_{\mathsf{\text{1}}}*\left((\text{max}\left({u}_{k}^{+}\right)+\text{min}\left({u}_{k}^{+}\right)\right)\mathsf{\text{,set}}{\mathrm{\Omega}}_{k}={\mathrm{\Omega}}_{k}\cup \left\{\mathsf{\text{gene}}r\right\};
\mathsf{\text{If}}\sum {v}_{k}^{+}\sum {v}_{k}^{},\mathsf{\text{set}}{v}_{k}={v}_{k};
For s = 1 to n_{2}
\mathsf{\text{If}}{v}_{s,k}\phantom{\rule{0.3em}{0ex}}=0.\mathsf{\text{5}}*{\tau}_{\mathsf{\text{2}}}*\left((\text{max}\phantom{\rule{0.3em}{0ex}}\left({v}_{k}^{+}\right)+\text{min}\left({v}_{k}^{+}\right)\right),\mathsf{\text{set}}{\mathrm{\Omega}}_{k}={\mathrm{\Omega}}_{k}\cup \left\{\mathsf{\text{condition}}s\right\};
For t = 1 to n_{3}
If w_{ t,k } >= 0.5 ∗ τ_{3} ∗ ((max(w_{ k } )+min(w_{ k } )), set Ω_{ k } = Ω_{ k } ∪ {time point t};
9: Return Ω = {Ω_{1}, Ω_{2}, ..., Ω_{ K }}.
In this algorithm, we need to input a tensor and five parameters. At the beginning, the algorithm randomly initializes matrices U, V and W in step 1, and then it updates them iteratively from steps 2 to 7. We note that there is an inner loop in step 5 in order to update W. When finishing the computation of U, V and W, the algorithm outputs K modules in step 8 by cutting off each column of U, V and W with thresholding parameters τ_{1}, τ_{2}, and τ_{3} respectively. Since the decomposition matrices U and V are not necessarily nonnegative, the module extraction in step 8 is also different from the conference version.
Results
In this section, we run MultiFacTV on synthetic datasets, Arabidopsis thaliana dataset, Yeast dataset and Homo sapiens dataset to test its performance and usefulness. The synthetic datasets are generated artificially and the other three real datasets can be found on http://www.ra.cs.unituebingen.de/software/EDISA/downloads/index.htm.
Results on synthetic datasets
In this experiment, we generated gene×condition×time tensor data to test the effectiveness of MultiFacTV. In the synthetic datasets, some "groundtruth" modules containing a set of genes, conditions and consecutive time intervals were generated. There were 400 genes, 400 conditions and 50 timepoints. Based on the number of modules included, the datasets were categorized into four types, 3module dataset, 5module dataset, 8module dataset and 10module dataset. To test the robustness of MultiFacTV, we added different level of noise in the corresponding tensors, i.e., using 0.005, 0.01 and 0.02 as densities to add noise into the tensors respectively. Our objective was to identify the "groundtruth" modules accurately.
As for a comparison, we performed EDISA and MetaFac as well. For MetaFac and MultFacTV, we set K to be the number of modules in the dataset. For EDISA, the sample size was set to be 20 and the iteration number was set to be 50. The parameters τ_{ g } and τ_{ c } were turned in the interval [0,1] with 0.1 as increasing step and then the best parameter values were to produce final result. For MultiFacTV, we used τ_{1} = τ_{2} = 1.0 and τ_{3} = 0.75. All results were evaluated based on the Fscore and NMI (Normalized Mutual Information) by considering the discovered modules and the "ground truth" modules.
Before comparing the performance of MultiFacTV and the other two algorithms, we first demonstrate the convergence of the proposed MultiFacTV and how its performance changes against the tuning of parameter α. In Figure 2, we show the convergence of MultiFacTV based on one synthetic dataset. We see from this figure that the objective function value is decreasing as the number of iterations increases, and after 40 iterations the change is very little and the algorithm is stopped. In Figure 3, we show how the performance of MutliFacTV changes with respect to the tuning of α. We see from this figure that its performance does not change significantly as parameter α changes from 1 to 20, and the best result is yielded when α = 10. Therefore we used this value for α in the experiments. Table 1 shows the results of EDISA, MetaFac and the proposed algorithm on these synthetic datasets. We see from the table that MultiFacTV algorithm outperforms the other two comparison algorithms.
Results on Arabidopsis thaliana datasets
In this experiment, the MultiFacTV was applied to Arabidopsis thaliana data to explore biological module patterns therein. The data recorded the timeseries genomic expression of the root/shoot tissue in Arabidopsis thaliana when different abiotic stresses were considered. For the genomic expression data of root tissue, we constructed a gene×condition×time tensor \mathcal{A} of size 2395 × 9 × 6. For the genomic expression data of shoot tissue, we constructed a gene×condition×time tensor \mathcal{A} of size 3454 × 8 × 6. Both tensors were nonnegative. We run the MultiFacTV method with K = 40, α = 10, τ_{1} = τ_{2} = 1.0 and τ_{3} = 0.75 on each tensor.
Next we present some biological modules discovered from each of these tensors(data). To validate these modules, we associate them to some functional annotation terms with DAVID analysis [26]. Besides, the corresponding pvalues are also given to demonstrate the statistical significance of these functional terms.
Interesting genomic modules in root tissue: Some interesting biological modules detected from root tissue by MultiFacTV are given in the following.

1.
Coldosmotic modules. In [27], it has been manifested that a large portion of the Arabidopsis genes are sensitive to cold and osmotic stress stimulus. In our results, we found several modules participating in the response to both stresses. We present two of such modules here and their genomic expression profiles are shown in Figure 4. We observe from Figures 4(a) and 4(b) there are distinct expression shapes, where the shapes for cold and osmotic conditions are quite similar. This observation indicates the genes in these two modules coregulate under these two conditions, suggesting that Arabidopsis may not distinguish between cold and osmotic stresses. The first module is associated to functional terms like "response to water deprivation", "cold acclimation" and "response to cold" (pvalues: 3.9 ×10^{4}, 7.1 × 10^{4} and 1.2 × 10^{3} respectively) by using DAVID, and the second module is associated to "response to osmotic stress", "response to temperature stimulus" and "response to cold" (pvalues: 9.1 × 10^{4}, 8.9 × 10^{6} and 1.4 × 10^{2} respectively). These facts confirm that both modules play key roles in the response to cold and osmotic stresses.

2.
Salt module. In Figure 5(a), we show a module detected by MultiFacTV that responds to salt stress. Apparently, this module has quite different expression shapes under salt stress compared to under the other stresses. Moreover, the terms like "response to water deprivation" and "response to salt stress" (pvalues: 2.6 × 10^{9} and 5.0 × 10^{3} respectively) are mapped to it, which manifests this module indeed functions under salt stress.

3.
Heat module. We obtained a module participating in the response to heat shock, shown as in Figure 5(b). Clearly, it has quite distinct expression shapes under heat condition. With DAVID, the genes in this module are mapped to "response to heat" and "response to temperature stimulus" (pvalues: 1.1 × 10^{55} and 1.3 × 10^{43} respectively).

4.
Uvbwound modules. We obtained two modules responding to uvb light and wound stresses, see Figure 6. In Figure 6(a), we observe that the module 1 downregulates slightly from 0.5h to 12h and upregulates from 12h to 24h. This module is significant for "photosynthesis, light harvesting", "response to light stimulus" and "defense response" (pvalues: 1.4 × 10^{8}, 2.8 × 10^{2} and 1.2 × 10^{2} respectively). It can be observed from Figure 6(b) that the module 2 has different genomic expression profiles for uvb and wound stresses in comparison with the other stresses. The module is pronounced under "response to light stimulus", "response to UV" and "response to wounding" (pvalues: 1.4 × 10^{4}, 1.4 × 10^{4} and 8.7 × 10^{3}). Clearly, both modules indeed participate in the response to uvb light and wound stresses.
Interesting genomic modules in shoot tissue: In the following, we show two interesting genomic modules in shoot tissue output by the proposed MultiFacTV.

1.
Saltoxidativedrought module. We found a module participating in the response to salt, oxidative and drought stresses, see Figure 7(a). We observe that the module has similar genomic expression profiles for salt, oxidative and drought stresses. It is annotated to functional terms like "response to salt stress", "oxidoreductase", "oxidation reduction" (pvalues: 7.8 × 10^{2}, 3.1 × 10^{2} and 4.1 × 10^{2} respectively). This suggests that the module is significant and indeed has biological functionalities related to salt and oxidative stresses.

2.
Wound module. We obtained a module participating in the response to wound stress, see Figure 7(b). It can be observed that the module first upregulates and then downregulates from 1h to 12h under wound stress, and its genomic expression shapes are quite different in comparison with the ones for the other stresses. By using DAVID, the module is annotated to functional terms like "defense response" and "response to wounding" (pvalues: 2.4 × 10^{4} and 2.5 × 10^{2} respectively). This suggests the module identified by MultiFacTV indeed has woundrelated biological functionalities.
Results on yeast dataset
We performed the proposed MultiFacTV on Yeast dataset to explore interesting module patterns. This dataset recorded multiple time series genomic expression of yeast Saccharomyces cerevisiae regarding to different environmental changes [28]. We considered six environmental stresses in this dataset, including heat shock, 0.32mM H_{2}O_{2}, 1mM menadione, 2.5mM DTT(dithiothreitol), 1.5mM diamide and 1M sorbitol. Since different timepoints were adopted to record the expression under different environmental stresses in the original data, we preprocessed this data by selecting 6 timepoints, i.e., 10min, 20min, 30min, 40min, 60min and 80min. The missing timepoint was handled by using a linear interpolation of two closest timepoints available. Other missing values were replaced with the average expression value at the corresponding timepoint. As a result, we constructed a gene×condition×time tensor \mathcal{A} of size 4425 × 6 × 6, i.e., there were 4425 genes, 6 stresses and 6 timepoints. This tensor was not nonnegative because the genomic expression data included negative values. The MultiFacTV algorithm was performed with K = 20, α = 10, τ_{1} = τ_{3} = 0.75 and τ_{2} = 0.85.
Next we present and analyze some interesting module patterns identified by the proposed MultiFacTV.

1.
H_{2}O_{2}menadione modules. In [28], it has been shown that a large portion of genes in yeast coregulate under H_{2}O_{2} stress and menadione stress despite that they are supposed to result in different reactive oxygen species. The MultiFacTV obtains similar findings and we present the genomic expression of two modules of such kind, see Figures 8(a) and 8(b). We observe that the module 1 upregulates from 30min to 40min and downregulates from 40min to 60min under both stresses, while the module 2 downregulates from 10min to 20min and upregulates from 20min to 30min. The analysis with DAVID have shown that the module 1 is functionally related to "reproduction of a singlecelled organism", "mating projection tip" and "cell budding" (pvalues: 1.9 × 10^{2}, 7.8 × 10^{2} and 6.1 × 10^{2} respectively), and the module 2 is functionally associated to "glucose catabolic process", "hexose catabolic process" and "monosaccharide catabolic process" (pvalues: 4.2 × 10^{2}, 5.2 × 10^{2} and 5.8 × 10^{2} respectively). All these terms may be related to some biological process induced by the oxidative and reductive reactions taking place in the cells.

2.
Heat shock modules. We obtained two interesting modules responding to heat shock in yeast. The genomic expression of both modules are shown as in Figures 9(a) and 9(b). We see that these two modules have opposite expression trends after heat stress where the module 1 downregulates while the module 2 upregulates. The analysis with DAVID have shown that the module 1 indeed takes part in the response to heat and temperature stimulus (pvalues: 2.5 × 10^{15} and 3.5 × 10^{23} respectively). Moreover, we find this module is annotated to functional terms like "protein catabolic process" and "cellular macromolecule catabolic process" (pvalues: 5.2 × 10^{7} and 2.5 × 10^{5} respectively). This can be interpreted by the fact that heat shock usually leads to protein unfolding [28]. The module 2 is annotated to functional terms like "ribonucleoprotein complex biogenesis" and "RNA binding" (pvalues: 8.6 × 10^{24} and 7.0 × 10^{7}). This may be because the protein unfolding induces the concurrent ribonucleoprotein complex biogenesis.
Results on Homo Sapiens dataset
We applied the proposed MultiFacTV to Homo Sapiens dataset for exploring biological modules. It was a higherorder time series dataset about genomic expression of multiple sclerosis patients after IFNβ injection treatment. We represented this data as a nonnegative gene×patient×time tensor of size 2920 ×14×9, i.e, there were 2920 genes, 14 patients {A, B, C, D, E, F, G, H, I, J, K, L, M, N} and 9 timepoints. The MultiFacTV was performed with τ_{1} = τ_{2} = 0.5, τ_{3} = 0.75, α = 10 and K = 40. As a result, we found many interesting modules responding to IFNβ treatment similar to [21]. To exploit the usefulness of those modules from a different view, we made use of them to help us group the patients.
With the 40 modules identified, we constructed a binary matrix M of size 14 × 40 representing the membership of each patient to the modules, where m_{ i,j } = 1 if the ith patient was associated to the jth module, otherwise m_{ i,j } = 0. In such case, each of the 14 patients was represented as a 1 × 40 binary vector. Subsequently, we clustered the 14 patients by using kmeans algorithm and the clustering results were {A, B, C, D}, {E, F, G, H}, {J, K, L, M }, {I, N }. This grouping result may suggest some differences of patients in their disease histories or progressions. We believe that this result will be beneficial to the designation of personalized medicine for the patients with multiple sclerosis [29].
Conclusions
As more and more time series biological data are being accumulated from different laboratories or databases, identification of modules with integrative analysis become an important and urgent task. One way to accomplish such integrative analysis is assembling multiple time series biological data into a tensor form. In this paper, we have proposed the MultiFacTV method, which extends the tensor factorization objective by introducing a timerelated regularization term of total variation, to detect modules from such higherorder time series biological data. We have performed the MultiFacTV method on synthetic datasets, Arabidopsis dataset, Yeast dataset and Homo sapiens dataset to test its performance. The results have shown that the proposed MultiFacTV indeed reveals some interesting module patterns. We have shown and validated these interesting findings with DAVID analysis or other analysis.
In this paper, we assume that the multiple time series genomic expression data have the same size, i.e., the same number of genes and the same number of timepoints, so that they can be joined into a tensor. In some cases, the data may be in different sizes. For example, the original Yeast dataset [28] has different number of timepoints for different environmental stresses. In the future, it would be interesting to extend the tensor factorization objective function of Tucker1 or Tucker2 [30] in a similar way to perform integrative module detection for such data.
References
Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM et al: Systematic determination of genetic network architecture. Nature genetics. 1999, 22: 281285. 10.1038/10343.
Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns. Proceedings of the National Academy of Sciences. 1998, 95 (25): 1486314868. 10.1073/pnas.95.25.14863.
Pentney W, Meila M: Spectral clustering of biological sequence data. Proceedings of the national conference on artificial intelligence. 2005, 20: 845850.
Cheng Y, Church GM: Biclustering of expression data. Proceedings of the eighth international conference on intelligent systems for molecular biology. 2000, 1: 93103.
Wall M, Rechtsteiner A, Rocha L: Singular value decomposition and principal component analysis. A practical approach to microarray data analysis. 2003, 91109.
Lee M, Shen H, Huang JZ, Marron JS: Biclustering via sparse singular value decomposition. Biometrics. 2010, 66 (4): 10871095. 10.1111/j.15410420.2010.01392.x.
Sill M, Kaiser S, Benner A, KoppSchneider A: Robust biclustering by sparse singular value decomposition incorporating stability selection. Bioinformatics. 2011, 27 (15): 20892097. 10.1093/bioinformatics/btr322.
Jung I, Kim D: Linknmf: Identification of histone modification modules in the human genome using nonnegative matrix factorization. Gene. 2012, 518 (1): 215221.
CarmonaSaez P, PascualMarqui R, Tirado F, Carazo J, PascualMontano A: Biclustering of gene expression data by nonsmooth nonnegative matrix factorization. BMC bioinformatics. 2006, 7 (1): 7810.1186/14712105778.
MejíaRoa E, CarmonaSáez P, Nogales R, Vicente C, Vazquez M, Yang XY, Garcia C, Tirado F, PascualMontano A: Bionmf: a webbased tool for nonnegative matrix factorization in biology. Nucleic acids research. 2008, 36 (suppl 2): 523528.
Mahoney M, Drineas P: Cur matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences. 2009, 106 (3): 697702. 10.1073/pnas.0803205106.
Paschou P, Mahoney M, Javed A, Kidd JR, Pakstis AJ, Gu S, Kidd KK, Drineas P: Intraand interpopulation genotype reconstruction from tagging snps. Genome Research. 2007, 17 (1): 96107.
Li W, Liu CC, Zhang T, Li H, Waterman MS, Zhou XJ: Integrative analysis of many weighted coexpression networks using tensor computation. PLoS Computational Biology. 2011, 7 (6): e100110610.1371/journal.pcbi.1001106.
Omberg L, Golub GH, Alter O: A tensor higherorder singular value decomposition for integrative analysis of dna microarray data from different studies. Proceedings of the National Academy of Sciences. 2007, 104 (47): 1837118376. 10.1073/pnas.0709146104.
Zhang Y, Du N, Ge L, Jia K, Zhang AD: A collective nmf method for detecting protein functional module from multiple data sources. Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine. 2012, 655660.
Narayanan M, Vetta A, Schadt EE, Zhu J: Simultaneous clustering of multiple gene expression and physical interaction datasets. PLoS computational biology. 2010, 6 (4): e100074210.1371/journal.pcbi.1000742.
GarciaHernandez M, Berardini T, Chen G, Crist D, Doyle A, Huala E, Knee E, Lambrecht M, Miller N, Mueller LA et al: Tair: a resource for integrated arabidopsis data. Functional integrative genomics. 2002, 2 (6): 239253. 10.1007/s101420020077z.
Edgar R, Domrachev M, Lash AE: Gene expression omnibus: Ncbi gene expression and hybridization array data repository. Nucleic acids research. 2002, 30 (1): 207210. 10.1093/nar/30.1.207.
WeinstockGuttman B, Badgett D, Patrick K, Hartrich L, Santos R, Hall D, Baier M, Feichter J, Ramanathan M: Genomic effects of ifnβ in multiple sclerosis patients. The Journal of Immunology. 2003, 171 (5): 26942702. 10.4049/jimmunol.171.5.2694.
Kilian J, Whitehead D, Horak J, Wanke D, Weinl S, Batistic O, Angelo C, BornbergBauer E, Kudla J, Harter K: The atgenexpress global stress expression data set: protocols, evaluation and model data analysis of uvb light, drought and cold stress responses. The Plant Journal. 2007, 50 (2): 347363. 10.1111/j.1365313X.2007.03052.x.
Supper J, Strauch M, Wanke D, Harter K, Zell A: Edisa: extracting biclusters from multiple timeseries of gene expression profiles. BMC bioinformatics. 2007, 8 (1): 334348. 10.1186/147121058334.
Li XT, Ye YM, Wu QY, Ng MK: Multifactv: Finding modules from higherorder gene expression profiles with time dimension. Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on: 47 October 2012. 2012, 16. 10.1109/BIBM.2012.6392641.
Lin YR, Sun J, Castro P, Konuru R, Sundaram H, Kelliher A: Metafac: community discovery via relational hypergraph factorization. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 2009, 527536.
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J: Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers. 2011
Ng MK, Weiss P, Yuan X: Solving constrained totalvariation image restoration and reconstruction problems via alternating direction methods. SIAM Journal on Scientific Computing. 2010, 32 (5): 27102736. 10.1137/090774823.
Dennis G, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA et al: David: database for annotation, visualization, and integrated discovery. Genome Biol. 2003, 4 (5): P310.1186/gb200345p3.
Kreps JA, Wu YJ, Chang HS, Zhu T, Wang X, Harper J: Transcriptome changes for arabidopsis in response to salt, osmotic, and cold stress. Plant Physiology. 2002, 130 (4): 21292141. 10.1104/pp.008532.
Gasch AP, Spellman PT, Kao CM, CarmelHarel O, Eisen MB, Storz G, Botstein D, Brown PO: Genomic expression programs in the response of yeast cells to environmental changes. Science's STKE. 2000, 11 (12): 42414257.
Vosslamber S, Baarsen L, Verweij CL: Pharmacogenomics of ifnβ in multiple sclerosis: towards a personalized medicine approach. Pharmacogenomics. 2009, 10 (1): 97108. 10.2217/14622416.10.1.97.
Acar E, Yener B: Unsupervised multiway data analysis: A literature survey. IEEE Transactions on Knowledge and Data Engineering. 2009, 21 (1): 620.
Acknowledgements
Based on "MultiFacTV: finding modules from higherorder gene expression profiles with time dimension", by Xutao Li, Yunming Ye, Qingyao Wu and Michael Ng which appeared in Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on. ©2012 IEEE [22].
X. Li's research was supported in part by NSFC under Grant No.61100190. Y. Ye's research was supported in part by NSFC under Grant No. 61272538, National Key Technology R&D Program of MOST China under Grant No. 2012BAK17B08, Shenzhen Science and Technology Program under Grant No.CXY201107010206A, and Shenzhen Strategic Emerging Industries Program under Grant No. ZDSY20120613125016389. M. Ng's research was supported in part by Centre for Mathematical Imaging and Vision, HKRGC Grant No. 201812.
Declarations
The publication costs for this article were funded by Yunming Ye.
This article has been published as part of BMC Genomics Volume 14 Supplement S4, 2013: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S4.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
XL, MN and YY participated in designing the algorithm, drafting and revising the manuscript. XL participated in implementing the algorithm and performing experiments. QW participated in the discussions of experimental results. All authors have read and approved the final version of this manuscript.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( https://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Li, X., Ye, Y., Ng, M. et al. MultiFacTV: module detection from higherorder time series biological data. BMC Genomics 14 (Suppl 4), S2 (2013). https://doi.org/10.1186/1471216414S4S2
Published:
DOI: https://doi.org/10.1186/1471216414S4S2