Overview of our approach
The block diagram of our approach PATTERN is shown in Fig. 1. It has a modular structure including four modules. In module 1, pathway data and TF regulated gene sets data from the existing knowledge database are assembled to construct a generic PathRNet. To construct the network, only two types of directed links are allowed, i.e., a link from a signaling pathway node to a TF node is placed if TF is in the downstream of the signaling pathway and a link from a TF node to a signaling or a metabolic pathway is placed if the TF regulated gene set contains genes in the pathway. The resulting PathRNet is a generic network that represents the prior knowledge of regulation under variety of different contexts. In Module 2, context-specific microarray data is introduced and the context-specific PathRNet is obtained by a functional enrichment strategy. Particularly, ECTDISA [14] is applied to determine the enrichment of each node in the generic PathRNet in the time series microarray data and the nodes that are not enriched will be deactivated. Furthermore, ECTDISA also determines the temporal expression pattern (the expression trend and time span) of each node. In Module 3, regulatory PathRNet is constructed to depict the regulatory relationship of nodes based on the expression pattern obtained in Module 2. Inference is carried out to determine the presence of links and their regulatory impact. As a result, a (context-specific) regulatory PathRNet will be obtained. In Module 4, temporal regulatory impact of network during each time period is further analyzed, resulting in a temporal landscape of PathRNet dynamics.
We discuss next the proposed methods for implementing each module.
Generic PathRNet Construction
A generic PathRNet is first constructed based on the existing knowledge database. The prior knowledge databases are collected from NCI-NC [16] ,Transfac [17] and MSigDB [15]. A generic PathRNet is constructed with each node corresponding to a signaling pathway, a transcription factor or a metabolic pathway; and an edge is placed:
a) From a signaling pathway to TFs if TFs are the downstream of the signaling pathways regulate transcription factors.
b) From a TF to signaling or metabolic pathways if TF regulated at least one genes in the pathway.
In the end, a prior network is obtained, which illustrates a generic picture of the interactions between pathways and transcription factors inside a human cell.
Noted that in our approach, there are additional steps to refine the constructed generic PathRNet by removing redundant nodes and edges:
In the context-specific PathRNet construction module, all the nonfunctional nodes and their edges were removed, leaving only the enriched functional pathways /TFs and edges between them.
In the regulatory PathRNet construction module, Bayesian information criterion (BIC) was applied for model selection, and all the edges that are not chosen were removed from the constructed regulatory PathRNet.
In the dynamic PathRNet construction module, (dynamic) edges that indicate no or nominal contributions were further removed from the constructed temporal PathRNet.
Since all the following steps remove redundant nodes/edges, we hope keep as many associations as possible when constructing the generic PathRNet to ensure a high sensitivity of reconstruction. Thus, an edge is placed between a transcription factor (TF) and a pathway as long as the TF regulates at least one gene in the Pathway (the pathway and the TF target gene sets share at least one gene). However, the degree of overlapping between the pathway genes and the TF target gene set might itself indicates the strength of their association. Although this information is not used in our algorithm, it is possible to include this information in further extension to better estimate the regulatory impacts of transcription factors.
Context specific PathRNet construction
To capture the functional components of generic PathRNet and construct a context-specific PathRNet, expression profile is then incorporated. The goal is to identify coregulated pathways enriched by the gene expression profile and their associated time period. To this end, we apply the ECTDISA [14] to the microarray gene expression data. ECTDISA can not only cluster the coregulated genes that are functionally most enriched by a functional annotation database such as pathway databases but also retrieve the timing of the co-regulating clusters.
Particularly, when applying the ECTDISA, a measurement needs to be specified to gauge the coregulation of genes. Since Pearson's correlation is not robust towards noise, and Euclidean distance is not suitable to measure correlation, we proposed a modified Euclidean distance measurement ρ to penalize smaller fold changes without losing the robustness to noise, which is defined as follows:
where x, y are the expression profile of a functional node (pathway or TF), and α is used to penalize smaller fold changes.
Since small fold changes are more vulnerable to noise, the parameter α was chosen to penalize smaller fold change to a reasonable amount. It should not be too small or too large. An experiment is developed on real data to show the sensitivity of the parameter, and result is evaluated in terms of A score and C score, which are defined in [18] for performance evaluation purpose. In general, A score indicates whether all the identified modules are corresponding to true modules; C score indicates whether all the true modules are correctly identified by clustering algorithm. Since our goal is to choose a proper α to uncover all the modules from the data, we are mainly interested in acquiring the highest C score; but at the same time, A score should not be too low. The result is shown in Fig. 2.
It can be seen from the figures that C score reaches its peak when α = 0.3 and at the same time A score is relatively high. This is the value we choose when the approach was applied to real data. To achieve the best result, the use of different α are recommended.
An example of clusters produced by ECTDISA is shown in Fig. 3. A number of the most significantly enriched signaling pathways, transcription factors and metabolic pathways are selected as the functional components in the prior network, and the context specific PathRNet is constructed by retaining only the functional components.
Regulatory PathRNet construction
Note that for now the edges in the functional networks are still defined by the prior knowledge. With the expression data, the edges can be further refined to reflect the regulatory status including up- or down- regulation and its weight. To this end, the expression profiles of the nodes are first determined. For the real case, since even the genes in the same pathways can behave differently, i.e., some are up regulated while some others are down regulated, it possible that the same pathway are enriched in both the up-regulated module and the down-regulated module. To capture the main trend of pathway, we consider the most significantly enriched trend in a pathway by using ECTDISA algorithm. Specifically, for a signaling pathway or a metabolic pathway (nodes labelled with "S-" and "P-" in the network), the expression profile is estimated by averaging all the pathway genes that are most enriched in their temporal trends. Since the trends of their expression profiles are very similar, the averaging expression of those genes is a good estimate of the main trend of the pathway. For example, if at 1 hour, the expressions of 5 genes that are involved in the same pathways are 2, 2.1, 2.2, 0, -3; and the first 3 gene are enriched in a module identified by ECTDISA, then the fold change of the corresponding pathway at 1 hour will be
. Similarly, the expression fold change at other time points can be calculated. Note that, since the expression is calculated by averaging along the gene dimension, the expression fold change of different genes are acquired from exactly the same experimental condition; moreover, the microarray data has been properly normalized [19]
If a pathway is enriched in a temporal transcription module, then the pathway is only active during the inferred time period that the temporal module is active; otherwise its expression is 0. For a transcription factor, the expression fold change profile is inferred by averaging the expression fold changes of its encoding genes.
Next, the regulatory impact is inferred based on the expression of nodes. Let y denote the expression profiles of a node and X = [x1, x2,...,x
k
] the expression profiles of its parent nodes. Then, a linear regulatory model is introduced as:
y
=
X
·
β
+
ε
where β represents regulatory weights and ε is white Gaussian noise. The goal is then to determine the functional parent nodes and the corresponding regulatory weights β. The problem is a variable selection problem and BIC is adopted for the solution. Under the assumption that the model error is normally distributed, the formula for BIC is:
where n is the number of data points in x
i
, k is the number of parameters to be estimated, and the residual sum of squares (RSS) is the sum of squares of residuals, which defines the discrepancy between the data and our estimation model:
RSS=|| y - ŷ||2
where,
is an estimate of y calculated based on the least squares estimate:
Temporal PathRNet construction
In the last module, the dynamic regulatory network is constructed from the basal network by considering the dynamic contribution of the regulatory nodes at each sample time points. Specifically, the dynamic change of the expression level of a node at a sample time t can be expressed as:
where, x
t
= [x1,
t
, x2,
t
,…,x
p,t
], and P is the number of parent nodes of y. The dynamic contribution of all the parent nodes is (x
t
– x
t
–1)∙β, and the dynamic contribution of a parent node from t – 1 to t is:
. Then, the dynamic regulatory networks can be obtained by refining the edges at each t. Particularly, if
is small, the corresponding edge can be removed indicating no or nominal contribution. Otherwise, the width and the color of the edge will be defined by
to illustrate the regulatory influence of the node onto a child node during a particular sample time period. Although small regulatory contribution
does lead to edge removal and thus change of network structure, it is mainly the regulatory time period inferred by the ECTDISA that defines the dynamics of the network structure.