A modulator based regulatory network for ERα signaling pathway

Background Estrogens control multiple functions of hormone-responsive breast cancer cells. They regulate diverse physiological processes in various tissues through genomic and non-genomic mechanisms that result in activation or repression of gene expression. Transcription regulation upon estrogen stimulation is a critical biological process underlying the onset and progress of the majority of breast cancer. ERα requires distinct co-regulator or modulators for efficient transcriptional regulation, and they form a regulatory network. Knowing this regulatory network will enable systematic study of the effect of ERα on breast cancer. Methods To investigate the regulatory network of ERα and discover novel modulators of ERα functions, we proposed an analytical method based on a linear regression model to identify translational modulators and their network relationships. In the network analysis, a group of specific modulator and target genes were selected according to the functionality of modulator and the ERα binding. Network formed from targets genes with ERα binding was called ERα genomic regulatory network; while network formed from targets genes without ERα binding was called ERα non-genomic regulatory network. Considering the active or repressive function of ERα, active or repressive function of a modulator, and agonist or antagonist effect of a modulator on ERα, the ERα/modulator/target relationships were categorized into 27 classes. Results Using the gene expression data and ERα Chip-seq data from the MCF-7 cell line, the ERα genomic/non-genomic regulatory networks were built by merging ERα/ modulator/target triplets (TF, M, T), where TF refers to the ERα, M refers to the modulator, and T refers to the target. Comparing these two networks, ERα non-genomic network has lower FDR than the genomic network. In order to validate these two networks, the same network analysis was performed in the gene expression data from the ZR-75.1 cell. The network overlap analysis between two cancer cells showed 1% overlap for the ERα genomic regulatory network, but 4% overlap for the non-genomic regulatory network. Conclusions We proposed a novel approach to infer the ERα/modulator/target relationships, and construct the genomic/non-genomic regulatory networks in two cancer cells. We found that the non-genomic regulatory network is more reliable than the genomic regulatory network.


Background
Nuclear receptors (NR) are a superfamily of ligandactivated transcription factors that modulate specific gene expression by interacting with specific DNA sequence upstream of their target gene. So far there are over 100 nuclear receptors identified [1][2][3]. Estrogen receptor (ER) is a member of the nuclear receptor superfamily and is categorized into the class of ligand-dependent steroid receptor in the 1960s. The study explained it controls diverse biological processes by mediating the actions of steroid hormone estrogen and afforded an appreciation of its global importance in cell growth, cellular signalling, differentiation, maturation and homeostasis in eukaryotic cells. Finally, the general pathway for steroid hormone action was subsequently elucidated [4].
Unlike conventional transcription factors, ER is composed of several domains including ligand binding, DNA binding, dimerization, and transcriptional activation. The ligand binding domain participates in several activities including hormone binding, homo-and/or heterodimerization, and transcriptional activation and repression. The binding of the estrogen induces conformational changes in ER that could regulate gene expression by directed interaction with DNA (genomic pathway of ER action) or via an undirected connection with the modulation of some specific proteins (non-genomic pathway) [5,6].
In a gene regulatory network, gene transcription variations are controlled by many transcription factors. It has been established that the presence of regulatory sequences is in the proximity of genes and the existence of proteins is able to bind to those elements and to control the activity of genes by either activation or repression of transcription [7]. To understand gene regulation, the inference of its regulatory network is an important research topic [8]. Recent genomic technology, such as genome wide expression array or sequencing, allows us to elucidate the global gene regulatory mechanisms. Due to the well-developed microarray technology, the wealthy information for gene expression allows us to observe the expression levels of thousand of gene at once and helps more accurately predict gene-to-gene interaction according to its similarity or dissimilarity.
One approach to establish the gene regulatory network is to start from gene-gene correlations or interactions. Many computational approaches have been developed aimed to measure associations between mRNA abundant profiles to predict the transcriptional regulatory interaction. Some attempts at determining gene regulation based on the gene expression clustering algorithm. They group the genes that show similar gene expression using correlation coefficient matrix [9] or mutual informationbased algorithm [10,11] under the same condition [8,9]. However, clustering the resembling genes that are coregulated cannot present much more information about the biological mechanisms of gene regulation or regulatory pathway. Thus, some computational algorithms are proposed to reconstruct the gene networks by applying statistical approaches, such as Relevance Network, Bayesian Network, Linear Regression Network [12], and our own Regulation Network [3].
Relevance Network detects the relatedness between two genes from their gene expression profiles and gives a link between transcription factor and its target gene if correlated [13][14][15][16][17]. The typical methods to calculate the relatedness are Pearson Correlation Coefficient and Mutual information. Pearson Correlation Coefficient provides better performance on detecting linear relationships but it is not as intuitive as the Euclidean distance measure [17]. Mutual information (MI) gives good performance on non-linear relationship. For example, ARACNE algorithm [16] estimates the mutual information between the gene expressions of two genes using Gaussian kernel estimator. The measure of relatedness by MI ranges from 0 to 1. Relevance network is a relatively simple model, which computes the pair-wise similarity or dissimilarity between two genes.
Bayesian Network (BN) can identify casual relationships between variables. The topology a BN can provide the dependence or independence of variable [18], BN algorithm can reveal the dynamics of the gene regulation hierarchy. While BN has its advantage of structure model, it is difficult to inform whether a node (gene) is important to be included. Another challenging is its computational stability. It usually results in multiple optimal networks [19]. The high computational requirement leads to almost impossible of inference to a large-scale regulatory network [20]. Also, BN assumes no gene-gene interaction, which can misrepresent the data.
Our proposed ERα regulatory network is a combination of TF binding affinity estimated from ChIP-seq data, up or down regulation using gene expression, and motif conservation in probe sequences. This approach effectively utilized the genomic or non-genomic actions. Unlike previous regression approaches, this method did not use correlation information.
In this paper, our proposed approach analyzes the interaction between TF and target gene conditioned on a group of specific modulator genes. Also, we consider the change of modulators' expression level to perceive its influence on transcriptional activity. We reconstruct gene regulatory networks in related biological subjects via a multiple linear regression approach with interaction term such that the inferred modulator gene is directly embodied and the relationships of the biological subjects they represent are easily exploited. As a result, this reveals deeper insight on how the structure, function, and behaviour of components evolve.
More quality control analyses were performed. A probe is considered as absent if it is called absent in every time point by the BeadStudio Software, and these absent probes were excluded. Then, gene expression was averaged from the existing duplicate present probes. Genes with the small coefficient of variance are also removed. Using the threshold of 0.15, there were 6418 genes in MCF-7 cell and 13872 genes for ZR-75.1 cell left for the network analysis.

ChIP-seq data analysis
ERα ChIP-seq data were prepared by Tim H. M. Huang's lab and generated for MCF-7 cell before and after E2 treatment 0.5, 1, and 24 hours. Sequencing was conducted with Illumina Genome Analyzer II (GA II) as per manufacturer's instructions. Reads are organized into a contiguous assembly of 24 different strands (one for each chromosome) and mapped to human genome reference sequence (HG18) using ELAND provided by Illumina. Four published peak-calling algorithms were applied to call the ERα binding sites: including CisGenome [21], GeneTrack [22], MACS [23], and SISSRS [24] with FDR = 0.001 to predict ERα binding peaks.

Linear regression model with interaction term
A linear regression model (1), with an interaction term is constructed to describe the relationship among TF (ERα), M (modulator), and T (target).
where (TF, T) are all J×1 vectors gene expression, and J is the total sampling time of gene expression samples; M is a binary variable (0 means low, and 1 means high) derived by the expression of a modulator's expression divided by its median; (a 1 , a 2 , b 1 , b 2 ) are regression coefficients. For parameter setting, we did try the continuous scale for M at the beginning, and found that it often times it was highly sensitive to its skewed distribution because of small sample size. Therefore, a binary M is more robust choice. When the expression of modulator is high (M = 1), Eq. (1) becomes T = (a 1 + a 2 ) + (b 1 + b 2 ) TF; otherwise, T = a 1 +b 1 TF when M = 0. Parameter b 2 presents the change of T in response to TF influenced by M. If b 2 is not statistically significant, the data do not have enough evidence to support M modulates the TF's effect on the target T.

Schematic representation
Using a linear regression model with the interaction term, the regression coefficients (a 2 , b 1 , b 2 ) define the functional relationship of the triplicate (T, TF, M). Eq. (2) defines the correlation indicator for all three regression parameters for any triplicate data analysis. The significant means p-value is less than 0.05.
A schematic overview of the triplicate network is shown in Figure 1. There are two lines between TF and T. These two lines mean two types of network construction according to the rule of modulator in the network. The solid line stands for the direct influence on the relationship between TF and T, which is independent of M; while the dashed line represents that the relationship of TF on T depends on M. Based on the criterion of correlation indicator (CI) described above, there are 27 categories of network behaviours from all combinations.

Modulator gene candidates
We focus our modulators on specific molecular functional classes. Using Gene Ontology (GO) molecular function, several functional classes are included: protein kinase activity, phosphoprotein phosphatase activity, acetyltransferase activity, deacetylase activity, methltransferase activity, transcription factors, and transcriptional cofactors. In these 7 specific GO molecular functions, 485 unique modulator genes are found from the pool of 6418 presented genes in MCF-7 cell. The result of GO analysis is shown in Table 1.

ERa genomic and non-genomic target gene selection
Using ChIP-seq analysis, only genes that have ERα bindings at 0, 0.5, 1, and 24 hours are considered genomic target. Four peak finding algorithms are used in predicting the binding sites: CisGenome, GeneTrack, MACS, and SISSRS with FDR = 0.001. Motif Conservation Score (MCS) [25] of each binding site overlapping with the known genes from the 6418 gene profiles is calculated, and a 95% conservation score is chosen as the threshold. As a consequence, 56 genes appearing in the results of 4 diverse algorithms can regard as the reliable target gene candidates in ERα genomic network. For the non-genomic network, we didn't treat all the other genes as nongenomic targets. As a matter of fact, only genes who don't have any binding signals with all the ChIP-seq peak finding tools in all four time points were selected as the non-genomic targets. There are 5877 genes not appearing in the results of 4 diverse algorithms.

Network construction
Using the triplicates generated from the linear model with the interaction, a network model can be constructed. This initial network will include an enormous amount of modulators which have only limited targets. Therefore, a filtering threshold is developed to keep only modulators with significant number of targets. In this network analysis, all the connections among the modulators, targets, and the TF are assumed at random, though the total number of connections is fixed as the observed number from our previous analysis. After 1000 times shuffling in the network connections, a distribution of target number of any modulator is formed, and 95% threshold is chosen for the modulator selection.

ERa regulatory network construction flowchart
The ERα regulatory network construction goes through data pre-processing, modulator selection, genomic target/ non-target selection, linear regression with interaction, and network construction. Figure 2 shows an integration of analytic workflow for the proposed method.

ERa/modulator categories
Using ChIP-seq data, we predicted 56 ERa genomic targets and 5877 non-genomic targets. Together with our pre-specified modulators, we generated 22 (T, TF, M) categories that involve the interactions with modulator genes. Many categories have FDR < 10% in both genomic and non-genomic (T, TF, M) categories. In particularly, genomic regulatory categories have higher FDRs than the non-genomic categories.
The following are four highlighted examples. Figure 3  (a,b) show the modulator genes mediated the transcription activities, which enhance and repress the expression level of their corresponding target genes. Figure 3(c,d) illustrates that CDK6 possesses the indirect influence on target genes according to the diverse level of its gene expression. Figure 3(c) represents CDK6 is inferred as an activator agonist of TF (ERα). Conventionally, ERα has no function on target gene (LOC652683). When the expression level of CDK6 comes to high, it stimulates ERα to turn into an activator to target gene (LOC652683). On the contrary, CDK6 owns the opposite capability to ERα since TF which makes no impact on target gene (CHD9) when CDK6 expression is high, which are illustrated in the Figure 3(d). It is implicated that each modulator gene can stimulate TF to either activate or repress a large number of targets, depending on the cellular context.

ERa regulatory network analysis
Once a biological process is represented by a network, the analysis of network topology uncovers the functional organization and unknown organizing principles of cellular systems [26]. Many network researches investigate network activity for an active node by using the concept of degrees, defined as the number of edges adjacent to the neighbours. As a node has larger or higher connectivities, Total 485 Figure 2 The analytic workflow of gene regulatory network analysis. An integration of analytic workflow for the proposed method is proposed. GRN construction goes through data preprocessing, modulator selection, genomic target/non-target selection, linear regression with interaction, and network construction.
it represents to be the important connector and has more impact on the signalling pathway. For this reason, we zoom in our scope into a small-scale ERα regulatory networks.
To concentrate on those modulators with a large number of targets, ERα regulatory networks are constructed from the triplicates from Table 2. The thresholds on both MCF-7 and ZR-75.1cell were chosen based   Validation studies with ZR-75.1 In the previous studies [27,28] Figure 6. For the results of the overlapping networks, the statistical significance of comparing the ERa network between two cancer cells is not our primary interests. As we have shown in our previous work, the overlap between the two gene regulatory networks in two different time points after E2 stimulation was very small even in a single cell line. We don't expect a significant overlap between two networks between two cells. The simply want to know what are the overlapped components and non-overlapped components. This description itself is very valuable. In addition, both GO and KEGG don't have estrogen signalling pathways, and we don't feel these analyses will add much to our understanding of the estrogen regulatory network.

Conclusions
This paper proposed a regression model based approach in ERα regulatory network model construction. It characterizes the interaction between ERα and its modulators and their associated gene targets. With additional ERα binding information from ChIP-seq data, we are able to construct ERα genomic and non-genomic regulatory models. Comparing these two networks, ERα non-genomic network has lower FDR than the genomic   network. This was validated by the same network analysis on both ZR-75.1 and MCF-7 cells.