Signed weighted gene co-expression network analysis of transcriptional regulation in murine embryonic stem cells
© Mason et al. 2009
Received: 17 February 2009
Accepted: 20 July 2009
Published: 20 July 2009
Skip to main content
© Mason et al. 2009
Received: 17 February 2009
Accepted: 20 July 2009
Published: 20 July 2009
Recent work has revealed that a core group of transcription factors (TFs) regulates the key characteristics of embryonic stem (ES) cells: pluripotency and self-renewal. Current efforts focus on identifying genes that play important roles in maintaining pluripotency and self-renewal in ES cells and aim to understand the interactions among these genes. To that end, we investigated the use of unsigned and signed network analysis to identify pluripotency and differentiation related genes.
We show that signed networks provide a better systems level understanding of the regulatory mechanisms of ES cells than unsigned networks, using two independent murine ES cell expression data sets. Specifically, using signed weighted gene co-expression network analysis (WGCNA), we found a pluripotency module and a differentiation module, which are not identified in unsigned networks. We confirmed the importance of these modules by incorporating genome-wide TF binding data for key ES cell regulators. Interestingly, we find that the pluripotency module is enriched with genes related to DNA damage repair and mitochondrial function in addition to transcriptional regulation. Using a connectivity measure of module membership, we not only identify known regulators of ES cells but also show that Mrpl15, Msh6, Nrf1, Nup133, Ppif, Rbpj, Sh3gl2, and Zfp39, among other genes, have important roles in maintaining ES cell pluripotency and self-renewal. We also report highly significant relationships between module membership and epigenetic modifications (histone modifications and promoter CpG methylation status), which are known to play a role in controlling gene expression during ES cell self-renewal and differentiation.
Our systems biologic re-analysis of gene expression, transcription factor binding, epigenetic and gene ontology data provides a novel integrative view of ES cell biology.
Embryonic stem (ES) cells have two important characteristics: pluripotency, the ability to differentiate into any type of cell in the body, and self-renewal, the ability to replicate indefinitely. As such, they have tremendous therapeutic potential for regenerative medicine [1, 2]. Current work focuses on understanding and extending the network of genes that controls these key characteristics [3–16]. These efforts identified ES cell-specific transcription factors (TFs) that are differentially expressed between ES cells and differentiated cells (fibroblasts). Several studies have identified the targets of these TFs and the mechanism by which they regulate them [4, 8, 17]. Highly differentially expressed TFs (Oct4, Sox2, c-Myc, and Klf4) have been found capable of reprogramming fibroblasts to a pluripotent State .
While standard differential expression analysis techniques have led to remarkable discoveries they ignore the strong correlations that may exist between gene expression profiles. As a consequence, the user of a standard marginal analysis can drown in information but starve in knowledge. This is especially true when considering ES cells where many genes change expression during differentiation. For example, in a data set from Zhou et al 2007, which we consider below, more than 6200 genes were highly differentially expressed (Student t-test p-value smaller than the very stringent threshold of 10-6). It is difficult to further prioritize these genes and to learn the underlying biological pathways. In contrast, co-expression networks, also referred to as 'association,' 'correlation,' or 'influence' networks [18–22], realize that genes can be highly correlated and thus can be grouped into large clusters (co-expression modules). For example, our network analysis of the same data organizes the genes into only 8 large modules. Next our module-centric analysis focuses on understanding the modules and their key regulators. Since it applies significance testing to the level of modules, co-expression network analysis may greatly alleviate the multiple testing problem that plagues standard gene-centric methods . Gene co-expression network methods have been successfully applied in a variety of different settings [18, 19, 21, 22, 24–32].
In this article, we demonstrate that a co-expression network analysis of stem cell data sets provides novel biological insights that cannot be found using conventional techniques. Using external data (including gene ontology, TF binding data, epigenetic regulators), we also contrast the performance of signed and unsigned network construction methods. We find that signed co-expression network analysis performs best in this stem cell application. We identify pluripotency and differentiation related co-expression modules and novel ES cell regulators.
As the unsigned measure , the signed similarity takes on a value between 0 and 1. Note that the unsigned similarity between two oppositely expressed genes (cor(x i , x j ) = -1) equals 1 while it equals 0 for the signed similarity. Similarly, while the unsigned co-expression measure of two genes with zero correlation remains zero, the signed similarity equals 0.5.
Next, an adjacency matrix (network), A = [a ij ], is used to quantify how strongly genes are connected to one another. A is defined by thresholding the co-expression similarity matrix S = [s ij ]. 'Hard' thresholding (dichotomizing) the similarity measure S results in an unweighted gene co-expression network. Specifically an unweighted network adjacency is defined to be 1 if s ij > τ and 0 otherwise, i.e. two genes are considered connected if their similarity measure is above a given threshold τ, and are considered separated otherwise.
A major step in our module centric analysis is to cluster genes into network modules using a network proximity measure. Roughly speaking, a pair of genes has a high proximity if it is closely interconnected. We will use the convention that the maximal proximity between two genes is 1 and the minimum proximity is 0. Specifically, we define the proximity as the topological overlap measure (TOM) [35–37] which can also be defined for weighted networks . The TOM combines the adjacency of two genes and the connection strengths these two genes share with other "third party" genes (see equation 6 in the Methods section and Additional file 1). The TOM is a highly robust measure of network interconnectedness (proximity). This proximity is used as input of average linkage hierarchical clustering. Modules are defined as branches of the resulting cluster tree . This module detection procedure has been used in many applications [23, 25–30, 32, 39, 40] and a comparison to alternative procedures is beyond the scope of this article.
We find it convenient to summarize the gene expression profiles of a given module with the module eigengene, which can be considered as the best summary of the standardized module expression data [33, 41]. The module eigengene of a given module is defined as the first principal component of the standardized expression profiles (see equation 8 in the Methods section).
where n (q) is the number of genes in the q th module. In the case of an unweighted network, simply counts the number of connections to gene i within the q th module. Intramodular connectivity can be interpreted as a measure of module membership: the higher the intramodular connectivity, the more centrally located the gene is in the module and the more certain is its membership with regard to this module. In signed networks, these highly connected hub genes may up-regulate adjacent genes since they are positively correlated with them, while in unsigned networks they may activate or repress their neighboring genes.
where E (q) is the eigengene of the q th module (see equations 9 and 10 in the Methods section) and x i is the expression profile of the gene i. We denote modules by colors. For example, denotes the module membership measure of the i-th gene with regard to the blue module.
Module eigengene based connectivity has several advantages over intramodular connectivity: first, it is naturally scaled to take on values between -1 and 1; second, one can use a correlation test to calculate a corresponding p-value for a gene's module membership; third it can be used in signed networks to identify genes that are anti-correlated with a given module eigengene (i.e. they may repress genes in the module), and fourth, k ME can be computed for any gene on the array (not just genes used in the network construction). In practice, we found that intramodular and module eigengene based connectivity are highly correlated (Additional File 2). A priori, the connectivity measures defined in equations 4 and 5 are quite different. But we show in the Methods section that a simple theoretical relationship between them can be derived in the context of a signed co-expression module. Due to its advantages, we used the module eigengene based connectivity as the measure of module membership in our applications.
We generated unsigned and signed co-expression networks to analyze over 17,000 genes measured across 70 expression arrays from data published in Ivanova et al (2006) . This data set contains expression profiles of ES cells individually depleted for the transcription factors Oct4, Nanog, Sox2, Esrrb, and Tbx3 by RNA interference (RNAi). The data set also includes expression profiles for RNAi knock downs of Tcl1 a co-activator of AKT kinase, and an EST (Mm343880), along with expression profiles of control ES cells carrying an empty RNAi vector and of ES cells differentiated by retinoic acid (RA). Each of these treatments was sampled over approximately eight days. To compare the performances of unsigned and signed WGCNA in identifying gene groups that are important for the regulation of the pluripotent State, we defined gene modules in unsigned and signed networks and assessed module function and importance by determining gene ontology terms associated with each module and examining module membership of genes known to play a role in ES cells. In addition, we analyzed how genes of a given module are bound by chromatin regulators or pluripotency TFs by incorporating independent promoter binding information.
Next we used external data to further study the gene modules defined by the networks and reveal their functional roles. We used two different strategies for this evaluation: first we assigned transcription factors and other regulators with known roles in pluripotency, self-renewal or differentiation to modules [4, 8] and second, we incorporated genome-wide binding data for transcription factors and other regulators implicated in ES cell regulation in order to determine if these modules contain genes that are directly controlled by ES cell related TFs or differentiation suppressors [5, 6, 16].
Many genes known to maintain the pluripotent State of ES cells are found in the black module in the signed network. We defined a measure of gene significance (GS) as the t-statistic from the paired Student's t-test of expression in control RNAi samples and ES cell samples with RNAi knock down of Oct4 (paired by day of treatment). Figure 2c shows GS plotted against its module eigengene based connectivity, k ME , in the black and blue modules of the signed network with marker genes labeled. Since the signed module membership k ME is defined as the correlation between a gene expression profile and the module eigengene, its values lie between -1 and 1 with values near 1 signifying strong module membership to the corresponding signed module. Figure 2c shows a strong linear relationship between k ME and GS in the black module (correlation = 0.5, p-value = 6.5e-13). As expected, most of the genes whose RNAi knock down induced ES cell differentiation in Ivanova et al  belong to the black module (Oct4, Nanog, Sox2, Esrrb, and Dppa4, Fisher's exact test p-value = 3.2 × 10-5). Oct4's high connectivity ( = 0.94) makes it a hub gene in the black module, consistent with its known role as a master regulator of the pluripotent State. Furthermore, many genes that are known to be highly expressed in ES cells are also in the black module (e.g. Klf4, Utf1, and Phc1). Klf4 is one of the four TFs that can reprogram differentiated cells into a pluripotency State . Utf1 interacts with Oct4, affects chromatin regulation in ES cells, and has recently been shown to improve reprogramming efficiency [43–45]. Phc1 is a Polycomb Group (PcG) protein. PcG proteins repress genes that become active upon differentiation of ES cells by mediating histone H3 lysine 27 tri-methylation and histone H2a ubiquitination . The blue module contains Gata6 and Gata4, which are both highly connected ( = 0.93 and 0.88, respectively). These TFs are markers of ES cell differentiation, particularly into endoderm. Below we provide further evidence that the black and blue modules are related to pluripotency and differentiation respectively.
We incorporated genome-wide binding data for TFs (Oct4, Sox2, Nanog, Stat3, Smad1, cMyc, nMyc, Zfx, E2f1) and other regulators (Suz12) implicated in the maintenance of pluripotency and self-renewal, which were obtained by chromatin immunoprecipitation (ChIP) and massive parallel sequencing (ChIP-seq) by Chen et al (2008) . Oct4, Sox2, Nanog, Smad1, and Stat3 are referred to as the Oct4 group of TFs, as they have been shown to often co-bind genomic regions; cMyc, nMyc, E2f1, and Zfx are referred to as the cMyc group of TFs because they also co-bind genomic regions . Together TFs in the Oct4 and cMyc group are thought to activate expression of genes involved in pluripotency and self-renewal. Suz12, is a subunit of the histone H3K27 methyltransferase PcG protein complex, which represses genes that are activated upon differentiation [6, 16].
Transcription Factor Binding in Ivanova et al Networks
No. of genes
Mammalian gene promoters are known to fall into one of at least two major classes: 1) CpG-rich promoters are associated with both ubiquitously expressed 'housekeeping' genes, and genes with more complex expression patterns, particularly those expressed during embryonic development and 2) CpG-poor promoters are generally associated with highly tissue-specific genes. To understand the role of CpG content in our modules we analyzed three CpG content classifications from Mikkelsen et al(2007): high (denoted HCP), low (LCP), and intermediate (ICP). Figure 3 shows that HCP genes contain significantly more black module genes (p = 1.3 × 10-51) and significantly more blue module genes (p = 2.4 × 10-36) than ICP or LCP genes. The LCPs are known to have a very different trimethylation pattern than the HCPs. Few (6.5%) of LCPs have significant H3K4me3 in ES cells and virtually none have H3K27me3. HCPs and LCPs are subject to distinct modes of regulation. In ES cells, all HCPs seem to be targets of trithorax group activity, and may therefore drive transcription unless actively repressed by PcG proteins. In contrast, LCPs seem to be inactive by default, independent of repression by PcG proteins, and may instead be selectively activated by cell-type- or tissue-specific factors .
Figure 3 also shows promoter CpG methylation in relation to module membership. DNA methylation in mammalian cells plays multiple roles in cell physiology, including genome stability, repression of endogenous retroviral elements, genomic imprinting. Levels of DNA methylation are dynamically regulated during embroyogenesis but less is known about the role DNA methylation play in gene expression and maintenance of pluripotency in ES cells . Figure 3 shows that methylated genes are significantly under-enriched for black module (p = 2.0 × 10-14) and significantly under-enriched for blue module genes (p = 5.1 × 10-11). In Additional file 5, we present the data used for cross-referencing module membership to epigenetic regulators.
Module Membership Versus Epigenetic Variables
Source of Variation in kME
kMEblack, Total Prop Var Explained = 8.3%
kMEblue, Total Prop Var Explained = 4.2%
Degrees Of Freedom
Sums of Sq
Prop. Of Total Var
p-value (F test)
Sums of Sq
Prop. Of Total Var
Histone Trimethylation (K4, K27, K4&K27, none)
CPG class (HCP, ICP, LCP)
To further investigate WGCNA's ability to discover functionally important groups of genes, we turned to an independent data set from Zhou et al (2007) . In this study, ES cells were removed from feeder cells and leukemia inhibitory factor (LIF) to induce differentiation. During the course of differentiation, cells were separated based on expression of an Oct4 green fluorescent protein (GFP) reporter gene. Multiple samples were taken from undifferentiated ES cells and cells sorted at days 2, 4, 8, and 15 for high and low Oct4 expression. As before, we first identified gene modules via signed and unsigned methods and then related module membership to external data. In the following we show that a pluripotency/self-renewal and a differentiation module can be found in this new data set. For consistency between data sets, we have colored these modules black and blue, respectively.
Transcription Factor Binding in Zhou et al Networks
No. of genes
Functional Pathways in Highly Connected Pluripotency and Differentiation Related Genes in the Zhou et al Network
Blue Module highly connected genes (kME)
er-golgi transport; protein localization; protein transport; vesicle-mediated transport; secretion by cell; cellular localization; secretory pathway; intracellular transport;
Myl6 (0.994), Sh3glb1 (0.993), Tm9sf3 (0.993), Tram1 (0.992), Derl1 (0.991), Serinc1 (0.991), Lman1 (0.991), Lrp10 (0.991), Mcfd2 (0.99), Mcfd2 (0.99), Tmed10 (0.99), Tpcn1 (0.989), Arl1 (0.989), Tinagl (0.987), Rab2 (0.987), Txndc1 (0.987), Col4a1 (0.987)
Glycan structures - biosynthesis 1; signal-anchor; transferase activity, glycosyltransferase
Glt8d1 (0.993), Creb3 (0.991), Fut8 (0.99), Fkrp (0.99), Extl2 (0.989), Glt8d3 (0.987), Itm2c (0.986), Hs3st1 (0.986), Pofut2 (0.986), Dpagt1 (0.985), Mgat2 (0.983), Abhd6 (0.982), Ddost (0.982), Ndst2 (0.981), B4galnt1 (0.981), St3gal6 (0.98)
membrane; transmembrane; transmembrane region; topological domain:Cytoplasmic
H13 (0.996), Pdgfra (0.994), Cd59a (0.994), Glt8d1 (0.993), Sh3glb1 (0.993), Tm9sf3 (0.993), Tram1 (0.992), Gdpd5 (0.991)
organ development; system development; anatomical structure morphogenesis; cell differentiation; organ morphogenesis
Pdgfra (0.994), Myl6 (0.994), Sh3glb1 (0.993), Lmo4 (0.992), Rgnef (0.989), Syvn1 (0.988), Kit (0.988), Fndc3b (0.988), Txndc1 (0.987), Lama1 (0.987), Barx1 (0.986), Col4a2 (0.986), Ctgf (0.985), Fgf3 (0.985), Crim1 (0.983), Pthr1 (0.983)
Black Module highly connected genes (kME)
response to DNA damage stimulus; DNA damage; DNA repair
Msh6 (0.993), Rif1 (0.983), Mre11a (0.982), Setx (0.974), Xrcc5 (0.971), Chek1 (0.968), Xab2 (0.967), Xrn2 (0.967), Trp53 (0.959), Npm1 (0.958), Tdp1 (0.955), Bccip (0.954)
Mitochondrion; transit peptide; Mitochondrion
Mrpl15 (0.992), Ppif (0.991), Mrps5 (0.987), Hspa9 (0.984), Coq3 (0.984), Tst (0.981), Mrpl45 (0.98), Akap1 (0.979), L2hgdh (0.978), Mrps31 (0.978), Chchd4 (0.976), Abce1 (0.975), Dci (0.975), Fpgs (0.974), Mrpl39 (0.973), Bdh1 (0.971)
nucleus; biopolymer metabolic process; DNA binding; cellular metabolic process; Transcription regulation;
Msh6 (0.993), Pes1 (0.991), Zic3 (0.991), Uchl1 (0.99), Rnf138 (0.99), Rnf138 (0.99), Wdr36 (0.989), Pou5f1 (0.989), Rbpj (0.987), Glo1 (0.987), Tdgf1 (0.987), OTTMUSG00000010173 (0.986), Aarsd1 (0.986), Nup133 (0.985), Xpo1 (0.985), Xpo1 (0.985), Dnajc6 (0.985), Klhl13 (0.984), Dppa4 (0.984),
cell cycle phase; cell cycle process; cell cycle; mitotic cell cycle; mitosis; cell division
Pes1 (0.991), Rif1 (0.983), Mre11a (0.982), Gtpbp4 (0.972), Chek1 (0.968), Mnat1 (0.966), Rcc2 (0.964), Gadd45gip1 (0.963), Rpa1 (0.961), Hells (0.96), Trp53 (0.959), Terf1 (0.959)
Table 4 also shows significant GO terms for genes with the 5% highest . Given that many pluripotency TFs are in the black module, it is not surprising that the functional classifications, DNA binding and transcriptional regulation, are significantly enriched (p-value = 5.4 × 10-8). However, two functional classifications, DNA damage/repair and mitochondrial function, are more significantly enriched than the transcriptional regulation group (p-values = 2.0 × 10-8 and 3.8 × 10-8, respectively) suggesting that these pathways play important roles in maintaining pluripotency and self-renewal.
We also used Ingenuity Pathway Analysis, IPA, to compare functional enrichment in the pluripotency and differentiation modules (Ingenuity Systems, http://www.ingenuity.com). Additional file 7 shows that functional groups similar to those found using DAVID are enriched in the black and blue module respectively. Cell cycle and DNA replication, recombination, and repair are enriched in the black module compared to the blue module and skeletal, muscular, and cardiovascular system development are enriched in the blue module.
Here we compare some of our WGCNA results with those of a standard differential expression analysis. In Figure 2c and Figure 5 we showed that for some modules a strong relationship between module membership (k ME ) and differential expression (gene significance/fold change) can be observed. In , we provide a geometric description of modules for which such a relationship can be observed. While a close relationship may exist between k ME and a Student t-test statistic, it does not imply that corresponding gene ranking procedures are equivalent.
Here we compare signed WGCNA to standard differential expression methods using three different approaches. First, we show that a gene ranking based on k ME is more consistent (reproducible) than that based on the Student t-test in our data. Specifically, we computed two gene rankings for the Ivanova et al data set, one ranked by t-statistic and the other by connectivity to a module of interest. We similarly computed two such rankings for the Zhou et al data set and studied the overlap between the two data sets (Additional File 8). Of the 1000 genes most significantly down regulated upon differentiation in each data set 139 overlap (hyper-geometric p-value = 1.0 × 10-20). However, when ranking genes by connectivity to each data set's pluripotency model there is an increase in overlap to 230 (p-value = 1.7 × 10-75, Additional file 8). This increased consistency is also seen in genes up regulated upon differentiation where 77 genes overlap between the two data sets (p-value = 0.02) when ranking by t-statistic and 161 genes overlap when ranking by connectivity to the differentiation modules (p-value = 2.8 × 10-31, Additional file 8).
We similarly compare the 1000 most highly connected genes in the differentiation (blue) module and the 1000 genes that are most significantly up regulated upon Oct4 RNAi in the Ivanova network. Interestingly, only five genes overlap (Figure 6). By examining those genes that do not overlap we see that ranking by connectivity yields greater significant enrichment for many functional groups important in ES cell differentiation including Organ Development, Tissue Development, Cell Morphology etc.
Similar analysis of pluripotency genes in Zhou et al yields consistent results with DNA Replication, Recombination, and Repair being more enriched when ranked by connectivity (Additional File 9) while analysis of highly connected genes in the differentiation module shows that differential analysis moderately out performs ranking by connectivity. The differences in functional enrichment in the Zhou et al data set are subtle given that there is more overlap between the two rankings (Additional File 8). This large overlap is likely due to the simpliCity of the expression array samples which are filtered into only two groups, those that exhibit Oct4 expression and those that do not. Meanwhile, signed WGCNA is especially useful in Ivanova et al where smaller overlap is caused by the complexity of the expression samples which are made of many different RNAi treatments.
A third approach for comparing gene rankings is to use the enrichment with regard to epigenetic and transcriptional regulators. In Additional file 4 we relate different gene rankings to enrichment significance with regard to the following variables (a) histone H3K4 alone versus all others, (b) bivalent H3K4 & H3K27 versus all others , (c) High CPG class versus all others (i.e. HCG versus ICG and LCG), (d) promoter CPG methylation status , (e) Oct 4 complex binding status, (f) cMyc complex binding status. We report results for 3 different gene rankings using the Ivanova data: the black and blue curve represent gene rankings according to and , respectively. The grey curve represents ranking according to a Student T-test of differential expression. Additional file 4 shows that black and blue module genes can have very different enrichment results that tend to be very different from those of a standard analysis. This analysis illustrates how module membership provides important complementary variables along the Student t-test for understanding differences between genes.
The increased functional enrichment and improved consistency between data sets suggest that signed WGCNA is a complementary method to standard differential analysis. In practice, we recommend to use both k ME and the Student t-test to find highly differentially expressed intramodular hub genes.
Motif Enrichment in Genes bound by Oct4 or cMyc TF Groups
No. of genes
No. of genes
To understand how signed WGCNA is better able to separate genes into functional modules in the Ivanova data set, we plotted genes in the signed black or turquoise module relative to the unsigned turquoise module eigengene (Additional File 10). Note that genes located in the black and turquoise modules in the signed network are clearly separated into two clusters. Because a module eigengene is defined as the first principle component of its module, it describes the main direction in which the module's gene expressions vary. Note that the signed module eigengenes are oriented in the direction of their clusters. The direction of the unsigned turquoise module eigengene is more difficult to interpret. Because the turquoise module in the unsigned network contains two distinct signed modules (black and turquoise), its module eigengene describes the variance between these two sub-modules and the variance within the larger sub-module, the signed turquoise. As such, the unsigned turquoise module eigengene fails to quantify the true importance of highly connected genes in the signed black module. For example, Oct4's in the unsigned turquoise module is -0.74 while it is 0.94 in the signed black module. Thus, Oct4 is not identified as a hub gene in the unsigned network while it is clearly a hub gene in the signed network.
We show that a systems biology approach, which utilizes gene expression, transcription factor binding, genomic, epigenetic and gene ontology data, can be improved by accounting for the sign of co-expression relationships. We also show that signed WGCNA has advantages over standard differential expression methods. Specifically, signed WGCNA has more consistent gene rankings between data sets (see Additional file 8), is better able to identify functionally enriched groups of genes (Figure 6), and its focus on module eigengenes circumvents the multiple testing problems that plague standard gene-based expression analysis. Below, we highlight several novel stem cell related genes that would not have been found using a standard differential expression analysis.
Signed WGCNA provides novel insight into murine ES cell biology, which unsigned WGCNA is unable to provide. Applying these signed methods to previously published data, we identified pluripotency and differentiation gene modules not found in unsigned networks or differential analysis. The results of signed WGCNA are robust as it identifies similar modules in independently published data sets. We show that module eigengene based connectivity k ME is valuable for annotating genes with regard to module membership and for identifying genes related to pluripotency and differentiation. As a resource, we provide a module membership annotation for each gene with regard to the signed modules (Additional Files 11 and 12).
Many current studies focus on the role transcriptional regulators play in ES cell maintenance. As expected, the pluripotency module is enriched with genes active in transcriptional regulation, e.g. Oct4, Sox2, Klf2, Nanog, Jarid1b, Jarid2, Nodal, Tgif1, and Esrrb, and contains other genes expected to play a role in ES cell function, such as Dppa4 and Dppa5. The module also contains genes that have recently been shown to be necessary for maintaining the pluripotent State, Nup133 and Utf1 [45, 62].
Interestingly, the pluripotency module contains genes with roles in two other pathways, DNA repair and mitochondrial function, which are not found by standard differential analysis. The enrichment for genes that respond to DNA damage is not surprising given that ES cells spend a larger portion of their cell cycle in S phase and have a shorter G1 phase than differentiated cells . An emphasis on accurate DNA replication is expected since it helps ES cells maintain a stable genome and prevents errors from being inherited by differentiated cells. Mitochondria in ES cells may assist in the prevention of DNA damage . During aerobic production of adenosine triphosphate (ATP), mitochondria leak superoxides leading to the creation of reactive oxygen species (ROS), which damage DNA. ES cells, however, produce ATP anaerobically and thus minimize the amount of DNA damaging ROS [69, 71]. ES cells also have fewer mitochondria than differentiated cells and their mitochondria are smaller, have fewer cristae, lack dense matrices, and are perinuclearly located [69–71]. Our use of signed WGCNA reveals that in addition to genes involved in transcriptional regulation, genes that prevent or repair DNA damage are key to maintaining pluriotency and self-renewal.
Figure 3 reports significant relationships between module membership, chromatin structure and epigenetic modifications (histone modifications and DNA methylation), which are known to play a role in controlling gene expression during ES cell self-renewal and differentiation. While the relationships are highly significant, we find that epigenetic variables and binding data explain only 8.3% of the variation in module membership and 4.3% of the variation of (Table 2). In Additional file 5, we provide gene annotations with regard to module membership, transcription factor bindings, histone trimethylation status, CpG DNA methylation etc.
Using module eigengene based connectivity we find that many known differentiation related genes are highly connected in the differentiation (blue) module, Cited2, Gata4, and Gata6, along with Ctsl, which has recently been shown to be active in differentiation . We also find that Uqcrh, a gene involved in the electron transport chain, is highly connected in this module, lending support to the argument that ES cell mitochondria differ from those in differentiated cells. Module eigengene based connectivity enabled us to identify novel candidate genes in the differentiation module, like Uqcrh, that warrant experimental validation (Figure 8). For the pluripotency module interesting candidate genes are Msh6, Ppif, Sh3gl2, Rbpj, Elk1, Nrf1, Nup133, Mrpl15, and Zfp39 (Figures 7 and 8). These genes lack significant fold change but are highly connected and thus would not be found using standard differential analysis. Using sequence data with motif analysis we confirm the importance of two genes, Nr5a2 and Elk1, computationally.
We use gene ontology information and literature results to provide strong statistical evidence that these candidate genes are very promising and justify further biological study. Our article provides a resource in form of module based gene annotation tables that could form the starting point of future biological validation studies. Depending on their function, these candidate genes can be tested by RNAi knock down, viral infection in order to increase the efficiency of reprogramming, or, if they bind DNA, analyzing their binding sites. Our article demonstrates that signed WGCNA not only identifies many well known ES cell regulators; it also yields novel insights regarding ES cell function.
Our statistical methods are implemented in the WGCNA R software package . For example, a signed network using the power β = 12 is constructed with the R command ADJ = adjacency (datExpr, power = 12, type = "signed").
where a ij is the above defined adjacency, l ij = ∑u≠i,j a iu a uj , and k i = ∑u≠i a iu .
where the latter approximation assumes that the number of genes n is large. For the above situation (with β = 1) this implies ≈ a unsigned = 1 and ≈ a signed = 0. Thus, the two genes have high interconnectedness in an unsigned network but zero interconnectedness in a signed network. Additional file 1 (part b) shows a simple network where genes 1 and 2 are oppositely correlated with their neighbors. Here the choice of gene co-expression measure results in a very different topological overlap measures, which in turn, leads to different modules.
where E (q) is the module eigengene of module q, , and . Equation (11) implies an approximate linear relationship between and if β = 1. Using real data, we illustrate this relationship in Additional file 2.
Methods developed in Zhou et al  were used to scan sequences for motifs with pre-defined position specific weight matrices. Sequences were determined by extending bound ChIP-seq sites 150 bp up and downstream resulting in regions approximately 330 bp long. Any overlapping regions were then joined into larger meta-regions. A set of control sequences was scanned to determine a motif enrichment ratio. The control group was created by randomly sampling 5,000 probes from the Agilent Mouse Promoter Whole Genome ChIP-on-chip Microarray Set. These probes are distributed -5.5 kb upstream to +2.5 downstream of approximately 17,000 known gene transcription start sites from UCSC's version mm8 genome.
Overlapping control probes were merged into meta-regions as described above. Enrichment of a motif is defined as where M T is the number of observed sites and λ is the number of expected sites, , where M C is the number of sites in the control, N C the length of all control sequences, and N T the length of all bound sequences. Statistical significance is determined by the Poisson distribution with λ as the mean.
SH is supported 1U19AI063603-01, 5P30CA016042-28, P50CA092131, and DK072206. KP is supported by DP2 OD001686-01, RN1-00564-1, RL1-00681-1, and P01 GM081621-01A1. QZ is supported by NSF DMS-0805491. We thank Mark Chin, Noah Dowell, Tom Drake, Tova Fuller, Dan Geschwind, Peter Langfelder, Jake Lusis, Paul Mischel, Mike Oldham, Bernadett Papp, Rupa Sridharan, and Jason Tchieu for useful discussions.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.