Incongruences in previous cell cycle lists
Four previous cell cycle studies [6, 7, 9, 10] identified gene sets with periodic expression ranging from 480 cell cycle genes in fibroblasts to 1871 in U2OS cells. Grant et al. noted large differences between the gene lists (Additional file 1: Figure S1A) and phase assignation of the cell cycle genes based on their peaking times exhibited further incongruences. Three studies identified five different phases, namely: G1/S, S, G2, G2/M and M/G1 while in Bar-Joseph data S phase was not assigned. Percentage and number of genes assigned to the phases varied greatly across studies. Bar-Joseph et al. assigned 43% of the genes (221 genes) to G2/M phase while Grant et al. 21% (598 genes) to the same phase (Additional file 1: Figure S1B). Similarly, G2 phase accounted for 6% (29 genes) of the Bar-Joseph gene list and it comprised 21% (203 genes) of Peña-Diaz list. Only 18 genes were annotated as G2/M and 16 genes as G1/S consistently across all four studies while for the other phases (S, G2, M/G1), not even a single gene was identified by all studies (Additional file 1: Figure S1C). We therefore set out to perform a systematic re-analysis of the human core cell cycle transcriptome.
Data processing and generation of a clustered network graph
Data was collected from four microarray gene expression studies [7, 9, 10, 19] generated from four human different cell types: NHDF (primary fibroblasts), HeLa (cervical cancer cell line), HaCat (immortal keratinocytes), and U2OS (osteosarcoma cells), respectively. Low quality arrays were discarded by performing a number of QC metrics (see Methods) (Additional file 1: Figure S2–S5). Each sample set was then normalized separately using RMA normalization and data was further investigated using PCA plots to assess the presence of subtler batch effects and further samples were discarded (Additional file 1: Figure S2–S5). We next mapped probe sets from each dataset to Entrez IDs which were then used as reference to generate a collated dataset of 11,693 unique Entrez IDs and 159 samples. As the average intensity of each sample set was variable (Additional file 1: Figure S6A) we used ComBat, a batch correction algorithm which uses Empirical Bayes models to adjust for batch effects in the data (Additional file 1: Figure S6B). A correlation network from this data was reconstructed using BioLayout Express
3D
. [16] After selecting for genes with Pearson’s correlation coefficient (r) ≥ 0.60, which was high enough for correlation not to occur by chance (Additional file 1: Figure S7A), the resulting network contained 3157 nodes (genes) connected by 21,858 edges (correlations). The network was then clustered using the MCL cluster algorithm [20] generating 68 different clusters of which six are here reported as showing reproducible pattern of expression across all the cell types and/or including relevant biology (Fig. 1a). Other clusters included noisy expression patterns which did not reproduce across samples or reflected artifact expression (Additional file 1: Figure S7B). Cluster 11 for example showed a sharp peak in a single sample not seen in the replicate samples and was therefore not considered for further analysis (Additional file 1: Figure S7C).
Clusters with G1/S-S and G2-M phase specific gene expression
Genes in clusters 4 and 6 were maximally expressed during S and M phases consistent across all cell line assessed (Fig. 1b, blue and red profiles). The expression of genes in cluster 4 was up-regulated upon release from the double-thymidine block (which synchronises cells at the G1/S transition) and raised approximately 16 h after serum-refeeding in the serum starvation experiment in fibroblasts (Fig. 1b, blue profile). This is in agreement with an increased enrichment of S phase-related genes in human fibroblasts reported previously [21–23]. Conversely, cluster 6 genes exhibited low expression upon release from the thymidine-block followed by up-regulation at around 8–12 h and after around 24 h in the starvation experiment (Fig. 1b, red profile). Gene Ontology enrichment analysis of the 155 genes in cluster 4 demonstrated a highly significant enrichment for biological processes linked with S phase including DNA replication (P = 1.58 10−55), DNA repair (P = 1.31 10−39) and G1/S transition (P = 4.3 10−26) (Fig. 1c, blue barplots). The 143 genes found in cluster 6 were instead highly enriched for mitosis-related biological processes such as mitotic cell cycle (P = 3.46 10−72), chromosome segregation (P = 1.29 10−48) and spindle organization (P = 4.47 10−22) (Fig. 1c, red barplots).
Genes in cluster 4 included several factors involved in DNA replication such as various polymerases (POLA1, POLA2, POLD1, POLD3, and POLE2), proliferating cell nuclear antigen (PCNA), cell division control protein (CDC6) and other protein complexes necessary to initiate DNA synthesis e.g. members of the DNA replication complex (GINS2-4), members of the minichromosome maintenance complex (MCM2-7 and 10), and the replication factor complex (RFCs). DNA repair and DNA damage factors known to cooperate in DNA replication were also identified including Fanconi anemia complex components (FANCE, FANCG, FANCI, and FANCL), RAD complex components (RAD51, RAD51AP1 and RAD54L) and Breast cancer type 1 susceptibility protein (BRCA1). Importantly, genes known to regulate G1/S transition including cyclins E (CCNE1 and CCNE2), M phase inducer phosphatase 1 (CDC25A) and Cell division control protein 6 homolog (CDC6) belonged to cluster 4. Genes in cluster 6 included several G2 and mitotic regulators such as mitotic checkpoint serine/threonine-protein kinase (BUB1), cyclin-dependent kinase 1 (CDK1), a master cell cycle regulator, cyclins A and the two isoforms of cyclin B (CCNA2, CCNB1, CCNB2) and M phase inducer phosphatase 2/3 (CDC25B and CDC25C). Various genes involved in kinetochore formation (CENPA, CENPE, CENPF, CENPI) and several motor proteins members of the kinesin-like proteins (KIFs) known to participate in chromosomal and spindle movements during mitosis [24] also belonged to this cluster. Clusters 4 and 6 together accounted for 298 genes which exhibited up-regulation associated with S phase and mitosis across all the four cell lines examined. This number is three fold higher than that previously found to be representing the core cell cycle signature across the human cell lines investigated [10].
Sub-clustering of clusters 4 and 6 allows more specific cell cycle phase association
As in the previous cell cycle studies genes were assigned to at least four different cell cycle phases, we investigated if more detailed phase-specific gene networks could be identified from cluster 2 and 4 by increasing the stringency of the clustering algorithm (see Methods). Cluster 4 separated in 5 sub-clusters, of which two showed subtle differences in their peak of expression (Fig. 2a, left) i.e. genes in cluster 4A displayed a peak in their expression earlier than those of cluster 4B (Fig. 2b, top). These two clusters represent G1/S transition and S phase gene expression respectively as they included several bona fide markers of these two phases. G1/S regulators, discussed in previous section, indeed belonged to cluster 4A (G1/S cluster). This cluster also contained the majority of genes known to be involved in the formation of the pre-replication complex, necessary to initiate DNA replication (MCM 2-7/10, CDC6, CDT1 and ORC1) [25]. On the other hand, in cluster 4B (‘S phase’ cluster) we identified genes playing a role in DNA replication, particularly in the initiation of DNA replication including cell division control protein 45 homolog (CDC45) [25], DNA polymerase alpha catalytic subunit (POLA1) and PCNA associated factor (KIAA0101). DNA metabolism factors including RRM1/2 were present in cluster 4B, responsible for providing precursors necessary for DNA synthesis.
Likewise, increasing the stringency of the clustering split cluster 6 into two sub-clusters, cluster 6A and cluster 6B (Fig. 2a, right), associated with G2 and M phase respectively (Fig. 2b, bottom). Cell cycle regulators CDK1, CCNA2, CDC25B and CDC25C were found in cluster 6A (G2 cluster) and are known to intervene at the G2/M boundary [6]. Kinetochore proteins (CENPE, CENPF, CENPI) and motor proteins (KIF11, KIF14, KIF18A, KIF18B, KIF20B, KIF22, KIF23, KIF2C, KIF5B) were included in this cluster. Cluster 6B was populated with mitotic cyclins such as: CCNB1 and CCNB2, BUB1, involved in the metaphase checkpoint [26] and other gene products playing a role in mitosis. Complete description of the phases assigned to genes in clusters 4 and 6 can be found in Additional file 2: Table S1. It should be emphasized however that whilst sub-division of clusters 4 and 6 may identify regions in the network that are more enriched for genes associated with particular phase of the cell cycle, the network is a continuum and these sub-divisions are relatively arbitrary.
G1-related and early growth response clusters
Cluster 2 showed a partial cell cycle-associated expression with peaks of expression coinciding to those of cluster 6 as observed in the following experiments: in experiment three in U2OS cells, in all the three experiments in HeLa cells and in the second experiment in HaCat cells (Fig. 1b, grey profile). Notably, this cluster profile also showed up-regulation at around 6 h in primary fibroblasts entering cell cycle from quiescence (starvation experiment). Genes in cluster 2 were involved in pathways indicative of an active metabolism such as: cellular metabolic process (P = 2.3 10−4), ribosome biogenesis (P = 3.8 10−2) and macromolecule modification (P = 1.4 10−2) (Fig. 1c, grey barplots). Also were found in this cluster: E2F5, a member of the E2F transcriptional factors family, which plays a role as repressor during G1 phase [27], the retinoblastoma protein (RB1), a main tumor suppressor which inhibits cell cycle progression during this phase by inactivating E2F1 [28] and CDC73, another tumor suppressor which has been reported to interact with cyclin D1 [29]. Cluster 2 also included several mitogen-activated MAP kinases (MAP2K1, MAP3K4, MAP3K7CL, MAP4K3, MAPK6) essential to deliver mitogenic stimuli signals to cell cycle regulators. Interestingly, cluster 1 and 9 (Fig. 1a) also contained G1-related genes with cluster 1 including cyclins D1 and D3 (CCND1 and CCND3), master regulators of G1 progression [28] while cluster 9 included CDK4, a cyclin dependent kinase which operates during G1 phase. [28] These clusters however failed to show expression patterns associated with cell cycle events (Additional file 1: Figure S8, green and red profiles). Cluster 3 showed a conserved sharp peak in expression in the first hours after the release of cells from blockade, with no further induction at other times (Fig. 1b, brown profile). The 128 genes in this cluster were highly enriched with pathways involving transmission of both proliferative and anti-proliferative signals (Fig. 1c, brown barplot). Accordingly, the cluster included several genes activated by mitogenic stimuli and encoding for a variety of cytoplasmatic enzymes, secreted proteins and transcription factors assigned to transduce the signal from the cell membrane to the nucleus [30]. These included early growth response genes 2/3 (EGR2 and EGR3), fos and jun (FOSB and JUNB) which activate transcription upon dimerization [30] and Immediate early response gene 2/3 (IER2 and IER3).
A full list of genes included in the six clusters identified and lists of their enriched GO biological process terms can be found in Additional file 2: Table S1 and Additional file 3: Table S2, respectively.
Validation of clusters analysis with another unsupervised clustering technique
We analyzed the data using weighted correlation network analysis (WGCNA), an unsupervised technique that generates modules (clusters) of correlated genes after construction of an adjacency matrix. We identified (color-coded) modules after hierarchical clustering using the WGNCA package (see Methods) [17] (Additional file 1: Figure S9A). Reassuringly, comparisons of the genes included in the most enriched modules derived from the WGNCA analysis and genes in the clusters identified with BioLayout Express
3D showed high overlap and GO enrichments for each module (Additional file 1: Figure S9B) showed consistency of GO biological process terms, particularly for clusters 2, 3, 4, 6 (Additional file 1: Figure S9C). Moreover, we compared the overlap between the two sets of clusters/modules enriched with cell cycle genes finding 237 genes in common. Though WGCNA analysis identified many other genes included in the two modules (449) (Additional file 1: Figure S9D), the enrichment for the GO_BP term cell cycle in the two clusters found in our analysis was far more significant (Additional file 1: Figure S9E).
A network analysis of the combined data more efficiently identifies commonalities in cell cycle-related genes
We identified 298 cell cycle genes up-regulated during G1/S-S and G2-M phase across independent studies in different human cell lines whereas direct comparison of the results of individual cell cycle studies showed only 96 common genes. To look deeper at the cause of the poor overlap we overlaid the gene sets from the four studies [6, 7, 9, 10] on the network graph. Notably, the highest overlap was in clusters 4 and 6, representing G1/S-S and G2-M phases (Fig. 3a). However many genes in clusters 4 and 6 were not reported by all studies with 63 genes identified by three studies, 62 genes by two, 50 genes by one study and 39 genes not reported by any (Fig. 3b). Nevertheless, their expression profiles did show cell cycle-dependent regulation across all the cell lines and many of them are documented to be involved in cell cycle. We illustrate this by describing few examples below. Their relative expression profiles with superimposed known-cell cycle factors can be seen in Fig. 3c. The Kinetochore-associated protein DSN1 homolog (DSN1), necessary for proper chromosome alignment and segregation during mitosis as part of the MIS12 complex [31] was only reported in Grant et al. study. KIF20A, a mitotic kinesin required for cytokinesis [32], was only found in HaCat and U2OS cells. CDKN3, a tumor suppressor phosphatase intervening during G1/S transition and mitosis, was not identified by Bar-Joseph et al. study and DNA polymerase alpha catalytic subunit (POLA1), essential for DNA replication initiation was only reported by Whitfield et al. study. Genes not supported by any study showing cell cycle-associated expression included structural maintenance of chromosomes protein 2 (SMC2), a central component of the condensing complex assigned to condense chromatin into mitotic-like chromosomes [33] and putative pituitary tumor-transforming gene 3 protein (PTTG3P), potentially involved in chromosome segregation. A table with complete gene listing of the clusters and the overlap from previous studies can be found in Additional file 4: Table S3. In summary, the majority of the genes in cluster 4 and 6 were not identified in all studies despite following a cell-cycle dependent expression pattern. Thus, correlation-based analysis of the collated data enables bypassing incongruences as a result of the independent analyses and finds coherent patterns in the data.
Data comparison with yeast data and further human-derived datasets
As a large body of work on cell cycle transcriptomics have been performed on budding and fission yeast, we sought to compare our results with these studies. To do so, we exploited a web resource called Cyclebase (Cyclebase.org) [13] in which several yeast studies were re-analysed and genes were ranked according to the magnitude of their periodicity scores calculated by a statistical method that proved to give the best performance when compared to others [12]. Ranked list of genes were downloaded from the website for the budding and the fission yeasts. The fission yeast list included 449 ranked periodic genes while the budding yeast list comprised a ranked list of the whole yeast transcriptome of which only the top 500 genes were used for comparison. Yeast orthologues to human genes were retrieved using YeastMine (http://yeastmine.yeastgenome.org) for budding yeast, and for fission yeast using PomBase (http://www.pombase.org/). The list of 298 cell cycle-associated genes identified here included 63 orthologues in budding yeast and 35 from fission yeast. When compared with the results of individual human cell cycle studies, the number of budding yeast orthologues is comparable, although in fission yeast almost a double amount of orthologues are identified by the Grant et al. study (Additional file 1: Figure S10A). Nonetheless the 96 genes overlapping in the four studies included a significantly lower number of orthologues genes compared to this list in both yeasts (Additional file 1: Figure S10A). All in all our list includes a relatively high number of orthologues which are mostly comparable with much larger gene lists and a marked higher number than those found in the 96 gene set. To further verify the quality of the 298 gene set we compared its GO enrichments for cell cycle biological process term across the four cell cycle gene lists derived from the correspondent individual studies and the set of 96 genes derived from their direct comparison (Additional file 1: Figure S10B). Gene lists from the Whitfield study were obtained both from Cyclebase and from the original study. As it can be seen, our list of genes received the highest enrichment for the GO_BP term cell cycle.