Identification of cell cycle-related regulatory motifs using a kernel canonical correlation analysis
© Rhee et al; licensee BioMed Central Ltd. 2009
Published: 3 December 2009
Gene regulation is a key mechanism in higher eukaryotic cellular processes. One of the major challenges in gene regulation studies is to identify regulators affecting the expression of their target genes in specific biological processes. Despite their importance, regulators involved in diverse biological processes still remain largely unrevealed. In the present study, we propose a kernel-based approach to efficiently identify core regulatory elements involved in specific biological processes using gene expression profiles.
We developed a framework that can detect correlations between gene expression profiles and the upstream sequences on the basis of the kernel canonical correlation analysis (kernel CCA). Using a yeast cell cycle dataset, we demonstrated that upstream sequence patterns were closely related to gene expression profiles based on the canonical correlation scores obtained by measuring the correlation between them. Our results showed that the cell cycle-specific regulatory motifs could be found successfully based on the motif weights derived through kernel CCA. Furthermore, we identified co-regulatory motif pairs using the same framework.
Given expression profiles, our method was able to identify regulatory motifs involved in specific biological processes. The method could be applied to the elucidation of the unknown regulatory mechanisms associated with complex gene regulatory processes.
One of the major challenges in current biology is to elucidate the mechanism governing the gene expression. Gene expression programs depend mainly on transcription factors which bind to upstream sequences by recognizing short DNA motifs called transcription factor binding sites (TFBSs) to regulate their target gene expression . Although many regulatory motifs have been identified, large amount of functional elements still remain unknown .
Many genome-wide approaches have been developed in attempt to discover regulatory motifs from upstream sequences. The early computational approach for identifying regulatory motifs is based on statistical analyses using only upstream sequences of genes. Statistical methods such as maximum-likelihood estimation or Gibbs sampling, are effective for searching directly significant sequence motifs from multiple upstream sequences [3, 4]. Several computational approaches based on machine learning methods have also been implemented. A SOM (self-organizing map)-based clustering method can find regulatory sequence motifs by grouping relevant sequence patterns  and a graph-theoretic approach has tried to identify regulatory motifs by searching the maximum density subgraph .
More advanced approaches have been developed that can identify regulatory motifs by linking gene expression profiles and motif patterns. The main advantage of these approaches is that they can identify motifs correlated to specific biological processes. Most early trials used a unidirectional search, such as approaches that search for shared patterns with upstream sequences in a set of co-expressed genes that were found by clustering algorithms [7, 8] or those that determine whether genes with common regulatory elements are co-expressed [9, 10]. In addition, it is also possible to link motifs to gene expression patterns using linear regression models or regression trees [11, 12]. Recently, several techniques for a bidirectional search to detect the relationship between the regulatory motifs and the gene expression profiles have been emerged [13, 14]. They search regulatory motifs more efficiently than unidirectional approaches since they search similar expression patterns and regulatory motifs correlated to them simultaneously.
We applied the kernel CCA to a paired set of upstream sequence motifs of genes and their expression profiles in yeast (Saccharomyces cerevisiae) cell cycle, and explored significant relationships between motifs and expression profiles. We also searched for regulatory motifs correlated with specific expression patterns. Our method retrieved regulatory motifs that play an important role in cell cycle regulation including several well-known cell cycle regulatory motifs: MCB, SCB and SFF'. Furthermore, we identified motif pairs associated with the gene expression to construct a map of combinatorial regulation of regulators.
Results and discussion
We applied a computational method, kernel CCA, to the identification of novel transcriptional regulatory elements. The main purpose of our experiments was to find regulatory motifs that were associated with gene regulation in specific biological processes. Using the kernel CCA, we first found highly correlated features between expression profiles and the sequence motifs. The key motifs in gene regulation were then identified from the weight scheme by the kernel CCA (see Methods section). Furthermore we demonstrate that it is possible for our method to be applied for identification of motif pairs using raw upstream sequences.
Identification of the relationship between gene expression and known motifs
Known regulatory motifs in yeast (Saccharomyces cerevisiae)
The list of top ranked motifs based on the weight scheme by the kernel CCA
Identification of cell cycle-related motifs
High-scored motifs in the first and the second components using 5-mer raw upstream sequences
Combinational effects of regulatory motifs
We searched the motif pairs that have synergistic or co-regulatory combination effects in the yeast cell cycle. The regulatory mechanisms of eukaryotes are highly complex since most genes are normally synergistically regulated by different transcription factors. Therefore, identifying the synergistic motif combinations can contribute to systematically understanding the regulatory circuit.
The top 10 ranked motif pairs and their ECRScores
We presented a novel method that can identify the candidate conditional specific regulatory motifs by employing kernel-based methods. The application of the kernel CCA enables us to detect correlations between heterogeneous datasets, consisting of upstream sequences and expression profiles. From a data-mining perspective, our work is regarded as a new approach for detecting important features from regulatory sequences and gene expression profiles. We demonstrated that major motifs in a specific biological process can be extracted by a CC score via modelling a close relationship between two datasets related to gene regulation.
As genome-wide datasets of various types become available, it's important to analyze these datasets in an integrated manner . It is possible to come up with novel biological hypotheses by integrating diverse biological resources generated for specific research purposes. In these aspects, the kernel CCA is regarded as a useful method that can extract the biological factors with significant roles by integrating different types of biological data. Many studies for identifying motifs have been based on sequence conservation or sequence characteristics, regardless of the biological processes. Therefore our method can be regarded as complementary approach in the analysis of gene regulation.
Our method found important motifs related to the cell cycle by using raw upstream sequences as well as known motif sets. In the present study we used the raw sequences of window size, l = 5. If we enlarged the window size, the dimension for sequence features increased exponentially, whereas the frequency of motifs decreased. Although the window size used in our experiments was shorter than the length of several known transcription factor binding sequences, it was long enough to obtain worthwhile results.
In the future research, we will apply the proposed method to diverse gene expression datasets, especially cancer-related datasets. The cancer-related regulatory program can be elucidated by analyzing regulatory motifs from a set of enriched genes in the cancer transcriptome . Using the kernel CCA, a correlation analysis between regulatory sequences and the cancer transcriptome may directly catch regulatory motifs related to the abnormal gene regulatory program.
Investigation of the relationship between regulatory sequence motifs and expression profiles
Kernel CCA (Canonical correlation analysis) is a version of the nonlinear CCA, where the kernel trick is utilized to find nonlinearly correlated features from two datasets [15–17]. CCA is a classical multivariate statistical method for finding linearly correlated features from a pair of datasets . Suppose there is a pair of multivariates x and y, CCA finds a pair of linear transformations such that the correlation coefficient between extracted features is maximized. However, if there is a nonlinear relationship between the variates, CCA does not always extract useful features.
Kernel CCA offers a solution for overcoming the linearity by first projecting the data into a higher dimensional feature space. While CCA is limited to linear features, kernel CCA can capture nonlinear relationships. Kernel CCA has been used for several applications including text retrieval and biological data analysis [15, 37].
A high weight value of the specific sequence motif means that the motif is strongly correlated with the expression patterns of genes whose upstream region includes the motif and whose CC scores are high. If a weight of a specific motif has a high absolute value, the motif is more likely to play a regulatory role in the specific biological process. The kernel CCA was implemented using Matlab.
Preparation of the gene expression datasets
where σ is a parameter and function d(•,•) is a Euclidean distance. The x and x' mean the two different instances.
Preparation of the gene sequence datasets
The sequence data was used in two ways. In the first case, we used the sequences of a total of 42 known motifs (Table 1) extracted by Pilpel . We then scanned the upstream regions of ORFs for the presence of these motifs using the AlignACE program . The sequence profile was represented by the occurrence of these motifs in the promoters of each gene in the genome.
In the second case, we analyzed the relationship between the expression profiles and the raw upstream sequences. We extracted ~1 kb upstream sequences of each gene. From these sequences, we calculated the frequency of all possible l-mers in each gene. For l = 5, each gene had 1,024 (= 45) different base combinations. The sequence profile was encoded in the frequency of l-mers.
Measurement of the effect of motif pairs
where N(m i ∩ m j ) is the number of all pairs of genes whose upstream regions have the two motifs, and N τ (m i ∩ m j ) is the number of gene pairs whose correlation coefficient is larger than the threshold τ. The threshold was chosen based on the fifth percentile of the distribution for correlation coefficients of randomly sampled gene pairs.
Other papers from the meeting have been published as part of BMC Bioinformatics Volume 10 Supplement 15, 2009: Eighth International Conference on Bioinformatics (InCoB2009): Bioinformatics, available online at http://www.biomedcentral.com/1471-2105/10?issue=S15.
This work was supported in part by KEIT through the MARS project (IITA-2009-A1100-0901-1639), KRF Grant funded by the Korean Government (MOEHRD) (KRF-2008-314-D00377) and the BK21-IT program funded by Korean Government (MEST). JHC has been supported by Korean Ministry of Information and Communications under 2005 IT scholarship program. The ICT at Seoul National University provides research facilities for this study.
This article has been published as part of BMC Genomics Volume 10 Supplement 3, 2009: Eighth International Conference on Bioinformatics (InCoB2009): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/10?issue=S3.
- Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, et al: Transcriptional regulatory networks in Saccharomyces cerevisiae. Science. 2002, 298 (5594): 799-804. 10.1126/science.1075090.View ArticlePubMedGoogle Scholar
- Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M: Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Nature. 2005, 434 (7031): 338-345. 10.1038/nature03441.PubMed CentralView ArticlePubMedGoogle Scholar
- Hughes JD, Estep PW, Tavazoie S, Church GM: Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol. 2000, 296 (5): 1205-1214. 10.1006/jmbi.2000.3519.View ArticlePubMedGoogle Scholar
- Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol. 1994, 2: 28-36.PubMedGoogle Scholar
- Mahony S, Hendrix D, Golden A, Smith TJ, Rokhsar DS: Transcription factor binding site identification using the self-organizing map. Bioinformatics. 2005, 21 (9): 1807-1814. 10.1093/bioinformatics/bti256.View ArticlePubMedGoogle Scholar
- Fratkin E, Naughton BT, Brutlag DL, Batzoglou S: MotifCut: regulatory motifs finding with maximum density subgraphs. Bioinformatics. 2006, 22 (14): e150-157. 10.1093/bioinformatics/btl243.View ArticlePubMedGoogle Scholar
- Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat Genet. 1999, 22 (3): 281-285. 10.1038/10343.View ArticlePubMedGoogle Scholar
- Brazma A, Jonassen I, Vilo J, Ukkonen E: Predicting gene regulatory elements in silico on a genomic scale. Genome Res. 1998, 8 (11): 1202-1215.PubMed CentralPubMedGoogle Scholar
- Pilpel Y, Sudarsanam P, Church GM: Identifying regulatory networks by combinatorial analysis of promoter elements. Nat Genet. 2001, 29 (2): 153-159. 10.1038/ng724.View ArticlePubMedGoogle Scholar
- Park PJ, Butte AJ, Kohane IS: Comparing expression profiles of genes with similar promoter regions. Bioinformatics. 2002, 18 (12): 1576-1584. 10.1093/bioinformatics/18.12.1576.View ArticlePubMedGoogle Scholar
- Bussemaker HJ, Li H, Siggia ED: Regulatory element detection using correlation with expression. Nat Genet. 2001, 27 (2): 167-171. 10.1038/84792.View ArticlePubMedGoogle Scholar
- Keles S, Laan van der M, Eisen MB: Identification of regulatory elements using a feature selection method. Bioinformatics. 2002, 18 (9): 1167-1175. 10.1093/bioinformatics/18.9.1167.View ArticlePubMedGoogle Scholar
- Segal E, Yelensky R, Koller D: Genome-wide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics. 2003, 19 (Suppl 1): i273-282. 10.1093/bioinformatics/btg1038.View ArticlePubMedGoogle Scholar
- Jeffery IB, Madden SF, McGettigan PA, Perriere G, Culhane AC, Higgins DG: Integrating transcription factor binding site information with gene expression datasets. Bioinformatics. 2007, 23 (3): 298-305. 10.1093/bioinformatics/btl597.View ArticlePubMedGoogle Scholar
- Hardoon DR, Szedmak S, Shawe-Taylor J: Canonical correlation analysis; An overview with application to learning methods. Technical Report CSD-TR-03-02. 2003, Royal Holloway University of LondonGoogle Scholar
- Bach FR, Jordan MI: Kernel independent component analysis. Technical Report UCB//CSD-10-1166. 2001, UC BerkeleyGoogle Scholar
- Akaho S: A kernel method for canonical correlation analysis. International meeting of Psychometric Society (IMP2001). 2001Google Scholar
- Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998, 9 (12): 3273-3297.PubMed CentralView ArticlePubMedGoogle Scholar
- Dohrmann PR, Butler G, Tamai K, Dorland S, Greene JR, Thiele DJ, Stillman DJ: Parallel pathways of gene regulation: homologous regulators SWI5 and ACE2 differentially control transcription of HO and chitinase. Genes Dev. 1992, 6 (1): 93-104. 10.1101/gad.6.1.93.View ArticlePubMedGoogle Scholar
- Dohrmann PR, Voth WP, Stillman DJ: Role of negative regulation in promoter specificity of the homologous transcriptional activators Ace2p and Swi5p. Mol Cell Biol. 1996, 16 (4): 1746-1758.PubMed CentralView ArticlePubMedGoogle Scholar
- Morillon A, O'Sullivan J, Azad A, Proudfoot N, Mellor J: Regulation of elongating RNA polymerase II by forkhead transcription factors in yeast. Science. 2003, 300 (5618): 492-495. 10.1126/science.1081379.View ArticlePubMedGoogle Scholar
- Simon I, Barnett J, Hannett N, Harbison CT, Rinaldi NJ, Volkert TL, Wyrick JJ, Zeitlinger J, Gifford DK, Jaakkola TS, et al: Serial regulation of transcriptional regulators in the yeast cell cycle. Cell. 2001, 106 (6): 697-708. 10.1016/S0092-8674(01)00494-9.View ArticlePubMedGoogle Scholar
- Vershon AK, Johnson AD: A short, disordered protein region mediates interactions between the homeodomain of the yeast alpha 2 protein and the MCM1 protein. Cell. 1993, 72 (1): 105-112. 10.1016/0092-8674(93)90054-T.View ArticlePubMedGoogle Scholar
- Zhong H, McCord R, Vershon AK: Identification of target sites of the alpha2-Mcm1 repressor complex in the yeast genome. Genome Res. 1999, 9 (11): 1040-1047. 10.1101/gr.9.11.1040.View ArticlePubMedGoogle Scholar
- Lydall D, Ammerer G, Nasmyth K: A new role for MCM1 in yeast: cell cycle regulation of SW15 transcription. Genes Dev. 1991, 5 (12B): 2405-2419. 10.1101/gad.5.12b.2405.View ArticlePubMedGoogle Scholar
- Keleher CA, Passmore S, Johnson AD: Yeast repressor alpha 2 binds to its operator cooperatively with yeast protein Mcm1. Mol Cell Biol. 1989, 9 (11): 5228-5230.PubMed CentralView ArticlePubMedGoogle Scholar
- Mead J, Zhong H, Acton TB, Vershon AK: The yeast alpha2 and Mcm1 proteins interact through a region similar to a motif found in homeodomain proteins of higher eukaryotes. Mol Cell Biol. 1996, 16 (5): 2135-2143.PubMed CentralView ArticlePubMedGoogle Scholar
- Das D, Banerjee N, Zhang MQ: Interacting models of cooperative gene regulation. Proc Natl Acad Sci USA. 2004, 101 (46): 16234-16239. 10.1073/pnas.0407365101.PubMed CentralView ArticlePubMedGoogle Scholar
- MacKay VL, Mai B, Waters L, Breeden LL: Early cell cycle box-mediated transcription of CLN3 and SWI4 contributes to the proper timing of the G(1)-to-S transition in budding yeast. Mol Cell Biol. 2001, 21 (13): 4140-4148. 10.1128/MCB.21.13.4140-4148.2001.PubMed CentralView ArticlePubMedGoogle Scholar
- Morrow BE, Johnson SP, Warner JR: Proteins that bind to the yeast rDNA enhancer. J Biol Chem. 1989, 264 (15): 9061-9068.PubMedGoogle Scholar
- Banerjee N, Zhang MQ: Identifying cooperativity among transcription factors controlling the cell cycle in yeast. Nucleic Acids Res. 2003, 31 (23): 7024-7031. 10.1093/nar/gkg894.PubMed CentralView ArticlePubMedGoogle Scholar
- Tsai HK, Lu HH, Li WH: Statistical methods for identifying yeast cell cycle transcription factors. Proc Natl Acad Sci USA. 2005, 102 (38): 13532-13537. 10.1073/pnas.0505874102.PubMed CentralView ArticlePubMedGoogle Scholar
- Hvidsten TR, Wilczynski B, Kryshtafovych A, Tiuryn J, Komorowski J, Fidelis K: Discovering regulatory binding-site modules using rule-based learning. Genome Res. 2005, 15 (6): 856-866. 10.1101/gr.3760605.PubMed CentralView ArticlePubMedGoogle Scholar
- Kasturi J, Acharya R: Clustering of diverse genomic data using information fusion. Bioinformatics. 2005, 21 (4): 423-429. 10.1093/bioinformatics/bti186.View ArticlePubMedGoogle Scholar
- Rhodes DR, Kalyana-Sundaram S, Mahavisno V, Barrette TR, Ghosh D, Chinnaiyan AM: Mining for regulatory programs in the cancer transcriptome. Nat Genet. 2005, 37 (6): 579-583. 10.1038/ng1578.View ArticlePubMedGoogle Scholar
- Hotelling H: Relations between two sets of variates. Biometrika. 1936, 28: 312-377.View ArticleGoogle Scholar
- Yamanishi Y, Vert JP, Nakaya A, Kanehisa M: Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis. Bioinformatics. 2003, 19 (Suppl 1): i323-330. 10.1093/bioinformatics/btg1045.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.