Identifying functional relationships within sets of co-expressed genes by combining upstream regulatory motif analysis and gene expression information
© Martyanov and Gross. 2010
Published: 02 November 2010
Skip to main content
© Martyanov and Gross. 2010
Published: 02 November 2010
Existing clustering approaches for microarray data do not adequately differentiate between subsets of co-expressed genes. We devised a novel approach that integrates expression and sequence data in order to generate functionally coherent and biologically meaningful subclusters of genes. Specifically, the approach clusters co-expressed genes on the basis of similar content and distributions of predicted statistically significant sequence motifs in their upstream regions.
We applied our method to several sets of co-expressed genes and were able to define subsets with enrichment in particular biological processes and specific upstream regulatory motifs.
These results show the potential of our technique for functional prediction and regulatory motif identification from microarray data.
DNA sequence motif finders are often used to predict potential regulatory motifs upstream of co-regulated genes, typically identified through gene expression experiments. The importance of upstream regulatory motifs for establishing a link between co-expression and co-regulation has been recognized previously [1–3]. These motifs represent patterns in sequence data important both for transcriptional regulation and protein function prediction . However, the identification of shared motifs does not necessarily mean that the genes are involved in the same biological process. Further, microarray expression data are notoriously noisy, which impacts the ability of motif finders to identify biologically relevant patterns.
It is believed that similar gene expression profiles are the result of similar regulatory mechanisms . In fact, this hypothesis served as the basis for regulatory network discovery from microarray expression experiments. However, gene expression profiles are often based on weak similarities that are unlikely to correlate with true co-regulation . Potentially, there are multiple parallel regulatory mechanisms within a set of co-expressed genes. Therefore, genes displaying similar expression profiles may respond to different external stimuli, represent parallel biosynthetic pathways, and/or be regulated by different transcription factors. Thus, the problem of elucidating functional relationships and identifying potential regulatory motifs among co-expressed genes is quite challenging.
Because of the high noise level of microarray expression data, cluster analysis often returns clusters that are not functionally coherent . Although the application of clustering methods to gene expression data provides numerous insights into cell regulation and disease characterization , the majority of current clustering algorithms do not consider functional relationships within co-expressed genes that comprise the cluster.
The majority of motif finders employ a single search strategy aimed at identifying motifs of a specific type. Because of that, they are not distinguishable from each other in terms of performance over a wide range of datasets from different species. In fact, according to the assessment of performance of thirteen different computational tools , absolute measures of correctness were low and similar for all the motif finders tested. It was suggested that a few tools be used in combination to improve the accuracy of predictions. This need resulted in the development of conceptually different ensemble algorithms. SCOPE (Suite for Computational Identification Of Promoter Elements), the ensemble motif finder developed in our lab  combines three distinct search strategies, each of which looks for a specific kind of motif: non-degenerate (e.g. ACGCGT), degenerate (ASTBKG) and long and bipartite (AYTNNNNNNNNCGT). The results of individual algorithms are then combined using a learning rule which is simply the maximum score returned by the component algorithms. SCOPE has been shown to outperform most commonly used motif finders by a statistically significant margin enjoying both high sensitivity and specificity that result in the best accuracy of transcription factor binding site prediction . SCOPE is also very robust to the presence of extraneous sequences in the input gene set which makes it an excellent tool for the analysis of (often noisy) microarray data. SCOPE’s interface is also very simple and does not require the user to enter any program parameters (such as the length of the expected motif or how many instances of the motif are predicted).
We used four sets of co-expressed genes from Saccharomyces cerevisiae microarray experiments. A set of genes upregulated during G1/S cell cycle transition was retrieved by literature mining [11, 12]. We also analyzed two gene sets from a classic microarray experiment  that correspond to G1 CLN2 and M-G1 MCM clusters. Finally, we analyzed a set of co-expressed potential targets of the filamentous growth pathway previously identified using a rigorous statistical approach . Gene expression data is clustered to generate subclusters of genes that share similar expression profiles. Each subcluster is analyzed by SCOPE to generate significant candidate regulatory motifs, which are used to generate motif profiles for each gene. A motif profile contains the number of occurrences of each motif in each upstream quartile for all significant motifs for that gene. This process generates a vector for each gene consisting of the motif occurrence profile of that gene. These gene-motif vectors are then clustered to generate subclusters of genes that have similar motif profiles. Finally, the genes in each of these last subclusters are analyzed by SCOPE and subjected to biological functional analysis.
The results of SCOPE can be saved as a tab-delimited text file. This file contains list of found motifs with the following information about each individual motif: consensus sequence, count (number of occurrences), Sig value (a measure of statistical significance), coverage (fraction of the input genes containing motif of interest) and a list of upstream locations of each individual instance of the consensus sequence.
For the analysis of SCOPE output, we pursued an approach that we call quartile analysis. We divide the upstream DNA region (default length 800 bps for S. cerevisiae) into four equal quartiles or bins, 200 bps each. The length of 800 bps for yeast seems to be the standard accepted in the community. Bins of 200 bps usually contain a high enough number of motif occurrences while simultaneously allowing one to distinguish between different motifs in terms of their distributions. Using fewer (larger) bins does not distinguish positions with enough detail and using more (smaller) bins results in many bins without any motifs. We then convert the SCOPE text file into a table where each row corresponds to a gene and each column corresponds to the number of occurrences of a motif in a given quartile across upstream regions of all genes in the gene set. Therefore, we create a series of gene-motif vectors that can be compared to each other and clustered together if motif distributions are similar for genes in question.
The actual clustering of genes based on their upstream sequence signatures and visualization of resulting subsets is done by using VxInsight. VxInsight  is a knowledge visualization tool that uses a force-directed placement algorithm to distribute objects (vectors of information) on a plane transforming them into an easily interpretable visual landscape. The algorithm moves similar items closer together while pushing dissimilar objects away and builds up a terrain with peaks and valleys. In our case, the peaks represent genes that have similar motif content and position distributions. The gene-motif vectors are randomly scattered on a plane and subjected to an iterative process that moves similar objects closer together and dissimilar objects further apart. This movement uses a process that is similar to simulated annealing . Iterations are continued until no more movement is observed. VxInsight constructs a 3-D virtual landscape from the concentration of objects (node) on the plane. This approach does not require a pre-defined number of clusters since the data objects are not explicitly members in a particular cluster. When the process is complete, the different peaks can be selected based on the visual analysis and their content (genes) analyzed further (see Figure 1).
Gene sets generated by VxInsight were analyzed via the AmiGO  functional analysis tool. Subsets of the original gene set were compared in terms of either enrichment of existing functional categories or emergence of new ones.
Statistically significant motifs from SCOPE analysis of gene subsets were used as the input to search for similar motifs in the UniPROBE database . This database hosts data from universal protein binding microarray assays on in vitro DNA binding specificities of proteins. UniPROBE enables a user to search for transcription factor binding sites in a query DNA sequence. SCOPE motifs were tested for matches with regulatory motifs in the database.
Gene expression during G1/S transition of the cell cycle in S. cerevisiae is regulated by two transcription factors, MBF and SBF (Mlu1 box and Swi4/6 cell cycle box binding factor, respectively). These are heterodimeric complexes sharing a common regulatory subunit but with different DNA-binding subunits . They regulate transcription of genes involved in DNA synthesis and DNA repair, budding and spindle pole body formation .
These subsets were then analyzed via AmiGO to look for enrichment of functional categories within each subset. The subsets showed different functional associations. Cluster A showed significant enrichment in external encapsulating structure organization and cell wall organization (p-value ∼ 10-6), biological processes not found in the initial gene set analysis. Cluster C was generally similar to the main gene set in terms of overrepresented functional categories. However, it showed significantly better p-value (by 102-104 fold) for several biological processes including DNA repair, response to stress and cellular response to stimulus. Cluster B was very similar to main gene set and did not show any functional enrichment. Most of the genes in cluster A were known to be SBF regulated genes, while those in cluster C were MBF regulated genes. Finally, the highest-scoring motifs from the VxInsight subsets were compared to the biologically predicted regulatory sequences for SBF and MBF. Sequences from cluster A were found to be similar to the SBF transcription factor binding site (CRCGAAA) and motifs from cluster C subset were found to match closely MBF transcription factor binding site (ACGCG) .
The G1 CLN2 cluster contains 76 genes, many of which are involved in DNA replication. Their expression is strongly cell cycle-regulated, with peak expression occurring in the mid-G1 phase .
Results of the SCOPE analysis of the G1 CLN2 show a highest-scoring motif with the consensus sequence DWCGCGW mostly located in the 100-400 bps upstream region (data not shown).
Functional enrichment analyses of G1 CLN2 subsets did not reveal any GO term p-value improvements over the original gene set. Generally, cluster B showed a list of functional annotations highly similar to that for the main gene set with somewhat less statistically significant p-values. Cluster A showed a much less diverse composition in the list of functional annotations with much less statistically significant p-values than the original gene set.
Comparison of G1-CLN2 subset motifs to UniPROBE data.
AmiGO analysis showed enrichment in different functional terms for the three subsets. Cluster A was found to be relatively enriched in the biological processes of mitotic cell cycle, cell cycle phase and regulation of cellular process. Cluster B contained genes involved in ribosome biogenesis. Cluster C was essentially the same as the entire gene set but with less significant p-values.
Comparison of M1-MCM subset motifs to UniPROBE data.
Filamentous growth pathway is a MAPK (mitogen-activated protein kinase) dependent pathway that regulates filamentous growth in yeast . Cullen et al.  identified a set of its potential targets using rigorous statistical techniques. We used a set of 269 putative gene targets for our analysis.
Comparison of Cullen et al. dataset motifs to UniPROBE data.
Cullen et al.
Cullen et al.
Cullen et al.
Results of the SBF-MBF gene set analysis suggest that it is possible to differentiate among transcription factor targets on the basis of sequence motif information alone. Without prior biological knowledge, we were able to associate each of the subsets with a specific regulatory pattern. Genes from cluster A are mostly regulated by SBF (15/16), genes from cluster C are mostly MBF targets (24/27) and genes in cluster B can be regulated by either or both transcription factors. It seems that in this case specific positional bias of the regulatory motif is linked to an enrichment in a particular function for the genes associated with it.
For both the G1 CLN2 and M-G1 MCM gene sets, we were able to find reliable matches with known transcriptional regulatory sequences for numerous upstream motifs predicted by SCOPE. In the case of the M-G1 MCM gene set, we identified unique functional patterns for all subsets that were generated by combining expression data and motif information.
In the case of the Cullen et al. dataset, we successfully partitioned the starting gene set into several smaller gene sets with more coherent functional annotations and unique upstream sequence signatures. Cluster E from Figure 7 is of particular interest since it displays both very significant enrichment in a particular biological process and a distinct pattern of upstream motifs across the gene set.
These results demonstrate the potential of our new approach for subdividing gene sets derived from microarray data into more functionally coherent subclusters by utilizing the upstream motif content and distribution of the genes.
Motif finders have been used to identify regulatory motifs from sets of co-expressed genes determined by microarray analysis. The microarray analysis, however, is often noisy and may include several different subsets of genes that are actually regulated by different transcription factors in parallel rather than by the same transcription factor acting on all genes. By using our motif finder, SCOPE, to identify statistically significant motifs and clustering co-expressed genes based on their upstream motif content and distribution, we were able to identify subsets of genes that might represent independently regulated responses.
We applied this technique to three gene sets derived from microarray analyses and validated it by performing functional enrichment analyses and comparing computationally predicted motifs to the biologically tested regulatory sequences. We were able to generate subsets of genes with functionally enriched or novel categories and specific upstream patterns of regulatory motifs similar to known binding sites for the transcription factors. This shows the usefulness of our approach for partitioning starting gene sets into more functionally coherent subsets and making predictions about putative transcriptional regulatory patterns.
Publication of this supplement was made possible with support from the International Society of Intelligent Biological Medicine (ISIBM). This research was supported by a grant to RHG from the National Science Foundation (USA), DBI-0445967. This article has been published as part of BMC Genomics Volume 11 Supplement 2, 2010: Proceedings of the 2009 International Conference on Bioinformatics & Computational Biology (BioComp 2009). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/11?issue=S2.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.