Systematic discovery of regulatory motifs in Fusarium graminearum by comparing four Fusarium genomes
© Kumar et al. 2010
Received: 9 September 2009
Accepted: 26 March 2010
Published: 26 March 2010
Skip to main content
© Kumar et al. 2010
Received: 9 September 2009
Accepted: 26 March 2010
Published: 26 March 2010
Fusarium graminearum (Fg), a major fungal pathogen of cultivated cereals, is responsible for billions of dollars in agriculture losses. There is a growing interest in understanding the transcriptional regulation of this organism, especially the regulation of genes underlying its pathogenicity. The generation of whole genome sequence assemblies for Fg and three closely related Fusarium species provides a unique opportunity for such a study.
Applying comparative genomics approaches, we developed a computational pipeline to systematically discover evolutionarily conserved regulatory motifs in the promoter, downstream and the intronic regions of Fg genes, based on the multiple alignments of sequenced Fusarium genomes. Using this method, we discovered 73 candidate regulatory motifs in the promoter regions. Nearly 30% of these motifs are highly enriched in promoter regions of Fg genes that are associated with a specific functional category. Through comparison to Saccharomyces cerevisiae (Sc) and Schizosaccharomyces pombe (Sp), we observed conservation of transcription factors (TFs), their binding sites and the target genes regulated by these TFs related to pathways known to respond to stress conditions or phosphate metabolism. In addition, this study revealed 69 and 39 conserved motifs in the downstream regions and the intronic regions, respectively, of Fg genes. The top intronic motif is the splice donor site. For the downstream regions, we noticed an intriguing absence of the mammalian and Sc poly-adenylation signals among the list of conserved motifs.
This study provides the first comprehensive list of candidate regulatory motifs in Fg, and underscores the power of comparative genomics in revealing functional elements among related genomes. The conservation of regulatory pathways among the Fusarium genomes and the two yeast species reveals their functional significance, and provides new insights in their evolutionary importance among Ascomycete fungi.
Collectively, fungal species within the genus Fusarium are among the most important group of plant pathogens, causing disease in nearly every agriculturally cultivated plant . Mycotoxins produced by Fusarium species also pose a significant hazard to food safety and human health [2, 3]. The economic and scientific importance of these fungi has led to whole genome sequencing and functional studies of multiple economically important and phylogenetically related Fusarium species including, F. graminearum (Fg), F. verticillioides (Fv), F. oxysporum (Fo), and F. solani [4–8]. Such rich genomic resources enable the discovery of many biological features related to the genetic mechanisms of organism adaptation and pathogenicity for these species [9–11].
As functional elements, transcription factor binding sites often evolve at a much slower rate than neutral sequences, and therefore they often stand out from the surrounding sequences by virtue of their greater levels of conservation. This property enables the recognition of these conserved elements through comparative genomics. Previous work has demonstrated the power of comparative genomics for discovering novel regulatory motifs in yeasts , Plasmodium , flies , mosquitoes  and mammals [16–18]. Even though there is growing body of knowledge about pathogenicity related gene regulation in the Fg genome, including spore germination  and adaptation to the host environment , very little is known about the transcription factors involved in the process and the regulatory elements targeted by these transcription factors. Comparative Fusarium genomics enables the generation of large-scale unambiguous alignments  among the four sequenced Fusarium genomes. The total divergence among these four genomes, measured by the total branch length of the tree connecting the four species is 0.35 (i.e. 0.35 substitutions per base, roughly the distance between human and dog). This distance is similar to that used successfully for identifying regulatory elements in other species [12, 18].
The whole-genome alignments of the four Fusarium species were generated using PatternHunter (see Methods) and further divided into three functionally distinct genomic regions - promoters, downstream, and introns. The gene annotation of Fg, the first sequenced  and better-annotated  Fusarium genome, was used as a reference. Because most of the Fg genes are computationally predicted and thus do not have well-defined transcriptional start or stop sites, we defined the promoter region to be up to 600 bp upstream of the translation start site (ATG) and not overlapping its neighboring gene. We chose 600 bp based on the mean intergenic distance of ~1200 bp in Fg. Similarly, we defined the downstream region of a gene to be up to 600 bp downstream of the stop codon without overlapping any other genes or potential promoter regions. Based on this definition, the downstream regions could be enriched for 3'UTR sequences, for which little is known in Fusarium genomes. To avoid redundancy, we further post-processed the promoter and downstream datasets to make sure that each genomic region is represented only once. Overall, the promoter data set contains 3.4 million aligned bases, whereas the intronic and downstream sets contain 1.3 and 2.3 million aligned bases respectively. Some segments could not be aligned in all four species, primarily because they represent new sequences recently inserted into the Fg genome, ancestral sequences deleted in one of the other species, or missing sequences in one of the draft genomes.
A score metric called Motif Conservation Score (MCS)  was used to evaluate the conservation properties of all 7-mer motifs within each of these three datasets (see Methods and Additional file 1). We focused our initial efforts on 7-mers because most regulatory motifs are in the range of 6-10 bps, with 6-mers tending to give too many random matches, whereas 8 or higher-mers too few (Note also that 7-mers are only used for initial screening; the exact lengths of the motifs are actually determined later when we introduce degeneracy to motifs). For each dataset, we exhaustively searched for the presence of 7-mers and obtained two values for each of them (i) the total occurrences in the reference genome Fg (N), and (ii) the number of occurrences that are conserved among all four genomes in the aligned regions (k). We called a motif occurrence conserved if the identical heptamer in the aligned regions occurs within a window of -5 bp to +5 bp compared to the reference position. The 5 bp extension at each side was introduced to account for potential errors in local sequence alignments. We next evaluated the MCS value for each heptamer, which essentially represents the difference in conservation of a heptamer from the expected conservation of a random heptamer (see Methods). For instance, the 7-mer CACGTGA occurred N = 264 times in the Fg genome of the promoter dataset. Among the 264 sites in Fg, k = 151 are also conserved in the corresponding orthologous regions of the three other species, corresponding to a conservation rate (C.R.) of 57%. By contrast the conservation rate for a random 7-mer in the promoter dataset is only p 0 = 3%, which indicates that the CACGTGA motif shows a 19 fold enrichment in conservation rate, corresponding to an MCS = (k-Np 0)/sqrt(Np 0(1-p 0)) = 51 s.d.
Regulatory motifs typically contain certain variations without greatly affecting the overall binding affinity, as many of the selected heptamers are slight variations of each other. Therefore, we sought to identify the true motifs associated with the discovered heptamers by introducing degeneracy into certain positions of the heptamers. If the motif, after introducing the degeneracy, is a closer representation of a regulatory element, the MCS score of the modified motif should be higher. Therefore, we developed a greedy algorithm to systematically search for the pattern of degeneracy that can lead to an increase in the MCS score of the modified motifs (see methods and Additional file 1). Briefly, the program randomly selected a base position in the heptamer and added a degenerate character that is consistent with the nucleotide at that position. If this change increased the MCS in the updated motif, we repeated this process for one more base position. As this greedy approach is susceptible to local maxima, we repeated it ten times and selected the degenerate motif with the highest MCS from all the trials. We used this approach on the heptamers in the decreasing order of their MCS values and ignored a heptamer if it could be specified by one of the degenerate motif derived from earlier heptamers. This step finally gave 326 motifs in the promoter regions, 206 in the downstream and 114 in the intronic regions. Some of the heptamers were able to increase their score by as much as 300%. ACGCGTC in promoters, for example, increased its MCS from an original value of 14.3 to 40.6 as it degenerated into nCGCGnC ('n' represents any nucleotide). The complete list of degenerate motifs is available from our project website http://www.broadinstitute.org/annotation/genome/fusarium_group/SupplementalMaterials.html.
The MCS score measures the conservation across all sites of a motif; therefore it should not be very sensitive to the specific definition of promoters we used as long as the distribution of the motif in the promoters is not biased toward long distances. To illustrate this, we generated two additional promoter datasets using the distance of 400 bp and 1000 bp respectively for the motif discovery. For the 400 bp dataset, we discovered 288 motifs (prior to clustering), of which 286 are present in the 326 motifs discovered in the 600 bp dataset. For the 1000 bp dataset, we discovered 394 motifs, which include all of the 326 motifs discovered in the 600 bp dataset. For these 394 motifs, more than 80% of all the sites are located within 600 bp from gene starts with the highest density located at 70 bp from gene starts (Additional file 1, Figure S3). This suggests that the 600 bp is a reasonable choice for the definition of promoter regions, although we note that distant motifs preferentially located far away from gene start sites are likely to be missed with this definition.
Top 30 discovered motifs in the promoter regions as ranked by MCS scores
Genome wide statistics for the aligned region
Functional category enrichment*
Most enriched functional category
Genes in functional category
Motif genes in Fun- Group
Since very little is known about the constituents of regulatory pathways in Fusarium species, we searched for potential associations of promoter motifs with specific gene functions using a combination of Gene Ontology (GO) annotation  and expression profiles . For each motif, we searched for the set of genes that contain the motif or its reverse pair in its promoter region. Each gene set associated with particular promoter motif is divided into GO category group per their cellular functions. Genes within each GO category group were further sub-grouped into clusters of genes with similar expression profiles using K-means clustering of previously published expression data for conidial germination . Gene expression was measured in fresh conidia (0 h) and at three other developmental milestones: spore swelling (2 h), germ tube emergence and elongation (8 h) and hyphal branching (24 h).
The GO cellular annotation provides a total 21 functional categories over 13332 Fg genes. Each category was further divided into 2 to 5 clusters within each GO category through K-means analysis and resulted in a total of 68 sub-clusters (Called FunGroup). A functional survey for each cluster was carried out using MIPS FunCat (5, 6). Gene lists, expression profiles and functional analysis for each cluster can be found on our project website shown above.
We further tested the significance of the enrichment between the promoter motifs and FunGroups through a control experiment, in which we generated a set of control motifs by randomly permuting the bases within each discovered promoter motif. The enrichment analysis on the control motifs resulted in only two enriched clusters (2%) (P-value < 10-5), much less than 22 as for the discovered motifs (30%). This test may suggest that the false discovery rate is below 9%, (P value < 10-5), and the enrichment of the motif containing genes within FunGroups are likely due to biologically bona fide associations.
Interestingly, through the GO category enrichment test, motif M2 (CACGTG) is highly enriched in the upstream promoter regions of genes in ribosome cluster 2 (Figure 3). Within this cluster, 44 out of 47 genes are predicted ribosomal proteins. Remarkably, we observed the expression profile for genes contained in ribosome cluster 2 peaking sharply at 2 and 8 h, periods of spore activation and germ tube emergence where a high level of ribosome biogenesis would be required to cope with an increased demand for nascent proteins. Even though, the components of the ribosome have not been identified as targets of Pho4 in yeast , more than 30% of the 155 proteins interacting with Pho4 TF  are classified as ribosome function related according to SlimGO annotation. In N. crassa, 7 ribosomal proteins are regulated by Nuc-1, and 4 of them contain CACGTG in their promoter regions . The CACGTG motif is also enriched in the upstream promoter regions of genes in cytoskeleton cluster 1,4 and cortex cluster 3. Therefore we hypothesize that in addition to Pi metabolism, Pho4 TF also regulates genes belonging to diverse functional categories including protein biosynthesis and the cell cycle.
The other top scoring motif M3, ACGTCA, matches the binding site of Sko1 in Sc and Atf1 in Sp, both of which are known TFs that regulate response to environmental stress [30, 31]. Even though both Sko1 and Atf1 are known to use the same TF binding sites (ACGTCA), their roles in stress defense and the set of genes they regulate have significantly diverged. Sko1p in Sc regulates a small set of genes in response to osmotic shock , while its ortholog Atf1 in fission yeast Sp regulates a larger gene set induced by environmental stress under diverse conditions that define Atf1 as a core environmental stress response (CESR) regulator . There is a Sko1/Atf1 ortholog in each Fusarium genome (FGSG_10142, FVEG_02866, FOXG_05265 and JGI_69482), although the Fusarium genes are more similar to Atf1 (1e-32 between FGSG_10142 and Atf1, 4.5e-9 between FGSG_10142 and Sko1). About 20 Sko1 target genes were identified in Sc , 9 of which are conserved in Fg. The binding site, M3, is conserved in 8 of them. Similar conservation and motif enrichment result is also observed among the 314 Sp induced CESR gene set (Figure 5). It is likely that the Sko1/Atf1 homolog in Fg regulates both CESR as in Sp and the osmotic defense response as reported in Sc. Consistent with this hypothesis, Sp is reported to be sensitive to osmotic stress, while both Fg and Sc are resistant to osmotic stress , which may indicate that the osmotic regulation through the Sko1 pathway may have either evolved after the split of the Sp lineage, or lost in Sp. The smaller genome size and number of genes in the two yeast systems implies that the yeasts are streamlined derivatives of a common ascomycete ancestor.
In Sc, the CESR is regulated through a different pathway using the duplicated TF gene pair MSN2/4 which bind to a sequence that differs from the Sko1/Atf1 binding site. A single MSN homologous gene is present in each Fusarium genome (FGSG_06871, FVEG_05115, FOXG_01955, and JGI_102564). The Fg genes homologous to the Sc CESR pathway genes are also enriched, but for a slightly different motif than the MSN2/4 binding sites (Figure 5). Based on all these observations, we infer that, in response to environmental stresses, the filamentous fungus Fg employs the combined regulatory pathways used in both yeast systems.
Large sets of genes are also repressed under environmental stress in Sc and Sp respectively [31, 32] and the same TF binding site was reported in both systems. About 60% and 33% of the repressed genes, from Sc and Sp respectively, share homologous genes in Fg. Two very similar motifs M19 and M21 are significantly enriched in two homologous sets accordingly (Figure 5). Clearly similar regulatory mechanisms are operating among these diverse fungal genomes, while there are more genes involved the regulation in the filamentous fungus Fg comparing to the rather streamlined yeast regulatory systems. The functional association study also revealed that these two motifs (M19 and M21) are enriched in the Cellular Ribosome GO category, cluster 2 (Figure 3). In rapidly growing yeast cells, about 60% of total transcription is devoted to ribosomal RNA, and 50% of RNA polymerase II transcription is devoted to ribosomal proteins (RPs) . Therefore repression of the genes encoding ribosomal proteins may ensure the economic use of resources under stressed conditions.
As one of the initial efforts to study whole genome regulation in filamentous fungi, it is perhaps not surprising to see many of the motifs discovered from our study have no matches in S. cerevisiae, including the top-scoring motif M1 (GCGGG). However, comparing these motifs to known TF binding sites from other filamentous fungi identified the M1 motif as highly similar to the binding site of transcription factor CreA (G/CPyGGGG), a well known repressor responsible for carbon catabolite repression in Aspergillus nidulans . By searching for other known Aspergillus transcription factor binding sites within the list of discovered Fusarium promoter motifs, we found two additional matches, including motifs M5 (GCCARG) and M26 (YGATAAG), which show high similarity to the PacC and the AreA DNA-binding sites respectively. PacC plays an important role in mediating the response to ambient pH, while AreA belongs to the GATA family of DNA-binding proteins that binds to HGATAR and mediates nitrogen metabolism repression .
In addition to the promoter motifs, our analysis also yielded a number of motifs showing significant conservation in the downstream and intronic regions (Additional file 1, Table S1), although the overall number is smaller and the motifs are less conserved when compared to the promoter region (Additional file 1, Figure S4). The most conserved intronic motif (GTNAGT, Additional file 1, Table S1), corresponding to a splice donor motif, is both abundant (6648 instances) and well conserved with C.R. (conservation rate) equal to 37%. However, among the downstream conserved motifs we didn't find the mammalian polyadenylation signal (AATAAA). The polyadenylation hexamer, though abundant, is rarely conserved in the Fusarium genome. Among the downstream regions, we found a total of 1138 occurrences, only 11 of which are conserved (C.R.: 0.9% and MCS = -3.69), which is even lower than the rate for a random hexamer (C.R.: 2.7%). The known polyadenylation signals in Sc, TATATA and TTAAGAAC [37, 38], also have a low MCS value of -1.8 (6 conserved from a total of 409 instances) and -0.3 (0 conserved from a total of 9 instances) respectively, suggesting that a different polyadenylation signal may be used in the Fusarium genomes.
The discovered downstream motifs and the splice signals in the introns show a strong directional bias, while the promoter motifs have similar conservation rate in both the forward and reverse strands (Additional file 1, Figure S5). This directional specificity is consistent with the function of the downstream motifs and the splice signals in post-transcriptional processes, in contrast with the promoter motifs that act at the level of DNA, as binding sites during transcription.
There is limited knowledge about transcriptional regulation among filamentous fungi. This study is the first attempt to systematically characterize sequence signatures that are conserved through evolution and may have the potential to function as regulatory elements among several Fusarium species, a group of agricultural important filamentous fungi. The study results in a total of 73 conserved elements in promoter regions. These promoter motifs are enriched in the upstream region of the genes involved in specific cellular functions, indicating their potential functional association with specific biological processes and validating the biological significance of our discovery. Interestingly, we observed the conservation among Fusarium species and two yeast model systems, Sc and Sp for some TFs, their binding sites and target genes regulated by these TFs for some important pathways, including phosphate metabolism and stress response. Both computational and experimental approaches were used to define modularity of regulatory network . The conservation of potential regulatory elements accompanying certain conserved modules was also reported [39, 40].
Constantly exposed to a wide range of environmental changes, including hostile conditions, environmental stress response is essential for the survival of fungi. In that sense, it is not surprising to observe a significant conservation among both induced and repressed gene sets in response to environmental stresses across divergent lineages. Our study suggests that the core environmental stress response (CESR) is a highly conserved physiological mechanism that protects cells and organisms from stressful changes in their environment. The conservation between Sc osmotic stress response pathway and the Fg gene sets suggests that this regulatory pathway may have evolved after the split of Schizosaccharomyces, or was lost in this Ascomycete basal lineage. The more complete list of genes that are repressed under ESR in Fg also suggests that both yeast systems have independently reduced portions of the pathway through evolution.
The focus of this study is on short regulatory motifs with size around 7 bp. Therefore translation-related motifs, which are typically longer, are likely missed by our method. In addition, the translation-related motifs such as internal ribosome entry site (IRES) and upstream ORFs (uORFs) typically observe different conservation characteristics than the transcription-related motifs, since what is preserved for IRES is its secondary structure and what is preserved for uORFs is their short ORF structure. Detecting these translation-related motifs likely will require a new conservation measure that can take these specific characteristics into account. It is interesting to note that a motif containing ATG (ATGACGN with the MCS >30), which could encode the start codon of uORFs, is detected in the upstream regions.
We developed a computational pipeline to systematically discover regulatory motifs in Fg by comparing its genome to three other closely related Fusarium genomes. Our analysis yielded 73 candidate motifs in promoter regions, 69 in the downstream regions, and 39 in the intronic regions. Out of the 73 motifs discovered in the promoter regions, 22 showed strong enrichment in functionally related gene clusters, and 16 showed strong sequence similarity to known motifs in Sc, altogether corresponding to 41% of the discovered motifs, suggesting that many of the motifs are likely true functional elements. Comparison of these motifs with known yeast TF binding sites revealed a significant conservation for signals involved in phosphate metabolism and ESR pathways, providing the first look into the regulation of these biological processes in Fg. Such conservation across millions of years of evolution since the divergence of yeast and Fg indicates the functional importance of these regulatory pathways. The analysis presented here demonstrates the power of comparative genomics for discovering functional elements in the Fusarium genome. As more genomes and better annotation of genes become available, the list of regulatory motifs presented here can be further refined.
Local-alignment anchors were detected using PatternHunter (1e-10) . Contiguous sets of anchors with conserved order and orientation were chained together within a 10 kb distance. The multiple alignments for all the syntenic regions that cover a common segment on the reference genome Fg were conducted using Mlogan  We used the annotation of Fg [5, 6] as a reference to define the coding sequence versus intergenic (600 bp upstream of the start codon and 600 bases downstream of the stop codon) and intronic regions in the aligned regions. For the intronic regions, we extended sequences by 2 bp both upstream and downstream in order to include the intron splice sites.
where K i is the number of conserved occurrences and N i is the number of total occurrences of Hi in the aligned genome of Fg with Fo, Fv and Fs. P o is the fraction of total conserved occurrences among the total occurrences of all heptamers. The same function is used to compute the MCS for degenerate motifs. However, the P o is calculated differently. Since P o denotes the average conservation rate of a motif m, we scan 5000 random heptamers in the genome and calculate its conservation as per the degeneracy profile of m. This conservation score is then used to compute the average conservation rate for a motif whose degeneracy profile is the same as m (See Additional file 1 for details).
We used consensus sequences to represent regulatory motifs. The sequences are spread over 11 alphabets, consisting of four nucleotides A, C, G, T, the six two-fold degenerate characters S = [C or G], W = [A or T], Y = [C or T], R = [A or G], M = [A or C], K = [G or T], and the four-fold degenerate character N = [A, C, G, or T]. A motif m occurs in the Fg genome when each nucleotide in the genome satisfies the corresponding degenerate character in the consensus sequence of m. We used a greedy approach to discover motifs with higher MCS values by adding degeneracy as described in the main text and Additional file 1.
We clustered the degenerate motifs as per their sequence similarity. We quantify the similarity between two motifs by the Pearson correlation of their equivalent position weight matrices. The weight matrices represent the frequencies of different nucleotides at each base. For example, if the base is W = (A or T), its corresponding column in the matrix will be represented as [0.5, 0, 0, 0.5] in the order of A, C, G and T. We computed the Pearson correlation coefficient between two positional weight matrices as described in Xie et al .
Since this value is usually very close to zero, we have used -log10(P-value) to denote the enrichment. For the later analysis of overlap between the set of genes derived as homologs from other eukaryotic species namely Sp and Sc, the N variable is limited to the number of genes in that species that have clear orthologs in Fg. Similarly, F and M are also reduced to only include genes with clear orthologs in Fg. Since only about a quarter of genes have orthologs in Fg, this reduction allows us to compute the effective enrichment without biasing the function towards the genes that are present in set but have no orthologs in Fg.
We derived orthologous genes from Sc and Sp to the genes in Fg for our gene set enrichment analysis based on a one-to-one map from yeast genes to the gene in Fg according to the highest similarity score of Blastp (E ≤ 10-5).
The sequenced strain of F. graminearum (PH-1; ) was used for expression analysis. Gene expression levels were determined using a custom Fusarium graminearum Affymetrix microarray (Fusariuma520094; ). Data from a spore germination time-course were used to create expression clusters  and raw data are available as experiment FG7 at the database PLEXdb . K-means clustering was performed in the Analyst module of Expressionist version 5.1 (Genedata) using raw data imported and normalized according to . The silhouette score was used to determine the optimum number of clusters for genes in each GO category using a specified range of 2-5. Default settings were used for distance, centroid calculation and maximum iterations fields.
We thank the Broad Institute for making the genomes of the three Fusarium species available before publication. LK and LJM are supported by USDA 2005-35600-16405; CK and AB are supported by NSF EnGen 0723451; and XX is supported by NSF DBI0846218.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.