Evolution of cis-regulatory elements in yeast de novo and duplicated new genes
© Tsai et al.; licensee BioMed Central Ltd. 2012
Received: 7 August 2012
Accepted: 18 December 2012
Published: 21 December 2012
Skip to main content
© Tsai et al.; licensee BioMed Central Ltd. 2012
Received: 7 August 2012
Accepted: 18 December 2012
Published: 21 December 2012
New genes that originate from non-coding DNA rather than being duplicated from parent genes are called de novo genes. Their short evolution time and lack of parent genes provide a chance to study the evolution of cis-regulatory elements in the initial stage of gene emergence. Although a few reports have discussed cis-regulatory elements in new genes, knowledge of the characteristics of these elements in de novo genes is lacking. Here, we conducted a comprehensive investigation to depict the emergence and establishment of cis-regulatory elements in de novo yeast genes.
In a genome-wide investigation, we found that the number of transcription factor binding sites (TFBSs) in de novo genes of S. cerevisiae increased rapidly and quickly became comparable to the number of TFBSs in established genes. This phenomenon might have resulted from certain characteristics of de novo genes; namely, a relatively frequent gain of TFBSs, an unexpectedly high number of preexisting TFBSs, or lower selection pressure in the promoter regions of the de novo genes. Furthermore, we identified differences in the promoter architecture between de novo genes and duplicated new genes, suggesting that distinct regulatory strategies might be employed by genes of different origin. Finally, our functional analyses of the yeast de novo genes revealed that they might be related to reproduction.
Our observations showed that de novo genes and duplicated new genes possess mutually distinct regulatory characteristics, implying that these two types of genes might have different roles in evolution.
New genes arise through various mechanisms, including gene duplication, exon shuffling, gene fusion, retroposition, mobile elements, lateral gene transfer, and de novo origination [1–3]. Although new genes are considered to be fairly dispensable , their role in adaptive evolutionary innovation has been investigated. Most of the studies have focused on the cellular, physiological, morphological, behavioral, and reproductive phenotypic traits associated with new genes [1, 5–7]. A recent study found that 30% of the new genes in Drosophila quickly evolved essential functions that allowed them to participate in development . Using pre-existing genes as the raw material, duplicate genes rapidly developed essential functions that were not present in the pre-duplication gene through the processes of neofunctionalization  or subfunctionalization . In addition, neofunctionalization and subfunctionalization of transcription factor binding sites (TFBSs) can explain the novelty that occurs in the regulatory region of duplicated new genes [10–12]. The de novo origin of genes, genes that arise from previous nonfunctional genomic sequences, is a rare and intriguing process [13, 14]. It is believed that the new coding region could emerge by mutations that remove disruptions of a proto-open reading frames . Positive selection in the coding sequences has been reported, suggesting that adaptive protein evolution had occurred .
De novo gene evolution was first investigated in Drosophila melanogaster in 2006. Five novel genes were identified experimentally as derived from ancestral non-coding sequences and evolved as the result of a selection process associated with male reproduction . In Saccharomyces cerevisiae, the first identified de novo gene was BSC4. Population genetic analysis suggested that BSC4 was under strong negative selection at the nonsynonymous sites . A de novo transcript in Mus musculus was found to have emerged in an intergenic region because of indel mutations in the 5’ regulatory region; the transcript was fixed by a selective sweep in M. musculus populations . Other de novo genes have been identified in various species; for example, CLLU1 and FLJ33706 in Homo sapiens[18, 19], MDF1 in S. cerevisiae, DR10 in Oryza sativa, and Noble in D. melanogaster. In addition, several genome-wide analysis studies have identified numerous de novo genes in various species, and the importance of such genes in adaptive evolution has been discussed [23–28]. For example, in D. melanogaster, a study based on expressed sequence tags identified eleven putative de novo genes, and de novo origination was estimated to be responsible for 11.9% of the new genes . In H. sapiens, 60 protein-coding genes were identified as de novo genes that were highly expressed in the cerebral cortex . These findings indicate the importance of de novo genes in phenotypic diversity and evolutionary adaptation. Nevertheless, the regulatory evolution of de novo genes is not yet fully understood. A prevalent view is that de novo genes do not possess complicated regulatory control and, therefore, only a functional transcription start site would be required for transcription initiation . However, because de novo genes might play important roles in development , the view that only a simple regulatory control mechanism is used remains open to speculation.
Several genome-wide studies have attempted to describe the characteristics of regulatory evolution [30, 31]. Frequent gain or loss events of TFBSs (TFBS turnover) have been identified as an important feature of regulatory evolution, and have been found to exhibit lineage specificity in transcriptional regulation [32–34]. A previous study showed that duplicated new genes inherit more than a third of the regulatory interactions from their ancestral genes . Moreover, the expression of duplicated genes often benefits from the preexisting regulatory mechanism . After gene duplication, positive selection on cis-regulatory motifs leading to dramatically accelerated rates of cis-regulation compared with the orthologs has been observed . In S. cerevisiae, it has been shown that the number of shared TFBSs in duplicate genes decreased with evolution time whereas the total number remained unchanged, suggesting that there is a balance between gain in functionally novel TFBSs and either the loss of preexisting TFBSs or the modification of preexisting TFBSs to new functions . Nonetheless, de novo genes evolve from non-coding sequences based on the cryptic presence of functional sites, including a transcriptional start site and upstream regulatory elements . The question of how de novo genes that have no parent gene obtain regulatory elements and further establish complex regulatory mechanisms has yet to be determined.
We conducted a genome-wide investigation of de novo genes in S. cerevisiae to investigate regulatory evolution in the initial stages of gene emergence. One of the challenges is that the conventional methods that are used for de novo gene identification are known to overestimate their numbers because of the high number of false positives that are generated . Recently, Capra et al. developed a computational pipeline to identify de novo genes in yeast and to understand the evolution of protein interaction networks involving the novel genes . They identified 227 de novo genes that originated after whole-genome duplication (WGD), and found that initially the de novo genes had fewer interactions, but subsequently gained interactions more rapidly than duplicated new genes. Here, we modified their pipeline to identify S. cerevisiae-specific de novo genes that emerged after divergence from S. paradoxus, instead of after WGD. The stringent criteria that we used to identify de novo genes would aid our observation of cis-regulatory element evolution during the initial stage of a gene emergence. Using our modified method, we identified 34 de novo genes that were specific to S. cerevisiae (i.e., without either paralogous genes or orthologous genes in any other species). To analyze the cis-regulatory evolution of genes that had emerged from different origins and had different ages, we identified duplicated new genes (new genes with paralogous genes) and orthologous genes (well-conserved genes with orthologous genes in all seven yeast species) and compared the characteristics of cis-regulation in each. We found a higher number of TFBS gain events and higher evolution rates in the promoters of new genes (both de novo and duplicated new genes) compared with in old (orthologous) genes. Our findings suggested that the promoters of new genes might experience adaptive evolution as their functions become established. Furthermore, we investigated the nucleosome architecture in the promoter regions, which might be associated with transcriptional regulation and the evolution of eukaryotic genes [39–46]. Our results revealed significant lower occupancy of proximal nucleosomes and lower enrichment of the TATA box in promoters of de novo genes compared with in duplicated new and orthologous genes, suggesting that de novo genes might employ different regulatory strategies from duplicated genes. Finally, functional analyses revealed that de novo genes might play roles in reproduction-related functions.
We retrieved 481 position frequency matrices from the MYBS database which integrates ChIP-chip data and phylogenetic footprinting data in yeast . To remove redundant motifs, we integrated all the recorded motifs for each transcription factor (TF) using the STAMP web server which calculates the similarity of various motifs and integrates them into a familial binding profile . A total of 175 familial binding profiles were generated and converted into position weight matrices (PWMs) by the PATSER software using the default settings . Putative TFBSs were obtained by scanning PWMs with a threshold p-value of <0.001  (TFBSs identified under different thresholds were also investigated to examine the robustness of our study in the Additional file 2: Supplementary Document). Next, putative TFBSs that were not documented in the curated YEASTRACT database, which documents 48,333 regulatory associations between TFs and their target genes , were excluded. We then characterized TFBSs based on whether they were newly gained (i.e., did not exist before gene origination) or were preexisting TFBSs (i.e., already existed before gene origination). The characterization entailed scanning the corresponding regions of S. paradoxus and S. mikatae, the two yeast species most closely related to S. cerevisiae, for each of the TFBSs that were identified in S. cerevisiae. The corresponding regions, defined as the regions that extended 25 bp upstream and downstream of the aligned region of a TFBS , were retrieved from multiz7way . A TFBS gain event was defined as a TFBS in S. cerevisiae that did not possess an occurrence of its motif within the corresponding regions in S. paradoxus and S. mikatae. A preexisting TFBS was defined as possessing occurrences of its motif within the corresponding regions in S. paradoxus, S. mikatae, or both. TFBS losses of de novo genes were not investigated because no ancient gene exists; that is, no functional TFBS existed before the de novo gene emerged.
In this study, we used a S. cerevisiae genome-wide reference map of nucleosome positions that integrated six high-resolution genome-wide maps from multiple laboratories and detection platforms . To exclude relatively depleted nucleosomes, only nucleosomes with >50% occupancy were considered . Tirosh et al. defined two gene categories according to different characteristics of the promoter nucleosomes and found that the two categories possessed different regulatory strategies . We modified the procedure proposed by Tirosh et al., and identified two categories according to the presence of nucleosomes in the TSS-proximal region (from TSS up to −100) and the TSS-distal region (from −300 to −400), as follows: (a) genes with a nucleosome in the TSS-proximal region but with none in the TSS-distal region, referred to as occupied proximal nucleosome (OPN) genes; and (b) genes without a nucleosome in the TSS-proximal region but with one in the TSS-distal region, referred to as depleted proximal nucleosome (DPN) genes.
The Serial Pattern of Expression Levels Locator (SPELL) database  was used to identify the potential functions of the S. cerevisiae de novo genes. SPELL is a query-driven search engine for large gene expression microarray compendia containing more than 2,400 experimental conditions. It has been used to identify the most informative expression data sets and to interpret relevant genes for a given set of query genes. We queried the SPELL database using the de novo genes and identified the top 100 relevant genes that were most similarly expressed across all data sets. SPELL then assigned the Gene Ontology (GO) terms from the identified genes to the queried de novo genes. Significance was tested using the Bonferroni-corrected Fisher’s exact test with the q-value set to <0.01 . We also conducted TFBS enrichment analysis to identify TFs that might be responsible for the regulation of the de novo genes. The identification was based on a binomial test, in which the null hypothesis states that the probability of finding the TFBSs in de novo genes is smaller or equal to that of all the other genes in the S. cerevisiae genome.
Another crucial architectural motif in the promoter is the TATA box. The expression of TATA-containing genes is highly regulated, responsive to stress, sensitive to chromatin regulators, and variable across different species [62, 63]. We found that the proportion of TATA-containing genes (consensus TATA(A/T)A(A/T)(A/G) within −50 to −200 ) was significantly lower in de novo genes (12.1%) compared with the proportion observed in the whole S. cerevisiae genome (23.3%) (one-sided two-sample proportion test p = 0.0037). In contrast, the proportion of TATA-containing genes in the duplicated new genes (71.4%) and orthologous genes (35.9%) was significantly higher than in the whole S. cerevisiae genome (one-sided two-sample proportion test p = 3.5×10-7 and 1.1×10-9, respectively) (Figure 5B). Overall, our findings indicated that de novo genes were dominated by DPN genes but fewer TATA-containing genes, whereas duplicated new genes were dominated by OPN genes and TATA-containing genes. These results suggested that the two types of new genes may possess different regulatory strategies.
Predicted GO terms for 34 de novo genes by SPELL
sexual sporulation resulting in formation of a cellular spore
cellular process involved in reproduction
spore wall biogenesis
ascospore wall biogenesis
ascospore wall assembly
spore wall assembly
fungal-type cell wall assembly
cell wall assembly
sporulation resulting in formation of a cellular spore
Predicted GO terms for 56 de novo genes (including 22 de novo genes with short promoters or poor alignments in promoters) by SPELL
M phase of meiotic cell cycle
meiotic cell cycle
cellular process involved in reproduction
We investigated the emergence of cis-regulatory elements in de novo genes. Specifically, 56 de novo genes were identified as having emerged in S. cerevisiae since separation from S. paradoxus approximately 5 million years ago . It has been shown that different approaches for de novo gene identification may yield different results. For example, Capra et al. investigated all the de novo genes since WGD. This strategy ensured that the possibility of having orthologous genes in any species before WGD was avoided, but genes in the closely related species after WGD were allowed . Wu et al., on the other hand, considered only the de novo genes without any orthologous genes but with highly similar orthologous regions and frame-shifts in two closely related species . In short, Capra et al. discuss the evolution of de novo genes in a relative large time-scale while Wu et al. analyzed the characteristics of de novo genes that originated immediately by one-step mutations from closely related species. In this study, we attempted to understand the evolution of regulatory elements which requires sufficient evolution time to accumulate mutations. Therefore, we considered a time-scale that fell between the time-scales of the above two studies. We did not focus on the de novo genes that immediately emerged one-step away from non-coding regions as in Wu et al., because the promoters of these genes might not have experienced sufficient evolution time.
Our results showed that the promoters of new genes (of both de novo and duplicated origin) possessed similar numbers of regulatory TFs and TFBSs compared with those in orthologous genes. This finding suggested that TFBSs might be established rapidly after the emergence of a new gene and could be explained by the frequent occurrence of TFBS turnover, a well-documented phenomenon in eukaryote cis-regulation . For example, frequent TFBS gain events in duplicated genes were found to play a critical role in the regulatory evolution of the yeast genome . Papp et al. found that the numbers of TFBSs in the promoters of duplicated genes remained constant over evolutionary time, whereas the numbers of shared motifs from a preexisting gene decreased, perhaps because of a balance between the gain of new TFBSs and the loss of TFBSs from parent genes .
The promoters of de novo genes that evolved from non-coding regions instead of duplicated from promoters of parent genes might be expected to have a different frequency of TFBS gain event than in duplicated genes. However, our analyses showed that the de novo and duplicated new genes exhibited similar numbers of TFBS gain events. A simple explanation could be that preexisting TFBSs in the promoters of the de novo genes were more plentiful than previous expected. Indeed, our results indicated that more than half of the TFBSs in the promoters of de novo genes were preexisting TFBSs, which supports this explanation. Together with the observation of high substitution rates in the promoters of de novo genes, our results further suggested that the promoters experienced adaptation evolution and frequent gain events. Both these phenomena would rapidly increase the number of TFBSs in de novo genes to a level comparable with the number found in orthologous genes. In addition, the higher substitution rates in the promoters of de novo genes compared with those of neutral sequences (i.e. the four-fold degenerate sites) suggested that the new genes might experience positive selection during the establishment of cis-regulatory motifs. Our results agree with a previous protein interaction networks study which found that, although de novo genes initially had fewer functions and protein interactions than duplicated new genes, de novo genes rapidly gained functions and protein interactions until the numbers were comparable to duplicated new genes .
Research has shown that duplicated genes often inherit cis-regulatory elements from their parent genes, thereby benefiting from preexisting regulatory mechanisms [35, 36]. However, because we found that de novo genes had a similar proportion of preexisting TFBSs in their promoters as duplicated new genes, we have proposed three possible explanations for this observation. First, studies have shown that non-functional TFBSs reside throughout the intergenic regions in the genome; for example, it was reported that TFs can bind to substantial numbers of non-functional TFBSs regardless of their weak binding strength . Second, although we removed head-to-head genes that share core promoters, there still might be cases in which the promoters are shared. The promoter of the de novo genes may partially overlap with the distal promoter of neighboring genes, especially in yeast, which have relatively short intergenic region. Moreover, while non-functional TFBSs determined by documented regulatory associations in YEASTRACT have been removed (i.e. the pair of head-to-head genes would not have exactly the same set of TFBSs), some TFBSs may still be shared. These shared TFBSs could explain the unexpectedly high proportion of preexisting TFBSs in de novo genes. Third, there may be a number of false positives in the computational identification of the TFBSs . Although we filtered out non-functional TFBSs in S. cerevisiae according to the regulatory associations documented in the YEASTRACT database , similar information in the other yeast species is insufficient to eliminate all the potential false positives. Thus, the numbers of TFBSs in other yeast species and consequently the number of preexisting TFBSs might have been overestimated.
The promoter architecture of new genes is an intriguing issue to explore because it has been associated with the gene origination mechanisms . We found that duplicated new genes were enriched with OPN genes and TATA-containing genes; whereas, most de novo genes were TATA-less and enriched with DPN genes. The association between DPN and TATA-less promoters in de novo genes is consistent with the report that TATA-less promoters usually have clearer nucleosome free regions than TATA-containing genes [45, 72]. Additionally, TATA box and OPN enrichment has been reported in the promoters of duplicated genes [44, 73]. OPN and TATA-containing genes are relatively adaptable to environmental changes and are associated with processes that require high expression variation, such as transcriptional plasticity, sensitivity to chromatin regulation and genetic perturbations, expression noise, and expression divergence. In addition, TATA-containing genes are often highly regulated and are associated with inducible responses to stress or biotic stimuli [45, 62, 63, 74]. DPN and TATA-less genes, on the other hand, display relatively low expression variation and constitutive expression, and TATA-less genes are lightly regulated by chromatin regulators, unresponsive to stress, and related to basic housekeeping functions in yeast and human [62, 63, 75]. The functions of TATA-less genes are enriched in basic processes such as cell growth and maintenance, protein biosynthesis, large ribosomal subunit, and mitochondrion , and these known functions are consistent with the results of our functional analyses of de novo genes. Furthermore, the promoters of the TATA-containing genes are TAF-independent and dominated by the Spt-Ada-Gcn5 acetyltransferase complex (SAGA), while the promoters of the TATA-less genes are TFIID-dominated and highly TAF-dependent despite there being a common set of TAFs that are shared by SAGA and TFIID . As a result, the difference in TATA enrichment and nucleosome occupancy (OPN or DPN) between the two types of new genes indicates that they employ distinct regulatory mechanisms. These findings agree with the suggestions by Capra et al. that the function and fate of new genes are associated with their origins . Our functional analysis using SPELL suggested that de novo genes might contribute to cellular processes that are involved in reproduction, such as sporulation and formations of cellular spore and cell wall. Differences in sporulation patterns and sporulation efficiencies between S. cerevisiae and S. paradoxus have been observed . Also, germinating spores of S. cerevisiae show a higher preference for own-species mating than the spores of S. paradoxus. In addition, the enrichment of DPN genes and TATA-less genes that we found in the de novo genes agrees with the observation that the genes involved in sporulation and division are constitutively expressed .
We used SPELL to predict the functions of de novo genes because of the lack of functional annotations in de novo genes. However, SPELL has various limitations. Given a set of query genes, SPELL identifies the expression microarray datasets that are most informative for these genes. Then additional genes that have the most similar expression profiles to the query genes are identified in the datasets. According to the functions of the additional genes, SPELL generates hypothetical functions for the query genes. However, the assignment of the functions is for the most part limited to the microarray datasets and GO annotation. Moreover, correlations of the expression patterns among a set of co-functional genes might not always be significantly high, because the genes need not be co-expressed at all the experimental time points. Because of these limitations, the functions assigned by SPELL may reveal only partial, and sometimes inaccurate, roles of de novo genes.
In addition to the SPELL functional predictions, we provided further support for the predicted de novo gene function by examining the function of their TFs. We identified BAS1, GCN4 and GCR1 as regulators of de novo genes. Interestingly, studies suggests that all three of these TFs are related to meiotic recombination, a process in reproduction: mutations in BAS1 affect the frequency of aberrant segregation of recombination hotspot at the histone HIS4 locus, lessen the recombination distance, and alter the frequency of meiosis-specific double-strand DNA breaks [65, 66]; deletion or constitutive expression of GCN4 affects the frequency of gene conversion and crossing-over at the HIS4 locus ; and removal of GCR1-binding sites reduces the expression of REC102, a gene required for the initiation of meiotic recombination . Based on previous studies and the findings in this study, we propose that de novo genes may play an important role in reproduction.
Although the functions of most de novo genes have not been well investigated, some of their specific roles have been addressed [1, 2, 27]. For example, Wu et al. have analyzed the transcriptome of numerous human tissues and found that de novo genes are highly expressed in the testes and cerebral cortex, which plays key roles in cognitive abilities . The authors suggested that the de novo genes might contribute to phenotypic traits that are unique to humans . Our results also suggest that new genes from different origins may play distinct roles in the evolutionary process. While duplicated new genes have been shown to be involved in environmental adaptation , we hypothesized that de novo genes might contribute to evolutionary innovation in reproduction processes like sporulation efficiency. Further studies are required to examine this hypothesis; nevertheless, the computational approaches that were used in this study shed some light on the evolution of cis-regulation in de novo genes.
Our study showed that the number of TFBSs in de novo genes increased rapidly after gene emergence and soon resulted in that de novo genes having a comparable number of TFBSs as the orthologous genes. We suggested that frequent TFBS gain events, more numbers of unexpected preexisting TFBSs, and the lower selection pressure experienced in the promoters of de novo genes compared to orthologous genes could be the major reasons for this finding. Moreover, we found that new genes from different origins (de novo or duplication) have distinct regulatory characteristics (de novo genes were dominated by DPN and TATA-less genes; duplicated new genes were dominated by OPN and TATA-containing genes). Furthermore, we found that the predicted GO terms related to reproduction processes were enriched in de novo genes. Taking all of our results together, we concluded that de novo genes and duplicated new genes might play distinct roles in evolution.
This work was supported by the National Science Council of Taiwan (Grant No: 99-2621-B-001-005-MY2 and NSC100-2628-E-001-006-MY3).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.