Phylogenetic patterns of emergence of new genes support a model of frequent de novoevolution
© Neme and Tautz; licensee BioMed Central Ltd. 2013
Received: 6 November 2012
Accepted: 15 February 2013
Published: 21 February 2013
New gene emergence is so far assumed to be mostly driven by duplication and divergence of existing genes. The possibility that entirely new genes could emerge out of the non-coding genomic background was long thought to be almost negligible. With the increasing availability of fully sequenced genomes across broad scales of phylogeny, it has become possible to systematically study the origin of new genes over time and thus revisit this question.
We have used phylostratigraphy to assess trends of gene evolution across successive phylogenetic phases, using mostly the well-annotated mouse genome as a reference. We find several significant general trends and confirm them for three other vertebrate genomes (humans, zebrafish and stickleback). Younger genes are shorter, both with respect to gene length, as well as to open reading frame length. They contain also fewer exons and have fewer recognizable domains. Average exon length, on the other hand, does not change much over time. Only the most recently evolved genes have longer exons and they are often associated with active promotor regions, i.e. are part of bidirectional promotors. We have also revisited the possibility that de novo evolution of genes could occur even within existing genes, by making use of an alternative reading frame (overprinting). We find several cases among the annotated Ensembl ORFs, where the new reading frame has emerged at a higher phylostratigraphic level than the original one. We discuss some of these overprinted genes, which include also the Hoxa9 gene where an alternative reading frame covering the homeobox has emerged within the lineage leading to rodents and primates (Euarchontoglires).
We suggest that the overall trends of gene emergence are more compatible with a de novo evolution model for orphan genes than a general duplication-divergence model. Hence de novo evolution of genes appears to have occurred continuously throughout evolutionary time and should therefore be considered as a general mechanism for the emergence of new gene functions.
The hallmark of the signature of a new gene (or orphan gene) is that it arises at some time within the evolutionary lineage towards an extant organism and has no similarity with genes in organisms that have split before this time [1–3]. This distinguishes orphan genes from genes that arise through full or partial duplication processes to form paralogous genes or gene families [4, 5]. It has been proposed that orphan genes are likely to play a major role in lineage specific adaptations [1–3, 6] and thus contribute to evolutionary innovations. There are two major models of how orphan genes can arise . The first is the duplication-divergence model, which assumes that they emerge through an initial duplication of other genes, but this is followed by rapid divergence, such that all similarity to the parent gene is lost . The alternative is the de novo evolution model, which assumes that genes can directly arise out of non-coding DNA . Although this second possibility seemed initially rather unlikely, such genes have been found in Drosophila[8–10], yeast [11, 12], mouse , Plasmodium  plants  and humans [16–18]. In fact, there is now increasing evidence that de novo evolution may be rather frequent. Studies in yeast have suggested that a large number of transcripts without annotation are actively transcribed and translated [19, 20] and that such transcripts could be a source for de novo gene emergence (called “proto-genes”) [7, 20].
We have developed phylostratigraphy as a method that identifies the genes that have arisen at each stage of a series of phylogenetically relevant splitting events . This allows to systematically study the characteristics of such genes over time [22–25]. Using this approach we found that gene emergence rates are particularly high in the youngest lineages, implying a very active process of de novo evolution, since the times considered for these youngest lineages are too short for the duplication-divergence model to apply . This is in agreement with the proto-gene concept, where non-coding transcripts are considered as possible sources of new genes [19, 20]. However, a study of emergence trends across the whole phylogeny is still missing.
In the present paper we use the mouse as a focal species, which has a particularly well annotated genome. We show that it is indeed possible to derive distinctive patterns for gene emergence, which appear to be generally in accordance with a de novo evolution model. As a special case of de novo evolution, we revisit the possibility that existing genes have developed an independent second reading frame. Evolution of new genes within such double reading frame arrangements have been known since some time [26, 27] (called “overprinting” by ). They have been well studied in viruses [28, 29], but several examples are also known from eukaryotes and have been studied in detail for some genes [30–32]. Chung et al.  provided a first systematic approach to identify such alternative reading frames (ARFs) in mammals and suggested 40 candidate genes which appeared to use ARFs. We find here that it is indeed possible to retrieve even among annotated genes additional cases of overprinting, where the alternative reading frame maps to a different phylostratum than the original reading frame. This suggests that existing genes may readily become templates for de novo evolution of new gene functions within them, further supporting the notion that de novo evolution of gene functions are possible.
The duplication-divergence versus the de novo evolution model for orphan gene emergence make some different predictions with respect to gene emergence over time, for example on length distributions and exon distributions, as detailed below. Apart of looking for such differential predictions, it is also of interest to assess general patterns, such as orphan gene distribution across the genome, as well as the emergence of associated promotors. Below, we describe first how we assign the genes to different age classes and then use this assignment to study gene emergence trends and patterns.
Phylostratigraphy of mouse genes
Approximately 60% of the annotated protein coding genes in the mouse genome originate from prokaryotic and basal eukaryotic ancestors (ps1-2). The rest of the genes have emerged later in the phylogenetic history, with peaks correlating to large scale biological transitions. For example, the peak around ps6 represents the single-cell to multicellular organism transition  and the peak around ps11-12 represents the invertebrate to vertebrate transition. Another peak is evident at ps20, representing all genes that have evolved since the rat/mouse split. Although this may partly be ascribed to annotation problems within the youngest group of genes  many of them are likely to represent de novo evolved genes, since mouse and rat are so close to each other that any duplicated gene would easily be traceable, even if it would evolve with the rate of a non-functional pseudogene.
Genomic features across ages
List of overprinted genes detected via a phylostratigraphic approach based on annotated ORFs in Ensembl
Same start as main gene, but acquired additional exons
Same start as main gene, but acquired an additional internal exon
New initiation codon creates second reading frame
Same start, but new splice variant; paralog of Gm4723
Same start, but new splice variant; paralog of Gm8898
New starting exon initiates a separate reading frame
Same start, alternative splicing leads to new reading frames
Same start as main gene, but acquired an additional internal exon
Gain of alternative second exon induces a shift from the older frame
Alternative first exon and last exons, common second exon
Alternative transcription start site and start codon
New starting exon initiates a separate reading frame. Also known as Arf, Pctr1, MTS1, Ink4a
New initiation codon creates second reading frame. Also known as Nesp, GPSA
A differential prediction can also be made for the expected correlation with protein domain emergence. De novo evolved proteins will initially have no domains which are shared with other genes, while duplicated genes would tend to retain domains of their parental genes . Hence, the de novo evolution would predict domain gain over time, while no distinct pattern is expected for the duplication-divergence model. Again we find indeed a strong time-dependence with a continuous trend for domain emergence (Figure 2C; Table 1), supporting the de novo model.
De novo emerged genes should also have initially fewer exons, but could be expected to accumulate additional ones over time. In the duplication-divergence model, on the other hand, one would not expect a time dependency of exon numbers, since this mechanism should work the same at every time horizon. However, we find a strong trend of exon gain over time (Figure 2D; Table 1), supporting the de novo model.
Average exon length, on the other hand, shows no clear age-dependence (Figure 2E). Only the youngest genes (ps20) have significantly longer exons (Figure 2E) suggesting a fast secondary acquisition of introns after gene emergence, or gene fusion effects .
Association with transcriptionally active sites
With respect to the de novo model, it is particularly interesting to ask whether the most recently evolved genes are associated with such marks, since this could imply that they tend to make use of existing promotors upon their emergence. We find indeed a significant over-representation of transcriptional marks for genes that have emerged in ps20 (Figure 5A). This would suggest that the transcription of de novo evolved genes is initially often dependent on the proximity to an existing transcriptionally active region. Intriguingly, however, the ps19 genes show a significant under-representation with respect to the association of these three marks. This would suggest that new genes acquire rather quickly own regulatory elements, independent of the standard marks.
To explore this pattern further, we analyzed each of the three marks separately and further distinguished between unidirectional and bidirectional promotors (Figure 5B-C). The latter are the most evident candidates of cases where newly evolved genes take advantage of an existing regulatory region. We find that bidirectional promotors are indeed consistently over-represented in genes from ps20 for all three marks.
Testis expressed genes
Testis is known to have the largest number of tissue-specifically expressed genes, many of which are newly evolved genes . It has therefore been suggested that new genes arise predominantly first in the context of testis expression, before acquiring roles in other tissues - the “out of testis hypothesis” .
Alternative reading frames
The second example for overprinting is Polr1d, a subunit of RNA polymerase I and III, which has acquired two additional exons at the end of the ancestral gene. Alternative splicing leads thus to a new protein that shares only the start codon and a few initial amino acids with the ancestral gene (Figure 7B). The ancestral protein maps to ps2, the derived one to ps5, i.e. this arrangement with two protein products from the same gene region is highly conserved.
The third example is Hoxa9, one of the canonical Hox genes involved in anterior-posterior patterning. In this case, the ancestral gene has first acquired an additional intron that leads to a truncated version of a protein, an arrangement that is conserved between birds to mammals  (ps14). On top of this, an additional 5′-exon, driven by a new promotor, has evolved within the Euarchontoglires (ps18). This splices to the acceptor of the new intron and creates thus a new reading frame (Figure 7C). Interestingly, this reading frame covers the homeobox and is conserved between primates and rodents.
The trends described above provide new insights into the modes of gene emergence over time. For the two models, de novo evolution versus duplication-divergence, it seems that de novo evolution is better compatible with these trends. But before coming to the interpretations, we should first like to discuss the technical aspects of our approach.
We rely generally on blastp searches for classifying the genes to phylostrata. There have been extensive simulation efforts that have shown that this is an adequate procedure . However, if one would add manual curation, including the use of a combination of different search algorithms, one would indeed classify a number of genes to older phylostrata. On the other hand, we are focusing here on general trends, not on absolute numbers. Given that most of these trends are robust, both with respect to statistical testing, as well as for confirming them for the much less well annotated fish genomes, we consider the possible misclassification problem as small.
We relate our analysis only to the currently annotated Ensembl reading frames, although these are in a constant flux, due to curation and further refinement of annotation procedures. In fact, it has already been noted that the currently available annotations underestimate the number of orphan genes, since finding a homologue for a gene is one accessory criterion for annotation. This affects mostly the genes from ps20, which are under-represented [3, 9], although they are the best candidates for ongoing de novo evolution. Hence, although some noise is expected in the data and the assignment fidelity, it would be very unlikely that a systematic artifact causes the trends observed.
De novo evolution versus duplication-divergence
The de novo emergence of a gene out of non-coding DNA requires only some form of transcription, as well as simple signals that define its start and its end and possibly splice sites, as well as some open reading frame [3, 7]. Since all of these signals are rather short, they are expected to occur frequently even in random sequences. Genes emerging from such random combination of signals have been called proto-genes [7, 20] and analysis of ribosome association profiles in yeast has suggested that they are abundantly translated [19, 20]. Accordingly, they could easily serve as a continuous source of short genes that are ready to become recruited to functional pathways and can then become more complex over time. Hence, new genes that arise according to this model would initially be short, have few introns and domains and would often be associated with existing regulatory elements. These are indeed the overall trends that we observe.
The duplication-divergence model, on the other hand, seems much less compatible with these trends. Under this model, one would expect that the new gene should inherit the gene structure from the parental gene. Since long and short genes should equally often be the source of new genes, and since duplications should happen similarly at all time horizons, one would not expect to see the dependence between age and length features.
Domain number is also highly correlated with age, with younger genes having far fewer domains. This is not a simple effect of the similarity searches that we have used, since the domain annotation in Interpro is based on a combination of a variety of different procedures that go beyond blastp matches . Hence, this observation confirms that not only new genes, but also new domains can arise over time [42, 43]. On the other hand, only half of the genes contain known domains , i.e. having a domain is not a prerequisite of protein function. In fact, many proteins are known to be intrinsically unstructured [44–46].
It is still unclear how a new gene can acquire its regulatory elements. One possibility is that there are many cryptic transcriptional initiation sites around the genome. Indeed, it appears that most of the genome becomes transcribed at some time [47, 48]. However, much of this may be co-transcription or spurious initiation. Moreover, to allow a transcript to become functional (i.e. to become subject to positive selection), it requires some form of stable and heritable regulation. We have therefore evaluated the possibility that new genes make use of existing promotors. It is known that RNA polymerase II promotors have a general tendency for divergent transcription within the nucleosome-free region associated with most promotors [49, 50]. We find indeed an enrichment of general signatures of active promotors in association with the most recently evolved genes (ps20). This is mostly due to bidirectional promotors, where the general tendency of RNA PolII for bidirectional transcription may have become extended to form a new transcript. Intriguingly, the next phylostratum (ps19) shows an under-representation of genes among bidirectional promotors, which would suggest that a new gene that has become functional could rather quickly gain its own independent promotor elements.
Another way of making use of an existing promotor is to develop an alternative reading frame within an existing gene. This can be caused by the acquisition of an alternative splicing, whereby the original start codon is retained (e.g. in Polr1d). Alternatively, a separate start codon becomes used that initiates a different reading frame (e.g. Reep6). This has long been thought to be very unlikely, mostly because of the common notion that in eukaryotes only the first AUG serves as a start codon in a mRNA. However, polycistronic mRNAs are known to occur in eukaryotes as well , i.e. the use of additional start codons from the same transcript is not without precedence. The third possibility to initiate an alternative reading frame within an existing gene is a new upstream exon, driven by a new promotor, combined with alternative splicing. This has apparently happened in the case of the Hoxa9 gene. This is also the mechanism that was found for the previously well-studied example of overprinting in the Cdkn2a gene . This raises of course the question of how the new promotor for the new upstream exon has evolved. However, it has been shown that there is a widespread presence of long-range regulatory activities in the mouse genome, which can act on inserted promotors . Thus, it seems indeed rather conceivable that random mutations in such potentially active regions might suffice to create a new regulated initiation site.
We expect that it should be possible to detect many more cases of overprinting, if one does not only search annotated reading frames, as we have done here. For example, Chung et al.  have identified 40 candidates for overprinting in humans using a probabilistic search strategy. With the much better genome sampling that we have nowadays, it should be possible to refine the searches even further.
Our search has specifically focused on cases where the overprinted reading frame has emerged later than the original one. Two of the previously well-studied genes fall into this class and we have recovered them. Such secondarily evolved proteins are the ones that give the strongest support for a de novo evolution mechanism, since alternative reading frames of long existing genes can be considered as almost random sequences. Hence, the fact that new proteins can arise out of them is a strong argument for the reality of de novo evolution [26, 27, 33].
The phylostratigraphy-based analysis of trends associated with gene emergence in the mouse genome is well compatible with a frequent de novo emergence of orphan genes. This seems to be in contrast to previous assessments, which found only a small fraction of cases of de novo evolution [10, 53, 54]. However, it is necessary to emphasize that this depends very much on the criteria that were used. These early studies were still constrained by the assumption that de novo evolution must be rare and the criteria were therefore tuned to be very restrictive to be sure that only the best-supported cases were included. In addition, it has initially been unclear whether any new gene that includes part of a transposable element should be classified in a separate class , since strictly speaking it contains at least partly a duplicated sequence. On the other hand, if the transposable element fragment does not contribute its reading frame to the new gene, we would now consider it as a de novo gene, given that we find also overprinting in other existing genes. We should also reiterate that our analysis here is strictly based on genes that were annotated as protein coding, whereby the criteria for annotation of genes are still rather restrictive and tend not to consider short open reading frames, although these may be functional as well . Further, all non-coding RNAs are still excluded from this analysis, although the emergence of new de novo genes may be characterized by a phase where it acts as non-coding RNA first [11, 13]. Hence, we conclude that we are only at the beginning to understand the true impact of de novo gene evolution on shaping the genome and emergence of new gene functions.
The phylostratigraphic procedure  is a blastp-based sorting of all protein sequences of an organism according to their phylogenetic emergence. The procedure uses the annotated genes of the focal organism and compares them to all available annotated and non-annotated genome data to infer the first time of emergence of a given gene. Accordingly, all available proteins from protein coding loci in the version 66 of Ensembl  for Mus musculus (obtained through BioMart ) were queried against the nr database from NCBI using an e-value threshold of < 10-3, which has been shown to be optimal for such an analysis [1, 34]. For phylostratum 12, given the low number of protein sequences for outgroups (Cyclostomata/Chondrichthyes), EST and Trace data were included in a tblastn query (translated nucleotide comparison), using an e-value threshold of <10-15. The computation of the phylostratigraphic maps was performed on the Phylostrat server of the IRB Institute, Zagreb, Croatia. Twenty phylogenetic age classes, i.e. phylostrata, were defined based on consensus phylogenetic relationships (Figure 1). The age of a locus was assigned taking into account the oldest detectable similarity of any of its protein products. This approach is targeted to the detection of orphan genes, as it neglects events of exon shuffling or gene fusion as genomic novelties.
Gene structure analyses
Structural gene features were obtained from version 66 of Ensembl through BioMart for mouse (Mus musculus), and from version 68 for human (Homo sapiens), zebrafish (Danio rerio) and stickleback (Gasterosteus aculeatus). Domain information from Interpro  was also obtained through BioMart, and the number of different entries per gene was used as a proxy to the number of domains. Phylostratigraphic analyses were tested with hypergeometric statistics for discrete features and correlations were calculated for continuous features. A combination of permutations (n=10,000) and Kolmogorov-Smirnov tests was used to assess the significance of each phylostratum per variable. Kolmogorov-Smirnov tests were also applied to distance distributions. Other statistical tests were perfomed using R version 2.15.1  and PASW version 18.0.0 . Circular plots for the mouse genome were done with Circos .
Transcription associated regions
Regions of high transcriptional activity from basal promotors were defined as those containing any of these three features: presence of CpG islands, H3K4me3 peaks or DNAseI sensitivity hotspots. These features allow broad range recognition of potential and actual sites with enhanced transcriptional activity. All datasets were taken from the UCSC Genome Browser [60, 61] through the Table Browser tool . Datasets for H3K4me3 ChIP-seq (Mouse ENCODE Consortium, 2012) were obtained from the available tracks from Histone Modification by ChIp-seq at ENCODE/LICR (Ludwig Institute for Cancer Research). Available tissue data at the time of the study include bone marrow, cortex, cerebellum, heart, kidney, liver, lung, mouse embryonic fibroblasts and spleen (all from 8 week old mice). Only peak data were used. Datasets for DNAseI sensitivity assays were obtained from the DNAseI Hypersensitivity by Digital DNAseI from ENCODE/University of Washington tracks . Only hotspots information was used and only tracks corresponding to C57BL/6 mice. Genes were considered to be associated to these marks if the transcription start site was found at a distance of 1,250 bases or less from the mark, accounting for potential offsets in annotations and allowing the assumption that transcriptional activity might affect more drastically those regions in a short range. Analyses of overlap between regions were performed with the BEDtools suite . Phylostratigraphic enrichment was calculated as log-odds and tested using hypergeometric statistics and FDR correction.
Expression data for testis
Mouse microarray expression data from  were obtained from the authors’ website (http://hugheslab.ccbr.utoronto.ca/supplementary-data/Zhang/). This study was selected because of the wide spectrum of tissues considered, which allow for an unbiased measure of expression for a large set of genes. Given that the study was performed using a draft of the mouse genome, the probes were re-annotated using Blat  to match the phylostratigraphic map of the mouse. Ambiguous and poorly matching probes were discarded from the analyses.
Secondary reading frames
This screen was devised to find annotated candidates for emergence of new genes within existing genes based on annotated products. All complete open reading frames corresponding to the same genomic location (ENSMUSG) were considered as candidates, if the minimum and maximum age values differed by at least 2 phylostrata (to avoid screening borderline classifications between phylostrata). Within each genomic location, ORFs were aligned at the nucleotide and protein level using global (needle)  and local alignments (blastn and blastp, database size adjusted to emulate nr-sized searches) . The oldest product was used as reference, and any products with younger phylostrata values were used as query. In the case of multiple older products, comparisons were made against all possible products from the oldest phylostratum. Non-matching protein alignments coming from matching nucleotide alignments were considered as genes with alternative reading frames. These were screened manually in Geneious (version 5.6.5) to identify conservation patterns of start and stop codons in other species. Additionally, using the Compara platform from Ensembl , phylogenetic trees for selected candidates were analyzed.
We thank Tomislav Domazet-Lošo for providing access to the Phylostrat server at the IRB in Zagreb, Croatia; Robert Bakaric for development and support of the Phylostrat server and Sebastian Meyer for work on preliminary tests regarding mouse overprinting. RN is member of the International Max-Planck Research School (IMPRS) for Evolutionary Biology.
- Domazet- Lošo T, Tautz D: An evolutionary analysis of orphan genes in Drosophila. Genome Res. 2003, 13: 2213-2219. 10.1101/gr.1311003.PubMed CentralView ArticlePubMedGoogle Scholar
- Khalturin K, Hemmrich G, Fraune S, Augustin R, Bosch TCG: More than just orphans: are taxonomically-restricted genes important in evolution. Trends Genet. 2009, 25: 404-413. 10.1016/j.tig.2009.07.006.View ArticlePubMedGoogle Scholar
- Tautz D, Domazet- Lošo T: The evolutionary origin of orphan genes. Nat Rev Genet. 2011, 12: 692-702.View ArticlePubMedGoogle Scholar
- Zhang JZ: Evolution by gene duplication: an update. Trends Ecol Evol. 2003, 18: 292-298. 10.1016/S0169-5347(03)00033-8.View ArticleGoogle Scholar
- Kaessmann H: Origins, evolution, and phenotypic impact of new genes. Genome Res. 2010, 20: 1313-1326. 10.1101/gr.101386.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Cai JJ, Petrov DA: Relaxed purifying selection and possibly high rate of adaptation in primate lineage-specific genes. Genome Biol Evol. 2010, 2: 393-409. 10.1093/gbe/evq019.PubMed CentralView ArticlePubMedGoogle Scholar
- Siepel A: Darwinian alchemy: Human genes from noncoding DNA. Genome Res. 2009, 19: 1693-1695. 10.1101/gr.098376.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Levine MT, Jones CD, Kern AD, Lindfors HA, Begun DJ: Novel genes derived from noncoding DNA in Drosophila melanogaster are frequently X-linked and exhibit testis-biased expression. Proc Natl Acad Sci USA. 2006, 103: 9935-9939. 10.1073/pnas.0509809103.PubMed CentralView ArticlePubMedGoogle Scholar
- Begun DJ, Lindfors HA, Kern AD, Jones CD: Evidence for de novo evolution of testis-expressed genes in the Drosophila yakuba Drosophila erecta clade. Genetics. 2007, 176: 1131-1137.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhou Q, Zhang GJ, Zhang Y, Xu SY, Zhao RP: On the origin of new genes in Drosophila. Genome Res. 2008, 18: 1446-1455. 10.1101/gr.076588.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Cai J, Zhao RP, Jiang HF, Wang W: De novo origination of a new protein-coding gene in Saccharomyces cerevisiae. Genetics. 2008, 179: 487-496. 10.1534/genetics.107.084491.PubMed CentralView ArticlePubMedGoogle Scholar
- Li C-Y, Zhang Y, Wang Zhang Y, Cao C, Zhang PW: A human-specific De novo protein-coding gene associated with human brain functions. PLoS Comput Biol. 2010, 6 (3): e1000734-10.1371/journal.pcbi.1000734.PubMed CentralView ArticlePubMedGoogle Scholar
- Heinen T, Staubach F, Haming D, Tautz D: Emergence of a New gene from an intergenic region. Curr Biol. 2009, 19: 1527-1531. 10.1016/j.cub.2009.07.049.View ArticlePubMedGoogle Scholar
- Yang ZF, Huang JL: De novo origin of new genes with introns in Plasmodium vivax. FEBS Lett. 2011, 585: 641-644. 10.1016/j.febslet.2011.01.017.View ArticlePubMedGoogle Scholar
- Donoghue MT, Keshavaiah C, Swamidatta SH, Spillane C: Evolutionary origins of brassicaceae specific genes in Arabidopsis thaliana. BMC Evol Biol. 2011, 11: 47-10.1186/1471-2148-11-47.PubMed CentralView ArticlePubMedGoogle Scholar
- Knowles DG, McLysaght A: Recent de novo origin of human protein-coding genes. Genome Res. 2009, 19: 1752-1759. 10.1101/gr.095026.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Li D, Dong Y, Jiang Y, Jiang HF, Cai J: A de novo originated gene depresses budding yeast mating pathway and is repressed by the protein encoded by its antisense strand. Cell Res. 2010, 20: 408-420. 10.1038/cr.2010.31.View ArticlePubMedGoogle Scholar
- Wu DD, Irwin DM, Zhang YP: De novo origin of human protein-coding genes. PLoS Genet. 2011, 7 (11): e1002379-10.1371/journal.pgen.1002379.PubMed CentralView ArticlePubMedGoogle Scholar
- Wilson BA, Masel J: Putatively Noncoding transcripts show extensive association with ribosomes. Genome Biol Evol. 2011, 3: 1245-1252. 10.1093/gbe/evr099.PubMed CentralView ArticlePubMedGoogle Scholar
- Carvunis AR, Rolland T, Wapinski I, Calderwood MA, Yildirim MA: Proto-genes and de novo gene birth. Nature. 2012, 487: 370-374. 10.1038/nature11184.PubMed CentralView ArticlePubMedGoogle Scholar
- Domazet- Lošo T, Brajkovic J, Tautz D: A phylostratigraphy approach to uncover the genomic history of major adaptations in metazoan lineages. Trends Genet. 2007, 23: 533-539. 10.1016/j.tig.2007.08.014.View ArticlePubMedGoogle Scholar
- Domazet- Lošo T, Tautz D: An ancient evolutionary origin of genes associated with human genetic diseases. Mol Biol Evol. 2008, 25: 2699-2707. 10.1093/molbev/msn214.PubMed CentralView ArticlePubMedGoogle Scholar
- Domazet- Lošo T, Tautz D: Phylostratigraphic tracking of cancer genes suggests a link to the emergence of multicellularity in metazoa. BMC Biol. 2010, 8: 66-10.1186/1741-7007-8-66.PubMed CentralView ArticlePubMedGoogle Scholar
- Domazet- Lošo T, Tautz D: A phylogenetically based transcriptome age index mirrors ontogenetic divergence patterns. Nature. 2010, 468: 815-818. 10.1038/nature09632.View ArticlePubMedGoogle Scholar
- Quint M, Drost HG, Gabel A, Ullrich KK, Bönn M, Grosse I: A transcriptomic hourglass in plant embryogenesis. Nature. 2012, 490: 98-101. 10.1038/nature11394.View ArticlePubMedGoogle Scholar
- Ohno S: Birth of a unique enzyme from an alternative reading frame of the preexisted, internally repetitious coding sequence. Proceedings of the National Academy of Sciences of the United States of America-Biological Sciences. 1984, 81: 2421-2425. 10.1073/pnas.81.8.2421.View ArticleGoogle Scholar
- Keese PK, Gibbs A: Origins of genes - big-bang or continuous creation. Proc Natl Acad Sci USA. 1992, 89: 9489-9493. 10.1073/pnas.89.20.9489.PubMed CentralView ArticlePubMedGoogle Scholar
- Rancurel C, Khosravi M, Dunker AK, Romero PR, Karlin D: Overlapping genes produce proteins with unusual sequence properties and offer insight into De novo protein creation. J Virol. 2009, 83: 10719-10736. 10.1128/JVI.00595-09.PubMed CentralView ArticlePubMedGoogle Scholar
- Sabath N, Wagner A, Karlin D: Evolution of viral proteins originated De novo by overprinting. Mol Biol Evol. 2012, 29: 3767-3780. 10.1093/molbev/mss179.PubMed CentralView ArticlePubMedGoogle Scholar
- Klemke M, Kehlenbach RH, Huttner WB: Two overlapping reading frames in a single exon encode interacting proteins - a novel way of gene usage. EMBO J. 2001, 20: 3849-3860. 10.1093/emboj/20.14.3849.PubMed CentralView ArticlePubMedGoogle Scholar
- Nekrutenko A, Wadhawan S, Goetting-Minesky P, Makova KD: Oscillating evolution of a mammalian locus with overlapping reading frames: An XL alpha s/ALEX relay. PLoS Genet. 2005, 1: 197-204.View ArticleGoogle Scholar
- Sherr CJ: Divorcing ARF and p53: an unsettled case. Nat Rev Cancer. 2006, 6: 663-673. 10.1038/nrc1954.View ArticlePubMedGoogle Scholar
- Chung WY, Wadhawan S, Szklarczyk R, Pond SK, Nekrutenko A: A first look at ARFome: Dual-coding genes in mammalian Genomes. PLoS Comp Biol. 2007, 3: 855-861.View ArticleGoogle Scholar
- Alba MM, Castresana J: On homology searches by protein Blast and the characterization of the age of genes. BMC Evol Biol. 2007, 7: 53-10.1186/1471-2148-7-53.PubMed CentralView ArticlePubMedGoogle Scholar
- Lipman DJ, Souvorov A, Koonin EV, Panchenko AR, Tatusova TA: The relationship of protein conservation and sequence length. BMC Evol Biol. 2002, 2: 20-10.1186/1471-2148-2-20.PubMed CentralView ArticlePubMedGoogle Scholar
- Wolf YI, Novichkov PS, Karev GP, Koonin EV, Lipman DJ: The universal distribution of evolutionary rates of genes and distinct characteristics of eukaryotic genes of different apparent ages. Proc Natl Acad Sci USA. 2009, 106: 7273-7280. 10.1073/pnas.0901808106.PubMed CentralView ArticlePubMedGoogle Scholar
- Chothia C, Gough J: Genomic and structural aspects of protein evolution. Biochem J. 2009, 419: 15-28. 10.1042/BJ20090122.View ArticlePubMedGoogle Scholar
- Buljan M, Frankish A: Bateman A (2010) Quantifying the mechanisms of domain gain in animal proteins. Genome Biol. 2010, 11: R74-10.1186/gb-2010-11-7-r74.PubMed CentralView ArticlePubMedGoogle Scholar
- Tu SC, Shin Y, Zago WM, States BA, Eroshkin A: Takusan: A large gene family that regulates synaptic activity. Neuron. 2007, 55: 69-85. 10.1016/j.neuron.2007.06.021.PubMed CentralView ArticlePubMedGoogle Scholar
- Dintilhac A, Bihan R, Guerrier D, Deschamps S, Pellerin I: A conserved non-homeodomain Hoxa9 isoform interacting with CBP is co-expressed with the ‘typical’ Hoxa9 protein during embryogenesis. Gene Expr Patterns. 2004, 4: 215-222. 10.1016/j.modgep.2003.08.006.View ArticlePubMedGoogle Scholar
- Hunter S, Jones P, Mitchell A, Apweiler R, Attwood TK: InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Res. 2011, 40: D306-D312.PubMed CentralView ArticlePubMedGoogle Scholar
- Pal LR, Guda C: Tracing the origin of functional and conserved domains in the human proteome: implications for protein evolution at the modular level. BMC Evol Biol. 2006, 6: 91-10.1186/1471-2148-6-91.PubMed CentralView ArticlePubMedGoogle Scholar
- Moore AD, Bornberg-Bauer E: The dynamics and evolutionary potential of domain loss and emergence. Mol Biol Evol. 2012, 29: 787-796. 10.1093/molbev/msr250.PubMed CentralView ArticlePubMedGoogle Scholar
- Dyson HJ, Wright PE: Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol. 2005, 6: 197-208. 10.1038/nrm1589.View ArticlePubMedGoogle Scholar
- Tompa P, Kovacs D: Intrinsically disordered chaperones in plants and animals. Biochemistry and Cell Biology-Biochimie Et Biologie Cellulaire. 2010, 88: 167-174. 10.1139/O09-163.View ArticlePubMedGoogle Scholar
- Schlessinger A, Schaefer C, Vicedo E, Schmidberger M, Punta M: Protein disorder - a breakthrough invention of evolution. Curr Opin Struct Biol. 2011, 21: 412-418. 10.1016/j.sbi.2011.03.014.View ArticlePubMedGoogle Scholar
- Carninci P: RNA dust: where are the genes. DNA Res. 2010, 17: 51-59. 10.1093/dnares/dsq006.PubMed CentralView ArticlePubMedGoogle Scholar
- Clark MB, Amaral PP, Schlesinger FJ, Dinger ME, Taft RJ: The reality of pervasive transcription. PLoS Biol. 2011, 9: e1000625-10.1371/journal.pbio.1000625.PubMed CentralView ArticlePubMedGoogle Scholar
- Seila AC, Calabrese JM, Levine SS, Yeo GW, Rahl PB, Flynn RA, Young RA, Sharp PA: Divergent transcription from active promotors. Science. 2008, 322: 1849-1851. 10.1126/science.1162253.PubMed CentralView ArticlePubMedGoogle Scholar
- Seila AC, Core LJ, Lis JT, Sharp PA: Divergent transcription: a new feature of active promotors. Cell Cycle. 2009, 8: 2557-2564. 10.4161/cc.8.16.9305.View ArticlePubMedGoogle Scholar
- Tautz D: Polycistronic peptide coding genes in eukaryotes - how widespread are they?. Brief Funct Gen Proteom. 2008, 8: 68-74. 10.1093/bfgp/eln054.View ArticleGoogle Scholar
- Ruf S, Symmons O, Uslu VV, Dolle D, Hot C: Large-scale analysis of the regulatory architecture of the mouse genome with a transposon-associated sensor. Nat Genet. 2011, 43: 379-341. 10.1038/ng.790.View ArticlePubMedGoogle Scholar
- Toll-Riera M, Bosch N, Bellora N, Castelo R, Armengol L: Origin of primate orphan genes: a comparative genomics approach. Mol Biol Evol. 2009, 26: 603-612.View ArticlePubMedGoogle Scholar
- Ekman D, Elofsson A: Identifying and quantifying orphan protein sequences in fungi. J Mol Biol. 2010, 396: 396-405. 10.1016/j.jmb.2009.11.053.View ArticlePubMedGoogle Scholar
- Flicek P, Amode MR, Barrell D, Beal K, Brent S: Ensembl 2011. Nucleic Acids Res. 2011, 39: D800-D806. 10.1093/nar/gkq1064.PubMed CentralView ArticlePubMedGoogle Scholar
- Kinsella RJ, Kahari A, Haider S, Zamora J, Proctor G: Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database. 2011, bar030-10.1093/database/bar030.Google Scholar
- R Core Team: R: A language and environment for statistical computing. 2012, Vienna, Austria: R Foundation for Statistical ComputingGoogle Scholar
- SPSS Inc: PASW Statistics for Windows, Version 18.0. 2009, Chicago: SPSS IncGoogle Scholar
- Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R: Circos: An information aesthetic for comparative genomics. Genome Res. 2009, 19: 1639-1645. 10.1101/gr.092759.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH: The human genome browser at UCSC. Genome Res. 2002, 12: 996-1006.PubMed CentralView ArticlePubMedGoogle Scholar
- Fujita PA, Rhead B, Zweig AS, Hinrichs AS, Karolchik D: The UCSC Genome Browser database: update 2011. Nucleic Acids Res. 2011, 39: D876-D882. 10.1093/nar/gkq963.PubMed CentralView ArticlePubMedGoogle Scholar
- Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW: The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 2004, 32: D493-D496. 10.1093/nar/gkh103.PubMed CentralView ArticlePubMedGoogle Scholar
- Snyder M, Hardison R, Ren B, Gingeras T, Gilbert D, Groudine M, Bender M, Kaul R, Mouse ENCODE Consortium, Stamatoyannopoulos, J: An encyclopedia of mouse DNA elements (Mouse ENCODE). Genome Biol. 2012, 13: 418-PubMed CentralView ArticlePubMedGoogle Scholar
- Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010, 26: 841-842. 10.1093/bioinformatics/btq033.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang W, Morris Q, Chang R, Shai O, Bakowski M, Mitsakakis N, Mohammad N, Robinson M, Zirngibl R, Somogyi E: The functional landscape of mouse gene expression. J Biol. 2004, 3: 21-10.1186/jbiol16.PubMed CentralView ArticlePubMedGoogle Scholar
- Kent WJ: BLAT–the BLAST-like alignment tool. Genome Res. 2002, 12: 656-664.PubMed CentralView ArticlePubMedGoogle Scholar
- Rice P, Longden I, Bleasby A: EMBOSS: The European molecular biology open software suite. Trends Genet. 2000, 16: 276-277. 10.1016/S0168-9525(00)02024-2.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R: EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 2009, 19: 327-335.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.