The trends described above provide new insights into the modes of gene emergence over time. For the two models, de novo evolution versus duplication-divergence, it seems that de novo evolution is better compatible with these trends. But before coming to the interpretations, we should first like to discuss the technical aspects of our approach.
We rely generally on blastp searches for classifying the genes to phylostrata. There have been extensive simulation efforts that have shown that this is an adequate procedure . However, if one would add manual curation, including the use of a combination of different search algorithms, one would indeed classify a number of genes to older phylostrata. On the other hand, we are focusing here on general trends, not on absolute numbers. Given that most of these trends are robust, both with respect to statistical testing, as well as for confirming them for the much less well annotated fish genomes, we consider the possible misclassification problem as small.
We relate our analysis only to the currently annotated Ensembl reading frames, although these are in a constant flux, due to curation and further refinement of annotation procedures. In fact, it has already been noted that the currently available annotations underestimate the number of orphan genes, since finding a homologue for a gene is one accessory criterion for annotation. This affects mostly the genes from ps20, which are under-represented [3, 9], although they are the best candidates for ongoing de novo evolution. Hence, although some noise is expected in the data and the assignment fidelity, it would be very unlikely that a systematic artifact causes the trends observed.
De novo evolution versus duplication-divergence
The de novo emergence of a gene out of non-coding DNA requires only some form of transcription, as well as simple signals that define its start and its end and possibly splice sites, as well as some open reading frame [3, 7]. Since all of these signals are rather short, they are expected to occur frequently even in random sequences. Genes emerging from such random combination of signals have been called proto-genes [7, 20] and analysis of ribosome association profiles in yeast has suggested that they are abundantly translated [19, 20]. Accordingly, they could easily serve as a continuous source of short genes that are ready to become recruited to functional pathways and can then become more complex over time. Hence, new genes that arise according to this model would initially be short, have few introns and domains and would often be associated with existing regulatory elements. These are indeed the overall trends that we observe.
The duplication-divergence model, on the other hand, seems much less compatible with these trends. Under this model, one would expect that the new gene should inherit the gene structure from the parental gene. Since long and short genes should equally often be the source of new genes, and since duplications should happen similarly at all time horizons, one would not expect to see the dependence between age and length features.
Domain number is also highly correlated with age, with younger genes having far fewer domains. This is not a simple effect of the similarity searches that we have used, since the domain annotation in Interpro is based on a combination of a variety of different procedures that go beyond blastp matches . Hence, this observation confirms that not only new genes, but also new domains can arise over time [42, 43]. On the other hand, only half of the genes contain known domains , i.e. having a domain is not a prerequisite of protein function. In fact, many proteins are known to be intrinsically unstructured [44–46].
It is still unclear how a new gene can acquire its regulatory elements. One possibility is that there are many cryptic transcriptional initiation sites around the genome. Indeed, it appears that most of the genome becomes transcribed at some time [47, 48]. However, much of this may be co-transcription or spurious initiation. Moreover, to allow a transcript to become functional (i.e. to become subject to positive selection), it requires some form of stable and heritable regulation. We have therefore evaluated the possibility that new genes make use of existing promotors. It is known that RNA polymerase II promotors have a general tendency for divergent transcription within the nucleosome-free region associated with most promotors [49, 50]. We find indeed an enrichment of general signatures of active promotors in association with the most recently evolved genes (ps20). This is mostly due to bidirectional promotors, where the general tendency of RNA PolII for bidirectional transcription may have become extended to form a new transcript. Intriguingly, the next phylostratum (ps19) shows an under-representation of genes among bidirectional promotors, which would suggest that a new gene that has become functional could rather quickly gain its own independent promotor elements.
Another way of making use of an existing promotor is to develop an alternative reading frame within an existing gene. This can be caused by the acquisition of an alternative splicing, whereby the original start codon is retained (e.g. in Polr1d). Alternatively, a separate start codon becomes used that initiates a different reading frame (e.g. Reep6). This has long been thought to be very unlikely, mostly because of the common notion that in eukaryotes only the first AUG serves as a start codon in a mRNA. However, polycistronic mRNAs are known to occur in eukaryotes as well , i.e. the use of additional start codons from the same transcript is not without precedence. The third possibility to initiate an alternative reading frame within an existing gene is a new upstream exon, driven by a new promotor, combined with alternative splicing. This has apparently happened in the case of the Hoxa9 gene. This is also the mechanism that was found for the previously well-studied example of overprinting in the Cdkn2a gene . This raises of course the question of how the new promotor for the new upstream exon has evolved. However, it has been shown that there is a widespread presence of long-range regulatory activities in the mouse genome, which can act on inserted promotors . Thus, it seems indeed rather conceivable that random mutations in such potentially active regions might suffice to create a new regulated initiation site.
We expect that it should be possible to detect many more cases of overprinting, if one does not only search annotated reading frames, as we have done here. For example, Chung et al.  have identified 40 candidates for overprinting in humans using a probabilistic search strategy. With the much better genome sampling that we have nowadays, it should be possible to refine the searches even further.
Our search has specifically focused on cases where the overprinted reading frame has emerged later than the original one. Two of the previously well-studied genes fall into this class and we have recovered them. Such secondarily evolved proteins are the ones that give the strongest support for a de novo evolution mechanism, since alternative reading frames of long existing genes can be considered as almost random sequences. Hence, the fact that new proteins can arise out of them is a strong argument for the reality of de novo evolution [26, 27, 33].