Enormous lineage-specific genes identified in other taxa with potentially important functions [36–41, 55, 60, 85, 86] motivated our genomewide search for lineage-specific genes within zebrafish. Here, we adopted BLAST , the preferred method for detecting homologs, and phylostratigraphy  to identify two sets of lineage-specific genes within zebrafish. Then we characterized these genes, predicted their functions ab initio, inferred their evolutionary origin, and analyzed their expression patterns, making this the most comprehensive study of lineage-specific genes within teleosts and zebrafish to date. The 135 CTSGs and 66 orphan genes obtained in this study are attractive targets for future experimental discovery, owing to their lineage specificity and to the fact that the majority encode proteins whose functions are yet to be determined (while only one orphan gene and 42 CTSGs have GO term accession). Compared with the lineage-specific genes identified in plants [36, 43, 44], the number of lineage-specific genes within zebrafish is significantly lower, which may reflect the basic difference between animals and plants, considering the likely small number of lineage-specific genes identified in primate  and insects [40, 41]. Although Yang et al. identified a relatively small number of lineage-specific genes in Arabidopsis, Oryza, and Populus, whose number is close to that in animals, their criteria used to define sequence conservation was too relaxed, making the validity of their results questionionable. For example, they restricted their analysis to only the genes with expression evidence support and employed a very relaxed criterion to define sequence conservation (e-value cutoff of 0.1) that has not been used in other studies. Taken together, the dramatic difference in number of lineage-specific genes observed between the genomes of animals and plants should not be the result of the method we used and may suggest that there is a remarkable genetic difference in terms of lineage-specific genes between the genomes of animals and plants. In addition, this difference may suggest that genome doubling followed by sequence divergence occurred in plants at a higher frequency , which may explain to some extent why there are many more lineage-specific genes in plant genomes.
Both the CTSGs and orphan genes had shorter gene length compared with the EC genes, probably owing to fewer numbers of exons per gene and higher percentage of intronless lineage-specific genes. For example, nearly 28% of orphan genes contained only one exon, while the percentage of single exon EC genes was only 6%. One reason for such a difference may be that intronless genes can arise via retroposition, which has been confirmed to create a large amount of new genes in the zebrafish genome . Alternatively, this difference may be a result of the “introns late” hypothesis, which assumes intron accretion into the protein-coding genes is continuous throughout the evolutionary time of eukaryotes . Thus, the younger the genes are, the fewer exons they have. Additionally, since orphan genes are species specific, these genes may have arisen in relatively recent years. Collectively, these reasons may partly explain why young orphan genes contain a single exon and why lineage-specific genes are shorter than older evolutionarily conserved genes.
Generally speaking, lineage-specific genes are thought to play significant roles in the evolution of lineage specific phenotypes and adaptive innovation . Although there are a large number of lineage-specific genes whose functions have not been characterized and only one orphan gene and 31% CTSGs have GO term accession, we were still able to find five orphan genes and eight CTSGs whose functions are closely related to immunity. The significant enrichment of immunity proteins in the lineage-specific genes within zebrafish indicates that defense against pathogens may be an important goal in terms of the successful diversification of fishes. Fishes are an extremely diverse group of aquatic vertebrate animals that also exhibit enormous diversity in the habitats they occupy. Fishes live in almost every conceivable type of aquatic habitat, from an elevation up to 5,200 meters in Tibet to 7,000 meters below the surface of the ocean and some species even make short excursions onto land. Some fishes can also live in almost pure freshwater, while others reside in very salty lakes. They can tolerate temperatures ranging from as high as 42.5°C to −2°C under the Antarctic ice sheet . Thus, fishes should be confronted with much more diverse pathogen invasion. Therefore, lineage-specific genes involved with immunity should help fishes better adapt to various pathogens and successfully survive within their diverse habitats. In addition, the prediction of gene function is based on homology to proteins with known function in other species. Some lineage-specific genes lack homologs in other lineages, so we predicted their function ab initio. Interestingly, the proteins of CTSGs involved in immune response were the most represented, with a percentage of 30.65% of CTSGs, probably implying a significantly larger expansion of these genes in teleosts. Therefore, function assignment both based on homology and prediction ab initio showed a significant enrichment in proteins related to immune response, suggesting that the successful adaptation of teleosts may be explained by their conserved lineage-specific genes.
Variation of gene number within different organisms suggests a general process of new gene origination . One basic question in biology is the molecular mechanisms involved in the creation of new genes. There have already been several hypotheses regarding the origin of lineage-specific genes. However, determining the exact mechanisms regarding the origin of lineage-specific genes depends on the comparative genome analysis of taxonomically closely related species. It is extremely difficult to achieve the aforementioned goal for research on fish so far. Gene duplication followed by rapid sequence divergence in one of the paralogs is a well explored source of lineage-specific genes [75, 91]. A simple method for determining such genes is to determine whether any of the paralogs of lineage-specific genes are widely evolutionarily conserved. Through this analysis, we found that there were a significantly large number of lineage-specific genes generated by gene duplication followed by rapid sequence divergence of one of the paralogs. It was also confirmed by observing that the similarity between the genes and their evolutionarily conserved paralogs was lower than the similarity between the genes and their paralogs not evolutionarily conserved. As for other mechanisms forming lineage-specific genes, we will explore these questions in the future when the genome sequence of the silver carp (Hypophthalmichthys molitrix), a relatively close species to zebrafish, is released.
Previous studies have shown that young new genes generated by various mechanisms seem to have been preferentially endowed with testis-specific or testis-biased expression patterns . In accordance with this observation, there are a significantly large number of new genes within zebrafish expressed in the reproductive system reflecting the expectation that emergence of new, lineage-specific genes may accompany speciation or reproduction. This suggests that this expression pattern is a general phenomenon not only in mammals and Drosophila, but also in teleosts. There are several hypotheses which can explain this propensity. First, sex- and reproduction-related genes are generally recognized as a class of rapidly evolving genes and undergo adaptive evolution after speciation events involved in male reproduction . Furthermore, the testis is the most rapidly evolving organ owing to the strong selective pressures to which it is subjected because of its important roles in sperm competition, sexual conflict, reproductive isolation, germline pathogens, and mutations causing segregation distortion in the male germline . Second, the “hypertranscription” state  caused by chromatin remodelling and RNA polymerase II complexes in the meiotic and postmeiotic spermatogenic cells would favor the initial, unprovoked transcription of newly arisen genes . As for the CTSGs, however, no significant reproductive expression was enriched, which further confirmed that only the young new genes were specifically expressed in the testis, since the CTSGs were relatively older than the orphan genes.
Expression analyses of lineage-specific genes using EST or microarrays have elucidated the fact that lineage-specific novel genes are preferentially expressed in specific tissues or organs, such as the testis or brain [39, 90]. Although EST data covers a large number of samples, which could be used to compare the expression between different samples, the coverage of individual genes is too low to quantify the expression level of genes. Microarrays also have some limitations, such as cross-hybridization and saturation of signals . Therefore, we used the RNA-seq data from various developmental stages and tissues to quantify the lineage-specific genes and highlight two novel properties of these genes. First, in addition to being highly tissue-specific, lineage-specific gene expression were highly temporally restricted. Second, lineage-specific genes were preferentially expressed in later-stage embryos and early larval stages compared with early-embryos. The higher expression level of lineage-specific genes after the MBT suggests that lineage-specific genes are important components for the zygotic transcription. Maternally deposited mRNAs direct early development before the initiation of zygotic transcription during mid-blastula transition . However, zygotic transcription plays a more important role in the regulation of development after MBT, since a high percentage of maternally stored mRNA has been degraded during the post-MBT stages. In addition, it has been shown that all vertebrate embryos must converge towards a narrow point, called phylotypic stage at which all vertebrate show high morphogenetic resemblance, to acquire the basic scheme on which subsequent differences will emerge . The phenomenon that more lineage-specific genes are expressed after the phylotypic stages may probably be linked to the acquisition of species-specific morphological traits. All vertebrate resemble each other at the phylotypic stage, so the crucial steps to form the morphological differences between species resulting from the expression product after the phylotypic stages. Therefore, lineage-specific genes within zebrafish should be crucial for the significantly morphological diversity of teleosts. On the other hand, Lineage-specific genes showed relatively higher expression levels during early larval stages, making them candidates for functions in specific tissues and organs during organogenesis. Expression analysis using RNA-seq from different tissues and organs supported the observations from the EST data and further showed that orphan genes are preferentially expressed in reproductive tissues, which also confirmed the potential roles of lineage-specific genes during organogenesis.