Zmat2 in mammals: conservation and diversification among genes and Pseudogenes
BMC Genomics volume 21, Article number: 113 (2020)
Recent advances in genetics and genomics present unique opportunities for enhancing our understanding of mammalian biology and evolution through detailed multi-species comparative analysis of gene organization and expression. Yet, of the more than 20,000 protein coding genes found in mammalian genomes, fewer than 10% have been examined in any detail. Here we elucidate the power of data available in publicly-accessible genomic and genetic resources by querying them to evaluate Zmat2, a minimally studied gene whose human ortholog has been implicated in spliceosome function and in keratinocyte differentiation.
We find extensive conservation in coding regions and overall structure of Zmat2 in 18 mammals representing 13 orders and spanning ~ 165 million years of evolutionary development, and in their encoded proteins. We identify a tandem duplication in the Zmat2 gene and locus in opossum, but not in other monotremes, marsupials, or other mammals, indicating that this event occurred subsequent to the divergence of these species from one another. We also define a collection of Zmat2 pseudogenes in half of the mammals studied, and suggest based on phylogenetic analysis that they each arose independently in the recent evolutionary past.
Mammalian Zmat2 genes and ZMAT2 proteins illustrate conservation of structure and sequence, along with the development and diversification of pseudogenes in a large fraction of species. Collectively, these observations also illustrate how the focused identification and interpretation of data found in public genomic and gene expression resources can be leveraged to reveal new insights of potentially high biological significance.
Of the more than 20,000 protein coding genes found in human and in other mammalian genomes, fewer than 10% have been studied in any detail [1,2,3]. This is true despite that fact that ready access to public genomic and gene-expression databases  means that nearly any gene is available for intensive analysis from the molecular and cellular to the individual and population levels [5,6,7,8,9,10]. Part of this disparity may reflect social or historical reasons, but it also is likely that direct association with human diseases and the ready availability of experimental models influences decisions to gravitate toward scientific areas that appear more amenable to higher profile publications or grant funding [2, 3].
ZMAT2 is an excellent example of a gene that had essentially been unstudied until late 2018 . ZMAT2, which encodes a protein that contains a zinc finger domain, is part of a 5-gene family of limited intra-familial amino acid similarity except for the zinc finger region. The lack of interest in this gene is potentially surprising, since it is the ortholog of Snu23, a yeast protein that plays an important role in the spliceosome , an essential molecular machine in eukaryotes that removes introns from primary gene transcripts . Although human ZMAT2 also has been mapped to the spliceosome in structural biological studies , even this observation has not much generated interest in the protein.
Here, by using information extracted from public repositories, we have studied Zmat2 genes and proteins from a broad group of 18 mammalian species comprising 13 orders, and representing ~ 165 million years (Myr) of evolutionary diversification [15,16,17,18]. Our results show extensive conservation in coding regions of these genes and in their encoded proteins, define a collection of Zmat2 pseudogenes in half of the mammals studied, and identify one mammal in which Zmat2 has undergone a tandem duplication. Our observations provide an illustration of how the focused application and analysis of data found in publicly-available genomic and gene expression resources can be leveraged to reveal new insights of potentially high biological significance.
Mammalian ZMAT2/Zmat2 genes are poorly annotated in genomic databases
Human ZMAT2 is an ortholog of yeast Snu23, a zinc-finger-containing protein that is a key component of the spliceosome , the molecular machine responsible for the removal of introns from primary gene transcripts . The human ZMAT2 gene has been incompletely characterized in the Ensembl and UCSC genomic repositories. We thus mapped the gene and its transcripts and protein (Fig. 1, Baral K, Rotwein P: The story of ZMAT2: a highly conserved and understudied human gene, manuscript submitted). Based on these results, which also revealed that 6-exon human ZMAT2 and its encoded 199-residue protein was highly conserved among non-human primates (Baral K, Rotwein P: The story of ZMAT2: a highly conserved and understudied human gene, manuscript submitted), we now sought to extend knowledge about Zmat2 by defining it in other mammalian species.
A preliminary examination within Ensembl revealed that the assignments of mammalian Zmat2 genes were even more incomplete than was observed for human ZMAT2, not only for the 18 species chosen here to cover a range of mammalian orders, but also for most of the mammalian and non-mammalian vertebrates in which Zmat2 has been identified in their genomes in Ensembl. For example, 5′ untranslated regions (UTRs) in exon 1 were described in only 6 of 18 species, and 3′ UTRs in exon 6 in only 7 of 18 species (Table 1). We thus developed an iterative strategy to define these genes, in which mouse Zmat2 was initially characterized in detail. Its exons then were used to perform homology searches in other mammalian genomes. As needed, these queries were supplemented by individual comparisons with Zmat2 cDNAs when available in the National Center for Biotechnology Information (NCBI) nucleotide database (cDNAs were listed in this resource for only 6 different species; see Methods), and by secondary searches using Zmat2 gene segments from species that were evolutionarily more similar to specific target species (e.g., using koala exon 1 to identify opossum exon 1). Most importantly, a final series of studies used the resources of the NCBI Sequence Read Archive (SRA) to map the putative 5′ and 3′ ends of each gene by analysis of expressed transcripts [19, 20]. As described below, results revealed substantially higher levels of gene complexity and completeness than had been found in the data curated by Ensembl.
The mouse Zmat2 gene
A search of Ensembl revealed that mouse Zmat2 appeared to be a 6-exon gene on chromosome 18, and like human ZMAT2 was located adjacent to Hars2 in the same transcriptional orientation (compare Fig. 2a and Fig. 1a). Of two proposed mouse Zmat2 transcripts in Ensembl, only one was stated to include all 6 exons (Fig. 2b) and to encode a protein of 199-amino acids, while the other was thought to include parts of 3 exons and a retained intron (see: https://useast.ensembl.org/Mus_musculus/Gene/Summary?db=core;g=ENSMUSG00000001383;r=18:36793876-36799666). Inspection of the presumptive full-length Zmat2 transcript revealed a proposed 5′ UTR of 66 base pairs (Table 1), that could not be extended by comparison with Zmat2 cDNA NM_025594 from the NCBI nucleotide database (5′ UTR of 19 base pairs).
Direct analysis of mouse Zmat2 gene expression using RNA-sequencing libraries from liver and keratinocytes (Additional file 1: Table S1) revealed that transcripts containing Zmat2 exon 1 were expressed at low levels (read counts of no more than 2 sequences per probe, Fig. 2c). Nevertheless, examination of these libraries revealed that exon 1 was at least 96 nucleotides in length (Fig. 2c). However, no potential TATA boxes, which position RNA polymerase II at the start of transcription , or initiator elements, which function similarly , were found adjacent to this transcript. Thus, the 5′ end of the mouse Zmat2 gene remains tentatively mapped.
Similar studies using probes from different parts of exon 6 showed that this exon was 1774 nucleotides in length, and thus was ~ 14 nucleotides shorter than stated in Ensembl. The 3′ end of exon 6 contained an ‘AATAAA’ presumptive poly A recognition sequence, and a poly A addition site  was mapped 7 base pairs further 3′ (Fig. 2c), thus supporting our analysis. Taken together, these results describe a 6-exon mouse Zmat2 gene of 5786 base pairs in length (Table 2), that is transcribed and processed into a mRNA of 2306 nucleotides (Fig. 2d), and that encodes a 199-amino acid ZMAT2 (Fig. 2e).
The Zmat2 gene in other mammals
By searching genome databases with mouse exons, the few homologous cDNAs, and in selected cases, exons from closely related species, Zmat2 was characterized in 17 other mammals representing 9 different orders, and spanning ~ 165 Myr of evolutionary history. These other mammalian Zmat2 genes also all appeared to consist of 6 exons (Fig. 3, Table 2), and when their 5′ and 3′ ends were mapped using species-homologous RNA-sequencing libraries (Additional file 1: Table S1, Additional file 2: Table S2, Additional file 3: Figure S1, Additional file 4: Figure S2 and Additional file 5: Figure S3), their overall structures closely resembled mouse Zmat2 (Fig. 3, Table 2). In particular, there was perfect congruence in the lengths of coding exons 2–5 (Table 2), and high levels of DNA sequence identity (84.3 to 97.8%, Table 3). Total gene sizes varied over a 2-fold range, from 5477 base pairs in megabat to > 10,457 base pairs in dog, with most of the differences attributable to longer or shorter 3′ UTRs in exon 6 and to some variation in intron lengths (Table 2).
DNA conservation also was relatively high for Zmat2 exon 1 among the mammals studied (87.1 to 96.8% identity, Table 3), even though it is comprised primarily of 5′ UTR. The exception here is opossum (55.8 and 56.8% identity, Table 3 and see below). Exon 6 was more dissimilar among the different species (Table 3), particularly in the noncoding segments (e.g., no identity in Tasmanian devil or koala).
The opossum genome contains tandem Zmat2 genes
Initial screening of the opossum genome revealed several sets of DNA sequences with comparable levels of identity with mouse Zmat2 exons 2–5 (84.9 to 94.3%, Table 3). Two of these groups of DNA segments were distributed to adjacent locations in the opossum genome, and when compiled and evaluated in detail (including identifying exon 1 by using koala Zmat2 exon 1) consisted of tandem genes that were oriented ‘head-to-head’ in divergent transcriptional direction (Fig. 4a). Further analysis showed that the 5′ ends of exon 1 of both genes potentially overlapped (Fig. 4a, b), that exons 1 through 5 were 99.73% identical, that the lengths of exon 6 matched each other and that they were 99.9% identical in DNA sequence (Fig. 4b and not shown). By using probes that differed by a single nucleotide (Additional file 2: Table S2) to screen an RNA-sequencing library, we found that both opossum Zmat2 genes were expressed, at least in liver, with transcripts for gene 1 being more abundant than those for gene 2 (Fig. 4c). Moreover, both opossum Zmat2 mRNAs were the same length (Fig. 4d), and they encoded proteins that varied by a single amino acid (valine at position 128 in protein 1, and methionine in protein 2 (Fig. 4e).
Multiple Zmat2 pseudogenes arose independently in different mammalian genomes
Screening of different mammalian genomes with individual mouse Zmat2 exons led to the identification of additional related DNA sequences in nine species (rat, guinea pig, rabbit, dog, dolphin, microbat, megabat, opossum, and platypus; Table 4). The levels of similarity with mouse Zmat2 exons ranged from 80.1 to 93.4% identity (Table 4). In rat, rabbit, dog, dolphin, megabat, microbat, and opossum paralogs of all 6 Zmat2 exons were detected, and except for rabbit, were composed of continuous DNA sequences (Table 4, Fig. 5). In the latter an unreadable DNA segment of ~ 406 nucleotides separated ‘exons’ 2 and 3. These ‘full-length’ DNAs thus appeared to be pseudogenes that resembled processed mRNAs, and that presumably were retro-transposed as DNA copies back into the respective genomes . In guinea pig, paralogs of only ‘exons’ 4 through 6 could be found, in platypus, individual representations of ‘exon 2’ and ‘exon 3’ mapped to different locations in the genome, and in rat two copies of 461 base pairs of ‘exon 6’ were found in different parts of the X chromosome (87.4% identity with the corresponding portions of the mouse exon, Table 4). The two putative Zmat2 pseudogenes found in the microbat genome and the four located in the dolphin genome are depicted in Fig. 5. In microbat, one of these DNA sequences contained a continuous open reading frame of 199 codons, and its conceptual translation revealed marked similarity with the microbat ZMAT2 protein (183/199 identical residues, Fig. 5b). In dolphin, in which two of the four pseudogenes encoded 199-codon open reading frames (Fig. 5c), one was predicted to be identical to authentic ZMAT2, while the other matched it in 185/199 residues (Fig. 5d).
Previous studies have shown that some potential pseudogenes for the human protein phosphatase 1 regulatory subunit (PP1R2) are transcribed and thus are not actually pseudogenes since they are expressed as RNAs . To determine whether or not any mammalian Zmat2 pseudogenes are functional, their gene expression was examined by querying RNA-sequencing libraries. As shown for rat, rabbit, guinea pig, dog, dolphin, megabat, and opossum, no transcripts could be detected in these libraries even though in all cases authentic Zmat2 mRNA was readily expressed (Fig. 6a-g; no microbat RNA sequencing library was available in the NCBI SRA).
Phylogenetic analysis of all 13 ‘full-length’ Zmat2 pseudogenes from 7 different mammals (including marmoset (Baral K, Rotwein P: The story of ZMAT2: a highly conserved and understudied human gene, manuscript submitted), Table 4) demonstrated that the DNA sequence of each pseudogene was more closely related to the paralog or paralogs from the homologous species than to other Zmat2 pseudogenes (Fig. 5e), suggesting that these retro-transposition events each arose independently after the divergence of each species from their nearest mammalian ancestors.
ZMAT2 protein sequences are highly-conserved among mammals
ZMAT2 was identical to the mouse and human protein in ten species studied here (Table 5, Fig. 7a, b). In each of the other 8 species, only one or two amino acid substitutions was found, except for platypus, in which the NH2-terminus of the protein could not be established because of incomplete genomic sequence (Fig. 7). Phylogenetic mapping further showed that marsupial ZMAT2 proteins clustered together, as all were identical except for opossum 2 (Fig. 7b). Of note for all variant ZMAT2 proteins, the altered amino acids were located throughout the protein, but none were found in the zinc finger domain (Fig. 7a).
The focus of this study was to characterize Zmat2 genes in mammals by analyzing data available in genomic and gene expression repositories, and to place these findings in an evolutionary context. Prior to this and to our recent report (Baral K, Rotwein P: The story of ZMAT2: a highly conserved and understudied human gene, manuscript submitted), there had been no publications on ZMAT2/Zmat2 genes from any species, despite the significance of the protein in the fundamentals of eukaryotic pre-RNA splicing [12, 14]. Our main observations here have included, first, demonstrating that 6-exon Zmat2 is a single-copy gene in all mammals studied, except for opossum, in which a gene duplication event occurring after the divergence of monotremes from other marsupials ~ 80 Myr ago [15, 26] has led to paired tandem Zmat2 genes (Fig. 4). Second, we have elucidated the presence of Zmat2 pseudogenes in at least ten different mammalian species, have demonstrated that they are not transcribed in a context in which authentic Zmat2 is expressed (Table 4, Figs. 5 and 6) and have shown that they appear to have arisen recently in these genomes (Fig. 5e); and third, we have found that the ZMAT2 protein is highly conserved among mammals (Table 5, Fig. 7). Importantly, our data demonstrate that a strategy involving the focused and complementary examination of genomic and gene expression databases can lead to new insights about mammalian biology and gene evolution, and illustrate how investigating unstudied genes can lead to the development of new experimentally-testable hypotheses.
The Zmat2 gene and pseudogenes in mammals
The data described and examined here define Zmat2 as a 6-exon gene in 18 different mammalian species representing 9 orders (Tables 2, 3, Figs. 3, 4). They are thus very similar to their human and non-human primate orthologs in terms of both gene organization and the encoded ZMAT2 protein (Baral K, Rotwein P: The story of ZMAT2: a highly conserved and understudied human gene, manuscript submitted), supporting the idea that the protein plays a conserved and potentially essential role in pre-RNA splicing and possibly in keratinocyte differentiation (see below).
Pseudogenes have been described in both prokaryotes and eukaryotes , and are fairly common in the human and in other mammalian genomes . Preliminary analysis of data generated by ENCODE, performed nearly a decade ago had suggested that there are more than 10,000 pseudogenes in the human genome, comprising ~ 0.7% of the DNA sequence . Among these pseudogenes, ~ 77.5% were thought to represent processed mRNAs that had been retro-transposed as individual DNA copies into the genome, and the other ~ 22.5% were thought to be the result of gene duplication events .
Zmat2 pseudogenes could be identified in about half of the mammals studied here, and in all evaluable cases were not expressed in organs or tissues in which authentic Zmat2 could be detected readily (Fig. 6), thus marking them as ‘real’ pseudogenes, unlike what was shown recently for human PP1R2, in which at least four previously identified pseudogenes were transcribed, and thus should be considered as genes . Remarkably, the number of Zmat2 pseudogenes varied among these species, ranging from 1 to 4 per mammal (Table 4, Fig. 5). In addition, although most Zmat2 pseudogenes contained components of all 6 Zmat2 exons, in the guinea pig genome, the pseudogene was composed of exons 4–6, and in platypus, copies of exon 2 and exon 3 were located on different genome segments (Table 4). In the rat genome, two partial copies of 461 nucleotides of Zmat2 exon 6 were found in different locations on the X chromosome, but these were not detected in any of the other mammals studied (Table 4). While the full-length pseudogenes seem likely to have arisen via retro-transposition of mRNAs as DNA copies back into the respective genomes , the origins of the partial Zmat2 gene sequences in guinea pig, platypus, and rat are unclear. Since Zmat2 pseudogenes were not identified in half of the mammals analyzed here, and since phylogenetic analysis of the ‘full-length’ pseudogenes indicated that they were more similar to their paralogs than to any orthologous DNA sequences in other mammals (Fig. 5e), it seems likely that they arose independently in each species subsequent to its evolutionary divergence from its closest ancestors.
ZMAT2 proteins are remarkably similar to one another in the mammalian species examined in this manuscript. Only 7 amino acid substitution variants were detected, with none found in the zinc finger domain. Including human and non-human primate ZMAT2, the protein was identical in 18/27 different mammals, and at most a variant protein in a given species contained 2 amino acid differences (Table 5, Fig. 6, and (Baral K, Rotwein P: The story of ZMAT2: a highly conserved and understudied human gene, manuscript submitted)), although, in platypus, the NH2-terminus of the protein could not be characterized because of poor quality genomic DNA sequence. In addition, we had shown recently that ZMAT2 is remarkably non-polymorphic in humans (Baral K, Rotwein P: The story of ZMAT2: a highly conserved and understudied human gene, manuscript submitted), with only 41 different potential codon changes identified that predicted amino acid substitutions in over 280,000 alleles found in the gnomAD project , corresponding to just 0.014% of the alleles in the entire study population (Baral K, Rotwein P: The story of ZMAT2: a highly conserved and understudied human gene, manuscript submitted). This level of variation in the human population is 6–90-fold lower than detected previously for at least 19 other human genes [30,31,32]. Moreover, and unlike these other genes [30,31,32], no frame shift or splicing site alterations were found in human ZMAT2 (Baral K, Rotwein P: The story of ZMAT2: a highly conserved and understudied human gene, manuscript submitted).
One possibility for the high level of conservation of ZMAT2 among mammals is that the protein plays a key role in pre-mRNA splicing. ZMAT2 and its yeast homolog Snu23 have been found in the spliceosome [12, 14], and based on structural data, the protein has been postulated to facilitate activation of the U6 snRNP at the 5′ splice site of the intron . Human ZMAT2 also may have a more specialized function, as it was described as a negative regulator of human keratinocyte differentiation, potentially by blocking the splicing of selected primary gene transcripts . Defining the specific functions of ZMAT2 by genetic or other approaches in one or more tractable organisms will be an important topic for future study.
Stitching together genes in pieces: improving the quality of genome resources
Publicly available genomic databases contain extensive information on genes from many species, and are valuable resources for the entire scientific community. Unfortunately, as shown here, the quality of available information in certain circumstances is very poor. In nearly two-thirds of the species studied here, the annotated Zmat2 gene in Ensembl lacked either 5′ or 3 UTRs, or both (Table 1), and in some cases could be identified only by screening with exons from other mammals. These types of problems may be quite common, and appears to be the norm for Zmat2 genes from other mammalian and non-mammalian vertebrates in Ensembl. Poor annotation also has been described for several other genes in multiple species [19, 33]. Ideally, the data quality in these genomic repositories should be nearly perfect, not only to enhance the opportunity for future discoveries, but also to minimize the propagation of false information in scientific publications.
It has been estimated that only a tiny fraction of the ~ 20,000 human protein coding genes has been evaluated [1,2,3]. In fact, a recent report has suggested that ~ 90% of human genes are understudied , including several, such as ZMAT2, that have been the main topic of only a single publication . It is likely that these statistics are more dismal for genes in other mammals and in non-mammalian vertebrates, even including species such as mouse and zebrafish that are favorites of experimentalists [34, 35]. Certainly, a concerted effort to broaden discovery horizons by focusing on understudied and unstudied genes could lead to new insights of potentially high biological and biomedical significance.
Database searches and analyses
Genomic databases were accessed in the Ensembl Genome Browser (www.ensembl.org), initially by text search using ‘Zmat2’ as the query term (see Table 6 for species-specific data links). Additional searches were performed in Ensembl with BlastN under normal sensitivity (maximum e-value of 10; mis-match scores: 1,-3; gap penalties: opening 5, extension, 2; filtered low complexity regions, and repeat sequences masked) using as queries mouse Zmat2 DNA fragments (Mus musculus, genome assembly GRCm38.p6). The following genome assemblies were examined: cat (Felis catus, Felis_catus_9.0), cow (Bos taurus, ARS-UCD1.2), dog (Canis lupus familiaris, CanFam3.1), dolphin (Tursiops truncatus, turTru1), elephant (Loxodonta africana, LoxAfr3.0), guinea pig (Cavia porcellus, cavpor3.0), goat (Capra hircus, ARS1), horse (Equus caballus, EquCab3.0), human (Homo sapiens, GRCh38.p12), koala (Phascolarctos cinereus, phaCin_unsw_v4.1), megabat (Pteropus vampyrus, pteVam1), microbat (Myotis lucifugus, Myoluc2.0), opossum (Monodelphis domestica, monDom5), pig (Sus scrofa, Sscrofa11.1), platypus (Ornithorhynchus anatinus, OANA5), rabbit (Oryctolagus cuniculus, OryCun2.0), rat (Rattus norvegicus, Rnor_6.0), sheep (Ovis aries, OAE_v3.1), and Tasmanian devil (Sarcophilus harrisii, Devil_ref v7.0). The highest scoring results in all cases mapped to the Zmat2 gene, or in several species, to Zmat2 and to Zmat2 pseudogenes. As many searches were incomplete, additional queries were conducted using species-homologous Zmat2 cDNAs when available to verify or extend initial results. The following Zmat2 cDNAs were obtained from the NCBI nucleotide database: cow (accession number: NM_001080343), horse (JL616468), koala (XM_021005188), mouse (NM_025594), rat (NM_001135582), and sheep (GAAI01003789). The Uniprot browser (http://www.uniprot.org/) was the source for ZMAT2 protein sequences (Additional file 6: Table S3); in the absence of primary protein data, DNA sequences of Zmat2 exons were translated using Serial Cloner 2.6 (see: http://serialbasics.free.fr/Serial_Cloner.html).
Mapping the 5′ and 3′ ends of Zmat2 genes
Inspection of ZMAT2 and its proposed mRNAs in the Ensembl genome database revealed for most species either a lack of 5′ or 3′ UTRs for Zmat2 mRNAs, or poorly-defined 5′ or 3′ UTRs. In a few cases, as in horse, koala and sheep, a cDNA in the NCBI nucleotide database could be used to extend the 3′ UTR. For all species for which they were available, RNA-sequencing libraries found in the NCBI SRA (www.ncbi.nlm.nih.gov/sra) were queried with multiple 60 base pair probes from genomic DNA corresponding to presumptive 5′ portions of exon 1, and from 3′ parts of exon 6, and read counts were analyzed. All queries used the Megablast option (optimized for highly similar sequences; maximum target sequences–10,000 (this parameter may be set from 50 to 20,000); expect threshold–10; word size–11; match/mismatch scores–2, − 3; gap costs–existence 5, extension 2; low-complexity regions filtered). The RNA-sequencing libraries are listed in Additional file 1: Table S1, and the probes in Additional file 2: Table S2.
DNA and protein alignments and phylogenetic trees
Multiple sequence alignments were performed for Zmat2 pseudogenes from different species. DNA sequences were uploaded into the command line of Clustalw2 (https://www.ebi.ac.uk/Tools/msa/clustalw2/)  in FASTA format. A similar approach was used with ZMAT2 proteins, except that amino acid sequences were uploaded into Clustalw2 in FASTA format. Output files were in GCG MSF (Genetics Computer Group multiple sequence file) format, and were used as input into a command line form of IQ-TREE (http://iqtree.cibiv.univie.ac.at/), a software tool that uses maximum likelihood to generate phylogenetic trees . IQ-TREE combines phylogenetic and combinatorial optimization techniques into a fast and effective tree search algorithm. The input sequence was bootstrapped 1000 times to get the optimal tree. The output file (with an extension of ‘.filetree’) became input into iterative Tree of Life (iTOL; https://itol.embl.de/), to produce pictorial phylogenetic trees. Pairwise alignments comparing the two ZMAT2 proteins discovered in opossum, and comparing ZMAT2 proteins with predicted proteins from Zmat2 pseudogenes were performed using Needle (EMBOSS; see https://www.ebi.ac.uk/Tools/psa/), which creates an optimal global alignment of two sequences using the Needleman-Wunsch algorithm .
Initial screening of several mammalian genomes revealed more than one group of DNA sequences with high levels of identity with different mouse Zmat2 exons, using the same BlastN criteria outlined above. In addition, when conceptually translated, many of these sequences resemble all or parts of ZMAT2 proteins (see Table 4). To determine if these DNA sequences were pseudogenes or actual genes , expression of transcripts was assessed in each species in which RNA-sequencing libraries were available in parallel with authentic Zmat2 (see Fig. 6).
Availability of data and materials
See Table 6 for data links and see specific accession numbers in Methods section above.
National Center for Biotechnology Information
Sequence Read Archive
Oprea TI, Bologa CG, Brunak S, et al. Unexplored therapeutic opportunities in the human genome. Nat Rev Drug Discov. 2018;17:317–32.
Haynes WA, Tomczak A, Khatri P. Gene annotation bias impedes biomedical research. Sci Rep. 2018;8:1362.
Stoeger T, Gerlach M, Morimoto RI, Nunes Amaral LA. Large-scale investigation of the reasons why potentially important genes are ignored. PLoS Biol. 2018;16:e2006643.
Manolio TA, Fowler DM, Starita LM, et al. Bedside back to bench: building bridges between basic and clinical genomic research. Cell. 2017;169:6–12.
Battle A, Brown CD, Engelhardt BE, Montgomery SB. Genetic effects on gene expression across human tissues. Nature. 2017;550:204–13.
Soumillon M, Cacchiarelli D, Semrau S, van Oudenaarden A, Mikkelsen TS. Characterization of directed differentiation by high-throughput single-cell RNA-Seq bioRxiv 2014;1:1–13.
Vera M, Biswas J, Senecal A, Singer RH, Park HY. Single-cell and single-molecule analysis of gene expression regulation. Annu Rev Genet. 2016;50:267–91.
Katsanis N. The continuum of causality in human genetic disorders. Genome Biol. 2016;17:233–7.
Quintana-Murci L. Understanding rare and common diseases in the context of human evolution. Genome Biol. 2016;17:225–39.
Acuna-Hidalgo R, Veltman JA, Hoischen A. New insights into the generation and role of de novo mutations in health and disease. Genome Biol. 2016;17:241–60.
Tanis SEJ, Jansen PWTC, Zhou H, et al. Splicing and chromatin factors jointly regulate epidermal differentiation. Cell Rep. 2018;25:1292–1303.e5.
Plaschka C, Lin PC, Nagai K. Structure of a pre-catalytic spliceosome. Nature. 2017;546:617–21.
Papasaikas P, Valcarcel J. The spliceosome: the ultimate RNA chaperone and sculptor. Trends Biochem Sci. 2016;41:33–45.
Bertram K, Agafonov DE, Dybkov O, et al. Cryo-EM structure of a pre-catalytic human spliceosome primed for activation. Cell. 2017;170:701–713.e11.
Bininda-Emonds OR, Cardillo M, Jones KE, et al. The delayed rise of present-day mammals. Nature. 2007;446:507–12.
Nikolaev SI, Montoya-Burgos JI, Popadin K, Parand L, Margulies EH, Antonarakis SE. Life-history traits drive the evolutionary rates of mammalian coding and noncoding genomic elements. Proc Natl Acad Sci U S A. 2007;104:20443–8.
Asher RJ, Bennett N, Lehmann T. The new framework for understanding placental mammal evolution. Bioessays. 2009;31:853–64.
Liu L, Zhang J, Rheindt FE, et al. Genomic evidence reveals a radiation of placental mammals uninterrupted by the KPg boundary. Proc Natl Acad Sci U S A. 2017;114:E7282–90.
Rotwein P. The insulin-like growth factor 2 gene and locus in nonmammalian vertebrates: organizational simplicity with duplication but limited divergence in fish. J Biol Chem. 2018;293:15912–32.
Rotwein P. Quantifying promoter-specific insulin-like growth factor 1 gene expression by interrogating public databases. Phys Rep. 2019;7:e13970.
Albright SR, Tjian R. TAFs revisited: more data reveal new twists and confirm old ideas. Gene. 2000;242:1–13.
Vo Ngoc L, Wang YL, Kassavetis GA, Kadonaga JT. The punctilious RNA polymerase II core promoter. Genes Dev. 2017;31:1289–301.
Proudfoot NJ. Ending the message: poly(a) signals then and now. Genes Dev. 2011;25:1770–82.
Weiner AM, Deininger PL, Efstratiadis A. Nonviral retroposons: genes, pseudogenes, and transposable elements generated by the reverse flow of genetic information. Annu Rev Biochem. 1986;55:631–61.
Korrodi-Gregorio L, Abrantes J, Muller T, et al. Not so pseudo: the evolutionary history of protein phosphatase 1 regulatory subunit 2 and related pseudogenes. BMC Evol Biol. 2013;13:242.
Mitchell KJ, Pratt RC, Watson LN, et al. Molecular phylogeny, biogeography, and habitat preference evolution of marsupials. Mol Biol Evol. 2014;31:2322–30.
Mighell AJ, Smith NR, Robinson PA, Markham AF. Vertebrate pseudogenes. FEBS Lett. 2000;468:109–14.
Alexander RP, Fang G, Rozowsky J, Snyder M, Gerstein MB. Annotating non-coding regions of the genome. Nat Rev Genet. 2010;11:559–71.
Karczewski KJ, Laurent C Francioli, Grace Tiao, Beryl B Cummings, Jessica Alföldi, Qingbo Wang, Ryan L Collins, Kristen M Laricchia, Andrea Ganna, Daniel P Birnbaum, Laura D Gauthier, Harrison Brand, Matthew Solomonson, Nicholas A Watts, Daniel Rhodes, Moriel Singer-Berk, Eleanor G Seaby, Jack A Kosmicki, Raymond K Walters, Katherine Tashman, Yossi Farjoun, Eric Banks, Timothy Poterba, Arcturus Wang, Cotton Seed, Nicola Whiffin, Jessica X Chong, Kaitlin E Samocha, Emma Pierce-Hoffman, Zachary Zappala, Anne H O’Donnell-Luria, Eric Vallabh Minikel, Ben Weisburd, Monkol Lek, James S Ware, Christopher Vittal, Irina M Armean, Louis Bergelson, Kristian Cibulskis, Kristen M Connolly, Miguel Covarrubias, Stacey Donnelly, Steven Ferriera, Stacey Gabriel, Jeff Gentry, Namrata Gupta, Thibault Jeandet, Diane Kaplan, Christopher Llanwarne, Ruchi Munshi, Sam Novod, Nikelle Petrillo, David Roazen, Valentin Ruano-Rubio, Andrea Saltzman, Molly Schleicher, Jose Soto, Kathleen Tibbetts, Charlotte Tolonen, Gordon Wade, Michael E Talkowski, The Genome Aggregation Database Consortium, Benjamin M Neale, Mark J Daly, Daniel G MacArthur. Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. bioRxiv. 2019;https://doi.org/10.1101/531210
Rotwein P. Variation in Akt protein kinases in human populations. Am J Phys Regul Integr Comp Phys. 2017;313:R687–92.
Rotwein P. Large-scale analysis of variation in the insulin-like growth factor family in humans reveals rare disease links and common polymorphisms. J Biol Chem. 2017;292:9252–61.
Rotwein P. Variation in the repulsive guidance molecule family in human populations. Phys Rep. 2019;7:e13959.
Rotwein P. Diversification of the insulin-like growth factor 1 gene in mammals. PLoS One. 2017;12:e0189642.
White BH. What genetic model organisms offer the study of behavior and neural circuits. J Neurogenet. 2016;30:54–61.
Kawakami K, Largaespada DA, Ivics Z. Transposons as tools for functional genomics in vertebrate m odels. Trends Genet. 2017;33:784–801.
Madeira F, Park YM, Lee J, et al. The EMBL-EBI search and sequence analysis tools APIs in 2019. Nucleic Acids Res. 2019;47:W636–41.
Trifinopoulos J, Nguyen LT, von Haeseler A, Minh BQ. W-IQ-TREE: a fast online phylogenetic tool for maximum likelihood analysis. Nucleic Acids Res. 2016;44:W232–5.
National Institutes of Health research grant, R01 DK042748–28 (to P. R.). The funding body played no role in the design of the study, in the collection, analysis, or interpretation of data, or in writing the manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
RNA-sequencing libraries screened for gene expression.
Probes for screening RNA-sequencing libraries.
Characterizing 5’ ends of mammalian Zmat2 genes by analysis of RNA-sequencing libraries. Mapping putative 5’ ends of mammalian Zmat2 genes by examination of gene expression data from species-specific RNA-sequencing libraries, with 60 base pair genomic segments a-c, a-d, or a-e as probes. A. Rat; B. Guinea pig; C. Rabbit; D. Cow; E. Horse; F. Pig; G. Sheep; H. Goat; I. Megabat; J. Dog K. Cat; L. Elephant; M. Dolphin; N. Tasmanian devil; O. Koala.
Characterizing 3’ ends of mammalian Zmat2 genes by analysis of RNA-sequencing libraries. Mapping putative 3’ ends of mammalian Zmat2 genes by examination of gene expression data from species-specific RNA-sequencing libraries, with 60 base pair genomic segments a-d, a-e, a-f, or a-g as probes. A. Rat; B. Guinea pig; C. Rabbit; D. Cow; E. Horse; F. Pig; G. Sheep; H. Goat. A vertical arrow denotes the possible 3’ end of Zmat2 transcripts.
Characterizing 3’ ends of mammalian Zmat2 genes by analysis of RNA-sequencing libraries. Mapping putative 3’ ends of mammalian Zmat2 genes by examination of gene expression data from species-specific RNA-sequencing libraries, with 60 base pair genomic segments a-d, a-e, or a-f as probes. A. Dog; B. Cat; C. Elephant; D. Dolphin; E. Megabat; F. Koala; G. Tasmanian devil. A vertical arrow denotes the possible 3’ end of Zmat2 transcripts, which could not be identified for dog or cat genes.
Mammalian ZMAT2 protein sequences from UniProt.
About this article
Cite this article
Rotwein, P., Baral, K. Zmat2 in mammals: conservation and diversification among genes and Pseudogenes. BMC Genomics 21, 113 (2020). https://doi.org/10.1186/s12864-020-6506-3