De novo assembly and annotation of the singing mouse genome

Background Developing genomic resources for a diverse range of species is an important step towards understanding the mechanisms underlying complex traits. Specifically, organisms that exhibit unique and accessible phenotypes-of-interest allow researchers to address questions that may be ill-suited to traditional model organisms. We sequenced the genome and transcriptome of Alston’s singing mouse (Scotinomys teguina), an emerging model for social cognition and vocal communication. In addition to producing advertisement songs used for mate attraction and male-male competition, these rodents are diurnal, live at high-altitudes, and are obligate insectivores, providing opportunities to explore diverse physiological, ecological, and evolutionary questions. Results Using PromethION, Illumina, and PacBio sequencing, we produced an annotated genome and transcriptome, which were validated using gene expression and functional enrichment analyses. To assess the usefulness of our assemblies, we performed single nuclei sequencing on cells of the orofacial motor cortex, a brain region implicated in song coordination, identifying 12 cell types. Conclusions These resources will provide the opportunity to identify the molecular basis of complex traits in singing mice as well as to contribute data that can be used for large-scale comparative analyses.

species, because they allow more detailed analysis of gene expression, sequence-driven manipulations of gene function, as well as comparative analysis of genome evolution [23].
Alston's singing mouse, Scotinomys teguina, produces a unique, easily quantifiable vocal display that makes it an excellent model for understanding the genomic mechanisms of complex, behavioral traits.These diurnal rodents live in the montane grasslands of central America and are obligate insectivores [24].Singing mice are named for the long, elaborate songs they use for mate attraction and male-male competition [24][25][26][27][28]. Their unique natural history as well as their complex social interactions make singing mice an excellent candidate for exploring the mechanisms and evolution of traits such as circadian rhythms, diet and energy balance, the challenges of thermoregulation or high-altitude living, dynamic vocal communication, and more.
Unlike model rodents such as house mice, singing mice produce highly structured advertisement songs (Fig. 1) which make them an emerging model for social cognition and vocal communication.These songs consist of rapidly repeated, frequency-modulated notes which span ~ 16 kHz in as little as 12 ms [29].Note amplitudes, frequencies, and repetition rates are modulated over the course of the song (Fig. 1).Singing mice have highly structured vocalizations that are rapidly exchanged with conspecifics (counter-singing) in a manner whose timescale resemble human conversational speech, a feature not found in house mouse communication [30].
In addition to variation among species [26,29], the advertisement song also varies between individuals [31] and populations of singing mice [29].Among individuals, both internal and external cues drive song differences.For example, androgen levels and adiposity signals such as circulating leptin are associated with differences in "song effort" measures (e.g., song length), but not spectral features (e.g., frequency bandwidth) which may be set by vocal anatomy [27,28,[31][32][33].The social environment also tunes vocal output.For example, when other males are present, male singing mice produce longer songs and rapidly turn-take during singing bouts, finely coordinating song onset, and offset [30].Together, these unique song features and the complexity of cues that impact song provide an exemplary opportunity to understand many fundamental questions such as how internal and external cues are integrated to modulate behavior, how elaborate vocalizations evolve, and more.The development of genomic resources for singing mice will provide opportunities to explore these questions as well as contribute to comparative work.
We sequenced the singing mouse genome and transcriptome using PromethION, Illumina, and PacBio technologies.We next assembled and annotated the genome and transcriptome and examined gene expression to assess the utility of our assemblies.Finally, we extracted and sequenced cell nuclei from the orofacial motor cortex (OMC) using 10X Genomics.The OMC was chosen because it plays an important role in the social modulation of singing behavior, namely counter singing between conspecifics [30].To identify cell-types within the OMC, we used gene expression analysis.Data associated with this project are deposited under NCBI Bio-Project PRJNA878522 and raw data are available on GEO (GSE212957 series) and the NCBI assembly database.The annotated genome and transcriptome data can be viewed on the UCSC genome browser (https:// genome.ucsc.edu/).A well-annotated genome and transcriptome will enable future work identifying the genomic substrates of a variety of physiological and behavioral adaptations.

DNA isolation
All animal procedures were approved by the University of Texas at Austin and New York University Grossman School of Medicine IACUC.For PacBio and Illumina sequencing, we sacrificed 1 adult, male singing mouse.Liver and brain were dissected out and flash-frozen immediately.GDNA was then extracted from tissues using a Qiagen DNeasy kit.We visualized DNA integrity on an agarose gel and quantified DNA quality using a nanodrop.For PromethION sequencing, we sacrificed an adult, male singing mouse, extracted its liver, and immediately froze the tissue.High molecular weight DNA was extracted from liver tissue using a Genomic-tip 20/G DNA kit (QIAGEN, 10223).

DNA sequencing and genome assembly
We did library preparations and sequencing using Pro-methION technology (Oxford Nanopore MinION) at the New York University Langone Medical Center.
PacBio library preparation and sequencing was done at Duke University using 6-8 kb insert sizes with subreads ranging from 2 KB-3 KB (PacBio RS platform).We did Illumina library preparation and sequencing at the University of Texas Austin Core facility.Two Illumina libraries were created: a fragmentation library consisting of 170, 400, and 900 bp segments (PE Barcode + 2 × 100, 3 lanes = 510 M reads requested) and a mate-pair library with a 3 KB insert size (PE Barcode + 2 × 100, 1 lane = 170 M reads requested).Libraries were sequenced on an Illumina HiSeq 2500.

RNA isolation and sequencing
RNA extraction and sequencing was done at UT Austin and NYULMC.All animals were sacrificed using isoflurane overdose.At UT Austin, forebrain, hindlimb skeletal muscle, gonads, and liver were dissected from 1 adult male singing mouse and immediately flash frozen.We extracted total RNA using a standard TRIzol method.RNA was then submitted to the UT Core facility for library preparation and Illumina sequencing.For RNAseq performed at NYU, we extracted RNA from freshly dissected tissue of two male and two female singing mice (liver and brain) using a Qiagen RNeasy Mini kit (Qiagen 74,104).We then homogenized the tissue using a rotor-stator homogenizer with disposable tips and did on-column DNAse digestion following manufacturer's instructions.An automated system performed poly-A library prep, and samples were run on a single-read Illumina HiSeq 4000 flowcell.

Transcriptome assembly
We assembled a de novo and reference guided transcriptome using Trinity (v2.8.4) [36,37].For the guided transcript assembly all the RNA-Seq reads were mapped to the assembled reference genome using STAR mapper (v2.5.0c) [38].Alignments were guided by a Gene Transfer Format (GTF) file.For quality control, we mapped the RNA reads to the assembled transcripts provided by Trinity [36,37].More than 83% of the reads mapped properly, suggesting a high-quality transcript assembly.To assess the completeness of the transcriptome and genome assembly, we used BUSCO (v.5.4.5)[39].

Functional enrichment
GO and KEGG analyses.To assess the accuracy of transcriptome assembly and annotation, GO MWU [47] and KEGG [48][49][50] analyses were used.The GO MWU method of gene ontology (GO) enrichment analysis uses a ranked list of genes to identify whether each GO category is significantly enriched with up-or downregulated genes [47,51].We did functional enrichment analysis of GO and KEGG Reactome pathways using g:Profiler (v.e101_eg48_p14_baf17f0) with a g:SCS significance threshold of 0.05 [52].Ordered gene lists for each tissue type included only those that had a |fold-change| of at least 2. We exported GO functional enrichment results from g:Profiler and created network pathways [53] using the EnrichmentMap application [54] in Cytoscape [55].Maps were created with FDR Q value < 0.01 and combined coefficients > 0.375 with a combined constant of 0.5.We used an expression file of normalized fold-change values to create heatmaps of genes enriched pathways.We then used AutoAnnotate to interpret the function of groups of nodes in the network.

Genome annotation
RNA-Seq reads from Illumina and the assembled reference genome were used to create transcript-backed and prediction-based annotations.We concatenated both the guided and de novo transcriptome assembly results and cleaned them using the PASA pipeline (v.2.5.3)[56] for UniVec vector sequences [57].Cufflinks [58][59][60][61] was used to make a GTF file for PASA pipeline and the tdn.accs file was made using the de novo assembly.We used Stringtie2 (v.2.2.0) [62] to make a another GTF file which was then passed to TransDecoder (v.5.5.0,Haas, BJ. https:// github.com/ Trans Decod er/ Trans Decod er) to identify coding regions within transcripts.We then used three different ab initio predictors, GlimmerHMM (v.3.0.4)[63], GeneMarkHMM [64], and Augustus (v.3.5.0)[65] and combined the resulting GFF3 files.Miniprot (v.0.12) [66] and Uniref100 [67] were used for protein alignment and prediction.Finally, all output files from PASA, TransDecoder, the three ab initio predictor programs, and miniprot were passed to EVi-denceModeler (v.2.1.0)[68,69] to generate a complete and comprehensive annotation file.The resulting GFF was converted to a GTF for downstream analyses (bulk RNA-Seq and snRNA-Seq).A cDNA fasta file was produced from the GTF and used as an input for BLAST [70].We blasted the cDNA file against the entire Uniprot database (all organisms; [71]).BLAST results were then used to annotate the GTF file with gene symbols.

Single nuclei sequencing
We tagged the orofacial motor cortex (OMC) area from one adult male singing mouse for extraction via stereotaxic injection of fluorescent dextran beads into the brain as previously described [30].Post injection, the mouse was immediately transcardially perfused with ice-cold artificial cerebrospinal fluid (aCSF).We sectioned the brain into 250 µm sections, located the dyed area under a dissecting microscope, and removed the region with a scalpel.Extracted tissue was immediately flash frozen in liquid nitrogen and stored overnight at -80 °C.We dissociated nuclei using a modified version of the Mccarroll lab protocol [72].FITC-tagged NeuN antibody (Sigma, MAB377) was prepared following manufacturer's instructions (Abcam, 188,285), and used to enrich for NeuN + , DAPI + nuclei on a MoFlo XDP flow cytometer (Beckman Coulter) with a 100 µm nozzle.We loaded 9000 sorted nuclei into GEMs on a 10X Genomics Chromium Controller (1 st generation, 10X v3 chemistry) using 3' v3 chemistry and recovered 3500 high-quality nuclei after standard analysis (10X Genomics CellRanger pipeline v. 3.1.0)[73].

Single nuclei gene expression analysis
Single-nuclei gene expression data were generated using the 10X Genetics Chromium system, following the manufacturer's instructions for sample and library prep.We aligned raw FASTQ files to the singing mouse transcriptome and then assigned reads to individual nuclei via the 10X CellRanger pipeline.The resulting gene expression matrix was analyzed using the standard Seurat package (v.3) [74] in RStudio (v.4.0.2).We excluded genes with expression in < 3 nuclei from the analysis.We filtered the expression data to only keep nuclei with fewer than 11,500 genes, and fewer than 40,000 molecules detected, excluding 14 likely doublet nuclei.
Singlet nuclei gene expression data were then log-normalized using the Seurat pipeline [74] and only the top 2,000 most variable genes were selected for downstream analysis.We ran PCA on the top 2,000 variable genes using standard Seurat settings and clustered nuclei via standard commands using the first 20 principal components.A resolution value of 0.1 was used to capture large, cell-type-level clusters of similar nuclei, resulting in 12 clusters that were categorized into major cell types based on known marker genes.We did dimensional reduction via two standard methods, tSNE (t-distributed Stochastic Neighbor Embedding) [75] and UMAP (Uniform Manifold Approximation and Projection) [76,77].UMAP better distinguished clusters and was used for downstream analyses.Marker genes for each cluster (genes significantly enriched) were identified using the standard threshold values of > 0.25 percent of nuclei expressing the gene and > 0.25 log-fold change.We plotted the top 10 markers genes for each cluster on a heatmap and compared these with known marker genes to determine what cell types are represented by each cluster.

Results
The annotated singing mouse genome and transcriptome can be viewed at the UCSC genome browser (https:// genome.ucsc.edu/).

Genome and transcriptome assembly and annotation
After assembly, the total genome size was 2.4 Gb.After scaffolding there were 7806 contigs.We assembled both a de novo and reference guided transcriptome and identified 754,907 transcripts.Assembly quality and completeness metrics can be found in Table 1.
We annotated the genome using the PASA pipeline and resulting GTF file can be accessed at the UCSC genome browser (https:// genome.ucsc.edu/).We annotated genes using blast and only included annotations for genes that had at least 80% sequence similarity to the reference gene (14,989 genes included).Our scaffolds have 92.5% of the complete, single-copy BUSCOs (Table 1).

Validation
We validated the quality of the transcriptome and annotations by doing gene expression analyses.Reads were normalized using DEseq2 [42] and samples were clustered by Euclidean distance (Fig. 2).As expected, samples clustered by tissue type.Principal components analysis revealed two components that distinguished tissue type (Fig. 3).PC1 separated brain gene expression from that of the liver, while PC2 distinguished brain and liver expression from that of the muscle and gonads.We then compared gene expression profiles between pairs of tissues.To validate that we accurately mapped transcripts to annotated genes, we did GO MWU [47] and KEGG [48][49][50] analyses.We found that the metabolic pathways KEGG term was the most significantly enriched among all annotated genes within the genome (Fig. 4).We then did GO MWU [47] on each tissue-type gene list which we ranked by fold-change.This analysis revealed enrichment of expected GO terms based on tissue type.For example, genes upregulated in the brain were enriched with terms related to synapse structure and function (Fig. 5a).To further assess whether we detect of appropriate, tissuespecific gene expression, we constructed network pathways from the brain GO enrichment results.The analysis determined 389 gene sets ("nodes") and 770 instances of overlap between gene sets ("edges") that were sorted into 146 clusters (Fig. 5b).We found that the network was annotated with functions consistent with first-principles predictions based on the focal tissue.For example, the brain functional GO network was annotated with functions such as "postsynaptic membrane component".

Single nuclei sequencing of the Orofacial Motor Cortex (OMC)
We assessed the quality of the data using cellranger (114 outliers removed, 3486 nuclei retained).For 3,486 nuclei that passed quality control, we did dimensional reductions (see Methods) and displayed them using t-distributed stochastic neighbor embedding (t-SNE; Fig. 6a) [75] and uniform manifold approximation and projection (UMAP; Fig. 6b) [76,77].Major brain cell types in 12 clusters were clearly identifiable based on canonical cell-type marker gene expression (Fig. 6c).We identified 5 clusters of nuclei as excitatory neurons, expressing high levels of Syt1 (synaptotagmin-1), two clusters as inhibitory neurons, which expressed high levels of Gad-2 (glutamate decarboxylase 2), one cluster of astrocytes, expressing Gfap (glial fibrillary protein), and one cluster of oligodendrocytes, which expressed Mbp (myelin basic protein) [78][79][80].Normalized expression of the top ten marker genes for each brain cell type clearly distinguished the 12 nuclei clusters using t-SNE and UMAP (Fig. 6d).

Discussion
We sequenced, assembled, and annotated a genome and transcriptome for Alston's singing mouse, a model for complex social behavior and vocal communication.Transcriptome and gene annotation quality were validated using gene expression and functional enrichment analyses.Finally, we did single-nuclei sequencing of cells of the orofacial motor cortex (OMC), a region involved in vocal turn-taking in singing mice [30] and identified major cell types.The annotated genome and transcriptome will be a valuable resource that will allow characterization of the genetic basis of complex traits in singing mice as well as be useful for comparative studies more broadly.

Genome and transcriptome assembly and annotation
By using three sequencing technologies, we were able to create a high quality de novo genome assembly.Short reads, like those generated by Illumina, provided the highest base-pair-level accuracy [81][82][83].Longer reads  generated by PacBio's SMRT sequencing [84,85], produced excellent contigs.Finally, contig scaffolding was facilitated by PromethION's nanopore sequencing [86,87], which can sequence the longest stretches of DNA [88,89].These sequencing efforts culminated in a 2.4 Gb genome.The size of the singing mouse draft genome is like that of other sequenced rodents such as house mice, Mus musculus, (strain: C57BL/6 J, genome size: 2.5 Gb) [90] and white-footed mice, Peromyscus leucopus (genome size: 2.45 Gb) [91].Using BUSCO [39], we found that our assembly captures much of the genome; our scaffolds contain 92.5% of the complete and singlecopy BUSCOs (annotated collection of ubiquitous mammalian genes).However, DNA was extracted from two different singing mice for sequencing.Tissue from one individual was used for long read PromethION sequencing while the other was used to generate PacBio and Illumina reads.Using DNA from two individuals in our assembly could impact future analyses of standing genetic variation.
We assembled 754,907 transcripts into a de novo transcriptome with a contig N50 of 826.When aligned to the reference genome, 83.41% of the transcriptome mapped, indicating a quality transcriptome assembly [36,37].As expected, the contig N50 based on only the longest isoform per gene was lower than that of all transcripts since including all transcript isoforms can exaggerate N50 values.
We did differential expression and functional enrichment analyses to test the quality of the transcriptome and annotations.In support of our expectations for a quality assembly and annotation, we found tissue-specific gene expression profiles.Two distinct clustering methods showed that samples of the same tissues type, both technical and biological replicates, had identical (technical replicates) or very similar (biological replicates) expression patterns that differed greatly from other tissues.Functional enrichment analysis identified the putative function of differentially expressed genes across tissue type.Within the brain, we found enrichment of expected pathways such as synapse-related GO terms.A network-based approach supported these results, clustering related nodes into larger functional groups with brain-relevant annotations such as "glutamate neurotransmitter receptor." This approach is useful because GO functional categories often share many genes and the results of GO analyses can often be redundant [53,55].A network approach clusters by gene overlap which allows the annotation of groups of similar gene sets, rather than annotating each gene set independently.The results of these analyses suggest that our transcriptome assembly Fig. 4 Barplot of KEGG terms for all annotated genes shows that metabolic pathways are enriched in this dataset [49][50][51] is of high quality and the annotations we created are accurate.For our functional enrichment analyses, we focused on brain gene expression because we are interested in understanding how the nervous system drives complex behavior.A quality transcriptome assembly and gene annotations allowed us to examine gene expression of single nuclei to identify specific brain cell types that may contribute to behavior (see Single-nuclei sequencing of the OMC).Single-nuclei sequencing of regions implicated in song production [30,92] can be paired with other approaches such as sequence-based interventions and epigenetic profiling to understand how the brain patterns vocal output.
The genome, transcriptome, and annotation GTF file can be viewed at the UCSC genome browser (https:// genome.ucsc.edu/) and raw data accessed on GEO under series GSE212957.We identified 14,989 genes that have at least 80% sequence similarity to reference genes.This is on the order of annotation efforts in other species [90,91,93,94].We combined the PASA with multiple ab initio gene finding, protein homology, and weighted consensus gene structure tools which generated a more comprehensive genome annotation.

Single-nuclei sequencing of the OMC
Dimensional reductions of the expression profiles of 3,486 OMC nuclei revealed 12 clusters that were categorized into major cell types based on marker genes.Most nuclei fell into 7 clusters that were categorized as neuronal, which we expected, since we used NeuN antibodies to enrich for neuronal nuclei prior to sequencing.Five of these clusters were identified as containing nuclei of excitatory neurons, expressing high levels of Syt1 (synaptotagmin-1), gene encoding a Ca2 + sensory for neurotransmitter release [80,[95][96][97].Two inhibitory interneuron clusters were identified by expression of Gad-2 (glutamate decarboxylase 2), which encodes an enzyme that catalyzes the synthesis of GABA, an inhibitory neurotransmitter [80,98].Despite selecting for neuronal nuclei, a few small clusters of other cell types were Fig. a GO tree of enriched terms for ranked brain gene expression (GO division: cellular compartment) using Fisher's exact test.Font indicates significance and the fraction before each term shows the number of genes annotated with the GO term relative to the total number of such genes in the dataset.(b, top) A network plot of brain GO enrichment results was made in Cytoscape using EnrichmentMap (FDR Q value < 0.01, combined coefficients > 0.375, combined constant 0.5).Most highly connected nodes (b, bottom) are annotated using AutoAnnotate.Each node is a gene set and the size of each node represents the number of genes in the gene set.Edges (lines between nodes) represent overlap between gene sets and their width refer to the number of genes that are shared by the nodes.The color of each node represents enrichment scores (q-value) also identified such as astrocytes (high Gfap) and oligodendrocytes (Mbp) [80].Clear identification of brain-cell types demonstrates the robustness of the singing mouse transcriptome and genome and demonstrates the broad applicability of 10X single cell/single nuclei technology to a nontraditional rodent species.The brain region we chose for single nuclei analysis is an important temporal regulator of the singing mouse advertisement song [30].By combining single-cell analysis of relevant brain regions with circuit-level studies [92], we can develop hypotheses about the role of each network node and the cellular mechanisms that underlie these functions.Together, these resources allow us to examine how the nervous system directs complex behavior.

Uses of these resources
The contribution of a high-quality singing mouse genome and transcriptome increases the diversity of Fig. 6 The S. teguina genome and transcriptome enable identification of brain cell types in a single-nuclei RNAseq dataset from brain area OMC.A tSNE dimensional reduction of nuclei isolated from brain area OMC.B UMAP dimensional reduction of same data C. Feature plots of normalized gene expression data for key brain cell type markers (Syt1: synaptotagmin-1, neuron; Gad2: glutamate decarboxylase 2, inhibitory neuron; Gfap: glial fibrillary protein, astrocyte; Mbp: myelin binding protein, oligodendrocyte).D Heatmap of normalized gene expression data for the top 10 marker genes for each brain cell type identified available model species and improves our ability to ask mechanistic questions.Singing mice are a particularly useful model for understanding how the brain drives complex behavior due to their unique, quantifiable phenotype, their tractability in the lab, and our ability to adapt existing neurobiological tools and resources for singing mice.The addition of genomic resources provides further opportunity to use singing mice to study novel questions in social cognition.For example, to explore the genomic basis of complex traits, we could use the genomic resources we have generated to examine gene regulation (e.g., ChIP-seq: [99]; ATACseq: [100]), sequence evolution (e.g., tests of selection: [101][102][103]), and more.These data also contribute to a library of resources that can be used for larger comparative analyses.Increasing the diversity of model systems, through the addition of species that are wellsuited to particular questions, is essential to understanding the mechanisms that drive complex traits.
• fast, convenient online submission • thorough peer review by experienced researchers in your field • rapid publication on acceptance • support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations maximum visibility for your research: over 100M website views per year

•
At BMC, research is always in progress.

Learn more biomedcentral.com/submissions
Ready to submit your research Ready to submit your research ?Choose BMC and benefit from: ? Choose BMC and benefit from:

Fig. 1 (
Fig. 1 (Left) A singing mouse and (right) a spectrogram of a representative advertisement song.Insets below show how frequency bandwidth, note shape, and note length change over the course of a song.Each inset is 0.15 s.Photo: Long lab

Fig. 2 Fig. 3
Fig. 2 Euclidean-distance-based heatmap shows that samples of the same tissue type have the most similar gene expression.Lower values (darker blue) indicate more similarity

Table 1
Assembly characteristics