- Open Access
A chromosome-level genome assembly of Cairina moschata and comparative genomic analyses
BMC Genomics volume 22, Article number: 581 (2021)
The Muscovy duck (Cairina moschata) is an economically important duck species, with favourable growth and carcass composition parameters in comparison to other ducks. However, limited genomic resources for Muscovy duck hinder our understanding of its evolution and genetic diversity.
We combined linked-reads sequencing technology and reference-guided methods for de novo genome assembly. The final draft assembly was 1.12 Gbp with 29 autosomes, one sex chromosome and 4,583 unlocalized scaffolds with an N50 size of 77.35 Mb. Based on universal single-copy orthologues (BUSCO), the draft genome assembly completeness was estimated to be 93.30 %. Genome annotation identified 15,580 genes, with 15,537 (99.72 %) genes annotated in public databases. We conducted comparative genomic analyses and found that species-specific and rapidly expanding gene families (compared to other birds) in Muscovy duck are mainly involved in Calcium signaling, Adrenergic signaling in cardiomyocytes, and GnRH signaling pathways. In comparison to the common domestic duck (Anas platyrhynchos), we identified 104 genes exhibiting strong signals of adaptive evolution (Ka/Ks > 1). Most of these genes were associated with immune defence pathways (e.g. IFNAR1 and TLR5). This is indicative of the existence of differences in the immune responses between the two species. Additionally, we combined divergence and polymorphism data to demonstrate the “faster-Z effect” of chromosome evolution.
The chromosome-level genome assembly of Muscovy duck and comparative genomic analyses provide valuable resources for future molecular ecology studies, as well as the evolutionary arms race between the host and influenza viruses.
The Muscovy duck (Cairina moschata), one of the largest wood ducks, is more land-oriented than other ducks . Domestic Muscovy duck has become one of the most economically important poultry species in the world due to its unique taste, high breast meat production and low calories. Eleven breeds of Muscovy duck in Latin America, the Caribbean, Europe, Asia Pacific and Africa are registered in the Domestic Animal Diversity Information System . Muscovy duck was domesticated in Central or South America, spread around the world by European colonists, and eventually introduced to China [3, 4]. The breeding history of Muscovy duck in China has a reliable record of more than 250 years . Except for Muscovy duck, almost all varieties of domesticated ducks are descended from mallard (Anas platyrhynchos) . The genome of the common domestic duck has been characterized and published . The Muscovy duck genome (assembly version: CaiMos1.0) was sequenced in 2019 using second-generation sequencing data and a reference-guided method, but it lacks accuracy evaluation, and it remains unpublished (GenBank acc. no. GCA_009194515.1).
The modern technology of next-generation sequencing can generate billions of short-read fragments at a relatively low price with high accuracy . It is difficult to obtain long and continuous scaffolds using short reads, owing to repetitive or heterozygous structures, which makes de novo assembly challenging . Although the use of PacBio long-read sequencing, Bionano optical mapping, and Hi-C scaffolding can provide highly contiguous genome assemblies, those methods substantially increase the cost of sequencing projects. Linked-reads sequencing (10x Genomics)  has recently been shown to generate a high-quality, cost-effective de novo assembly in a non-model mammal . In addition, combining 10x genomics and a reference-guided method can achieve chromosome-scale assembly from a previous study  without the need for mate-pair reads with different insert sizes, or physical and genetic maps .
Here, we combined linked-reads sequencing (10x Genomics) and reference-guided approaches as a cost-effective strategy to enhance the conventional short-read and long-read-based methods, to obtain a draft assembly of the Muscovy duck genome. Comparative genomics analyses of the Muscovy duck, including orthology, species-specific, rapid expansion and identification of positively selected genes could deepen our understanding of the evolutionary relationships between different species of birds at the molecular level. The comparison of the evolution of sex chromosomes and autosomes could shed light on the fundamental evolutionary forces. Our draft assembly is available for public use, and genome analysis can assist future studies of evolution and ecology in birds.
Results and discussion
A total of ~ 128-fold (128 Gbp data) read coverage was obtained with paired-end 150 bp reads using 10x Chromium technology in this project. Using the Supernova  assembler, we produced a draft Muscovy duck genome of 1.12 Gbp. This is lower than the predicted 1.32 Gbp, based on kmer analysis (Additional file 1: Fig. S1). The assembly comprised 15,925 scaffolds > 1 kb with contig N50 of 219.51 kb and scaffold N50 of 2.27 Mb. The mean input DNA molecule length (~ 16.65 Kb) in Supernova assembler statistics output (Additional file 2: Table S1) was lower than the official recommendation (50–100 kb), which may greatly affect the scaffold N50 length. To further assemble these into a contiguous draft, we reused the 10x reads for kmer mapping against the Supernova assembler. Our results showed that ARKS software  further improved the connectivity of the Supernova genome (scaffold N50 = 5.22 Mb).
The reference-based scaffolder Chromosomer  was used to align previous scaffolds to the closely related A. platyrhynchos genome to construct a chromosome-level assembly. After reference-guided scaffolding, approximately 99.08 % of genome sequences assembled by ARKS were aligned to the A. platyrhynchos genome; 93.98 % of the sequences were anchored to 29 autosomes and sex chromosomes. GapCloser  was then used to fill the gaps in the pseudo-chromosomes, resulting in 13,924 gaps completed.
Finally, we assembled a chromosome-level Muscovy duck genome of 1.12 Gbp, including 4,613 scaffolds > 1 kb with scaffold N50 of 77.35 Mb. The total genome size is comparable to that of the A. platyrhynchos (~ 1.13 Gbp). We evaluated the completeness of the draft genome assembly by calculating coverage for a set of Single-Copy Orthologous genes in Aves using BUSCO . The result showed that the genome coverage rate is 93.30 %, slightly higher than A. platyrhynchos’s genome completeness (89.10 %) (Table 1).
Gene prediction and annotation
We predicted 15,580 protein-coding genes by integrating ab initio, homology- and transcript-based methods. The number and length of genes and exons predicted by each approach are listed in Additional file 3: Table S2. Subsequently, we aligned the protein sequences to the BUSCO database to evaluate the annotation quality, and found 90.8 % single copy BUSCO genes (Additional file 4: Table S3), suggesting a high degree of completeness in the predicted genes. For repetitive elements, the result showed that the level of repeats in the Muscovy duck genome (9.19 %, Additional file 5: Table S4) is similar to that of the chicken genome (9.45 %) and higher than that of the A. platyrhynchos genome (5.85 %) . In birds, transposable elements (TEs) usually account for 4–10 % of the total genome size . Finally, 99.72 % (15,537) of the predicted genes were functionally annotated using public databases (Additional file 6: Fig. S2). Taken together, these analyses suggested a satisfying level of completeness and accuracy of genome annotation.
Orthology and evolution
The OrthoFinder  analysis identified 22,701 orthogroups (gene family clusters), of which 16,783 comprised two or more species (Additional file 7: Table S5). Of the 13,500 orthogroups including one or more Muscovy duck sequences, 9,878 orthogroups (43.51 %) were present in all species (Fig. 1). Among these, 7,182 consisted entirely of single-copy genes with a one-to-one relationship in different genomes (Additional file 7: Table S5). Compared to other species, we found 762 Muscovy duck-specific genes, classified into 709 gene families (Additional file 8: Table S6). To further elucidate the biological relevance of these species-specific genes, we assigned Gene Ontology (GO) terms using WEGO 2.0 . GO analysis indicated that most of these genes were classified into the molecular function category, including functional subcategories such as ion binding, protein binding, heterocyclic compound binding, organic cyclic compound binding, etc. (Additional file 9: Fig. S3). In addition, we also performed a Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis  using these genes and found 24 mainly enriched pathways, including Calcium signaling pathway, Vascular smooth muscle contraction, Gap junction, Focal adhesion, Adherens junction, etc. (Additional file 10: Table S7). Furthermore, we also found that Muscovy-specific genes (GABRG3, MDGA1 and Shank3) were associated with the development of the brain and nervous system. Indeed, compared to common domestic duck, Muscovy duck had higher virus titers developed in vital organs, particularly in the brain . It is an interesting question to investigate Muscovy-A. platyrhynchos differences in brain morphology, worthy of further study.
The divergence time tree using 500 (6.96 %) single-copy genes dated the common ancestor of Muscovy and A. platyrhynchos to about 13.8 million years ago (MYA) (Fig. 1), which is within the range proposed by previous studies (9.0–17.9 MYA) [21,22,23]. Based on the results of gene cluster analysis in the previous step, we performed a computational analysis of gene family size to understand gene family expansion and contraction between the Muscovy duck and the other seven species included in the dataset for comparative analyses. We found that ~ 608 gene families had undergone expansion (517 in A. platyrhynchos) and ~ 1,587 gene families had contracted (1,915 in A. platyrhynchos) (Fig. 1). A total of 134 significantly expanded gene families and 529 contracted gene families were found in the Muscovy duck (P < 0.05) (Additional file 11: Table S8). KEGG analysis of 479 genes from these expanded gene families revealed that they were mainly classified as ABC transporters, Calcium signaling pathway, Adrenergic signaling in cardiomyocytes, and GnRH signaling pathway (Table 2). Intriguingly, some of these pathways overlapped with pathways that species-specific Muscovy duck genes were involved in, such as the Calcium signaling pathway, Adrenergic signaling in cardiomyocytes, and GnRH signaling pathway (Table 2 and Additional file 10: Table S7). In addition, we found that the A. platyrhynchos genome had 1060 genes corresponding to 529 gene families, while the Muscovy duck genome only existed 39 genes (Additional file 11: Table S8). These significantly contracted genes of Muscovy duck were involved in 4 pathways: Necroptosis, Histidine metabolism, β-Alanine metabolism, and Ascorbate and Aldarate metabolism (Table 2). We also found 39 genes were mainly annotated in olfactory receptor (Additional file 12: Table S9). Gene family expansion and contraction is often reflective of phenotypic adaptation during the evolutionary trajectory of species .
Since Ka/Ks values > 1 are considered to indicate the directional (Darwinian) evolution , we focused on 104 genes with Ka/Ks ratios greater than 1.0 in the Muscovy duck genome as further evidence of adaptive evolution (Additional file 13: Table S10). KEGG annotation results showed that rapidly evolving genes were enriched in the Toll-like receptor, Cytokine-cytokine receptor interaction, Necroptosis and Influenza A signaling pathways (Additional file 14: Table S11). We found that two interferon genes (IFNAR1 and IFNAR2) were simultaneously involved in these four pathways and that the tumor necrosis factor (TNF) gene and a TLR5 gene were independently involved in the Cytokine signaling pathway and the Toll-like receptor pathway, respectively (Additional file 15: Table S12). Gu et al.  also found that IFNAR1 and TLR5 exhibited Muscovy-specific SNPs. Signatures of positive selection and species-specific SNPs both indicated that these two Muscovy duck genes underwent adaptive evolution. All of these pathways play important roles in the innate immune response mechanisms [27, 28]. For example, the expression of Toll-like receptor (TLR) genes was mostly up-regulated in the brain after the duck plague virus infection (Anatid alphaherpesvirus 1), and expression levels of cytokine-related (IFNA, TNF) and TLR genes were significantly increased in the lungs after the avian influenza virus infection [6, 29]. We identified five TLRs (TLR3, TLR4, TLR5, TLR7, TLR15) in the Muscovy duck genome through annotation to A. platyrhynchos genome. These genes with Ka/Ks value less than 1 except for the TLR5, which indicates that most Toll-like-related genes experienced purifying selection in the Muscovy duck genome. Furthermore, a relatively high number of directional evolution genes related to immune response indicate that Muscovy duck might have a stronger immune system than A. platyrhynchos. However, contrary to this prediction, the Muscovy duck is more susceptible to avian influenza virus H5N1 in terms of disease development and mortality than A. platyrhynchos [20, 30]. Arguably, this one case-study may not be reflective of the overall immune system responses in these two species. Further studies of these genes involved in the innate immune response pathways may provide insights into the viral defense mechanisms in Muscovy ducks.
Syntenic relationship with the A. platyrhynchos genome
To assess the syntenic relationship between species, we aligned the draft Muscovy duck genome assemblies to the A. platyrhynchos genome, which is currently the closest species with available chromosome-level assembly. The dot plot showed that reference-based pseudochromosomes exhibited a highly collinear relationship with the A. platyrhynchos chromosomes (Fig. 2), which indicated a high quality of our genome assembly. Several large-scale rearrangements were observed between the two genomes, including interspecific inversions at the Chr18: 6.49–8.89 Mb, Chr25: 3.45–5.59 Mb, and ChrZ (the sex chromosome): 41.33–44.69 Mb. To evaluate the reliability of these inversions, we independently checked them using structural variation detection software. Several small inverted regions were detected by BreakDancer  (Additional file 16: Table S13) within the largest inversion detected by SyRI  (Fig. 1 and Additional file 17: Table S14). SyRI detected an inversion at the chromosome level, while Breakdancer detected an inversion in a short read sequence, which indirectly confirmed the SyRI results. The ChrZ inversion with 3.36 Mb was the largest one, and the number of inversion on the Z chromosome was significantly greater than in all autosomes when chromosome size was accounted for (Additional file 18: Fig. S4). This is probably indicative of the comparatively fast evolution of the Z chromosome, known as the “faster-Z effect” [33, 34].
Faster-Z effect and inversion polymorphisms
Several studies have found a greater divergence in coding sequences on sex chromosomes (ZZ/ZW) than on the autosomes in birds [1,2,3,4]. Nonsynonymous to synonymous substitutions ratio (dN/dS = ω) can provide an insight into the strength of purifying and directional selection [5, 6]. We obtained 7,516 Muscovy - A. platyrhynchos 1:1 orthologs, encompassing 12.66 Mb, and found significantly higher mean ω values in genes linked to the Z chromosome compared to autosomes (P = 2.20E-07, based on two-sided Wilcoxon tests; Table 3). In agreement with previous findings [35, 36], this was driven by an increase in dN (P = 1.55E-06), rather than a decrease in dS on the Z chromosome. Two alternative explanations are codon bias differences between sex chromosomes and autosomes , and weak selection at synonymous sites . The Faster-Z effect can be explained by the fact that the effective population size of Z chromosomes is less than that of autosomes, which in turn will lead to increased genetic drift [35, 39]. We performed a re-sequencing data analysis and identified 6,186,165 polymorphic loci (including 295,278 loci on the Z chromosome) from Muscovy duck populations and 23,124,940 polymorphic loci (including 1,089,826 on the Z chromosome) from mallard populations. Muscovy and mallard populations clustered as two genetically distinct groups in a phylogenetic tree (Additional file 19: Fig. S5). The FST values between Muscovy and mallard ducks were calculated for Z chromosomes and autosomes. This analysis revealed that divergence between the two species was higher for the Z chromosome (FST = 0.7838) than for autosomes (0.6286). Synonymous sites can be used to approximate the level of neutral polymorphism to assess the effective population size (Ne) . The synonymous nucleotide diversity (πs) values of Z-linked genes in the Muscovy duck and mallard were lower than those of autosomal genes (Table 4). This reflects a lower Ne on the Z chromosome than autosomes in both species, which in turn is expected to result in decreased strength of purifying selection on the Z chromosome than autosomes, as purifying selection is less powerful at lower effective population sizes . The average πsZ/πsA estimates for Muscovy duck (0.730) and mallard (0.7006) were slightly lower than the neutral expectation of 0.75. The loss of diversity could be explained by increased genetic drift on the Z chromosome .
We then detected Z chromosome inversion polymorphisms and found that the FST values around the two breakpoints of the largest inversion on the Z chromosome were significantly higher than in adjacent genomic regions (Fig. 3). Similar phenomena have been observed before , and they are believed to be a consequence of the fact that inversion inhibits chromosomal recombination, which in turn results in greater genetic differentiation in the vicinity of inverted segments [43, 44]. Similarly, we also observed high LD near the inversion breakpoints as a result of almost completely inhibited recombination in these two regions (Fig. 3). These are in agreement with the hypothesis that inversion polymorphisms can propel the sex evolution as a result of suppressed recombination [45, 46].
Using linked-reads sequencing data and RNA-seq data we have generated a high-quality draft genome assembly and annotation of the Muscovy duck genome. Synteny analysis showed that reference-based pseudochromosomes exhibited a highly collinear relationship with corresponding common domestic duck chromosomes. Comparative genome analysis of the Muscovy duck and common duck genomes, including orthology, species-specific, rapid expansion and positively selected analyses could deepen our understanding of the evolutionary relationship between these two closely related species at the molecular level. In addition, we found mixed evidence of rapid divergence on the Z chromosome relative to autosomes using divergence and polymorphism data. Strikingly, inversions were enriched on the Z chromosome compared to autosomes, suggesting that inversion polymorphisms propelled the evolution of sex chromosomes. In conclusion, these results deepen our understanding of the evolution and ecology of Muscovy duck.
Samples and sequencing
A male Muscovy duck was collected in Wuhan, China (Animal handling and experiments were approved by the Scientific Ethic Committee of Huazhong Agricultural University (Permission number: HZAUCA-2016-058)). The liver sample was stored at -80 °C. All experiments and methods were performed in accordance with the ARRIVE guidelines (Animal Research: Reporting of In Vivo Experiments) . We declared that all methods were carried out in accordance with the relevant guidelines and regulations.
The Chromium™ Genome Protocol was used to generate a high molecular weight (HMW) genomic DNA from fresh liver tissues. Sample indexing and partition barcoded libraries were conducted using the Chromium Genome Reagent Kit (10x Genomics)  (Novogene Company, Beijing, China) following the manufacturer’s protocol. DNA sequencing on the Illumina NovaSeq platform generated paired-end reads. After removing poly-N and low-quality sequences, 128 Gbp clean reads (~ 128x) with a length of 150 bp were generated, and used for the subsequent genome assembly. Furthermore, the heart, liver, spleen, lung, kidney, brain, chest muscles, lymph, oral mucosa, stomach, small intestine, large intestine and hair follicle tissues were collected from the same duck specimen and used for RNA extraction. The RNA extracted from each of these tissues was mixed (1 µg from each tissue), and used as input material for the library construction workflow, which included the isolation of polyadenylated RNA molecules using poly-T oligo-attached magnetic beads, enzymatic RNA fragmentation, cDNA synthesis, ligation of bar-coded adapters and PCR amplification. Then clustering of the index-coded samples was performed on a cBot Cluster Generation System using TruSeq PE Cluster Kit v3-cBot-HS (Illumina) and sequenced using an Illumina Novaseq platform. After removing reads containing adapter sequences, poly-N, and low-quality reads from raw data, we obtained ~ 20.8 Gbp clean reads with a length of 150 bp to assist genome annotation.
Genome assembly and reference-guided reconstruction
Approximately 20.8 Gbp clean reads were obtained and then the 16 bp 10x Genomics barcode was trimmed using Long Ranger v2.1 basic. Jellyfish v2.3.0  was used to count the frequency of kmer size of 21. The histogram of the kmer counting distribution was processed in GenomeScope (v-2.0) (kmer_max = 10,000)  to estimate the genome size, the abundance of repetitive elements, and heterozygosity.
The original 10x Genomics linked-reads were used as input for the Supernova (v-2.1.1)  assembler for de novo genome assembly with the maximum reads (-maxreads) parameter set at 858 million input reads with optimal raw coverage of 85-fold, greater than the 56-fold suggested in the Supernova protocol (Additional file 2: Table S1). For further assembly into contiguous drafts, we used ARKS v-1.0.4 (parameters: m = 20–20,000, a = 0.9) , which again utilized the original 10x reads for kmer mapping against the Supernova assembly. The companion LINKS  program applied a scaffold graph generated by ARKS to create a longer assembly. To remove the artifacts produced by the above step, we removed scaffolds composed entirely of N and stretches of Ns at the beginning and end of scaffolds.
Chromosomer (v-0.1.3)  then was used to construct large pseudomolecules corresponding to the chromosomes from the assembled contigs or scaffolds by using A. platyrhynchos (IASCAAS_PekingDuck_PBH1.5) as the reference genome. For the assembly process, the scaffold sequences were aligned to the genome of A. platyrhynchos genome using BLASTN v 2.6.0 (-outfmt 6 and -evalue 1E-10) . Chromosomer (fragmentmap -r 1.01) software utilizes the results of alignments to connect the mapping fragments with a gap length setting of 100 and anchor them to the reference genome chromosomes. The redundant scaffolds (< 2 Mb) removed by CD-HIT (-c 0.99) , unlocalized and unplaced scaffolds were also collected to produce the final assemblies.
Finally, GapCloser (v-1.12) (-l 155 -p 31)  was used to fill the gaps left in the assembly process. At each step of the assembly, a Benchmarking Universal Single-Copy Orthologs (BUSCO v-4.0.6)  analysis was applied to evaluate the completeness of the gene set in our draft genome with the library “aves_odb10”.
Gene prediction and annotation
We employed a combined approach utilizing ab initio, homology- and transcript-based strategies to predict protein-coding genes. GeneMark-ES  and Augustus (v-3.3)  were used for ab initio gene prediction. GeneMark-ES is an unsupervised training algorithm that identifies protein-coding genes in eukaryotic genomes. Augustus (--species = chicken) was trained to apply homologous protein sequences from the closely related species and our RNA-Seq data to improve the reliability of the de novo prediction. Based on the homology method, exonerate (v-2.2.0)  (--model protein2geneome --percent 50) was used to align the annotated gene sets from four closely related species, A. platyrhynchos, Gallus gallus, Meleagris gallopavo and Taeniopygia guttata, to our draft genome. Based on the transcript prediction, Hisat2 (v-2.1.0)  was used to align the transcriptome data from 13 tissues to the draft Muscovy duck genome, and Stringtie (v-2.1.3b)  was used to predict the gene information. To improve the accuracy of transcription prediction, we also employed the PASA (v-2.4.1) annotation  to reconstruct the transcripts, and then TransDecoder (v-5.5.0) (http://transdecoder.sf.net) was utilized to identify potential protein-coding regions. We used EvidenceModeler  to integrate the gene structures obtained from the above three methods and filter out genes that lacked homolog identification or RNA-Seq data support. Finally, we aligned the predicted gene structures to the BUSCO (-m prot) databases to evaluate the obtained gene set.
To identify repeat elements in our draft genome we used a combined approach of de novo and homology-based prediction. RepeatModeler (v-1.0.11) (http://www.repeatmasker.org/RepeatModeler/) was used to construct the de novo self-specificity repeat library, and RepeatMasker (v-4.1.0)  was applied to produce a homolog-based repeat library with default parameters.
For gene function annotation, we used the gene sequences to align NCBI non-redundant protein (NR), Nucleotide Sequence (NT) , SwissProt  with E-value cutoffs of 1E-5. Best-hit BLAST results were then used to determine gene functions. Furthermore, motifs and domains in the predicted gene sequences were annotated using InterProScan (v-5.45)  relying on publicly available databases: Gene3D, PRINTS, Pfam, CDD, SMART, and MobiDBLite. PROSITE and Gene Ontology (GO) terms for each gene were extracted from the corresponding InterProcan entries.
Orthology and evolution
Protein sequences from the entire genome data of 5 species of birds (Taeniopygia guttata, Anser cygnoides, Anas platyrhynchos, Meleagris gallopavo, Gallus gallus) and 2 species of mammals (Mus musculus, Homo sapiens) were downloaded from the NCBI’s public database (Additional file 20: Table S15) and used for comparative analyses and gene clustering analysis. To produce a single transcript for each protein set, we filtered redundant alternative splicing events. We then identified final orthologs, in-paralogs and co-orthologs for all protein-coding genes using OrthoFinder (v-2.3.12)  with default parameters. To reconstruct the phylogenetic relationships among these species, 500 single-copy orthologous protein sequences were selected to perform multiple alignments using MUSCLE (v-3.8.31) . We extracted conserved blocks using Gblocks (v-0.91b)  and concatenated them into 8 supergenes. The JTT model in PhyML-3.3  was used to construct a maximum likelihood phylogenetic tree (100 bootstrap replicates). In addition, the MCMCtree program (-clock 2 -alpha 0.5 -model 3, PAML 4.8 package) [67,68,69] was used to combine the known time-calibration data in TIMETREE (http://www.timetree.org/) to estimate divergence times among species. Based on the results of OrthoFinder and divergence time, we applied CAFE v4.1  with a p-value of 0.05 to identify gene families that underwent expansion and contraction in the Muscovy duck genome compared with the other species. Then, we used the KOBAS  program to infer functional information about expanded or contracted gene families. Protein alignments of single-copy Muscovy - A. platyrhynchos orthologues were performed using the MUSCLE . The alignments were translated into a codon alignment with the Perl script PAL2NAL . Non-synonymous (Ka) and synonymous (Ks) substitution rates were calculated using KaKs_Calculator 2.0  with default parameters.
Syntenic relationship with the A. platyrhynchos genome
To assess the syntenic relationship with the A. platyrhynchos genome, we used minimap2 (-x asm5)  to align the assembled pseudomolecules corresponding to the chromosomes to the A. platyrhynchos chromosomes. The result was then visualized as a dot plot using D-GENIES  to assess the synteny relationship, as well as identifying regions exhibiting collinearity and rearrangements between the two assemblies. We then used two different methods to verify these structural variations. First, Muscovy duck and A. platyrhynchos genomes were aligned to identify rearrangements using SyRI  with default parameters. Second, we aligned trimmed 10x Genomics barcode reads to the A. platyrhynchos genome using BWA (v-0.7.12) , after which the Breakdancer (v1.4.5)  software was used to detect structural variations.
Divergence and polymorphism analyses
We extracted 1:1 orthology of Muscovy-A. platyrhynchos orthologous genes from the OrthoFinder results. Multiple protein sequence alignments of orthologues were performed using MUSCLE . The codeml program from the PAML package  was used to estimate the non-synonymous substitution rates (dN) and the synonymous substitution rates (dS) for Z chromosomes and autosomes separately. The orthologous genes with abnormal dS values (< 0.001) were removed as too few synonymous sites/substitutions indicate that those genes were probably misaligned. Confidence intervals (95 %) were calculated using bootstrapping (1,000).
We retrieved re-sequencing data for 4 Chinese Muscovy ducks  and 6 wild mallards from a previous study , available from the NCBI (SRP144280). The sequencing depth for each specimen was more than 10x. Raw reads were aligned to the Muscovy genome using BWA-MEM  with default parameters. Samtools (v-1.3.1)  software was used to sort the alignment bam files and the repeated reads were removed by a Picard tools MarkDuplicates (v-1.108) . Polymorphic positions were called using Genome Analysis Toolkit (GATK, v-3.5) . Subsequently, we used it to filter these variants with the following parameters: “QUAL > 30.0, QD > 2.0, FS < 60.0, MQ > 40.0, MQRankSum > -12.5, ReadPosRankSum > -8.0, SOR > 10.0, clusterWindowSize = 10, --clusterSize = 3”. Furthermore, the plink (v-1.9) software  with a parameter (--geno 0.1) was used to filter the above SNPs again to remove false positives.
Based on the above SNPs (autosomes), a phylogenetic tree was constructed using the SNPhylo (v-20,140,701) software . SnpEff software (v-4.3t)  was used to identify variants located at protein-coding positions and whether they were synonymous or non-synonymous. Then, population fixation statistics (FST), non-synonymous (πn) and synonymous (πs) nucleotide diversity was computed for each coding gene using VCFtools software (v-0.1.15)  with the parameter (--site-pi). To screen for inversion polymorphisms, we calculated FST values using VCFtools, with windows of 40 kb in length sliding across the genome. Linkage disequilibrium (LD) values among SNPs were inferred using LDBlockShow (-MAF 0.4 -Het 0.8) .
Availability of data and materials
The Muscovy duck genome sequences in this study have been deposited with links to BioProject accession number PRJNA669953 in the NCBI BioProject database (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA669953). The transcriptome data for genome annotation is available with links to BioProject accession number PRJNA678766 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA678766). The Whole Genome project has been deposited at GenBank under the accession JADZGK000000000 (https://www.ncbi.nlm.nih.gov/assembly/JADZGK000000000).
Benchmarking Universal Single-Copy Orthologs
Kyoto Encyclopedia of Genes and Genomes
Million years ago
Single nucleotide polymorphism
Non-synonymous substitution rates
Synonymous substitution rates
Effective population size
Non-synonymous nucleotide diversity
Synonymous nucleotide diversity
Genome Analysis Toolkit
- FST :
Harun MA, Veeneklaas RJ, Van Kampen M, Mabasso M. Breeding biology of Muscovy duck Cairina moschata in natural incubation: the effect of nesting behavior on hatchability. Poult Sci. 1998;77(9):1280–1286.
Food and Agriculture Organization. Domestic animal diversity information system. FAO, United Nations. 2007. http://www.fao.org/dadis/.
Chen Y. The Chinese waterfowl. Agricultural Press, Beijing (in Chinese). 1988.
HM F. Breed characteristics of Muscovy duck and its utilizations. Pigs Poult 2:55–57 (in Chinese). 1994.
Veeramani P, Prabakaran R, Sivaselvam SN, Sivakumar T, Karthickeyan SMK. Phylogenetic Analysis of Six Duck Populations. Indian J Anim Res. 2016;50(4):626–628.
Huang Y, Li Y, Burt DW, Chen H, Zhang Y, Qian W et al. The duck genome and transcriptome provide insight into an avian influenza virus reservoir species. Nat Genet. 2013;45(7):776–783.
Levy SE, Myers RM. Advancements in Next-Generation Sequencing. Annu Rev Genomics Hum Genet. 2016;17(1):95–115.
Jang-Il S, Jin-Wu N. The present and future of de novo whole-genome assembly. Brief Bioinform. 2018;19(1):23–40.
Weisenfeld, Neil, I., Kumar, Vijay, Shah et al. Direct determination of diploid genome sequences. Genome Res. 2018;27:757–767.
Armstrong EE, Taylor RW, Stefan P, Peter B, Esther V, Hillary M et al. Cost-effective assembly of the African wild dog (Lycaon pictus) genome using linked reads. Gigascience. 2018;8(2):1246752.
Almeida P, Sandkam BA, Morris J, Darolti I, Breden F, Mank JE. Divergence and Remarkable Diversity of the Y Chromosome in Guppies. Mol Biol Evol. 2021;38(2):619–633.
Tamazian G, Dobrynin P, Krasheninnikova K, Komissarov A, Koepfli K-P, O’Brien SJ. Chromosomer: a reference-based genome arrangement tool for producing draft chromosome sequences. Gigascience. 2016;5(1):38.
Lauren C, Zhang J, Vandervalk BP, Chu J, Jackman SD, Inanc B et al. ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers. Bmc Bioinformatics. 2018;19(1):234.
Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience. 2012;1(1):18.
Waterhouse RM, Mathieu S, A SF, Mosè M, Panagiotis I, Guennadi K et al. BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics. Mol Biol Evol. 2018;35(3):543–548.
Sotero-Caio CG, Platt RN, II, Suh A, Ray DA. Evolution and Diversity of Transposable Elements in Vertebrate Genomes. Genome Biol Evol. 2017;9(1):161–177.
Emms DM, Kelly S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 2015;16(157):157.
Ye J. WEGO: a web tool for plotting GO annotations. Nucleic Acids Res. 2006;34(Web Server issue):W293–7.
Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30.
Phuong DQ, Dung NT, Jørgensen PH, Handberg KJ, Vinh NT, Christensen JP. Susceptibility of Muscovy (Cairina Moschata) and mallard ducks (Anas Platyrhynchos) to experimental infections by different genotypes of H5N1 avian influenza viruses. Vet Microbiol. 2011;148(2):168–174.
Laurin M, Lapauze O, Marjanović D. What do ossification sequences tell us about the origin of extant amphibians? bioRxiv. 2019;352609.
Jetz W, Thomas GH, Joy JB, Hartmann K, Mooers AO. The global diversity of birds in space and time. Nature. 2012;491(7424):444–448.
Mitchell KJ, Wood JR, Scofield RP, Llamas B, Cooper A. Ancient mitochondrial genome reveals unsuspected taxonomic affinity of the extinct Chatham duck (Pachyanas chathamica) and resolves divergence times for New Zealand and sub-Antarctic brown teals. Mol Phylogenet Evol. 2014;70:420–428.
Guo Y-L. Gene family evolution in green plants with emphasis on the origination and evolution of Arabidopsis thaliana genes. Plant J. 2013;73(6):941–951.
Starr TK, Jameson SC, Hogquist KA. Positive and negative selection of T cells. Annu Rev Immunol. 2003;21:139–176.
Gu L, Wang F, Lin Z, Xu T, Lin D, Xing M et al. Genetic characteristics of Jiaji Duck by whole genome re-sequencing. PLOS ONE. 2020;15(2):e0228964.
Aderem A, Ulevitch RJ. Toll-like receptors in the induction of the innate immune response. Nature. 2000;406(6797):782–787.
Spangler JB, Moraga I, Mendoza JL, Garcia KC. Insights into cytokine-receptor interactions from cytokine engineering. Annu Rev Immunol. 2015;33:139–167.
Li N, Hong T, Li R, Guo M, Wang Y, Zhang J et al. Pathogenicity of duck plague and innate immune responses of the Cherry Valley ducks to duck plague virus. Sci Rep. 2016;6:32183–32183.
Cagle C, To TL, Nguyen T, Wasilenko J, Adams SC, Cardona CJ et al. Pekin and Muscovy ducks respond differently to vaccination with a H5N1 highly pathogenic avian influenza (HPAI) commercial inactivated vaccine. Vaccine. 2011;29(38):6549–6557.
Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods. 2009;6(9):677–681.
Goel M, Sun H, Jiao W-B, Schneeberger K. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biol. 2019;20(1):277.
Haldane J. A mathematical theory of natural selection and artificial selection. Part 1. Cambr Phil Soc Trans. 1924;23:19–41.
Charlesworth B, Coyne JA, Barton NH. The relative rates of evolution of sex chromosomes and autosomes. Am Nat. 1987;130(1):113–146.
Mank JE, Nam K, Ellegren H. Faster-Z Evolution Is Predominantly Due to Genetic Drift. Mol Biol Evol. 2010;27(3):661–670.
Wright AE, Harrison PW, Zimmer F, Montgomery SH, Pointer MA, Mank JE. Variation in promiscuity and sexual selection drives avian rate of Faster-Z evolution. Mol Ecol. 2015;24(6):1218–1235.
McVEAN GA, Charlesworth B. A population genetic model for the evolution of synonymous codon usage: patterns and predictions. Genet Res (Camb). 1999;74(2):145–158.
Lu J, Wu C-I. Weak selection revealed by the whole-genome comparison of the X chromosome and autosomes of human and chimpanzee. Proc Natl Acad Sci U S A. 2005;102(11):4063–4067.
Ellegren H. The different levels of genetic diversity in sex chromosomes and autosomes. Trends Genet. 2009;25(6):278–284.
Vicoso B, Haddrill P, Charlesworth B. A multispecies approach for comparing sequence evolution of X-linked and autosomal sites in Drosophila. Genet Res (Camb). 2008;90:421–431.
Caballero A. On the effective size of populations with separate sexes, with particular reference to sex-linked genes. Genetics. 1995;139(2):1007–1011.
Lamichhaney S, Fan G, Widemo F, Gunnarsson U, Thalmann DS, Hoeppner MP et al. Structural genomic changes underlie alternative reproductive strategies in the ruff (Philomachus pugnax). Nat Genet. 2016;48(1):84–88.
Stefansson H, Helgason A, Thorleifsson G, Steinthorsdottir V, Masson G, Barnard J et al. A common inversion under selection in Europeans. Nat Genet. 2005;37(2):129–137.
Bansal V, Bashir A, Bafna V. Evidence for large inversion polymorphisms in the human genome from HapMap data. Genome Res. 2007;17(2):219–230.
Wilson MA, Makova KD. Genomic analyses of sex chromosome evolution. Annu Rev Genomics Hum Genet. 2009;10:333–354.
Wright AE, Harrison PW, Montgomery SH, Pointer MA, Mank JE. Independent stratum formation on the avian sex chromosomes reveals inter-chromosomal gene conversion and predominance of purifying selection on the W chromosome. Evolution. 2014;68(11):3281–3295.
Kilkenny C, Browne WJ, Cuthill IC, Emerson M, Altman DG. Improving bioscience research reporting: the ARRIVE guidelines for reporting animal research. PLoS Biol. 2010;8(6):e1000412.
Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27(6):764.
Vurture GW, Sedlazeck FJ, Maria N, Underwood CJ, Han F, James G et al. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics. 2017;33(14):2202–2204.
L. WR, Chen Y, Vandervalk BP, Bahar B, Albert L, Jones SJM et al. LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads. Gigascience. 2015;4(1):35.
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–3402.
Huang Y, Niu B, Gao Y, Fu L, Li W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010;26(5):680–682.
Alexandre L, Vardges TH, Chernoff YO, Mark B. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res. 2005;20:6494–506.
Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 2006;34(suppl_2):W435-W439.
Slater GS, Birney E. Automated generation of heuristics for biological sequence comparison. Bmc Bioinformatics. 2005;6(1):31.
Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015;12(4):357–360.
Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015;33(3):290–295.
Campbell MA, Haas BJ, Hamilton JP, Mount SM, Buell CR. Comprehensive analysis of alternative splicing in rice and comparative analyses with Arabidopsis. BMC Genomics. 2006;7(1):327.
Haas BJ, Salzberg SL, Zhu W, Pertea M, Allen JE, Orvis J et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 2008;9(1):R7.
Tarailo-Graovac M, Chen N. Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences. Curr Protoc Bioinformatics. 2009;25(1):4–10.
Pruitt KD, Tatiana T, Maglott DR. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005;33(Database issue):D501-D504.
Brigitte B, Amos B, Rolf A, Marie-Claude B, Anne E, Elisabeth G et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003;31(1):365–370.
Hunter S, Apweiler R, Attwood TK, Bairoch A, Yeats C. InterPro: The integrative protein signature database. Nucleic Acids Res. 2008;37(Database issue):D211-215.
Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–1797.
Talavera G, Castresana J. Improvement of Phylogenies after Removing Divergent and Ambiguously Aligned Blocks from Protein Sequence Alignments. Syst Biol. 2007;56(4):564–577.
Guindon S, Dufayard J-F, Lefort V, Anisimova M, Hordijk W, Gascuel O. New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0. Syst Biol. 2010;59(3):307–321.
Yang Z. PAML 4: Phylogenetic Analysis by Maximum Likelihood. Mol Biol Evol. 2007;24(8):1586–1591.
Rannala B, Yang Z. Inferring Speciation Times under an Episodic Molecular Clock. Syst Biol. 2007;56(3):453–466.
Yang, Z. Bayesian Estimation of Species Divergence Times Under a Molecular Clock Using Multiple Fossil Calibrations with Soft Bounds. Mol Biol Evol. 2005;23(1):212–226.
De Bie T, Cristianini N, Demuth JP, Hahn MW. CAFE: a computational tool for the study of gene family evolution. Bioinformatics. 2006;22(10):1269–1271.
Xie C, Mao X, Huang J, Ding Y, Wu J, Dong S et al. KOBAS 2.0: a web server for annotation and identification of enriched pathways and diseases. Nucleic Acids Res. 2011;39(suppl_2):W316-W322.
Suyama M, Torrents D, Bork P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 2006;34(suppl_2):W609-W612.
Wang D, Zhang Y, Zhang Z, Zhu J, Yu J. KaKs_Calculator 2.0: A Toolkit Incorporating Gamma-Series Methods and Sliding Window Strategies. Genomics Proteomics Bioinformatics. 2010;8(1):77–80.
Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–3100.
Cabanettes F, Klopp C. D-GENIES: dot plot large genomes in an interactive, efficient and simple way. PeerJ. 2018;6:e4958-e4958.
Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009;25(14):1754–1760.
Jiang F, Lin R, Xiao C, Xie T, Jiang Y, Chen J et al. Analysis of whole-genome re-sequencing data of ducks reveals a diverse demographic history and extensive gene flow between Southeast/South Asian and Chinese populations. Genet Sel Evol. 2021;53(1):35–35.
Zhou Z, Li M, Cheng H, Fan W, Yuan Z, Gao Q et al. An intercross population study reveals genes associated with body size and plumage color in ducks. Nat Commun. 2018;9(1):2648.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078–2079.
Wysokar A, Tibbetts K, McCown M, Homer N, Fennell T. Picard: A set of tools for working with next generation sequencing data in BAM format. 2014.
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297–1303.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–575.
Lee T-H, Guo H, Wang X, Kim C, Paterson AH. SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data. BMC Genomics. 2014;15(1):162.
Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). 2012;6(2):80–92.
Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA et al. The variant call format and VCFtools. Bioinformatics. 2011;27(15):2156–2158.
Dong S-S, He W-M, Ji J-J, Zhang C, Guo Y, Yang T-L. LDBlockShow: a fast and convenient tool for visualizing linkage disequilibrium and haplotype blocks based on variant call format files. Brief Bioinform. 2020;22(4);bbaa227.
We are grateful to Mr Jingjin Li and Mr Jiawu Feng for their help with the experimental materials.
This project was supported by the National Nature Science Foundation of China (Grant no. 31702109), the Fundamental Research Funds for the Central Universities of China (Grant no. 2662017JC027), and the technology innovation program for improving efficiency of animal husbandry and developing natural green raising method in Hubei Province (Grant no. 2018skjcx05).
Ethics approval and consent to participate
All experiments and methods were performed in accordance with the ARRIVE guidelines (Animal Research: Reporting of In Vivo Experiments). We declared that all methods were carried out in accordance with the relevant guidelines and regulations. Animal handling and experiments were approved by the Scientific Ethics Committee of the Huazhong Agricultural University (Permission number: HZAUCA-2016-058).
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Histogram of the 21-mer depth distribution of the sequencing reads of Muscovy duck plotted in GenomeScope. The kmer with coverage of 50X has the largest peak (excluding the kmer with extremely low coverage), which was used to estimate the genome size.
Descriptive metrics, estimated by Supernova, of the input sequence data for the de novo genome assembly.
Summary statistics of ab-initio, homology-based and RNA-seq based gene prediction results.
Integrity assessment of the coding sequence of Muscovy duck (annotated 15,580 genes).
Summary of identified repeat elements in the Muscovy duck genome.
Venn diagram of functional annotation for the 15,580 protein-coding genes predicted in the Muscovy duck genome. The numbers indicate the numbers of genes in the Muscovy duck genome identified in different databases. NR: the non-redundant protein sequences database in NCBI. NT: the nucleotide database in NCBI.
OrthoFinder gene counts per orthogroup.
762 Muscovy duck-specific genes.
GO term annotation of Muscovy duck species-specific genes.
Functional annotation of Muscovy duck species-specific genes.
The list of 134 rapidly expanded gene families and 529 contracted gene families.
Functional annotation of 39 genes in Muscovy duck.
104 positively selected genes (PSGs) in Muscovy duck.
Functional annotation of positively selected genes in the Muscovy duck genome.
Genes involved in the innate immune response pathway.
Breakdancer software was used to verify structural variations.
SyRI software was used to verify inversion.
Structural variations detected using SyRI software. INV refers to inversions, TRANS refers to transpositions, SNP refers to single-nucleotide polymorphism, DEL refers to deletions. The shaded area is a 95% confidence interval.
Phylogenetic tree was constructed using whole genome SNP data. The red and blue are Muscovy and mallard populations, respectively.
The assembly accession of the seven species genome in the NCBI public database.
About this article
Cite this article
Jiang, F., Jiang, Y., Wang, W. et al. A chromosome-level genome assembly of Cairina moschata and comparative genomic analyses. BMC Genomics 22, 581 (2021). https://doi.org/10.1186/s12864-021-07897-4