Investigation of SNP markers for the melatonin production trait in the Hu sheep with bulked segregant analysis

Background As an important reproductive hormone, melatonin plays an important role in regulating the reproductive activities of sheep and other mammals. Hu sheep is a breed favoring for meat, with prolific traits. In order to explore the relationship between melatonin and reproductive function of Hu sheep, 7,694,759 SNPs were screened out through the whole genome sequencing analysis from high and low melatonin production Hu sheep. Results A total of 68,673 SNPs, involving in 1126 genes, were identified by ED association analysis. Correlation analysis of SNPs of AANAT/ASMT gene and MTNR1A/MTNR1B gene were carried out. The melatonin level of CG genotype 7,981,372 of AANAT, GA genotype 7,981,866 of ASMT and GG genotype 17,355,171 of MTNR1A were higher than the average melatonin level of 1.64 ng/mL. High melatonin Hu sheep appear to have better multiple reproductive performance. Conclusions By using different methods, three SNPs which are associated with high melatonin production trait have been identified in Hu sheep. These 3 SNPs are located in melatonin synthetase AANAT/ASMT and receptor MTNR1A, respectively. Considering the positive association between melatonin production and reproductive performance in ruminants, these three SNPs can be served as the potential molecular markers for breading Hu sheep with the desirable reproductive traits. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-023-09494-z.


Introduction
Small ruminants, particularly sheep significantly impact the daily life of a considerable part of human population socio-economically [1].These small ruminants provide meat, wool, skin and other products for our consuming.To select the breed with high quality meat [2], wool fineness [3], large number of offspring [4] and carcass weight [5] becomes the urgent agenda for researchers.Hu sheep (Ovis aries) is a unique sheep breed in China.This breed has the advantages of early sexual maturity, strong fertility with large litter size and easy adapts to coarse feeding materials [6].By analyzing its reproductive traits with genome-wide association study it was found that the polymorphisms of FecB, GDF9 and TGFBR2 were significantly associated with the fertility trait of Hu sheep [7] and these genes might be the potential marker sites for breeding selection.As the advancement of molecular breading technology, the combined trials with emphasis on administration and genetic progress to improve the animal breading outputs have shown their decisive significance [8].Specific to sheep, to improve productivity and reproductive performance of ewes with economical and biological efficiency has become the goal of sheep production enterprises [9].
Melatonin is an important reproductive hormone which is synthesized in many tissues and organs including the pineal gland [10] and reproductive system [11].Melatonin modulates the release of follicle-stimulating hormone (FSH) and luteinizing hormone (LH) through the hypothalamic-pituitary-gonadal axis thus regulating the seasonal reproductive activities of mammals [12,13].It also improves oocyte quality through its antioxidant function [14].It was found that before the MTNR1A expression peak prior to ovulation, melatonin application could improve luteal function, increase progesterone secretion level, pregnancy rate and litter size of animals [15].It appears that melatonin plays an important role in animal reproduction.
Single nucleotide polymorphism (SNP) mainly refers to DNA polymorphism at the genome level, which is characterized by large number, wide distribution and genetic stability [16].Currently, SNPs are widely used in genome-wide association, germplasm resource and genome selection signal analyses.Gene chips developed based on SNP typing technology have been mainly used for genetic resource mining and breeding selection [17].Bulked segregant analysis (BSA) method is also a method based on SNP analysis, which can rapidly locate candidate genes for extreme traits [18].Bulked segregant analysis (BSA) has been widely used in molecular marker screening and Quantitative trait locus (QTL) localization analysis of target traits in plants and animals [19,20].
In this study, BSA method will be used to analyze the genotypes of individual Hu sheep to match up their endogenous melatonin levels.The focus will be given to the SNP polymorphisms of AANAT/ASMT as well as MTNR1A/MTNR1B.Our purpose is to identify whether SNPs among these genes are associated with melatonin production in Hu sheep.

Sample collection
In the current study, Hu sheep were selected to investigate whether their excellent reproductive performance was associated with the melatonin production traits by the method of whole-genome SNP level.In order to analyze the melatonin production trait of Hu sheep, the whole genome SNP of Hu sheep who exhibited either high or low melatonin production trait were analyzed.The correlation between the SNPs of enzyme genes for melatonin synthesis, AANAT/ASMT, and genes of receptors, MTNR1A/MTNR1B, was also analyzed.If some of these SNPs of melatonin trait are strongly associated with reproductive performance of Hu sheep these SNPs may serve as the biomarkers for guiding the breeding selection of a new strain of Hu sheep.A total of 195 Hu sheep (Information as shown in Table S1) were selected from Inner Mongolia Golden Grassland Ecological Technology Group Co., LTD.For blood collections, the 5 mL of blood was collected from neck veins at eight o 'clock, centrifuged 3,000 rpm for 8 min, serum was taken and stored at -20℃ for future use.

Melatonin detection
Melatonin standard curve was prepared by serial dilutions of melatonin from 100, 50, 20, 10 and 5 ng/L respectively.Peak area was detected and standard concentration curve was drawn.The 200μL of serum was added with 800μL methanol, swirled for 30 min and centrifuged at 14,000 rpm for 20 min at 4 ℃.The supernatant was filtered by the 0.22 μm filter and cooled down to -20 ℃ then, was detected by Liquid chromatography tandem mass spectrometry (LC-MS/MS) (Agilent1290-G6470, Santa Clara, CA, USA) in the Central Laboratory of Institute of Animal Science, Beijing, Chinese Academy of Agricultural Sciences.The Symmetry C18 column was used to separate the melatonin from the samples.

Quality control analysis of Illumina HiSeq DNA resequencing data
Blood DNA was extracted following the instructions of the Kit (NEBNext ® Ultra ™ DNA Library Prep Kit for Illumina ® ) for library construction.A sample of 1 μg genomic DNA was randomly broken into fragments < 500 bp by ultrasonic treatment (Covaris S220).End repair was performed on the fragment with End Prep Enzyme Mix, appropriate ends were added to both ends with T-A ligase, then 5 'phosphorylation and da-tailed were performed.Adaptor-ligated DNA was size-screened using the AxyPrep Mag PCR cleanup (Axygen) to obtain fragments about 410 bp.Each sample was amplified by PCR using P5 and P7 primers for 8 cycles, respectively.Both P5 and P7 primers carry bridge PCR sequences that can be annealed by flow cytometers.The P7 primer carries a six-base index for multiplexing.PCR products were cleaned using AxyPrep Mag PCR (Axygen) and validated using Agilent 2100 Bioanalyzer (Agilent Technologies, PaloAlto, CA, USA).Quantification was performed using Qubit 2.0 fluorometer (Invitrogen, Carlsbad, CA, USA).According to the kit instructions (Illumina, San Diego, CA, USA), libraries with different indexes were multiplexed and loaded onto the Illumina HiSeq instrument.Sequencing was performed using the 2X150 paired terminal (PE) configuration.Using HiSeq control software (HCS) + OLB + gappeline-1.6(Illumina) for image analysis and basic invocation on HiSeq instrument.
The bcf2fastq (version 2.17.1.14)software was used for image Base calling of original image Data.Pass Filter Data (PF) in FASTQ (fq) format was obtained.Clean Data was obtained by using cutadapt (version 1.9.1) software to remove junctions and low-quality sequences from Pass Filter Data.Clean data was compared to reference Genome sequences using Dragen Genome Pipline.For the resulting BAM (binary SAM file) file, Picard and GATK were used to remove duplicate data of PCR and collect the local realignment, base quality recalibration and the corrected genome alignment results.The coverage and coverage depth of the genome were calculated according to the comparison results.When the sample genotype mutation frequency at a SNP site is ≥ 0.8 or ≤ 0.2, the site is a pure sum.If the mutation frequency is between 0.2 and 0.8 and each allele of the heterozygous genotype has at least 4 reads, the SNPs is considered as the heterozygous mutation; otherwise, the site is treated as deletion and listed as NA in the table.Finally, all SNPs of genotypes in the tested samples were summarized in the table for comparison.

SNP detection and annotation
Euclidean Distance (ED) algorithm was used to search for significantly different markers between pools.In the process of ED analysis, MMAPPR software package will used to filter SNP loci of non-secondary gene and SNP loci with sequencing depth less than 10X of wild type or mutant mixed pool.If the frequency of A, T, C and G in the wild type pool is greater than or equal to 95%, it will be filtered.The calculation formula of ED method is as follows: Note: Amut is the frequency of base A in mutant mixing pool, Awt is the frequency of base A in wild type mixing pool; Cmut is the frequency of C base in mutant mixing pool and Cwt is the frequency of C base in wild type mixing pool.Gmut is the frequency of G base in mutant pool and Gwt is the frequency of G base in wild type pool.Tmut is the frequency of T base in mutant mixing pool and Twt is the frequency of T base in wild type mixing pool.
In this project, the 4th power of the original ED served as the correlation value to eliminate background noise, and this value was used for the local polynomial regression fitting to obtain the fitting curve.The value of Median + 3SD of all locus fitting values was used as the association threshold for analysis.This association threshold was used to determine the correlation area.Thereafter, sites with mutation frequency > 0.75 and Euclidean distance > 0.5 were selected as candidate sites from the correlation area and the annotation results of ANNOVAR were also extracted from these candidate sites.

GO enrichment analysis
Based on the GO each term mapping database (http:// www.geneo ntolo gy.org/).for candidate genes the numbers of genes in each term were calculated to obtain the genes with a certain GO function and their statistical analysis.Then the GO entries with significantly enriched in the whole genome background were identified by hypergeometric test.

KEGG enrichment analysis
KEGG enrichment analysis was based on hypergeometric distribution principle [21][22][23].Pathway with Qvalue ≤ 0.05 was defined as significantly enriched in candidate genes after multiple test correction.The formula is as follows: Note: N is the number of genes with Pathway annotation in the whole genome; n is the number of candidate genes screened according to the selection signal index.M is the number of genes annotated for a specific Pathway among all genes; m is the number of candidate genes annotated for a specific Pathway.

SNP analysis of enzyme genes for melatonin synthesis and genes of melatonin receptors
PCR was carried out for SNP analysis.The related primers were shown in

Quality control analysis of high-melatonin Hu-sheep sequencing data
Among 195 Hu sheep tested, 100 (51.3%) had melatonin content between 0-0.49ng/mL, 42 (21.5%)had melatonin content between 0.5-1.0ng/mL and 53 (27.2%) had melatonin content above 1 ng/mL.Three individuals with highest and three individuals with lowest melatonin levels were selected for whole genome sequencing analysis (Table 2).The original Data (Pass Filter Data, PF) were shown in Table 3.The average base quality value of sequencing data Reads was concentrated in 36.After removing low-quality sequence results from the original data the Clean Data were obtained as shown in Table 4.The clean data after QC reached more than 99% in PF data, as shown in Table 5.The alignment ratio between clean data and reference genome sequence was 99.11% on average and the unique reads number of reference genome was 81.12%, as shown in Table 6.The average coverage rate reached 96.05% and the average coverage depth reached 28.32%.

Genome-wide SNP detection and annotation
Based on the comparison results, SNPs were detected using Dragen Genome Pipline.According to the functional changes caused by mutation sites, the proportion of non-synonymous mutations of SNP is higher than that of synonymous mutations (Table 7).SNP mutations mainly focus on C:G > T:A and A:T > C:G.The 7,694,759 SNPs were obtained by screening and filtering the SNPs in the samples.Functional annotation was carried out for the detected variation sites.The variation sites mainly occurred in the intronic and splicing regions of the      genome, with a small proportion in the UTR5、UTR3 and UTR5;UTR regions, as shown in Fig. 1.

Analysis of ED method correlation results
Based on ED association method, a total of 3,934,196 SNPs were selected from 7,694,759 SNPs (Table 8).
The distribution of these SNPs on each chromosome is shown in Fig. 2. Median + 3SD of all sites fitting values was taken as the association threshold for analysis, and 0.3443 was calculated.According to the correlation threshold, a total of 1111 regions with a total length of 34,805,411 bp were obtained.A total of 68,673 SNPs were selected from the associated regions with mutation frequency > 0.75 and Euclidene distance > 0.5.For the selected SNPs, the annotation results of ANNO-VAR were extracted, involving 1126 genes.

GO enrichment analysis
The GO functional annotation results of candidate genes were shown in Fig. 3.A total of 25 genes were enriched in terms of biological process, and 572 genes were enriched in cellular process.In terms of 8 terms enriched to Cellular Component, organelle was enriched to 341 genes at most.In terms of Molecular Function, a total of 10 terms are enriched, and the maximum number of binging genes is 403.The GO  function of candidate genes was significantly enriched and hypergeometric test was used to find out the significantly enriched GO entries.The pathways with the most differentially enriched genes in cell components are multivesicular body, internal vesicle and mitotic spindle assembly checkpoint MAD1-MAD2 complex.The path in which GO term accounts for the largest percentage of all differences is GO:0043005 neuron projection (Fig. 4).The TOP 20 pathways in molecular function have only one gene with a Rich Factor of 1, among which the pathway with GO term accounting for the largest percentage of all differences is GO:0016887ATPase (Fig. 5).The pathways with the most differentially enriched genes in biological processes were RNA repair and subpallium to the cortex and the pathway with GO term accounting for the most percentage of all differences was GO:0048666 neoron development (Fig. 6).

KEGG enrichment analysis
In vivo, different genes coordinate to perform biological functions through the same action pathway.KEGG enrichment analysis is helpful to interpret gene function.
Based on the hypergeometric distribution principle, the whole genome genes were taken as background genes, and the candidate genes were analyzed by KO enrichment.As shown in Fig. 7, Among them, Vascular smooth muscle contraction and Retrograde endocannabinoid signaling pathways had the largest number of enriched genes.The Pathways that accounted for the largest The ordinate is GO term and the abscissa is the percentage of all differences accounted for by this GO term.From the largest to the smallest, the top 20 are selected.The darker the color, the smaller the Q value.
The number of GO term and Q value are labeled above.B Bubble chart.Note: The ordinate is GO term, and the abscissa is enrichment factor (the number of differences in the GO term divided by all the numbers).From the largest to the smallest, the top 20 are selected.Size of bubble area: the number of genes belonging to this GO in the target gene set; Bubble color: enrichment significance, that is, the size of Q value; The size is the quantity, and the redder the color, the smaller the Q percentage of all differences were Alzheimer disease and pathways in cancer.

Correlation analysis of SNPs in melatonin synthetase and receptor genes
SNPs of enzyme genes for melatonin synthesis genes of melatonin receptors in high and low melatonin production Hu sheep were counted, among which 14 nonsynonymous sites in the exon region had amino acid mutations, as shown in Table 9. R software was used to analyze the correlation between SNP and melatonin concentration in 189 Hu sheep, and it was found that there were three sites with significant correlation with melatonin production (P < 0.05) (Table 10), which were ranked 7,981,372, 7,981,866 and 17,355,171, respectively.The levels of melatonin in CG genotype at 7,981,372 site, GA genotype at 7,981,866 site and GG genotype at 17,355,171 site were several folds higher than the average levels in the population (1.64 ng/mL) (Table 11).As shown in Fig. 8, the dominant genotype of SNP 7981372 was GG with a correlation of 89.4% and genotype CG with a correlation of 10.6%.The dominant genotype of SNP 7981866 was GG, and its correlation was 86.8% and 13.4% for GA.The dominant genotype of SNP 17355,171 was GG, and the correlation was 84.1% and 15.6% for GT.

Discussion
Due to the nutritional value and the rich taste, the demanding of mutton product has excessed its production globally.To increase mutton production and quality has become an urgent agenda not only for sheep husbandry but also for researchers.It appears that the traditional sheep breeding methods cannot satisfy consumers' demanding, therefore, the animal breeding techniques have shifted from traditional cross breeding to genome selection which can directly target the traits to be selected [24].In the current study, we analyzed SNPs related to the production traits of melatonin in Hu sheep.The rationale is that high melatonin content in sheep is often related to high breading ability of this species.It has been reported that melatonin regulates animal reproductive activity and improve oocyte [25] and embryo developmental quality [26].
The transgenic AANAT and ASMT overexpressed sheep has increased the number of their offspring [27,28].Melatonin is secreted mainly by mitochondria, virtually, all cells containing mitochondria have capacity to  synthesize melatonin [29,30].Melatonin has a wide range of biological functions.It not only regulates animal reproductive activities, but is also a strong antioxidant, which can delay reproductive aging and have anti-tumor activity [31,32].Jiang et al. had conducted whole gene sequencing correlation analysis on the body size of Hu sheep and found that 5 SNPs were correlated with body height traits and 4 SNPs were correlated with chest circumference traits [33].In this study, first, we evaluated the serum melatonin levels in 195 Hu sheep and to identify the distribution of high and low melatonin production individuals, then, the whole genome was sequenced to find the SNPs which are associated with melatonin production traits.Since melatonin is secreted rhythmically, and the levels of melatonin in individuals vary  [34].To avoid influences of these factors the environmental information of these sheep exposed have been recorded in detail in advance of a month before the study.It was found that among 195 Hu sheep, 53 individuals (27.2%) were distributed in the high melatonin population with more than 1 ng/mL melatonin and 100 (51.3%) were in low melatonin population with 0-0.5 ng/mL melatonin.The 7,694,759 SNPs were obtained from 3 individuals with highest and 3 individuals with lowest melatonin productions.SNPs mainly occur in the intronic and splicing regions of the genome.A total of 68,673 SNPs were selected by ED association analysis and 1126 genes were involved in these SNPs.GO and KEGG enrichment analyses showed that the SNPs associated with melatonin traits in Hu sheep at the genome-wide level were enriched into multiple genes in multiple pathways.
AANAT and ASMT are two key genes in melatonin synthesis and MTNR1A and MTNR1B are two key genes of melatonin receptors in mediation of melatonin biological functions [2].Therefore, the analysis was focused on SNPs in the exon region of these four genes and their mutations may lead to amino acid changes and affect melatonin production.Yao et al. have found 3 SNPs of AANAT and 4 SNPs of ASMT in Holstein dairy cows that can significantly affect the production of melatonin [35].Here, a total of 14 nonsynonymous mutation sites were identified in these four genes.Among them, three sites were found to be significantly correlated with melatonin production in Hu sheep, which were ranked 7,981,372, 7,981,866 and 17,355,171, respectively.The melatonin levels of CG genotype 7,981,372, GA genotype 7,981,866 and GG genotype 17,355,171 were higher (around 1.64 ng/ mL) than the average melatonin concentration of (0.5-1.0 ng/mL) in this population.These results indicate that mutations in the three SNPs significantly affect the expression level of melatonin.Melatonin can be considered as an important reproductive regulator.Via hypothalamus and pituitary axis, it can indirectly regulate the seasonal reproductive activities of seasonal breeders [36].The reproductive system locally produced melatonin including ovary, oocytes, embryo and placenta can directly facilitate the embryo development and increase the numbers of the offspring in many species [37].The 3 SNPs which occur in the AANAT/ASMT and MTNR1A may improve the enzyme activity to directly enhance melatonin production or promote receptor function which can positively feed-back melatonin production.Since melatonin level in animals is associated with their reproductive efficiency, these three SNPs can serve as the molecular markers to bread Hu sheep with improved reproductive activity.
According to the comparison results of Clean Reads in reference genome, the Haplotype Caller module of software GATK (version 4.0.4.0) was used for mutation detection.Filtration was done using the Variant Filtration module, Filter parameters for:-filter Expression "QD < 2.0 | | MQ < 40.0 | | FS > 60.0 | | SOR > 3.0 | | MQ Rank Sum < 12.5 | | Read Pos Rank Sum < -8.0".ANNOVAR software was used for the functional annotation of the detected gene variants.When the coverage depth of the sample at a SNP site is < 5X, the site is considered as missing and listed as NA in the table.

( 1 )
Sample: Name of sequencing sample (2) length: reads average length (3) Reads: Number of sequencing reads (4) Bases: Total number of bases (5) Q20 (%): The percentage of bases with Phred value greater than 20 in the total base was calculated respectively (6) Q30 (%): The percentage of bases with Phred value greater than 30 in the total base was calculated respectively (7) GC (%): Calculate the total number of bases G and C as a percentage of the total number of bases (8) N(ppm): The number of N bases per million that cannot be determined by sequencing

Fig. 2
Fig. 2 Distribution of ED association values across the genome.The x-coordinate is the name of chromosome, the dot plot represents the ED value of each SNP site, and the line plot represents the ED value after fitting.The higher the ED value, the better the correlation effect of the point, and the blue shadow represents the interval located

Table 1
Data analysisR software was used to analyze the association between SNPs and phenotypes.The analytic model was Yij = μ + Gi + Pj + eij, Yij was the observed value of character; μ is the total mean value of character.Gi is genotype effect; Pj is the age effect; eij is random error.

Table 1
Primer sequence table

Table 2
Sample of whole genome sequencing analysis

Table 3
PF data statistics

Table 4
Clean data Data statistics (6)Q30 (%): The percentage of bases with Phred value greater than 30 in the total base was calculated respectively (7) GC (%): Calculate the total number of bases G and C as a percentage of the total number of bases (8) N(ppm): The number of N bases per million that cannot be determined by sequencing

Table 5
Clean data ratio statistics

Table 6
Sequencing comparison statistics (1) Sample: Name of sequencing sample (2) Total Reads: The number of all sequencing reads (3) Mapped Reads: Number of reads with reference genomes for all comparisons (4) Mapped Reads Ratio (%): The ratio of reads in the reference genome was compared (5) Uniq Reads: The ratio of reads in the reference genome was compared (6) Uniq Reads Ratio (%): uniq reads as a percentage of all mapped reads (5) Mean depth: The average sequencing depth of the covered bases

Table 7
Statistics of SNP types in the whole genome (1) Sample: Name of sequencing sample (2) SNP: Total number of SNPs (3) Non-Synonymous: Total number of non-synonymous mutations (4) Synonymous: Total number of synonymous mutations (5) Stop gain: The number of terminating codons that are mutated (the mutation gives the gene a terminating codon) (6) Stop lost: The number of stop codon deletions (mutations that cause genes to lose stop codons)

Table 8
SNP site filtering statistics Total_number: The total number of SNPs Biallelic: The number of SNPs after filtering non-secondary genes Frequency: The number of SNPs after the frequency of filter bases is greater than or equal to 95% NA: The number of SNPs after the sequencing depth of wild type or mutant pool was lower than 10X

Table 9
Summary of non-synonymous mutations

Table 10
Significance test for mutation sites p < 0.05 indicates a significant correlation at the 0.05 level and p < 0.01 indicates a significant correlation at the 0.01 level

Table 11
Melatonin concentrations at significant loci genotypes Fig. 8 Genotype of significant locus.A 7981372 genotype, B 7981866 genotype, C 17355171 genotype in seasons, light exposure and age