Genome-wide selection signatures detection in Shanghai Holstein cattle population identified genes related to adaption, health and reproduction traits

Background Over several decades, a wide range of natural and artificial selection events in response to subtropical environments, intensive pasture and intensive feedlot systems have greatly changed the customary behaviour, appearance, and important economic traits of Shanghai Holstein cattle. In particular, the longevity of the Shanghai Holstein cattle population is generally short, approximately the 2nd to 3rd lactation. In this study, two complementary approaches, integrated haplotype score (iHS) and runs of homozygosity (ROH), were applied for the detection of selection signatures within the genome using genotyping by genome-reduced sequence data from 1092 cows. Results In total, 101 significant iHS genomic regions containing selection signatures encompassing a total of 256 candidate genes were detected. There were 27 significant |iHS| genomic regions with a mean |iHS| score > 2. The average number of ROH per individual was 42.15 ± 25.47, with an average size of 2.95 Mb. The length of 78 % of the detected ROH was within the range of 1–2 MB and 2–4 MB, and 99 % were shorter than 8 Mb. A total of 168 genes were detected in 18 ROH islands (top 1 %) across 16 autosomes, in which each SNP showed a percentage of occurrence > 30 %. There were 160 and 167 genes associated with the 52 candidate regions within health-related QTL intervals and 59 candidate regions within reproduction-related QTL intervals, respectively. Annotation of the regions harbouring clustered |iHS| signals and candidate regions for ROH revealed a panel of interesting candidate genes associated with adaptation and economic traits, such as IL22RA1, CALHM3, ITGA9, NDUFB3, RGS3, SOD2, SNRPA1, ST3GAL4, ALAD, EXOSC10, and MASP2. In a further step, a total of 1472 SNPs in 256 genes were matched with 352 cis-eQTLs in 21 tissues and 27 trans-eQTLs in 6 tissues. For SNPs located in candidate regions for ROH, a total of 108 cis-eQTLs in 13 tissues and 4 trans-eQTLs were found for 1092 SNPs. Eighty-one eGenes were significantly expressed in at least one tissue relevant to a trait (P value < 0.05) and matched the 256 genes detected by iHS. For the 168 significant genes detected by ROH, 47 gene-tissue pairs were significantly associated with at least one of the 37 traits. Conclusions We provide a comprehensive overview of selection signatures in Shanghai Holstein cattle genomes by combining iHS and ROH. Our study provides a list of genes associated with immunity, reproduction and adaptation. For functional annotation, the cGTEx resource was used to interpret SNP-trait associations. The results may facilitate the identification of genes relevant to important economic traits and can help us better understand the biological processes and mechanisms affected by strong ongoing natural or artificial selection in livestock populations. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-08042-x.


Conclusions:
We provide a comprehensive overview of selection signatures in Shanghai Holstein cattle genomes by combining iHS and ROH. Our study provides a list of genes associated with immunity, reproduction and adaptation. For functional annotation, the cGTEx resource was used to interpret SNP-trait associations. The results may facilitate the identification of genes relevant to important economic traits and can help us better understand the biological processes and mechanisms affected by strong ongoing natural or artificial selection in livestock populations.

Background
There are more than 85,000 cows and nearly 200 proven bulls in the Shanghai Holstein cattle population, which not only guarantees the basic needs of milk in Shanghai but also provides more than 3 million doses of sperm per year nationwide. Shanghai Province has long taken the leading position and become one of the largest breeding centres in China.
Holstein cattle have been extensively imported to China, mostly from Canada, the USA, France and northern Europe, since the 1940 s for use in crossbreeding aimed at improving the productivity of Chinese native cattle by combining the environmental adaptation features of the Chinese cattle with the high milk yield potential of foreign cattle [1]. The Holstein breed is categorized as a heat-sensitive cattle breed, and the most appropriate temperature for production is 5~20 degrees Celsius [2]. However, the Shanghai Holstein cattle population is exposed to long, hot and humid summers with abundant rain and temperatures of over 35 degrees Celsius. It is worth noting that Mao et al. (2015) found that the longevity of the Shanghai Holstein cattle population is generally short, approximately the 2nd to 3rd lactation [3], whereas the maximum milk yield is obtained in the fourth lactation in most Holstein populations [4].
Over several decades, a wide range of natural selection and artificial selection events in response to subtropical environments and intensive pasture and feedlot systems have greatly changed the customary behaviour, appearance, and important economic traits of Shanghai Holstein cattle. According to the theory of population genetics, the functional genes subject to selection will reveal characteristic patterns due to selection preference, and these patterns are known as "selective signatures" [5].
Recently, with the development and the prevalent application of high-throughput and cost-effective genotyping techniques, the power of detecting selection signatures at the molecular genetic level has experienced a major breakthrough. Compared to high-throughput single nucleotide polymorphism (SNP) chip technology, which has made it possible to uncover traces of positive selection and detect candidate genes based on linkage disequilibrium, next-generation sequencing (NGS) has been widely adopted by several platforms and has decreased the cost of DNA sequencing, which allows the systematic identification of selection signatures at a higher effective resolution and sensitivity [5]. Furthermore, studies based on sequence data do not suffer from SNP ascertainment bias, as do studies that are performed using commercially available SNP assays [6,7].
To date, various analytical methods have been proposed to detect different kinds of selection signatures. A considerable amount of research has been conducted to detect selection signatures in selected populations during the last decade by using various statistical approaches, including Tajima's D [8], extended haplotype homozygosity (EHH) statistic [9], integrated Haplotype Score (iHS) [10], crosspopulation Extended Haplotype Homozygosity (XPEHH) and cross-population composite likelihood ratio (XP-CLR) [11]. EHH is a popular approach that is known for reliably detecting ongoing selection and is a long haplotype-based test. In populations under positive selection, the mutation frequency will rapidly increase. Therefore, regions with extremely strong and long-range LD with high haplotype allele frequencies can be used to detect recent selection by EHH [12]. The iHS approach was developed by Voight et al. (2006); it is based on EHH and can overcome the influence of heterogeneous recombination rates across the genome. iHS can distinguish the ancestral and derived alleles of a polymorphic site, as a much larger EHH score for the derived allele than for the ancestral allele is expected to represent positive selection. iHS is usually sensitive for detecting positive selection signatures for intermediate frequency variants [13].
Signatures of selection could be observed in genomewide ROH scans in animals. It was suggested that genetic diversity is reduced under selective pressure; thus, ancestral genetic variations are often transformed into long stretches of consecutive homozygous genotypes across the genome [14]. The size and frequency of ROH vary according to population diversity and selection pressure. Analyses of ROH allow the identification of genomic regions with possible selection signatures for breed [15]. There are many studies with ROH for the identification of selection signatures in cattle [16,17], sheep [18,19], horses [20,21] and pigs [22,23]. To improve the power and spatial resolution for identifying selection signatures, combining multiple methods into composite tests is useful for mapping the comprehensive footprint of selection across the genome.
Genome-scale bioinformatics annotations are available from a number of sources, including the cattle Genotype-Tissue Expression (cGTEx, http://cgtex.roslin. ed.ac.uk/) atlas, which represents the most comprehensive reference resource of the cattle transcriptome to date and is based on 11,642 RNA sequences from publicly available datasets representing over 100 cattle tissues. This database provides a detailed characterization of the genetic control of gene regulation across 24 major tissues and provides novel biological insights into the molecular regulatory mechanisms underpinning agronomic traits in cattle by conducting a transcriptomewide association study (TWAS) linking gene expression in different tissues with 43 economically important traits [24]. Mapping expression quantitative trait loci (eQTLs) has provided great supporting evidence for important insights into the regulatory pathways involved in disease, which contain genetic variants associated with gene expression. We defined genes whose expression levels were significantly associated with SNPs as eGenes. eGenes in a disease-relevant tissue provide insight into gene regulation and are genes whose expression levels are associated with genetic variants. cGTEx data were used to estimate the variation in gene expression regulation between tissues by comparison with our data. If the selective signatures or the candidate genes identified are also eQTLs or eGenes, then there is strong evidence to further study these variants or genes.
However, little information is available from the scientific literature about the selection signatures and population structure of the Shanghai Holstein population. Therefore, the aims of this study were to 1, investigate the recent selective signatures based on next-generation sequencing data in the Shanghai Holstein population in response to local climatic conditions and artificial selection for economical purposes, and 2, explore the genes related to biological processes and traits of interest to identify important functional candidate genes undergoing positive selection in Shanghai Holstein cattle.

Results
After filtration, we utilized 164,312 high-quality SNPs for analysis of the genomic selection signature in our study. These SNPs covered 2.51 Gb of the cattle genome (UMD3.1), with an average distance of 15.56 kb between adjacent SNPs. The average distance between the adjacent SNPs across autosomes ranged from 7.43 kb on BTA25 to 22.49 kb on BTA6, and the standard deviation of the adjacent SNPs ranged from 18.86 on BTA25 to 46.70 on BTA1, which is presented in Additional file 1: Table S1. An overview of the relationships among these animals from 24 farms is presented in Additional file 1: Figure S1. We further estimated the inbreeding coefficient based on the genomic information for all Shanghai Holstein cattle. The average inbreeding coefficients estimated using GCTA (-ibc command) are shown in Additional file 1: Table S2. The inbreeding coefficient reflects the deviations in the observed inbreeding from the expected values in the current population, and the average inbreeding coefficient of this population was 0.36.

Detection of selection signature using iHS approach
The iHS test was used to detect strong footprints of recent selection within the Shanghai Holstein cattle population. We obtained a total of 62,140 SNPs with estimated |iHS| scores (Additional file 2). The chromosome-wide scans of iHS for the studied population are shown in Fig. 1A. The plots show clear evidence of selective forces in different regions of the genome. A total of 9304 regions were detected based on the single site |iHS| score. Based on the top 1 % of values, 101 candidate regions were identified. In total, 256 candidate genes overlapped with these significant iHS genomic regions according to the UMD 3.1 assembly (Additional file 3). Among the 101 candidate regions, there were 27 significant |iHS| genomic regions with a mean |iHS| score > 2, and 84 genes were located in these regions including ITGA9, MASP2, SLC39A11, SSTR2, TGFBR1,  TSPEAR, LRRC3, SNRPA1, PFKL, RGS3, SRSF10,  SUPT3H, FUCA1, CSPP1, IL22RA1, and AOX2. (Table 1). The region located on BTA6 (BTA6:78,000,000-78,500,000) had the highest |iHS| score (2.942), and no candidate gene overlapped with this region.

Genomic distribution of runs of homozygosity
In total, 44,509 ROH among 1092 samples were identified. The average number of ROH per individual was 42.15 ± 25.47, ranging from 1 to 121 ROH. The genomic distribution of ROH was nonuniform both in length and position across chromosomes. The average length of ROH was 2.95 Mb across all autosomes, but the total length of ROH per individual varied considerably from 1.51 Mb to 430 Mb. The longest segment was 18.82 Mb in length (381 SNPs) and was found on BTA6. Figure 2 shows the percentage of bovine chromosomes covered by ROH and the highest coverage by ROH was observed on BTA11 (7.8 % of chromosomal length), whereas the lowest was on BTA9 (2.24 % of chromosomal length). The number of ROH per chromosome was greatest for BTA11 (2976 segments) and lowest for BTA9 (800 segments) (Fig. 2).
In this study, we classified ROH into four different categories according to their physical length: 1 to < 2 Mb, 2 to < 4 Mb, 4 to < 8 Mb and > 8 Mb. Descriptive statistics of each length category are given in Table 2. Our results show that 99 % of the ROH were shorter than 8 Mb. The total length of ROH for the Shanghai Holstein cattle population was composed mostly of shorter segments (1-2 Mb and 2-4 Mb). These segments accounted for approximately 78 % of all ROH detected, which contributed to 0.79 % and 2.16 % of the cumulative length of ROH. ROH (1-2 Mb) were more abundant throughout the genome than ROH (4-8 Mb); however, the proportion of the genome covered by ROH (1-2 Mb) was much smaller than that covered by ROH (4)(5)(6)(7)(8).
To identify the genomic regions that were most commonly associated with ROH in all individuals, the top 1 % of SNPs observed in the ROH was selected, and the adjacent SNPs over this threshold were merged into genomic regions corresponding to ROH islands [25]. In the ROH islands detected here, each SNP showed a percentage of occurrence > 30 % (occurring in over 30 % of the samples) (Fig. 1B). This approach resulted in the identification of 18 ROH islands across 16 autosomes, with one on BTA1, 3, 4, 6, 9, 10, 11, 12, 13, 19, 20, 23, 25 and 29 and two on BTA2 and 21, and the length of these regions ranged from 679 kb on BTA11 to 2.79 Mb on BTA12 (Table 3). Among the described ROH islands, the strongest pattern was observed on BTA13: 39,852,196,648, with an overlapping ROH region present in 50 % of the samples.
Within all of the ROH islands reported here, a reasonable number of genes (n = 168) was observed ( Table 3). We found that some SNPs in ROHs were located in intergenic regions, and a few genes were detected in some identified regions. For example, the length of region BTA25:15,526,462-16,728,810 was 1.2 Mb, but only contained 2 annotated genes. Although the length of the region on BTA2 was 793 kb, no gene was detected. The possible reasons were either that the annotation of the cow reference genome is still incomplete or that the genomic region was positioned in a noncoding region.

Overlap between selection metrics
In total, 101 and 18 candidate regions were detected by the iHS approach and ROH estimation, respectively. All the candidate regions were spread across 27 of the 29 autosomal chromosomes of the bovine genome. Coincident signatures identified by both methods were found on chromosomes BTA12, BTA13, BTA21 and BTA23 (Table 4). There were 11 common genes revealed by both iHS and ROH analyses: MTIF3, UBD, OR12D2, KIZ, LNX2, USP12, POLR1D, GTF3A, GSX1, RPL21, and MIR12033. The majority of the 11 common genes detected by both methods were related to heat stress, reproduction, immunity, and mastitis.

Functional analyses of the candidate genes
A total of 256 and 168 genes were fully or partially contained within each selected region, as detected by iHS and ROH, and were subjected to GO annotation and KEGG pathway enrichment to further analyze their biological functions. Multiple categories were statistically significant (P value < 0.05). All the genes from the iHS method were grouped into several different annotation clusters, of which 18 GO terms and 5 KEGG pathways were significantly enriched (Additional file 1: Table S3). The enriched annotation terms from iHS analysis were associated with different molecular functions, biological processes, and cellular components. A gene list analysis revealed a high percentage of genes involved in intracellular part (GO:0044424), organic substance metabolic process (GO:0071704), cellular metabolic process (GO:0044237), primary metabolic process (GO:0044238), nitrogen compound metabolic process (GO:0006807) and membrane-bounded organelle (GO: 0043227) and organelle part (GO:0044422). Similarly, the 5 KEGG pathways included vitamin B6 metabolism, folate biosynthesis, tryptophan metabolism, metabolic pathways and valine, leucine and isoleucine degradation.
All the genes detected in all ROH were grouped into several different annotation clusters (Additional file 1: Table S4). 16 GO terms (2 molecular functions, 6 biological processes, and 8 cellular components) and 1 KEGG pathway were enriched. Biological terms were further labelled based on the following categories: positive regulation of sequence-

QTLs based on identified regions
QTLs and selection signatures in the same location indicate a precise relationship between the selection for traits and the effects of variation at a locus [12]. Thus, we investigated the candidate regions overlapping QTL regions extracted from the cattle QTLdb. Overall, 89 genomic  Furthermore, 160 and 167 genes were associated with the 52 and 59 candidate regions within the health-related and reproduction-related QTL intervals, respectively. The results also showed that a significant region might overlap with several QTLs associated with different traits. Among 89 candidate regions, 35.96 % (32 regions) were related to both health traits and reproductive traits. In total, 104 genes were detected in these 32 candidate regions and included 28 genes overlapping with the most significant genomic regions (mean |iHS| score > 2). These genes were CNR2, UBD, TSPEAR, UBE2G2, SUPT3H, FUCA1, IFNLR1, IL22RA1, PNRC2, SRSF10, EXOSC10, MASP2, SRM, TARDBP, MIR12033, OR12D2, CFAP410, FAM207A, ITGB2, LRRC3, PFKL, PTTG1IP, SUMO3, KRTAP10-2, KRTAP10-8, KRTAP12-2, LOC617218 and LOC780781. The 28 genes were used to determine whether other literature reported the function of these genes. Sixteen genes overlapped with candidate regions with a mean |iHS| >2 and were only related to health QTLs; 23 genes overlapped with candidate regions with a mean |iHS| >2 and were only related to reproductive QTLs (Table 5).

Discussion
In this study, we aimed to investigate genomic evidence of selection signatures in Shanghai Holstein cattle using GGRS data. Two complementary approaches were applied for the detection of selection signatures in the studied population, i.e., the ROH and iHS methods, which should boost the accuracy of detection and eliminate unknown bias [26][27][28]. Overall, 101 and 18 candidate regions under selection were detected by the iHS approach and ROH estimation, respectively. These signatures provided insight into the genes contributing to the diverse phenotypes of these animals. Our results revealed a series of well-known and novel genes, such as FAM135B, C2CD2, GOLGA4, ARFGEF1, CTDSPL, TSPE AR, SUPT3H, ATAD2B, KLHL29, FKBP2, STOML2 and ECI2, which are related to milk production; IL22RA1, CALHM3, SNW1, PLXNA4, ABCA9, DDOST, ATP1B3, ALDH6A1 and ADCK1, which are associated with clinical mastitis; SCS, ITGA9, FKBP1B, ACAT2, AMZ2, MRPL18, HDHD3, GNAS, VSX2, PLAC8, PXDC1, REG3G, DNAJB5 and PRDX5, which are involved in body temperature during heat stress; NDUFB3, RGS3, UBD, DIS3L2, NRXN2, PEX14, SPTLC2, AQP1, and PTPN9, which are involved in adaptation, especially climate adaptation, such as adaptation to tropical humidity and harsh environments; SOD2, SNRPA1, TGFBR1, According to the PCA based on genotypes, there was no obvious structure in the Shanghai Holstein cattle population. Kim et al. (2015) reported that modern dairy cattle populations are composed of both inbred (F~0.1) and outbred structures because of the intensive use of a small number of influential males selected for artificial insemination (AI) and mated to cows that probably originated from common ancestors born more than three generations ago, which creates complex pedigree structures consisting of multiple inbreeding loops [17].
The SNPs used to detect regions associated with ROH were obtained from GGRS, a high-throughput reduced representation sequencing method used in our study. Similar to the genotyping-by-sequencing approach, this method has been designed for higher coverage (5×), which makes it more appropriate for outbred animals; thus, it is suitable for the Shanghai Holstein cattle population. However, the distribution of markers is usually uneven. In the past few years, several studies have explored the selection signatures of cattle, pigs, sheep and horses by next-generation sequencing. However, the detection of ROH is sensitive to the parameters or thresholds used for sequencing and pruning SNPs [29,30]. The genotyping error in NGS data has an impact on ROH detection. Therefore, we allowed one heterozygous SNP per ROH to avoid losing particularly long ROH because of a single genotyping error; thus, the accuracy of detection in the NGS data was shown to be high after correction for bias by hidden errors in the genotyping data [21,31]. For the impact of the density of SNPs,  many studies have reported that more signatures of selection were able to be identified with NGS data than with SNP microarray data. NGS data facilitate the detection selection signatures at higher resolution than SNP array data. Moreover, NGS-based detection of ROH is more sensitive for short ROH that are typically missed using SNP array-derived genotypes [32].  In cattle, a low-density SNP microarray tends to overestimate the number of ROH that are shorter than 4 Mb, but using a dense chip leads to an underestimation of the number of long ROH ( ! 8 Mb) [33,34]. SNPs that are genotyped with low coverage can be used for detecting ROH. Compared with SNP array data, SNP genotyping with low coverage (4× on average) can achieve comparable detection of ROH by allowing a high number of heterozygous calls in the sliding window along the genome [29,35].
In the past few years, many studies have been performed in the fields of horse, human and cattle genetics by the iHS method and the detection of ROH within populations [36][37][38]. Saravanan et al. (2021) implemented two complementary approaches viz. iHS and ROH to detect selection signatures for intrapopulation analyses in Indian cattle [28].  reported that the frequencies of ROH across the genome are correlated with signals of recent positive selection [39]. iHS is a measure of positive selection based on haplotype patterns. The positive correlation between the frequency of ROH and iHS was confirmed, supporting a role for natural selection in shaping genomic patterns of ROH. Regions with homozygosity in the long haplotypes were created by recent selection events. Short and intermediate ROH created by older selection events either have weaker iHS signals or are partially diluted by the presence of short and intermediate ROH generated by other forces. Zhang et al. (2015) reported that short ROH were selected and derived from ancient haplotypes that became fixed in populations, while long ROH were the result of more recent inbreeding events based on next-generation sequencing data in cattle populations [38]. Nolte et al. (2019) took genes falling into ROH islands and overlapping with iHS signals as input for enrichment analyses and found several pathways when the selection signatures were investigated in 942 stallions [36].
In this study, we focused mainly on detecting the footprints of selection left in the Shanghai Holstein cattle population after the process of introducing cattle from Canada, the USA, France and northern Europe. Despite their common relevance to Holstein breeding, the current breeding focus of these countries differs with respect to local climate conditions and national strategic directions. In addition, historically, the Shanghai Holstein cattle population underwent different breeding policies regarding pure and cross breeding and different primary foci of utilization. Shanghai Holstein has more than 100 years of history and is the offspring of Chinese native cattle and introduced breeds from Canada, the USA, France and northern Europe [40]. The Shanghai Holstein population is under high susceptibility to diseases such as mastitis, abortions and still births due to the harsh environment in Shanghai. Mastitis can decrease the yield and quality of milk in cattle. To reduce the environmental stress put on animals, exploring the mechanism of adaptation of livestock breeds to local climatic conditions for contemporary agriculture is important. Thus, the production of animals adapted to local climatic conditions can increase and be more environmentally friendly [41]. The method used here is an effective way to identify the genes relevant to important economic traits by identifying selection signatures within the genome. This method can also help us better understand the biological processes and mechanisms affected by ongoing strong artificial selection in livestock populations [31].
We found a total of 101 candidate regions, and 27 candidate regions representing strong signals (mean |iHS| score > 2) were identified using the iHS method. Fiftynine significant regions containing 167 genes were located within the reproduction-related QTLs. Shanghai Holstein cows have only 2-3 parities over their entire life. The high proportion of candidate regions overlapping with reproduction-related QTLs indicated that this population was subjected to selection on reproductive traits, which was consistent with the fact that Shanghai Holstein has reproductive disorders [3]. Moreover, we found that 53 genes overlapped with candidate regions located within health-related or reproduction-related QTLs. Among these 53 genes, SGO2, PPIL3, ORC2, NIF3L1, CLK1, BZW1, AOX1, AOX2, ALAD, BSPRY, CDC26, CFLAR, COX4I1, CTDSPL, EMC8, HDHD3, IRF8, ITGA9, MIR2367, MIR26A-1, MIR26C, NDUFB3, POLE3, PRPF4, RGS3, RNF183, SLC31A1, SLC31A2, VILL and C8H9orf43 overlapped with health-related QTLs. ITGA9 is a heat shock response protein that is associated with body temperature during heat stress; it was discovered by conducting a genome-wide association study in a cattle population [42]. COX4I1 has been shown to be associated with protein yield in Jersey cattle [43]. NDUFB3 was identified to be involved in Mediterranean climate adaptations and morphology and stature [44], which supports the hypothesis that climate strongly influences body size because a smaller size is positively correlated with heat and aridity [45].  found that ALAD contributed to superior heat tolerance mechanisms in an African cattle population [46]. ALAD was also related to the inflammatory response [47]. IRF8 is present within footprints of selection for Iraqi breeds and is linked to the acquired immune response to protozoan and bacterial infections [48]. Ben-Jemaa et al.
(2020) provided an outline of potential selection signatures in North African cattle, and IRF8 was found [49]. RGS3 is an essential factor for the proper growth and development of calves [50]. Previous studies showed that RGS3 was suggested to be a prognostic biomarker for small yellow follicle development in chickens. In rats, RGS3 had negative regulatory functions in signalling. In humans, this gene was involved in GnRH responsiveness in granular cells [51]. In goats, CTDSPL was positively correlated with milk yield. Moreover, miR-26b and CTDSPL were significantly correlated with milk fat content [52]. TGFBR1, SVIL, SNRPA1, SSTR2, SLC39A11, GMPS, CSPP1, COPS5, ARFGEF1 and MYOM2 overlapped with reproduction-related QTLs, which overlapped with 8 candidate regions with mean |iHS| >2: BTA13:34,000,000-34,500,000, BTA14:30,750,000-31,250,000, BTA27:1,000,000-1,500,000, BTA21: 28,750,000-29,250,000, BTA13: 40,000,000-40,500,000, BTA8: 66,500,000-67,000,000, BTA1: 113,500,000-114,000,000, and BTA19: 59,750,000-60,250,000. Among these genes, SVIL overlapped with the candidate regions with the highest iHS score. This gene, which is a target of selection in seven Indian native cattle breeds, is associated with resistance to diseases/higher immunity [53]. SNRPA1 is related to reproductive traits in livestock [54]. A comparison of selective regions and published QTL data suggests that SLC39A11 is a candidate gene for reproductive traits. Association analysis demonstrates that SLC39A11 has a substantial effect on the calving interval in dairy cattle [55]. SLC39A11 was also shown to be associated with fore teat placement in the Chinese Holstein cattle population [56]. According to functional analysis, SLC39A11 was a candidate gene for immunity and defence [57]. CSPP1, which was associated with reproduction, was detected by identifying the regions of signatures of selection across the genome of five Swedish breeds [31]. It is worth noting that the CSPP1 gene has been described to play an important role in the productive life of Holstein cattle [58]. In the Shanghai Holstein cattle population, we observed quite high disease incidences for female fertility disorders. Many studies have found that there is a strong timelagged effect of these female fertility disorders on longevity traits [59]. Genomic scans for selection signatures revealed that ARFGEF1 is a hallmark gene for milk yield in cattle [28,60].
In particular, 12 genes overlapping with 4 candidate regions related to heat tolerance QTLs, ADCK1, GOLGA4, ITGA9, ECI2, PRPF4B, PXDC1, ADIPOR2, ALKBH1, DCP1B, SLIRP, SNW1 and SPTLC2, were detected in our study. Heat stress is an important problem compromising animal production and productivity in Shanghai, which is a subtropical region [61]. Heat stress induces heat shock, oxidative stress and osmotic stress, which are deleterious to normal cellular functions [62], such as reduced oocyte competence, thereby causing lower fertility [63]. Heat shock proteins play a crucial role in environmental stress adaptation and thermal balance [15]. PRPF4B and PXDC1, which are potential candidate genes affecting pregnancy establishment and maintenance in Chinese Holstein cattle, were identified by revealing the genetic and biological basis of the reproductive performance of dairy cows under heat stress [64][65][66]. The SNPs in GOLGA4 explained the greatest proportion of the genetic variation in the cow conception rate in Holstein cows [67,68]. GOLGA4 was found to be associated with milk yield in many dairy populations [69]. ITGA9 was identified in a large number of studies on the signature of selection and environmental adaptation to exposure to thermal stress in cattle [28,[70][71][72]. According to the study of Makina et al. (2015), ADIPOR2, which is related to reproductive performance, was identified by detecting signatures of selection within and between six cattle breeds in South Africa [73]. DCP1B was identified and considered a hub gene because it was involved in multiple pathways related to milk production [74]. SNW1 was considered to be a candidate gene associated with alternative subclinical mastitis traits [75].
We employed a strict criterion that the top 1 % of SNPs with the highest number of occurrences was chosen as an indication of a possible ROH island in the genome. The same threshold was reported in a study on cattle [25]. In our study, BTA21:31,435,280-33,563,201 was detected as a candidate region in the Shanghai Holstein cattle population. Yanhuang cattle are native Chinese cattle from South China. Small and medium ROH were found to be predominant in Yanhuang cattle, and three genes (TMEM266, ETFA and ISL2) at 32 Mb on BTA21 were also detected [76]. ETFA was explored on the cGTEx website and was reported to be significantly associated with 8 economically important traits, such as mastitis and overall conformation score. We detected candidate selection regions in the Shanghai Holstein cattle population by using ROH that harbour genes associated with milk production (ABCA9, ALPL, C2CD2, CDC42, and PER2), reproduction (ABCC1, AIRN, CDX2, DCPS, ESPNL, ISL2, and MSX1), resistance to diseases/ higher immunity (ABCC6, COMMD4, CSPG4, DDOST, KCNMB1, PLXNA4, RAP1GAP, SLIT3, and ST3GAL4), heat stress (ACAT2, AMZ2, GMPR2, MRPL18, PINK1, PKDCC, SOD2, and WTAP), and feed efficiency or lipid metabolism (AGPAT4, KIF17, NEIL1, PRKAR1A, SLC22A1, and THSD4).
There are 11 genes detected by both methods. The MTIF3 gene located on chromosome 12 has been found to inhibit the autophagy pathway in heat-shocked oocytes, affecting the mRNA abundance of MTIF3 and further disrupting oocyte energy balance, this impairing embryonic development [77]. UBD has potential roles in innate immunity [78]. Zare et al. suggested UBD and POLR1D as candidates for susceptibility to Mycobacterium avium subspecies paratuberculosis infection [79]. A search for known QTLs affecting traits related to adaptation in the QTL database showed that UBD is located in the region between 15 and 40 Mb on chromosome 23 and harbours putative QTLs affecting cattle cell-and antibody-mediated immune response, tick resistance, heat tolerance, and respiratory rate [80].
LNX2 was described earlier in the literature as a mastitis-related gene [81] and a candidate gene for residual body weight gain [82]. USP12 plays an important role in stabilizing and enhancing the cellular function of androgen receptors, which are very important in regulating various sperm functions and have also been reported as fertility-related protein-coding genes in crossed cattle breeds [83].
To make use of meaningful annotations rather than simple annotations, we also conducted secondary GO term and KEGG pathway enrichment analyses, which have been proven to successfully identify overrepresented terms. Population-level analysis further suggests the presence of selection for adaptation in the Shanghai Holstein cattle population. A total of 413 genes embedded in selected regions belong predominantly to categories that are related to adaptation, such as molecular adaptation under heat stress. For example, 'vitamin B6 metabolism' (3 genes: AOX1, AOX2 and AOX4; P value = 0.0045) is a growth-and development-related pathway that is under selection, as the metabolically active form of vitamin B6 aids in the synthesis of haemoglobin and enhances oxygen binding [84]. L-tryptophan (Trp) plays crucial roles in the balance between intestinal immune tolerance and gut microbiota maintenance. Tryptophan metabolism plays a pivotal role in human health [85]. Valine, leucine and isoleucine degradation and folate biosynthesis were enriched among genes harbouring SNPs associated with marbling score in Hanwoo cattle [86]. Abnormal folate metabolism has been causally linked with a myriad of diseases [87]. Purine metabolism (bta00230) was a unique KEGG pathway identified as significant in candidate regions for ROH. Purines and pyrimidines are vital constituents of DNA and RNA molecules as well as other substrates of lipid and carbohydrate metabolism. Additionally, purine metabolites are crucial for intracellular signalling and cellular energy, which can also act as cofactors to promote cell growth, proliferation and survival [88].

Conclusions
Using two complementary analyses based on GGRS data, we constructed a high-resolution map of selection signatures in the Shanghai Holstein cattle population, which increases the spectrum of selective signals in the cattle genome. The selective signals identified in this study clearly reflected the stronger adaptation to hot and humid environments in Shanghai. Some candidate genes that might underlie differences in adaptation to specific environments and production systems were identified in potentially selected regions, such as ITGA9, ACAT2, and PLAC8. Genes associated with milk production, resistance to diseases, and reproduction were also identified as candidate regions in iHS and ROH analyses. We used cGTEx further to understand the functional consequences of genetic variants on the transcriptome of the Shanghai Holstein cattle population. Our findings may contribute to promoting the understanding of the genome evolution of and selection mechanisms in cattle.

Genotypes
This study consisted of 1092 animals from the Shanghai Holstein population. The genotypes were sequenced using a genotyping by genome reducing and sequencing (GGRS) protocol (http://klab.sjtu.edu.cn/GGRS/) [89]. The data were extracted from our previous publication [90]. The raw reads with a base average quality score of at least 20 and a score of at least 30 in the first 65 bp aligned to the cow reference genome were retained. The filtered reads were aligned to the btau4.6 assembly of the cattle genome by using Burrows-Wheeler Aligner [91]. The SNP data were edited prior to statistical analyses. First, the SNPs that could not be mapped to the reference were removed. Afterwards, SNPs were excluded if they did not meet the following criteria: (1) calling quality greater than 20 (99 % accuracy); (2) sequencing depth on average greater than 5×; (3) call rate > 30 %; (4) minor allele frequencies (MAFs) ≥ 0.05; and (5) located on an autosome. Missing genotypes were imputed using iBLUP with the default parameters [92]. After quality control, the final data consisted of 164,312 autosomal SNPs. The mean distance between adjacent SNPs was 15.56 kb. Since the UMD3.1 assembly is used in most platforms, we converted SNP position information from btau4.6 to UMD3.1 using UCSC liftOver (https://genome.ucsc.edu/cgi-bin/hgLiftOver) [93], which is a tool for conversion between genome assemblies by coordinates.

Population structure analysis
To evaluate the population structure of the Shanghai Holstein cattle population, principal component analysis (PCA) was conducted using GCTA (http://cnsgenomics. com/software/gcta/pca.html) [94]. The first three significant components were visualized using the R package "scatterplot3d".

Inbreeding coefficient estimations
The inbreeding coefficient was estimated through the -ibc command implemented in GCTA software, which returns three different inbreeding estimations, F GRM , F HOM and F UNI . F GRM , calculated following VanRaden (2008), was equivalent to the estimate of an individual's additive relationship to itself; F HOM , calculated following Wright (1948), was estimated based on the excess homozygosity; and F UNI was calculated based on the correlation between uniting gametes, following Wright (1922).

Detection of selection signatures using iHS
The integrated haplotype score (iHS) test was performed to investigate selection signatures in the Shanghai Holstein population. Before computing iHS, the ancestral allele of all bovine SNPs was inferred as the most common allele in the entire dataset, as described by Bahbahani et al. (2015) and Bertolini et al. ( 2018). The iHS score was calculated for each autosomal SNP using R's "rehh" package [97], which uses a function to detect selection signatures in dense marker data using a test based on extended haplotype homozygosity (EHH) discussed above. The package was applied to estimate the statistics and generate plots to visualize and interpret the results.
The normal standardized iHS was calculated as where iHH A and iHH D represent the integrated EHH score for ancestral and derived core alleles, respectively.
i are the expectation and standard deviation in terms of frequency bin p. We divided the genome into 500-kb windows with a 250-kb overlap and used the averaged |iHS| value in each window as the test statistic. Windows at the top 1 % of the empirical distribution were considered to be candidate regions of positive selection.

Runs of homozygosity analysis
We used the PLINK toolset version 1.9 to identify long series of consecutive, homozygous SNPs [98]. ROH were discovered using a sliding window of 100 SNPs, allowing one possible heterozygous genotype (to account for potential errors in genotyping and imputation) and 5 missing SNPs per window. The minimum SNP density was 1 SNP every 50 kb to ensure that low SNP density did not increase the length of the ROH, and the maximum distance between two consecutive homozygous SNPs in a run was kept at the default value of 100 kb. Homozygous genotypes with a length of > 1 Mb were defined as ROH. The algorithm in PLINK takes a window from a certain SNP and slides this window across the genome. Whether this window is a ROH was determined according to the previous criteria. The proportion of ROH windows encompassing each SNP was calculated. The percentage of animals that had the region with the most overlapping ROH on each chromosome was plotted. Statistical thresholds were determined empirically by selecting the top 1 % of the SNPs most commonly observed in ROH [37]. The number of ROH, the average and maximum length in kilobases and the number of SNPs were calculated for all the animals for chromosomes 1 to 29.

Enrichment analyses of candidate genes under selection
Enrichment analysis is a promising strategy for increasing the likelihood of identifying biological processes that are highly related to the biological phenomena under study [99]. In this study, enrichment analyses for genes within each significant ROH island (top 1 %) and candidate region detected by the iHS method (the top 1 %) were performed. We used the database for annotation, visualization and integrated discovery (DAVID 6.8; https://david.ncifcrf.gov/summary.jsp) to extract biological features/meanings associated with the list of genes. Kyoto Encyclopedia of Genes and Genomes (KEGG) is a database resource for understanding the high-level functions and utilities of biological systems, such as cells, organisms and ecosystems, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies [100]. We explored the potential functions of the genes in all samples using KEGG; thus, enriched pathways with an adjusted P value < 0.05 were reported in our study.
Aligning core regions to QTL database QTLs were used to determine the expression patterns of these differentially expressed genes. Cattle QTLdb (http://www.animalgenome.org/cgi-bin/QTLdb/BT/ index, updated Sep, 2014) contains 81,652 QTLs [101]. We used a Perl Script to conduct QTL-based annotation to identify all the significant candidate regions detected by the iHS method (top 1 %) contained or overlapped across the QTLs. The number and function of candidate regions was determined after annotation.

cGTEx gene expression data set
Understanding the functional consequences of genetic variants on the transcriptome of livestock is essential for interpreting the molecular mechanisms underlying traits of economic value. In this study, to further investigate the hypothesis that many of our leading variants are regulatory, we retrieved multiple tissue gene expression data from cGTEx and then compared the gene expression data with a list of genes within each significant ROH island (top 1 %) and candidate region detected by the iHS method (the top 1 %). First, we retrieved the Ensembl IDs of the genes that were detected by ROH analysis and iHS simultaneously based on the annotation for cow genes (UMD3.1) available from the Ensembl database (http://www.ensembl.org/index.html). We uploaded the Ensembl IDs of the genes to cGTEx and downloaded all cis-eQTLs and trans-eQTLs related to genes from the website. Notably, because all the genome annotation files are based on ARS-UCD1.2 (https:// www.ncbi.nlm.nih.gov/assembly/GCA_002263795.2; GenBank accession NKLS00000000.2) [102] for cGTEx, conversion of our SNPs located in the genes from UMD3.1 to ARS-UCD1.2 was performed using the UCSC liftOver tool [93]. Then, we used a Perl script to find SNPs shared between cGTEx and the SNP dataset in our study, thus identifying genetic variations that affect gene expression in specific tissues, which is a promising approach to finding functionally relevant pathways that contribute to traits.
TWAS integrates genome-wide association studies (GWAS) and gene expression datasets to identify genetrait associations. We downloaded all 496 gene-tissue pairs significantly associated with 43 economically important traits from cGTEx. The TWAS results were compared with our genes that overlapped with candidate regions detected by iHS and ROH to detect common genes and to identify the genes that are significantly expressed in tissues relevant to a trait (P value < 0.05).