Runs of homozygosity reveal signatures of positive selection for reproduction traits in breed and non-breed horses

Background Modern horses represent heterogeneous populations specifically selected for appearance and performance. Genomic regions under high selective pressure show characteristic runs of homozygosity (ROH) which represent a low genetic diversity. This study aims at detecting the number and functional distribution of ROHs in different horse populations using next generation sequencing data. Methods Next generation sequencing was performed for two Sorraia, one Dülmen Horse, one Arabian, one Saxon-Thuringian Heavy Warmblood, one Thoroughbred and four Hanoverian. After quality control reads were mapped to the reference genome EquCab2.70. ROH detection was performed using PLINK, version 1.07 for a trimmed dataset with 11,325,777 SNPs and a mean read depth of 12. Stretches with homozygous genotypes of >40 kb as well as >400 kb were defined as ROHs. SNPs within consensus ROHs were tested for neutrality. Functional classification was done for genes annotated within ROHs using PANTHER gene list analysis and functional variants were tested for their distribution among breed or non-breed groups. Results ROH detection was performed using whole genome sequences of ten horses of six populations representing various breed types and non-breed horses. In total, an average number of 3492 ROHs were detected in windows of a minimum of 50 consecutive homozygous SNPs and an average number of 292 ROHs in windows of 500 consecutive homozygous SNPs. Functional analyses of private ROHs in each horse revealed a high frequency of genes affecting cellular, metabolic, developmental, immune system and reproduction processes. In non-breed horses, 198 ROHs in 50-SNP windows and seven ROHs in 500-SNP windows showed an enrichment of genes involved in reproduction, embryonic development, energy metabolism, muscle and cardiac development whereas all seven breed horses revealed only three common ROHs in 50-SNP windows harboring the fertility-related gene YES1. In the Hanoverian, a total of 18 private ROHs could be shown to be located in the region of genes potentially involved in neurologic control, signaling, glycogen balance and reproduction. Comparative analysis of homozygous stretches common in all ten horses displayed three ROHs which were all located in the region of KITLG, the ligand of KIT known to be involved in melanogenesis, haematopoiesis and gametogenesis. Conclusions The results of this study give a comprehensive insight into the frequency and number of ROHs in various horses and their potential influence on population diversity and selection pressures. Comparisons of breed and non-breed horses suggest a significant artificial as well as natural selection pressure on reproduction performance in all types of horse populations. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1977-3) contains supplementary material, which is available to authorized users.


Background
The modern horse population represents a particularly heterogenous group influenced over the time by various selective pressures [1]. However, in studies on genetic diversity a contrasting homogeneity within breeds or non-breeds has been observed [2][3][4]. In particular breeds like the Arabian, Hanoverian and Saxon-Thuringian Heavy Warmblood have been shaped by intense human selection for specific abilities and characteristics that meet with requirements for optimal performance whereas environmental conditions have particularly influenced non-breed horses. The Dülmen Horse and the Sorraia can be characterized as non-breeds with a robust constitution for primitive living conditions not subjected to human selection criteria for specific breeding aims but to natural selection [5][6][7]. Strong selective pressures result in a reduction of genetic diversity which is characterized by long stretches of consecutive homozygous genotypes in the genome known as runs of homozygosity (ROH) [8][9][10][11]. Size and frequency of ROHs give evidence for relatedness within and in-between populations.
In horses, signals of selection have been investigated in 744 horses of 33 breeds using whole-genome single nucleotide polymorphism (SNP) array data [1]. Potential genomic targets of selection were observed within breeds by F ST -based statistics in 500-kb windows and revealed common haplotypes in the region of coat color genes, size and performance traits. The highest signature of selection was found in the Paint and Quarter Horse on ECA18 in the region of the myostatin gene (MSTN). Positively selected loci for performance have also been detected in a Thoroughbred population study based on microsatellite markers [12]. Candidate regions for exercise adaption including fatty acid oxidation, increased insulin sensitivity and muscle strength have been suggested as potential selection targets.
Signals of selection have also been investigated in other mammals including cattle, dog, pig and human, frequently scanning ROHs as diversity indices [1,[9][10][11]13]. In human, ROHs were considered valuable for population demographic analyses and allowed reliable differentiation of human indigenous populations from distinct continents [14]. Shorter homozygous stretches helped to characterize population specific properties whereas ROH longer than 0.5 Mb could be shown to be frequent in all populations [15,16]. It was proposed that ROHs longer than 1 Mb were quite more common in outbred individuals. In addition to population genetic applications, ROH detection was suggested to be valuable for mapping of causative mutations for recessive diseases [17,18]. A study for schizophrenia identified nine risk ROHs which were significantly more frequent in affected patients and harbored disease-associated genes [19].
In domestic animals, especially performance related phenotypes and breed specific characteristics were mainly in focus of ROH analyses [9,20]. A whole-genome comparative detection of ROHs in a sliding window approach was applied for pigs in wild and domesticated populations [9]. Two overlapping ROHs were identified in the European breeds harboring genes involved in cell differentiation. In Large White and Landrace pigs an exclusive ROH could be shown to be located in the region of the growth related pleiomorphic adenoma gene 1 (PLAG1). Further breed specific ROH analyses for Chinese and Western pigs revealed loci under selection important for highaltitude adaption in Tibetan pigs as well as a coat color locus in the region of endothelin receptor type B (EDNRB) in Chinese belted pigs [21].
Signatures of selection affecting coat color and body size traits could also be observed in genome-wide ROH scans for dogs [22]. It was suggested that ancestral genetic variations were transformed into specific characteristics of different dog breeds [13,22]. Next generation sequencing (NGS) data from dogs and wolves revealed regions of potential selection in domesticated dogs which affect metabolism and thus suggest a potential adaption to starch digestion [13,23]. In the Lundehund, fifteen regions with long-range haplotypes indicated potential signatures of positive selection for polydactyly, body size and male fertility [24]. In cattle, a large number of ROHs have been shown to be widely distributed among various breeds and demonstrated its utility for prediction of inbreeding coefficients and relatedness [10,25,26]. Haplotype-frequency based approaches revealed signatures of selection in the region of genes affecting reproduction and muscle formation [27]. A genome-wide scan in Holstein cattle identified milk yield, composition, reproduction and behavioral traits in potentially selected regions [28]. Similar observations were made in an U.S. Holstein cattle study which investigated the distribution of ROHs in different milk production groups [29]. Forty genomic regions in potential signatures of selection were identified in SNP array data harboring loci for milk, fat and protein yield. However, the use of SNP arrays for ROH detection was suggested to be limited mainly for low SNP density reasons [27,28,30]. Higher resolution genomic analyses on basis of wholegenome data enabled the use of 15 million SNPs from 43 Fleckvieh cattle for powerful detection of selected traits [20]. Candidate regions for coat color, neurobehavioral functioning and sensory perception were found in ROH regions suggesting domestication-related signatures of selection. The accuracy of ROH detection in NGS data was shown to be high if corrected for bias by hidden errors in genotyping data [31].
In this study, whole-genome sequences of ten horses were used for analysis of ROHs in a sliding window approach. 50-SNP and 500-SNP windows were chosen for reliable detection of ROHs of different sizes. ROHs exclusively found in individual horses or breeds were further investigated for their gene content potentially affected by targeted selection for specific appearance and function.

Sequencing and variant detection
Whole-genome sequences of a Dülmen Horse, two Sorraia, an Arabian, a Saxon-Thuringian Heavy Warmblood descendent from the Old-Oldenburg breed, a Thoroughbred and four Hanoverian were obtained using NGS. Mapping to the reference genome Equ-Cab2. 70

Sequence error estimation
We estimated sequence errors on the basis of SNP50 BeadChip data in five horses. The results of BeadChip analysis were assumed to be error free. The rate of falsenegative SNPs was calculated based on heterozygous SNPs in BeadChip data which were homozygous in NGS data. We detected false-negative rates of 0.24-0.26 in the four Hanoverian and 0.22 in the Arabian ( Table 1). The false-positive rate was at 3.8 x 10 −4 to 9.8 × 10 −4 using all SNP positions of the BeadChip data. More stringent error estimations in long SNPChip ROH regions of >10 Mb compared to filtered NGS sequence revealed even lower error rates at 3.1 × 10 −4 to 5.9 × 10 −5 .

ROH detection
An average number of 3492 ROHs was detected for the ten horses in windows of minimum amount of 50 homozygous SNPs and an average number of 292 ROHs in windows of 500 homozygous SNPs ( Table 2). The number of smaller ROHs of 40-59 kb was almost equally distributed in all ten horses, whereas ROHs >59 kb were comparatively high in the two Sorraia horses and in the thoroughbred (Fig. 1). ROH detection in larger windows of >400 kb revealed even a more distinct distribution, showing ROHs particularly frequent in the Sorraia and Thoroughbred but also in the Arabian. As indicated by the number and size of ROH, the total length of ROHs in sliding windows of at least 50 SNPs was notably high in the Thoroughbred (953 Mb) and the two Sorraia

Private ROHs and functional annotation
Functional annotation of genes located in private horse specific ROH regions, which could not be detected in one of the other horses under analysis, was performed in order to get an insight into biological processes affected by genes in horse specific homozygous segments. PAN-THER gene list analysis for 50-SNP as well as 500-SNP windows revealed a high percentage of genes involved in cellular processes (GO:0009987), metabolic processes (GO:0008152) as well as biological regulations (GO:006    Analysis of shared private ROHs in specific breed horses revealed 18 ROHs common in all four Hanoverian but not in the other analyzed horses in 50-SNP windows and no shared ROHs in 500-SNP windows ( Table 4). The 18 ROHs contained four novel genes and six genes known as dyslexia susceptibility 1 candidate 1 (DYX1C1), protein phosphatase 1, regulatory (inhibitor) subunit 14C (PPP1R14C), cilia and flagella associated protein 61 (CFAP61/C20orf26), cysteine sulfinic acid decarboxylase (CSAD), TBC1 domain family, member 30 (TBC1D30) and ALX homeobox 4 (ALX4), which were shown to be related by direct genetic interactions or coexpression (Fig. 2). A dense network of genetic interactions could also be found in-between genes located in private ROHs exclusively found in the non-breed horses Dülmen Horse and Sorraia ( Fig. 3 and 4). In total, 198 ROHs could be detected in 50-SNP windows covering 139 genes (Additional file 4). The largest ROHs for nonbreed horses of 324,707-163,116 base pairs were located in the region of the developmental and signaling genes secreted frizzled-related protein 2 (SFRP2), fraser extracellular matrix complex subunit 1 (FRAS1), interleukin-1 receptor-associated kinase 1 binding protein 1 (IRAK1BP1), pleckstrin homology domain interacting protein (PHIP), acyl-CoA synthetase short-chain family member 3 (ACSS3), protein tyrosine phosphatase, receptor type, f polypeptide, interacting protein (liprin), alpha 2 (PPFIA2) and did also cover a gene-rich region which included spermatogenesis associated 25 (SPATA25), acyl-CoA thioesterase 8 (ACOT8) and troponin C type 2 fast (TNNC2). ROH detection in 500-SNP windows revealed seven common ROH regions for non-breed horses. They were located on horse chromosomes 22 and 28 in or near by ROHs already found in 50-SNP windows. The largest region showed a size of 576,454 base pairs. In contrast to non-breeds, the whole group of breed horses (Hanoverian, Arabian, Saxon-Thuringian Heavy Warmblood and Thoroughbred) revealed only three common ROHs in 50-SNP windows and no common ROHs in 500-SNP windows (Additional file 5). The largest     proposed to be tolerated (SIFT score 0.32, 0.34, 0.12).
In contrast to non-breeds, breed horses harbored no variants with high or moderate effects in private ROHs. Nevertheless, the four Hanoverians could be shown to harbor four SNPs in their private ROH regions in the genes DYX1C1, CSAD and in the novel gene ENSECAG00000004438 (Additional file 9). These missense mutations showed no specific genotypes which could be exclusively found in the four Hanoverians. Furthermore, a closer examination of the consensus ROHs of all ten horses revealed a total of seven SNPs in the intronic region of KITLG but no variants with high or moderate effects.

Discussion
The detection of ROHs in ten horses of six different populations allowed us to estimate the genetic diversity in breeds or non-breeds and their signatures of potential selection. Smaller ROHs could be found in all horses to a very high number whereas ROHs of a larger size >59 kb and also longer stretches of consecutive homozygous genotypes >400 kb showed quite distinct distribution among different horse populations. Long homozygous stretches and consequently high inbreeding coefficients characterized the Sorraia and Thoroughbred, which were shown to be closed populations, as well as the Arabian derived from a relatively narrow genetic base [32][33][34].
Especially in the Thoroughbred the low genetic diversity was supposed to be a result of high selective pressures for specific traits of racing performance [12]. In contrast, the four Hanoverian sport horses in our study showed a low number of ROHs and relatively low values for F ROH indicative for inbreeding. Nevertheless, they shared 18 ROHs which harbored six genes potentially important for appearance and performance in sport horses. One of these genes, the homeodomain transcription factor coding gene ALX4 was proposed to play an essential role in the skeletal mineralization and epidermal development in human and mice [35,36]. Neurologic activity could be shown to be affected by CSAD, regulating intracellular calcium levels in neurons by its influence on taurine biosynthesis, and DYX1C1 involved in neuronal migration [37][38][39]. The candidate gene TBC1D30 has been characterized as a signal transducing peptide [40]. Comparative analyses of indicine and taurine cattle revealed signatures of selection and copy number variations in the region of TBC1D30 [41]. In KEPI (PPP1R14C)-knockout mice, a reduce response to repeated morphine injections suggested an important role of KEPI in the regulation of analgesic tolerance [42]. KEPI was shown to be expressed in brain regions of drug reward, locomotor control and nociception [42,43]. Furthermore, it was supposed to play an important role for the regulation of glycogen synthase by its inhibitive effect on protein phosphatase 1 (PP1) [44]. A significant impact on fertility could be observed in association with CFAP61 which was shown to affect cilia and flagella motility [45,46]. It can be assumed that these functional effects on neurologic control, signaling pathways, glycogen balance and reproduction might represent important targets of selection for the Hanoverian, which has become a specifically shaped breed into a modern sport horse type. In comparison to ROH analysis of all breed horses, the number of ROHs in the Hanoverian was relatively high probably as a result of breed specific similarities. However, despite significant differences in-between breeds, the whole group of breed horses revealed a region of potential selection harboring a fertility-related gene. YES1 could be shown to be an essential protein tyrosine kinase for selfdefensive mechanisms in spermatocytes [47]. During testicular heat stress a significantly upregulated expression of YES1 was supposed to antagonize apoptotic processes to maintain spermatogenic differentiation and male fertility. In addition to that, it was even more intriguing that ROH analyses in non-breed horses also suggested a high positive selection for reproduction in mainly naturally selected horses. One of the largest ROHs could be shown to harbor the fertility related gene SPATA25 which is known to be mainly expressed in testis in human. Studies of obstructive azoospermia revealed a significantly reduced expression level in affected patients in comparison to fertile persons [48]. Other candidate genes were proposed to be involved in embryonic development. Analyses of FRAS1 deficient mice revealed phenotypic defects affecting embryonic epithelial basement membranes and internal organs [49]. Furthermore a number of genes involved in energy metabolism (Acyl-coenzyme A synthetase 3, ACSS3; thioesterase 8, ACOT8) [50,51] and muscle development (NEURL2) [52,53] could be found in large non-breed specific ROHs. The differentiation and survival of cardiomyocytes was supposed to be affected by SFRP2 [54]. It was shown that SFRP2 plays an important role in myocardial survival and is involved in ischemic injury repair of cardiomyocytes. The assumption of a potential non-breed specific effect on myocardial regulation for greater endurance in free range conditions was supported through the detection of a functional variant with a possibly deleterious impact on HRC. It was suggested that different expression levels of HRC can affect CA 2+ homeostasis and contractile function of the heart [55,56]. In human and mice affected with heart failures, the HRC expression levels could be shown to be significantly decreased. In conclusion, we propose that non-breed horses underlie a selection mainly driven by nature which affects reproduction, embryonic development, energy metabolism and cardiac development traits. These results confirm the suggestion that metabolic processes and morphogenesis play an important role for survival and maintenance in non-breeds [57].
Despite the specific genetic features in non-breeds as well as breeds and the general differences in the number and length of ROHs in various horse breeds, a functional enrichment of genes affecting cellular, metabolic and developmental as well as immune system and reproduction processes could be shown in ROHs in all ten horses.
These results suggest that despite the low number of individuals in some breeds or non-breeds these ten horses presumably represent a general phenomenon in horse populations. We assume that regions of genes involved in fundamental processes essential for development and sustainment of individuals and populations underlie high selective pressures and accordingly limited variations. A main focus which could be found in all breeds, specific breeds (Hanoverian) and also in non-breeds was a potential selection for traits of reproduction. Essential genes for processes affecting fertility, embryonic development and birth varied among different horse populations but could be assumed to play a key role in artificial or natural selection as well. Reproduction performance has been shown to be of high economic importance in breeds and of vital importance for non-breeds to ensure survival in the wild [5,58]. Various studies in livestock came to the same conclusion and identified reproduction traits are essential targets of selection [9,27,28].
This assumption is supported by our detection of three consensus ROHs in all ten horses which harbor only one annotated gene, the KITLG, also known as Mast Cell Growth Factor, Stem Cell Factor or steel factor [59,60]. Scans for signatures of diversifying selection in pigs proposed the KITLG locus to be a breed-specific signature in the Berkshire [61]. Due to its complex functional capacity, KITLG has fundamental impact on various essential processes affecting melanogenesis, haematopoiesis and gametogenesis [59,62,63]. Mutations in KITLG and its receptor KIT were shown to affect multiple cell formation stages parallelly during embryonic development and in fully-grown mice [63,64]. The Steel Panda mutation at KITLG locus resulted in anemic black-eyed mice of white color with pigmented ears and scrotum and caused sterility in females. In human, a significant association for male infertility could be detected in KITLG affecting sperm count in patients [65]. In horses, the receptor of KITLG (KIT) was suggested to encode the dominant white (W) locus and to initiate severe disorders in haematopoietic system which might be responsible for the lethal consequences of homozygous W/W-genotype [66]. It was proposed that the dominant white phenotype is restricted in some breed registries due to the lethal effect of the homozygous dominant white mutation and also due to the risk of greater susceptibility to skin diseases. Therefore, we assume that the number of negative effects of KITLG mutations particularly affecting traits of reproduction and development have led to a strong positive selection of this region in horses that resulted in long ROHs.
The results of our study suggest that despite significant differences in-between breed and non-breed horses with regard to functional traits, all horse populations show strong signatures of selection in the region of genes affecting traits of reproduction.

Ethics statement
All animal work has been conducted according to the national and international guidelines for animal welfare. The

Samples and sequencing
Sequencing analysis was based on data from two Sorraia, one Dülmen Horse, one Arabian, one Saxon-Thuringian Heavy Warmblood, one Thoroughbred and four Hanoverian. Among these horses, six whole-genome sequences from two Hanoverian (SRX389480/SRX389477), one Arabian (SRX389472), one Sorraia (SRX389475), one Dülmen Horse (SRX384479) and one Thoroughbred (SRR1055837) were obtained from the Sequence Read Archive (NCBI). The remaining samples of a Sorraia mare, a Saxon-Thuringian Heavy Warmblood and two Hanoverian stallions were prepared for whole-genome sequencing. DNA was extracted from white blood cells derived from EDTA-blood sampling using Invisorb Spin Blood Mini kit according to the manufacturers' protocol (Stratec Biomedical, Birkenfeld, Germany). Paired-end libraries of the two Hanoverian were prepared using the Illumina DNA sample preparation kit (Illumina, San Diego, CA). DNA-samples were sheared on the Covaris (Covaris, Woburn, Massachusetts) and purified with Agencourt AMPure XP beads (Beckman Coulter, Krefeld, Germany). The remaining two samples (Sorraia and Saxon-Thuringian Heavy Warmblood) were prepared using the Illumina Nextera DNA Sample Prep Kit according to the manufacturers' protocol and purified with Agencourt AMPure XP beads as well. The whole genome of both Hanoverian was sequenced using an Illumina HiSeq2000 (Illumina) in paired-end mode (2 × 101 bp reads), whereas the Sorraia and Saxon-Thuringian Heavy Warmblood were run on an Illumina MiSeq with v2 Reagent Kits (2 x 250 bp reads) four times paired-end on a single lane flowcell to reach an adequate coverage for whole genome sequencing.

Runs of homozygosity
ROHs were detected using a trimmed dataset of 11,325,777 SNPs with a minimum read depth of 3, a maximum read depth of 60 and a minimum mean read depth of 12 for all ten samples. The X chromosome was omitted for this analysis. We defined ROHs as homozygous regions in sliding windows of 50 SNPs in a first run and 500 SNPs in a second approach using PLINK, version 1.07 (http://pngu.mgh.harvard.edu/purcell/plink/, [73]). Homozygous genotypes of >40 kb as well as >400 kb were defined as ROHs. The minimum distance of SNPs was estimated 0.8. This distance estimation was determined dividing the size of the genome covered with SNPs by the number of SNPs. No more than three SNPs with missing genotypes and three heterozygous SNPs were allowed in each window. The detected ROHs were categorized into small, medium and large ROHs and filtered for individual ROH regions for specific horses or breeds using SAS/Genetics, version 9.4. Private ROHs were determined by filtering out homozygous variants in ROHs in the horse of interest which could not be found in ROHs of other horses. Thus whole individual ROHs or individual parts of ROHs were detected as private ROHs for specific horses as well as for breeds or non-breeds. Consensus ROH regions were derived from intersections of homozygous variants in all ten horses. Furthermore inbreeding coefficients (F ROH ) were estimated for each horse dividing the size of ROHs in bp by the length of the genome (2,242,879,462 bp) covered with SNPs.
In addition to that, theta estimations and neutrality test statistics Tajima's D, Fu&Li F's, Fu&Li's D, Fay's H, Zeng's E were obtained using ANGSD version 0.902 [74]. Analyses were performed for all detected private ROHs in breed, non-breed and Hanoverian horses and for the consensus ROHs as well. Run parameters were adjusted to control for sequencing errors using a minimum quality value of 20 (−minQ 20) and filtering for a read depth of 3 to 60 (−geno_minDepth 3, −geno_-maxDepth 60). Sliding windows of 40 kb as well as 400 kb were chosen for analysis.

Sequence error detection by SNP50 BeadChip
In addition to whole-genome sequencing, two horses (Hanoverian) of a previous study [57] and three horses (two Hanoverian and one Arabian) of the current study were genotyped on the Illumina SNP50 BeadChip.
Sequence errors were estimated in comparison with BeadChip data identifying heterozygous SNPs in Bead-Chip data which were homozygous in NGS data as false-negative and homozygous SNPs in BeadChip data which were heterozygous in NGS data as false-positive. For a more robust estimation of average false-positive error rates, long ROHs >1 Mb in sliding windows of 20 SNPs and a minimum distance of 50 were detected in BeadChip data using PLINK. No heterozygous SNPs and two missing called were admitted. These long ROH were assumed to hold error free homozygous genotypes and therefore ensure more precise error estimation in comparison with NGS-SNPs. The falsepositive error rates were taken into account in the ROH detection admitting three heterozygous SNPs in each sliding window.

Functional annotation
Gene lists of horse specific ROH regions were obtained using SAS/Genetics for filtering PLINK summary files and Galaxy intersection tool (https://usegalaxy.org/) [75][76][77] for gene allocation to genomic regions. The chromosomal positions of ROHs were aligned with the refseq gene table from UCSC (Ensembl genes) in order to obtain all genes located in ROHs. To improve functional analysis, we converted these gene lists to human orthologous genes using g:Profiler [78,79]. PANTHER gene list analysis [80] was performed for functional classification of biological processes affected by genes in private ROH regions. In addition to these horse specific evaluations, further analyses for consensus ROHs in all ten horses and shared private ROHs in breed horses (Hanoverian, Arabian, Saxon-Thuringian Heavy Warmblood and Thoroughbred), non-breed horses (Dülmen Horse, Sorraia) and in the Hanoverian were performed. Gene names and its human orthologues were obtained using the Galaxy intersect function and g:Profiler as well. Genetic relations inbetween genes were obtained using GeneMANIA [81].

Functional variant detection
Functional variants with high or moderate effects were evaluated using SAS/Genetics for filtering SNPEff predictions categorized into high, moderate and low variant impacts. We determined the distribution of genotypes in relation to breed or non-breed groups and detected SIFT [82] prediction scores for functional effects using the Variant Effect Predictor [83].