Next generation sequencing gives an insight into the characteristics of highly selected breeds versus non-breed horses in the course of domestication

Background Domestication has shaped the horse and lead to a group of many different types. Some have been under strong human selection while others developed in close relationship with nature. The aim of our study was to perform next generation sequencing of breed and non-breed horses to provide an insight into genetic influences on selective forces. Results Whole genome sequencing of five horses of four different populations revealed 10,193,421 single nucleotide polymorphisms (SNPs) and 1,361,948 insertion/deletion polymorphisms (indels). In comparison to horse variant databases and previous reports, we were able to identify 3,394,883 novel SNPs and 868,525 novel indels. We analyzed the distribution of individual variants and found significant enrichment of private mutations in coding regions of genes involved in primary metabolic processes, anatomical structures, morphogenesis and cellular components in non-breed horses and in contrast to that private mutations in genes affecting cell communication, lipid metabolic process, neurological system process, muscle contraction, ion transport, developmental processes of the nervous system and ectoderm in breed horses. Conclusions Our next generation sequencing data constitute an important first step for the characterization of non-breed in comparison to breed horses and provide a large number of novel variants for future analyses. Functional annotations suggest specific variants that could play a role for the characterization of breed or non-breed horses. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-562) contains supplementary material, which is available to authorized users.

Conclusions: Our next generation sequencing data constitute an important first step for the characterization of non-breed in comparison to breed horses and provide a large number of novel variants for future analyses. Functional annotations suggest specific variants that could play a role for the characterization of breed or non-breed horses.

Background
The process of domestication has shaped the modern horse population and lead to an immense group of different types of breeds [1,2]. Various environmental as well as artificial factors affected the population structure and lead to the formation of more than 400 horse breeds today [2,3]. In order to create horses with a characteristic uniform appearance and function, some breeds have especially been under strong directional selection to a special breeding goal while other populations still underlie a greater natural selection and have kept their original properties to survive under harsh environment [2,4]. Despite human influences, the Duelmener horse as well as the Sorraia developed under quite natural circumstances as they are generally kept under free range conditions without specific human care for health. They show typical primitive markings and a robust constitution for the survival under harsh conditions [5][6][7]. Those horses which are less subjected to a breeding goal but to the preservation of this specific population can be grouped as non-breed horses [2]. These non-breeds developmentally lie inbetween the highly selected modern horse breeds and the Przewalski population which falls outside of the monophyletic group of domestic horses and represents the last survivor of wild horses [8]. In contrast to that the Hanoverian as well as the Arabian, one of the oldest recognized domestic breeds, have been subject to close breeding and intense human selection for their high ability in function and performance [9,10]. A comparative whole genome sequence analysis with the genomes of five modern domestic horses identified private loci selected for modern horses and suggested a continuous selection on the immune system and olfaction throughout horse evolution. Other genomic regions showed low levels of genetic variation compared to the Przewalski horse and were suggested to be potentially selected early during domestication [8]. Further specific modern domestic horse variants have been investigated by next generation sequencing of a Quarter horse mare that provided 2.8 million novel SNPs, 193 thousand indels and 282 CNVs and revealed an enrichment of biological processes involved in sensory perception, signal transduction, immunity and defense pathways [11]. The aim of this study was to investigate the influence of human selection on domestic horses by comparative analysis of non-breed to breed horses. Next generation sequencing was performed to characterize five horses of two different breeds and two different non-breeds representing the highly diverse horse population by the detection of novel genetic variations. The results show that the process of domestication as well as further artificial selection for specific breeding aims has influenced the development of specific biological processes in horse breeds.

Sequence analysis and variant detection
Whole genome sequencing was performed in two Hanoverians and one Arabian representing breed horses and one Duelmener and Sorraia representing non-breed horses using the Illumina HiSeq2000 (Illumina, San Diego, CA). Each horse was run on one lane except one of the Hanoverians (Hanoverian 1) which was run on two lanes due to its importance for the Hanoverian breed. After passing filters, reads were trimmed and aligned to the reference horse genome derived from the Thoroughbred mare Twilight and its half-brother Bravo (EquCab2.70, Additional file 1A) [12]. The alignment resulted in a mean coverage of 14.02X for the Duelmener, 12.21X for the Sorraia, 13 Considering the distribution of variants identified on the individual chromosomes of non-breed and breed horses, these groups showed an almost identical number of variants per chromosome (Additional file 2). The highest number of variants could be detected on chromosome (ECA) 1. Nevertheless, with regard to the chromosomal enrichment of detected variants accounted by dividing the number of variants by the length of the chromosomes (bp), we could show that especially ECA12 and ECA20 were over the threshold of the whole genome enrichment (Additional file 3). On the whole, 19 chromosomes of the analyzed 32 chromosomes and the mitochondrial DNA exceeded the threshold of the average number of variants per base pair of the whole genome.

Comparison with Equine SNP50 genotyping beadchip and public databases
Comparative analyses of the detected SNPs by next generation sequencing with Equine SNP50 genotyping beadchip data in the Arabian and the two Hanoverian showed that the variant detection was reliable for these data. In total, 95.92% of the SNPs from beadchip analysis could be confirmed by next generation sequencing for the Arabian and 97.42% for the Hanoverian (Hanoverian 2) run on one lane ( Figure 1). The Hanoverian which was run on two lanes (Hanoverian 1) even showed a concordance with beadchip data of 99.13%. Only 0.51% of the SNPs were different in genotype and 0.36% was not detected by next generation sequencing.
Further comparison was performed with SNP and indel data from open access data bases with the present variant detection results. A total of 907,776 SNPs of our analysis showed an overlap with dbSNP (dbSNP, ftp:// ftp.ncbi.nih.gov/snp/organisms/horse_9796/chr_rpts/), 907,864 SNPs with Ensembl (ftp://ftp.ensembl.org/pub/release-73/variation/gvf/equus_caballus/ Equus_caballus.gvf. gz) and 910,822 SNPs with Broad Institute (http://www. broadinstitute.org/ftp/distribution/horse_snp_release/v2/) data, while 9,281,631 SNPs could not be retrieved from these databases ( Figure 2). In comparison with NGS data from previous studies in a Quarter horse with a minimum read depth coverage of 10X [11] and in five domestic horses (Arabian, Icelandic, Norwegian fjord, Standardbred, Thoroughbred, 7.9-21.1X), one Przewalski (9.6X) and one donkey (16X) [8], we found 1,782,560 shared SNPs, 268,966 SNPs known from the Quarter horse data and 4,697,675 exclusively overlapping with SNPs derived from the five domestic horses, Przewalski or donkey ( Figure 3). In total 3,444,220 SNPs could not be found in these published whole genome sequence data. With regard to all previously published databases and sequence analyses we detected 3,394,883 novel SNPs. Comparative analyses of our indel data with previous studies revealed 236 indels shared in all projects, 18,800 indels known from the Quarter horse and 474,387 indels overlapped the study with the five domestic horses, a Przewalski and a donkey [8]. In total 868,525 indels were exclusive in our data ( Figure 4). With regard to the mutation type a huge number of 15,012 novel non-synonymous SNPs were predicted as well as SNPs affecting splice sites,    Table 1, Additional file 4). The novel indels in our analysis were proposed to have different effects like codon deletions, codon insertions and a large number of 4,916 frame shifts as well as four exon deletions could be found.

Functional annotation and characterization
We investigated the variants detected in our analysis for their effects on basis of the variant annotation and effect prediction tool SNPEff [13] and their distribution across breed and non-breed horses. Most of the variants revealed one or two predicted SNPEff-effects. Regarding the private SNPs that could only be found in one of the compared horses or groups, non-breed horses showed a larger number of effects by private SNPs in intergenic as well as genic regions in comparison with breed horses (Table 2). A total of 810,687 effects could be detected affected by private SNPs in the Duelmener and 824,072 effects in the Sorraia.
The Arabians showed 737,324 effects while the Hanoverian horses revealed an even lower number of effects (316,464) as a result of a reduced number of private SNPs due to the analysis of two horses. With regard to specific effects, 9,507 non-synonymous SNPs were predicted to be shared by all horses while 3,053 private SNPs in the Duelmener, 2,367 private SNPs in the Sorraia, 2,191 private SNPs in the Arabian and 878 private SNPs in the Hanoverian horses. Nevertheless, functional classification analysis of the distribution of genes affected by private nonsynonymous SNPs did not reveal any significant differences in these groups (Additional file 5). Other effects like splice site donors, splice site acceptors, start and stop codon changes were also predicted for shared and private variants. We further investigated the loss of the stop codons possibly affected by private SNPs and identified the genes ATP13A4 (probable cation-transporting ATPase 13A4) in the Duelmener, ENSECAG00000000628 (TRBV6-4, T cell  receptor beta variable 6-4) and ENSECAG00000007186 (OR2A2, olfactory receptor, family 2, subfamily A, member 2) in the Sorraia and RTDR1 (Rhabdoid tumor deletion region gene 1) and ENSECAG00000008382 in the Hanoverians but no loss of stop codons in the Arabian (Additional file 6). All private losses of stop mutations were heterozygous for the respective individuals. Analyses of the indels revealed 629,686 effects predicted for shared variants in the five horses. Especially codon changes, splice site modifications, frameshift mutations as well as one exon deletion became apparent in the detection of private indels (Additional file 6). The heterozygous 21 bp deletion detected exclusively in the Duelmener was predicted to affect the gene CNDP2 (dipeptidase 2, metallopeptidase M20 family) (Additional file 7). Further investigation of the predicted codon changes due to private indels revealed an increased occurrence of genes involved in immune system processes in breed horses (22.6%) in comparison to non-breed horses (6.7%, Additional file 7 and 8). The mean heterozygosity in immunity related regions could be shown to be considerably high in breed horses (0.30-0.33) in contrast to the Duelmener (0.24) and Sorraia (0.09). We performed an enrichment analysis for coding and regulative regions affected by SNPs for breed and non-breed horses. An analysis of coding regions in nonbreed horses revealed an enrichment of genes involved in primary metabolic processes as well as genes involved in anatomical structures, morphogenesis and cellular components ( Figure 5, Additional file 8). Breed horses showed an enrichment of genes in coding regions involved in cell communication, lipid metabolic process, neurological system process, muscle contraction, ion transport and developmental processes of the nervous system and ectoderm. Regulative regions with private non-breed SNPs were enriched with genes affecting proteolysis and fatty acid metabolic processes, while breed horses showed an enrichment of genes in system processes, exocytosis, developmental processes, cell communication, transport and sensory perception of sound (Additional file 9).
Further investigation of known variants revealed a confident detection of the basic coat color black/chestnut and bay phenotypes (Table 3) [14]. Analysis of coat color genotypes associated with a disease did not reveal any mutations. Further search for disease traits confirmed that the Arabian was heterozygous at the TOE1:g.2171G > A SNP which is associated with genetic carriers for cerebellar abiotrophy [15] (Table 4). Performance trait analysis further characterized the different horse breeds. The Duelmener, Sorraia and Arabian showed a T/T genotype for the polymorphism BIEC2-808543 which was proposed to affect the ligand dependent nuclear receptor corepressor-like (LCORL) as the main regulator for body size in horses [16][17][18]. The larger sized Hanoverians had a heterozygous C/T genotype (Table 5). Five SNPs affecting racing performance were also investigated and suggested the Arabian and Hanoverians to be the horses with good stamina in long distances. Especially one Hanoverian was heterozygous for all racing performance traits. The Duelmener showed a heterozygous genotype for the g.66493737C/T mutation in the myostatin gene (MSTN) which is associated with middle distance racing ability and two further mutations associated with good racing performance [19]. In contrast to the Duelmener, the Sorraia showed the C/C genotype in the MSTN gene which can be found in fast horses in short distances.

Discussion and conclusion
The objective of this study was to give an insight into the diversity of the horse population by comparative analysis of breed to non-breed horses using next generation sequencing. Both groups revealed a large number of potential novel SNPs and indels that could not be found in any horse database. Comparative analyses with BeadChip data and known variants causing coat color and disease phenotypes suggested that our sequencing data with a mean coverage of 11-25X provide a reliable basis for variant detection despite possible limitations due to lower sequence coverage in specific regions and errors in the reference genome that might simulate genetic variants [11]. Variant effect prediction by SNPEff was probably under the same limitations regarding the genome build EquCab2.70, but the use of this tool on basis of pre-build databases has been successfully applied for re-sequencing studies in horses and other mammals so far [41][42][43]. Our next generation sequencing data of one or two horses of different populations each gives us a broad idea of how diverse domestic horses really are. The analyzed horses are derived from different developmental groups of the horse population representing one of the oldest and strongly selected breeds (Arabian), one highly selected sport horse breed (Hanoverian) and two populations from different geographic backgrounds that underlie a strong natural selection (Duelmener, Sorraia). These horses were chosen as characteristic representatives for their population in order to reflect distinctive attributes of selectively bred horses in comparison to horse populations not bred for specific purposes. The use of the term "breed" has been discussed very differently. The main point for the differentiation of non-breed and breed was the question if the population was subject to controlled breeding and husbandry imposed by humans [44]. We suggest that the The presented results are classified by SNPEff terms for each breed.
Duelmener and Sorraia populations are not exposed to those strong human influences and should therefore be grouped as non-breed. In comparison to breed horses we were able to detect a larger number of private SNPs in non-breed horses. Nevertheless, despite the lower number of private SNPs, breed horses showed a large number of private indels with codon changing effects in genes involved in immune system processes. Some of these genes play a role in the cell surface receptor linked signal transduction (olfactory receptor gene 56A4, OR56A4), others affect antigen processing and presentation (SRSF protein kinase 2, SRPK2; CALPAIN-8, CAPN8) [45][46][47]. The heterozygosity of variants in these immunity-related regions could be shown to be high in comparison to the detected variants of nonbreed horses. Comparative analysis of modern domestic horses with a Przewalski horse suggested that regions of significant reduction of the genetic diversity in modern horses correspond to specifically selected loci while regions of high densities of mutations give evidence of continuous selection as it could be shown in immunityrelated and olfactory genes in modern horses [8]. Our data confirm this assumption which presumably reflects the importance for variability of the immune system especially in breed horses [48]. In contrast to that, the investigated non-breed horses showed an enrichment of private mutations in genes affecting metabolism, anatomical structures, morphogenesis and cellular components which might give us an idea of the genetic background of their characteristics. We can only speculate if private mutations detected in pheromone and odorant binding vomeronasal receptor 1 (EQUCABV1R928) or in the probable cation-transporting ATPase 13A4 (ATP13A4) as well as ATPase, class VI, type 11B (ATP11B), that play a role for chemoreception, might be involved in non-breed specific abilities. Nevertheless, despite similar characteristics to the Duelmener, the heterozygosity was inferior in the Sorraia horse in comparison with all other horses. We assume that this result is consistent with previous assumption that the Sorraia population shows losses in the level of heterozygosity as it has undergone a genetic bottleneck [49].
In conclusion, our analysis is provided to give an insight into possible interrelations between populations and specific characteristics. We suggest that this data of five horses form a basis for future verification studies in a larger number of horses that will build on these results and help to elucidate specific genetic features by further elimination of individual variants and by increased sequence  Silver coat color, congenital eye disease 6 73,665,304 PMEL17 C/T substitution: Arg/Cys C/C C/C C/C C/C C/C [27] Macchiato, hearing loss 16 20,103,081 MITF T/C substitution: Asn/Ser T/T T/T T/T T/T T/T [24] Glycogen branching enzyme deficiency 26 8,217,062 GBE1 C/A substitution: Tyr/stop C/C C/C C/C C/C C/C [29] Equine hyperkalemic periodic paralysis 11 15,500,439 SCN4A C/G substitution: Phe/Leu C/C C/C C/C C/C C/C [30] Polysaccharide storage myopathy (PSSM type 1) 10 18,940,324 GYS1 G/A substitution: Arg/His G/G G/G G/G G/G G/G [31] Malignant hyperthermia 10 9,554,699 RYR1 C/G substitution: Arg/Gly C/C C/C C/C C/C C/C [32] Hereditary equine regional dermal asthenia 1 128,056,748 PPIB G/A substitution: Gly/Arg G/G G/G G/G G/G G/G [33,34] Junctional epidermolysis bullosa 8 45,603,643 LAMC2 C insertion: frameshift/stop Not detected Not detected Not detected Not detected Not detected [35,36] Foal immunodeficiency syndrome 26 30,660,224 SLC5A3 C/T substitution: Pro/Leu C/C C/C C/C C/C C/C [37] coverage. Although we cannot exclude that some private mutations could be due to false detection or might be individual mutations for the analyzed horses we suppose that our data generally reflect the specific characteristics of breeds and non-breeds and give an idea of what the main developmental focus of each group is.

Ethics statement
All animal work has been conducted according to the national and international guidelines for animal welfare. Primary data analysis was carried out with the standard Illumina pipeline. Sequencing reads were trimmed from the end of the read until the first base over Q10 and reads shorter than 40 bp were discarded. The Genome Multitool (GEM) [50] mapper was used to identify all alignments to the reference (EquCab 2.70) with 4 or less mismatches which included up to one insertion or deletion of up to 25 bases. To map reads with higher divergence we additionally performed alignment with BFAST [51].

Variant detection
For the discovery of SNPs and indels, we applied the Genome Analysis Toolkit (GATK) version 2.7-2 [52,53]. Further basic statistics like the total read depth for each position, consensus quality, allele frequency and genotype quality for the variant in one sample were computed by SAMtools (Sequence Alignment/Map) and BCFtools from the SAMtools package [54]. The chromosomal enrichment of detected variants was accounted dividing the number of variants by the length of the chromosome (bp) Racing performance Racing performance Racing performance Racing performance 22 22684390 COX4I2 C/T substitution C/C C/T T/T C/T C/C [40] in order to compute the average number of variants per base pair.  [11]. A concordance of our detected variants with these data was calculated by the position of the variants in the genome.

Functional annotation and analysis
We performed functional annotation of the detected variants using the genetic variant annotation and effect prediction toolbox SNPEff [13], version 3.1 that provided lists of expected effects, their position and involved genes affected by these variants. The variants were categorized by their functional class and impact. Finally, the data of calculated effects produced in known genes were provided for further analysis in a VCF output file. We further categorized this data by their influence on coding or regulative regions in breed and non-breed horses. All genes were converted to human ortholog genes using g: Profiler [55,56] in order to improve the identification of gene ontology terms. The raw and Bonferroni corrected P-values for the enrichment analysis of genes involved in biological processes were computed using PANTHER (Protein ANalysis THrough Evolutionary Relationships, version 8.0) classification system [57,58]. Values for heterozygosities were defined for all 10,193,421 sites with SNPs. In addition, we calculated heterozygosities for regions with private variants in immunity related genes using SAS/Genetics. Submit your manuscript at www.biomedcentral.com/submit