The adaptation to new environments and artificial selection based on distinct breeding goals has shaped the phenotypic appearance of cattle since domestication. The Industrial Revolution with growing city populations stimulated the development of specialised dairy, beef and dual-purpose breeds at turn of the 18th to 19th century . This progressive breed differentiation was accelerated by introduction of breed standards, which accumulated specific phenotypes that visibly distinguished particular breeds. Parallel introduction of herdbooks reduced gene flow between breeds. The recent advances in quantitative genetics and reproductive techniques have enabled the application of enormous selection intensity within particular cattle breeds. Uncovering the genetic footprints of this strong recent artificial selection could give an insight into the mechanisms of selection in general and could, moreover, help to assign chromosomal regions related to important physiological and economical traits. With this aim, we performed a genome-wide scan of recent artificial selection in eight selected and two mainly artificially unselected cattle breeds using XP-EHH.
To avoid population stratifications within the sets of animals and to be able to assess the genetic diversity of the ten breeds, we performed PCA and pair-wise F
-estimations before conducting XP-EHH analyses. Taken together, the PCA- and F
-results imply that the two artificially unselected breeds are not well differentiated and still cover a considerable part of the original genetic diversity. In contrast, artificially selected breeds show significantly higher differentiation, e.g. the average F
-value for GLW (0.128) is around twice the average for IMB (0.065, Table 2). These results confirm findings of our previous diversity studies based on microsatellite markers [27, 28].
Using XP-EHH we were able to identify between 32 (IMB) and 229 (MWF) selection signatures per breed and some promising candidate genes. The overlap with other studies was only moderate. This moderate agreement was no surprise at all and might be due to a number of parameters. One is the discrepancy in the power of methods used to identify selection signatures in different designs. Since this is the first genome-wide study using XP-EHH in cattle it is not possible to compare the results to another study utilizing the same method. Some previous studies based on long haplotypes using methods like EHH  or iHS  were looking for selection within and not between breeds. The applied methods in these studies have low power or are even unable to detect signatures of selected alleles at high frequency (>0.8) . Moreover, the breeds investigated as well as the design of investigation (breeds separately vs. breeds with similar breeding purpose combined to one group) have major impact on the results. However, if we compare our results to that of Gautier and Naves , who detected 23 selection signatures in Creole Cattle using average ancestry methods, iHS and especially Rsb, which is similarly to XP-EHH a haplotype based method for across population study , we found considerable overlap. They detected 19 of the 23 signatures using the Rsb approach and were able to define candidate genes for 11 . About one third (PDE1B, KDR, SAMD12 and SLC7A5) were also declared as positional candidate genes in at least one of the investigated breeds in our study. The Creole cattle represents an admixture of European, African and zebu cattle kept in tropical environments. However, pure European cattle breeds investigated in this case are mainly kept in -and are therefore adapted to- a temperate climate. Considering these discrepancies, the overlap in question is quite remarkable. Moreover, corresponding signatures can be assumed to represent more adaptions to selective pressures concerning performance than environment.
Additional reasons for divergences from previous studies or our own expectations may be due to the discrepancies in SNP densities and animal numbers that might influence the accuracy of results. For example, the surrounding SNPs of MSTN and MC1R (two obvious targets of selection) were more than 520 kb (MSTN) respectively 490 kb (MC1R) apart. Such huge gaps can bias haplotyping and the XP-EHH estimation on both sides of the gap thus precluding an adequate signature mapping within the gap. Although both, MSTN and MC1R, are surrounded by significant SNPs it is obvious (Figure 2) that they seem to lie in a sXPEHH depression compared to a little more distant SNPs. Therefore, especially for less distinctive signatures a similar SNP distribution might prevent its detection or at least disjoint an otherwise consistent signature.
In general, a high level of neutral genetic diversity was assumed in the effectively large ancestral population of the wild progenitors . Domestication eventually caused some variants to become quite advantageous in a possibly very small founder population . Even if these variants had only reached the frequency of 0.01 in the effectively very large wild auerochs population the “derived” allele could have resided within different haplotypes. If more than one of those haplotypes passed the domestication bottleneck and survived until the period of breed formation it would no longer match the “classical selective sweep” model. This circumstance complicates their detection with diverse methods, especially if the initial allele frequency was already moderate . Our design is optimised to detect the more recently selected variants involved in breed specific phenotypes and improved performance of divergently selected breeds. Although genomic scans in our design can yield an unrepresentative subset of loci that contribute to genetic gains or adaptations  we do not expect an increase of false positives due to predominant recent positive selection on standing variation [23, 81].
The intensity of the artificial selection is inevitably associated with the number of parents of the next generation . This is reflected in both, current effective population size (e.g. N
) and number of positively selected sites (Table 4). Therefore, negative correlation between N
and the number as well as the extent of selection signatures is expected and cannot be considered as evidence for mapping of genomic regions due to drift alone. Admittedly, with further reduction in sample size the estimate of N
gets simultaneously smaller. This means that the varying sample sizes (31 for BBB up to 50 for DFV, FGV, BBV and RH) are potentially able to affect N
-results. However, sample sizes are not chosen freely but dependent on the degree of familial relationship within the sampled subpopulations as measured by UAR. Filtering subpopulations using UAR will reduce the sample of populations that have a smaller N
more than those with higher N
, since individuals in effectively small populations are in general closer related. The observed relationship between estimated N
and sample size might thus be a result of UAR filtering that determined sample size. Taken together, we conclude that the estimated N
is related to the true N
. This conclusion is strongly supported by the fact that the estimated N
–values match with our expectations especially regarding the relative sizes and recent demographical processes of the investigated breeds. Therefore, we assumed that different sample size or N
do not bias our design although great differences between compared breeds might not be advantageous. However, the final number of selection signatures after excluding all singletons was more strongly correlated with estimated N
and the estimated negative correlation was even stronger for N
. Since N
coincides with the breed founding period, here detected selection signatures reflect a combination of breed specific phenotypes and quantitative traits under partly divergent artificial selection in artificially isolated subpopulations.
Regardless of whether breed characteristics or quantitative traits are predominantly mapped, the main question is where to set the significance threshold in order to get as few as possible false positives due to genetic drift, while retaining as much as possible of the truly selected sites. Unlike for human evolution, a precise demographic model as a basis for the coalescence simulation is not available for the diversification of cattle breeds. But even for human data there is considerable uncertainty in the models simulated, which makes the usefulness of assigning formal p-values debatable [17, 36]. Therefore, many studies in both, human and animal data, set the significance threshold empirically (e.g. [12, 13, 17, 19, 20, 36, 83, 84]). For this, the thresholds are predominantly set freely without any relation to real or simulated data (distinct percentage of most extreme values in the empirical distribution). This is the reason why alternative approaches for the determination of the significance threshold are needed. For this purpose, we introduced the usage of artificially unselected breeds for the determination of this threshold. It is quite clear to us that the proposed thresholds have no formal statistical validity but in comparison to arbitrary empirical thresholds we consider our solution as an improvement for domesticated species without living wild progenitor. We tested this approach by detecting the selection signatures around three distinct phenotypes typical for one individual breed (polledness in GLW, double muscling in BBB and red coat colour in RH). At first sight it might be surprising that the selection signature in GLW was only found in comparison with two breeds (RH, BBV) since all investigated breeds, excluding GLW, are horned. However, we demonstrated in two previous studies that the polled mutation resides on the most frequent haplotype over all breeds [46, 85]. This reduced haplotype diversity around the polled locus might be due to an ancient selection signature prior to breed formation and prohibits a distinct signature mapping for at least some control breeds. Similar selective sweeps present in all investigated subpopulations might therefore be missed by XP-EHH. It thus affirms our approach that the polled signature in GLW could nonetheless be detected in two comparisons (RH and BBV). Although further improvements to our significance determination approach are possible, we believe that it is a promising step when it comes to keeping false-positive rates under control.
The approach of proposing candidate genes in the vicinity of the peak location within a selection signature represents a common way to identify a possible candidate gene (e.g. ). However, especially in quite extended signatures that incorporate a lot of genes and are therefore likely to occupy more than one target of positive selection, promising candidate genes might be lost. This is the main reason why we decided against this approach and reported all positional candidate genes (Additional file 4: Table S2, Additional file 5: Table S3, Additional file 6: Table S4, Additional file 7: Table S5, Additional file 8: Table S6, Additional file 9: Table S7, Additional file 10: Table S8, Additional file 11: Table S9, Additional file 12: Table S10 and Additional file 13: Table S11). The circumstance of a possible bimodal signature had already been mentioned when presenting the selection signature around MAP2K6 (Figure 3).
We also detected a number of signatures (480 over all ten breeds) that did not incorporate any gene at all. This is not new in genome-wide searches for signatures of selection and has been discussed in human studies  as well as in livestock animals [10, 83, 84, 86]. In some cases the selected gene might be just a few kb outside the determined signature, some might be false positives but for most of these regions the underlying selected structures might simply not be annotated or even not known as yet. There are highly significant and by multiple comparisons confirmed signatures without any known gene inside. For example, we detected a signature in the case breed GLW (BTA7: 79,494,846-80,987,053) only but it was confirmed five times (significant in comparison to five “control breeds”). Additionally, there are signatures detected at the same position in multiple case breeds (e.g. a common significant region for BBB, DFV and IMB on BTA9: 12,134,014-12,252,182). Such analogies within and between case breeds are unlikely to occur only by chance due to random drift. Instead, it is more likely these signatures harbour yet unknown genes. It is certainly intriguing that we found 97 of the 480 signatures without annotated genes to be overlapping between at least two breeds in 45 distinct regions.
In general, combining different methods of detecting selection signatures  or combining the results of selection signature and association studies  might improve resolution of quite extended signatures and reduce the number of false positives. However, the information gained by such combining methods will always depend on the existence and use of the correct phenotype shaped by the selective pressure. Setting up correct adaptation hypotheses for signatures without genes or signatures based on traits not yet phenotyped, will keep scientists engaged in the coming decade. Single broad selection signatures could include closely linked genes affecting, for instance, important yield traits, coat colour and resistances all differentially selected in the compared breeds. Successive association studies considering the appropriate traits in appropriate models could be able to decipher such a complex and broad selection signature signal into involved phenotypes.
Many differentially selected phenotypes within cattle breeds were the subject of selection for novelty. This practice was common due to the human tendency to select phenotypes visibly distinguishing particular breeds. We adapted a design so that it should be able to detect a large fraction of differentially selected breed characteristics as well as differentially selected (quantitative) traits. Nevertheless, a final validation for both the way of defining the significance thresholds based on artificially unselected breeds as well as for the found signatures of selection will only be possible when further studies are performed and some candidate genes are irrevocably identified. This proof of causality will be easier to achieve for less extended signatures that include only a few or just one candidate gene. The experiences gathered by resolving these “simpler” signatures might in turn lay the foundation for dissecting the more extended signatures.