### Animal Material

Figure 1 provides information on the origin of the nine different West African populations sampled in our study and representatives of the three main types of cattle found in this area. Longhorn taurines were represented by N'Dama individuals from two distinct origins (ND1 and ND2). ND1 individuals (n = 14) originated from the Gaoua herd constituted by acquisitions coming from different villages in the Pays Lobi (Burkina Faso). ND2 (n = 17) samples were collected in the Samandeni herd (Burkina Faso) which was constituted in the early eighties by acquisitions from the Marahoue ranch (Ivory Coast) and originated from northwestern villages from Ivory Coast where zebu had not been introduced [64]. Shorthorn taurines were represented by three different breeds: Baoulé (BAO) samples (n = 29) originated, as ND1, from the Gaoua herd [64], Somba (SOM) individuals (n = 44) were sampled in the breed birthplace across the border near Nadoba (Togo) and Boukoumbe (Benin), and Lagune (LAG) individuals (n = 44) were sampled in the Porto Novo district (Benin) [65]. West African zebus were represented by two populations: Sudanese Fulani or ZFU (n = 43) which was sampled in the Malanville region in Benin [65] and Choah zebus or ZCH (n = 59) which was sampled in the Bol district (Chad) [64]. This latter population was initially sub-divided into two breeds (M'Bororo and Choah Zebus), however, as previously shown [10] and confirmed in our analysis with a dense marker set, no clear differentiation appeared among these. Finally, two hybrid populations were considered: Borgou or BOR (n = 45) which is a stabilized crossbred between shorthorn taurines (LAG or BAO) and was sampled in its region of origin around Parakou in Benin [65] and Kuri or KUR (n = 47) which was sampled from different islands of Lake Chad around Bol [64]. This latter breed was sometimes referred to as a particular longhorn taurine [3], however, several molecular analyses showed a high level of Zebu admixture in it [4, 10]. Moreover, unlike West African taurines, KUR is known to be trypano-susceptible. Three other breeds were considered as outgroups for the detailed genetic structure analysis: ZMA (n = 35) sampled in the Madagascar Island [66] and representing pure zebu [1], AUB (n = 20) sampled in the birthplace of the breed [67] and representing European taurine, and OUL (n = 40) sampled in the North of Morocco to represent North African taurine cattle.

### Genotyping data

Individuals were genotyped on the Illumina BovineSNP50 chip assay [9] at the Centre National de Génotypage (CNG) platform (Evry, France) using standard procedures http://www.illumina.com. Among the 54,001 SNPs included in the chip, only the 51,581 mapping to a bovine autosome on the latest bovine genome assembly Btau_4.0 [68] were retained for further analysis. To limit ascertainment bias favouring SNPs from European origin, we subsequently discarded 13,786 SNPs (~25%) which were not polymorphic (MAF > 0.01) in at least one West African taurine (ND1, ND2, LAG, SOM or BAO) and one West African zebu (ZCH or ZFU). Among the remaining ones, 1,422 SNPs which were genotyped on less than 85% of the individuals from at least one of the nine West African breeds, were also eliminated. An exact test for Hardy-Weinberg Equilibrium (HWE) [69] was subsequently carried out within each breed separately. Based on the obtained p-values, q-values [70] were estimated for each SNP using the R package qvalue http://cran.r-project.org/web/packages/qvalue/index.html. Fifty three SNPs with a q-value < 0.01 in at least one breed were then discarded from further analysis. In total, 36,320 SNPs were finally considered for the study leading to an average marker density of 1 SNP every 70 kb over the genome (Table S8 in additional file 2). Moreover, as shown in Figure S6 (additional file 1) and detailed in Table S8 (additional file 2), the genome coverage was homogeneous with a median distance between consecutive SNPs equal to 47.54 kb. Few large gaps between SNPs were present since the 95^{th} (99^{th}) percentile of this distance was 189 kb (385 kb), the largest gap localized on BTA10 being 2 Mb long. Conversely, less than 0.5% of the distances between successive SNPs were shorter than 20 kb. Genotyping data are given in additional file 3.

### Analysis of Population Structure and characterization of the extent of LD

ASD were computed for each pair of individuals using all available SNP information by a simple counting algorithm: for a given pair of individuals *i* and *j*, ASD was defined as 1-*x*
_{
ij
} where *x*
_{
ij
} represents the proportion of alleles alike in state averaged over all genotyped SNPs. A neighbor-joining tree was computed based on the resulting distance matrix using the R package APE [71]. We subsequently performed a PCA based on all available SNP information using the SMARTPCA software package [14]. As suggested by Patterson et al. [14], we performed LD correction by replacing individual SNP values with the residuals from a multivariate regression without intercept on the two preceding SNPs on the map, providing they were less than 200 kb apart. Results were further visualized using functionalities available in the R package ade4 [72]. The global F-statistics *F*
_{
IT
}, *F*
_{
ST
} and *F*
_{
IS
} were estimated respectively in the form of *F*, *θ* and f [73] using the program GENEPOP 4.0 [74]. GENEPOP 4.0 was also used to estimate diversity for each locus and population both within individuals and among individuals within a population. The within breed *F*
_{
IS
} was derived from the average of these two quantities over all the SNPs. In order to evaluate the reliability of the *F*
_{
IS
} estimates we computed the mean and standard deviation over 10,000 samples of 5,000 randomly chosen SNPs. *F*
_{
ST
} statistics between populations were estimated using both SMARTPCA [14] and GENEPOP [74] which are based on two different models of population divergence.

In order to characterize the extent of LD, we computed the *r*
^{2} measure [23] between each marker pair within each breed separetely using Haploview 4.1 [75].

### Bayesian Model to analyze Differentiation among SNPs

Individual genotyping data were modelled according to the reparameterized extension, recently proposed by Riebler et al. [8], of the initial Bayesian hierarchical model developed by Beaumont and Balding [7]. Briefly and in the bi-allelic SNP case, let *a*
_{
ij
} be the observed reference allele (defined arbitrarily) count in population *j* = 1, ..., *J* at locus *i* = 1, ..., *I*. The conditional distribution given the true (unobserved) allele frequency *p*
_{
ij
} in that population at that locus is assumed to be binomial with parameters *n*
_{
ij
} (twice the number of genotyped individuals in population *j* at locus *i*) and *p*
_{
ij
}: *a*
_{
ij
} | *p*
_{
ij
}~*Bin*(*n*
_{
ij
}, *p*
_{
ij
}) (1). Note that 1) implicitly assumes populations are in HWE or their respective inbreeding coefficients (*F*
_{
IS
}) are null. Non null *F*
_{
IS
} could be taken into account in the model by considering instead that the three possible genotypes are drawn from a multinomial distribution with parameters corresponding to the number of individuals genotyped and genotype probabilities: (1-*F*
_{
IS
}
^{
j
})*p*
_{
ij
}
^{2} + *F*
_{
IS
}
^{
j
}
*p*
_{
ij
}; 2(1-*F*
_{
IS
}
^{
j
})*p*
_{
ij
}(1-*p*
_{
ij
}) and (1-*F*
_{
IS
}
^{
j
})(1-*p*
_{
ij
})^{2} + *F*
_{
IS
}
^{
j
}(1-*p*
_{
ij
}) [18]. Nevertheless, for co-dominant markers such as SNP and given the range of *F*
_{
IS
} values in our study (see Results), the binomial distribution seems to be reasonable (Figure S3 in additional file 1). The second step assumes that the true allele frequencies *p*
_{
ij
} are themselves sampled from a Beta distribution: *p*
_{
ij
} | *λ*
_{
ij
}, *x*
_{
i
}~*Beta*(*λ*
_{
ij
}
*x*
_{
i
}, *λ*
_{
ij
}(1-*x*
_{
i
})) (2). This distribution relies on an infinite Wright island model involving mutation, drift and migration at its equilibrium state [76, 77]. Under this model, *x*
_{
i
} might be interpreted as the frequency of the chosen reference allele at locus *i* in the gene pool from which each allele frequency *p*
_{
ij
} is sampled. The scaling parameter *λ*
_{
ij
} reflects the gene flow from the gene pool to population *j*. Under Wright's model, this parameter is specific to each population *j* but remains homogeneous over loci so that any departure from that property may indicate that locus *i* is no longer neutral. Note that under the Beta distribution, the first two moments can be simply expressed as *E(p*
_{
ij
}) = *x*
_{
i
}and *Var(p*
_{
ij
}) = *x*
_{
i
}(1-*x*
_{
i
})(1+*λ*
_{
ij
})^{-1}. Actually (1+*λ*
_{
ij
})^{-1} can be identified as a *F*
_{
ST
}
^{
ij
} coefficient [17]. The next level of the model consists of specifying the distributions of *λ*
_{
ij
} (or *F*
_{
ST
}
^{
ij
}) and of *x*
_{
i
}. These frequencies are nuisance parameters and uncertainty about them will be taken into account by assigning to them non informative prior such as a *Beta(1,1)* distribution (*e.g.* [78]). Following Beaumont and Balding [7], the *λ*
_{
ij
}'s are modelled via a linear model on the logistic transformation of the *F*
_{
ST
}
^{
ij
}. Since F_{
ST
}
^{
ij
}/(1-F_{
ST
}
^{
ij
}) = 1/*λ*
_{
ij
}, we can write this model in terms of *η*
_{
ij
} = *log(F*
_{
ST
}
^{
ij
}/(1-*F*
_{
ST
}
^{
ij
})) = -log(*λ*
_{
ij
}) as: *η*
_{
ij
} = *α*
_{
i
}+ *β*
_{
j
}+ *γ*
_{
ij
} where *α*
_{
i
} is a locus effect, *β*
_{
j
} a population effect and *γ*
_{
ij
} an error term corresponding to a departure of the logit of *F*
_{
ST
}
^{
ij
} from the additive decomposition. For theoretical and computational reasons, it is more efficient to consider the previous decomposition under a hierarchical Bayesian structure than under its basic form, and implement it as follows: *η*
_{
ij
} | *α*
_{
i
}, *β*
_{
j
}, *σ*
_{
γ
}
^{2}~_{
iid
}
*N*(*α*
_{
i
} + *β*
_{
j
}, *σ*
_{
γ
}
^{2}) with *β*
_{
j
} | *μ*
_{
β
}, *σ*
_{
β
}
^{2}~_{
iid
}
*N*(*μ*
_{
β
}, *σ*
_{
β
}
^{2}) and *α*
_{
i
} | *σ*
_{
α
}
^{2}~_{
iid
}
*N*(0, *σ*
_{
α
}
^{2}). Introduction of additional levels by defining priors on the variance components *σ*
_{
α
}
^{2}, *σ*
_{
β
}
^{2} and *σ*
_{
γ
}
^{2} is straightforward but adds marginal gain in the estimates and requires high supplementary computational costs (Gautier and Foulley, unpublished data). We thus gave for these components the known values proposed initially by Beaumont and Balding (2004): *σ*
_{
α
}
^{2} = 1, *σ*
_{
β
}
^{2} = 3.24 and *σ*
_{
γ
}
^{2} = 0.25. The prior mean *μ* for the *β*
_{
j
} was also taken as -2.0. Recently, Riebler *et al.* [8] introduced in the above logistic model an auxiliary indicator variable *δ*
_{
i
} attached to each locus specifying whether it can be regarded as selected (*δ*
_{
i
} = 1) or neutral (*δ*
_{
i
} = 0). They demonstrated the efficiency of this approach for improving the power of the statistical procedure, particularly for the detection of loci under balancing selection. Under this reparameterized model, the previous parameters *α*
_{
i
} are written as: *α*
_{
i
} = *δ*
_{
i
}
*α*
_{
i
} * with *α*
_{
i
} * | *σ*
_{
α
}*^{2}~_{
iid
}
*N*(0, *σ*
_{
α
}*^{2}). The model further assumes a Bernoulli distribution for the indicator *δ*
_{
i
} variable with parameter *P*: *δ*
_{
i
} | P ~*Bin(1, P)*. P is itself assumed to be *Beta* distributed to take into account uncertainty on this crucial parameter. Here we took *P ~Beta(0.2,1.8)* [8], thus assuming that a priori 10% of the loci are on average under selection, but within a very large range of possible values as the 95% credibility interval goes from almost 0 to 65%. Note that the value *σ*
_{
α
}*^{2} can be derived from *σ*
_{
α
}
^{2} since by construction *σ*
_{
α
}
^{2} = *E(P) σ*
_{
α
}*^{2} (hence *σ*
_{
α
}*^{2} = 10).

The posterior distributions of the different parameters of interest were estimated via MCMC procedures as previously described [8] from 10,000 post burn-in samples (with a burn-in period of 3,000 iterations) and a thinning interval of k = 25. Convergence was checked using standard criterion. Parameter estimates for the *F*
_{
ST
}
^{
ij
} were taken as the median of the posterior distribution.

### Decision rule to identify non neutral loci

In order to identify outlier loci, we decided to base the decision rule on a Bayes Factor

*BF*
_{
i
} defined as the ratio of the posterior to the prior odds that locus

*i* is selected:

As

*δ*
_{
i
} is an indicator variable, the prior and posterior probabilities that

*δ*
_{
i
} = 1 are easy to compute since they reduce to expectations as shown below:

where *π(P)* is the density of the distribution of *P* defined above. To make interpretation of the *BF*
_{
i
} easier, we expressed them in deciban units (dB) *ie dB*
_{
i
} = 10*log*
_{10}(*BF*
_{
i
}) so that 10 dB corresponds to an odds ratio of 10, 20 dB to an odds ratio of 100, 30 dB to an odds ratio of 1,000 and so forth. We then applied the Jeffreys' rule [79] which quantifies the strength of evidence (here to consider a SNP as being under selection) based on *BF* using the following scale: "strong evidence" when *10*<*dB*
_{
i
}<*15*, "very strong evidence" when *15*<*dB*
_{
i
}<*20* and "decisive evidence" when *dB*
_{
i
}>*20*. Notice that the BF captures the change in the odds in favor of *δ*
_{
i
} = 1 as we move from the prior to the posterior. In that way, it makes this criterion independent of the prior distribution of *δ*
_{
i
} and thus guarantees some robustness.

### Simulated data

Following Foll and Gaggiotti [20], four data sets each comprising 50,000 (unlinked) SNPs with respectively 9300 (P = 0.186), 5000 (*P = 0.1*), 500 (*P = 0.01*) and 50 (*P = 0.001*) under selection (*α*
_{
i
} ≠ *0*) were simulated under the inference model. To mimic characteristics of our samples, we considered 9 populations with parameters *β*
_{
j
} equal to the ones estimated on the real data (mean of the corresponding posterior distribution). For each pair of simulated locus/population, we subsequently simulated the *η*
_{
ij
} parameter by adding to the corresponding *β*
_{
j
}, a value of *γ*
_{
ij
} sampled from a Gaussian distribution (*γ*
_{
ij
}
*~N(0,0.25)*) and a value for *α*
_{
i
} equal to 0 if the locus was neutral or sampled from a Gaussian distribution (*α*
_{
i
}
*~N(0,10)*) if the locus was under selection. The frequency *x*
_{
i
}in the ancestral population (or migrant gene pool) was sampled from a Beta distribution with parameters equal to 0.7. This is equivalent to assume the ancestral population was at equilibrium with *4Nμ* = *0.7* where *N* is the effective population size and *μ* the mutation rate [80]. Based on these simulated values for *x*
_{
i
}and *η*
_{
ij
} we subsequently derived the *p*
_{
ij
} parameter of the binomial distribution from which the simulated observed *a*
_{
ij
} was sampled (taking *n*
_{
ij
} = 80 for all *i* and *j*). Only SNPs displaying MAF>0.01 in at least one simulated West-African taurine and one West-African zebu (identified according to the simulated *β*
_{
j
}) were conserved which led to the exclusion of few SNPs (displaying *α*
_{
i
} too high in general). To a first approximation, this also allowed the mimicking of the ascertainment bias introduced by the SNP selection criterion. The posterior distributions of the different parameters of interest were then estimated as described above from 1,000 post burn-in samples (with a burn-in period of 3,000 iterations) and a thinning interval of k = 25. *BF*
_{
i
} for each simulated loci were then computed (see above) and used to compute the (true) FDR and FNR for a given threshold. To investigate relationships between the coefficient of selection and the magnitude of the locus effect *α*
_{
i
}we performed simulations under a simplified version (without mutation) of the Wright-Fisher island model with selection proposed by Beaumont and Balding [7]. Briefly, 10 random mating populations with a constant size of 1,000 individuals and deriving from the same ancestral population were reproduced for 500 generations assuming a global *F*
_{
ST
} of 0.15. In total, 5,000 neutral SNPs and 500 SNPs for each investigated coefficient and mode of selection were simulated. Initial allele frequencies were drawn from a Beta distribution with both parameters equal to 0.7. At the end of the process, SNPs displaying a MAF <0.01 in at least 2 populations were discarded. Locus specific *F*
_{
ST
} were then computed following Weir and Cockerham [73]. Following Foll and Gaggiotti [20], for each type of selection considered, the corresponding average *F*
_{
ST
} was compared to the one obtained under neutrality (*α*
_{
i
} = *0*) to estimate the average SNP effect.

### Annotation of the SNPs

Because the annotation of the bovine genome is still sparse, the gene content information was derived from the TransMap cross-species alignments available in the UCSC Genome Browser http://genome.ucsc.edu/. For closer evolutionary distances, the alignments are created using syntenically filtered BLASTZ alignment chains, resulting in a prediction of the orthologous genes in cow. In total, 46,598 different RefSeq identifiers were anchored to the Btau_4.0 bovine genome assembly http://genome.ucsc.edu/. Considering that most consecutive SNPs on the map were separated by more than 20 kb and the relatively high level of LD at shorter distance (see Results and [11]) a SNP was considered as representative of a gene if it was localized within the boundary positions of the gene that extended by 15 kb upstream and downstream. According to this criterion, 15,360 out of the 36,320 SNPs were representative of 17,190 different TransMap RefSeq identifiers. Subsequent annotation and analyses were carried out with the Ingenuity Pathway Analysis (IPA) software v7.0 (Ingenuity Systems Inc., USA, http://www.ingenuity.com/). Among the 17,190 different TransMap RefSeq identifiers, 17,151 identifiers (99.7%) were represented in the Ingenuity Pathway Knowledge Base (IPKB) and corresponded to 7,177 different genes further considered as the reference set. Finally, 15,336 SNPs were located within a gene, leading to an average of 2.21 SNPs per annotated gene. Notice that 421 of these SNPs were representative of more than one gene.

### Functional and Network Analyses

Functional and Networks analyses were carried out using an approach similar to the one described in [81]. Based on the information contained in the IPKB, IPA allows performing both functional and network analyses. The functional analysis aims at identifying the most significant biological functions represented in the candidate gene list from the reference set. Note that among the genes under selection, only those associated with at least one functional annotation in IPKB were eligible for functional analysis. The most significant functions are then obtained by comparing functions associated with eligible genes against functions associated with all the genes in the reference set using the right-tailed Fisher's exact test. The network analysis aims at searching for interactions (known from the literature) between candidate genes and all other IPKB molecules (genes, gene products or small molecules) and result in the definitions of networks which contains at most 35 molecules (including candidate genes). For each network, a score *S* is computed based on a right-tailed Fisher exact test for the overrepresentation of candidate genes (*S* = -*log(p-value)*). In our study, networks were considered as significant when S>3 and their associated top functions and diseases were further reported.

### Identification of regions under selection

In order to identify regions under selection (with an unexpectedly high proportions of SNPs subjected to selection), we followed the locally adaptive procedure which allows to account for variations in distance between the different tested positions [81, 82]. Individual SNP *BF* values were first smoothed over each chromosome with a local variable bandwidth kernel estimator [83]. We further performed 250,000 permutations to estimate local p-values which were corrected for multiple testing by computing q-values (see above).