Data
Three different sets of genomic data were used for this study:
Set 1: Individual sequence data of 68 chickens from 68 different populations, sequenced within the scope of the EU project Innovative Management of Animal Genetic Resources (IMAGE; www.imageh2020.eu) [48]. They were complemented by 25 sequences (17 + 8) from two commercial white layer lines, 25 sequences (19 + 6) from two commercial brown layer lines, and 40 sequences (20 each) from two commercial broiler lines [23]. In total 158 sequences from 74 populations.
Set 2: Pooled sequence data from 37 populations (9–11 chickens per population) [17]. All except 4 chickens from two populations were part of set 3.
Set 3: Genotypes of 1566 chickens from 74 populations, either genotyped (sub-set of the Synbreed Chicken Diversity Panel; SCDP) [49] with the Affymetrix Axiom™ 580 k Genome-Wide Chicken Array [15], or complemented from set 1.
The intersection of the used data sets is shown in Fig. 1 and accession information of the raw data per sample can be found in Supplementary File 1. All three data sets came with their own characteristics. While individual sequences are considered to be the gold standard throughout this study, genotypes of the Affymetrix Axiom™ 580 k Genome-Wide Chicken Array [15] are biased towards variation which is common in the commercial chicken lines [16] and pooled sequences only allow for an estimate of population allele frequencies and show a slight bias due to sample size and coverage (Supplementary File 2) [50, 51].
Calling of WGS SNPs and generation of genotype set
Alignment of the raw sequencing reads against the latest chicken reference genome GRCg6a [52] and SNP calling was conducted for individual and pooled sequenced data following GATK best practices [53, 54]. As the Affymetrix Axiom™ 580 k Genome-Wide Chicken Array [15] does not contain enough SNPs on chromosomes 30–33 for imputation (and chromosome 29 is not annotated in the reference genome), only up to chromosome 28 was used. This resulted in 20,829,081 biallelic SNPs on chromosomes 1–28 which were used in further analyses. Additionally, all individual sequences were genotyped for the positions of the Affymetrix Axiom™ 580 k Genome-Wide Chicken Array [15].
To ensure compatibility between Array- and WGS data, the genotypes of the Synbreed Chicken Diversity panel were lifted over from galGal5 to galGal6 and corrected for switches of reference and alternate alleles. Only SNPs with known autosomal position, call rates > 0.95 and genotype recall rates > 0.95 were further considered. MAF filters were later used when subsampling the different sets and thus not considered in this step. Further, missing genotypes were imputed using Beagle 5.0 [35] with ne = 1000 [47] and the genetic map taken from Groenen et al. [55]. This resulted in a final set of 1566 animals from 74 populations (18–37 animals per population) and 462,549 autosomal SNPs, further referred to as the genotype set.
As Malomane et al. [17] described LD-based pruning as an effective filtering strategy to minimize the impact of ascertainment bias in SNP array data, the genotype set was additionally LD pruned using plink 1.9 [56] with --indep 50 5 2 flag. This reduced the genotype set to 136,755 SNPs (30%) and will be referred to as pruned genotype set.
The description of the detailed pipeline can be found in Supplementary File 2.
Analyses based on simulation of ascertainment bias within the genotype set
A first comparison was based solely on the 15,868 SNPs of chromosome 10 of the genotype set which allowed for a high number of repetitions while still being based on a sufficiently sized chromosome. To simulate an ascertainment bias of known strength, an even more strongly biased array was designed in silico from the genotype set for each of the 74 populations (further called discovery populations) by using only SNPs with MAF > 0.05 within the according discovery population. This simulates the limitation to common variants in the discovery samples, which is the main reason for the ascertainment bias. Then, reference samples for imputation were chosen in five different ways with 10 different numbers of reference samples and three repetitions per sampling:
-
1)
allPop_74_740: Equally distributed across all populations by sampling one to 10 chickens per population (74–740 reference samples).
-
2)
randSamp_5_50: 5, 10, …, 50 randomly sampled chickens (5–50 reference samples).
-
3)
randPop_5_50: Five chickens from each of one to 10 randomly sampled populations (5–50 reference samples).
-
4)
minPop_5_50: Five chickens from each of one to 10 populations which were closest related to the discovery population, based on Nei’s Distance ([57]; 5–50 reference samples).
-
5)
maxPop_5_50: Five chickens from each of one to 10 populations which were most distantly related to the discovery population, based on Nei’s Distance ([57]; 5–50 reference samples).
This resulted in 2200 repetitions of in silico array development and re-imputation per sampling strategy. The reference set was formed by sub-setting the total genotype matrix to SNPs with MAF > 0.01 within the reference samples and the reference samples chosen via the above-mentioned strategies. Imputation of the in silico arrays to the reference set was performed by running Beagle 5.0 [35] with ne = 1000 [47], the genetic distances taken from Groenen et al. [55] and the according reference set. The schematic workflow can be found in Fig. 2.
Analyses were then based on comparisons between the in silico ascertained and later imputed sets and the genotype set, which was considered as the ‘true’ set for those comparisons.
Imputation of genotype set to sequence level
After the initial tests of the imputation strategies by the in silico designed arrays, we imputed the complete genotype set to sequence level, using the available individual sequences as the reference panel. In the first run, one reference sample per sequenced population was chosen (74 reference samples; 74_1perLine) which is equivalent to the first scenario allPop_74 of the in silico array imputation. As we had more than one sequenced individual for the commercial lines, the number of reference samples for the commercial lines was subsequently increased to five reference samples per line (up to 98 reference samples; 98_5perLine). Finally, we used all available individually sequenced animals as reference samples (158 reference samples; 158_all), which resulted in a strong imbalance towards the two broiler lines (20 reference samples per broiler line).
Parameter settings in Beagle were further tweaked by increasing the window parameter to 200 cM to ensure enough overlap between reference and study SNPs. This was needed as we observed low assembly quality and insufficient coverage of the array on the small chromosomes. Analyses were then based on comparisons between the genotype set, the pruned set or the imputed sets and the gold standard, the WGS data.
Comparison of population genetic estimators
Ascertainment bias shows its primary effect on the allele frequency spectrum. As populations are affected differently, we first concentrated on two heterozygosity estimates: expected (HE) and observed (HO) heterozygosity, which summarize per-population allele frequency spectra. We additionally included two allele frequency dependent distance measurements: Wright’s fixation index (FST) [58] and Nei’s distance (D) [57].
HO, as the proportion of heterozygous genotypes in a population, could only be calculated when the genotypic status of a population was known (individual sequences or genotypes). In contrast, HE could also be calculated from pooled sequences which allow the estimation of allele frequencies (p). Thereby, HO and HE (Eq. (1)) are calculated as average over all loci (l = 1, …, L).
$$ {H}_E=\frac{\sum \limits_l2{p}_l\left(1-{p}_l\right)}{L} $$
(1)
As pooled sequence data comes with a slight but systematic underestimation of HE ([50]; Supplementary File 2), HE for pooled sequences was multiplied with the correction factor \( \frac{n}{n-1} \), introduced by Futschik and Schlötterer [50], where n is the number of haplotypes in the pool. This partially corrected the HE estimates for the bias introduced by pooled sequencing (Supplementary File 2).
D was calculated as given by Eq. (2), where Dxy accounts for the genetic distance between populations X and Y, while xil and yil represent the frequency of the ith allele at the lth locus in population X and Y, respectively.
$$ {D}_{xy}=-\ln \left(\frac{\sum \limits_l\sum \limits_i{x}_{il}{y}_{il}}{\sqrt{\sum \limits_l\sum \limits_i{x}_{il}^2\sum \limits_l\sum \limits_i{y}_{il}^2}}\right) $$
(2)
Pairwise FST values between populations X and Y were estimated using Eq. (3), where HTl accounts for the HE within the total population at locus l and \( {\overline{HS}}_l \) for the mean HE within the two subpopulations at locus l [58].
$$ {F}_{ST}=\frac{\sum \limits_l\left({HT}_l-{\overline{HS}}_l\right)}{\sum \limits_l{HT}_l} $$
(3)
D and FST both show a downward bias that is comparable to HE when estimated from pooled data (Supplementary File 2). The effect of ascertainment bias is much larger than the effect of pooling for D. In contrast, FST is generally robust against the effects of ascertainment bias when a sufficiently large discovery panel was used for array development [10]. Therefore, it shows underestimation when calculated from pooled sequence data which is larger than the effect of ascertainment bias (Supplementary File 2). We therefore could not dissect the effects of the two biases in the comparisons on sequence level and did not include FST there.
Having no ascertainment bias would mean that estimates of a respective set would lie on the line of identity (diagonal) when regressing the set against the true values. The magnitude of the bias can therefore be defined as the distance of the estimates to that line. We therefore regressed the estimates from biased data (yij) on the unbiased ones (xij) while fitting group specific intercepts (groupi) as well as group-specific slopes (groupi × βi) and a random error (ϵij, \( \upepsilon \sim N\left(0,\mathbf{I}{\sigma}_e^2\right) \)) as in Eq. (4).
$$ {y}_{ij}={group}_i+{group}_i\times {\beta}_i{x}_{ij}+{\upepsilon}_{ij} $$
(4)
The definition of a group describes for within-population estimators (e.g. HE) whether a population was used for SNP discovery (discovery population), samples from that population were used as reference set (reference population) or none of both (application population). Note that in scenarios where reference individuals were present for every population, we only divided them into discovery and application populations. For between population estimators (FST, D), a group describes the according combination of the two involved population groups. Differences of the estimated slopes from one and the correlation between heterozygosity and distance estimates from biased and true set within groups were used as indicators for the magnitude of bias and random estimation error.
To get a measure for a fixed estimation error, we also calculated the mean overestimation across populations (j = 1 ... J) as in Eq. (5).
$$ mean\ \mathrm{overestimation}=\frac{\sum \limits_j\frac{biased\kern0.5em {estimate}_j- true\kern0.5em {estimate}_j}{true\kern0.5em {estimate}_j}}{J} $$
(5)
Note, that we had more than one (pooled) sequenced chicken for only 45 populations. Comparisons of population estimates on sequence level are therefore limited to 45 populations out of the 74 populations which were used as study and reference set for the imputation process.
Assessment of imputation accuracy
Assessment of imputation accuracy was done by using Pearson correlation (r) between true and imputed genotypes [45, 59] for the in silico designed arrays. Pearson correlation puts a higher relative weight on imputation errors in rare alleles than plain comparison of allele- or genotype concordance rates [59]. In case of the imputation to sequence level, we used leave-one-out validation to assess per-animal imputation accuracy. However, the leave-one-out validation in our case shows a slightly downward biased accuracy estimate for the non-commercial samples (Figure S11, Supplementary File 2). For validation, the only sequenced sample of those populations was the test sample, which had to be removed from the reference set. Therefore, no closely related sample to the test sample remained in the reference set and the accuracy was subsequently underestimated. We additionally used the internal Beagle quality measure, the dosage r-squared (DR2) [60] to evaluate per-SNP imputation accuracy. This, however, only shows the theoretical imputation accuracy and cannot capture biases due to biased reference sets.