Benchmarking phasing software with a whole-genome sequenced cattle pedigree

Oget-Ebrad, Claire; Kadri, Naveen Kumar; Moreira, Gabriel Costa Monteiro; Karim, Latifa; Coppieters, Wouter; Georges, Michel; Druet, Tom

doi:10.1186/s12864-022-08354-6

Research
Open access
Published: 15 February 2022

Benchmarking phasing software with a whole-genome sequenced cattle pedigree

Claire Oget-Ebrad¹,
Naveen Kumar Kadri²,
Gabriel Costa Monteiro Moreira¹,
Latifa Karim³,
Wouter Coppieters³,
Michel Georges¹ &
…
Tom Druet¹

BMC Genomics volume 23, Article number: 130 (2022) Cite this article

3546 Accesses
4 Citations
1 Altmetric
Metrics details

Abstract

Background

Accurate haplotype reconstruction is required in many applications in quantitative and population genomics. Different phasing methods are available but their accuracy must be evaluated for samples with different properties (population structure, marker density, etc.). We herein took advantage of whole-genome sequence data available for a Holstein cattle pedigree containing 264 individuals, including 98 trios, to evaluate several population-based phasing methods. This data represents a typical example of a livestock population, with low effective population size, high levels of relatedness and long-range linkage disequilibrium.

Results

After stringent filtering of our sequence data, we evaluated several population-based phasing programs including one or more versions of AlphaPhase, ShapeIT, Beagle, Eagle and FImpute. To that end we used 98 individuals having both parents sequenced for validation. Their haplotypes reconstructed based on Mendelian segregation rules were considered the gold standard to assess the performance of population-based methods in two scenarios. In the first one, only these 98 individuals were phased, while in the second one, all the 264 sequenced individuals were phased simultaneously, ignoring the pedigree relationships. We assessed phasing accuracy based on switch error counts (SEC) and rates (SER), lengths of correctly phased haplotypes and the probability that there is no phasing error between a pair of SNPs as a function of their distance. For most evaluated metrics or scenarios, the best software was either ShapeIT4.1 or Beagle5.2, both methods resulting in particularly high phasing accuracies. For instance, ShapeIT4.1 achieved a median SEC of 50 per individual and a mean haplotype block length of 24.1 Mb (scenario 2). These statistics are remarkable since the methods were evaluated with a map of 8,400,000 SNPs, and this corresponds to only one switch error every 40,000 phased informative markers. When more relatives were included in the data (scenario 2), FImpute3.0 reconstructed extremely long segments without errors.

Conclusions

We report extremely high phasing accuracies in a typical livestock sample. ShapeIT4.1 and Beagle5.2 proved to be the most accurate, particularly for phasing long segments and in the first scenario. Nevertheless, most tools achieved high accuracy at short distances and would be suitable for applications requiring only local haplotypes.

Peer Review reports

Background

Haplotype phasing consists in the reconstruction of haplotypes inherited from each parent. On autosomes, diploid individuals carry two alleles (eventually identical) at polymorphic sites, each allele being inherited from one of the two parents. The combination of alleles from one of the homologous chromosomes is called haplotype, or phase. However, genotyping data obtained with genotyping arrays or from whole-genome sequencing experiments are typically unphased, the origin of each allele remaining unknown. Therefore, statistical phasing methods must be used to determine the set of alleles belonging to each homolog, that were co-inherited.

Haplotype information can be used in many applications in quantitative and population genomics, including missing genotype imputation [1, 2], identification of identical-by-descent (IBD) segments in outbred or experimental populations [3,4,5], quantitative-trait locus (QTL) mapping [6], haplotype-based association studies [7,8,9] or genomic predictions [10,11,12], demographic inference [13, 14], identification of signatures of selection [15, 16], allele’s age estimation [17], estimation of linkage disequilibrium (LD) -based recombination maps [18] or identification of cross-over (CO) events in genotyped pedigrees [19, 20]. Haplotypes present indeed higher LD with underlying causative variants and allow the estimation of the length of shared IBD segments, a measure related to the number of generations to their common ancestor [21]. Haplotypic information is also required to understand interactions among tightly linked loci, for instance to study allele specific expression, identify deleterious compound heterozygotes or to determine how combinations of variants affect gene expression [22, 23].

Haplotype phasing methods can be divided into two main groups (as reviewed in Browning and Browning [24]): those relying on pedigree relationships (e.g., [25, 26]) and those that can be applied in samples of unrelated individuals by exploiting LD information, often referred to as population-based methods (e.g., [7, 27, 28]). Nevertheless, some methods present hybrid properties by exploiting both sources of information (e.g., [29,30,31]). Other methods apply heuristic rules by matching target haplotypes to libraries of reference haplotypes in windows (e.g., [30]). Long-range phasing methods use such heuristic approaches and rely also on the identification of surrogate parents [29, 32]. Most recent advances in phasing methods were related to their ability to handle huge data sets, including thousands of sequenced samples [33,34,35]. Phasing accuracy of these different approaches will impact the outcome of different haplotype-based applications and must be assessed. Although this is most often initially tested in human populations, it should ideally be realized in populations with different demographic histories and levels of relatedness.

Here we take advantage of a unique sequenced cattle pedigree to assess accuracy of several population-based phasing methods, including recent methods commonly used in livestock species. This sample is a typical example of a livestock population with reduced effective population size, high levels of relatedness and long-range LD, and containing 100 to 200 sequenced individuals. We show that such data can be phased with extremely high accuracy. In addition, we illustrate that the raw sequence data requires stringent filtering to obtain accurate haplotypes.

Results

Quality of whole-genome sequence genotype data after applications of different procedures

We assessed the quality of the genotyping data by comparing the number of identified CO with the evaluated data to the expected number of CO, obtained with a high-quality reference map validated with 115,967 genotyped individuals and 30,331 SNPs (see Methods). In both cases, CO were identified using a pedigree-based approach [36]. The ARS-UCD1.2 bovine genome assembly [37] was used as starting point for the reference map. A total of 18 SNPs showing evidence of incorrect map positions were then discarded. Among those, 13 matched with regions also flagged by Quanbari and Wittenburg [38] as potential errors in the genome build. After removal of these SNPs, we found no more evidence for map errors. Using this reference map and our sequenced pedigree containing 264 individuals, we found an average of 24 CO per individual, 26 and 23 in males and females, respectively.

When CO were estimated using the 15,327,429 SNPs that passed the variant quality score recalibration (VQSR) procedure (with the threshold set to 99.9), the average counts per meiosis were highly inflated, equal to 1416 (Fig. 1). When the threshold for the VQSR filtering was set to 97.5, resulting in the selection of 11,030,905 SNPs, the average number of CO dropped to 254 confirming that the data quality was improved. However, this value was 10 times larger than the expected values, clearly indicating that the sequence genotype data required further cleaning. Subsequent selection of a subset of 8,435,899 variants behaving like true Mendelian variants (see Methods), with genotype frequencies close to Hardy-Weinberg proportions, and with minor allele frequency higher than 0.01, resulted in the identification of 167 CO on average per meiosis, still six times above expectations. Refining genotype calls using Beagle4.1 [39] clearly improved the genotype quality, the number of identified CO being reduced by a factor 3 (51 CO per meiosis on average). We then removed regions presenting high coverage, excessive levels of recombination or of genotyping errors (see Methods), resulting in the removal of 18,220 additional SNPs. The total number of SNPs was then 8,417,679 ranging from 166,292 (chromosome 25) to 529,626 (chromosome 1) per chromosome (Table S1 in Additional file 1). This removal further improved the quality of our data as the number of identified CO dropped to 31 CO per meiosis on average. Finally, by setting remaining genotypes that were discordant in parent-offspring pairs to missing (e.g., opposite homozygous), the average number of identified CO was further reduced to 30 CO. Overall, the application of these different procedures allowed us to reduce the average number of CO from 1416 to 30 CO, closer to expectations. Nevertheless, we still identified on average 5 additional CO with the final sequence data compared to results obtained with the high-quality reference map. These 5 additional CO might be missed with the 50K map but could also correspond to errors, meaning that haplotypes obtained with a family-based approach might still contain a few errors.

Comparison of phasing quality achieved with different population-based phasing methods

Strategy

To assess the phasing quality from different LD-based methods we used the haplotypes from 98 sequenced individuals that had both their parents also sequenced (sequenced trios). The haplotypes of these sequenced offspring (validation individuals) were phased using Mendelian rules, that are exact in absence of genotyping errors, to serve as the “true haplotypes”. Population-based approaches were then applied in two scenarios, either with only the 98 validation individuals (scenario 1), or with the full data set consisting in 264 individuals, but ignoring the pedigree relationships (scenario 2). Most phasing metrics are computed with respect to heterozygous markers phased in the true haplotypes (the gold standard) with Mendelian rules since these markers are informative. On average, each of the 98 individuals had 1,964,220 such informative markers in both scenarios (Table S1 in Additional file 1). The different metrics used to assess phasing quality are described in the Methods section and most of them are illustrated in Fig. 2.

Phasing yield

Most of the tested phasing methods achieved 100% phasing yield, and this was almost true also for FImpute3.0 that phased on average more than 99.99% of the heterozygous SNPs in both tested scenarios. Only AlphaPhase1.3 failed to phase all the SNPs, with on average 97.5 and 98.7% of phasing yield in the first and in the second scenarios respectively.

Switch error count (SEC) and rate (SER)

Median switch error count (SEC) and rate (SER) computed on the 98 validation individuals are provided for the main phasing algorithms and each scenario in Fig. 3 and Table 1. When only the 98 validation individuals were used for phasing, the median SEC was around 4500 to 5000 for a group of methods including AlphaPhase1.3, Eagle2.4 and FImpute3.0. These values correspond to a SER slightly below 0.25%, meaning that switch errors occur on average every 400 informative markers. Haplotypes obtained with Beagle4.1 had clearly lower SEC than these first methods, close to 2750. ShapeIT4.1 performed even better with median SEC and SER below 400 and 0.02% respectively, corresponding to one switch error every 5000 informative markers. Finally, Beagle5.2, relying on a new algorithm compared to earlier Beagle versions (until Beagle4.1), resulted in the lowest SEC and SER, outperforming all other methods, the median values were below 200 and 0.01% per individual, corresponding to one switch every 10,000 informative markers. This represents a reduction by a factor 20 compared to AlphaPhase1.3, Eagle2.4 and FImpute3.0, whereas ShapeIT4.1 generated ten times fewer switches than these methods. When methods are compared with other summary statistics related to SEC and SER such as the mean, minimal and maximal values or as their range (Fig. 3A and Tables S2-S3 in Additional file 1), the ranking of the method remained similar with Beagle5.2 performing best, followed by ShapeIT4.1. When comparing different versions of the same software in terms of SEC and SER (Figure S1A in Additional file 1), we observed that newer versions are more accurate as expected. In particular, AlphaPhase1.3 represents a major improvement with respect to AlphaPhase1.1. Regarding Beagle’s versions relying on a directed acyclic graph, Beagle3.3 and Beagle4.0 had close performances and Beagle4.1 appeared as an important improvement. Several of these versions of Beagle presented a lot of variation among individuals. Regarding the latest Beagle’s versions, relying on the Li and Stephens model [40], Beagle5.1 was slightly better than Beagle5.0 whereas Beagle5.2 represented a substantial improvement. Finally, ShapeIT2 and ShapeIT4.1 achieved similar performances.

Table 1 Results of different metrics used to assess phasing quality in both scenarios. Median values of switch error counts (SEC), switch error rates (SER, %), quality adjusted (QA) haplotype block length (bp), and QAN50 (bp), obtained with AlphaPhase1.3, Beagle4.1, Beagle5.2, Eagle2.4, FImpute3.0 and ShapeIT4.1, computed for the 98 validation individuals in each scenario (scenario 1: using the 98 validation individuals; scenario 2: using the 264 sequenced individuals)

Full size table

When a larger data set was used for phasing, consisting in 264 individuals including the sequenced parents (scenario 2), the phasing accuracy improved for most of the methods (Fig. 3B and Table 1), with less variation among individuals. For AlphaPhase1.3, the SEC reduction remained however modest and it consequently ranked last. For all the other phasing methods, the median SEC was below 1000, around 900 and 350 for Eagle2.4 and Beagle4.1, respectively. FImpute3.0 showed the highest improvement compared to scenario 1, the median SEC being reduced by almost 40 folds and dropping to 115. However, Beagle5.2 and ShapeIT4.1 still performed best with median SEC values equal to 55 and 50, respectively. These values correspond to extremely low median SER, equal to 0.0027 and 0.0026%, respectively, and to one switch error every 40,000 informative SNPs. The ranking remains similar with other summary statistics (Fig. 3B and Tables S2-S3 in Additional file 1), except that FImpute3.0 presented the lowest minimum and maximum individual SEC. With FImpute3.0, the lowest value was equal to 2, indicating that almost all chromosomes were perfectly phased for that individual. With ShapeIT4.1 and Beagle5.2, the best phased individual had only 12 and 13 SEC, respectively. The median SEC dropped for all the different versions of the tested software, and the variation among individuals was strongly reduced, in particular for Beagle3.3, Beagle4.0, Beagle4.1 and Beagle 5.1 (Figure S1B in Additional file 1). The ranking of these different versions, in terms of SEC or SER, was similar to the ranking observed in the first scenario, with the exception of ShapeIT2 that presented now higher SEC and variation levels than ShapeIT4.1. Differences between Beagle5.1 and Beagle5.2 were also smaller than in the first scenario.

Length of correctly phased haplotype blocks

Statistics relying on SEC does not provide a full description of their distribution along the chromosomes and of the resulting distribution of length of correctly phased haplotype segments. Therefore, we also computed the quality adjusted (QA) haplotype block length and the QAN50 metrics, as described in the Methods section, in order to highlight the ability of a phasing tool to produce long correctly phased blocks within a chromosome, without switch error.

In the first scenario, median QA haplotype block lengths are equal respectively to 10 and 12 kb with AlphaPhase1.3 and Beagle5.2, and clearly lower with other methods (Table 1). The ranking of the methods based on the median QA haplotype block lengths is thus very different from comparisons based on SEC, with AlphaPhase1.3 ranking second. However, when mean values are used in comparisons (Table S4 in Additional file 1), the ranking follows results obtained with SEC. These mean lengths of correctly phased segments range from 500 kb with AlphaPhase1.3 to 7.5 Mb with Beagle5.2. Compared to AlphaPhase1.3, the median values were five times lower with ShapeIT4.1 but the haplotype block lengths were on average ten times longer. This indicates that some methods such as ShapeIT4.1 tend to produce a lot of small correctly phased segments (switch errors being close) in combination with very long correctly phased segments (up to 156.8 Mb with ShapeIT4.1, a full chromosome), whereas other methods such as AlphaPhase1.3 tend to provide more uniform distances between successive switch errors. This is confirmed in the distributions of QA haplotype block lengths (Figure S3 in Additional file 1) that are concentrated around short values with all methods, although long correctly phased segments are observed, in particular with ShapeIT4.1 and Beagle5.2. As a result, the mean length is higher with these two methods. Long segments capture large fractions of the genome and the QAN metrics provide complementary information by weighting the segments by their length (in Mb or in number of SNPs). For instance, the QAN50 metrics obtained for different methods (Fig. 4A and Table 1) indicate that with AlphaPhase1.3, 50% of the genome is included in correctly phased segments longer than 2.5 Mb. The QAN50 increases to 5.8 and 6.1 Mb with FImpute3.0 and Eagle2.4, respectively (approximately 20,000 SNPs, Fig. 4C) and to 18.2 Mb with Beagle4.1(approximately 60,000 SNPs). ShapeIT4.1 and Beagle5.2 performed best with a QAN50 close to 48 Mb corresponding to blocks of approximately 170,000 SNPs. Figure 4A provides the full distribution of QAN values (from 100 to 0%), with very similar curves for ShapeIT4.1 and Beagle5.2. It allows also to determine the percentage of the genome included in correctly phased segments longer than different thresholds as reported in Table 2. For instance, for applications such as imputation or haplotype-based association studies, phasing accuracy is important locally, at short range (< 1 Mb). The aptitude to produce long correctly phased segments (> 10 Mb) for most of the genomic positions is more important in applications relative to the age of young alleles, of recent IBD segments or recent selective sweeps. With Beagle5.2 for instance, 96.3, 93.6, 89.3 and 48.1% of the genome is included in correctly phased segments of at least 1, 5, 10 and 50 Mb, respectively. These values are close with ShapeIT4.1 and lower with the remaining methods, only 86.4, 54.0, 33.2 and 1.3% with Eagle2.4, for instance. Most recent versions of tested software performed better in terms of QA and QAN50 than older versions (Figure S2A, C and Table S4 in Additional file 1), with the exception of ShapeIT2 that had similar statistics as ShapeIT4.1, and Beagle4.1 that presented better results than Beagle5.0.

Table 2 Genome percentage included in correctly phased segments longer than different thresholds in both scenarios. Percentage of the genome covering quality adjusted (QA) haplotype blocks of minimal length of respectively 1, 5, 10 and 50 Mb, obtained with AlphaPhase1.3, Beagle4.1, Beagle5.2, Eagle2.4, FImpute3.0 and ShapeIT4.1, computed for the 98 validation individuals in each scenario (scenario 1: using the 98 validation individuals; scenario 2: using the 264 sequenced individuals)

Full size table

In the second scenario including more individuals, mean QA haplotype block lengths (Table S4 in Additional file 1) increased for all methods, reaching 23.3 and 24.1 Mb with Beagle5.2 and ShapeIT4.1, respectively. The distributions of QA haplotype block lengths are clearly shifted towards longer segments with ShapeIT4.1 and Beagle5.2, and with FImpute3.0 to a lesser extent (Figure S3 in Additional file 1). Interestingly, the improvement is only modest with AlphaPhase1.3 whereas the mean QA haplotype block length increases from 500 kb to 12.5 Mb with FImpute3.0, a 25-fold change. FImpute3.0 is even the best method with respect to QAN50 (79.9 Mb), followed by ShapeIT4.1 (69.0 Mb) and Beagle5.2 (62.7 Mb) (Table 1). However, a larger proportion of the genome is included in correctly phased segments longer that 10 Mb with ShapeIT4.1 (95.1%) compared to FImpute3.0 (91.6%) (Fig. 4B, D and Table 2). ShapeIT4.1 performed slightly better than Beagle5.2 at different thresholds (see Fig. 4B, D). Comparisons of different versions of tested software is in agreement with comparisons made with the first scenario, the differences between Beagle5.0, Beagle5.1 and Beagle5.2 being however smaller (Figure S2B, D and Table S4 in Additional file 1).

Pairwise SNP phasing accuracy

Finally, we compared the methods in terms of pairwise SNP phasing accuracy. This metric represents the probability that there is no phasing error between two SNPs as a function of their distance. The results are reported in Fig. 5 and Table 3 and are in agreement with observations for other metrics such as QAN50. In the first scenario, these probabilities are above 0.95 and 0.92 at 10 and 100 kb, respectively, with Beagle4.1, Beagle5.2 and ShapeIT4.1 (Fig. 5A). Other methods presented values below 0.93 and 0.91, respectively. The probabilities dropped rapidly at longer distances, even at 1 Mb (around 0.80 with Beagle4.1 and even below 0.70 for the three less efficient methods). ShapeIT4.1 and Beagle5.2 performed best with probabilities still above 0.90 at 1 Mb, but only 0.78 and 0.66 at 5 and 10 Mb, respectively. At 50 Mb, the probabilities were almost null with all methods and only 0.16 and 0.17 with Beagle5.2 and ShapeIT4.1, respectively. In the second scenario, the probabilities are higher and drop less rapidly, presenting a plateau until a distance of almost 1 Mb (Fig. 5B). ShapeIT4.1 achieved the highest probabilities, equal to 0.95, 0.87 and 0.77 at 1, 5 and 10 Mb distance, respectively, but FImpute3.0 achieved almost identical results and was even better at very long distance (0.31 at 50 Mb vs 0.27 for ShapeIT4.1).

Table 3 Pairwise SNP phasing accuracy at different distances in both scenarios. Pairwise SNP phasing accuracy at distances of respectively 0.01, 0.1, 1, 2, 5, 10 and 50 Mb, obtained with AlphaPhase1.3, Beagle4.1, Beagle5.2, Eagle2.4, FImpute3.0 and ShapeIT4.1, computed for the 98 validation individuals in each scenario (scenario 1: using the 98 validation individuals; scenario 2: using the 264 sequenced individuals)

Full size table

Discussion

We herein compared accuracy of population-based phasing tools in a whole-genome sequenced cattle pedigree. To be able to measure the accuracy, it was essential to apply certain procedures to our whole-genome sequencing data. Indeed, after filtering variants based on a standard variant quality score recalibration procedure, the number of CO in our pedigree was still highly inflated, suggesting that these pedigree-based haplotypes contain many errors. We had to apply further filters to our data to remove additional low-quality markers or small genomic regions incorrectly mapped in the reference genome build. Refining genotype calling with Beagle4.1 [39] had a major impact, stressing the importance of such a procedure. Our final data presented still a few more CO than those obtained at lower density with a high-confidence map. This could be due to the CO missed at lower density or to the errors that remain in the sequence data. This would represent a maximum of 5 incorrect CO on average per meiosis, and these errors could result from a phasing error in the parent or in the offspring haplotypes (each of these haplotypes would thus have less than 5 errors). We could apply additional filters to further reduce the number of errors. For instance, we could remove SNPs located in copy number variants since they would generate spurious CO as genotype calling is more difficult at these positions [41]. Similarly, heterozygous genotypes in the middle of long homozygous-by-descent segments [42] are also probably errors and would generate incorrect CO. Nevertheless, our results illustrate that many errors are still present in whole-genome sequencing data, and that stringent filtering is required. The presence of these low-quality variants would not be a problem in genome-wide association studies because associations are tested independently for each SNP, and genomic prediction methods might be robust to this problem. It might even be key to keep as much variants as possible to have the causative variants in the data set. However, for applications relying on haplotypes and their length, stringent filtering is essential, in particular when long correctly phased segments are required. To illustrate the impact of filtering, we evaluated the methods before improving genotype quality with Beagle4.1 and observed that the phasing accuracy was strongly reduced (Table S5 and Table S6 in Additional file 1). With ShapeIT4.1, the SER were for instance 6 and 40 times higher, in the first and second scenarios respectively, compared to values obtained on a cleaned data set with the same software. Similarly, the QAN50 was divided approximately by ten in both scenarios when ShapeIT4.1 was applied to lower quality genotypes. Overall, evaluations on these data resulted in smaller differences between methods and scenarios, indicating that most of the phasing errors result from the presence of genotyping or map errors in the data rather than from differences between phasing approaches. Nevertheless, ShapeIT4.1 and Beagle5.2 remained overall the best, whereas performances from FImpute3.0 were heavily impacted. This evaluation further stresses the importance to improve as much as possible the genotype quality.

The quality of our final data set was high enough to evaluate the phasing methods, since we expect only a few errors (from 0 to 5) in our reference haplotypes. In our best scenario, the accuracy of the best method was impressive with a median of 50 SEC per individual, corresponding to approximately only two SEC per chromosome and to a SER of 0.003%. In the case of point switches, a phasing error at a single marker (or a small segment) that would cause two consecutive SEC (see example in Fig. 2) would represent only one such punctual error per chromosome. Some individuals presented only 2 SEC for their entire genome, and chromosomes were frequently phased without errors. These results are also confirmed with the metrics related to length of correctly phased segments. On average, ShapeIT4.1 had only one switch error every 40,000 informative markers. When only one hundred individuals were simultaneously phased, there were around 10 switches on average per chromosome with Beagle5.2 (one switch error every 10,000 SNPs corresponding to a SER of 0.01%). These are nevertheless excellent results given the small sample size. These are indeed better results than those reported in human’s populations. For instance, Delaneau et al. [35] obtained a SER above 0.5% with ShapeIT4 and Beagle5 and with a reference panel of 20,000 individuals (at lower marker density). Loh et al. [33] obtained also higher SER using ShapeIT2 and Eagle2 whereas Choi et al. [23] estimated that the SER ranged from 0.8 to 1.5% for Eagle2, ShapeIT2 and Beagle4 for a reference ‘Genome-In-A-Bottle’ whole-genome phased individual, and using a reference panel of 2500 individuals. Similarly, Song et al. [13] reported higher SER, above 2%, in human populations phased with ShapeIT2. This higher accuracy in our cattle data set might be related to the lower effective population size (around 100 in the current population [43, 44]), the higher relatedness (see for instance Figure S4 in Additional file 1) and LD levels, particularly at long distance (> 0.1 when marker distance > 1 Mb [45]). We previously observed that population-based methods such as Beagle are very effective at indirectly exploiting the familial information through the presence of long-shared haplotypes (see also [24]). Similarly, methods such as AlphaPhase [29] or FImpute [30] can identify parents or surrogate parents without pedigree information. This was confirmed in the present study as increasing the sample size and including sequenced relatives clearly improved the accuracy, in particular for FImpute3.0, although the pedigree information was not explicitly used. The high observed accuracy might also result from the stringent rules applied to improve the quality of our data set. With less stringent rules, phasing accuracy dropped indeed significantly (see above).

We herein evaluated the phasing methods with default settings. However, performances from most phasing methods could be further improved by optimizing their parameters. Settings from methods that were originally developed for human populations might indeed not be optimal for livestock populations. For instance, we observed that accuracy of ShapeIT4.1 could be slightly improved by increasing the value of the --pbwt-depth parameter in the second scenario (Table S7 and Table S8 in Additional file 1). This parameter defines the number of selected conditioning neighboring haplotypes to perform a Li and Stephens model [40], higher values increase the accuracy but also the computational costs [35]. However, the optimal parameters for each method might depend on the population structure, the number of individuals, the marker density, etc. Therefore, it is difficult to select optimal values prior to the analysis, and we preferred to compare the phasing methods with their default settings.

In our study, ShapeIT4.1 and Beagle5.2 performed best for almost all evaluated metrics and for both scenarios. Their relative ranking varied however according to the metric and the scenario. Beagle5.2 achieved the best results mainly in the first scenario whereas ShapeIT4.1 was often the most accurate in the second scenario. When the parents were included in the data, FImpute3.0 accurately phased extremely long segments and the estimated SEC was as low as 2 for some individuals. Nevertheless, when the parents were not included in the sample, the accuracy of FImpute3.0 decreased although some full-sibs were present in the sample (Figure S4 in Additional file 1). Phasing accuracy varies across different versions of a software. In our study, we observed that phasing accuracy improved as expected with newer versions of the software. As a result, comparisons of different methods might vary through time, according to the compared versions. For instance, until recently ShapeIT4.1 was in competition with Beagle5.1. In our comparisons, ShapeIT4.1 was most often better although Beagle5.1 performed extremely well. However, Beagle5.2, the new release, performed as well as ShapeIT4.1 (see above). Phasing accuracy will also change according to different elements such as marker density, level of relatedness and size of the population [24], and this might impact the ranking of the methods. For instance, in Choi et al. [23], Eagle2 performed better than ShapeIT2 and Beagle4 on human data. In a Holstein dairy cattle population genotyped with medium to high density genotyping arrays, Miar et al. [46] compared Beagle4.1, ShapeIT2 and FImpute based on SER. They estimated that Beagle4.1 was the most accurate whereas ShapeIT2 resulted in higher SER. However, their sample was much larger and more information from relatives was thus available. Consistently with our study, when one or two parents of the validation animals were added to the phased sample, FImpute became more accurate than Beagle4.1. Using simulated data mimicking a brown layer population, Frioni et al. [47] observed that haplotypes phased with Beagle4.1 had lower SEC than those obtained with FImpute when parents were not included. As in our study, inclusion of parents in the phased sample increased phasing accuracy for FImpute. Fewer comparisons in livestock species are available for ShapeIT4.1 or Beagle5.0 as these programs are more recent. In summary, our data set represent a typical example of reference panel containing 100 to 200 whole-genome sequenced individuals in a livestock species with high levels of relatedness. In those conditions, ShapeIT4.1 and Beagle5.2 performed particularly well. We are not aware of the reasons why these two approaches present higher phasing accuracies. Both of them rely on a Li and Stephens model [40] that might result more flexible and more accurate than methods relying on matching haplotypes of fixed length such as AlphaPhase or FImpute. Browning and Browning [24] previously reported that the original phasing algorithm from Beagle (implemented in Beagle3 or Beagle4) was less efficient with smaller samples (as in the present study). The implementation of the method, including parameters setting and fine-tuning, impacts also the phasing accuracy as we observed that methods with the same global approach (e.g., the Li and Stephens model), or even successive versions of the same software, achieve different accuracies. Similarly, it is difficult to determine the cause of the poorer performances of AlphaPhase, although lower phasing accuracy compared to Eagle2 was already reported in the study presenting the software [48].

The choice of the phasing method might nevertheless depend on the availability of other options. For instance, Beagle4.0, FImpute3.0 and AlphaPhase1.3 can exploit the pedigree information, which might increase their phasing accuracy in certain conditions. When pedigree information was used in the first scenario (without the sequenced parents), phasing accuracy of AlphaPhase or FImpute was however not higher (Table S9 and Table S10 in Additional file 1) probably because that information was already captured through long haplotype sharing between individuals or because these approaches can identify parents or surrogate parents without the need of the pedigree information. The benefit of pedigree information is stronger when direct relatives such as sequenced parent or offspring are available as in the second scenario, or at lower marker density when LD methods are less efficient. Familial information can also be integrated in some LD-based approaches with a two-step procedure in which haplotypes are first obtained based on familial information and unphased markers are subsequently phased by a LD-based approach [26]. Such an approach is possible with Beagle4.1 or ShapeIT4.1, that will preserve pre-phasing information present in the VCF file. Phasing information coming from marker alleles present on the same sequenced reads can also be integrated with such an approach.

Finally, the importance of phasing accuracy will depend on the applications in which the haplotypes are used. For many applications, accurate phasing is only required at short range. For haplotype-based association studies, short 100-kb haplotypes would capture interactions among tightly-linked loci. We previously observed that improved long-range phasing accuracy did not result in higher imputation accuracy in a livestock population [49]. The presence of a few switch errors would not necessarily be a problem in haplotype-based GWAS or genomic selection, or in some QTL mapping approaches, as long as correctly phased segments are long enough to infer the IBD relationships around the tested position. For such applications, most of the tested methods would provide sufficient accuracy. The phasing accuracy will be more important in applications in which the length of shared haplotypes is used to estimate age of alleles [17] or age to a common ancestor, to identify signatures of selection [15, 16], to determine relatedness between individuals based on the distribution of length of shared IBD segments [4]. This accuracy will also be essential in studies on meiotic recombination based on the identification of CO in genotyped or sequenced pedigrees [20, 50].

Methods

Sequencing data

The whole-genome sequence data used in the present work was obtained from 264 Holstein-Friesian individuals from the DAMONA pedigree designed to study germline mutation in cattle [51] and previously used and described [52, 53]. The individuals were sequenced at high coverage (mean coverage: 25.8X, ranging from 15.2X to 47.1X), and the data included 98 sequenced trios (Figure S3 in Additional file 1). Whole genome Illumina Nextera PCR free libraries (550 bp insert size) were sequenced on an Illumina HiSeq 2000 with a paired-end protocol (2 × 100 bp).

The sequencing data was re-aligned on the new ARS-UCD1.2 (BosTau9) bovine genome assembly [37] using the Burrows-Wheeler Aligner MEM algorithm (v0.7.5a) [54]. The SAM files were converted into BAM files with SAMtools (v1.9) [55]. The BAM files were sorted using Sambamba (v0.6.6) [56]. PCR duplicates were marked with the MarkDuplicates option of picard-tools (v2.7.1) [57]. The BAM files were then recalibrated using the BaseRecalibrator procedure of GATK (v4.1.7.0) [58,59,60], using the VCF provided by the 1000 Bull Genome project (http://www.1000bullgenomes.com/) as known polymorphic sites database. Individual GVCF files were obtained with HaplotypeCaller (GATK4) and were subsequently merged in a GenomicsDB (with GenomicsDBImport, GATK4) to perform joint genotyping with GenotypeGVCFs (GATK4). Variants from the resulting VCF file were then recalibrated using VariantRecalibrator (GATK4) by applying two thresholds (99.9 and 97.5) and using 1.2 M SNPs extracted from commercial chips [61] as truth and training sets, and 138 M SNPs provided by the 1000 Bull Genome project as known set.