How to study runs of homozygosity using PLINK? A guide for analyzing medium density SNP data in livestock and pet species

Meyermans, R.; Gorssen, W.; Buys, N.; Janssens, S.

doi:10.1186/s12864-020-6463-x

Methodology article
Open access
Published: 29 January 2020

How to study runs of homozygosity using PLINK? A guide for analyzing medium density SNP data in livestock and pet species

R. Meyermans¹^na1,
W. Gorssen¹^na1,
N. Buys¹ &
…
S. Janssens ORCID: orcid.org/0000-0002-5588-3889¹

BMC Genomics volume 21, Article number: 94 (2020) Cite this article

33k Accesses
135 Citations
6 Altmetric
Metrics details

Abstract

Background

PLINK is probably the most used program for analyzing SNP genotypes and runs of homozygosity (ROH), both in human and in animal populations. The last decade, ROH analyses have become the state-of-the-art method for inbreeding assessment. In PLINK, the --homozyg function is used to perform ROH analyses and relies on several input settings. These settings can have a large impact on the outcome and default values are not always appropriate for medium density SNP array data. Guidelines for a robust and uniform ROH analysis in PLINK using medium density data are lacking, albeit these guidelines are vital for comparing different ROH studies. In this study, 8 populations of different livestock and pet species are used to demonstrate the importance of PLINK input settings. Moreover, the effects of pruning SNPs for low minor allele frequencies and linkage disequilibrium on ROH detection are shown.

Results

We introduce the genome coverage parameter to appropriately estimate F_ROH and to check the validity of ROH analyses. The effect of pruning for linkage disequilibrium and low minor allele frequencies on ROH analyses is highly population dependent and such pruning may result in missed ROH. PLINK’s minimal density requirement is crucial for medium density genotypes and if set too low, genome coverage of the ROH analysis is limited. Finally, we provide recommendations for the maximal gap, scanning window length and threshold settings.

Conclusions

In this study, we present guidelines for an adequate and robust ROH analysis in PLINK on medium density SNP data. Furthermore, we advise to report parameter settings in publications, and to validate them prior to analysis. Moreover, we encourage authors to report genome coverage to reflect the ROH analysis’ validity. Implementing these guidelines will substantially improve the overall quality and uniformity of ROH analyses.

Background

Runs of homozygosity (ROH) are the state-of-the-art method for inbreeding analyses in livestock populations [1]. ROHs are defined as long continuous homozygous stretches in the genome, which are – due to their length – assumed to arise from a common ancestor [2]. Whereas short ROH are indicators of distant inbreeding, long ROH suggest recent inbreeding [3]. ROH were first identified by Broman and Weber in the human genome, whereas Gibson et al. acknowledged their importance for inbreeding calculations [4, 5]. McQuillan et al. defined the genomic inbreeding coefficient based on ROH (F_ROH) [6].

PLINK [7, 8] is the most used program for ROH analyses in livestock populations [1]. ROH analyses are performed using the --homozyg function. The PLINK algorithm for ROH detection relies on a scanning window approach which roughly consists of three steps.

First, the scanning window is defined by a predefined number of SNPs (--homozyg-window-snp) with a maximal number of heterozygous SNPs (--homozyg-window-het) and a maximal number of missing SNPs (--homozyg-window-missing). The defined window stepwise scans an individual’s genome and scores for each SNP the proportion it appears in a homozygous window.

Second, segments of homozygous SNPs are identified genome wide by using a threshold for these scores per SNP: the scanning window hit rate (--homozyg-window-threshold). For a window size of 100 SNPs and a threshold of 0.05, a SNP has to appear in at least five homozygous windows before it is identified as part of a segment. Note that such homozygous windows may contain missing or heterozygous SNPs, depending on scanning window settings.

Third, extra constraints are set to these homozygous segments to identify the final ROH segments. The maximal interval between two SNPs in a segment is checked (--homozyg-gap) as well as the maximal amount of heterozygous SNPs allowed in the final ROH segment (--homozyg-het). Next, ROH segments that do not meet these two requirements are split and re-evaluated. This may lead to detecting ROH segments smaller than the scanning window size. Thereafter, the minimal SNP density (in kb/SNP) per segment is evaluated (--homozyg-density) as well as the minimal length and number of SNPs (--homozyg-kb and --homozyg-snp). ROH segments which do not fulfill any of these three conditions are removed.

In literature, there is no consensus whether SNP data should be pruned for linkage disequilibrium (LD) and/or minor allele frequency (MAF) before ROH analysis. In Table 1 we provide an overview of recent ROH studies on medium density genotypes using PLINK. Most studies apply MAF pruning with a threshold between 0.01–0.05 and some studies also perform LD pruning. For example, Bjelland et al. and Zhang et al. prune all SNPs with R² > 0.5 (using bins of 50 SNPs), resulting in a reduced set of 7997 and 14,366 SNPs (unpruned > 50,000 SNPs), respectively [11, 15]. Hence, this LD pruning results in a SNP reduction of more than 70%.

Table 1 Literature review of ROH analysis settings on livestock species using medium density genotypes

Full size table

The effect of minimal ROH length, either by the minimal number of SNPs or minimal kb length, has been thoroughly studied by Purfield et al. and Ferenčaković et al. [33, 34]. Purfield et al. concluded that a 50 K SNP array is suitable for identifying ROHs longer than 5 Mb, whereas Ferenčaković et al. reasoned that the minimal ROH length should be adapted to the SNP density. They also found that heterozygous calls should be tolerated depending on the ROH length and SNP density [34]. Note that when allowing more than one heterozygous SNP in a scanning window, adjacent heterozygous SNPs may cause the merging of different homozygous segments which are longer than the original ones.

Howrigan et al. simulated genotypes to test PLINK’s ROH detection ability and varied several PLINK detection settings (--homozyg-window-snp, −-homozyg-window-het, --homozyg-window-missing, --homozyg-window-threshold, --homozyg-snp) [35]. They concluded that data should be pruned for LD and MAF prior to analysis. However, Howrigan and colleagues did not vary scanning window sizes, maximal gap sizes, minimal density requirements (in kb/SNP) nor final ROH length in kb, although these parameters can affect the outcome [35].

There is a large variation in parameter settings considering the maximal gap, minimal density and the scanning window size (Table 1). Moreover, studies often do not report density, gap and/or window size settings. Both Howrigan et al. and Peripolli et al. underlined a lack of consensus criteria for ROH analyses [1, 35]. This lack of consensus will lead to biased results and hinders the comparison of results across studies.

In this paper, we provide guidelines for choosing PLINK parameter settings that ensure a robust and reliable ROH analysis. We used medium density genotypes in eight different livestock and pet species (pig, cattle, sheep, cats, horses, goats, dogs and chicken). First, we evaluated the effect of MAF and LD pruning on ROH analysis. Second, we investigated effects of the minimal density (--homozyg-density), the maximal interval between two SNPs in a ROH (--homozyg-gap), scanning window size (--homozyg-window-snp) and scanning window hit rate (--homozyg-window-threshold). Third, we introduce the genome coverage parameter to evaluate the validity of the ROH analysis and to estimate inbreeding based on ROH more accurately. These guidelines facilitate an adequate and robust ROH analysis, resulting in a higher overall quality and uniformity across studies.

Results

All analyses were performed on the eight different livestock and pet breeds. Results and figures for PIT, BB, MER and BUR are provided in the main manuscript, whereas results for SAA, ICE, LAB and BAR can be found in Additional files 1, 2, 3, 4, 5, 6 and 7.

Pruning for linkage disequilibrium

The results of pruning for varying LD levels prior to ROH analysis for PIT, BB, MER and BUR are shown in Fig. 1, results for SAA, ICE, LAB and BAR are added in Additional file 2: Figure S1. The effects of LD pruning on the outcome of the ROH analysis was population dependent. Although maximal genome coverage was reached at R² > 0.25 in some populations (e.g. BB), not all ROH were detected and F_ROH estimates were lower than without pruning for LD. In PIT, maximal genome coverage was reached more slowly in comparison to other populations (e.g. BB).

Pruning for minor allele frequency

In PIT and BUR, we observed that even mild MAF pruning (0.01) had an impact on ROH detection in several genomic regions. Figure 2 shows ROH incidence per SNP (in % of the total population) for both populations without MAF pruning (left) and with MAF pruning at 0.01 (right). For PIT, ROH islands were observed on SSC8 and SSC18, whereas for BUR, a change in observed ROH was found on e.g. B3, D1 and D3. These ROH in PIT and BUR would not have been detected if MAF pruning was performed. For the six other populations, little differences were observed in genome coverage and F_ROH estimates by varying MAF pruning levels.

Minimal density requirement

Figure 3 presents the genome coverage (in %) and the estimated F_{ROH, aut} and F_{ROH, cov} by varying density for PIT, BB, MER and BUR (results for SAA, ICE, LAB and BAR are shown in Additional file 3: Figure S2). All investigated populations showed a low genome coverage with density below 40 kb/SNP. Starting from a mean density of 40 kb/SNP genome coverage increased and maximal coverage is reached between 60 and 75 kb/SNP.

Maximal gap requirement

The results for varying maximal gap settings in ROH analyses for PIT, BB, MER and BUR are shown in Fig. 4, results for SAA, ICE, LAB and BAR are added in Additional file 4: Figure S3. All investigated populations reached maximal genome coverage using gap sizes around 500 kb. Below 500 kb, genome coverage decreased as well as F_{ROH cov} estimates. In general, F_{ROH aut} decreased faster than F_{ROH cov}.

Scanning window size and threshold

An increasing scanning window size led to a decrease in estimated F_ROH, where especially short ROH were no longer detected. Similarly, an increasing threshold resulted in a decreasing F_ROH. For both settings, genome coverage did not vary. Results are shown in Additional file 5: Figure S4 and Additional file 6: Figure S5.

Validation using a model based approach for ROH detection

In general, the model based approach (RZooRoH) yielded higher F_ROH estimates than the rule based approach (PLINK) (Fig. 1 vs Fig. 5). This can be attributed to the less stringent constraints of the model based approach (e.g. no minimal ROH length). Pearson correlations of individual F_ROH between PLINK and RZooRoH were high (r = 0.89–0.99) for all populations (no LD nor MAF pruning performed).

Results for varying LD levels prior to ROH analysis using a model based approach (RZooRoH) for PIT, BB, MER and BUR are shown in Fig. 5, while results for SAA, ICE, LAB and BAR are added in Additional file 7: Figure S6. MAF pruning using RZooRoH revealed the same results: in PIT and BUR, the same effects of even mild MAF pruning (0.01) on ROH detection were observed (Fig. 2 vs Fig. 6), whereas in the other six populations no substantial differences were apparent.

Discussion

To unravel the effects of PLINK parameter settings on ROH estimation using medium density SNP data we analyzed these settings on eight different livestock and pet species. We examined the effects of pruning for LD and/or MAF on ROH detection and genome coverage. Next, we investigated the effect of the previously unstudied PLINK parameters.