Skip to main content


Weighted likelihood inference of genomic autozygosity patterns in dense genotype data



Genomic regions of autozygosity (ROA) arise when an individual is homozygous for haplotypes inherited identical-by-descent from ancestors shared by both parents. Over the past decade, they have gained importance for understanding evolutionary history and the genetic basis of complex diseases and traits. However, methods to infer ROA in dense genotype data have not evolved in step with advances in genome technology that now enable us to rapidly create large high-resolution genotype datasets, limiting our ability to investigate their constituent ROA patterns.


We report a weighted likelihood approach for inferring ROA in dense genotype data that accounts for autocorrelation among genotyped positions and the possibilities of unobserved mutation and recombination events, and variability in the confidence of individual genotype calls in whole genome sequence (WGS) data.


Forward-time genetic simulations under two demographic scenarios that reflect situations where inbreeding and its effect on fitness are of interest suggest this approach is better powered than existing state-of-the-art methods to infer ROA at marker densities consistent with WGS and popular microarray genotyping platforms used in human and non-human studies. Moreover, we present evidence that suggests this approach is able to distinguish ROA arising via consanguinity from ROA arising via endogamy. Using subsets of The 1000 Genomes Project Phase 3 data we show that, relative to WGS, intermediate and long ROA are captured robustly with popular microarray platforms, while detection of short ROA is more variable and improves with marker density. Worldwide ROA patterns inferred from WGS data are found to accord well with those previously reported on the basis of microarray genotype data. Finally, we highlight the potential of this approach to detect genomic regions enriched for autozygosity signals in one group relative to another based upon comparisons of per-individual autozygosity likelihoods instead of inferred ROA frequencies.


This weighted likelihood ROA inference approach can assist population- and disease-geneticists working with a wide variety of data types and species to explore ROA patterns and to identify genomic regions with differential ROA signals among groups, thereby advancing our understanding of evolutionary history and the role of recessive variation in phenotypic variation and disease.


Genomic regions of autozygosity (ROA) reflect homozygosity for haplotypes inherited identical-by-descent (IBD) from an ancestor shared by both maternal and paternal lines. Common ROA are a source of genetic variation among individuals that can provide invaluable insight into how population history, such as bottlenecks and isolation, and “sociogenetic” factors, such as frequency of consanguineous marriage, influence genomic variation patterns. Population-genetic studies in worldwide human populations over the past decade have found ROA ranging in size from tens of kb to multiple Mb to be ubiquitous and frequent even in ostensibly outbred populations [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28] and to have a non-uniform distribution across the genome [7, 10, 13, 18] that is correlated with spatially variable genomic properties [2,3,4, 18] creating autozygosity hotspots and coldspots [18]. ROA of different sizes have different continental patterns both with regards to their total lengths in individual genomes [12, 18, 22, 24, 26,27,28] and in their distribution across the genome [18] reflecting the distinct forces generating ROA of different lengths. Studies of ROA in the genomes of ancient hominins [29,30,31] and early Europeans [32] have provided unique insights into the mating patterns and effective population sizes of our early forbearers. In non-humans, ROA patterns have provided insights into the differential histories of woolly mammoth [33], great ape [34, 35], cat [36], canid [37,38,39,40,41,42,43], and avian [44] populations, while in livestock breeds they have contributed to our understanding of their origins, relationships, and recent management [42, 45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62] and the lasting effects of artificial section [59, 62,63,64,65,66,67,68,69,70,71,72,73,74,75], as well as informed the design of ongoing breeding [76, 77] and conservation [48, 58, 78] programs [79].

In contemporary human populations, increased risks for both monogenic [80,81,82,83,84] and complex [85,86,87,88,89,90,91,92] disorders as well as increased susceptibility to some infectious diseases [93,94,95] have been observed among individuals with higher levels of parental relatedness. While the association between parental relatedness and monogenic disease risk has been known for more than a century [96], associations with complex and infectious diseases potentially reflect elevated levels of autozygosity as a consequence of prescribed and unintentional inbreeding [97] that enrich individual genomes for deleterious variation carried in homozygous form [98, 99]. Indeed, genomic autozygosity levels have been reported to influence a number of complex traits, including height and weight [100,101,102,103], cognitive ability [103,104,105], blood pressure [106,107,108,109,110,111,112,113], and cholesterol levels [113], as well as risk for complex diseases such as cancer [86, 87, 114,115,116,117,118], coronary heart disease [86, 119,120,121], amyotrophic lateral sclerosis (ALS) [122], and mental disorders [123, 124]. These observations are consistent with the view that variants with individually small effect sizes associated with complex traits and diseases are more likely to be rare than to be common [125,126,127,128], are more likely to be distributed abundantly rather than sparsely across the genome [9, 129], and are more likely to be recessive than to be dominant [9, 130]. Recent studies investigating ROA and human disease risk have identified both known and novel loci associated with standing height [131], rheumatoid arthritis [132], early-onset Parkinson’s disease [133], Alzheimer’s disease [134, 135], ALS [122], schizophrenia [4, 136], thyroid cancer [118], and Hodgkin lymphoma [117, 137]. Thus, just as ROA sharing among affected individuals has facilitated our understanding of the genetic basis of monogenic disorders [138] in both inbred [139,140,141,142] and more outbred [143,144,145] families, it also represents a potentially powerful approach with which to further our understanding of the genetic etiology of complex disorders [146] of major public health concern worldwide.

In both population- and disease-genetic studies, ROA are frequently inferred from runs of homozygous genotypes (ROH) present in genome-wide single nucleotide polymorphism (SNP) data obtained using high-density microarray platforms [147]. A popular program for ROH identification is PLINK [148], which uses a sliding window framework to find stretches of contiguous homozygous genotypes spanning more than a certain number of SNPs and/or kb, allowing for a certain number of missing and heterozygous genotypes per window to account for possible genotyping errors. While a number of more advanced ROA identification approaches have been proposed [149, 150], a recent comparison found the PLINK method to outperform these alternatives [151]. We recently proposed to infer ROA using a sliding-window framework and a logarithm-of-the-odds (LOD) score measure of autozygosity [1, 152] that offers several key advantages over the PLINK method [18]. First, it is not reliant on fixed parameters for the number of heterozygous and missing genotypes when determining the autozygosity status of a window, instead incorporating an assumed genotyping error rate, making it more robust to missing data and genotyping errors. Second, it incorporates allele frequencies in the general population to provide a measure of the probability that a given window is homozygous by chance, allowing homozygous windows to be more readily distinguished from autozygous windows. These important advances would be expected to provide greater sensitivity and specificity for the detection of ROA in high-density SNP genotype data, particularly in the presence of the higher and more variable genotype error rates in next-generation sequence (NGS) data [153, 154].

A shortcoming of the LOD method is that correlations between SNPs within a window that occur as a consequence of linkage disequilibrium (LD) are ignored, leading to overestimation of the amount of information that is available in the data and potentially false-positive detection of autozygosity signals. In addition, the LOD method does not account for the possibility of recent recombination events onto very similar haplotype backgrounds that might give the appearance of autozygosity when paired with a non-recombined haplotype [155]. Such a scenario would, for example, arise when ROA are detected in microarray-based genotype data that comprises information at only a limited set of positions within a genomic interval and is therefore blind to unobserved genetic differences that make the apparently identical haplotypes distinct.

Here, we report an improved LOD-based ROA inference method that accounts for the non-independence between SNPs and the likelihoods of unobserved mutation and recombination events within a window. We compare the performance of this new method against the original LOD method as well as PLINK [148] and a recently reported method implemented in the BCFtools software package [156] in simulated genetic datasets. We then evaluate how ROA inference is influenced by the source and density of interrogated markers using the 26 human populations included in Phase 3 of The 1000 Genomes Project [157], considering the entire whole-genome sequence (WGS) dataset as well as subsets representing SNPs present in the exome and included on two commonly used Illumina BeadChips. We show that population differences in genome-wide ROA patterns inferred from WGS data using our improved LOD-based method recapitulate those observed in our earlier BeadChip-based study that used the original LOD method [18]. Finally, we highlight the unique ability of our improved LOD-based method to identify genomic regions enriched for autozygosity signals in one group relative to another without first inferring ROA through the direct comparison of weighted LOD scores, finding nine regions that significantly differ in the strength of their autozygosity signals between apparent subgroups within the Asian Indian Gujarati, Punjabi, and Telugu populations. Our improved ROA inference method will assist population- and disease-geneticists working with a wide variety of data types and species to explore ROA patterns and to identify genomic regions with differential ROA signals, thereby facilitating our understanding of the role of recessive variants in phenotypic variation and disease.


Weighted likelihood autozygosity estimator

We previously reported an ROA inference approach that was based on a number of earlier methods [1, 152] in which a likelihood-based autozygosity estimator is applied in a sliding window framework where window size is defined as a fixed number of SNPs [18]. In this approach, within window w in individual i from population j, the LOD score of autozygosity is calculated across the K SNP markers within window w, where we observe genotype G k at the k th SNP that has state X k , which equals 1 if the SNP is autozygous and 0 otherwise.

$$ LOD\left(w,i\right)=\sum \limits_{k=1}^K{\mathit{\log}}_{10}\left(\frac{\Pr \left(\left.{G}_k\right|{X}_k=1\right)}{\Pr \left(\left.{G}_k\right|{X}_k=0\right)}\right) $$

The per-SNP likelihoods of autozygosity and non-autozygosity are based on Hardy-Weinberg proportions (Table 1) and include population-specific allele frequencies and an assumed rate of genotyping errors and mutations ε. Missing genotypes are ignored in this algorithm; that is, they have a log-likelihood of zero. The log-likelihood of autozygosity for homozygous SNPs is positive and decreases exponentially as a function of allele frequency (Additional file 1: Figure S1A). The log-likelihood of autozygosity for heterozygous SNPs is instead negative and equal to log10 (ε), thus acting as a penalty for the presence of heterozygous genotypes within a window.

Table 1 Per-SNP likelihoods of autozygosity and non-autozygosity

To address the apparent shortcomings of the LOD score method, we developed a weighted LOD-based method (wLOD) that accounts for non-independence among SNPs and the probabilities of recombination and mutation within window w.

$$ wLOD\left(w,i\right)=\sum \limits_{k=1}^K\left[{\mathit{\log}}_{10}\left(\frac{\Pr \left(\left.{G}_k\right|{X}_k=1\right)}{\Pr \left(\left.{G}_k\right|{X}_k=0\right)}\right)\begin{array}{l}\times \mathrm{Corr}\left({p}_k,\left[{p}_1,{p}_K\right]\right)\\ {}\times \Pr \left(\left. no\ recombination\right|\left[{g}_{k-1},{g}_k\right]\right)\\ {}\times \Pr \left(\left. no\ mutation\right|\mu, \left[{p}_{k-1},{p}_k\right]\right)\end{array}\right] $$

Here, we adapt the approach of Chen et al. [158] to incorporate LD information into the wLOD(w, i) estimator, weighting the log-likelihood of SNP k by the reciprocal of the sum of pairwise LD between SNP k and all other SNPs within window w calculated as

$$ \mathrm{Corr}\left({p}_k,\left[{p}_1,{p}_K\right]\right)=\frac{1}{\sum_{l=1}^K{LD}_{k,l}} $$

and bounded in the interval [1/K,1]. An intuitive explanation for this correction is that when a number of SNPs are highly correlated they provide redundant information. By weighting the log-likelihood for SNP k as a function of its correlation with all other SNPs within window w it contributes only the unique autozygosity information it possesses to wLOD(w, i).

LD reflects historical recombination and mating patterns in a population and is largely insensitive to the effects of mating patterns within the last few generations that can, through recombination events onto very similar haplotype backgrounds, lead to false-positive autozygosity signals [155]. If a recombination event occurred in the recent past we would want to place the ROA boundary at that position. However, the direct detection of such events becomes increasingly challenging as the distance between genotyped positions increases, particularly when their genotype patterns are highly correlated. Thus, we also weight the log-likelihood of SNP k by the probability of no recombination events having occurred within the genomic interval bounded by SNP k − 1 and SNP k in the last M generations, calculated based upon their genetic map position g (in Morgans) as previously described [10, 159]:

$$ \Pr \left(\left. no\ recombination\right|\left[{g}_{k-1},{g}_k\right]\right)={e}^{-2M\left({g}_k-{g}_{k-1}\right)} $$

Here, M reflects the expected minimum number of meioses since the most recent common ancestor (MRCA) for a pair of IBD haplotypes [148], tuning the sensitivity of the wLOD estimator to the expected age of IBD haplotypes underlying ROA without limiting it to only ROA of that age. In a population-genetic context, M can be set based upon effective population size estimates and the probability that a pair of individuals share a common ancestor M generations in the past [160], while in a disease-genetic context M can instead be set based on known relationships between affected individuals.

Finally, we account for the potential presence of unobserved genetic differences within the genomic interval bounded by SNP k − 1 and SNP k by weighting the log-likelihood of SNP k by the probability of no unobserved mutation events having occurred within the genomic interval in the last M generations, calculated based upon their physical map position p (in bp) and a per-base mutation rate μ using an approach adapted from MacLeod et al. [161]:

$$ \Pr \left(\left. no\ mutation\right|\mu, \left[{p}_{k-1},{p}_k\right]\right)={e}^{-2 M\mu \left({p}_k-{p}_{k-1}\right)} $$

As evident in Figure S1B (Additional file 1), the recombination and mutation weightings reduce the log-likelihood of SNP k as a function of its distance from SNP k − 1, reflecting the increased likelihood of recombination and mutation events going undetected as the size of the uninterrogated region increases. At M = 7, the log-likelihood of SNP k is reduced by ~ 50% at an inter-marker distance of ~ 300 kb and ~ 85% at ~ 100 kb. When the wLOD method is applied to high-density SNP genotype datasets, such as that created by WGS, the recombination and mutation weighting will have a minimal effect on the wLOD score of window w. However, when applied to lower-density SNP genotype data, such as that created by genotyping microarrays―particularly those available for non-human species such as dogs and cows―the much larger inter-marker distances in these datasets will lead the recombination and mutation weighting to have a much larger effect on the wLOD score of window w.

It can also be seen that as M decreases the magnitude of the change in the weighting with increasing distance also decreases; thus, wLOD scores in populations with small effective population sizes or in disease studies where affected individuals share a more recent common ancestor (smaller M) will be adjusted to a lesser extent than those with larger effective population sizes or where affected individuals share a more distant common ancestor (larger M).

Properties of the wLOD estimator

To investigate how the LD, recombination, and mutation corrections implemented in the wLOD estimator influence per-window autozygosity likelihoods, we compared them with those of the LOD estimator using The 1000 Genomes Project Phase 3 dataset that contains phased genotypes for 84,801,880 genetic variants discovered using a low-coverage WGS approach in 2436 unrelated individuals from 26 human populations (Table 2) [157]. To approximate a typical microarray-based SNP genotyping study, we first developed a subset of this dataset that contained 2,166,414 autosomal SNPs that are present on the popular Illumina HumanOmni2.5-8 BeadChip (“Omni2.5 dataset” henceforth). In all analyses, μ was set to 1.18 × 10−8 [162] and ε was set to 4.71 × 10−4, the average rate of discordance across samples between genotypes in our Omni2.5 dataset and those obtained for 1693 of the 2436 individuals directly with the Illumina HumanOmni2.5 BeadChip [157]. Unless otherwise stated, M was set to seven, a conservative value broadly reflecting the average of effective population size estimates for populations included in The 1000 Genome Project [157, 160, 163]. Window size was varied in an arbitrary interval [K 0, K n ] in which K is increased in 10 SNP increments (i.e. K n  = K 0 + [10 × n]).

Table 2 Populations included in Phase 3 of The 1000 Genomes Project

The genome-wide distribution of wLOD scores for all windows in the Omni2.5 dataset is bimodal and centered around 0 (Fig. 1a), with wLOD scores under the left-hand mode favoring the hypothesis of non-autozygosity, whereas those under the right-hand mode favor the autozygosity hypothesis. The area under the right-hand mode decreases as a function of window size as ROA are progressively covered by fewer but longer windows. In addition, while the location of the right-hand mode does not change appreciably with window size, there is a noticeable shift toward lower wLOD scores in the left-hand mode with increasing window size, likely reflecting the larger number of heterozygous SNPs in non-autozygous compared with autozygous regions and their greater cumulative effect on wLOD scores with increasing window size. This shift progressively increases the distance between the non-autozygous and autozygous modes until either the autozygous mode disappears (Fig. 1b) or the intermodal distance begins to decrease instead (Additional file 1: Figure S2), both potentially reflecting the point above which window lengths exceed those of the majority of ROA in the sample set. In this scenario, as window size increases, autozygous windows increasingly overlap non-autozygous regions flanking shorter ROA leading them to encompass greater numbers of heterozygotes within these non-autozygous regions, deflating their wLOD scores. Whether the autozygous mode disappears or shifts toward lower wLOD scores is likely determined by the abundance of ROA and their levels of support in the sample set: sets with fewer ROA and ROA with generally lower wLOD scores lose their autozygous mode while those with large numbers and higher wLOD scores have it shift toward the non-autozygous mode. Nevertheless, the location of the minimum between the two modes does shift subtly toward higher wLOD scores with increasing window size, reflecting the expected increase in scores for autozygous windows as a function of the number of SNPs within the window. The periodicity observed in the genome-wide score distribution of the original LOD estimator [18] is absent with the wLOD estimator, indicating that this property was a reflection of LD among SNPs within the window.

Fig. 1

Distribution of genome-wide wLOD scores in European Americans. a Each line represents the Gaussian kernel density estimates of the pooled wLOD scores from all 97 individuals in the European American (CEU) population at window sizes between 40 and 200 SNPs in 10 SNP increments in the Omni2.5 dataset. The largest window size that produced a clear bimodal distribution (150 SNPs) is shown in black. b The change in intermodal distance with increasing window size in the CEU population. These patterns are representative of those observed in all other populations in the dataset

To evaluate how the improvements incorporated into the wLOD estimator (Eq. 2) influence per-window scores as compared to the original LOD estimator (Eq. 1), we compared wLOD and LOD scores in the Omni2.5 dataset with a window size of 150 SNPs (Additional file 1: Figure S3A), the largest value that produced a clear bimodal wLOD score distribution in all 26 populations (Additional file 1: Figure S4). Across populations, per-window wLOD scores differed from their corresponding LOD scores by between −103.87 and 454.07 (Additional file 1: Figure S3B) with the range and average of wLOD and LOD score differences increasing as a function of a population’s geographic distance from East Africa (ρ = 0.8460 with P = 5.029 × 10−6 and ρ = 0.8846 with P = 4.961 × 10−7, respectively), reflecting increasing LD [164, 165] and decreasing genetic diversity [97, 166,167,168,169]─leading to larger distances between polymorphic SNPs─with distance from Africa. Among the six admixed populations included in Phase 3 of The 1000 Genomes Project, those of mixed African and European ancestry (ACB and ASW) had smaller ranges and averages of wLOD and LOD score differences than those of mixed of Amerindian and European ancestry (CML, MXL, PUR, and PEL), consistent with the lower LD [170,171,172] and higher genetic diversity [169, 173] of admixed African-European populations compared with Amerindian-European populations.

Across populations, 5.15–47.93% of all windows in the right-hand “autozygous” mode with the LOD estimator were present in the left-hand “non-autozygous” mode with the wLOD estimator (Additional file 1: Figure S3C) potentially reflecting false-positive autozygosity signals reported by the LOD estimator as a consequence of non-independence among homozygous SNPs that cumulatively give the mistaken impression of autozygosity. The proportion of windows was lowest in African populations and highest in most European populations, increasing incrementally through Central/South Asian and East Asian populations. This pattern can be explained by population differences in the location of the autozygous mode and its shift toward lower scores with the wLOD estimator. Modal LOD and wLOD scores in the autozygous mode are generally smallest and most similar in European populations and highest and least similar in African populations (Additional file 1: Figure S5A). Thus, for a given unit decrease in score between the LOD and wLOD estimators, an autozygous LOD window has a greater chance of transitioning to the non-autozygous wLOD mode in Europeans populations than in African populations. Consistent with this hypothesis, the magnitude of the difference between modal LOD and wLOD scores in the autozygous mode and the location of the minima between the autozygous and non-autozygous modes is significantly negatively correlated with the proportion of autozygous LOD windows that transition to the non-autozygous wLOD mode (r = −0.8654, P = 1.156 × 10−8; Additional file 1: Figure S5B).

In contrast, across populations only 0.055–5.015% of all windows in the non-autozygous mode with the LOD estimator were present in the autozygous mode with the wLOD estimator (Additional file 1: Figure S3D), potentially reflecting false-negative autozygosity signals reported by the LOD estimator as a consequence of heterozygotes in high LD with a larger number of homozygotes that, in one possibility, might reflect genotyping errors. The proportion of windows was highest in most African populations and lowest in most European populations, with broadly similar values observed in Central/South and East Asian populations. This pattern is the opposite of that observed with the putative false-positive windows above, and can also be explained by population differences in the location of the autozygous mode and its shift toward lower scores with the wLOD estimator. The addition of a single heterozygote to an autozygous window in the European populations has a greater chance of transitioning it from the autozygous to non-autozygous mode than in the African populations since the autozygous mode is situated much closer to the minima between the two modes (Additional file 1: Figure S5).

Overall, the much larger numbers of windows transitioning from the autozygous to the non-autozygous mode than vice versa between the LOD and wLOD estimators accord with the expectation that the LOD estimator frequently overestimates the amount of information available in the data leading it to falsely report autozygosity signals particularly in genomic regions with higher levels of LD, while it underestimates the amount of information much less frequently.

ROA inference with the wLOD estimator

To infer ROA with our wLOD method, we must select an appropriate wLOD score threshold above which genomic windows are considered autozygous. A natural threshold above which to consider a window as autozygous is the location of the minimum between the non-autozygous and autozygous modes in its wLOD score distribution [18]. Sample size was not observed to appreciably influence the location of the minimum between the non-autozygous and autozygous modes (Additional file 1: Figure S6). However, across 100 random samples of individuals greater consistency in its determination was observed with increasing sample size, particularly compared with sample sizes of less than 10 individuals, indicating that 10 or more individuals should be used to ensure a robust estimate of the threshold is obtained. All windows with wLOD scores above threshold are considered autozygous [18], and overlapping autozygous windows are joined to define ROA.

As each SNP is included in multiple windows (i.e. a SNP is included in 50 different windows at a window size of 50), SNPs near the boundaries of true ROA will be included in both autozygous and non-autozygous windows as the sliding window enters and leaves the ROA. To improve the accuracy of ROA inferences when using a sliding-window approach, we require a SNP to be covered by a certain proportion of autozygous windows before it is placed within an ROA [148], with an overlap fraction of 0.25 previously recommended for use with the original LOD ROA inference method [174].

Accuracy of the wLOD estimator

To evaluate the sensitivity and specificity of the wLOD method to detect ROA in dense genotype data, we simulated 50 independent replicates of genetic data under two demographic scenarios that are broadly representative of situations in which inbreeding and its effect on fitness are of interest as previously described [175] except that we considered a non-uniform distribution of recombination rates across the simulated chromosomes and allowed all base pairs to be mutatable (see Methods). Scenario 1 considered a small partially isolated population of constant effective size (N e = 75) that receives approximately one migrant per generation, simulated for 150 generations (4350 years for a generation time of 29 years [176]). Scenario 2 considered a medium sized closed population (N e = 500 simulated for 100 generations [2900 years]). Each simulated dataset consisted of a single 250 Mb chromosome upon which ~750,000 polymorphic single-nucleotide variants (SNVs) segregate, consistent with the SNV density and length of chromosome 1 in The 1000 Genomes Project Phase 3 WGS data.

Separately for each simulated dataset, we applied the wLOD estimator considering windows of between 50 and 500 SNPs (in 10 SNP increments), and a proportion of overlapping autozygous windows used to construct ROA of between 0 and 50% (in 1% increments). We then calculated three measures of how well inferred ROA agreed with true ROA reported by the simulation program. First, we calculated the power of the wLOD method to detect true ROA, defined here as the total length of true ROA that is overlapped by inferred ROA divided by the total length of true ROA. Second, we calculated its false positive rate as the total length of inferred ROA that does not overlap with true ROA divided by the total length of inferred ROA. Finally, for all true ROA detected with the wLOD method, we calculated the ratio of inferred ROA length and true ROA length for all ROA. Here, ratios greater than one indicates a tendency to overcall ROA by falsely including non-autozygous regions near the boundaries of true ROA, while ratios below one indicate a tendency to instead undercall an ROA by falsely excluding true autozygous regions near the boundaries of true ROA [174].

As can be seen in Fig. 2a, large numbers of false positive ROA calls are made by the wLOD method with a window size of 50 SNPs, decreasing markedly as the window size and the proportion of overlapping windows required during ROA construction increases. These patterns are consistent with the observation that false positive ROA calls are very small―on average 16.97 kb (standard deviation [SD] = 3.85) with a window size of 50 SNPs―and therefore delineated by a few erroneous autozygous windows that progressively fail to meet the required threshold during ROA calling as the window overlap fraction increases. Once window size reaches ~90 SNPs, the wLOD estimator is able to distinguish autozygosity from homozygosity-by-chance with great precision. Conversely, numbers of false negative ROA calls increase as a function of window size and overlap fraction (Fig. 2b). These patterns are consistent with the expectation that as window size increases smaller ROA increasingly go undetected (Additional file 1: Figure S7A), likely as a result of them being spanned by progressively fewer but larger windows and their autozygosity signal being increasingly masked by the inclusion of non-autozygous flanking regions in the wLOD score calculation. Similarly, higher overlap fractions also lead to small ROA spanned by just a small number of autozygous windows increasingly going undetected (Additional file 1: Figure S7D) as they fail to meet the required threshold. Nevertheless, overall power to detect ROA with the wLOD method only decreases slightly as window size and overlap fraction increase (Fig. 2c), consistent with the expectation that at larger window sizes (Additional file 1: Figure S7B) and overlap fractions (Additional file 1: Figure S7E) the sliding window approach will have increasing difficulty in detecting smaller ROA but nonetheless retains high power to detect longer ROA. Finally, ratios of inferred to true ROA length increase as a function of window size and decrease as a function of overlap fraction (Fig. 2d), reflecting the tendency of the wLOD method to overcall the boundaries of smaller ROA at larger window sizes (Additional file 1: Figure S7C) and smaller overlap fractions (Additional file 1: Figure S7F) with those of longer ROA affected to a much lesser extent. All together, these patterns suggest that a suitable point within the parameter space at which to evaluate the sensitivity and specificity of the wLOD method will be the smallest window size and overlap fraction combination at which no false-positive ROA are inferred and the average ratio of inferred to true ROA length is approximately one (Additional file 2: Table S1), striking a balance between sensitivity to detect smaller ROA and the overall accuracy of ROA calls.

Fig. 2

Performance of the wLOD method across different window sizes and overlap fractions. For scenario 1 and the 750,000 polymorphic SNV datasets, a three-dimensional (3D) bar graph depicting the average number of falsely discovered ROA (a) as well as 3D scatterplots depicting the average number of false negative ROA (b), average power (c), and average ratio of inferred and true ROA lengths (d) reported by the wLOD method for each window size and overlap fraction across the 50 replicates are shown. In each graph, the point representing the smallest combination of window size and overlap fraction that had an average number of falsely discovered ROA of 0 and an average ratio of inferred and true ROA lengths of about 1 is shown in black

To evaluate how SNV density influences the sensitivity and accuracy of ROA inference with the wLOD method, we created three subsets of the simulated WGS datasets that reflect the SNV densities of commonly used human microarray-based genotyping platforms: Illumina’s HumanOmni2.5-8 (125,000 SNVs) and OmniExpress-24 (50,000 SNVs) BeadChips and Affymetrix’s Genome-Wide Human SNP 6.0 Microarray (80,000 SNVs). In addition, we included subsets with SNV densities consistent with the genotyping platforms used by ROA studies in cattle and dogs: Illumina’s Bovine HD (80,000 SNVs) and Canine HD (18,000 SNVs) BeadChips. ROA inference accuracy was evaluated exactly as described above for the WGS datasets at the optimal window size and overlap fraction determined separately for each SNV density and demographic scenario (Additional file 2: Table S1). Interestingly, optimal window size varied only slightly across the different SNV densities, lying between 60–130 SNPs and 70–120 SNPs for scenarios 1 and 2, respectively, but nevertheless increasing as a function of SNV density. The optimal window overlap fraction did however vary more widely, increasing as a function of SNV density and lying between 7–37% and 5–32% for scenarios 1 and 2, respectively.

As would be expected, the power of the wLOD method to detect ROA increases as a function of ROA length and the density of SNV in the genetic dataset (Fig. 3). While ROA longer than 1 Mb are captured extremely well (> 99.7%) at all SNV densities explored, the detection of ROA shorter than ~1 Mb decreases appreciably as a function of SNV density. Nevertheless, even with only ~ 18,000 SNVs (1 SNV every ~ 14 kb) the wLOD method is able to capture 96.3 and 89.0% of ROA under scenarios 1 and 2, respectively, with this increasing to 99.9% for both scenarios with 750,000 SNVs (1 SNV every ~ 333 bp). However, false discovery rates do increase dramatically with decreasing SNV density, particularly for smaller ROA (Fig. 3) where they jump from 0.0045 to 0.0069 with 750,000 SNVs to 0.0445 and 0.1362 with 18,000 SNVs for scenarios 1 and 2, respectively, while longer ROA are much less affected: 0.0010 and 0.0001 with 750,000 SNVs increasing to 0.0200 and 0.0495 with 18,000 SNVs for ROA ≥ 5 Mb, respectively. It should be noted that these false discovery rates are solely the result of overcalling true ROA and not erroneous ROA calls. This is reflected in the ratios of inferred to true ROA length (Fig. 3) that increase with decreasing SNV density, particularly for smaller ROA, and approach―but never quite reach―one with increasing ROA length.

Fig. 3

Performance of the wLOD method at different SNV densities. Line graphs showing how average power (top), false discovery rate (middle), and ratio of inferred and true ROA length (bottom) across 50 replicate genetic simulations change with increasing ROA length for each SNV subset under (a) scenario 1 and (b) scenario 2. Each comparison was performed at the optimal combination of window size and overlap fraction for that scenario and SNV subset (Additional file 2: Table S1). The grey vertical lines denote 500 kb (dashed) and 1.5 Mb (dotted), frequently applied length thresholds used to categorize ROA arising due to LD (< 500 kb) and inbreeding (> 1.5 Mb) in humans [9]. Note that in scenario1, power to detect ROA > 1 Mb with 18,000 SNVs surpasses that with 50–125,000 SNVs as a consequence of the optimal overlap fraction used: the overlap fraction of 0 used for the 18,000 SNV dataset is much lower than the 0.15–0.22 fractions used for the 50–125,000 SNV datasets. Consequently, greater power to detect ROA > 1 Mb is achieved with 18,000 SNVs than is possible with 50–125,000 SNVs through less stringent placement of ROA boundaries, but at the expense of more frequent overcalling of ROA (inflated false discovery rate)

Overall, these findings indicate that the wLOD method is well powered to detect ROA with high sensitivity and good specificity at a wide range of SNV densities that are consistent with WGS as well as popular microarray-based platforms that are commonly used in human and non-human studies of ROA, and in particular long ROA that are of interest in studies of Mendelian and complex diseases and traits. In the simulations, both the optimal window size and the optimal overlap fraction increased logarithmically as a function of SNV density (R 2 = 0.9814 and R 2 = 0.8868, respectively, when considering their averages across scenarios). Fitting these averages against the natural logarithm of average SNV density D across all 50 replicates of their respective SNV subset, this suggests that as a rule of thumb future studies apply the wLOD method at a window size equal to 16.400 × log e (D) + 218.020 and an overlap fraction equal to 0.0736 × log e (D) + 0.8063. Based upon these equations, and calculating SNV density as the number of autosomal SNPs on the microarray divided by the total length of the target species’ autosomal genome, guideline settings for window size and overlap fraction with the commonly used human and non-human genotyping microarrays are: 111 SNPs (33%), 103 SNPs (29%), 85 SNP (21%), 81 SNPs (19%), and 59 SNP (9%) for Illumina’s HumanOmni5, HumanOmni2.5, Bovine HD, OmniExpress, and Canine HD BeadChips, respectively, and 85 SNPs (21%) for the Affymetrix Genome-Wide Human SNP 6.0 Microarray. Considering the range of autosomal SNVs observed in the WGS data available for the 26 populations in Phase 3 of The 1000 Genomes Project (12–24 million SNV [157]) a window size of 128–140 SNPs and an overlap fraction of 40–45% would be recommended for WGS datasets. Nevertheless, the modest effect window size has on power to detect longer ROA across the simulated SNV densities (Additional file 1: Figure S8) would suggest that the use of more conservative (i.e. larger) window sizes will not greatly impact the ability of future studies to detect longer ROA of interest regardless of the source and density of the SNV data being analyzed. The window overlap fraction used in ROA construction can then be tailored to meet the needs to detect shorter ROA (Additional file 1: Figure S9) and to accurately place ROA boundaries (Additional file 1: Figure S10), where less restrictive (i.e. smaller) fractions can greatly improve the detection of shorter ROA without significantly impacting the accuracy of longer ROA inferences.

Performance of wLOD against existing ROA detection methods

We have shown the wLOD method to be well powered to detect ROA in genetic datasets consistent with WGS and microarray-based genotyping. We next evaluated how the power and false discovery rate of the wLOD method compared with those of three current methods designed specifically to detect ROA in dense genotype data using the datasets simulated above: the original LOD method, the naïve genotype counting method implemented in PLINK [148], and the recently reported hidden Markov model (HMM) method implemented in the RoH function of BCFtools [156]. We do not consider here the ROA inference methods of GERMLINE [150] and Beagle [177] as they have been previously shown to underperform compared with the methods implemented in PLINK [151] and BCFtools [156] in simulated data that most closely reflects genetic variation patterns found in Europeans; given the notably higher power of PLINK and BCFtools in these comparisons, we consider it unlikely that the outcome would be substantially different with simulated data consistent with non-European populations that have notably different LD patterns from European populations [164, 165]. Since the false discovery and boundary placement properties of the sliding-window-based LOD and PLINK methods would be expected to differ from those of the wLOD method due to their different underlying models, separately for each dataset we identified the optimal window size and overlap fraction for the LOD method (Additional file 2: Table S2) and PLINK (always a window size of 50 SNPs and overlap fraction of zero) as described above; the BCFtools method was applied using default settings. It should be noted that we apply PLINK here to the entire dataset to provide an “all else being equal” comparison with the other methods; however, it is generally recommended to apply it to minor allele frequency (MAF) and LD-pruned datasets to minimize the number of spurious ROH calls [148, 151]. Thus, its power and false-positive rates of ROA detection reported here are expected to be inflated and deflated, respectively, relative to those after MAF- and LD-pruning.

For both scenario 1 and 2, all four methods were able to detect >99.5% of ROA on average with 750,000 SNVs (Fig. 4a and d, respectively), representative of the density of SNVs observed in WGS data. Nevertheless, the wLOD method outperformed both the original LOD method as well as PLINK and BCFtools, particularly at shorter ROA lengths. Interestingly, power with BCFtools became increasingly erratic at longer ROA lengths, most noticeably in scenario 1 (small partially isolated populations), for reasons that remain enigmatic. However, while the wLOD method had a lower false discovery rate than the LOD method, it was notably higher than that of BCFtools and PLINK. Again, it should be noted that this elevated false positive rate solely reflects the overcalling true ROA due to the sliding-window approach employed and not erroneous ROA calls, with such overcalling easily reduced through the use of a more stringent overlap fraction but at the expense of power to detect short ROA. Nevertheless, average ratios of inferred to true ROA length were broadly similar across the wLOD, LOD, and BCFtools methods, where they are highest for extremely short ROA and decrease exponentially with increasing ROA length until they approach―but never quite reach―one, although ratios with BCFtools were marginally lower than those with the wLOD and LOD methods in scenario 2. Conversely, average ratios with PLINK decreased noticeably as a function of ROA length―reaching 0.47 in scenario 1 and 0.81 in scenario 2―consistent with the expectation that as a consequence of its naïve model, PLINK will have a tendency to undercall ROA or return fragmented ROH calls across their span as a function of the distribution of heterozygous genotypes within the ROA, which would be expected to be most numerous near its boundaries. Overall, these observations would suggest that model improvements implemented in the wLOD estimator (Eq. 2) that account for the confounding effects of LD, recombination, and mutation in the autozygosity likelihood calculation provide improved sensitivity and specificity in ROA calling over the original LOD estimator (Eq. 1). Additionally, they indicate that the wLOD method’s sliding window approach, which combines evidence for autozygosity across multiple SNVs, provides improved sensitivity to detect ROA compared with the HMM method of BCFtools, albeit with slightly decreased accuracy in ROA boundary placement.

Fig. 4

Performance of the wLOD method compared with existing methods. Line graphs showing for scenarios 1 (a–c) and 2 (d–f) and subsets consistent with WGS (750,000 SNV; a & d) and the Illumina HumanOmni2.5–8 (125,000 SNV; b & e) and HumanOmniExpress-24 (50,000 SNV; c & f) BeadChips how average power (top), false discovery rate (middle), and ratio of inferred and true ROA length (bottom) across 50 replicate genetic simulations change with increasing ROA length. The grey vertical lines denote 500 kb (dashed) and 1.5 Mb (dotted), frequently applied length thresholds used to categorize ROA arising due to LD (< 500 kb) and inbreeding (> 1.5 Mb) in humans [9]

When we consider simulated datasets consistent with those of genotyping microarrays we observe similar patterns to those observed with 750,000 SNVs (Fig. 4 and Figure S11 [Additional file 1]). For both scenarios 1 and 2, the wLOD method consistently outperforms the LOD method as well as BCFtools and PLINK in terms of power, particularly at shorter ROA lengths. False discovery rates with the wLOD method are consistently lower than those with the LOD method but remain slightly higher than those with BCFtools, while ratios of inferred to true ROA length remain similar across the wLOD and LOD methods and BCFtools. As SNV density decreases from 750,000 SNVs down to 18,000 SNVs several patterns emerge. First, the difference in power between the wLOD and LOD methods decreases as a function of SNV density (Additional file 1: Figure S11B and D), disappearing faster under scenario 2 (large closed populations) than under scenario 1 (small partially isolated populations). These patterns are consistent with the view that in datasets containing fewer SNVs, LD confounds the inference of ROA appreciably less than in datasets containing many SNVs. Consequently, the LD correction implemented in the wLOD estimator (Eq. 2) increasingly becomes less important as SNV density decreases, leading the LOD and wLOD estimators to provide broadly similar autozygosity likelihoods. Nevertheless, false discovery rates with the wLOD method are consistently lower than those with the LOD method, in agreement with the expectation that as SNV density decreases the probabilities of unobserved recombination and mutation events between genotyped SNVs increases, with the recombination and mutation corrections implemented in the wLOD estimator (Eq. 2) enabling it to better account for these events than the LOD estimator (Eq. 1). Second, ratios of inferred to true ROA length with the PLINK method become more similar to those of the other three methods with decreasing SNV density. This pattern is consistent with the expectation that as SNV density decreases, the number of heterozygous genotypes within ROH will also decrease, allowing PLINK to increasingly detect the entire ROA. Finally, the performance of BCFtools decreases as a function of SNV density, although an appreciable loss of power only manifests when we reach 18,000 SNVs and is more pronounced in scenario 2 than in scenario 1 (Additional file 1: Figure S11B and D), suggesting that its HMM is sensitive to the effects of extended LD among sparsely distributed SNVs, a situation frequently encountered in closed populations due to elevated levels of general inbreeding. It should be noted, however, that BCFtools was designed for next-generation whole-genome and -exome data analysis and not for sparser microarray-derived genotype datasets, so its decline in performance in such datasets is to be somewhat expected.

Contrary to expectations based on frequent discrepancies in the autozygosity status of windows with the wLOD and LOD estimators in The 1000 Genomes Project Phase 3 populations (Additional file 1: Figure S3), in our simulated datasets the wLOD method only provided modest improvements in power and false discovery rate over the original LOD method (Fig. 6 and Figure S11 [Additional file 1]). How can we reconcile the high similarity of ROA calls with the LOD and wLOD methods in the simulated datasets with the appreciable differences in per-window autozygosity inferences made by their underlying estimators in The 1000 Genomes Project Phase 3 data? Considering the simulated datasets containing ~125,000 SNVs, which have a comparable SNV density to that of The 1000 Genomes Project Phase 3 Omni2.5 dataset investigated in Figure S3 (Additional file 1), and the same window size of 150 SNPs, across the 50 replicates for scenario 1 0.519% (SD = 0.496) of windows were autozygous with the LOD estimator but not the wLOD estimator, while 2.808% (SD = 1.260) were autozygous with the wLOD estimator but not the LOD estimator; for scenario 2 the values were 0.153% (SD = 0.169) and 5.364% (SD = 1.594), respectively. While the proportion of windows autozygous with the wLOD estimator but not the LOD estimator in the simulated datasets is similar to that observed in The 1000 Genomes Project Phase 3 populations (Additional file 1: Figure S3D), the proportion of windows autozygous with the LOD estimator but not the wLOD estimator is about two orders of magnitude lower than the values observed in The 1000 Genomes Project Phase 3 populations (Additional file 1: Figure S3C). Thus, while we observe the expected gain in sensitivity through a reduction in the contribution of occasional heterozygotes within ROH with the wLOD estimator that enables improved detection of shorter ROA comprised of common haplotypes, we do not observe the expected inflation in LOD scores due to the confounding effects of LD among genotyped positions that leads to increased false positive ROA calls.

Based on their underlying models, we would expect the LOD (Eq. 1) and wLOD (Eq. 2) estimators to provide highly similar inferences in situations where autozygosity patterns align almost perfectly with LD patterns among genotyped SNVs and are investigated with a sufficiently high density of SNVs that the probabilities of unobserved mutation and recombination events are effectively zero. The most parsimonious explanation for the surprisingly high similarity of ROA calls made by the LOD and wLOD methods in the simulated datasets is therefore that LD patterns in these simulated datasets do not faithfully recapitulate the complexity of those found in real populations who have experienced much more complex histories than those simulated here, limiting the impact of the LD correction (Eq. 3) incorporated into the wLOD estimator. We therefore expect to observe appreciably greater improvements in the sensitivity and specificity of ROA calls with the wLOD method compared with the LOD method in real genetic data than in our simulated datasets.

Effect of genotyped SNV density on ROA inference in real data

We have shown the wLOD method to be well powered to detect ROA in genetic datasets consistent with WGS and microarray-based genotyping, and to outperform a number of existing methods in terms of power, although the overcalling of ROA due to its sliding window approach creates slightly higher rates of false discovery than a recently reported HMM model approach. While our simulations suggest that the wLOD method has >99.8% power to detect ROA longer than 1 Mb across SNV densities that are consistent with those frequently used in human population- and disease-genetic studies (Fig. 4), they do not capture the diversity of historical events and sociogenetic processes that have influenced genomic autozygosity patterns in contemporary worldwide human populations. Thus, we next sought to evaluate how robust ROA inferences are among genotype datasets created via WGS and whole-exome-sequencing (WES) as well as with the popular Illumina HumanOmni2.5-8 and OmniExpress-24 BeadChips using The 1000 Genomes Project Phase 3 data.

We first developed a WGS dataset comprised of all 75,071,695 SNVs that passed our quality control criteria (see Methods). Next, we developed a subset of the WGS dataset that was restricted to only the 1,830,512 SNVs that are located within the genomic regions captured by the Roche Nimblegen SeqCap EZ Human Exome Library v3.0 system to mimic a whole-exome-sequencing (WES) dataset (“WES dataset” henceforth). Finally, we developed a subset of the Omni2.5 dataset that was comprised of the 676,445 SNPs that are also present on the Illumina OmniExpress-24 BeadChip (“OmniExpress dataset” henceforth). As the wLOD method explicitly accounts for LD among genotyped positions within a given window (Eq. 3) we do not consider LD pruned datasets. Similarly, since homozygosity for minor alleles at low to rare frequencies in the population is most informative for autozygosity inference with the wLOD estimator (Additional file 1: Figure S1A), we also do not consider a minor allele frequency (MAF) pruned datasets.

For the WGS, Omni2.5 and OmniExpress datasets we applied the wLOD method at the window size and overlap fraction suggested by our simulation analyses given their average SNV density across populations: 125 SNPs (40%), 95 SNPs (25%), and 80 SNPs (18%), respectively. As the SNV density of the WES dataset closely resembles that of the WGS dataset in the genomic regions it covers, we used the same window size and overlap fraction settings in both the WES and WGS datasets. For all datasets, μ was set to 1.18 × 10−8 [162] and M was set to seven, a conservative value broadly reflecting the average of effective population size estimates for populations included in The 1000 Genome Project [157, 160, 163]. For the Omni2.5 and OmniExpress datasets ε was set to 4.71 × 10−4, the average rate of discordance across samples between genotypes in our Omni2.5 dataset and those obtained for 1693 of the 2436 individuals directly with the Illumina HumanOmni2.5 BeadChip [157], while in the WGS and WES datasets ε was instead set separately for each genotype as one minus its reported likelihood. This has the potential to improve the accuracy of ROA calls in NGS datasets by incorporating the uncertainty of each genotype call into the wLOD score calculation, an important potential source of erroneous ROA calls in the context of their often higher and more variable per-genotype error rates compared with microarray-derived datasets [153, 154]. As such, autozygous windows comprised of SNVs with low quality genotypes have a greater chance of being false-positive signals than those with higher quality genotypes, while low quality heterozygous genotypes―that in one possibility may be genotype calling errors―located in runs of higher quality homozygous genotypes have the potential to mask true autozygous signals.

Comparing ROA inferred in the WGS and Omni2.5 datasets, we find Omni2.5 ROA to be frequently longer than their corresponding WGS ROA and in most cases to completely encompass the WGS ROA (Fig. 5a). The magnitude of their length discrepancies decreases with increasing ROA length, consistent with the expected effects of decreased SNV density on the accuracy of inferred ROA boundaries. In addition, while all Omni2.5 ROA are present in the set of WGS ROA, the reverse is not true (Fig. 5b). Many short ROA (< 500 kb) inferred in the WGS dataset are not found in the Omni2.5 dataset, with the fraction of missing ROA decreasing with increasing distance from Africa, reflecting the effect of increasing LD [164, 165] on our ability to infer shorter ROA with the sparser set of SNVs in the Omni2.5 dataset. Concordance between the WGS and Omni2.5 datasets for intermediate (500 kb to 1.5 Mb) and long (> 1.5 Mb) ROA is generally high, although in many populations the fraction of WGS ROA missing in the set of Omni2.5 ROA remains nontrivial. These fractions generally increase as a function of distance from Africa, likely reflecting the reduction in haplotype diversity with decreasing genetic diversity [97, 166,167,168,169] decreasing our ability to distinguish autozygosity from homozygosity-by-chance, particularly over extended genomic regions when genotypes are only available for a fixed set of SNVs that were selected for their generally high level of polymorphism worldwide.

Fig. 5

Concordance of ROA inferred in the WGS and Omni2.5 datasets. a A scatterplot comparing the length of each WGS ROA with that of its corresponding Omni2.5 ROA in the European American (CEU) population. Each point is shaded according to the proportion of the WGS ROA that overlaps the Omni2.5 ROA. b Bar plots representing the proportions of short (< 500 kb; shown in red), intermediate (500 kb to 1.5 Mb; shown in green), and long (> 1.5 Mb; shown in blue) ROA in the WGS (upper) and Omni2.5 (lower) datasets that overlap (darkest shade) or are absent from (lightest shade) the other dataset in each population

Similar patterns are observed when we compare ROA inferred in the Omni2.5 and OmniExp datasets, where almost all OmniExp ROA are present in the set of Omni2.5 ROA (Additional file 1: Figure S12B) and encompass their generally shorter corresponding Omni2.5 ROA (Additional file 1: Figure S12A). While many short ROA inferred in the Omni2.5 dataset are not found in the OmniExp dataset, both intermediate and long ROA are captured extremely consistently between the two datasets despite their different SNV densities. Likewise, when we compare ROA inferred in the WGS and WES datasets, almost all WES ROA are present in the set of WGS ROA (Additional file 1: Figure S13B) and tend to encompass their generally shorter corresponding WGS ROA (Additional file 1: Figure S13A). However, while numbers of short and intermediate ROA inferred in the WGS dataset but not the WES dataset are much higher than in the same comparison between the WGS and Omni2.5 datasets (Fig. 5), the numbers of long ROA inferred in the WGS dataset but not the WES dataset are instead similar. This indicates that the non-uniform and often sparse distribution of SNVs in the WES dataset does not impact the inference of long ROA more than would be expected following a general reduction in SNV density.

Overall, these findings are consistent with the higher density of SNVs in the WGS dataset and the presence of many more rare and low-frequency SNVs detected by NGS compared with microarray-based genotyping platforms―which are particularly informative about autozygosity under our likelihood model (Additional file 1: Figure S1A)―greatly improving our ability to infer ROA. Nevertheless, many long ROA that are of interest in Mendelian and complex disease studies are well captured by the sets of SNVs included on Illumina’s HumanOmni2.5-8 and OmniExpress-24 BeadChips. However, the sparse and non-uniform genomic distribution of SNVs in the WES dataset creates difficulties when inferring short and intermediate ROA with the wLOD method, despite the presence of rare and low-frequency SNVs, while long ROA are instead captured almost as well as with genotyping microarrays. We therefore do not recommend using the wLOD method to infer ROA in WES datasets generated by future studies.

Classification of ROA

ROA of different lengths reflect homozygosity for haplotypes inherited IBD from common ancestors at different depths in an individual’s genealogy: longer ROA most likely arise due to recent ancestors and shorter ROA due to more distant ancestors. We previously advocated that ROA be classified into G length-based classes using a Gaussian mixture model approach applied on their physical map lengths (in bp) that groups ROA based upon their supposed ages [18]: (A) short ROA that measure tens of kilobases and that are of the length at which baseline patterns in LD in a population produce autozygosity through the pairing of two copies of the same ancient haplotype, (B) intermediate length ROA that measure hundreds of kilobases to several Mb and that are likely the result of background relatedness─recent but unknown kinship between parents due to limited effective population sizes─and (C) long ROA that measure multiple megabases and are likely the result of recent parental relatedness (e.g. consanguinity). The choice of G = 3 was motivated by the observation that at G > 3, the additional classes were not discrete; that is, they were encompassed by one of the existing classes (Additional file 1: Figure S14A and C).

This classification approach is limited by the imperfect correlation between physical map lengths and genetic map lengths (Additional file 1: Figure S15), a more accurate representation of the relationship between ROA length and age [178, 179] that is not confounded by the non-uniform genomic distribution of recombination rates [180]. If we instead classify ROA based on their genetic map length (in cM) using a Gaussian mixture model we find that regardless of the number of classes considered they are always discrete (Additional file 1: Figure S14B and D). This would suggest that the original loss of discreteness when classifying based upon physical map length may reflect the confounding effects of physically long but genetically short (and vice versa) ROA on the overall length distribution. Nevertheless, regardless of whether physical or genetic map lengths are used the overall pattern of fit with increasing class number remains highly similar (Additional file 1: Figure S14A and B, respectively), where Bayesian Information Criterion (BIC) likelihoods plateau at around G = 5 with the WGS and Omni2.5 datasets and at around G = 4 classes with the OmniExpress dataset (not shown). The smaller class number for the OmniExpress dataset compared with the WGS and Omni2.5 datasets is consistent with the expectation that smaller ROA will be poorly captured by its sparser set of SNVs, ultimately leading to the loss of the shortest ROA class detected in the WGS and Omni2.5 datasets. Note that for all populations the maximum BIC likelihood is reached at G > 5. Future studies investigating fine scale ROA patterns may wish to consider values of G at which BIC is maximized, however for illustrative purposes we consider G = 5 here since the increase in BIC at G > 5 is small.

When considering a five-class classification scheme, the longest class (G = 5) contains ROA that likely arise from recent parental relatedness and the penultimate longest class (G = 4) contains ROA that likely arise from recent population processes, while the shortest classes (G = 1–3) contain ROA arising through the pairing of two copies of much older haplotypes that have common ancestors at different times in the distant past. Sample size was observed to have a greater effect on ROA classification (Additional file 1: Figure S16) than on wLOD score threshold (Additional file 1: Figure S6), with the proportion of ROA whose classification differed from that assigned when all available individuals are used decreasing as a function of sample size. Importantly, the proportion of misclassified ROA decreases with increasing ROA class, with those in the longest class (G = 5) infrequently misclassified (mean = 0.052 with SD = 0.029 across all 26 populations at a sample size of 25) while those in shorter classes were more frequently affected (mean = 0.092 with SD = 0.046, mean = 0.091 with SD = 0.045, mean = 0.083 with SD = 0.045, and mean = 0.068 with SD = 0.042, for G = 4 to 1, respectively). These observations indicate that sample size is an important factor when classifying ROA using a Gaussian mixture model, but in general samples sizes of at least 25 individuals should provide reasonably robust classification of ROA using this approach, particularly longer ROA that are of interest in genetic studies on Mendelian and complex diseases.

Geographic properties of the wLOD method

We have shown the wLOD method to be well powered to detect ROA in genetic datasets consistent with WGS and microarray-based genotyping, while our investigation of a Gaussian mixture model approach for ROA classification based upon their genetic map lengths indicates the presence of five ROA classes in The 1000 Genomes Project Phase 3 populations, a higher number than was used in our earlier study of the Human Genome Diversity Panel (HGDP) and International HapMap Project (Phase 3) populations that used a microarray-derived dataset and classified ROA based upon their physical map lengths [18]. We thus next explored the population-genetic properties of the wLOD estimator and its inferred ROA.

Evidence of separate endogamic and consanguinity autozygosity signals in Asian Indians

In four of the five Asian Indian populations─Gujarati (GIH), Telugu (ITU), Punjabi (PJL), and Sri Lankan Tamil (STU)─as well as in the East Asian Dai (CDX) population, as window size increased a third mode appeared in their wLOD score distribution that divided the right-hand autozygous mode in two (Fig. 6a). While an apparent third mode also appeared in the wLOD score distribution of the Bengali (BEB) Asian Indian population, it was not as well defined as those of the other populations. As window size increased further, the area under both autozygous modes decreased until the left-hand autozygous mode disappeared followed sometime later by the right-hand autozygous mode. Notably, the distributions of all other populations in our dataset did not develop this third mode, and trimodality was not observed in the distribution of LOD scores for any of the 26 populations.

Fig. 6

Influence of cultural processes on the distribution of wLOD scores. a Gaussian kernel density estimates of the pooled wLOD scores from all individuals in the Asian Indian Gujarati (GIH) and Telugu (ITU) populations at window sizes 200 and 220 SNPs, respectively. These patterns are representative of those observed in the Asian Indian Punjabi (PJL) and Sri Lankan Tamil (STU) populations as well as the East Asian Dai (CDX) population, which are excluded from the plot for clarity as their traces overlap significantly those of the GIH and ITU. b Gaussian kernel density estimates of the proportion of windows comprising each inferred ROA that are present in the right-most autozygosity mode in the Asian Indian GIH, ITU, PJL, and STU populations. ROA in the CDX population are almost exclusively in the left-most mode and it was excluded for clarity. The Asian Indian Bengali (BEB) population was excluded as we could not robustly distinguish between the two autozygous modes

The appearance of a trimodal distribution in these six populations potentially reflects the effects of two distinct cultural processes that occur in India and among the Dai: consanguinity [181, 182] and endogamy [183, 184]─the restriction of marriages to within a predefined group of lineages or villages. In this scenario, the right-hand autozygous mode represents ROA due to consanguinity that are enriched for alleles rare in the general population that segregate within inbred families, while the left-hand autozygous mode represents ROA due to endogamy that are enriched for alleles present at low frequency in the general population that segregate within specific endogamic groups. Compatible with this hypothesis, the three populations with the strongest trimodal pattern (STU, ITU, and DAI) have higher reported frequencies of consanguinity (38.2% [182], 30.8% [182], and 21.3% [181]) than those with weaker trimodal patterns (BEB, 5.0% [182]; GIH, 4.9% [182]; PJL, 0.9% [182]). For example, the consanguinity-associated mode of the ITU is much larger than the endogamy-associated mode, while the reverse is true for the GIH (Fig. 6a), consistent with consanguinity being the primary force generating ROA in the ITU while endogamy is the dominant force in the GIH. To the best of our knowledge, none of the other populations included in Phase 3 of The 1000 Genomes Project practise endogamy; consequently, we do not observe the emergence of a separate endogamy-associated autozygous mode in their wLOD score distributions.

If trimodal distributions are indeed a reflection of the wLOD method being able to disentangle autozygosity signals arising from endogamy and consanguinity processes we would expect inferred ROA to be delineated predominantly by windows from only one of the two autozygous modes. Conversely, if the trimodal distribution is just an idiosyncrasy of the wLOD estimator we would instead expect ROA to be delineated by a random mix of windows drawn from the two autozygous modes. To investigate how windows in the putative endogamy- and consanguinity-associated modes cluster to form inferred ROA, separately for each population exhibiting a clear trimodal wLOD score distribution, we constructed ROA from windows with wLOD scores above the minimum between the non-autozygous and left-most autozygous modes in their wLOD score distribution [18]. Next, for each inferred ROA, we calculated the proportion of their underlying autozygous windows that had wLOD scores within the right-most putative consanguinity-associated mode (i.e. above the minimum between the two autozygous modes).

Inferred ROA were found to frequently be delineated by windows drawn predominantly from one of the two autozygous modes (Fig. 6b). A large well-defined peak is observed at low proportions, representing those ROA comprised of > 90% of windows drawn from the left-hand endogamy-associated mode. A more diffuse peak is observed at higher proportions, representing those ROA comprised of > 80% of windows drawn from the right-hand consanguinity-associated mode. The dispersed appearance of the peak representing putative consanguinity-associated ROA can be explained as a reflection of the fact that the two autozygous modes are not distinct. At the ends of ROA arising via consanguinity, the wLOD scores of windows will naturally decrease as they increasingly span non-autozygous regions and overall support for autozygosity declines, leading them to increasingly fall within the endogamy-associated mode. Consequently, we would expect ROA arising via consanguinity to contain a small proportion of windows in the endogamy-associated mode, with the proportion varying based upon the overall strength of the autozygous signal (i.e. ROA conferring generally higher wLOD scores will have lower proportions of windows in the endogamy-associated mode). Nevertheless, across populations, 68.9% (PJL) to 84.5% (CDX) of all ROA had >80% of their component windows drawn from a single autozygous mode.

Additional support for trimodality in the wLOD score distribution reflecting distinct autozygosity signals arising from endogamy and consanguinity processes is provided by a comparison of how the proportion of windows drawn from the consanguinity-associated mode changes with ROA length (Additional file 1: Figure S17). Almost all ROA longer than 5 Mb are comprised predominantly of windows drawn from the consanguinity-associated mode (>90%), while proportions among ROA shorter than 5 Mb are much more variable. This pattern is consistent with the expectation that ROA arising via consanguinity will in general be much longer than those arising via endogamy.

Overall, the properties of ROA constructed from the trimodal wLOD score distributions present in the Asian Indian and East Asian Dai populations are compatible with the wLOD method being capable of disentangling autozygosity signals that arise from different cultural processes at sufficiently large window sizes. However, further work in well-defined populations that practise both endogamy and consanguinity will be required to fully evaluate this apparent property of the wLOD method.

Population patterns in ROA

To evaluate how genome-wide patterns in ROA inferred with the wLOD method and classified into five classes via a Gaussian mixture model applied to their genetic map lengths accord with those of earlier studies, we next performed the first high-resolution survey of ROA patterns in The 1000 Genomes Project Phase 3 populations based upon ROA inferred in the WGS dataset as described above.

Consistent with previous studies [12, 18, 22], ROA of different lengths have different continental patterns among the 26 populations included in Phase 3 of The 1000 Genomes Project, both with regards to their total lengths (Additional file 1: Figure S18) in individual genomes as well as in their non-uniform distributions across the genome (Additional file 1: Figure S19) that are correlated with spatially variable genomic properties such as recombination rate (Additional file 1: Figure S20) and signals of natural selection (Additional file 1: Figure S21), reflecting the distinct forces generating ROA of different lengths. Total lengths and numbers of ROA in the shortest (G = 1–3) and to some extent intermediate (G = 4) classes increase with distance from Africa, rising in a stepwise fashion in successive continental groups (Additional file 1: Figure S18), in agreement with the observed reduction in haplotype diversity with increasing distance from Africa [164, 185,186,187]. Those of the longest class (G = 5) do not show a similar stepwise pattern, instead exhibiting higher and more variable values in populations where consanguinity in more frequent (Table 2) and inbreeding coefficient estimates are generally higher [188]. Notably, the East Asian Dai have remarkably high total lengths of short ROA (G = 1–3), potentially reflecting their small population size―~1.2 million in Yunnan province, China [189], where The 1000 Genomes Project samples were collected―and complex evolutionary history [190, 191].

Recombination and natural selection

The strength of the correlation between the genomic distribution of ROA and recombination rate decreases with increasing ROA class (Additional file 1: Figure S20), consistent with the expectation that the patterns of genetically shorter ROA will be determined by recombination to a greater extent than longer ROA, which due to their more recent origins have had fewer opportunities for recombination events to systematically influence their patterns. Conversely, the correlation between ROA patterns and signatures of natural selection is strongest for class 2–3 ROA, and to some extent intermediate class 4 ROA, while it is very weak for the shortest (G = 1) and longest (G = 5) ROA classes (Additional file 1: Figure S21). These patterns are compatible with natural selection having primarily influenced genomic diversity patterns in the distant past, with autozygosity for the relics of haplotypes that arose during those events manifesting as class 2–4 ROA, dependent upon how long ago the event occurred.

The long-term effects of natural selection on patterns of ROA might be expected to be most evident in genomic regions encompassing genes implicated in one or more Mendelian diseases, where purifying selection acting on strongly deleterious alleles, which may occur more frequently in such genes due to their apparent importance for human health, would be expected to increase levels of autozygosity relative to genes much less frequently subjected to purifying selection. Using the union of two previously reported lists of genes associated with autosomal dominant (669) and recessive (1130) diseases in the Online Mendelian Inheritance of Man (OMIM) database [192,193,194], we created a list containing genes not associated with autosomal dominant or recessive diseases (24,260; “non-OMIM” henceforth); genes associated with both autosomal dominant and recessive diseases were ignored. For each individual, we then calculated the fraction of the total lengths of all autosomal dominant, autosomal recessive, or non-OMIM transcribed regions that are overlapped by ROA based on their genomic positions in build hg19 of the University of California – Santa Cruz (UCSC) reference genome assembly. Strikingly, regardless of the ROA length class considered, the fraction for OMIM dominant genes was almost always higher than that of non-OMIM genes (P < 10−16 in all comparisons; Wilcoxon signed rank test), while the opposite was true for OMIM recessive genes (P < 10−16 in all comparisons; Additional file 1: Figure S22). Nevertheless, the pattern is strongest for intermediate length ROA classes (G = 2–4) and weakest for the shortest (G = 1) and longest (G = 5) classes. Together, these results are compatible with deleterious alleles occurring less frequently in non-OMIM genes than in OMIM dominant genes, where they are efficiently removed from the population via purifying selection acting on both their homozygous and heterozygous forms, creating increased autozygosity at lengths consistent with population-level processes rather than inbreeding. One possible explanation for the decreased autozygosity around OMIM recessive genes compared with non-OMIM genes would be increased embryonic lethality and/or childhood mortality with individuals homozygous for deleterious recessive mutations in OMIM recessive genes, leading to reduced autozygosity in genomic regions encompassing them in the extant population.

Genes that have been the target of positive selection might be expected to reside within genomic regions that are more frequently autozygous in the general population than those harboring genes that have not. Considering the fraction of each gene’s transcribed region that is in a ROA in each individual’s genome, we compared their median fraction across individuals in each population (Additional file 1: Figure S23). While most genes have a median fraction of about zero, a number of genes that lie within genomic regions spanned by ROA in more than 90% of individuals in a population. Across populations, we observe 54 such instances with long class 5 ROA that represent seven distinct genomic regions (Additional file 2: Table S3), 159 with intermediate length class 4 ROA (22 distinct regions; Additional file 2: Table S4), and 31 (nine distinct regions; Additional file 2: Table S5), seven (five distinct regions; Additional file 2: Table S6), and 480 (46 distinct regions; Additional file 2: Table S7) with short class 1–3 ROA, respectively. While most genes in these regions fall within the non-OMIM group, two of the genes enriched for class 4 ROA (CFC1 and SMN1) and nine of the genes enriched for class 1 ROA (SLC25A20, NDUFAF3, LAMB2, GPX1, NPRL2, ACY1, MRPS16, LCAT, and COX4I2) are from the OMIM recessive group, while one gene enriched for class 1 ROA is from the OMIM dominant group (THAP1). Future investigation of genes that are unusually frequently overlapped by ROA in the general population may provide new insights into the role of recessive variation in human phenotypic diversity and common disease risk as well as the genes within which such variation acts.

Genomic distribution

Genomic distributions of shorter ROA (G = 1–4) are similar among populations from the same geographic region (Additional file 1: Figure S24B–E) and closely mirror the patterns of pairwise F ST among populations (Additional file 1: Figure S24A; Procrustes similarity statistic t 0  > 0.803), while those of the longest ROA class (G = 5) vary more widely among populations (Additional file 1: Figure S24F; t 0  = 0.466). Overall, these patterns are consistent with the interpretation that shorter ROA (G = 1–4), for which neighboring populations have similar patterns, reflect autozygosity that arises through population processes on different evolutionary timescales, while longer ROA (G = 5), for which neighboring populations do not necessarily have similar patterns, reflect autozygosity that instead arises through more recent cultural processes such as inbreeding [18].

Autozygosity hotspots

The non-uniform genomic distribution of the different ROA classes and their variability among populations creates autozygosity hotspots that are in some instances shared among subsets of the populations. For example, there is a hotspot for class 4 ROA on the q-arm of chromosome 2 that is common to three of the five European populations and encompasses the human lactase gene (LCT; Fig. 7) that was not detected in our original study of the HGDP and HapMap populations that included 10 from Europe [18]. In this genomic region, we observe high frequencies of intermediate length class 4 ROA in the Northern European FIN and GBR populations as well as the European American (CEU) group, but not in the Southern European TSI and IBS populations or any other population in the dataset. The presence and absence of this hotspot broadly reflects worldwide patterns in lactase persistence frequency [195, 196]. Lactase persistence is most frequent in Northwestern Europe [197, 198] where it is caused primarily by a single mutation in LCT that rose to high frequency as a consequence of natural selection in response to the rise of milk consumption and pastoralism [196, 199, 200]. It decreases in frequency through Eastern and Southern Europe and Central/South Asia reaching near-zero frequencies in East Asia and the Americas [195, 198, 201,202,203], while it is present to varying degrees in admixed Mestizo [204,205,206] and African American [201, 206] populations as a consequence of their recent European ancestry. Thus, we observe high levels of autozygosity around LCT in the GBR, FIN, and CEU populations and markedly lower but noticeable levels in the IBS, but no observable signal in the TSI or any of the Asian or admixed populations. While lactase persistence is present at moderately high frequency in sub-Saharan Africa it is caused by several different mutations [196, 207] and the African populations included in The 1000 Genomes Project are located predominantly in historically non-milking areas of the continent [199]. Consequently, we do not observe a similar autozygosity signal in the African populations as we do in the Northern European populations.

Fig. 7

Per-population ROA frequencies within a ROA hotspot on chromosome 2. For each ROA class, for each population, the average proportion of individuals in that population who have an ROA overlapping SNVs within non-overlapping 50 kb windows from 132,500,000 to 140,200,000 bp on the q-arm of chromosome 2 is shown. Each row represents a population, and each column represents a window. Populations are ordered from top to bottom by geographic affiliation, as indicated by the color of their label, and within regions from top to bottom by increasing geographic distance from Addis Ababa (in the same order as in Figure S18 [Additional file 1]). The intensity of a point increases with increasing average ROA frequency, as indicated by the color scale below the figure. The SNV density of each window and an ideogram of chromosome banding are shown in the bottom tracks, with average recombination rate in each window represented by a vertical black line below the ideogram, where line heights proportional to average recombination rate. The black vertical box demarks the location of the LCT gene, while the vertical grey box demarks the location of the class 5 ROA hotspot in the CEU and GBR

Interestingly, we also observe a hotspot for the longest ROA class (G = 5) at the same location in the Northern European CEU and GBR populations ~770 kb downstream of the LCT gene (Fig. 7), while a weaker spike in class 5 ROA frequency is seen in the FIN population. This hotspot encompasses four genes within its core region (chr2:135,375,000–135,775,000) that encode a transmembrane protein (TMEM163), an aminocarboxymuconate semialdehyde decarboxylase (ACMSD), cyclin T2 (CCNT2), and a mitogen-activated protein kinase kinase kinase (MAP3K19). The maximum normalized haplotype-based selection statistic nS L [208] score observed in the CEU, GBR, and FIN populations within the core region is 4.980, 4.818, and 4.962, respectively, suggesting that this ROA hotspot potentially reflects the outcome of recent positive selection. However, none of the genes within this hotspot are known to have functional consequences when mutated, leaving the cause of this ROA hotspot and its putative signals of positive selection enigmatic.

Overall, frequency patterns in this genomic region of the different ROA classes in the Northern European CEU, GBR, and FIN populations are consistent with positive selection having occurred at two different time-points. The extended haplotypes created by historical positive selection acting on the single LCT mutation that arose in ancestral Northern Europeans have, over subsequent generations, decreased appreciably in length, but due to the marked reduction in haplotype diversity in the surrounding region commonly create intermediate length class 4 ROA through background population processes. Conversely, the presence of extended IBD haplotypes creating longer class 5 ROA in a genomic region ~770 kb away from LCT would be compatible with positive selection acting much more recently, in agreement with the atypically high nS L scores observed within this region in these populations.

Statistical inference of enrichment of autozygosity signals between groups

A unique feature of the wLOD ROA detection approach is the availability of log-likelihoods of autozygosity for each window in each individual examined. It is therefore possible to directly compare the strength of autozygosity signals between two or more groups of individuals to identify those windows that have significantly greater evidence for shared autozygosity signals in one group compared with the others [152]. In one possibility, such an approach could be used to identify genomic regions that have stronger signals of autozygosity in affected versus unaffected individuals and thus may harbor disease-associated mutations. Similarly, genomic regions with significantly stronger signals of autozygosity in one subset of a population compared to another other may reflect founder effects if there is limited gene flow between them or the presence of adaptive alleles in one subset but not the other that have risen to high frequency.

We demonstrate the principle of this approach using three of the five Central/South Asian groups included in Phase 3 of The 1000 Genomes Project who represent subpopulations within the larger Indian population: BEB, GIH, ITU, PJL, and STU. Genetic diversity patterns in these five groups support the presence of two genetically distinguishable clusters within the GIH, ITU, and PJL (Additional file 1: Figure S25). When instead compared pairwise, the larger of the two ITU clusters lies intermediate between the smaller ITU cluster and the larger of the two GIH, PJL, or STU clusters, while the largest of the PJL clusters overlaps significantly with the smaller GIH cluster (not shown). The GIH individuals were sampled in Houston, TX, while the BEB, ITU, PJL, and STU individuals were all sampled in the UK. Given the intermediate locations of the larger ITU and PJL clusters in the pairwise comparisons, they may potentially reflect admixed individuals within these sample sets. However, both clusters are tightly bunched arguing against this possibility given the normal dispersion of admixed individuals in such analyses owing to their continuum of admixture levels [209, 210]. In another possibility, these distinct clusters might represent the unintentional sampling of distinct endogamic communities whose restrictive marital practices under the long-established Indian caste system has made them distinguishable genetically [211].

Because we would expect differential autozygosity signals among groups to have arisen relatively recently through population or cultural processes, window size is not constrained by our power to detect shorter, more ancient, ROA. A natural window size to use when searching for differential autozygosity signals between groups is therefore the one whose wLOD score distribution can best discriminate between autozygous and non-autozygous windows. In one possibility, this can be defined as the window size that maximizes the distance between the autozygous and non-autozygous modes―measured here as the distance between the modal score in each mode (Fig. 1b and Figure S2 [Additional file 1]). Using the WGS dataset and optimal window sizes of 450, 580, and 610 SNVs for the GIH, PJL, and ITU, respectively, we compared the wLOD scores of individuals present in each of their two clusters (Additional file 1: Figure S25) and evaluated the significance of their observed differences with the permutation-based approach described in Wang et al. [152] except that here we use a Wilcoxon rank-sum test instead of the two sample t-test suggested by Wang et al. as it is much less sensitive to the presence of outliers but has similar power to detect a location shift [212]. Briefly, separately for each group, we first create a distribution of test statistics under the null hypothesis of no difference in wLOD scores between clusters using 1000 permutations of cluster labels, recording for each permutation the maximum observed test statistic across all windows genome-wide. Next, separately for each window, a genome-wide adjusted P-value for the significance of the observed differences in wLOD scores between clusters is then calculated as the proportion of the maximum genome-wide test statistics observed in the 1000 permutations that exceeded the test statistic obtained with the true labels for that window. Finally, for each cluster, genomic regions enriched for autozygosity signals in that cluster compared with the other were defined by joining together overlapping windows with a permutation P-value (P perm) ≤ 0.05.

Intriguingly, while we would not a priori expect to observe significant differences in the strength of autozygosity signals between the two apparent clusters within the GIH, ITU, and PJL sample sets, we did identify one genomic region significantly enriched for autozygosity signals in cluster A compared with cluster B in both the ITU and PJL (Fig. 8; Table 3); no regions were identified in the GIH (Additional file 1: Figure S26). The genomic region in the ITU lies within the transcription elongation regulator 1 like (TCERG1L) gene that has been associated with regulation of plasma levels of the adipokine adiponectin [213], a modulator of glucose regulation and fatty acid oxidation [214] implicated in obesity, diabetes, coronary artery disease and Crohn’s disease risk [215,216,217]. The genomic region in the PJL encompasses the transmembrane phosphoinositide 3-phosphatase and tensin homolog 2 (TPTE2) gene, a paralog of the phosphatase and tensin homolog (PTEN) tumor suppressor [218] implicated in hepatic carcinogenesis [219] that has been found to harbor SNPs with significant allele frequency differences between males and females in European and African populations [220]. While the underlying basis for these differential autozygosity signals remains enigmatic in the absence of more detailed information on these individuals, their identification highlights the potential of our approach to identify genomic regions with differential autozygosity signals between groups that may reflect the presence of variants that have experienced differential selection histories or that influence differences in their predisposition to disease. Moreover, these findings highlight the need for further investigations among well-defined endogamic groups from India to facilitate our understanding of the genomic consequences of the long-established caste system.

Fig. 8

Distribution of differential ROA signals between subgroups in the ITU and PJL. Manhattan plots showing for each window the log 10 (P) of pairwise comparisons of per-individual wLOD scores in the two subgroups present in the (a) ITU (580 SNV window) and (b) PJL (610 SNV window). In each plot, P-values for the comparison testing whether wLOD scores in cluster A are greater than those in cluster B (Additional file 1: Figure S25) are shown on top with P-values for the reverse comparison shown below. P-values represent the proportion of genome-wide maximum Wilcoxon rank-sum test statistics observed in 1000 permutations of group labels that exceed the Wilcoxon rank-sum test statistic obtained with the true labels [152]. Windows with P > 0.05 are shown in black and those with P ≤ 0.05 are shown in orange. The horizontal grey dashed line denotes P = 0.05 and while the grey dotted line denotes P = 0.01

Table 3 Genomic regions enriched for autozygosity signals in the ITU and PJL subgroups


We have reported an improved likelihood-based estimator for the detection of ROA in genome-wide SNV genotype data derived from either microarray platforms or WGS that accounts for autocorrelation among genotyped positions and variability in the confidence of individual genotype calls as well as the probabilities of unobserved mutation and recombination events. Fully accounting for LD among SNVs in a given window is important, because in genomic regions of high LD many pairs of individuals will share common haplotypes that are homozygous identical-by-state but not ROA in the sense defined here (i.e., inherited IBD from a common ancestor). Thus, including such spurious windows would add noise when looking for ROA for the purpose of autozygosity mapping. The incorporation of LD in our model reduces false-positive ROA detection, affording us the ability to identify smaller ROA segments with greater fidelity. An alternative approach to accounting for LD is to prune the dataset prior to its analysis. However, since MAF strongly influences pairwise LD estimates [221] leading to a floor effect beyond short distances when SNP pairs have large MAF differences, such an approach commonly requires those SNV with MAF less than 5% to be removed [151], which would significantly reduce the power of the wLOD method to detect ROA by removing those low-frequency and rare variants whose homozygosity is most indicative of autozygosity under its likelihood model (Additional file 1: Figure S1A). Further, such pruning cannot completely remove LD from the dataset being analyzed, with a pairwise r 2 threshold of 0.5 typically applied [151]. The incorporation of LD into the model therefore better controls for the autocorrelation of autozygosity signals among nearby SNV than is attainable with LD pruning, thereby improving the specificity of the ROA it detects particularly in regions of moderate to high LD.

Similarly, accounting for the probabilities of unobserved recombination and mutation events in the genomic interval spanned by the window becomes increasingly important as a function of inter-marker distance, particularly in situations where these probabilities become nontrivial such as in lower-density microarray-derived genotype datasets. By modeling these probabilities based on an assumed minimum number of generations since the MRCA of the apparent autozygous haplotypes, which we have set here based on the reported effective sizes of the populations included in The 1000 Genomes Project [157, 160, 163], we minimize the number of false positive ROA that can be erroneously inferred when recombination and mutations events onto very similar haplotype backgrounds give the appearance of autozygosity when paired with a non-recombined haplotype. An alternative approach would be to set an arbitrary maximum inter-marker distance allowed when calling ROA; dividing into two any inferred ROA that spans an inter-marker interval greater than that maximum. However, this has the potential to erroneously break-up long ROA, potentially impacting downstream analyses that use ROA length one of their filtering criteria. By incorporating mutation and recombination weightings into the wLOD model we therefore take a more informed and less-biased approach to this issue, thereby improving the inference of longer ROA particularly in datasets containing sparser sets of SNVs.

We have shown the wLOD ROA inference method to be well-powered to infer ROA in genetic datasets consistent with those generated by WGS and microarray-based genotyping. We recommend using this method together with a model-based ROA classification approach [18] based on genetic map lengths to distinguish ROA arising from population-level LD patterns on different evolutionary timescales (classes G = 1–4) from those arising from more recent cultural processes such as inbreeding (class G = 5). Our findings suggest that our inference approach is robust for analyses of as few as 10 individuals. However, model-based classification requires at least 25 individuals to provide a robust classification solution. Moreover, to ensure allele frequency and LD estimates used with the wLOD estimator are close to their true value in the population, at least 30 unrelated individuals should ideally be used in their estimation [222, 223]. Intriguingly, our observation of trimodal wLOD score distributions for a subset of the 26 populations analyzed here, all known to practise both endogamy and consanguinity to varying degrees, suggests that this method may be able to distinguish autozygosity arising from different cultural processes that act on different time scales. Future work within well-defined endogamic and non-endogamic groups that practice consanguinity, as well as within simulated datasets exploring the breadth of possible isolation and inbreeding parameters observed in human populations, will be required to clarify this apparent property of the wLOD method and evaluate its potential human genetics applications.

Comparisons of the ROA inferred using the wLOD method on different microarray-derived and NGS datasets created from The 1000 Genomes Project Phase 3 WGS data suggest that long and to some extent intermediate length ROA are captured consistently by WGS and microarray-derived datasets. However, inference of shorter ROA does vary substantially among the different datasets as a consequence of the decreasing resolution and sensitivity attainable as the genome-wide density of genotyped positions decreases. An observation reflected in the notable lack of consistency between ROA inferred in the WES dataset and those identified in the WGS dataset. Nevertheless, population-genetic analyses of genomic ROA patterns among the 26 populations included in The 1000 Genomes Project on the basis of WGS data are consistent with our previous findings in the 64 worldwide populations included in the HGDP [224, 225] and International HapMap Project [226] on the basis of ~600,000 microarray-derived SNP genotypes [18]. These observations would therefore suggest that ROA studies using microarray-derived genotype data have similar power to detect genomic ROA patterns, and in particular those of longer ROA that are of interest to the disease genetic community due to their enrichment of deleterious variation carried in homozygous form [98, 99], as those using WGS data.

We have compared the wLOD method against a commonly used naïve genotype counting method implemented in the software PLINK, as well as the recently reported HMM method of the BCFtools software package, under two demographic scenarios in which ROA will be of interest in population- and disease-genetic studies. In our genetic simulations, the PLINK approach performed surprisingly well, potentially reflecting their relatively short duration that limited the opportunities for new mutations to arise on the IBD haplotypes that ultimately underlied ROA in the final generation. Indeed, only ~ 4.01% and ~ 14.36% of SNVs in our simulated datasets were de novo mutations not present in the founder individuals under scenarios 1 and 2, respectively, while just ~ 2.14% and ~ 2.91% of SNVs had MAF < 5%. Conversely, across the 26 populations in The 1000 Genomes Project Phase 3 WGS data on average 56% of SNVs had MAF < 5%. Nevertheless, the wLOD method had greater power to detect ROA versus PLINK across all SNV densities considered here. This difference reflects the very limited ability of the PLINK approach, which allows for only occasional missing or heterozygous genotypes when determining the status of a window to account for possible genotyping errors and mutations, to distinguish genomic regions that are homozygous-by-chance from those that are autozygous. In contrast, the wLOD method incorporates population allele frequency and LD estimates and an assumed genotyping error rate as well as accounts for the probabilities of unobserved mutations and recombination events when inferring the autozygosity status of a window, enabling more rigorous assessments of the possibility of genotyping errors and the loss of information caused by missing data. In addition, it provides a more precise measure of the probability that a given window is truly autozygous rather than simply homozygous by chance. Thus, the greater power of the wLOD method compared with PLINK reflects the greater number of false negative ROA expected under the naïve autozygosity model implemented in PLINK.

Comparisons of the wLOD method with the recently reported RoH function of BCFtools have consistently shown it to have improved power to detect ROA, and smaller ROA in particular, across all SNV densities considered here, which are representative of WGS and microarray-based genotyping platforms. However, false discovery rates of the wLOD method are slightly higher than those of BCFtools, wholly reflecting a more permissive placement of ROA boundaries marginally outside of their true locations as a consequence of the sliding window approach employed. While the underlying likelihood models of the wLOD and BCFtools approaches are similar, there are two aspects of the wLOD method that explain its higher power. First, by summing over all SNVs within a given window, the wLOD method is better able to detect the autozygosity signals of ROA comprised of older (shorter) haplotypes whose constituent SNVs individually provide only weak to modest autozygosity support than the pointwise HMM employed by BCFtools. Second, the wLOD method adjusts each SNV’s log-likelihood by the probabilities that no unobserved recombination and mutation events have occurred in the interval between it and the preceding SNV in the last M generations (Eq. 2), where M is set based on the expected time since the most recent common ancestor in an individual’s maternal and paternal lineages given the effective size of the population. BCFtools does not account for unobserved mutations in its inference model, and only allows for up to a single recombination event to have occurred within a given interval [156]. Thus, for longer ROA and those comprised of older haplotypes inherited IBD from an ancient ancestor, we would a priori expect BCFtools to have greater difficulty in making inferences as it will underestimate the number of recombination events that may have occurred as these haplotypes segregate in the general population. This may potentially underlie the noticeably erratic patterns observed with its power to detect ROA greater than 1.5 Mb in the higher SNV density simulated datasets (Fig. 4).

Finally, the wLOD method distinguishes itself from BCFtools and PLINK through its ability to directly detect genomic regions enriched for autozygosity signals in one population or group compared with one or more others without requiring the inference of ROA first. We have applied this approach within the Gujarati (GIH), Punjabi (PJL), and Telugu (ITU) Asian Indian groups, comparing wLOD scores in two distinct clusters of individuals identified via multidimensional scaling of allele sharing dissimilarities (Additional file 1: Figure S25). We identified two genomic regions enriched for autozygosity signals in one of the two clusters, one in the ITU and another in the PJL, that contain genes implicated in the regulation of metabolism and the risk for developing liver cancer, respectively (Table 3). If we instead set a more permissive threshold of P perm ≤ 0.1 when defining enriched regions, we identify an additional seven genomic regions marginally enriched for autozygosity in one cluster compared with the other (Additional file 2: Table S8). One of the seven regions was identified on chromosome 2 in ITU cluster A and contains two genes: G6PC2, a pancreatic glucose-6-phosphatase implicated in the modulation of fasting plasma glucose levels [227] that is a major target of cell-mediated autoimmunity in diabetes [228], and the ATP-binding cassette transporter gene ABCB11, mutations in which cause autosomal recessive progressive familial intrahepatic cholestasis [229, 230]. In addition, a region on chromosome 17 also identified in ITU cluster A contains seven genes that include USH1G, mutations in which cause autosomal recessive deafness in both humans [231, 232] and mice [233, 234]. Finally, a region on chromosome 16 identified in PJL cluster A contains four genes including the mechanically-activated ion channel gene PIEZO1, mutations in which cause autosomal recessive generalized lymphatic dysplasia [235, 236] as well as autosomal dominant hemolytic anemia [237, 238].

The presence of genes that cause autosomal recessive diseases in three of the seven marginally significant regions―a highly unlikely observation (P < 0.008 across 1000 random draws of genomic regions of equivalent size)―suggests the intriguing possibility that, if these clusters do indeed represent distinct endogamic communities, they may be the hallmark of cultural and natural selection processes related to the differential presence of deleterious genetic variants in these genes. Future comparative autozygosity analyses of well-defined endogamic communities within the different subpopulations of India considering much larger sample sizes than were available here will facilitate our understanding of the genomic consequences of the long-established caste system and further clarify its potential role in contributing to genetic predisposition in complex disease risk and negative health outcomes.


To facilitate community adoption of the wLOD ROA inference method as well as classification based on genetic map length via a Gaussian mixture model, we have implemented these approaches in the software GARLIC (Genomic Autozygosity Regions Likelihood-based Inference and Classification) [174] that can be downloaded at As a guide, analysis of the 97 individuals in the CEU population on a Dell Precision T7600 workstation running RedHat Enterprise Linux (v.7.3) with multi-threading support enabled (16× 2.60 GHz threads total) took ~ 2½ minutes for the OmniExp dataset, ~ 6½ minutes for the Omni2.5 dataset, and ~40 min for the WGS dataset, and occupied at most ~ 3 Gb, ~ 7 Gb, and ~ 20 Gb of RAM, respectively. Future enhancements planned for GARLIC’s core engine are expected to significantly reduce its runtime and memory usage. We also provide a searchable online database of ROA inferred in The 1000 Genomes Project Phase 3 populations and a ROA genome browser based on the JBrowse browser interface [239] in which to explore their genomic distribution with respect to various genomic features and properties available at


Genotype datasets

Release v5a of Phase 3 of The 1000 Genomes Project (accessed March 29th, 2015) provides phased genotypes at 84,801,880 genetic variants in 2504 individuals from 26 worldwide human populations discovered using a low-coverage WGS approach [157]. During the genotype phasing, occasional positions with missing genotypes were imputed; consequently, our datasets contain no missing data. We first developed a subset of this WGS dataset in which to perform individual-level quality control prior to developing different subsets in which to evaluate the performance of the wLOD method. In all subsets we applied a common set of quality-control procedures described in Pemberton et al. [240] to remove low-quality variants (Additional file 1: Figure S27).

Individual-level quality control

To independently verify the putative unrelatedness and population labeling of individuals reported by The 1000 Genomes Project Consortium, we developed a preliminary Omni dataset comprised of the 2,165,831 autosomal, 48,458 X-chromosomal, and 543 Y-chromosomal SNPs in The 1000 Genomes Project data that are present on the Illumina HumanOmni2.5-8 BeadChip (stage 1; Additional file 1: Figure S27). Across the 1693 individuals for which genotypes derived using the HumanOmni2.5-8 BeadChip were also available, genotype concordance between the WGS- and BeadChip-derived genotypes lay between 0.99431 and 0.99986 (mean = 0.99953, SD = 0.00041). We identified intra- and inter-population pairs of individuals related closer than first cousins as well as those individuals whose reported sex or population labels were likely to be erroneous as described in Pemberton et al. [240]. Using these approaches, we identified six individuals whose reported sex is likely to be erroneous, 47 individuals who did not cluster genetically with other individuals sharing the same population label, and 14 intra-population and one inter-population pairs of close relatives (Additional file 2: Table S9).

Preparation of final datasets

Removing one individual from each intra-population relative pair, both individuals from the inter-population relative pair, and the 53 individuals whose reported sex or population labels were suspected to be erroneous (68 total individuals; Additional file 2: Table S9), we developed four subsets of The 1000 Genomes Project data that were restricted to the 2436 unrelated individuals and autosomal biallelic variants (stage 2; Additional file 1: Figure S27).

First, we developed a WGS dataset comprised of 75,071,695 SNVs. Second, we developed a WES dataset comprised of the 1,830,512 SNVs that are present within the regions captured by the Roche Nimblegen SeqCap EZ Human Exome Library v3.0 system. Third, we developed an Omni2.5 dataset comprised of the 2,166,414 SNPs that are present on the Illumina HumanOmni2.5-8 BeadChip. Fourth, as ~96% of all markers present on the Illumina HumanOmniExpress-24 BeadChip are also present on the HumanOmni2.5-8 BeadChip, we developed an OmniExpress dataset comprised of the 676,445 SNPs in the Omni2.5 dataset that are present on the HumanOmniExpress-24 BeadChip.

Geographic distances

The geographic distance of each population from Addis Ababa, Ethiopia, was calculated as in Rosenberg et al. [241] with the use of waypoint routes, based on the sampling location reported by The 1000 Genomes Project [157].

Simulation of genetic and true ROA datasets

Simulation procedure

For two demographic scenarios, we generated 50 independent replicates of genetic datasets using a forward-in-time process as previously described [175]. In their original approach, prior to performing the simulation steps Kardos et al. placed N predetermined polymorphic SNV onto the chromosome’s genetic map by randomly sampling N unique genetic map positions in the range 0 to g max (the user-defined genetic map length of the simulated genome), only converting genetic map positions to physical map positions based upon a fixed user-defined recombination rate to physical map distance relationship when writing the simulated datasets to file. Here, we modified their approach to instead create a non-uniform distribution of recombination rates across the simulated chromosome and allow any base pair to mutate during the simulation.

If we let g p represent the genetic map position assigned to physical map position p, which is equal to the base pair count from the beginning of the chromosome. Based on the user-defined values for g max and recombination rate θ, all values of g lie within the interval [0, g max ] and all values of p lie within the interval [1. . (g max /θ) × 1, 000, 000]. To begin, we created a backbone of genetic and physical map positions onto which we will place all other positions, randomly drawing (g max /θ) + 1 values of g and assigning them in increasing order to p in the range [1. . (g max /θ)] (i.e. every Mb). Next, we randomly chose N values of p to be predetermined polymorphic SNVs, and then randomly assigned each a value of g based upon the backbone interval in which it was located, again ensuring that values of g always increase as a function of p. Finally, all values of p that were not among the set of predetermined SNVs were assigned a value of g through interpolation onto the construct created by the values of p and g assigned to the predetermined SNVs. This approach created a non-uniform relationship between physical and genetic map distance along the simulated chromosome that is similar to that observed on real human chromosomes (not shown).

To extend the method of Kardos et al. to enable any base pair on the simulated chromosome to mutate, for each individual in each generation, the number of mutations that occur during each meiosis was drawn from a Poisson distribution with mean μ × [(g max /θ) × 1, 000, 000], where μ is mutation rate. The base pairs to be mutated were then chosen at random from all (g max /θ) × 1, 000, 000 possible positions without replacement. Mutations were tracked and then incorporated into the genotypes of individuals in the analyzed dataset; all monomorphic positions were removed during dataset construction.

In all simulations, we set g max to 325 cM, θ to 1.3 cM/Mb [242], and μ to 1.18 × 10−8 [162], and scaled θ and μ by a factor of 10 to increase genetic diversity in the final generation [243]. N was chosen separately for each simulated scenario such that the final number of polymorphic SNVs in the dataset (both predetermined and de novo) was ~750,000; N = 725,000 for scenario 1 and N = 650,000 for scenario 2. Because predetermined polymorphic SNVs can become fixed over the course of the simulation, their numbers in the analyzed datasets lay between 679,256 and 717,855 (25,788–31,503 de novo SNVs) for scenario 1 and between 633,582 and 638,675 (103,871–110,077 de novo SNVs) for scenario 2. The simulated WGS datasets used in the analyses contained 50 individuals randomly chosen from among the 75 present in the final generation with genotypes for 709,862–746,963 polymorphic SNVs in scenario 1 and 737,957–748,572 polymorphic SNVs in scenario 2. The 125K, 80K, 50K, and 18K subsets of the WGS datasets contained between 117,113–123,766, 74,953–79,211, 46,846–49,507, and 16,865–17,823 polymorphic SNVs, respectively, for scenario 1, and between 121,833–122,815, 77,973–78,602, 48,733–49,126, and 17,544–17,686 polymorphic SNVs for scenario 2.

To better mimic real genetic datasets, we randomly introduced genotyping errors separately into each simulated dataset at a rate of 0.001, a conservative value that is similar to but slightly higher than the average rate of genotype discordance across 1693 individuals between genotypes in their WGS data and those obtained at the exact same SNVs with the Illumina HumanOmni2.5 BeadChip [157], and we set ε to this value in all analyses. Analysis of the simulated pedigrees found the parents of individuals in the final generation to have a common ancestor on average three generations in the past for scenario 1 (all between 1 to 5 generations) and four generations in the past for scenario 2 (all between 1 to 7 generations) and M was set to these average values when analyzing their respective datasets.

Parameters used in the application of PLINK, LOD, and BCFtools methods

For PLINK, we allowed at most 2% of SNPs to have heterozygous genotypes and 5% of SNPs to have missing genotypes for a window to be inferred to be autozygous [151]. The LOD method and BCFtools were applied using the same allele frequency estimates and error rate ε as the wLOD method, while BCFtools additionally incorporated genetic map positions and performed Viterbi training with initial transition probabilities between autozygous and non-autozygous states and vice versa of 6.6 × 10−8 and 5.0 × 10−9, respectively, to optimize its underlying model prior to ROA inference [156].

Application of the wLOD estimator to real data

To minimize the number of variables that varied in within-dataset comparisons, we used a single set of allele frequencies when calculating wLOD scores at all window sizes considered. To account for sample-size differences among populations, we used a resampling procedure to estimate the allele frequencies, sampling 100 non-missing alleles with replacement and calculating allele frequencies from these 100 alleles. As a consequence of the resampling procedure, it was possible for an individual to possess an allelic type whose frequency was estimated to be 0 in the sample of 100 alleles. SNV positions at which this scenario was encountered were treated as missing when calculating wLOD scores for all windows containing the positions in individuals that had the allelic type of frequency 0.

As our datasets contained phased genotypes, in the LD correction applied in the wLOD estimator (Eq. 3) LD was estimated with the correlation coefficient r 2 [244] using a resampling procedure to account for the possible influence of sample size on homozygosity-based LD statistics [222]. For each pair of SNPs, we randomly sampled 55 individuals─the smallest population sample size in our dataset (Table 2)─without replacement and the LD computation was performed using those 55 individuals. Note that we used a single set of LD estimates when calculating wLOD scores at all window sizes considered.

In the recombination rate correction applied in the wLOD estimator (Eq. 4), the genetic map position of each marker in the Omni2.5 dataset and its subsets were downloaded from the Laboratory of Computational Genetics at Rutgers University ( The genetic map position of each marker in the WES and WGS datasets was determined by interpolation onto the Rutgers linkage-physical map [245] based on their UCSC Build hg19 physical map position.

Due to computer memory requirements for Gaussian kernel density estimation, the wLOD score distributions used to determine the autozygosity score thresholds in the WGS dataset considered only twenty individuals chosen at random. Based on our investigation into the effect of sample size on score threshold (Additional file 1: Figure S6), we do not expect this approach to have biased our detection of ROA in the WGS dataset. All genome-wide windows were, however, considered when determining optimal window sizes in the Omni2.5, OmniExpress, and WES datasets.

Classification of ROA

We ran unsupervised Gaussian fitting of the ROA length distribution using the mclust package (v.5.2) [246] in the R statistical software (v.3.3.3) [247], allowing component magnitudes, means, and variances to be free parameters. BIC likelihoods with increasing number of components (G) were calculated using the function mclustBIC, while final classification under the five component model was performed using the function Mclust. Violin plots [248] of total lengths of ROA in individual genomes were produced separately for each length class using the vioplot function from the vioplot package in R.

Genomic distribution and geographic patterns of ROA

The frequency at which each SNV was present in ROA in each population was calculated as described in Pemberton et al. [18]. To compare the genomic distribution of ROA across populations, we calculated mean ROA frequencies in non-overlapping 50 kb windows across all SNVs polymorphic in that population that were within the window, and excluding windows that lay within the centromere and telomeres. To evaluate the similarity of ROA frequency patterns among populations, we performed classical (metric) multidimensional scaling (MDS) separately for each ROA length class based on a matrix of ROA frequency dissimilarities between all pairs of populations, calculated as one minus the Pearson correlation coefficient (r) of their mean ROA frequencies across windows. We then applied MDS to this matrix using cmdscale in R.

We compared population patterns in the MDS based on ROA frequencies to an MDS based on a matrix of pairwise F ST among populations calculated with our WGS dataset and the method of Hudson et al. [249] according to the recommendations of Bhatia et al. [250]. The similarity of patterns in our MDS of ROA dissimilarities and those in the MDS of F ST was evaluated with the Procrustes method [251].

Relationship between ROA and genomic variables

For each ROA length class, we investigated recombination rate and haplotype-based nS L selection scores [208] for correlations with ROA frequency across the autosomes. Population-based recombination-rate estimates were obtained from Phase 3 of The 1000 Genomes Project [157] (downloaded July14th, 2014), and nS L values for each of the 26 populations were calculated in the WGS dataset considering only SNVs with MAF > 0.05 and normalization of unstandardized scores in 100 genome-wide frequency bins with selscan [252]. Comparisons between ROA frequency and recombination rate and nS L were performed as described in Pemberton et al. [18] considering the mean value of each variable in non-overlapping 50 kb windows, excluding windows within the centromere and telomeres, calculated across all SNV within the window for which the variable was available. Admixed Afro-European (ASW and ACB) and Mestizo (CLM, MXL, PEL, and PUR) populations and the geographically imprecise CEU (Utah residents of Northwestern European ancestry) group were omitted from geographic analyses but were included in the scatterplots.





African American








European American


Northern Han


Southern Han
















Human Genome Diversity Panel


Hidden Markov model


Identical by descent




Insertion/deletion variant










Linkage disequilibrium


Logarithm of the odds




Minor allele frequency






Mexican American


Next-generation sequencing


Online Mendelian Inheritance of Man






Puerto Rican


Random access memory


Regions of autozygosity


Runs of homozygosity


Single-nucleotide polymorphism


Single-nucleotide variant


Sri Lankan Tamil




Whole-exome sequencing


Whole-genome sequencing

wLOD :

Weighted logarithm of the odds




  1. 1.

    Broman KW, Weber JL. Long homozygous chromosomal segments in reference families from the centre d’Etude du polymorphisme humain. Am J Hum Genet. 1999;65(6):1493–500.

  2. 2.

    Gibson J, Morton NE, Collins A. Extended tracts of homozygosity in outbred human populations. Hum Mol Genet. 2006;15(5):789–95.

  3. 3.

    Li LH, Ho SF, Chen CH, Wei CY, Wong WC, Li LY, Hung SI, Chung WH, Pan WH, Lee MT, Tsai FJ, Chang CF, Wu JY, Chen YT. Long contiguous stretches of homozygosity in the human genome. Hum Mutat. 2006;27(11):1115–21.

  4. 4.

    Lencz T, Lambert C, DeRosse P, Burdick KE, Morgan TV, Kane JM, Kucherlapati R, Malhotra AK. Runs of homozygosity reveal highly penetrant recessive loci in schizophrenia. Proc Natl Acad Sci U S A. 2007;104(50):19942–7.

  5. 5.

    The International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449(7164):851–61.

  6. 6.

    Simon-Sanchez J, Scholz S, Fung HC, Matarin M, Hernandez D, Gibbs JR, Britton A, de Vrieze FW, Peckham E, Gwinn-Hardy K, Crawley A, Keen JC, Nash J, Borgaonkar D, Hardy J, Singleton A. Genome-wide SNP assay reveals structural genomic variation, extended homozygosity and cell-line induced alterations in normal individuals. Hum Mol Genet. 2007;16(1):1–14.

  7. 7.

    Curtis D, Vine AE, Knight J. Study of regions of extended homozygosity provides a powerful method to explore haplotype structure of human populations. Ann Hum Genet. 2008;72(Pt 2):261–78.

  8. 8.

    Jakkula E, Rehnstrom K, Varilo T, Pietilainen OP, Paunio T, Pedersen NL, Defaire U, Jarvelin MR, Saharinen J, Freimer N, Ripatti S, Purcell S, Collins A, Daly MJ, Palotie A, Peltonen L. The genome-wide patterns of variation expose significant substructure in a founder population. Am J Hum Genet. 2008;83(6):787–94.

  9. 9.

    McQuillan R, Leutenegger AL, Abdel-Rahman R, Franklin CS, Pericic M, Barac-Lauc L, Smolej-Narancic N, Janicijevic B, Polasek O, Tenesa A, Macleod AK, Farrington SM, Rudan P, Hayward C, Vitart V, Rudan I, Wild SH, Dunlop MG, Wright AF, Campbell H, Wilson JF. Runs of homozygosity in European populations. Am J Hum Genet. 2008;83(3):359–72.

  10. 10.

    Auton A, Bryc K, Boyko AR, Lohmueller KE, Novembre J, Reynolds A, Indap A, Wright MH, Degenhardt JD, Gutenkunst RN, King KS, Nelson MR, Bustamante CD. Global distribution of genomic diversity underscores rich complex history of continental human populations. Genome Res. 2009;19(5):795–803.

  11. 11.

    Nalls MA, Simon-Sanchez J, Gibbs JR, Paisan-Ruiz C, Bras JT, Tanaka T, Matarin M, Scholz S, Weitz C, Harris TB, Ferrucci L, Hardy J, Singleton AB. Measures of autozygosity in decline: globalization, urbanization, and its implications for medical genetics. PLoS Genet. 2009;5(3):e1000415.

  12. 12.

    Kirin M, McQuillan R, Franklin CS, Campbell H, McKeigue PM, Wilson JF. Genomic runs of homozygosity record population history and consanguinity. PLoS One. 2010;5(11):e13996.

  13. 13.

    Nothnagel M, Lu TT, Kayser M, Krawczak M. Genomic and geographic distribution of SNP-defined runs of homozygosity in Europeans. Hum Mol Genet. 2010;19(15):2927–35.

  14. 14.

    O’Dushlaine CT, Morris D, Moskvina V, Kirov G, International Schizophrenia Consortium, Gill M, Corvin A, Wilson JF, Cavalleri GL. Population structure and genome-wide patterns of variation in Ireland and Britain. Eur J Hum Genet. 2010;18(11):1248–54.

  15. 15.

    Gross A, Tonjes A, Kovacs P, Veeramah KR, Ahnert P, Roshyara NR, Gieger C, Rueckert IM, Loeffler M, Stoneking M, Wichmann HE, Novembre J, Stumvoll M, Scholz M. Population-genetic comparison of the Sorbian isolate population in Germany with the German KORA population using genome-wide SNP arrays. BMC Genet. 2011;12:67.

  16. 16.

    Roy-Gagnon MH, Moreau C, Bherer C, St-Onge P, Sinnett D, Laprise C, Vezina H, Labuda D. Genomic and genealogical investigation of the French Canadian founder population structure. Hum Genet. 2011;129(5):521–31.

  17. 17.

    Teo SM, Ku CS, Naidoo N, Hall P, Chia KS, Salim A, Pawitan Y. A population-based study of copy number variants and regions of homozygosity in healthy Swedish individuals. J Hum Genet. 2011;56(7):524–33.

  18. 18.

    Pemberton TJ, Absher D, Feldman MW, Myers RM, Rosenberg NA, Li JZ. Genomic patterns of homozygosity in worldwide human populations. Am J Hum Genet. 2012;91(2):275–92.

  19. 19.

    Teo SM, Ku CS, Salim A, Naidoo N, Chia KS, Pawitan Y. Regions of homozygosity in three southeast Asian populations. J Hum Genet. 2012;57(2):101–8.

  20. 20.

    Di Gaetano C, Fiorito G, Ortu MF, Rosa F, Guarrera S, Pardini B, Cusi D, Frau F, Barlassina C, Troffa C, Argiolas G, Zaninello R, Fresu G, Glorioso N, Piazza A, Matullo G. Sardinians genetic background explained by runs of homozygosity and genomic regions under positive selection. PLoS One. 2014;9(3):e91237.

  21. 21.

    Aghakhanian F, Yunus Y, Naidu R, Jinam T, Manica A, Hoh BP, Phipps ME. Unravelling the genetic history of Negritos and indigenous populations of Southeast Asia. Genome Biol Evol. 2015;7(5):1206–15.

  22. 22.

    Karafet TM, Bulayeva KB, Bulayev OA, Gurgenova F, Omarova J, Yepiskoposyan L, Savina OV, Veeramah KR, Hammer MF. Extensive genome-wide autozygosity in the population isolates of Daghestan. Eur J Hum Genet. 2015;23(10):1405–12.

  23. 23.

    Zhai G, Zhou J, Woods MO, Green JS, Parfrey P, Rahman P, Green RC. Genetic structure of the Newfoundland and Labrador population: founder effects modulate variability. Eur J Hum Genet. 2015;24(7):1063–70.

  24. 24.

    Scott EM, Halees A, Itan Y, Spencer EG, He Y, Azab MA, Gabriel SB, Belkadi A, Boisson B, Abel L, Clark AG, Greater Middle East Variome Consortium, Alkuraya FS, Casanova JL, Gleeson JG. Characterization of greater middle eastern genetic variation for enhanced disease gene discovery. Nat Genet. 2016;48(9):1071–6.

  25. 25.

    Arauna LR, Mendoza-Revilla J, Mas-Sandoval A, Izaabel H, Bekada A, Benhamamouch S, Fadhlaoui-Zid K, Zalloua P, Hellenthal G, Comas D. Recent historical migrations have shaped the gene pool of Arabs and Berbers in North Africa. Mol Biol Evol. 2017;34(2):318–29.

  26. 26.

    Cole AM, Cox S, Jeong C, Petousi N, Aryal DR, Droma Y, Hanaoka M, Ota M, Kobayashi N, Gasparini P, Montgomery H, Robbins P, Di Rienzo A, Cavalleri GL. Genetic structure in the Sherpa and neighboring Nepalese populations. BMC Genomics. 2017;18(1):102.

  27. 27.

    Gilbert E, Carmi S, Ennis S, Wilson JF, Cavalleri GL. Genomic insights into the population structure and history of the Irish Travellers. Sci Rep. 2017;7:42187.

  28. 28.

    Somers M, Olde Loohuis LM, Aukes MF, Pasaniuc B, de Visser KCL, Kahn RS, Sommer IE, Ophoff RA. A genetic population isolate in The Netherlands showing extensive Haplotype sharing and long regions of Homozygosity. Genes (Basel). 2017;8(5):E133.

  29. 29.

    Meyer M, Kircher M, Gansauge MT, Li H, Racimo F, Mallick S, Schraiber JG, Jay F, Prufer K, de Filippo C, Sudmant PH, Alkan C, Fu Q, Do R, Rohland N, Tandon A, Siebauer M, Green RE, Bryc K, Briggs AW, Stenzel U, Dabney J, Shendure J, Kitzman J, Hammer MF, Shunkov MV, Derevianko AP, Patterson N, Andres AM, Eichler EE, et al. A high-coverage genome sequence from an archaic Denisovan individual. Science. 2012;338(6104):222–6.

  30. 30.

    Castellano S, Parra G, Sanchez-Quinto FA, Racimo F, Kuhlwilm M, Kircher M, Sawyer S, Fu Q, Heinze A, Nickel B, Dabney J, Siebauer M, White L, Burbano HA, Renaud G, Stenzel U, Lalueza-Fox C, de la Rasilla M, Rosas A, Rudan P, Brajkovic D, Kucan Z, Gusic I, Shunkov MV, Derevianko AP, Viola B, Meyer M, Kelso J, Andres AM, Paabo S. Patterns of coding variation in the complete exomes of three Neandertals. Proc Natl Acad Sci U S A. 2014;111(18):6666–71.

  31. 31.

    Prufer K, Racimo F, Patterson N, Jay F, Sankararaman S, Sawyer S, Heinze A, Renaud G, Sudmant PH, de Filippo C, Li H, Mallick S, Dannemann M, Fu Q, Kircher M, Kuhlwilm M, Lachmann M, Meyer M, Ongyerth M, Siebauer M, Theunert C, Tandon A, Moorjani P, Pickrell J, Mullikin JC, Vohr SH, Green RE, Hellmann I, Johnson PL, Blanche H, et al. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature. 2014;505(7481):43–9.

  32. 32.

    Gamba C, Jones ER, Teasdale MD, McLaughlin RL, Gonzalez-Fortes G, Mattiangeli V, Domboroczki L, Kovari I, Pap I, Anders A, Whittle A, Dani J, Raczky P, Higham TF, Hofreiter M, Bradley DG, Pinhasi R. Genome flux and stasis in a five millennium transect of European prehistory. Nat Commun. 2014;5:5257.

  33. 33.

    Palkopoulou E, Mallick S, Skoglund P, Enk J, Rohland N, Li H, Omrak A, Vartanyan S, Poinar H, Gotherstrom A, Reich D, Dalen L. Complete genomes reveal signatures of demographic and genetic declines in the woolly mammoth. Curr Biol. 2015;25(10):1395–400.

  34. 34.

    Prado-Martinez J, Sudmant PH, Kidd JM, Li H, Kelley JL, Lorente-Galdos B, Veeramah KR, Woerner AE, O’Connor TD, Santpere G, Cagan A, Theunert C, Casals F, Laayouni H, Munch K, Hobolth A, Halager AE, Malig M, Hernandez-Rodriguez J, Hernando-Herraez I, Prufer K, Pybus M, Johnstone L, Lachmann M, Alkan C, Twigg D, Petit N, Baker C, Hormozdiari F, Fernandez-Callejo M, et al. Great ape genetic diversity and population history. Nature. 2013;499(7459):471–5.

  35. 35.

    Xue Y, Prado-Martinez J, Sudmant PH, Narasimhan V, Ayub Q, Szpak M, Frandsen P, Chen Y, Yngvadottir B, Cooper DN, de Manuel M, Hernandez-Rodriguez J, Lobon I, Siegismund HR, Pagani L, Quail MA, Hvilsom C, Mudakikwa A, Eichler EE, Cranfield MR, Marques-Bonet T, Tyler-Smith C, Scally A. Mountain gorilla genomes reveal the impact of long-term population decline and inbreeding. Science. 2015;348(6231):242–5.

  36. 36.

    Bertolini F, Gandolfi B, Kim ES, Haase B, Lyons LA, Rothschild MF. Evidence of selection signatures that shape the Persian cat breed. Mamm Genome. 2016;27(3–4):144–55.

  37. 37.

    Boyko AR, Quignon P, Li L, Schoenebeck JJ, Degenhardt JD, Lohmueller KE, Zhao K, Brisbin A, Parker HG, vonHoldt BM, Cargill M, Auton A, Reynolds A, Elkahloun AG, Castelhano M, Mosher DS, Sutter NB, Johnson GS, Novembre J, Hubisz MJ, Siepel A, Wayne RK, Bustamante CD, Ostrander EA. A simple genetic architecture underlies morphological variation in dogs. PLoS Biol. 2010;8(8):e1000451.

  38. 38.

    vonHoldt BM, Pollinger JP, Earl DA, Knowles JC, Boyko AR, Parker H, Geffen E, Pilot M, Jedrzejewski W, Jedrzejewska B, Sidorovich V, Greco C, Randi E, Musiani M, Kays R, Bustamante CD, Ostrander EA, Novembre J, Wayne RK. A genome-wide perspective on the evolutionary history of enigmatic wolf-like canids. Genome Res. 2011;21(8):1294–305.

  39. 39.

    Pilot M, Dabrowski MJ, Hayrapetyan V, Yavruyan EG, Kopaliani N, Tsingarska E, Bujalska B, Kaminski S, Bogdanowicz W. Genetic variability of the grey wolf Canis Lupus in the Caucasus in comparison with Europe and the Middle East: distinct or intermediary population? PLoS One. 2014;9(4):e93828.

  40. 40.

    Friedenberg SG, Meurs KM, Mackay TF. Evaluation of artificial selection in standard poodles using whole-genome sequencing. Mamm Genome. 2016;27(11–12):599–609.

  41. 41.

    Metzger J, Pfahler S, Distl O. Variant detection and runs of homozygosity in next generation sequencing data elucidate the genetic background of Lundehund syndrome. BMC Genomics. 2016;17(1):535.

  42. 42.

    Pedersen NC, Pooch AS, Liu H. A genetic assessment of the English bulldog. Canine Genet Epidemiol. 2016;3:6.

  43. 43.

    Wiener P, Sanchez-Molano E, Clements DN, Woolliams JA, Haskell MJ, Blott SC. Genomic data illuminates demography, genetic structure and selection of a popular dog breed. BMC Genomics. 2017;18(1):609.

  44. 44.

    Kardos M, Qvarnstrom A, Ellegren H. Inferring individual inbreeding and demographic history from segments of identity by descent in Ficedula flycatcher genome sequences. Genetics. 2017;205(3):1319–34.

  45. 45.

    Purfield DC, Berry DP, McParland S, Bradley DG. Runs of homozygosity and population history in cattle. BMC Genet. 2012;13:70.

  46. 46.

    Ai H, Huang L, Ren J. Genetic diversity, linkage disequilibrium and selection signatures in chinese and western pigs revealed by genome-wide SNP markers. PLoS One. 2013;8(2):e56001.

  47. 47.

    Ferencakovic M, Hamzic E, Gredler B, Solberg TR, Klemetsdal G, Curik I, Solkner J. Estimates of autozygosity derived from runs of homozygosity: empirical evidence from selected cattle populations. J Anim Breed Genet. 2013;130(4):286–93.

  48. 48.

    Herrero-Medrano JM, Megens HJ, Groenen MA, Ramis G, Bosse M, Perez-Enciso M, Crooijmans RP. Conservation genomic analysis of domestic and wild pig populations from the Iberian peninsula. BMC Genet. 2013;14:106.

  49. 49.

    Pertoldi C, Purfield DC, Berg P, Jensen TH, Bach OS, Vingborg R, Kristensen TN. Genetic characterization of a herd of the endangered Danish Jutland cattle. J Anim Sci. 2014;92(6):2372–6.

  50. 50.

    Gomez-Raya L, Rodriguez C, Barragan C, Silio L. Genomic inbreeding coefficients based on the distribution of the length of runs of homozygosity in a closed line of Iberian pigs. Genet Sel Evol. 2015;47(1):81.

  51. 51.

    Howard JT, Maltecca C, Haile-Mariam M, Hayes BJ, Pryce JE. Characterizing homozygosity across United States, New Zealand and Australian Jersey cow and bull populations. BMC Genomics. 2015;16:187.

  52. 52.

    Marras G, Gaspa G, Sorbolini S, Dimauro C, Ajmone-Marsan P, Valentini A, Williams JL, Macciotta NP. Analysis of runs of homozygosity and their relationship with inbreeding in five cattle breeds farmed in Italy. Anim Genet. 2015;46(2):110–21.

  53. 53.

    Rodriguez-Ramilo ST, Fernandez J, Toro MA, Hernandez D, Villanueva B. Genome-wide estimates of coancestry, inbreeding and effective population size in the Spanish Holstein population. PLoS One. 2015;10(4):e0124157.

  54. 54.

    Zavarez LB, Utsunomiya YT, Carmo AS, Neves HH, Carvalheiro R, Ferencakovic M, Perez O’Brien AM, Curik I, Cole JB, Van Tassell CP, da Silva MV, Sonstegard TS, Solkner J, Garcia JF. Assessment of autozygosity in Nellore cows (Bos Indicus) through high-density SNP genotypes. Front Genet. 2015;6:5.

  55. 55.

    Iacolina L, Scandura M, Goedbloed DJ, Alexandri P, Crooijmans RP, Larson G, Archibald A, Apollonio M, Schook LB, Groenen MA, Megens HJ. Genomic diversity and differentiation of a managed island wild boar population. Heredity (Edinb). 2015;116(1):60–7.

  56. 56.

    Mastrangelo S, Tolone M, Di Gerlando R, Fontanesi L, Sardina MT, Portolano B. Genomic inbreeding estimation in small populations: evaluation of runs of homozygosity in three local dairy cattle breeds. Animal. 2016;10(5):746–54.

  57. 57.

    Burren A, Neuditschko M, Signer-Hasler H, Frischknecht M, Reber I, Menzi F, Drogemuller C, Flury C. Genetic diversity analyses reveal first insights into breed-specific selection signatures within Swiss goat breeds. Anim Genet. 2016;47(6):727–39.

  58. 58.

    Mastrangelo S, Portolano B, Di Gerlando R, Ciampolini R, Tolone M, Sardina MT, International Sheep Genomics Consortium. Genome-wide analysis in endangered populations: a case study in Barbaresca sheep. Animal. 2017;11(7):1107–16.

  59. 59.

    Msalya G, Kim ES, Laisser EL, Kipanyula MJ, Karimuribo ED, Kusiluka LJ, Chenyambuga SW, Rothschild MF. Determination of genetic structure and signatures of selection in three strains of Tanzania shorthorn zebu, Boran and Friesian cattle by genome-wide SNP analyses. PLoS One. 2017;12(1):e0171088.

  60. 60.

    Grossi DA, Jafarikia M, Brito LF, Buzanskas ME, Sargolzaei M, Schenkel FS. Genetic diversity, extent of linkage disequilibrium and persistence of gametic phase in Canadian pigs. BMC Genet. 2017;18(1):6.

  61. 61.

    Brito LF, Kijas JW, Ventura RV, Sargolzaei M, Porto-Neto LR, Canovas A, Feng Z, Jafarikia M, Schenkel FS. Genetic diversity and signatures of selection in various goat breeds revealed by genome-wide SNP markers. BMC Genomics. 2017;18(1):229.

  62. 62.

    Purfield DC, McParland S, Wall E, Berry DP. The distribution of runs of homozygosity and selection signatures in six commercial meat sheep breeds. PLoS One. 2017;12(5):e0176780.

  63. 63.

    Bosse M, Megens HJ, Madsen O, Paudel Y, Frantz LA, Schook LB, Crooijmans RP, Groenen MA. Regions of homozygosity in the porcine genome: consequence of demography and the recombination landscape. PLoS Genet. 2012;8(11):e1003100.

  64. 64.

    Kim ES, Cole JB, Huson H, Wiggans GR, Van Tassell CP, Crooker BA, Liu G, Da Y, Sonstegard TS. Effect of artificial selection on runs of homozygosity in U.S. Holstein cattle. PLoS One. 2013;8(11):e80813.

  65. 65.

    Silio L, Rodriguez MC, Fernandez A, Barragan C, Benitez R, Ovilo C, Fernandez AI. Measuring inbreeding and inbreeding depression on pig growth from pedigree or SNP-derived metrics. J Anim Breed Genet. 2013;130(5):349–60.

  66. 66.

    Pryce JE, Haile-Mariam M, Goddard ME, Hayes BJ. Identification of genomic regions associated with inbreeding depression in Holstein and Jersey dairy cattle. Genet Sel Evol. 2014;46:71.

  67. 67.

    Kim ES, Sonstegard TS, Rothschild MF. Recent artificial selection in U.S. Jersey cattle impacts autozygosity levels of specific genomic regions. BMC Genomics. 2015;16:302.

  68. 68.

    Kim ES, Sonstegard TS, Van Tassell CP, Wiggans G, Rothschild MF. The relationship between runs of homozygosity and inbreeding in Jersey cattle under selection. PLoS One. 2015;10(7):e0129967.

  69. 69.

    Metzger J, Karwath M, Tonda R, Beltran S, Agueda L, Gut M, Gut IG, Distl O. Runs of homozygosity reveal signatures of positive selection for reproduction traits in breed and non-breed horses. BMC Genomics. 2015;16(1):764.

  70. 70.

    Saura M, Fernandez A, Varona L, Fernandez AI, de Cara MA, Barragan C, Villanueva B. Detecting inbreeding depression for reproductive traits in Iberian pigs using genome-wide data. Genet Sel Evol. 2015;47:1.

  71. 71.

    Zhang Q, Guldbrandtsen B, Bosse M, Lund MS, Sahana G. Runs of homozygosity and distribution of functional variants in the cattle genome. BMC Genomics. 2015;16:542.

  72. 72.

    Zanella R, Peixoto JO, Cardoso FF, Cardoso LL, Biegelmeyer P, Cantao ME, Otaviano A, Freitas MS, Caetano AR, Ledur MC. Genetic diversity analysis of two commercial breeds of pigs using genomic and pedigree data. Genet Sel Evol. 2016;48(1):24.

  73. 73.

    Andrea T, Bertolini F, Pagnacco G, Pilla F, Ajmone-Marsan P, Rothschild MF, Crepaldi P, Italian Goat Consortium. The Valdostana goat: a genome-wide investigation of the distinctiveness of its selective sweep regions. Mamm Genome. 2017;28(3–4):114–28.

  74. 74.

    Ferencakovic M, Solkner J, Kaps M, Curik I. Genome-wide mapping and estimation of inbreeding depression of semen quality traits in a cattle population. J Dairy Sci. 2017;100(6):4721–30.

  75. 75.

    Chitneedi PK, Arranz JJ, Suarez-Vega A, Garcia-Gamez E, Gutierrez-Gil B. Estimations of linkage disequilibrium, effective population size and ROH-based inbreeding coefficients in Spanish Churra sheep using imputed high-density SNP genotypes. Anim Genet. 2017;48(4):436–46.

  76. 76.

    Luan T, Yu X, Dolezal M, Bagnato A, Meuwissen TH. Genomic prediction based on runs of homozygosity. Genet Sel Evol. 2014;46:64.

  77. 77.

    Gurgul A, Szmatola T, Topolski P, Jasielczuk I, Zukowski K, Bugno-Poniewierska M. The use of runs of homozygosity for estimation of recent inbreeding in Holstein cattle. J Appl Genet. 2016;57(4):527–30.

  78. 78.

    Meszaros G, Boison SA, Perez O’Brien AM, Ferencakovic M, Curik I, Da Silva MV, Utsunomiya YT, Garcia JF, Solkner J. Genomic analysis for managing small and endangered populations: a case study in Tyrol Grey cattle. Front Genet. 2015;6:173.

  79. 79.

    Howard JT, Tiezzi F, Huang Y, Gray KA, Maltecca C. Characterization and management of long runs of homozygosity in parental nucleus lines and their associated crossbred progeny. Genet Sel Evol. 2016;48(1):91.

  80. 80.

    Maatouk F, Laamiri D, Argoubi K, Ghedira H. Dental manifestations of inbreeding. J Clin Pediatr Dent. 1995;19(4):305–6.

  81. 81.

    Zlotogora J. Genetic disorders among Palestinian Arabs: 1. Effects of consanguinity. Am J Med Genet. 1997;68(4):472–5.

  82. 82.

    Stoll C, Alembik Y, Roth MP, Dott B. Parental consanguinity as a cause for increased incidence of births defects in a study of 238,942 consecutive births. Ann Genet. 1999;42(3):133–9.

  83. 83.

    Modell B, Darr A. Science and society: genetic counselling and customary consanguineous marriage. Nat Rev Genet. 2002;3(3):225–9.

  84. 84.

    Habeb AM, Flanagan SE, Deeb A, Al-Alwan I, Alawneh H, Balafrej AAL, Mutair A, Hattersley AT, Hussain K, Ellard S. Permanent neonatal diabetes: different aetiology in Arabs compared to Europeans. Arch Dis Child. 2012;97(8):721–3.

  85. 85.

    Morton NE. Effect of inbreeding on IQ and mental retardation. Proc Natl Acad Sci U S A. 1978;75(8):3906–8.

  86. 86.

    Shami SA, Qaisar R, Bittles AH. Consanguinity and adult morbidity in Pakistan. Lancet. 1991;338(8772):954.

  87. 87.

    Rudan I. Inbreeding and cancer incidence in human isolates. Hum Biol. 1999;71(2):173–87.

  88. 88.

    Mani A, Meraji SM, Houshyar R, Radhakrishnan J, Ahangar M, Rezaie TM, Taghavinejad MA, Broumand B, Zhao H, Nelson-Williams C, Lifton RP. Finding genetic contributions to sporadic disease: a recessive locus at 12q24 commonly contributes to patent ductus arteriosus. Proc Natl Acad Sci U S A. 2002;99(23):15054–9.

  89. 89.

    Rudan I, Rudan D, Campbell H, Carothers A, Wright A, Smolej-Narancic N, Janicijevic B, Jin L, Chakraborty R, Deka R, Rudan P. Inbreeding and risk of late onset complex disease. J Med Genet. 2003;40(12):925–32.

  90. 90.

    Kanaan ZM, Mahfouz R, Tamim H. The prevalence of consanguineous marriages in an underserved area in Lebanon and its association with congenital anomalies. Genet Test. 2008;12(3):367–72.

  91. 91.

    Shieh JTC, Bittles AH, Hudgins L. Consanguinity and the risk of congenital heart disease. Am J Med Genet A. 2012;158A(5):1236–41.

  92. 92.

    Fareed M, Afzal M. Increased cardiovascular risks associated with familial inbreeding: a population-based study of adolescent cohort. Ann Epidemiol. 2016;26(4):283–92.

  93. 93.

    AHA e-O, Singh M, Giasuddin ASM. Conjugal leprosy among Libyan patients. Dermatology. 1998;196(2):271–2.

  94. 94.

    Lyons EJ, Frodsham AJ, Zhang L, Hill AV, Amos W. Consanguinity and susceptibility to infectious diseases in humans. Biol Lett. 2009;5(4):574–6.

  95. 95.

    Lyons EJ, Amos W, Berkley JA, Mwangi I, Shafi M, Williams TN, Newton CR, Peshu N, Marsh K, Scott JAG, Hill AVS. Homozygosity and risk of childhood death due to invasive bacterial disease. BMC Med Genet. 2009;10:55.

  96. 96.

    Garrod AE. The incidence of alkaptonuria: a study in chemical individuality. Lancet. 1902;ii:1616–20.

  97. 97.

    Pemberton TJ, Rosenberg NA. Population-genetic influences on genomic estimates of the inbreeding coefficient: a global perspective. Hum Hered. 2014;77(1–4):37–48.

  98. 98.

    Szpiech ZA, Xu J, Pemberton TJ, Peng W, Zollner S, Rosenberg NA, Li JZ. Long runs of homozygosity are enriched for deleterious variation. Am J Hum Genet. 2013;93(1):90–102.

  99. 99.

    Mezzavilla M, Vozzi D, Badii R, Alkowari MK, Abdulhadi K, Girotto G, Gasparini P. Increased rate of deleterious variants in long runs of homozygosity of an inbred population from Qatar. Hum Hered. 2015;79(1):14–9.

  100. 100.

    McQuillan R, Eklund N, Pirastu N, Kuningas M, McEvoy BP, Esko T, Corre T, Davies G, Kaakinen M, Lyytikainen LP, Kristiansson K, Havulinna AS, Gogele M, Vitart V, Tenesa A, Aulchenko Y, Hayward C, Johansson A, Boban M, Ulivi S, Robino A, Boraska V, Igl W, Wild SH, Zgaga L, Amin N, Theodoratou E, Polasek O, Girotto G, Lopez LM, et al. Evidence of inbreeding depression on human height. PLoS Genet. 2012;8(7):e1002655.

  101. 101.

    Fareed M, Afzal M. Evidence of inbreeding depression on height, weight, and body mass index: a population-based child cohort study. Am J Hum Biol. 2014;26(6):784–95.

  102. 102.

    Verweij KJ, Abdellaoui A, Veijola J, Sebert S, Koiranen M, Keller MC, Jarvelin MR, Zietsch BP. The association of genotype-based inbreeding coefficient with a range of physical and psychological human traits. PLoS One. 2014;9(7):e103102.

  103. 103.

    Joshi PK, Esko T, Mattsson H, Eklund N, Gandin I, Nutile T, Jackson AU, Schurmann C, Smith AV, Zhang W, Okada Y, Stancakova A, Faul JD, Zhao W, Bartz TM, Concas MP, Franceschini N, Enroth S, Vitart V, Trompet S, Guo X, Chasman DI, O’Connel JR, Corre T, Nongmaithem SS, Chen Y, Mangino M, Ruggiero D, Traglia M, Farmaki AE, et al. Directional dominance on stature and cognition in diverse human populations. Nature. 2015;523(7561):459–62.

  104. 104.

    Power RA, Nagoshi C, DeFries JC, Plomin R, Wellcome Trust Case Control Consortium 2. Genome-wide estimates of inbreeding in unrelated individuals and their association with cognitive ability. Eur J Hum Genet. 2014;22(3):386–90.

  105. 105.

    Howrigan DP, Simonson MA, Davies G, Harris SE, Tenesa A, Starr JM, Liewald DC, Deary IJ, McRae A, Wright MJ, Montgomery GW, Hansell N, Martin NG, Payton A, Horan M, Ollier WE, Abdellaoui A, Boomsma DI, DeRosse P, Knowles EE, Glahn DC, Djurovic S, Melle I, Andreassen OA, Christoforou A, Steen VM, Hellard SL, Sundet K, Reinvang I, Espeseth T, et al. Genome-wide autozygosity is associated with lower general cognitive ability. Mol Psychiatry. 2015;21(6):837–43.

  106. 106.

    Krieger H. Inbreeding effects on metrical traits in northeastern Brazil. Am J Hum Genet. 1969;21(6):537–46.

  107. 107.

    Martin AO, Kurczynski TW, Steinberg AG. Familial studies of medical and anthropometric variables in a human isolate. Am J Hum Genet. 1973;25(6):581–93.

  108. 108.

    Hurwich BJ, Nubani N. Blood pressures in a highly inbred community--Abu Ghosh, Israel. 1. Original survey. Isr J Med Sci. 1978;14(9):962–9.

  109. 109.

    Halberstein RA. Blood pressure in the Caribbean. Hum Biol. 1999;71(4):659–84.

  110. 110.

    Saleh EA, Mahfouz AA, Tayel KY, Naguib MK, Bin-al-Shaikh NM. Hypertension and its determinants among primary-school children in Kuwait: an epidemiological study. East Mediterr Health J. 2000;6(2–3):333–7.

  111. 111.

    Rudan I, Smolej-Narancic N, Campbell H, Carothers A, Wright A, Janicijevic B, Rudan P. Inbreeding and the genetic complexity of human hypertension. Genetics. 2003;163(3):1011–21.

  112. 112.

    Badaruddoza. Inbreeding effects on metrical phenotypes among north Indian children. Coll Antropol. 2004;28 Suppl 2:311–9.

  113. 113.

    Campbell H, Carothers AD, Rudan I, Hayward C, Biloglav Z, Barac L, Pericic M, Janicijevic B, Smolej-Narancic N, Polasek O, Kolcic I, Weber JL, Hastie ND, Rudan P, Wright AF. Effects of genome-wide heterozygosity on a range of biomedically relevant human quantitative traits. Hum Mol Genet. 2007;16(2):233–41.

  114. 114.

    Simpson JL, Martin AO, Elias S, Sarto GE, Dunn JK. Cancers of the breast and female genital system: search for recessive genetic factors through analysis of human isolate. Am J Obstet Gynecol. 1981;141(6):629–36.

  115. 115.

    Lebel RR, Gallagher WB. Wisconsin consanguinity studies. II: Familial adenocarcinomatosis. Am J Med Genet. 1989;33(1):1–6.

  116. 116.

    Bacolod MD, Schemmann GS, Wang S, Shattock R, Giardina SF, Zeng Z, Shia J, Stengel RF, Gerry N, Hoh J, Kirchhoff T, Gold B, Christman MF, Offit K, Gerald WL, Notterman DA, Ott J, Paty PB, Barany F. The signatures of autozygosity among patients with colorectal cancer. Cancer Res. 2008;68(8):2610–21.

  117. 117.

    Thomsen H, Inacio da Silva Filho M, Fuchs M, Ponader S, Pogge von Strandmann E, Eisele L, Herms S, Hoffmann P, Engert A, Hemminki K, Forsti A. Evidence of inbreeding in Hodgkin lymphoma. PLoS One. 2016;11(4):e0154259.

  118. 118.

    Thomsen H, Chen B, Figlioli G, Elisei R, Romei C, Cipollini M, Cristaudo A, Bambi F, Hoffmann P, Herms S, Landi S, Hemminki K, Gemignani F, Forsti A. Runs of homozygosity and inbreeding in thyroid cancer. BMC Cancer. 2016;16(1):227.

  119. 119.

    Puzyrev VP, Lemza SV, Nazarenko LP, Panphilov VI. Influence of genetic and demographic factors on etiology and pathogenesis of chronic disease in north Siberian aborigines. Arctic Med Res. 1992;51(3):136–42.

  120. 120.

    Ismail J, Jafar TH, Jafary FH, White F, Faruqui AM, Chaturvedi N. Risk factors for non-fatal myocardial infarction in young south Asian adults. Heart. 2004;90(3):259–63.

  121. 121.

    Christofidou P, Nelson CP, Nikpay M, Qu L, Li M, Loley C, Debiec R, Braund PS, Denniff M, Charchar FJ, Arjo AR, Tregouet DA, Goodall AH, Cambien F, Ouwehand WH, Roberts R, Schunkert H, Hengstenberg C, Reilly MP, Erdmann J, McPherson R, Konig IR, Thompson JR, Samani NJ, Tomaszewski M. Runs of Homozygosity: association with coronary artery disease and gene expression in Monocytes and macrophages. Am J Hum Genet. 2015;97(2):228–37.

  122. 122.

    McLaughlin RL, Kenna KP, Vajda A, Heverin M, Byrne S, Donaghy CG, Cronin S, Bradley DG, Hardiman O. Homozygosity mapping in an Irish ALS case-control cohort describes local demographic phenomena and points towards potential recessive risk loci. Genomics. 2015;105(4):237–41.

  123. 123.

    Keller MC, Miller G. Resolving the paradox of common, harmful, heritable mental disorders: which evolutionary genetic models work best? Behav Brain Sci. 2006;29(4):385–404. discussion 405-352

  124. 124.

    Keller MC, Simonson MA, Ripke S, Neale BM, Gejman PV, Howrigan DP, Lee SH, Lencz T, Levinson DF, Sullivan PF, Schizophrenia Psychiatric Genome-Wide Association Study Consortium. Runs of homozygosity implicate autozygosity as a schizophrenia risk factor. PLoS Genet. 2012;8(4):e1002656.

  125. 125.

    Carlson CS, Eberle MA, Kruglyak L, Nickerson DA. Mapping complex disease loci in whole-genome association studies. Nature. 2004;429(6990):446–52.

  126. 126.

    Freimer N, Sabatti C. The use of pedigree, sib-pair and association studies of common diseases for genetic mapping and epidemiology. Nat Genet. 2004;36(10):1045–51.

  127. 127.

    Pritchard JK. Are rare variants responsible for susceptibility to complex diseases? Am J Hum Genet. 2001;69(1):124–37.

  128. 128.

    Pritchard JK, Cox NJ. The allelic architecture of human disease genes: common disease-common variant...Or not? Hum Mol Genet. 2002;11(20):2417–23.

  129. 129.

    Wright A, Charlesworth B, Rudan I, Carothers A, Campbell H. A polygenic basis for late-onset disease. Trends Genet. 2003;19(2):97–106.

  130. 130.

    Reich DE, Lander ES. On the allelic spectrum of human disease. Trends Genet. 2001;17(9):502–10.

  131. 131.

    Yang TL, Guo Y, Zhang LS, Tian Q, Yan H, Papasian CJ, Recker RR, Deng HW. Runs of homozygosity identify a recessive locus 12q21.31 for human adult height. J Clin Endocrinol Metab. 2010;95(8):3777–82.

  132. 132.

    Yang HC, Chang LC, Liang YJ, Lin CH, Wang PL. A genome-wide homozygosity association study identifies runs of homozygosity associated with rheumatoid arthritis in the human major histocompatibility complex. PLoS One. 2012;7(4):e34840.

  133. 133.

    Simon-Sanchez J, Kilarski LL, Nalls MA, Martinez M, Schulte C, Holmans P, International Parkinson’s Disease Genomics Consortium, Wellcome Trust Case Control Consortium, Gasser T, Hardy J, Singleton AB, Wood NW, Brice A, Heutink P, Williams N, Morris HR. Cooperative genome-wide analysis shows increased homozygosity in early onset Parkinson’s disease. PLoS One. 2012;7(3):e28787.

  134. 134.

    Ghani M, Sato C, Lee JH, Reitz C, Moreno D, Mayeux R, St George-Hyslop P, Rogaeva E. Evidence of recessive Alzheimer disease loci in a Caribbean Hispanic data set: genome-wide survey of runs of homozygosity. JAMA Neurol. 2013;70(10):1261–7.

  135. 135.

    Ghani M, Reitz C, Cheng R, Vardarajan BN, Jun G, Sato C, Naj A, Rajbhandary R, Wang LS, Valladares O, Lin CF, Larson EB, Graff-Radford NR, Evans D, De Jager PL, Crane PK, Buxbaum JD, Murrell JR, Raj T, Ertekin-Taner N, Logue M, Baldwin CT, Green RC, Barnes LL, Cantwell LB, Fallin MD, Go RC, Griffith PA, Obisesan TO, Manly JJ, et al. Association of Long Runs of Homozygosity with Alzheimer disease among African American individuals. JAMA Neurol. 2015;72(11):1313–23.

  136. 136.

    Mukherjee S, Guha S, Ikeda M, Iwata N, Malhotra AK, Pe’er I, Darvasi A, Lencz T. Excess of homozygosity in the major histocompatibility complex in schizophrenia. Hum Mol Genet. 2014;23(22):6088–95.

  137. 137.

    Sud A, Cooke R, Swerdlow AJ, Houlston RS. Genome-wide homozygosity signature and risk of Hodgkin lymphoma. Sci Rep. 2015;5:14315.

  138. 138.

    Lander ES, Botstein D. Homozygosity mapping: a way to map human recessive traits with the DNA of inbred children. Science. 1987;236(4808):1567–70.

  139. 139.

    Ben Hamida C, Doerflinger N, Belal S, Linder C, Reutenauer L, Dib C, Gyapay G, Vignal A, Le Paslier D, Cohen D, et al. Localization of Friedreich ataxia phenotype with selective vitamin E deficiency to chromosome 8q by homozygosity mapping. Nat Genet. 1993;5(2):195–200.

  140. 140.

    Kwitek-Black AE, Carmi R, Duyk GM, Buetow KH, Elbedour K, Parvari R, Yandava CN, Stone EM, Sheffield VC. Linkage of Bardet-Biedl syndrome to chromosome 16q and evidence for non-allelic genetic heterogeneity. Nat Genet. 1993;5(4):392–6.

  141. 141.

    Pollak MR, Chou YH, Cerda JJ, Steinmann B, La Du BN, Seidman JG, Seidman CE. Homozygosity mapping of the gene for alkaptonuria to chromosome 3q2. Nat Genet. 1993;5(2):201–4.

  142. 142.

    Botstein D, Risch N. Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet. 2003;33 Suppl:228–37.

  143. 143.

    Hildebrandt F, Heeringa SF, Ruschendorf F, Attanasio M, Nurnberg G, Becker C, Seelow D, Huebner N, Chernin G, Vlangos CN, Zhou W, O’Toole JF, Hoskins BE, Wolf MT, Hinkes BG, Chaib H, Ashraf S, Schoeb DS, Ovunc B, Allen SJ, Vega-Warner V, Wise E, Harville HM, Lyons RH, Washburn J, Macdonald J, Nurnberg P, Otto EA. A systematic approach to mapping recessive disease genes in individuals from outbred populations. PLoS Genet. 2009;5(1):e1000353.

  144. 144.

    Collin RW, van den Born LI, Klevering BJ, de Castro-Miro M, Littink KW, Arimadyo K, Azam M, Yazar V, Zonneveld MN, Paun CC, Siemiatkowska AM, Strom TM, Hehir-Kwa JY, Kroes HY, de Faber JT, van Schooneveld MJ, Heckenlively JR, Hoyng CB, den Hollander AI, Cremers FP. High-resolution homozygosity mapping is a powerful tool to detect novel mutations causative of autosomal recessive RP in the Dutch population. Invest Ophthalmol Vis Sci. 2011;52(5):2227–39.

  145. 145.

    Hagiwara K, Morino H, Shiihara J, Tanaka T, Miyazawa H, Suzuki T, Kohda M, Okazaki Y, Seyama K, Kawakami H. Homozygosity mapping on homozygosity haplotype analysis to detect recessive disease-causing genes from a small number of unrelated, outbred patients. PLoS One. 2011;6(9):e25059.

  146. 146.

    Zhu Z, Bakshi A, Vinkhuyzen AA, Hemani G, Lee SH, Nolte IM, van Vliet-Ostaptchouk JV, Snieder H, LifeLines Cohort S, Esko T, Milani L, Magi R, Metspalu A, Hill WG, Weir BS, Goddard ME, Visscher PM, Yang J. Dominance genetic variation contributes little to the missing heritability for human complex traits. Am J Hum Genet. 2015;96(3):377–85.

  147. 147.

    Gibbs JR, Singleton A. Application of genome-wide single nucleotide polymorphism typing: simple association and beyond. PLoS Genet. 2006;2(10):e150.

  148. 148.

    Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75.

  149. 149.

    Browning SR, Browning BL. High-resolution detection of identity by descent in unrelated individuals. Am J Hum Genet. 2010;86(4):526–39.

  150. 150.

    Gusev A, Lowe JK, Stoffel M, Daly MJ, Altshuler D, Breslow JL, Friedman JM, Pe’er I. Whole population, genome-wide mapping of hidden relatedness. Genome Res. 2009;19(2):318–26.

  151. 151.

    Howrigan DP, Simonson MA, Keller MC. Detecting autozygosity through runs of homozygosity: a comparison of three autozygosity detection algorithms. BMC Genomics. 2011;12:460.

  152. 152.

    Wang S, Haynes C, Barany F, Ott J. Genome-wide autozygosity mapping in human populations. Genet Epidemiol. 2009;33(2):172–80.

  153. 153.

    The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–73.

  154. 154.

    Wall JD, Tang LF, Zerbe B, Kvale MN, Kwok PY, Schaefer C, Risch N. Estimating genotype error rates from high-coverage next-generation sequence data. Genome Res. 2014;24(11):1734–9.

  155. 155.

    Clark AG. The size distribution of homozygous segments in the human genome. Am J Hum Genet. 1999;65(6):1489–92.

  156. 156.

    Narasimhan V, Danecek P, Scally A, Xue Y, Tyler-Smith C, Durbin R. BCFtools/RoH: a hidden Markov model approach for detecting autozygosity from next-generation sequencing data. Bioinformatics. 2016;32(11):1749–51.

  157. 157.

    The 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR. A global reference for human genetic variation. Nature. 2015;526(7571):68–74.

  158. 158.

    Chen H, Patterson N, Reich D. Population differentiation as a test for selective sweeps. Genome Res. 2010;20(3):393–402.

  159. 159.

    Albrechtsen A, Sand Korneliussen T, Moltke I, van Overseem Hansen T, Nielsen FC, Nielsen R. Relatedness mapping and tracts of relatedness for genome-wide data in the presence of linkage disequilibrium. Genet Epidemiol. 2009;33(3):266–74.

  160. 160.

    Keller MC, Visscher PM, Goddard ME. Quantification of inbreeding due to distant ancestors and its detection using dense single nucleotide polymorphism data. Genetics. 2011;189(1):237–49.

  161. 161.

    MacLeod IM, Larkin DM, Lewin HA, Hayes BJ, Goddard ME. Inferring demography from runs of homozygosity in whole-genome sequence, with correction for sequence errors. Mol Biol Evol. 2013;30(9):2209–23.

  162. 162.

    Conrad DF, Keebler JE, DePristo MA, Lindsay SJ, Zhang Y, Casals F, Idaghdour Y, Hartl CL, Torroja C, Garimella KV, Zilversmit M, Cartwright R, Rouleau GA, Daly M, Stone EA, Hurles ME, Awadalla P. Variation in genome-wide mutation rates within and between human families. Nat Genet. 2011;43(7):712–4.

  163. 163.

    Schiffels S, Durbin R. Inferring human population size and separation history from multiple genome sequences. Nat Genet. 2014;46(8):919–25.

  164. 164.

    Jakobsson M, Scholz SW, Scheet P, Gibbs JR, VanLiere JM, Fung HC, Szpiech ZA, Degnan JH, Wang K, Guerreiro R, Bras JM, Schymick JC, Hernandez DG, Traynor BJ, Simon-Sanchez J, Matarin M, Britton A, van de Leemput J, Rafferty I, Bucan M, Cann HM, Hardy JA, Rosenberg NA, Singleton AB. Genotype, haplotype and copy-number variation in worldwide human populations. Nature. 2008;451(7181):998–1003.

  165. 165.

    Pemberton TJ, Jakobsson M, Conrad DF, Coop G, Wall JD, Pritchard JK, Patel PI, Rosenberg NA. Using population mixtures to optimize the utility of genomic databases: linkage disequilibrium and association study design in India. Ann Hum Genet. 2008;72(Pt 4):535–46.

  166. 166.

    Prugnolle F, Manica A, Balloux F. Geography predicts neutral genetic diversity of human populations. Curr Biol. 2005;15(5):R159–60.

  167. 167.

    Ramachandran S, Deshpande O, Roseman CC, Rosenberg NA, Feldman MW, Cavalli-Sforza LL. Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proc Natl Acad Sci U S A. 2005;102(44):15942–7.

  168. 168.

    DeGiorgio M, Jakobsson M, Rosenberg NA. Out of Africa: modern human origins special feature: explaining worldwide patterns of human genetic variation using a coalescent-based serial founder model of migration outward from Africa. Proc Natl Acad Sci U S A. 2009;106(38):16057–62.

  169. 169.

    Pemberton TJ, DeGiorgio M, Rosenberg NA. Population structure in a comprehensive genomic data set on human microsatellite variation. G3 (Bethesda). 2013;3(5):891–907.

  170. 170.

    Xu S, Huang W, Wang H, He Y, Wang Y, Wang Y, Qian J, Xiong M, Jin L. Dissecting linkage disequilibrium in African-American genomes: roles of markers and individuals. Mol Biol Evol. 2007;24(9):2049–58.

  171. 171.

    Silva-Zolezzi I, Hidalgo-Miranda A, Estrada-Gil J, Fernandez-Lopez JC, Uribe-Figueroa L, Contreras A, Balam-Ortiz E, del Bosque-Plata L, Velazquez-Fernandez D, Lara C, Goya R, Hernandez-Lemus E, Davila C, Barrientos E, March S, Jimenez-Sanchez G. Analysis of genomic diversity in Mexican Mestizo populations to develop genomic medicine in Mexico. Proc Natl Acad Sci U S A. 2009;106(21):8611–6.

  172. 172.

    Bryc K, Velez C, Karafet T, Moreno-Estrada A, Reynolds A, Auton A, Hammer M, Bustamante CD, Ostrer H. Genome-wide patterns of population structure and admixture among Hispanic/Latino populations. Proc Natl Acad Sci U S A. 2010;107 Suppl 2:8954–61.

  173. 173.

    Wang S, Ray N, Rojas W, Parra MV, Bedoya G, Gallo C, Poletti G, Mazzotti G, Hill K, Hurtado AM, Camrena B, Nicolini H, Klitz W, Barrantes R, Molina JA, Freimer NB, Bortolini MC, Salzano FM, Petzl-Erler ML, Tsuneto LT, Dipierri JE, Alfaro EL, Bailliet G, Bianchi NO, Llop E, Rothhammer F, Excoffier L, Ruiz-Linares A. Geographic patterns of genome admixture in Latin American Mestizos. PLoS Genet. 2008;4(3):e1000037.

  174. 174.

    Szpiech ZA, Blant A, Pemberton TJ. GARLIC: genomic Autozygosity regions likelihood-based inference and classification. Bioinformatics. 2017;33(13):2059–62.

  175. 175.

    Kardos M, Luikart G, Allendorf FW. Measuring individual inbreeding in the age of genomics: marker-based measures are better than pedigrees. Heredity (Edinb). 2015;115(1):63–72.

  176. 176.

    Fenner JN. Cross-cultural estimation of the human generation interval for use in genetics-based population divergence studies. Am J Phys Anthropol. 2005;128(2):415–23.

  177. 177.

    Browning BL, Browning SR. Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics. 2013;194(2):459–71.

  178. 178.

    Thomas A, Skolnick MH, Lewis CM. Genomic mismatch scanning in pedigrees. IMA J Math Appl Med Biol. 1994;11(1):1–16.

  179. 179.

    Thomas A, Camp NJ, Farnham JM, Allen-Brady K, Cannon-Albright LA. Shared genomic segment analysis. Mapping disease predisposition genes in extended pedigrees using SNP genotype assays. Ann Hum Genet. 2008;72(Pt 2):279–87.

  180. 180.

    Kong A, Gudbjartsson DF, Sainz J, Jonsdottir GM, Gudjonsson SA, Richardsson B, Sigurdardottir S, Barnard J, Hallbeck B, Masson G, Shlien A, Palsson ST, Frigge ML, Thorgeirsson TE, Gulcher JR, Stefansson K. A high-resolution recombination map of the human genome. Nat Genet. 2002;31(3):241–7.

  181. 181.

    Wu L. Investigation of consanguineous marriages among 30 Chinese ethnic groups. Hered Dis. 1987;4:163–6.

  182. 182.

    Bittles AH. Consangunity in context. Cambridge, UK: Cambridge University Press; 2012.

  183. 183.

    Zhusheng W. Dai. In: Encyclopedia.Com; 1996.

  184. 184.

    Gadgil M, Joshi NV, Prasad UVS, Manoharan S, Patil S. Peopling of India. In: Balasubramanian D, Rao NA, editors. The Indian human heritage. Hyderabad, India: Universities Press; 1998. p. 100–29.

  185. 185.

    Conrad DF, Jakobsson M, Coop G, Wen X, Wall JD, Rosenberg NA, Pritchard JK. A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nat Genet. 2006;38(11):1251–60.

  186. 186.

    Huang L, Jakobsson M, Pemberton TJ, Ibrahim M, Nyambo T, Omar S, Pritchard JK, Tishkoff SA, Rosenberg NA. Haplotype variation and genotype imputation in African populations. Genet Epidemiol. 2011;35(8):766–80.

  187. 187.

    Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, Cann HM, Barsh GS, Feldman M, Cavalli-Sforza LL, Myers RM. Worldwide human relationships inferred from genome-wide patterns of variation. Science. 2008;319(5866):1100–4.

  188. 188.

    Gazal S, Sahbatou M, Babron MC, Genin E, Leutenegger AL. High level of inbreeding in final phase of 1000 genomes project. Sci Rep. 2015;5:17453.

  189. 189.

    Shen Z, Duan L, Yang H, Yuan L, Huang Y, Li L, Xu B. Genetic variation of 17 STR loci in Dai population in mainland China. Forensic Sci Int Genet. 2015;19:37–8.

  190. 190.

    Sun H, Zhou C, Huang X, Lin K, Shi L, Yu L, Liu S, Chu J, Yang Z. Autosomal STRs provide genetic evidence for the hypothesis that tai people originate from southern China. PLoS One. 2013;8(4):e60822.

  191. 191.

    Li YC, Huang W, Tian JY, Chen XQ, Kong QP. Exploring the maternal history of the tai people. J Hum Genet. 2016;61(8):721–9.

  192. 192.

    Blekhman R, Man O, Herrmann L, Boyko AR, Indap A, Kosiol C, Bustamante CD, Teshima KM, Przeworski M. Natural selection on genes that underlie human disease susceptibility. Curr Biol. 2008;18(12):883–9.

  193. 193.

    Berg JS, Adams M, Nassar N, Bizon C, Lee K, Schmitt CP, Wilhelmsen KC, Evans JP. An informatics approach to analyzing the incidentalome. Genet Med. 2013;15(1):36–44.

  194. 194.

    Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, O’Donnell-Luria AH, Ware JS, Hill AJ, Cummings BB, Tukiainen T, Birnbaum DP, Kosmicki JA, Duncan LE, Estrada K, Zhao F, Zou J, Pierce-Hoffman E, Berghout J, Cooper DN, Deflaux N, DePristo M, Do R, Flannick J, Fromer M, Gauthier L, Goldstein J, Gupta N, Howrigan D, Kiezun A, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536(7616):285–91.

  195. 195.

    Itan Y, Jones BL, Ingram CJ, Swallow DM, Thomas MG. A worldwide correlation of lactase persistence phenotype and genotypes. BMC Evol Biol. 2010;10:36.

  196. 196.

    Gerbault P, Liebert A, Itan Y, Powell A, Currat M, Burger J, Swallow DM, Thomas MG. Evolution of lactase persistence: an example of human niche construction. Philos Trans R Soc Lond Ser B Biol Sci. 2011;366(1566):863–77.

  197. 197.

    Smith GD, Lawlor DA, Timpson NJ, Baban J, Kiessling M, Day IN, Ebrahim S. Lactase persistence-related genetic variant: population substructure and health outcomes. Eur J Hum Genet. 2009;17(3):357–67.

  198. 198.

    Ingram CJ, Mulcare CA, Itan Y, Thomas MG, Swallow DM. Lactose digestion and the evolutionary genetics of lactase persistence. Hum Genet. 2009;124(6):579–91.

  199. 199.

    Simoons FJ. Primary adult lactose intolerance and the milking habit: a problem in biologic and cultural interrelations. II. A culture historical hypothesis. Am J Dig Dis. 1970;15(8):695–710.

  200. 200.

    McCracken RD. Lactase deficiency: an example of dietary evolution. Curr Anthropol. 1971;12(4/5):479–517.

  201. 201.

    Kretchmer N. Lactose and lactase. Sci Am. 1972;227(4):71–8.

  202. 202.

    Wang YG, Yan YS, Xu JJ, Du RF, Flatz SD, Kuhnau W, Flatz G. Prevalence of primary adult lactose malabsorption in three populations of northern China. Hum Genet. 1984;67(1):103–6.

  203. 203.

    Gallego Romero I, Basu Mallick C, Liebert A, Crivellaro F, Chaubey G, Itan Y, Metspalu M, Eaaswarkhanth M, Pitchappan R, Villems R, Reich D, Singh L, Thangaraj K, Thomas MG, Swallow DM, Mirazon Lahr M, Kivisild T. Herders of Indian and European cattle share their predominant allele for lactase persistence. Mol Biol Evol. 2012;29(1):249–60.

  204. 204.

    Woteki CE, Weser E, Young EA. Lactose malabsorption in Mexican-American children. Am J Clin Nutr. 1976;29(1):19–24.

  205. 205.

    Sahi T. Genetics and epidemiology of adult-type hypolactasia. Scand J Gastroenterol Suppl. 1994;202:7–20.

  206. 206.

    Enattah NS, Sahi T, Savilahti E, Terwilliger JD, Peltonen L, Jarvela I. Identification of a variant associated with adult-type hypolactasia. Nat Genet. 2002;30(2):233–7.

  207. 207.

    Tishkoff SA, Reed FA, Ranciaro A, Voight BF, Babbitt CC, Silverman JS, Powell K, Mortensen HM, Hirbo JB, Osman M, Ibrahim M, Omar SA, Lema G, Nyambo TB, Ghori J, Bumpstead S, Pritchard JK, Wray GA, Deloukas P. Convergent adaptation of human lactase persistence in Africa and Europe. Nat Genet. 2007;39(1):31–40.

  208. 208.

    Ferrer-Admetlla A, Liang M, Korneliussen T, Nielsen R. On detecting incomplete soft or hard selective sweeps using haplotype structure. Mol Biol Evol. 2014;31(5):1275–91.

  209. 209.

    McVean G. A genealogical interpretation of principal components analysis. PLoS Genet. 2009;5(10):e1000686.

  210. 210.

    Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2(12):e190.

  211. 211.

    Pemberton TJ, Li FY, Hanson EK, Mehta NU, Choi S, Ballantyne J, Belmont JW, Rosenberg NA, Tyler-Smith C, Patel PI. Impact of restricted marital practices on genetic variation in an endogamous Gujarati group. Am J Phys Anthropol. 2012;149(1):92–103.

  212. 212.

    Hodges JL, Lehmann EL. The efficiency of some nonparametric competitors of the t-test. Ann Math Stat. 1956;27(2):324–35.

  213. 213.

    Chen G, Bentley A, Adeyemo A, Shriner D, Zhou J, Doumatey A, Huang H, Ramos E, Erdos M, Gerry N, Herbert A, Christman M, Rotimi C. Genome-wide association study identifies novel loci association with fasting insulin and insulin resistance in African Americans. Hum Mol Genet. 2012;21(20):4530–6.

  214. 214.

    Maeda N, Shimomura I, Kishida K, Nishizawa H, Matsuda M, Nagaretani H, Furuyama N, Kondo H, Takahashi M, Arita Y, Komuro R, Ouchi N, Kihara S, Tochino Y, Okutomi K, Horie M, Takeda S, Aoyama T, Funahashi T, Matsuzawa Y. Diet-induced insulin resistance in mice lacking adiponectin/ACRP30. Nat Med. 2002;8(7):731–7.

  215. 215.

    Diez JJ, Iglesias P. The role of the novel adipocyte-derived hormone adiponectin in human disease. Eur J Endocrinol. 2003;148(3):293–300.

  216. 216.

    Sim X, Ong RT, Suo C, Tay WT, Liu J, Ng DP, Boehnke M, Chia KS, Wong TY, Seielstad M, Teo YY, Tai ES. Transferability of type 2 diabetes implicated loci in multi-ethnic cohorts from Southeast Asia. PLoS Genet. 2011;7(4):e1001363.

  217. 217.

    Kenny EE, Pe’er I, Karban A, Ozelius L, Mitchell AA, Ng SM, Erazo M, Ostrer H, Abraham C, Abreu MT, Atzmon G, Barzilai N, Brant SR, Bressman S, Burns ER, Chowers Y, Clark LN, Darvasi A, Doheny D, Duerr RH, Eliakim R, Giladi N, Gregersen PK, Hakonarson H, Jones MR, Marder K, McGovern DP, Mulle J, Orr-Urtreger A, Proctor DD, et al. A genome-wide scan of Ashkenazi Jewish Crohn’s disease suggests novel susceptibility loci. PLoS Genet. 2012;8(3):e1002559.

  218. 218.

    Walker SM, Downes CP, Leslie NR. TPIP: a novel phosphoinositide 3-phosphatase. Biochem J. 2001;360(Pt 2):277–83.

  219. 219.

    Clifford RJ, Zhang J, Meerzaman DM, Lyu MS, Hu Y, Cultraro CM, Finney RP, Kelley JM, Efroni S, Greenblum SI, Nguyen CV, Rowe WL, Sharma S, Wu G, Yan C, Zhang H, Chung YH, Kim JA, Park NH, Song IH, Buetow KH. Genetic variations at loci involved in the immune response are risk factors for hepatocellular carcinoma. Hepatology. 2010;52(6):2034–43.

  220. 220.

    Zuo L, Wang T, Lin X, Wang J, Tan Y, Wang X, Yu X, Luo X. Sex difference of autosomal alleles in populations of European and African descent. Genes Genomics. 2015;37(12):1007–16.