World diversity panel SNP ascertainment and genotyping
We first identified genomic regions of ~50 kb in size, each centered around a single AluY insertion, with no other young AluY insertions or known genes in the region. Nearly all of these less-diverged elements are of the AluY subfamily, so we restrict our analysis to AluY elements. We selected 19 such AluY-containing regions based on previous characterization [28, 29]. Fourteen of these are defined by a fixed but recently inserted AluY element, and five are defined by AluY insertions that are still polymorphic for presence or absence in humans. We then identified a total of 206 SNPs in these regions by searching public databases and by resequencing 1 kb stretches of DNA at ~5 kb intervals flanking the AluY insertion loci in seven individuals of African, European, and East Asian ancestry (21 total). This ascertainment design delivers evenly spaced SNPs and reduces the ascertainment bias for common European SNPs that pervades most large publicly available data sets, such as the HapMap data. We then genotyped these SNPs in 347 individuals sampled from Sub-Saharan Africa (152), Europe (118), and East Asia (77) (additional population details can be found in [30]).
Sequencing of both strands in the targeted loci on the ascertainment panel of 21 individuals was carried out in our laboratory using an ABI 3100 sequencer (Applied Biosystems) and primer designs based on the human reference genome. SNPs were identified using PolyPhred [31], and those with MAF > 5% in the 21 individuals were selected for genotyping in the 347 individuals in our sample using standard methods (SNaPshot®, Applied Biosystems).
Recombination rate estimation and model variables
For each AluY region-by-continent data set, we used the default recombination rate model in PHASE 2.1 (MR0; [32]) with a segment size of 12 markers, 200 burn-in iterations followed by 100 sampling iterations, with a 10-times longer final run (using all loci instead of segments) for better sampling of the recombination rate estimates. We ran the entire estimation process five times and used results from the run with the best average goodness-of-fit. Our estimate of ρ for each interval is the median of the final sampled values produced by PHASE for that interval, as recommended by the authors.
Where parent-offspring trios were available, we used them to obtain better haplotype reconstructions and recombination rate estimates. Within trios, genotypes that were incompatible with Mendelian inheritance were treated as missing data. Data subsets (genotypes for all SNPs in an AluY region, in all individuals in the continental population sample) that exhibited more than 10 Mendelian inheritance conflicts were not analyzed. On a region-by-region basis, any individual with more than 50% missing data was removed (along with its entire trio, where applicable).
We estimated the rescaled recombination rate parameter, ρ, for every inter-SNP interval in each AluY region from the genotypes of SNPs in each region using PHASE 2.1 [32]. Recombination rates were estimated separately for each continental group (Africa, Europe, or Asia).
We then used stepwise-fitted linear regression (stepwisefit in Matlab; [33]) to test for an effect of AluY insertions on recombination rates while controlling for other factors that might also influence recombination rates. Our regression model predicts ρ for each inter-SNP interval (across all AluY regions) as a function of seven variables:
(1) The length of the inter-SNP interval (log10 scale), which allows us to factor out and test for a potentially confounding relationship between interval length, recombination rate, and AluY presence or absence.
(2) The mean ρ of all intervals in the AluY region other than the interval containing the AluY (see Figure 1). This essentially uses the inter-SNP intervals surrounding the AluY-containing interval as matched controls (see Figure 1) and will factor out broad-scale variation in the recombination rate, regardless of the cause. In particular, this should account for biases that might have been introduced by the procedure that we used to select AluY regions (such as slightly lower AluY density or slightly higher SNP heterozygosity compared to genome-wide averages).
(3) The G+C base pair composition of the interval. This is known to be correlated with recombination rates at broad scales [14, 15, 34] and might well have a short-range effect.
(4) The number of copies per interval of a 7-bp GC-rich "core" motif (CCTCCCT) that is associated with recombination hot spots [16].
(5) The number of copies per interval of an "extended" degenerate motif (CCNCCNTNNCCNC) that is associated with recombination hot spots [20]. The core and extended motifs are sometimes found in AluY copies, and their potential recombinogenic effects might explain some or all of any effect that AluY insertions might have on the recombination rate. Instances of each motif and its reverse complement were identified and counted in the UCSC hg18 reference human genome sequence.
(6) An indicator variable, indicating whether or not an interval overlaps a hot spot (0 or 1, respectively; hot spot locations as identified in [16]). Although hot spots are not DNA sequence features per se, they do correspond to small (mostly <10 kb) regions where recombination rates several times higher than in the surrounding regions. In effect, we use the presence of a hot spot as a proxy for the unknown local genomic features that presumably cause hot spots.
(7) Lastly, the variable of interest in this work: an indicator of the presence (1) or absence (0) of an AluY insertion in the interval.
In an alternative test of the independence of the effects of AluY insertions and the recombinogenic motifs that sometimes occur in them, we generated subsets of the HapMap data by removing all AluY regions where the focal AluY contained either recombinogenic motif. Our findings remained essentially unaltered, indicating that the effects of the motifs were well accounted for in the linear regression models.
Although it was possible to identify many ~50 kb regions in the human genome that contained exactly one young AluY insertion, it was not possible to find such regions that were also free of copies of other repeat sequence families, because those are too common. If copies of other families affect recombination and are strongly correlated with young AluY insertions at the ~4 kb scale we used, those effects could cancel out or be confounded. To test this possibility for LINE L1, HERV, Alu repeats other than AluY, and DNA repeats (separately), we constructed additional variables that indicate whether an inter-SNP interval overlaps a repeat of that class or not (1 or 0; similar to the hot spot variable). Those four classes account for the vast majority of non-AluY repeats in the genome. Adding these new variables into our linear regression models caused only trivial and statistically non-significant changes in the size of the effect of AluY repeats on recombination: the effect size coefficients changed by no more than 4% of the original estimate at most, and by only 0.7% on average. Thus other repeat sequences can be safely treated as uncorrelated background effects.
"Edge" effects on recombination rate estimation
PHASE relies on the pattern of linkage disequilibrium between SNPs to infer the historical rates of recombination in the genomic intervals defined by those SNPs. For intervals at the end of a region of neighboring SNPs, however, there are no further SNPs on one side. This could limit the ability of PHASE to detect evidence of past recombination events, which could result in a downward bias in recombination rate estimates for those intervals. The AluY insertions in our data are nearly always in the central interval of each region, where information about recombination events should be sufficient for unbiased estimates. Lower recombination rate estimates in terminal intervals due to an "edge effect" could cause recombination rates at internal intervals (in particular, the AluY-containing intervals) to appear significantly higher than the local average.
We examined our data for edge effects by including additional indicator variables in the linear regression model described above. These variables indicated whether or not (1 or 0) and interval was terminal, sub-terminal, sub-sub-terminal, and so on, in its region. In the HapMap data sets, where each region has eleven intervals, this requires five variables. In the large data sets for fixed AluY in the HapMap YRI and CEU samples (comprising 6,235 and 5,344 regions, respectively), terminal intervals had a modestly but significantly lowered recombination rate (effect coefficients -0.018 and -0.025, p < 10-9 and p < 10-4, for YRI and CEU data sets respectively). Sub-terminal intervals exhibited a similar effect in the YRI data set only (effect size -0.26, p < 10-7). Other internal intervals (apart from the central AluY-containing interval) showed no position effects. The remaining data sets derived from our world diversity panel are all much smaller, and no edge effect was detected for the terminal or sub-terminal intervals there.
In order to eliminate edge effects from our data, we removed the terminal and sub-terminal intervals from all linear regression analyses on the HapMap-derived data sets (for fixed AluY in the YRI and CEU samples, and for polymorphic AluY in the YRI sample). In the smaller data sets derived from our world diversity panel, the AluY are not as tightly correlated with the central interval in a region; in one case, the AluY is in a sub-terminal interval. For these data sets, we nonetheless dropped all terminal intervals from all analyses. The exclusion of terminal and/or sub-terminal intervals has some quantitative impacts, but does not qualitatively affect the results of any of our analyses.
HapMap data set construction
To extend our analysis to a different set of individuals and a larger set of loci, we turned to the publicly available HapMap data set [23]. We analyzed data from the parent-child trios of European (CEPH) and Yoruban ancestry (YRI; thirty trios each; [23]. Since we are interested in the effects of typical AluY insertions, we searched the UCSC Genome Browser [35] "rmsk" database table (itself generated using RepeatMasker[36]) for all AluY insertions in the human genome reference sequence that met the following criteria: (1) the insertions have been mapped to a specific location in an autosome, (2) they are no more than 10% diverged from their respective AluY subfamily consensus sequences, (3) they are between 250 and 350 bp in length, (4) no more than 10% of any insertion consists of inserted non-AluY sequence, (5) no more than 10% of the subfamily consensus sequence is missing from the insertion. We identified 113,852 such AluY insertions.
We then emulated the AluY region and SNP ascertainment strategies that we used for our own genotyping project ("World diversity panel SNP ascertainment and genotyping," above) to select comparable AluY regions and evenly-spaced SNPs from the HapMap data. Out of the AluY insertions identified as above, we selected only those that had no other AluY fragments (or known polymorphic Alu insertion loci in dbRIP; [24]) within 25 kb of them, in order to simplify analysis and eliminate potentially complicating interactions between neighboring AluY elements. We further winnowed this set down to those AluY insertions that had at least twelve SNPs with MAF>0.1 (in the population being analyzed) within 25 kb of the AluY locus. We used the more demanding 10% MAF threshold (instead of 5%) in order to obtain better recombination rate estimates. Among SNPs that met those criteria, we chose sets that maximized the uniformity of the inter-SNP interval lengths (as our own ascertainment procedure above did). We then ranked the HapMap AluY regions according to the evenness of SNP spacing and selected regions with the highest evenness for further analysis. AluY regions and SNPs were ascertained separately for the HapMap YRI and CEU data sets. For the analysis of polymorphic AluY insertions, AluY regions with 12 SNPs each were selected and constructed as above, after using liftOver [37] to translate dbRIP chromosomal positions from UCSC hg17 to hg18 positions.
The SNPs discovered in this work have been submitted to dbSNP. All the data sets used in this work (AluY regions, SNP loci, genotypes, recombination rate estimates and all other predictor variables computed on inter-SNP intervals) are available from the authors on request.