Identifying branch-specific positive selection throughout the regulatory genome using an appropriate proxy neutral.

BACKGROUND
Adaptive changes in cis-regulatory elements are an essential component of evolution by natural selection. Identifying adaptive and functional noncoding DNA elements throughout the genome is therefore crucial for understanding the relationship between phenotype and genotype.


RESULTS
We used ENCODE annotations to identify appropriate proxy neutral sequences and demonstrate that the conservativeness of the test can be modulated during the filtration of reference alignments. We applied the method to noncoding Human Accelerated Elements as well as open chromatin elements previously identified in 125 human tissues and cell lines to demonstrate its utility. Then, we evaluated the impact of query region length, proxy neutral sequence length, and branch count on test sensitivity and specificity. We found that the length of the query alignment can vary between 150 bp and 1 kb without affecting the estimation of selection, while for the reference alignment, we found that a length of 3 kb is adequate for proper testing. We also simulated sequence alignments under different classes of evolution and validated our ability to distinguish positive selection from relaxation of constraint and neutral evolution. Finally, we re-confirmed that a quarter of all non-coding Human Accelerated Elements are evolving by positive selection.


CONCLUSION
Here, we introduce a method we called adaptiPhy, which adds significant improvements to our earlier method that tests for branch-specific directional selection in noncoding sequences. The motivation for these improvements is to provide a more sensitive and better targeted characterization of directional selection and neutral evolution across the genome.


Background
An accurate and comprehensive characterization of the genomic distribution of adaptive substitutions is essential for understanding the genetic basis for trait divergence between species [1][2][3][4][5]. Tests for positive selection at the interspecies scale developed during the 1980s focused on ω, the ratio of nonsynonymous to synonymous substitution rates in protein coding regions [6,7].
These methods were first applied at a whole genome scale soon after the release of reference genome assemblies for human, chimpanzee, and macaque [8][9][10][11], and provided the earliest relatively unbiased views of positive selection on protein-coding regions. At the same time, a growing appreciation for the contribution of regulatory mutations to adaptation [12][13][14] prompted the development of methods to test for positive selection in noncoding regions.
Two general approaches were devised to test for positive selection in the absence of a genetic code. One seeks regions that contain many substitutions along the human branch (since the most recent common ancestor with chimpanzees) but are otherwise highly conserved among mammals or vertebrates [15][16][17][18][19]. The other seeks an elevated rate of substitution along the human lineage in a query region hypothesized to contain regulatory elements relative to a nearby reference (proxy neutral) region thought to contain few functional elements [20,21]. Both approaches test for branch-specific accelerated substitution, but differ in the reference point against which they assess acceleration: the first tests for accelerated substitution within otherwise conserved regions against a putatively neutral region that is usually obtained from local Ancient Repeats (ARs) or fourfold degenerate sites (4D) [15,[17][18][19], while the second employs a putatively non-functional local intron of the genome as a neutral reference against which to identify branch-specific accelerated substitution [20,21]. Wong & Nielsen [20] defined the parameter ζ as the ratio of substitution rates in the query region to those in the associated neutral region; ζ is thus analogous to ω. To detect significance in the departures from neutrality, both approaches typically use maximum likelihood estimation and likelihood ratio tests (LRTs) that compare a null model allowing neutrality against an alternative model that additionally allows for positive selection.
These two general approaches have complementary strengths and weaknesses. The first approach is less sensitive, in that it does not make use of an appropriate proxy for neutral regions along the human lineage. This approach allowed the discovery of Human Accelerated Regions (HARs) [16]. However, there is no reason to suppose adaptive evolution along the human lineage has been confined to regions under purifying selection in most or all other species, particularly since such regions constitute a small fraction of the genome (Fig. 1a). Additionally, this method may fail to distinguish regions evolving under relaxation of constraint from those that have experienced positive selection. The first method has been implemented in phyloP [22], which is straightforward to execute and allows running selection tests using different approaches, such as LRT, SPH, Score and Genomic Evolutionary Rate Profiling (GERP) [16,[23][24][25]. phyloP has been extensively used for more than a decade, and has been applied to conserved DNA regions using neutral proxies based on four-fold degenerate (4D) sites e.g., [26] or local ancient repeats (ARs) e.g., [27,28]. The second method runs in HyPhy [29] and requires more computational and analytical effort than phyloP. On the other hand, it is more broadly applicable because it can query any genomic region regardless of whether that region was previously under functional constraint.
At the time these approaches were first developed, very little information existed about the location of functional elements in any genome. This limited the ability to identify suitable proxy neutral regions, i.e., those likely to be free from either purifying or positive selection. Inadvertently using constrained or accelerated regions as neutral proxies can potentially introduce artificial adaptive signals or reduce sensitivity, respectively. In addition, not knowing the location of regulatory elements meant that testing for positive selection at a genome-wide scale was intractable due to the need for massive correction for multiple testing. Prior to the invention of functional genomic assays for chromatin status, the best method for identifying putative regulatory elements was sequence conservation [15].
The ENCODE project [30,31] and other efforts [32] to identify regulatory elements throughout the human genome mean that it is now possible to focus tests for positive selection on likely functional noncoding elements and to identify appropriate proxy neutral regions. Highly conserved noncoding regions overlap with only about 1.09% of the~3 million known DNase I Hypersensitive Sites (DHSs) in the human genome [33] and just 1.06% of~1.7 million enhancers and promoters from the HoneyBadger2 regions published by the ENCODE and Roadmap Epigenomics projects (Fig. 1a). Accordingly, it seems likely that a substantial fraction of functional DNA elements do not occur in regions of strong conservation. At the same time,~18.3% of the human genome currently has no known regulatory or functional annotation, despite extensive study, providing a principled basis for choosing proxy neutral sequences that may be superior to 4D sites and ARs.
Here we introduce adaptiPhy, an improved analytical method that can test for branch-specific directional selection on any collection of query segments based on accurate alignments from three or more species. We implemented a series of technical and computational modifications to our previously published method [21] using openly available software from PHAST [22,34] and functional genomic datasets from ENCODE [35]. We tested the performance of adaptiPhy in regions that have been already tested (i.e., published Human Accelerated Regions, or HARs) and among simulated sequences evolving at neutral rates, positive selection in one branch of the tree, or two branches of the tree, and relaxation of constraint. Significant improvements include: 1) better genome-wide representation of putatively neutral proxies based on functional annotations; 2) ability to test for selection in nearly any noncoding region of the genome, including functionally dense regions; 3) the ability to increase stringency by filtering reference sequences; and 4) improved understanding of how branch number and the size of query and reference regions impact test sensitivity. We demonstrate that this approach can be applied productively to focal collections of genomic regions commonly encountered in contemporary genomics and

Results
Global nonfunctional sequences provide appropriate neutral reference sequences Thus, identifying sufficient ARs to use as a local reference for each DHS is also difficult. Next, we asked whether local ARs, local NFRs, or global NFRs can be used to build an appropriate reference for testing positive selection. To build a local reference region, we concatenated all the NFRs or ARs within 100 kb of a given query region. Next, we computed the substitution rate of each concatenated sequence of local ARs, local NFRs, and NFRs across the genome. For a set of query regions in our sample, we found a wide distribution of substitution rates among concatenated local references including local NFRs, global NFRs, and ARs (Fig. 2b).
We thus sought to test whether filtering global NFRs by their relative substitution rate over the entire tree can provide an improved neutral proxy for estimating positive selection. We filtered out global NFRs representing the top and bottom quartiles of relative substitution rates (Fig. S1), then concatenated 10 NFRs per query. When we compared the substitution rates of this new set of putative neutral references, we found that the distribution of substitution rates of the filtered global NFRs is narrower and its median is skewed to the right (Fig. 2c). This suggests that global NFRs can provide a set of putatively neutral elements if appropriately filtered. This approach also allows conservativeness and sensitivity to be modulated by tuning the filtration step accordingly. Moreover, given that we use relative values of substitution rate, this filtering step can be applied to any region of the genome regardless of the amount of functional annotation.
To assess the impact of using local versus global NFRs on testing for positive selection, we sampled three sets of queries: (1) widespread and specific DHSs (open in > 124 and exactly 1 ENCODE cell types, respectively); (2) a set of ncHAEs to be used as positive controls; and (3) a set of putatively non-functional DNA elements to be used as negative controls. The correlation in P-values is high among the 3531 DHSs that could be analyzed using both local and global neutral proxies (Spearman's Rank test ρ = 0.80; P < 2.2 × 10 − 16 ; Fig. 3a). Of these, only 2.63% scored high for positive selection (P < 0.05) for global proxies, while 5.12% scored high for positive using the local proxy alone. Likewise, the correlation of P-(See figure on previous page.) Fig. 1 Overlaps between functional regions of the human genome and the evolutionary model. a. Left, Venn diagram showing percentage of the human genome that overlaps with known non-coding functional annotations and conserved regions of the human genome. Right, scaled Venn diagram showing the proportion of conserved regions with respect to known functional annotations and gapped DNA representing telomeric and centromeric sequences (white). b. Graphic summary of our improved method. Left panel, the evolutionary ratio "ζ" is computed as the ratio of the substitution rate in a query (K query ) with respect to the substitution rate in a reference (K reference ) region. Queries can be obtained from functional annotations such as ATAC-seq or ChIP-seq peaks (red box), while reference alignments can either be taken by sampling local nonfunctional elements in the vicinity of the query or from a genome-wide random sampling of non-functional and putatively neutral regions of the genome (green boxes). Right panel, to test for positive selection in a query region on a foreground branch (red) of the tree, we fit, via maximum likelihood, both a null model and an alternative model to the alignments of the reference and query regions. In both models, on all branches, all sites in the reference region evolve neutrally. In both models, on the background branches, a fraction b1 > = 0 of sites in the query region evolve under purifying selection, at rate ζ 1 < 1 relative to sites in the reference region, and a fraction b2 = 1 -b1 of sites in the query region evolve neutrally, at relative rate ζ 2 = 1. In the null model, evolution on the foreground branch is the same as on the background branches, except a fraction Δ > = 0 of sites in the query region that evolve under purifying selection on the background branches may evolve neutrally on the foreground branch, that is, the model allows for relaxation of constraint on the foreground branch. In the alternative model, fractions Δ 1 > = 0 and Δ 2 > = 0 of sites in the query region that evolve under purifying selection and neutrally, respectively, on the background branches may evolve under positive selection on the foreground branch, at rate ζ 3 > 1. A likelihood-ratio test indicates whether the alternative model fits the alignments significantly better than the null model. As explained under "Materials and methods", we conservatively approximate this test as a chisquared test with one degree of freedom values in the global and local sets is high among the 1291 ncHAEs that could be tested using both local and global proxies (Spearman's Rank test ρ = 0.86; P < 2.2 × 10 − 16 ). Of these, only 25.33% of the ncHAE regions tested positive globally, while 39.04% tested positive for selection using the local tests (Fig. 3b). Thus, local proxies in general identify more putative cases of positive selection but have limited applicability within functiondense regions of the genome while global proxies can be used to test any query region but possibly with lower sensitivity.
Sensitivity is high given practical query length, reference length, and branch number Earlier, we observed that the distribution of substitution rates among all global reference regions used in this study is narrow and high (Fig. 2b). More specifically, we observed an average human branch length of 0.0072 substitutions per site using global neutral proxies, which is appreciably faster than the average substitution rates of local elements (average branch length = 0.0055). When we evaluated the effect of query and neutral proxy length on the sensitivity of the estimation of positive selection using empirical data, we measured the effect of reference length on substitution rate; we tested reference alignments of 300 bp, 900 bp, 3 kb, 9 kb, and 30 kb (Fig. 4a). As expected, the median of the branch lengths of the global references does not increase or decrease as they get longer; rather, they reach an equilibrium at 0.00725 with reduced variation (Fig. S5).
Functional genomic approaches such as ChIP-seq and ATAC-seq identify putative regulatory regions with window lengths that are usually between 150 and 800 bp and skewed towards shorter lengths [33,[36][37][38][39][40]. In order to assess the ability of adaptiPhy to identify positive selection throughout the biologically meaningful range of putative regulatory element sizes, we tested the effect of query lengths. Importantly, the ability to detect selection is not strongly affected by differences in query length, and remains similar down to~150 bp (Fig. 4b).
This finding suggests that our set of reference sequences is able to detect signatures of positive selection in regions where the query is longer than the actual functional element under selection, and in particular across most of the size range of known regulatory elements and open chromatin regions in the human genome.
Finally, we tested the impact of using three or five species to detect positive selection, as more branches might be expected to provide a better estimation of the background substitution rate in the reference. We found that adding one or two species above the minimum of three (two ingroup and one outgroup) provides only a negligible improvement in sensitivity of adaptiPhy (Fig. 4c). In contrast, the sensitivity of phyloP is more dependent on the number of species, improving markedly with additional taxa (Fig. 4c). Thus, adaptiPhy may be preferable in situations where the minimum number of reference genome assemblies is available.

The test discriminates between four different types of evolutionary scenarios
To determine whether adaptiPhy can correctly detect selection under different evolutionary scenarios, we simulated reference alignments evolving neutrally and query alignments evolving under four selection regimes, namely neutral on both background branches (BG) and a foreground branch (FG) of a five-species tree, purifying selection on the BG and neutral on the FG (i.e., relaxation of constraint), neutral on the BG but positive selection on the FG, and positive selection in the FG and one BG species, while the other BG species remain neutral. We found that adaptiPhy accurately discriminates positive selection from relaxation of constraint and neutral evolution (Fig. 5). Of the simulated neutral regions, 1.8% were incorrectly identified as positive selection using adaptiPhy compared to 2.8% using phyloP. In addition, phyloP fails to distinguish positive selection from relaxation of constraint in almost half of the simulated cases, while our approach fails in only 2.2% of cases. In contrast, of the simulated regions under positive selection, adaptiPhy and phyloP identified 99.7 and 99.4% of sequences simulated to be under positive selection, respectively. When positive selection occurs in the FG and in one of the BG species, adaptiPhy and phyloP identified 99.2 and 83.5% of the cases, respectively (Fig. 5).
Across all the evolutionary scenarios tested, the sensitivity of adaptiPhy to detect positive selection is 0.99 and specificity is 0.98, while the sensitivity of phyloP is 0.91 and specificity is 0.74. While it is difficult to evaluate whether the same would occur with real data, these results from simulations suggest that adaptiPhy may produce a lower false positive rate than phyloP, particularly for instances of relaxed selection.
The test reconfirms many previously identified human accelerated elements To test whether adaptiPhy and phyloP replicate results from previous scans for positive selection in noncoding regions using our global NFRs [15,[17][18][19], we queried a consolidated set of 2649 ncHAEs [41]. Since our findings indicate that our test does not dilute signals of positive selection when the query region is between 150 bp and 1 kb, and given that most ncHAEs are relatively short (67% are shorter than 300 bp), we normalized query length by capturing sequence up to 300 bp centered on each ncHAE. At a significance level of 0.05, we confirmed 25.8% of previously reported ncHAEs using adaptiPhy (Fig. S4A), while confirming 59% using phy-loP (Fig. S4B). More broadly, the distribution of P-values for previously reported ncHAEs is skewed toward 0; the peak at the lower end of the distribution is pronounced (Fig. S4C). As expected, this distribution is more skewed towards 0 using phyloP and our set of neutral references (Fig. S4D). Indeed, these results are consistent with rapid acceleration in the human lineage among ncHAEs when using our proxy neutral with both phyloP and adaptiPhy.
Interestingly, there is a substantially higher degree of overlap between the regions we replicated in the consolidated set of ncHAEs and those in each of the published studies than between any two of them (Fig. S6). For instance, we validated nearly 55% of the ncHAEs identified by Bush and Lahn [18], a significantly higher fraction than any of the other studies were able to validate (Fisher's Exact Test, Two-sided, P = 0.0007). Of the other methods that were used to identify human accelerated elements, Bush and Lahn's is the most methodologically similar to ours, although they used ancient repeats within 750 kb surrounding each ncHAE as the neutral proxy. Only four loci were identified as ncHAEs in all four prior studies, and all four were replicated here (Table S1 and Fig. S7). One of these is near BNC2, a gene that may be responsible for skin pigmentation differences between humans and other primates [42,43].
We also scanned a region of 100 kb containing one of the genes that was significant for positive selection in the published studies [15,[17][18][19] and the present study using a sliding window. We found additional signals of positive selection around the NEIL3 locus that were as strong as the single ncHAE element that was originally identified (Fig. 6). The striking clustering of signals of positive selection around this locus suggests that transcriptional regulation of NEIL3 changed extensively during human origins and highlights the ability of adaptiPhy to identify signatures of positive selection that other methods may miss.

Discussion
Rigorous testing for selection depends upon accurate identification of reference regions that represent neutral rates of evolution. Prior studies have used four-fold degenerate (4D) sites from coding sequences [15], surrounding non-coding sequences [27], concatenated conserved LINE elements (ancient repeats; ARs) [28], and non-first intron sequences [21] as neutral references. However, regulatory elements are often located in regions that are dense with functional annotations and thus potentially constrained or near sites evolving under weak purifying selection (Fig. 2a). Consequently, identifying sufficient non-functional regions to use as a neutral proxy in the vicinity of most open chromatin sites is simply not possible in primate genomes and this problem is even more acute in organisms with higher gene (See figure on previous page.) Fig. 4 Sensitivity test and the effect of reference and query length, and number of species in the estimation of selection in a subset of 1000 random DHSs. a. Effect of reference length on the power to identify selection at different significance levels. b. Effect of query length on the power to identify selection at different significance levels. c. Percentage of significant tests in a sample of 1000 DHSs that were analyzed with phyloP and adaptiPhy. Bars labeled as 5 sp. depict tests done for five branches, while 3 sp. labels depict three-species tests density. To address this limitation, we used the dense functional annotation of the human genome to identify a set of putatively non-functional regions (based on absence of any functional annotation) that can be used as an appropriate neutral proxy (see Methods).
We then filtered this set of putatively non-functional regions (NFRs) to address the challenges of rate heterogeneity and lineage sorting. For the entire set of DHS elements in our sample, we found a wide distribution of substitution rates among concatenated local samples including local NFRs, prefiltered global NFRs and ARs (Fig. 2b). This means that there are many such regions in which the substitution rate on the human branch is substantially slower or faster than average and thus potentially confounding when used to test for positive selection. These observations reflect generally less availability of non-functional elements in regions where DHSs and ncHAEs are more common and slower rates of evolution nearby each functional element, but also the possibility that some ancient LINE elements in the vicinity of DHSs are under functional constraint or evolving under high mutation rates. Indeed, many studies have shown that ARs can become co-opted as local regulatory elements e.g., [44][45][46][47]. Consistent with this interpretation, we found that the proportion of reference ARs with at least one multitissue eQTL or brain specific eQTL is almost three times as large as in a set of global non-functional sequences (Fig. S3). The relative absence of local non-functional elements may also increase the effect of incomplete lineage sorting in the reference sequence, producing an increase of false positives and negatives for query elements if the reference foreground sequence is evolving at slower or faster rates than the background.
Together, these results suggest that using local nonfunctional elements and ARs can both contribute to misestimation of selection. At the same time, ARs have the tendency to overestimate the rates of evolution of query alignments compared to a set of neutral references built from global NFRs (Fig. S2). Our findings also suggest that the local tests have a tendency to overestimate positive selection, probably because they contain unknown functional sites and thus underestimate the neutral rate. As expected, most putative non-functional elements appear neutral in tests for selection, with only one testing positive for selection using both local and global references (Fig. 3a). Overall, our results suggest that global proxies are capable of testing more genomic regions, with an important gain in accuracy as we sample putatively neutral elements based on their distribution pattern relative to the tree rates. As a consequence, every query should be tested against a reference region in which the rate on the foreground branch is approximately equal to the rate on the background branches. Local reference regions can yield higher sensitivity to positive selection but also more false positives due to unrecognized purifying selection on parts of the reference region. Since incomplete lineage sorting or variation in the time to coalescence can be a potential confounding factor for both local and global neutral proxies [48], our filtering of global sequences based on relative branch lengths and concatenation of nonfunctional elements provides a useful strategy to control these issues [49][50][51]. Moreover, coalescence times for human polymorphisms are generally in the 10 3 -10 5 year range, which is shorter than the time of the betweenspecies branches. Cases of lineage sorting between ape species certainly occur but are rare [52]. Nevertheless, we suggest caution when analyzing query regions scoring high for positive selection; this is particularly important for genomes with relatively short branch lengths. Despite these recommendations, there are situations where local references may be useful alone or as a complement to global references. In particular, local references may be useful in regions of the genome with a lower density of functional elements, because they may more accurately reflect the local substitution rate. Indeed, the implementation of adaptiPhy allows the user to choose which reference to use (or both).
We also observed that among DHSs and nonfunctional DNA elements, the distribution of P-values tends towards the upper limit (P = 1) (Fig. S4C). This means that most open-chromatin regions are probably evolving neutrally or under purifying selection because the maximum likelihood estimate of the null and the alternative are exactly the same when P = 1 (Fig. S4C). In contrast, most tests on ncHAEs show a distribution of P-values strongly skewed towards zero with a small but important density peak at 1, suggesting that at least 92 out 2416 regions originally identified as ncHAEs appear to be evolving  (Fig. S6). The location of this ncHAE is highlighted in pink, and the red dots represent windows of 300 bp where ζ scored significant for positive selection (P < 0.05) neutrally or by purifying selection in the human branch when using global proxy (Fig. S4C). Some of these may reflect errors in the early reference assemblies that were used in prior studies. Overall, however, our results are generally consistent with earlier findings [15,[17][18][19], identifying many of the same regions as being under positive selection. Moreover, our results suggest that the distribution of P-values is strongly dependent on the genomic partition: DHSs and our set of putatively nonfunctional regions scored nearly 2.5 and 0.02% of sites under positive selection respectively, confirming the expectation that DHSs in general are more often subject to positive selection than putatively nonfunctional DNA regions. This fraction is higher than previously reported [27,28]. The likely reason is that these prior studies underestimated the proportion of sites under selection because the AR elements used as neutral proxies are evolving under higher substitution rates than the rest of the genome (Fig. S2).
By testing reference alignments of different lengths, we show a diminishing added return as these neutral proxies get longer. We recommend using reference alignments between 3 and 9 kb (Fig. 4a), as longer alignments require more computing power while providing only minimal additional sensitivity. Reference alignments shorter than 3 kb introduce more variation in estimation of ζ and thus increase the risk of false positives and negatives. We also tested the effect of query region length and found little difference in sensitivity between regions of 150 bp and 1000 bp, which means our approach can be applied to the vast majority of putative regulatory elements. We recommend caution in extending query region length indefinitely because this risks combining multiple functional elements with neutral and non-neutral substitution rates.
Our framework can be used to detect sequence outliers under a high substitution regime while controlling for relaxation of constraint. We propose that most signals that are detected reflect true instances of positive selection. Locally elevated mutation rate is a potential confound [53], however. Some evidence suggests that mutation rate correlates with substitution rates between species within long segments of the genome (~50 kb) [54], perhaps due to mutation pressure. However, the situation remains unclear, as other studies found that mutation rate does not explain variation in divergence between species [55,56]. In addition, regulatory elements are much shorter (generally 100-350 bp), and our sliding window scans reveal highly localized peaks of elevated substitution rather than the broad regions analyzed in the studies just mentioned. To investigate whether local differences in mutation or recombination rate might produce a false positive, we recommend investigating the amount of common and rare variation present in a query region by consulting dbSNP or 1000 genomes databases. Rare variants are often used as a proxy for mutation rate [56][57][58], making this a straightforward way to flag queries that score high for positive selection but are under a higher or lower mutational regime. Moreover, finding an excess of common variants under high-to-intermediate frequencies in the Site-Frequency Spectrum (SFS) in a given region that scores high for positive selection is potentially a useful way to identify previously unknown regions under persistent balancing selection [59]. Indeed, we found that the most variable element under positive selection is located in the MHC region (Supplementary Data), a region known to be under balancing selection.
Another potential confound of our approach is variation in recombination rate [48], which can cause discordance in the phylogenetic inference of substitution rates. Again, however, our query regions are very short (100 s of bp) while values for recombination rates in the genome are calculated over much longer scales (~100 kb) and are still rather imprecise [60,61]. Furthermore, the effects of a selective sweep are more evident at the within-than between species scale. Some tests for selection designed to work within species measure reduced polymorphism around the selected site and later a signal of lowfrequency derived mutations in linkage disequilibrium (LD) with it. These signals only persist for tens of thousands of years in human populations, since over time recombination erodes LD and drift eliminates most lowfrequency alleles [62]. As a result, it is not possible to test for selection across deeper evolutionary time scales using signals like the SFS or LD. Divergence times between great apes range from~7-16 million years, much too long for the LD effects of a selective sweep to be evident and leaving only a signature of elevated substitution.
If selective sweeps did influence tests for positive selection at the between-species level, we would expect the same query regions to be flagged by tests at the withinand between-species levels. This is not what we observe. As a specific example, we analyzed the region that includes SNPs responsible for lactase persistence in some human populations. The region around these SNPs shows one of the strongest selective sweeps known from the entire human genome, making it a good test case. Of 1847 windows tested around the LCT locus, only 9 (0.48%) scored significant for positive selection using adaptiPhy (Fig. S7), which is about the percentage among non-functional regions distributed randomly across the genome. Further, the specific area around the lactase persistence SNP that recently swept in humans is not positive for selection by our test for selection between species.
A final potential confound is the idea that reference genome sequences do not capture all the variation that is present at the population level [63]. In this study, we used the human (hg19), chimpanzee (panTro4), gorilla (gorGor3), orangutan (ponAbe2), and macaque (rhe-Mac3) reference assemblies. None of these reflect the ancestral state for their respective species, and are instead based on observed sequences from at least one diploid individual. As a result, some regions may contain concentrations of rare or common derived alleles simply by chance. This will artificially increase the apparent interspecies substitution rate. To control for this potential confound, we suggest re-running tests for selection for any query regions of particular interest using local alignments that represent the reconstructed ancestral state in order to correct for intraspecific variation.

Conclusion
Together, the improvements introduced here increase the ability to identify proxy neutral regions and to modulate the sensitivity and conservativeness of branchspecific tests for positive selection in noncoding regions. The test is sensitive across nearly the entire range of annotated functional regulatory elements, dropping only for elements < 150 bp in length. It is possible to apply these tests to nearly any noncoding region of the genome, even those in functionally dense locations.

Methods
Testing for branch-specific positive selection using our approach requires at least one reference alignment in which all branches of the tree are evolving at putatively neutral rates, and one query alignment, which can be obtained from any genomic region of interest, such as putative open chromatin regions or segments of a GWAS peak (Fig. 1b). First, we downloaded the 100-way multiple alignment from the University of California Santa Cruz (UCSC) website (http://hgdownload.soe.ucsc. edu/goldenPath/hg19/multiz100way/) and several annotations of functional DNA elements from the ENCODE Project at UCSC, including: 5′ and 3′ UTRs, total human mRNA, lincRNAs, microRNAs, sncRNAs, short repeats, CpG islands, etc. (supplementary Table 2). We also enriched this list of functional elements with a set of HoneyBadger2-intersect promoters and enhancers from reg2map annotated by the Broad Institute and Epigenomics Roadmap project. Then, using maf_parse implemented in PHAST [22], we transformed the 100-way alignment into a smaller 5-way genome-wide alignment in MAF format that included only our focal species human (hg19), chimpanzee (panTro4), gorilla (gorGor3), orangutan (ponAbe2), and rhesus macaque (rheMac3). To draw reference alignments and non-functional regions, we generated a masked 5-way MAF alignment with a BED file containing all known functional DNA regions using maf_parse with the optional command --mask-features. To mask the genome, we used a merged BED file that included 5′ and 3′ exons, all coding and non-coding RNAs, vista enhancers, roadmap and EN-CODE regulatory elements and promoters, CpG repeats, microsatellite sequences and simple repeats. Interestingly, the remaining non-functional fraction of the genome covered only~20.5% of the genome. We used msa_split [22] to draw alignments from non-coding human accelerated elements, ncHAEs [15,[17][18][19]64], a random subset of non-functional regions (as defined below), and DNA Hypersensitive Sites (DHSs) from 125 human cell types and tissues [33].
To select non-functional regions (NFRs) for our analyses of positive selection, we randomly chose around two million (1,893,795) non-overlapping segments of 300 bp from the non-functional fraction of the genome described above, collectively amounting to 18.3% of the genome. Subsequently, we excluded all the alignments containing any masked regions. Next, we computed branch lengths of each of the tree branches using the tool phyloFit available in PHAST [22]. phyloFit computes a tree with branch lengths and a substitution rate matrix by fitting a tree model to a multiple sequence alignment using maximum likelihood for a total of 92,160 regions. We noted a large peak of substitution rates near 0, and to avoid this bias we removed sites with a relative substitution rate < 0.001 for any of the sites. Within the remaining sample of 52,879 alignments from the last filtration step, we excluded any alignments in which the human branch was evolving too quickly or too slowly with respect to the total tree by using a custom R script; in this step, we omitted all trees with a relative branch length within the top and bottom 25% of this distribution (Fig. S1). This step reduced the pool of global NFRs to 26,426 FASTA alignments in which the relative human branch length ranged between the lower and the upper quartiles. The genome locations of these NFRs are available in the supplement (NFR.sorted.bed.csv).
To run our tests of selection, we fitted the null and alternative models for each DNA-element alignment using the batch scripts written by Haygood and collaborators [21] that run under the program HyPhy [29]. After all tests were completed, we extracted the best maximum likelihood estimates from twenty fittings to allow for stochasticity. Then, we obtained P-values from each likelihood ratio test (LRT) using the Chi 2 distribution tool (pchisq) with one degree of freedom, implemented in R [65]. Consistent with previous findings [66], we observed that all P-value distributions were non-uniform and highly skewed to 1, therefore we considered our test to be conservative (Fig. 3b). Consequently, we decided to use nominal P-values smaller than 0.05 to name regions scoring high for positive selection, instead of correcting for multiple testing to avoid violating the assumption of uniform P-value distribution underlying the False Discovery Rate methodology [67,68,69]. We recognize that each query is independent of the rest, and thus the problem of multiple testing is real. We thus recommend that any query region where the nominal P-value indicates positive selection should be re-tested against different reference regions. In this case, we recommend adjusting P-values when the same query region is tested multiple times or when testing enrichment between multiple groups. All data generated during this study are included in the supplementary information files as comma delimeted files (NFR.data, DHS.data and ncHAE.data).
Evaluation of the effect of query and reference length on sensitivity using empirical data To examine the sensitivity of our framework to detect positive selection under varying lengths of the query and reference, we obtained seven sets of 1000 queries of DHS alignments from the center of the peak position up to a total of 150 bp, 250 bp, 300 bp, 400 bp, 500 bp, 600 bp and 1000 bp on both sides. For each of these alignments, we also generated 10 reference alignments in order to account for the stochasticity of the evolutionary processes by concatenating 10 alignments from our set of 26 K non-functional 300 bp elements. Likewise, to identify the effects of variation in reference length in our queries, we also ran each query alignment of 300 bp against each of ten reference alignments varying from 300 bp to 30,000 bp. We also investigated the effects of including five species in our alignments rather than the minimum of three species considered by Haygood and collaborators [21]. There are now many genomewide assemblies that are available for many species, so we decided to employ additional species to test the effect of removing two branches in the tree. To do this, we extracted a random pool of 1000 queries with both three (human, chimp and macaque) and five species (human, chimp, gorilla, orangutan and macaque) from the DHSs prepared by Thurman and collaborators [33].
Evaluation of the effect of evolutionary state on specificity and sensitivity using simulated data One typical concern of any test of directional selection is the fact that it may deliver a signal of positive selection if reference sequences are under unrecognized purifying selection or if mutation rates are increased in the query region [66]. To test for these effects, we used the distribution of both relative and absolute branch lengths among non-functional sequences to simulate the distribution of trees under different classes of evolution (i.e. relaxation of constraint, increased mutation rate, positive selection or neutrality). To do this, we assumed that the alignments in the second quartile were putatively neutral, while those in the lower quartile were constrained, and those beyond the highest value of the upper quartile were under positive selection. Subsequently, we used these 'neutral' distributions to generate random sets of simulated trees, which were executed in the program seq-gen [70] to simulate sequence alignments in FASTA format. Consequently, we generated four sets of 1000 query alignments in which 100% of the alignments were evolving by different classes of evolution: i, neutral in the foreground and in the background; ii, neutral in the foreground but constrained in the background by scaling down the substitution rates in the background by a factor α = 0.1; iii, scaling up the neutral rate in the foreground by a factor α = 7; and iv, scaling up the neutral rate in the foreground by a factor of α = 7 and scaling up the neutral rate in one of the background species (i.e. gorilla) by a factor of α = 10. Next, we used these simulated alignments to explore the power of adaptiPhy to distinguish positive selection from relaxation of constraint and other types of evolution at the phylogenetic scale. Finally, we compared the results obtained from adaptiPhy with phyloP. This computational tool estimates a null distribution of the number of substitutions from the reference model, and computes the number of actual substitutions that occur in the query alignment, then it estimates P-values in the branch of interest given the total tree using a LRT method [16]. Here, for each query region used before, we tested the human subtree using the SPH model against the putatively neutral reference alignment obtained from phyloFit. Then, we parsed the P-value of acceleration in subtree given total tree using custom bash scripts.

Evaluation of sensitivity using known non-coding human accelerated elements
To test for selection in regions of the genome that have been previously investigated, we obtained a consolidated set of positive control queries of 2649 ncHAEs studied by Capra (2013) [41]. To compare the fraction of positive selection among ncHAEs with other sets of known regulatory elements, we used a subset of DHSs previously published by Thurman and collaborators (2012) [33]. To sample the DHSs in our test, we defined the "Ubiquity Score" as the fraction of cell types a given open chromatin site was open among the total number of tissues or cell types surveyed, thereby identifying a different set of queries to run our LRT method on known regulatory elements. To do this, we selected 4216 DHSs that were open in at least 124 of 125 cell types, which we term 'widespread sites' (Ubiquity score > 0.98), and we selected 7433 DHSs that were open in at most 2 of 125 cell types, which we term 'specific sites' (Ubiquity score < 0.02). Although these numbers seem arbitrary, they are a consequence of losing sequence alignments due to missing sequences from any of the branches or high frequency of gaps.
Funding Alejandro Berrio was supported by a Hargitt Fellowship from Duke University and NIH grant 1R01-TW010870 during the design of the study, data collection, analysis, interpretation of data, and during the writing of this manuscript.

Availability of data and materials
Genomewide aligment of the 100 vertebrate alignment used in this study is available at the UCSC genome browser (http://hgdownload.soe.ucsc.edu/ goldenPath/hg19/multiz100way/). Functional genomic data used in this manuscript to mask functional regions of the genome are available at UCSC genome data browser (https://genome.ucsc.edu/cgi-bin/hgTables), ENCODE (https://www.encodeproject.org/data/annotations/) and HoneyBadger2 (https://personal.broadinstitute.org/meuleman/reg2map/), see Supplementary Table 2 for direct web links. Scripts and software for running tests of selection using adatiPhy are available in Github (https://github.com/ wodanaz/adaptiPhy). This GitHub repository also contains a Docker file with the minimal requirements to run adaptiPhy in a sample dataset. Datasets containing results of selection scores and neutral sequence regions used and/or analyzed during the current study are available from the supplementary information. Other datasets and R scripts used in the current study are available from the corresponding author on reasonable request.
Ethics approval and consent to participate Not applicable.