Identifying branch-specific positive selection throughout the regulatory genome using an appropriate neutral proxy

Adaptive changes in cis-regulatory elements are an essential component of evolution by natural selection. Identifying adaptive and functional noncoding DNA elements throughout the genome is therefore crucial for understanding the relationship between phenotype and genotype. Here, we introduce a method we called adaptyPhy, which adds significant improvements to our earlier method that tests for branch-specific directional selection in noncoding sequences. The motivation for these improvements is to provide a more sensitive and better targeted characterization of directional selection and neutral evolution across the genome. We use ENCODE annotations to identify appropriate proxy neutral sequences and demonstrate that the conservativeness of the test can be modulated during the filtration of reference alignments. We apply the method to noncoding Human Accelerated Elements as well as open chromatin elements previously identified in 125 human tissues and cell lines to demonstrate its utility. We also simulate sequence alignments under different classes of evolution in order to validate the ability of adaptiPhy to distinguish positive selection from relaxation of constraint and neutral evolution. Finally, we evaluate the impact of query region length, proxy neutral sequence length, and branch count on test sensitivity.


INTRODUCTION 30
An accurate and comprehensive characterization of the genomic distribution of adaptive 31 substitutions is essential for understanding the genetic basis for trait divergence between species 32 (Wayne and Simonsen 1998;Yang and Bielawski 2000;Nadeau and Jiggins 2010;Pardo-Diaz et 33 al. 2015;Reilly and Noonan 2016). Tests for positive selection at the interspecies scale 34 developed during the 1980s focused on ω , the ratio of nonsynonymous to synonymous 35 substitution rates in protein coding regions (Li et al. 1985;Nei and Gojobori 1986). These 36 methods were first applied at a whole genome scale soon after the release of reference genome 37 assemblies for human, chimpanzee, and macaque (Lander et al. 2001;Clark et al. 2003;38 Iacobuzio-Donahue et al. 2003;Rhesus Macaque Genome Sequencing and Analysis Consortium 39 et al. 2007), and provided the earliest relatively unbiased views of positive selection on protein-40 coding regions. At the same time, a growing appreciation for the contribution of regulatory 41 mutations to adaptation (Hedrick and McDonald 1980;Prud'homme et al. 2006;Wray 2007) 42 prompted the development of methods to test for positive selection in noncoding regions. 43 Two general approaches were devised to test for positive selection in the absence of a 44 genetic code. One seeks regions that contain many substitutions along the human branch (since 45 the most recent common ancestor with chimpanzees) but are otherwise highly conserved among mammals or vertebrates (Pollard, Salama, King, et al. 2006;Siepel et al. 2006;Bird et al. 2007;47 Bush and Lahn 2008;Prabhakar et al. 2008). The other seeks an elevated rate of substitution 48 along the human lineage in a query region hypothesized to contain regulatory elements relative 49 to a nearby reference (proxy neutral) region likely to contain few functional elements (Wong and 50 Nielsen 2004;Haygood et al. 2007). Both approaches test for branch-specific accelerated 51 substitution, but differ in the reference point against which they assess acceleration: the first tests 52 for accelerated substitution within otherwise conserved regions against a putatively neutral 53 region that is usually obtained from local Ancient Repeats (ARs) or fourfold degenerate sites 54 (4D) (Pollard, Salama, King, et al. 2006;Bird et al. 2007;Bush and Lahn 2008;Prabhakar et al. 55 2008), while the second employs a putatively non-functional local intron of the genome as a 56 neutral reference against which to identify branch-specific accelerated substitution (Wong and 57 Nielsen 2004;Haygood et al. 2007). Wong & Nielsen (2004) defined the parameter ζ as the ratio 58 of substitution rates in the query region to those in the associated neutral region; ζ is thus 59 analogous to ω . To detect significance in the departures from neutrality, both approaches 60 typically use maximum likelihood estimation and likelihood ratio tests (LRTs) that compare a 61 We thus sought to test whether filtering global NFRs by their relative substitution rate 139 respect to the entire tree can provide an improved neutral proxy for estimating positive selection. 140 We filtered out global NFRs representing the top and bottom quartiles of relative substitution 141 rates ( Figure S1), then concatenated 10 NFRs per query. When we compared the substitution 142 rates of this new set of putative neutral references, we found that the distribution of substitution 143 rates of the filtered global NFRs is narrower and its median is skewed to the right ( Figure 2C). 144 This suggests that global NFRs can provide a set of putatively neutral elements if appropriately 145 filtered. This approach also allows conservativeness and sensitivity to be modulated by tuning 146 the filtration step accordingly. Moreover, given that we use relative values of substitution rate, 147 this filtering step can applied to any region of the genome regardless of the amount of functional 148 annotation. 149 To validate the effect of local and global NFRs when testing for positive selection, we 150 sampled three sets of queries: (1) widespread and specific DHSs (open in >124 and exactly 1 151 ENCODE cell types, respectively); (2) a set of ncHAEs to be used as positive controls; and (3) a 152 set of putatively non-functional DNA elements to be used as negative controls. The correlation in 153 P-values is high among the 3,531 DHSs that could be compared using both local and global 154 neutral proxies (Spearman's Rank test rho = 0.80; P < 2.2x10 -16 ; Figure 3A). Of these, only 155 2.63% scored high for positive selection (P < 0.05) for global proxies, while 5.12% scored high 156 for positive using the local proxy alone. Likewise, the correlation of P-values in the global and 157 local sets is high among the 1,291 ncHAEs that could be tested using both local and global 158 proxies (Spearman's Rank test rho = 0.86; P < 2.2x10 -16 ). Of these, only 25.33% of the ncHAE 159 regions tested positive globally, while 39.04% tested positive for selection using the local tests 160 ( Figure 3B). Thus, local proxies in general identify more putative cases of positive selection but 161 Berrio et al. 2019 8 have limited applicability within function-dense regions of the genome while global proxies can 162 be used to test any query region but possibly with lower sensitivity. 163 164 Sensitivity is high given practical query length, reference length, and branch number 165 Earlier, we observed that the distribution of substitution rates among all global reference 166 regions used in this study is narrow and high ( Figure 2B). More specifically, we observed an 167 average human branch length of 0.0072 substitutions per site using global neutral proxies, which 168 is appreciably faster than the average substitution rates of local elements (average branch length 169 = 0.0055). We evaluated the effect of query and neutral proxy length in the estimation of positive 170 selection, using both real and simulated data. To evaluate the effect of reference length on 171 sensitivity, we tested reference alignments of 300 bp, 900 bp, 3 kb, 9 kb, and 30 kb ( Figure 4A). 172 As expected, the median of the branch lengths of the global references does not increase or 173 decrease as they get longer; rather, they reach an equilibrium at 0.00725 with reduced variation 174 ( Figure S5). 175 Functional genomic approaches such as ChIP-seq and ATAC-seq identify putative 176 regulatory regions with window lengths that are usually between 150-800 bp and skewed 177 towards shorter lengths (Crawford 2005;Birney et al. 2007;Song and Crawford 2010;Shibata et 178 al. 2012;Thurman et al. 2012;Bryois et al. 2017). In order to assess the ability of adaptiPhy to 179 identify positive selection throughout the biologically meaningful range of putative relative 180 element sizes, we tested the effect of query lengths. Importantly, the ability to detect selection is 181 not strongly affected by differences in query length, and remains similar down to ~150 bp 182 ( Figure 4B). This finding suggests that our set of reference sequences is able to detect signatures 183 of positive selection in regions where the query has been predicted to be longer than the actual 184 functional element under selection, and in particular across most of the size range of known 185 regulatory elements and open chromatin regions in the human genome. 186 Finally, we tested the impact of using three or five species to detect positive selection, as 187 more branches might be expected to provide a better estimation of the background substitution 188 rate in the reference. We found that adding one or two species above the minimum of three (two 189 ingroup and one outgroup) provides only a negligible improvement in sensitivity of adaptiPhy 190 ( Figure 4C). In contrast, the sensitivity of phyloP is more dependent on the number of species, 191 improving markedly with additional taxa ( Figure 4C). Thus, adaptiPhy may be preferable in 192 situations where the minimum number of reference genome assemblies is available. 193 194 The test discriminates between three different types of selection 195 To determine whether adaptiPhy can correctly detect selection under different 196 evolutionary scenarios, we simulated both query and reference alignments undergoing different 197 types of evolution, including neutral in both the background (BG) and the foreground (FG), 198 constraint in the BG but neutral in the FG (i.e., relaxation of constraint), and neutral in the BG 199 but positive selection in the FG. We found that adaptiPhy accurately discriminates positive 200 selection from relaxation of constraint and neutral evolution ( Figure 5). Of the simulated neutral 201 regions, 2.3% were incorrectly identified as positive selection using adaptiPhy compared to 202 4.1% using phyloP. In addition, phyloP fails to distinguish positive selection from relaxation of 203 constraint in almost half of the simulated cases, while our approach fails in only 2.7% of cases. 204 In contrast, of the simulated regions under positive selection, adaptiPhy and phyloP identified 205 99.6% and 98.8% of sequences simulated to be under positive selection, respectively ( Figure 5). 206 Moreover, the sensitivity of adaptiPhy to detect positive selection is 1 and specificity is 0.97, 207 thus potentially constrained or near sites evolving under weak selection ( Figure 2A). 254 Consequently, identifying sufficient non-functional regions to use as a neutral proxy in the 255 vicinity of most open chromatin sites is simply not possible in primate genomes and the problem 256 is even more acute in organisms with higher gene density. To address this limitation, we used the 257 dense functional annotation of the human genome to identify a set of putatively non-functional 258 regions that can be used as an appropriate neutral proxy. 259 We then filtered this set of putatively non-functional regions to address the challenge of 260 rate heterogeneity. For the entire set of DHS elements in our sample, we found a wide 261 distribution of substitution rates among concatenated local samples including local NFRs, raw 262 global NFRs and ARs ( Figure 2B). This means that there are many such regions in which the 263 substitution rate on the human branch is substantially slower or faster than average and thus 264 potentially confounding when used to test for positive selection. These observations reflect 265 generally less availability of non-functional elements in regions where DHSs and ncHAEs are 266 more common and slower rates of evolution nearby each functional element, but also the 267 possibility that some ancient LINE elements in the vicinity of DHSs are under functional 268 constraint or evolving under high mutation rates. Indeed, many studies have shown that ARs can 269 become co-opted as local regulatory elements (Greene et al. 2005;Kamal et al. 2006;Maumus 270 and Quesneville 2014;Zeng et al. 2018). Consistent with this interpretation, we found that the 271 proportion of reference ARs with at least one multitissue eQTL or brain specific eQTL is almost 272 three times as large as in a set of global non-functional sequences ( Figure S3). The relative 273 absence of local non-functional elements may also increase the effect of incomplete lineage 274 sorting in the reference sequence, producing an increase of false positives and negatives for 275 query elements if the reference foreground sequence is evolving at slower or faster rates than the 276 background. 277 Together, these results suggest that using local non-functional elements and ARs can both 278 contribute to misestimation of selection. At the same time, ARs have the tendency to 279 overestimate the rates of evolution of query alignments compared to a set of neutral references 280 built from global NFRs ( Figure S2). Our findings also suggest that the local tests have a 281 tendency to overestimate positive selection, probably because they contain unknown functional 282 sites and thus underestimate the neutral rate. As expected, most putative non-functional elements 283 appear neutral in tests for selection, with only one testing positive for selection using both local 284 and global references. Overall, these results suggest that global proxies are capable of testing 285 more genomic regions, with an important gain in accuracy as we sample neutral regions based on 286 their distribution pattern relative to the tree rates. As a consequence, every query will be tested 287 against a reference region in which the foreground rates are not evolving faster or slower than the 288 background branches of the tree. Local proxies can be more sensitive by increasing the number 289 of sites that are under positive selection, some of which may be falsely called by introducing 290 reference sequences that are in constraint and are thus evolving at slower than neutral rates. 291 Since incomplete lineage sorting can be a potential confounding factor for both local and global 292 neutral proxies, our filtering of global sequences based on relative branch lengths and 293 concatenation of non-functional elements provides a useful strategy to control this issue (Tonini 294 et al. 2015). 295 We also observed that among DHSs and nonfunctional DNA elements, the distribution of 296 P-values tends towards the upper limit (P=1). This means that most open-chromatin regions are 297 probably evolving neutrally or under purifying selection because the maximum likelihood estimate of the null and the alternative are exactly the same when P=1 (Figure S4 C). In contrast, 299 most tests on ncHAEs show a distribution of P-values strongly skewed towards zero with a small 300 but important density peak at 1, suggesting that at least 92 out 2,416 regions originally identified 301 as ncHAEs appear to be evolving neutrally or by purifying selection in the human branch when 302 using global proxy. Some of these may reflect errors in the early reference assemblies that were 303 used in prior studies. And indeed, our results are generally consistent with their findings (Pollard, 304 Salama, King, et al. 2006;Bird et al. 2007;Bush and Lahn 2008;Prabhakar et al. 2008), 305 identifying many of the same regions as being under positive selection. Moreover, our results 306 suggest that the distribution of P-value is strongly dependent on the genomic partition: DHSs and 307 our set of non-functional elements scored nearly 2.5% and 0.02% of sites under positive 308 selection respectively, confirming the expectation that DHSs in general are more often subject to 309 positive selection than nonfunctional DNA elements. This fraction is higher than previously 310 reported (Gittelman et al. 2015;Dong et al. 2016). The likely reason is that these prior studies 311 underestimated the proportion of sites under selection because the AR elements used as neutral 312 proxies could be evolving under higher substitution rates than the rest of the genome ( Figure S2). 313 By testing reference alignments of different lengths, we show a diminishing return as 314 these proxies get longer. We recommend using reference alignments between 3 and 9 kb ( Figure  315 4A), as longer alignments require more computing power while providing only minimal 316 additional sensitivity. Reference alignments shorter than 3 kb introduce more variation in 317 estimation of ζ and thus increase the risk of false positives and negatives. We also tested the 318 effect of query region length and found little difference in sensitivity between regions of 150 bp 319 and 1000 bp, which means our approach can be applied to the vast majority of putative regulatory elements. We recommend caution in extending query region length indefinitely 321 because this risks combining multiple functional elements with distinct rates of substitution. 322 Because differences in the rate of substitution in the reference has a strong influence on 323 sensitivity, using global references can be valuable, particularly in regions of the genome that are 324 densely populated with genes or other functional elements. In addition, a strong correlation exists 325 between the distributions of P-values found in our test of DHSs (Spearman's Rank test rho = 326 0.78; P < 2.2x10 -16 ) and ncHAEs (Spearman's Rank test rho = 0.86; P < 2.2x10 -16 ), and their 327 corresponding rates of evolution (ζ) ( Figure S4). This result suggests that both P-value and Our framework can be used to detect sequence outliers under a high substitution regime 333 while controlling for relaxation of constraint. We propose that most signals that are detected 334 reflect true instances of positive selection. However, locally elevated mutation rate is a potential 335 confound. To investigate whether local differences in mutation might produce a false positive, 336 we recommend investigating the amount of common and rare variation present in a query region 337 by consulting dbSNP or 1000 genomes databases. Rare variants are often used as a proxy for 338 both mutation rate and recombination rate (Nei 1977;Slatkin 1985;Nachman 2001;Nachman 339 2002), making this a straightforward way to flag queries that score high for positive selection but 340 are under a higher or lower mutational regime. Moreover, finding an excess of common variants 341 under high-to-intermediate frequencies in a given region that scores high for positive selection is 342 potentially a useful way to identify previously unknown regions under persistent balancing selection. Indeed, we found that the most variable element under positive selection is located in 344 the MHC region (Supplementary Data), a region known to be under balancing selection 345 (Takahata et al. 1992). 346 In this study, we used the human (hg19), chimpanzee (panTro4), gorilla (gorGor3), 347 orangutan (ponAbe2), and macaque (rheMac) reference assemblies. It is highly likely that none 348 of these reflect the ancestral state for their respective species. Depending of the number of 349 individual genomes that were assembled into the reference, some regions may contain 350 concentrations of rare or common derived alleles. This will artificially increase the apparent 351 interspecies substitution rate. To control for this potential confound, we suggest including both 352 common and rare variation that is known for the branch of interest and build local alignments 353 with reference genomes that have used ancestral state data to correct for intraspecific variation. 354 Together, the improvements introduced here improve the ability to identify proxy neutral 355 regions and increase the ability to modulate the sensitivity and conservativeness of branch-356 specific tests for positive selection in noncoding regions. The test is sensitive across nearly the 357 entire range of annotated functional regulatory elements, dropping only for elements <150 bp in 358 length. It is possible to apply these tests to nearly any noncoding region of the genome, even 359 those in functionally dense locations. 360

MATERIALS AND METHODS 362 363
Testing for branch-specific positive selection using our approach requires at least one 364 reference alignment in which all branches of the tree are evolving at putatively neutral rates, and 365 one query alignment, which can be obtained from any genomic region of interest, such as putative open chromatin regions or segments of a GWAS peak ( Figure 1B). First, we 367 downloaded the 100-way multiple alignment from the University of California Santa Cruz 368 (UCSC) website (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/multiz100way/) and several 369 annotations of functional DNA elements from the ENCODE Project at UCSC, including: 5' and 370 3' UTRs, total human mRNA, lincRNAs, microRNAs, sncRNAs, short repeats, CpG islands, etc. 371 We also enriched this list of functional elements with a set of HoneyBadger2-intersect promoters 372 and enhancers from reg2map annotated by the Broad Institute and Epigenomics Roadmap 373 project. Then, using maf_parse implemented in PHAST (Hubisz et al. 2011), we transformed the 374 100-way alignment into a smaller 5-way genome-wide alignment in MAF format that included 375 only our focal species human (hg19), chimpanzee (panTro4), gorilla (gorGor3), orangutan 376 (ponAbe2), and rhesus macaque (rheMac3). To draw reference alignments and non-functional 377 regions, we generated a masked 5-way MAF alignment with a BED file containing all known 378 functional DNA regions using maf_parse with the optional command --mask-features. To mask 379 the genome, we used a merged BED file that included 5' and 3' exons, all coding and non-380 coding RNAs, vista enhancers, roadmap and ENCODE regulatory elements and promoters, CpG 381 repeats, microsatellite sequences and simple repeats. Interestingly, the remaining non-functional 382 fraction of the genome covered only ~20.5% of the genome. To draw query alignments, we used 383 To select non-functional regions (NFRs) for our analyses of positive selection, we 390 randomly chose around two million (1,893,795) non-overlapping segments of 300 bp or longer 391 from the non-functional fraction of the genome described above, collectively amounting to 392 18.3% of the genome. Subsequently, we excluded all the alignments containing any masked 393 regions with functional sequences, CpG islands, repetitive elements, assembly gaps and 394 heterochromatin sequences (masked as Ns), and/or missing sequences (masked as asterisks or 395 dashes.) Next, we computed branch lengths of each of the tree branches using the tool phyloFit 396 available in PHAST (Hubisz et al. 2011), phyloFit computes a tree with branch lengths and a 397 substitution rate matrix by fitting a tree model into a multiple sequence alignment using 398 maximum likelihood. Within the remaining sample of 92,160 alignments from the last filtration 399 step, we excluded any alignments in which the human branch was evolving too quickly or too 400 slowly with respect to the total tree by using a custom R script; in this step, we omitted all trees 401 with a relative branch length within the top and bottom 25% of this distribution ( Figure S1). This 402 step reduced the pool of global NFRs to 26,426 FASTA alignments in which the relative human 403 branch length ranged between the lower and the upper quartiles. 404

405
To run our tests of selection, we fitted the null and alternative models for each DNA-406 element alignment using the batch scripts written by Haygood and collaborators (2007) that run 407 under the program HyPhy (Pond et al. 2005). After all tests were completed, we extracted the 408 best maximum likelihood estimates from twenty fittings to allow for stochasticity. Then, we 409 obtained P-values from each likelihood ratio test (LRT) using the Chi 2 distribution tool (pchisq) 410 with one degree of freedom, implemented in R (R Team 2015). Interestingly, we observed that 411 all P-value distributions were non-uniform and highly skewed to 1, therefore we considered our test to be conservative ( Figure 3B). Consequently, we decided to use nominal P-values smaller 413 than 0.05 to name regions scoring high for positive selection, instead of correcting for multiple 414 testing to avoid violating the assumption of uniform distribution of the False Discovery Rate 415 methodology (Benjamini and Hochberg 1995;Yekutieli and Benjamini 1999). 416

417
Evaluation of the effect of query and reference length on sensitivity. 418

419
To examine the sensitivity of our framework to detect positive selection under varying 420 lengths of the query and reference, we obtained seven sets of 1000 queries of DHS alignments 421 from the center of the peak position up to a total of 150 bp, 250 bp, 300 bp, 400bp, 500 bp, 600 422 bp and 1000 bp were reached in both sides. For each of these alignments, we also generated 10 423 reference alignments in order to account for the stochasticity of the evolutionary processes by 424 concatenating 10 alignments from our set of 26K non-functional 300bp elements. Likewise, to 425 identify the effects of variation in reference length in our queries, we also ran each query 426 alignment of 300 bp against each of ten reference alignments varying from 300 bp to 30,000 bp. 427 We also investigated the effects of including five species in our alignments rather than the 428 minimum of three species considered by Haygood et al. (2007). There are now many 429 genomewide assemblies that are available for many species, so we decided to employ additional 430 species to test the effect of removing two branches in the tree. To do this, we extracted a random 431 pool of 1,000 queries with both three (human, chimp and macaque) and five species (human, 432 chimp, gorilla, orangutan and macaque) from the DHSs prepared by Thurman and collaborators 433 (Thurman et al. 2012). 434 One typical concern of our method is the fact that it may be sensitive to deliver signals of 436 selection if the reference sequences consist of negative selection, and if mutation rates are 437 increased along all the branches of a tree. To test for these effects, we used the distribution of 438 both relative and absolute branch lengths among non-functional sequences to simulate the 439 distribution of trees under different classes of evolution (i.e. relaxation of constraint, increased 440 mutation rate, positive selection or neutrality). To do this, we assumed that the alignments in the 441 second quartile were putatively neutral, while those in the lower quartile were constrained, and 442 those beyond the highest value of the upper quartile were under positive selection. Subsequently, 443 we used these 'neutral' distributions to generate random sets of simulated trees, which were 444 executed in the program seq-gen (Rambaut and Grassly 1997) to simulate sequence alignments 445 in FASTA format. Consequently, we generated four sets of 1,000 query alignments in which 446   P  o  l  l  a  r  d  K  S  ,  H  u  b  i  s  z  M  J  ,  R  o  s  e  n  b  l  o  o  m  K  R  ,  S  i  e  p  e  l  A  .  2  0  1  0  .  D  e  t  e  c  t  i  o  n  o  f  n  o  n  n  e  u  t  r  a  l   618   s  u  b  s  t  i  t  u  t  i  o  n  r  a  t  e  s  o  n  m  a  m  m  a  l  i  a  n  p  h  y  l  o  g  e  n  i  e  s  .  G  e  n  o  m  e  R  e  s  .  2  0  :  1  1  0  -1    four of the cited studies ( Figure S7). The location of this ncHAE is highlighted in pink, and the 787 red dots represent windows of 300 bp were ζ scored significant for positive selection (P < 0.05).