Relative effects of mutability and selection on single nucleotide polymorphisms in transcribed regions of the human genome

Motivation Single nucleotide polymorphisms (SNPs) are the most common type of genetic variation in humans. However, the factors that affect SNP density are poorly understood. The goal of this study was to estimate the relative effects of mutability and selection on SNP density in transcribed regions of human genes. It is important for prediction of the regions that harbor functional polymorphisms. Results We used frequency-validated SNPs resulting from single-nucleotide substitutions. SNPs were subdivided into five functional categories: (i) 5' untranslated region (UTR) SNPs, (ii) 3' UTR SNPs, (iii) synonymous SNPs, (iv) SNPs producing conservative missense mutations, and (v) SNPs producing radical missense mutations. Each of these categories was further subdivided into nine mutational categories on the basis of the single-nucleotide substitution type. Thus, 45 functional/mutational categories were analyzed. The relative mutation rate in each mutational category was estimated on the basis of published data. The proportion of segregating sites (PSSs) for each functional/mutational category was estimated by dividing the observed number of SNPs by the number of potential sites in the genome for a given functional/mutational category. By analyzing each functional group separately, we found significant positive correlations between PSSs and relative mutation rates (Spearman's correlation coefficient, at least r = 0.96, df = 9, P < 0.001). We adjusted the PSSs for the mutation rate and found that the functional category had a significant effect on SNP density (F = 5.9, df = 4, P = 0.001), suggesting that selection affects SNP density in transcribed regions of the genome. We used analyses of variance and covariance to estimate the relative effects of selection (functional category) and mutability (relative mutation rate) on the PSSs and found that approximately 87% of variation in PSS was due to variation in the mutation rate and approximately 13% was due to selection, suggesting that the probability that a site located in a transcribed region of a gene is polymorphic mostly depends on the mutability of the site.


Background
Single nucleotide polymorphisms (SNPs) are the most common type of genetic variation in humans, accounting for approximately 90% of genetic variations and occurring every 400 nucleotides on average [1,2]. SNPs are believed to be the major contributors to interindividual genetic variation in susceptibility to common human diseases [3][4][5][6]. They are distributed nonrandomly: in some regions, their density is high, and in others, it is low [2,[7][8][9]. Because the exonic regions are most likely to carry functional polymorphisms [10], it is important to identify factors that affect SNP density in exons.
The results of several studies suggest that purifying selection affects SNP density [11,12]. A comparison of the fixation rates for synonymous, nonsynonymous, and disease-associated mutations reveals that negative selection operates against nonsynonymous SNPs [13][14][15]. Different SNP types are expected to vary in the intensity of negative selection against them; for example, missense mutations are expected to be more deleterious than are silent mutations [16,17].
A few studies have demonstrated that the mutation rate also can affect SNP density. Schmegner et al. [18] compared SNP densities in G+C-poor and G+C-rich regions and found a higher SNP density in G+C-rich (i.e., more mutable) regions, suggesting that SNP density is correlated with mutation rate. Horvath et al. [19] found that G > A transitions that occur at CpG sites and produce silent mutations were the most common substitution, suggesting that both mutability and selection play roles in SNP density in the coding regions of the human genome. However, the relative effects of these two factors have never been evaluated on a genome-wide scale. The goal of this analysis was to estimate the relative effects of mutability and selection on SNP density. Here we estimate the proportion of segregating (polymorphic) sites in different regions. The effect of selection is estimated by evaluating the differences in the proportion of segregating sites between synonymous, nonsynonymous SNP, as well as 5' UTR and 3' UTR potential sites. The effect of mutation rate is estimated by looking at difference in the proportions of segregating sites for single nucleotide substitutions that differ by mutation rate (e.g. CpG vs. non-CpG sites).

SNP data retrieval
We searched the dbSNP database BUILD128 [20][21][22] to identify SNPs. The database was accessed February, 2008. To reduce discovery errors, only validated SNPs (i.e., those submitted by at least two independent researchers, with at least one submission validated by a noncomputational method) were used in this study.
SNPs were stratified into five functional categories: (i) 5' untranslated region (UTR) SNPs, (ii) 3' UTR SNPs, (iii) synonymous SNPs, (iv) SNPs producing conservative missense mutations, and (v) SNPs producing radical missense substitutions. Missense substitutions were further stratified as radical or conservative according to criteria suggested by Zhang [23]. Briefly, all amino acids were subdivided into three groups according to their charge: positive (R, H, K), negative (D, E) and uncharged (A, N, C,  Q, G, I, L, M, F, P, S, T, W, Y, V). The amino acids were further subdivided by volume and polarity: special (C), neutral and small (A, G, P, S, T), polar and relatively small (N, D, Q, E), polar and relatively large (R, H, K), non-polar and relatively small (I, L, M, V), and non-polar and relatively large (F, W, Y). We considered radical missense mutations to be those that change amino acid categories (e.g. R→L) and conservative missense mutations to be those that do not change amino acid category (e.g. L→V).

Accounting for overlapping genes
Approximately 3% of genes in the human genome overlap [24], suggesting that some SNPs can be categorized differently depending on the overlapping gene analyzed. In this study, we identified 96, or 0.4% of the total number of SNPs used in the analysis that were located in overlapping genes. Overlapping SNPs were counted separately for each overlapping gene. For example, rs3736360 is a radical missense mutation in the HSPG2 gene and it is a 3' UTR SNP in the context of the overlapping gene LDLRAD2. We have conducted analyses both including and excluding the overlapping SNPs. The results of the analyses were essentially the same, because of the very low proportion of excluded SNPs. Because multiple counting of the same SNPs violates the assumption of independent data points, we excluded SNPs located in overlapping genes from our final analyses.

Ancestral and derived alleles
We used data on ancestral and derived alleles from the dbSNP database [22,25] and the Haplotter database [26]. The requirements that SNPs should be frequency-validated and that their ancestral state should be known limited the number of SNPs that could be used in the analysis. The number of analyzed SNPs, subdivided by categories, were 1424 for 5' UTR SNPs, 7705 for 3' UTR SNPs, 6972 for synonymous SNPs, 2567 for conservative missense mutations, and 3528 for radical missense mutations.
Knowing the ancestral and derived alleles does not tell us on which DNA strand the substitution has occurred. For example, for a C/T SNP with ancestral allele C, the ancestral pair of nucleotides is C:G and the derived pair of nucleotides is T:A. Two substitutions are possible in this case, C > T or G > A. Because it is impossible to tell these two substitutions apart, we analyzed them jointly. We separately analyzed substitutions located in CpG sites. Therefore, the total number of mutational categories analyzed in this study was nine:

Estimating relative mutation rates
To estimate the relative mutation rates, we used mutation rates obtained by an analysis of processed pseudogenes [27,28] and direct estimates of mutation rates derived from an analysis of mutations causing Mendelian diseases in humans [29]. The results of these studies agree closely: the Pearson's correlation coefficient between the two estimates is 0.997 (Additional file 1). We assumed that the mean mutation rate for transversions in nonCpG sites (as the lowest mutation rate) equals one and computed the relative mutation rates for other types of SNPs. To estimate the transition rates in CpG sites, we used the ratio of CpG to nonCpG transitions determined by [28]. We used a similar method to estimate the relative mutation rates for CpG and nonCpG C > A, C > G, G > C, and G > T transversions ( Table 1). The original data and computations are presented in Additional file 1. Relative mutation rates were estimated by assuming that mutation rates for C > A and G > T transversions are equal one each; then, given two transversions, it is two for the pair of the transversions. Relative mutation rates for all other substitutions were computed based on the published estimates of the mutation rates (see Additional table 2 for details). All these numbers can be scaled by dividing by two.

Estimating number of potential sites for each functional/ mutational category
A potential site was defined as a single nucleotide substitution (SNS) that would produce a SNP of a specific functional/mutational category. In each nucleotide position, three SNSs are possible, which correspond with three potential sites; therefore, the total number of potential sites per codon is nine.
To estimate the number of potential sites in the coding, 3' UTR, and 5' UTR regions, we used data from the codon usage database [30]. We considered nine possible SNSs in each of 64 codons, yelding a total of 576 wild type-mutant codon pairs. The pairs were classified as radical nonsynonymous, conservative nonsynonymous, synonymous, or other (nonsense mutations and mutations that change a stop codon into an amino acid-encoding codon [elongating]). These types of codon pairs were not used in this analysis because corresponding SNPs are relatively rare and have a high discovery-error rate [31]. To estimate the numbers of potential sites in these regions, we used data on the nucleotide content of 5' UTR and 3' UTR regions [32,33].
The National Center for Biotechnology Information Entrez Gene database (accessed February, 2008) contains about 23,000 known genes. On the basis of the estimated mean size of the 5' UTR region, 300 bp, the total number of potential sites in 5' UTR region is approximately 20,700,000 (23,000 × 300 × 3). Similarly, on the basis of the estimated size of the 3' UTR region, 770 bp, the total number of potential sites in the human genome for the 3' UTR region is 53,130,000.
On the basis of the mean number of exons in a gene (8.8), the mean size of the exon (145 nucleotides) [34,35], and the number of known genes ~23,000 (based on the NCBI Entrez Gene accessed April 30, 2008), the total number of potential sites in the coding region is 88,044,000.

Estimating number of potential sites in CpGs
Because the mutation rate in CpG dinucleotides is higher than that in nonCpGs [36][37][38][39][40], they were analyzed separately. To estimate the number of potential sites located in CpGs, we first estimated the proportions of synonymous, radical, and conservative missense mutations resulting from SNSs in CpG-containing codons. To account for CpGs on codon boundaries, we flagged codons that started with G or ended with C. The proportion of Gs in the first codon position was 0.31, and the proportion of Cs among the third positions was 0.30. Assuming a random combination of codons, the probability that the first G nucleotide in a codon would be in a CpG site was 0.30, and the probability that the last C in a codon would be in a CpG site was 0.31. The product of these frequencies and the frequency of the corresponding codon gives the expected proportion of boundary-located CpGs. To account for possible violation of the assumption on the random combination of nucleotides on codon boundaries we compared the expected and the observed number of CpG on codon boundaries in 56 Seattle genes (see section 2.7 for the list of the genes). The expected proportion of the CpGs on codon boundaries was 0.072 and the observed 0.023 suggesting that there is the observed number of CpGs on codon boundaries is ~3.1 times lower compared to the expected one. We correspondingly reduced the expected number of the boundary-located CpGs by the factor 3.1.

Statistical analysis
The proportion of segregating sites (PSSs) was estimated as the ratio of frequency-validated SNPs of a specific functional/mutational category to the number of potential sites for that category in the human genome. An analysis of variance (ANOVA) was used to estimate the effect of functional categories. An analysis of covariance (ANCOVA) was used to estimate the effect of the mutation rate on the PSS, controlling for functional category. All statistical analyses were performed using Statistica (StatSoft, Inc., Tulsa, OK, USA). Table 3 shows the estimated absolute numbers of potential sites. The total number of potential sites for SNPs in coding regions was estimated to be 9.8 × 10 7 . That is less than the expected number (1.02 × 10 8 ), which was based on the size of the coding region of the human genome, 34 × 10 6 nucleotides [35]. This inconsistency occurs because SNPs that produce nonsense and protein-elongating mutations were excluded from the analysis. There was nearly 40-fold variation in the number of potential sites for the various functional/mutational categories. The smallest number of potential sites was observed in CpG dinucleotides in the 3' UTR region; this number was small simply because the frequency of CpG sites is low in this region. The maximal number of potential sites was for nonCpG A:T > C:G substitutions producing radical missense mutations. Table 4 shows the observed number of SNPs for the functional/mutational categories. The lowest number of SNPs was found for C:G > A:T substitution in CpGs producing conservative missense mutations, and the highest number was found for A:T > G:C substitutions in the 3' UTR region. The difference between the lowest and highest numbers of observed SNPs was 200-fold. Table 5 shows data on the proportions of segregating sites (PSSs) in 45 functional/mutational categories. The mean PSS was approximately 2.8 × 10 -6 . The highest PSS detected among the potential sites (4.8 × 10 -3 ) was for CpG C:G > T:A substitutions in the 3' UTR region; the lowest (1.4 × 10 -5 ) was for nonCpG A:T > T:A substitutions producing radical missense mutations. To evaluate how PSS depends on the relative mutation rate, we estimated the correlation between PSSs and relative mutation rates. An analysis conducted within functional categories yielded Pearson's correlation coefficients that varied from 0.95 (for radical missense mutations) to 0.99 (for SNPs located in the 3' UTR region). All correlation coefficients were significant (P < 0.0001). To remove the effect of the mutation rate, we divided PSSs by the corresponding mutation rate (Table 6). After this adjustment, the variation in PSSs was considerably lower. We used ANOVA to estimate the effect of functional categories on adjusted PSSs and found that the effect of the functional category was significant (F = 5.9, df = 4, P = 0.001).  Figure 1A shows the PSSs for nine mutational categories after the adjustment for the relative mutation rate. We did not observe significant differences in the adjusted PSSs between different mutational categories. The lowest average adjusted PSS was found for radical missense mutations (1.1 × 10 -5 ), and the highest -for 3' UTR region (3.4 × 10 -5 ) ( Figure 1B).

Results
To estimate the relative effects of mutation rate and selection on PSS, we compared a linear model that used both the mutation rate and functional type as predictors (model 1) with two others that used only the mutation rate (model 2) or functional type (model 3) as the predicting variable. Because PSS data were not normally distributed, we used log transformation to normalize data before conducting statistical tests (an analysis of nontransformed data produced similar results, data not shown). We used ANCOVA for model 1, with PSSs as the outcome variable, mutation rate as the continuous predictor, and functional type as the categorical predictor. The whole-model P value was <0.001, with R 2 = 0.53 (F = 8.9, df = 5, P < 0.001    number of potential sites for each category is shown in Table 7. Figure 2 shows the adjusted proportions of the segregating sites in the six functional categories of SNPs. Table 8 shows the adjusted PSSs for the SeatlleSNPs data. The adjusted PSSs for SNPs in the coding was significantly lower compared to the 3' UTR and intronic SNPs with Pvalues less than 0.01. No difference in the adjusted PSSs between SNPs in the coding regions and SNPs from the 5' UTR region were found probably because of the small sample size: the number of the 5' UTR SNPs was 11.

Discussion
The absolute number of potential sites in the coding region was estimated on the basis of the frequency of codons from which a given type of substitution originated and on the size of the coding region. We further subdivided the functional categories into nine mutation types on the basis of ancestral and derived alleles. This allowed us to estimate the effects of mutability (relative mutation rates) and selection (functional categories) on SNP density.
We found that the mutation rate was strongly correlated with PSS. We assumed that different functional categories are under varying selection pressures; for example, radical missense mutations are under stronger purifying selection than are conservative and synonymous mutations. After adjusting PSSs by mutation rates, we found that functional type had a significant effect on SNP density (P = 0.02), suggesting that selection also plays a role in SNP density. This result is consistent with those of other studies [11,14,17,19,42].
To estimate the relative effect of mutability and selection on the proportion of segregating sites we compared the model in which both mutability and selection were used as predictors with the models in which either mutability or functional category alone were the predicting variables and found that approximately 87% of PSS variation was due to variation in the mutation rate and 13% was explained by the variation in selection intensity (functional categories). The finding that mutation rate rather than selection plays a major role in SNP density may be important for the design of association studies. Case-control association studies are widely used to identify genetic variants that affect susceptibility to common human diseases [43][44][45][46]. An association study usually identifies the candidate gene (or region) and may require resequencing of the candidate region to identify functional SNPs. Resequencing of the whole candidate region might be too expensive; therefore, it is important to identify regions that are most likely to contain functional polymorphisms.
Our results show that the adjusted PSSs in the 5' UTRs are similar to, or even lower than, the PSSs for radical missense mutations.
Analysis of the SeattleSNPs data produced results that support the idea that 5' UTR regions in the human genome are subject to strong purifying selection. The adjusted PSS for 5' UTR SNPs was similar to that for radical missense mutations and was significantly lower compared to the intronic and 3' UTR SNPs. In contrast to analysis of the dbSNP database, we found no differences between synonymous, conservative and radical substitutions. This may be a result of a much smaller sample size and therefore insufficient power to detect the differences. It is also possible that the analysis based on the SeattleSNP data is biased. The genes for the SeattleSNPs project were selected based on the two criteria: i) the gene should be functionally important with strong evidence for its involvement in common human diseases; and ii) the gene should be relatively small to allow the direct sequencing. It is possible, therefore, that the Seattle genes are under a stronger pressure of purifying selection compared to an average gene in the human genome. This also could explain our finding that based on the dbSNP data, the ratio of the adjusted PSS for 3' UTR to the PSS for radical missense mutations was 3.4, but it was 18.5 based on the SeattleSNPs data.

Limitations of the analysis
Our analysis provides a bird eye view of the control of PSS in exonic regions of the human genome. Exonic regions are more likely to be targeted for SNP discovery compared to the intronic and intergenic regions [47,48], the reason why we excluded intronic and intergenic regions from our dbSNP-based analysis. We did not take into account the fact that intensity of purifying selection can be different for different genes or different regions in the gene. The reason why we did not analyze more detailed functional categories is that the density of SNPs in the human genome is currently not sufficient to conduct the gene-centered analysis.
Our analysis provides only rough estimates of the effect of mutability. Though we took into account major sources of PSSs adjusted for mutation rates (dbSNP data) Figure 1 PSSs adjusted for mutation rates (dbSNP data).A) PSSs (× 10 6 ) for the 45 functional/mutational categories. B) Mean PSSs among five functional categories.
variation in the mutation rates -transitional or transversional type of substitution and CpG sites [there was almost 50-fold difference in the relative mutations rate in our study (Table 1)], it still may be not sufficiently detailed because there is growing evidence that local differences in nucleotide composition can lead to up to 4fold differences in the mutation rate [49][50][51]. We limited our analysis to the 9 basic mutational categories because further categorization of sites by mutation rate would unlikely provide statistically robust estimates.
Ancestral allele information in our analysis was derived from the comparison of human and chimpanzee sequences. It is based on the assumption that the outgroup sequence is identical to the ancestral sequence.
Multiple mutations during species divergence may violate this assumption leading to misidentification of the ancestral allele [52]. About 2.6% SNPs in the coding regions of the human genome may be misidentified [53]. Misidentification is expected to be highest for sites with the highest mutation rate and because the derived alleles tend to be rare, the misidentification will mislabel rare allele as common leading to inflation of non-neutrality tests, especially those designed to detect positive selection. It is hard to estimate the effect of misidentification of ancestral alleles on the proportion of segregating sites.
PSSs adjusted for mutation rates for 45 functional/mutational categories (SeattleSNP data) Figure 2 PSSs adjusted for mutation rates for 45 functional/mutational categories (SeattleSNP data). We estimated the number of the potential sites in CpG dinucleotides by summarizing the CpGs in codons and CpGs located on codon boundaries. To our knowledge, there are only two studies that address the CpG frequencies at codon boundaries. Analysis of hemoglobin genes [54] demonstrated almost two-fold excess of CpGs at codon boundaries. On the contrary, a more recent study of 369 genes (author did not indicate criteria were used for the gene selection) from tomato demonstrated a more than two-fold deficit of CpGs compared to the expected based on the frequency of Cs at third and frequency of Gs on the first codon position [55]. Analysis of CpGs in Seattle genes demonstrated a 3-fold deficit of the CpG dinucleotides at codon boundaries. After the adjustment for the deficit of CpG at codon boundaries, the proportion of CpG dinucleotides was ~0.03 in our analysis, which is close to the proportion of CpGs -0.028 estimated based on the analysis of genes from the Santa Cruz human genome assembly (hg16) [56].

Conclusion
In conclusion, our results suggest that 5' UTRs are under purifying selection pressure as strong as that affecting radical missense mutations. This is consistent with the results reported by Osada et al. [57], who compared 169 human and macaque gene sequences and detected a much lower rate of substitution in 5' UTRs than the rate of synonymous substitutions. The authors concluded that SNPs in 5' UTR regions are subject to purifying selection. These findings and ours suggest that 5' UTRs are important sites to identify functional disease-associated polymorphisms.

Additional material
Additional file 2