In the work presented here, we utilized the latest release of dbSNP, including the 1000 Genomes Project data, to perform a genome-wide scan of human variation within validated and predicted miR binding sites, our hypothesis being that genetic variants at miR binding sites are functional, and important contributors to phenotypic variation and disease susceptibility. We have taken careful measures to assign SNPs as creating or altering miR-mRNA interactions. We identified 5797 instances of a SNP falling within a conserved predicted MRESS based on stringent filtering of conservation and interaction scores predicted by Betel et al . Interestingly, 38% of these predicted disruptions were identified in 8mer target predictions. 8mer target sites have been shown to have the highest efficacy of target repression and therefore are considered higher priority predictions than those with lesser complimentarity . Overall, we estimate that 3% of predicted conserved MRESSs contain SNPs. Our analysis also identified 49407 instances of a SNP creating an MRESS. Given that no conservation restraint was utilized for identification of these CNM SNPs, we must be cautious for it is likely many of them are potential false positives. We also determined that 87 of the MRESS and CNM SNPs identified are in LD with SNPs identified in GWAS. We demonstrate that 2.2% of GWAS SNPs are, or are in LD with, MRESS or CNM SNPs. However, this may be a conservative estimate, given that 1) we limited our SNP selection based on conservation and other strict cutoffs, 2) the catalog of GWAS SNPs investigated is not all encompassing 3) that these GWAS studies do not consider gene by environment interactions and 4) LD estimates only cover SNPs up to the 1000 Genomes Project pilot study 1 data.
Recently, it has been demonstrated that SNPs previously identified in GWAS are in LD with SNPs found in enhancer motifs regulating gene expression . Furthermore, other studies have linked SNPs falling in gene regulatory motifs, and not found on commercial SNP arrays, to be in LD with top scoring GWAS hits [12, 13]. In a similar fashion, we suggest that many of the SNPs found in this study to be in LD with GWAS SNPs may have functional significance. To further explore this possibility we utilized several publicly available data sets and tools and showed 39 of these 87 variants found to have evidence of co-expression of target mRNA and the predicted miR. We found that four SNPs from this list have supporting eQTL data demonstrating variation in transcripts between alleles.
Our analyses have identified four SNPs predicted to modulate allele-specific miR-mRNA interactions which are supported by co-expression and eQTL data. The rs907091 SNP falls in the IZKF3 transcript and is in LD (r2 > 0.90) with eight SNPs associating with increased risk for a variety of autoimmune diseases. IZKF3 is a transcription factor important for B-cell activation, and mice lacking this gene develop a lupus like syndrome, suggesting a role for IZKF3 in autoimmunity . The rs907091 minor T allele is predicted to create a CNM for mir-326. There is evidence for expression of miR-326 and IZKF3 in human B-lymphocytes. Interestingly, miR-326 is important for T-cell differentiation and has been implicated in the pathogenesis of autoimmune multiple sclerosis . A study investigating transcript levels between the T and C alleles of rs907091 in a lymphoblastoid cell line (LCL) demonstrate significantly lower levels of IZKF3 in subjects carrying the T allele . These data suggest that carriers of the T allele may have reduced levels of IZKF3, in part through miR-326.
In addition the minor allele of rs3810291 is predicted to create an MRE for mir-502-3p within the ZC3H4 transcript and associates with BMI. ZC3H4 is a poorly characterized zinc finger protein. There is eQTL evidence supporting this prediction where minor allele carriers have reduced ZC3H4 expression compared to non-carriers, in adipose tissue . Both mir-502-3p and ZC3H4 are expressed in adipose . The rs2245717 SNP, predicted to create an MRE for miR-155 in the SYS1 transcript, is in perfect LD with rs1008953 a SNP associating with psoriasis . The MRE-creating allele of SYS1 is also associated with lower SYS1 transcript levels in LCL cells . While a role for SYS1 in immune function could not be found in the literature, it is known that miR-155 is involved in the immune response . The rs12449157 SNP is found in the poorly characterized glucose-fructose oxidoreductase domain containing 2 (GFOD2) transcript showing association with HDL-C . Our analysis predicts that the minor allele of rs12449157 creates a CNM for mir-125a-3p and that it is associated with reduced GFOD2 levels. Interestingly, both RNAs are expressed in adipose tissue . Further, we identify rs12449157 as an FST outlier suggesting this SNP may be undergoing population specific selection.
In addition to these four SNPs, we identified 39 others with data indicating co-expression with the predicted target mRNA and these should be considered as candidates for functional studies. Of these 39 candidates, a SNP within the HOXB2 loci has shown eQTL peaks identified from lymphoblastoid cell lines [28, 46]. While our analysis has generated many MRESS and CNM SNP predictions for which no miR expression data are available, it is likely that as more miR expression and eQTL data become accessible, particularly for different cell types and specific conditions, many of these SNPs could be seen as functionally relevant. Recent data indicate some miRs may act intracellularly, carried by HDL particles to recipient cells . Therefore, it may be that co-expression is not essential for all predicted miR-mRNA interactions.
As new variants arise in a population and are exposed to different environmental conditions, those variants may be subject to forces of selection. Moreover, if these SNPs alter gene expression they may modulate the individual's response to the environment and potentially the risk for particular disease state. Based on this, we hypothesized that allele-specific miR-mRNA interactions would show a greater level of selection than SNPs not classified as MRESS SNPs. We show that, as a group, predicted MRESS and CNM SNPs have a significantly higher mean FST than do those SNPs which do not create or disrupt a predicted MRESS. We identify those MRESS and CNM SNPs showing the highest degree of population subdivision and suggest these SNPs and the interactions they are predicted to modulate, as candidates for functional studies.
We show that the frequency of MRESS SNPs in validated MREs (5.5 SNPs/kb) is less than in surrounding regions and this supports prior work showing a higher degree of negative selection on MRESSs [3, 10]. Although the level of variation within this region is lower, we do show that the occurrence of variation across validated MRESSs is not rare (~5%). Supporting the notion that miR SNPs are high priority candidates for functional consequence we show that 22% of SNPs falling within validated MRESSs have reported associations related to a disease phenotype or risk. Of note, our results differ somewhat from the MRESS SNPs reported in Saunders, et al . This is most likely due to the fact that we utilized a more current database of validated MRE targets, and also that we required functional evidence of MRESS for inclusion.
There are several web based MRE SNP prediction databases available to query a SNP for creation or disruption of a MRESS, however these tools incorporate a relatively limited amount of functional annotation (GWAS, co-expression and eQTL data) for identification of the most promising MRESS SNPs [48–50]. SNPinfo, is a web tool which offers the calculation of LD between query SNPs and GWAS SNPs in addition to functional prediction of these SNPs for abrogation or creation of potential MRESS . Approximately 70% of the SNPs found in Tables S1 and S2 are also identified at SNPinfo web portal as being a SNP in LD with a GWAS SNP, and SNPinfo also includes prediction of that SNP as a MRESS SNP. Importantly, our work differs from what may be found at SNPinfo and others, in that we present a more comprehensive summary of potential MRESSs SNPs, being the first to investigate 1000 Genomes data for MRESS SNPs. Furthermore, we use this MRESS SNP information in combination with a variety of publically available web tools and data sets (including co-expression and eQTL data), not currently incorporated in other resources, to determine which of these SNPs are most likely functional. Our data demonstrates the utility of using multiple publically available datasets and resources to identify functional candidates.
In Summary, we have surveyed the most current human SNP data and identified variants that provide functional hypotheses for observed GWAS associations. Our work also suggests that a considerable number of SNPs create or abrogate MREs in the human genome. Our results further suggest MRE SNPs that modulate gene expression are likely to be under selective pressure. With relevance to human disease we show that publicly available resources can be used to identify high priority candidate SNPs for functional studies.