Disease-associated variants in different categories of disease located in distinct regulatory elements

Background The invention of high throughput sequencing technologies has led to the discoveries of hundreds of thousands of genetic variants associated with thousands of human diseases. Many of these genetic variants are located outside the protein coding regions, and as such, it is challenging to interpret the function of these genetic variants by traditional genetic approaches. Recent genome-wide functional genomics studies, such as FANTOM5 and ENCODE have uncovered a large number of regulatory elements across hundreds of different tissues or cell lines in the human genome. These findings provide an opportunity to study the interaction between regulatory elements and disease-associated genetic variants. Identifying these diseased-related regulatory elements will shed light on understanding the mechanisms of how these variants regulate gene expression and ultimately result in disease formation and progression. Results In this study, we curated and categorized 27,558 Mendelian disease variants, 20,964 complex disease variants, 5,809 cancer predisposing germline variants, and 43,364 recurrent cancer somatic mutations. Compared against nine different types of regulatory regions from FANTOM5 and ENCODE projects, we found that different types of disease variants show distinctive propensity for particular regulatory elements. Mendelian disease variants and recurrent cancer somatic mutations are 22-fold and 10- fold significantly enriched in promoter regions respectively (q<0.001), compared with allele-frequency-matched genomic background. Separate from these two categories, cancer predisposing germline variants are 27-fold enriched in histone modification regions (q<0.001), 10-fold enriched in chromatin physical interaction regions (q<0.001), and 6-fold enriched in transcription promoters (q<0.001). Furthermore, Mendelian disease variants and recurrent cancer somatic mutations share very similar distribution across types of functional effects. We further found that regulatory regions are located within over 50% coding exon regions. Transcription promoters, methylation regions, and transcription insulators have the highest density of disease variants, with 472, 239, and 72 disease variants per one million base pairs, respectively. Conclusions Disease-associated variants in different disease categories are preferentially located in particular regulatory elements. These results will be useful for an overall understanding about the differences among the pathogenic mechanisms of various disease-associated variants.


Regulatory regions from FANTOM and ENCODE
This study focuses on the pathogenic mechanism study of disease-associated variants in different categories of disease at regulatory level. All datum used in this study are based on Human GRCH37/hg19. All variants and regulatory regions were extracted from multiple data sources before May 2014. We collected various regulatory regions from FANTOM5 and ENCODE project. Transcription promoter data of FANTOM5 can be downloaded from http://fantom.gsc.riken.jp/5/datafiles/latest/extra/CAGE_peaks/ (filename is hg19.cage_peak_ann.txt.gz); Transcription enhancer data of FANTOM5 can be downloaded from http://fantom.gsc.riken.jp/5/datafiles/latest/extra/Enhancers/ (filename is hg19_enhancers.bed.gz) . We downloaded regulatory regions identified by the ENCODE project through Table Browser of UCSC ( http://genome.ucsc.edu ) and the detailed information are in the following. Regulatory regions identified by DNase-seq was download from "DNase Clusters" track (table name is wgEncodeRegDnaseClusteredV2). DNA Binding sites of proteins by ChIP-seq were downloaded from "Txn Factor ChIP" track (table name is wgEncodeRegTfbsClusteredV3). Insulator regions are the binding sites of CTCF, which were also extracted from "Txn Factor ChIP" track. Regulatory regions identified by FAIRE-seq were downloaded from "UNC FAIRE" track in 36 cell lines. Chromatin physical interaction regions by CHIA-PET were downloaded from "GIS ChIA-PET" track in five different human cancer cell lines. Histone modification regions were downloaded from "Broad Histone" track. CpG methylation regions by Methyl 450K Bead Arrays were downloaded from "HAIB Methyl450" with score>200.

Disease-associated Variants
We collected four types of disease-associated variants including Mendelian disease variants, complex disease variants, cancer predisposing germline mutations, and recurrent cancer somatic mutations. We counted the number of associated diseases/phenotypes/traits and the genes where disease variants are located within. All transcripts of the genes are extracted from UCSC "GENCODE Genes V19" track (filename is wgEncodeGencodeBasicV19). We extended each transcript with the flanking 500bp. Mendelian disease variants were gotten from OMIM and ClinVar. We downloaded and parsed a compressed file of OMIM site from http://www.omim.org/downloads (filename is omim.txt.Z) and extracted all OMIM disease variants. All pathogenic variants of ClinVar can be obtained from ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/ (filename is variant_summary.txt.gz). We collected 27,558 Mendelian disease variants finally. Complex disease variants were obtained from NHGRI GWAS catalog (http://www.genome.gov/admin/gwascatalog.txt ) and VarDi. VarDi is a proprietary database of disease-associated variants built through a combination of Hadoop-based text mining tools and manual curation. We collected 20,964 complex disease variants under p value < 10E-8. Considering the fact that most complex disease variants are mark SNPs, and not necessarily the disease causal variants, we filtered complex disease variants using stricter criterion. We accept such hypothesis that if a complex disease SNP is replicated in different ethnicities, then this SNP is more likely to be causal than markers. Those complex disease variants replicated in at least two different ethnicities, were adopted as complex disease causal variants. At last 5,549 complex disease causal variants were collected. Cancer germline mutations were downloaded from HGMD Professional database (http://www.biobase-international.com/product/hgmd ). We collected 5,809 cancer predisposing germline variants. Recurrent cancer somatic mutations were obtained from COSMIC ( https://cancer.sanger.ac.uk/files/cosmic/ , filename is CosmicCompleteexport_v68_040214). Only mutations which are with "Confirmed somatic variant" status, and recurrent in at least two different samples, will be adopted. At last 43,364 recurrent cancer somatic mutations were collected. All SNPs of dbSNP137 were downloaded from "ALL SNPs (137)

Regulatory regions can locate within coding and noncoding regions
We extracted upstream and downstream 2000bp of genes, coding exons, 5' and 3' UTR, and introns, and intergenic regions through "GENCODE Genes V19" track (filename is wgEncodeGencodeBasicV19). Then we counted the length of human genome regions overlapping with nine types of regulatory regions using BEDTools utilities Intersect and merge (https://code.google.com/p/bedtools/ ).

Functional effect annotation of disease-associated variants
There are 34 consequences of genetic variants in Sequences Ontology (http://www.sequenceontology.org/ ), and Ensembl rank these consequences in the order of severity. This ordering is necessarily subjective. Everyone may always extract the full set of consequences for each allele and make their own severity judgment. We applied Ensembl Variant Effect Predictor web interface (http://grch37.ensembl.org/Homo_sapiens/Tools/VEP ) to annotate the four types of disease-associated variants based on Ensembl transcripts.

Predicting scores for disease-associated variants
Genome Wide Annotation of Variants (GWAVA) is from the Wellcome Trust Sanger Institute and the European Bioinformatics Institute. GWAVA aims to predict functionality of noncoding variants based on a wide range of annotations of regulatory elements, along with genome-wide properties such as evolutionary conservation and GC-content. Corresponding with three different control groups adopted in GWAVA, there are three kinds GWAVA scores: Region score, TSS score and Unmatched score, which are all in the range of 0-1. A high GWAVA score means more active functionality with respect to a low GWAVA score. . The GWAVA score can be computed in this website https://www.sanger.ac.uk/sanger/StatGen_Gwava. Mutation Assessor predicts the functional impact of coding variants on proteins based on sequence evolutionary conservation. Usually functional coding variants have higher Mutation Assessor score than non-functional coding variants. Mutation Assessor score threshold of 1.9 was used to discriminate disease-associated variants with medium or high functionality; if the Mutation Assessor score of a variants is smaller than 1.9, then this variant can be considered to be low functional or neutral. The Mutation Assessor score can be computed in this website http://mutationassessor.org/.
Combined Annotation-Dependent Depletion (CADD) is from the University of Washington and HudsonAlpha Institute for Biotechnology. CADD framework incorporates a support vector machine in its workflow. CADD integrates multiple annotations into one metric by contrasting variants that survived natural selection with simulated mutations in order to score the deleteriousness of coding or noncoding variants in human genome. A high CADD score typically suggests more severe deleteriousness compared to a low CADD score. There are two distinct CADD score forms, namely "Raw" and "Scaled". "Raw" CADD scores come straight from the SVM and have relative meaning, with higher values indicating that a variant is more likely to have deleterious effects. "Scaled" score is as transformation of "Raw" score by ranking all variants. For example, reference genome single nucleotide variants at the top 10% of CADD scores are assigned to CADD-10. In this study, we adopted CADD "Raw" score. Pre-computed CADD score of all possible SNVs of hg19 can be downloaded from http://cadd.gs.washington.edu/download (filename is whole_genome_SNVs.tsv.gz).
GERP produces position-specific estimates of evolutionary constraint using maximum likelihood evolutionary rate estimation. Negative GERP scores indicate that a site is probably evolving neutrally. Positive scores indicate that a site may be under evolutionary constraint. Positive scores scale with the level of constraint, such that the greater the score, the greater the level of evolutionary constraint. GERP code can be downloaded from http://mendel.stanford.edu/SidowLab/downloads/gerp/.

Enrichment analysis of disease-associated variants within regulatory regions
The enrichment of disease-associated variants can be measured using odds ratio (see Methods section).
Two different strategies about control group generation were adopted for enrichment analysis of disease-associated variants within regulatory regions.
The first strategy is to make the human genetic variant background as control group for each type of disease-associated variants. We subtracted disease-associated variants from all SNPs of dbSNP database and made remain SNPs as the genome variant background control group.
The second strategy is to generate 1000 equal size specific control groups for each type of disease-associated variants. Based on the allele frequency distribution of each type of disease-associated variants, we generated 1000 equal size control groups for disease variants.