GACT: a Genome build and Allele definition Conversion Tool for SNP imputation and meta-analysis in genetic association studies
© Sulovari and Li; licensee BioMed Central Ltd. 2014
Received: 14 April 2014
Accepted: 10 July 2014
Published: 19 July 2014
Genome-wide association studies (GWAS) have successfully identified genes associated with complex human diseases. Although much of the heritability remains unexplained, combining single nucleotide polymorphism (SNP) genotypes from multiple studies for meta-analysis will increase the statistical power to identify new disease-associated variants. Meta-analysis requires same allele definition (nomenclature) and genome build among individual studies. Similarly, imputation, commonly-used prior to meta-analysis, requires the same consistency. However, the genotypes from various GWAS are generated using different genotyping platforms, arrays or SNP-calling approaches, resulting in use of different genome builds and allele definitions. Incorrect assumptions of identical allele definition among combined GWAS lead to a large portion of discarded genotypes or incorrect association findings. There is no published tool that predicts and converts among all major allele definitions.
In this study, we have developed a tool, GACT, which stands for Genome build and Allele definition Conversion Tool, that predicts and inter-converts between any of the common SNP allele definitions and between the major genome builds. In addition, we assessed several factors that may affect imputation quality, and our results indicated that inclusion of singletons in the reference had detrimental effects while ambiguous SNPs had no measurable effect. Unexpectedly, exclusion of genotypes with missing rate > 0.001 (40% of study SNPs) showed no significant decrease of imputation quality (even significantly higher when compared to the imputation with singletons in the reference), especially for rare SNPs.
GACT is a new, powerful, and user-friendly tool with both command-line and interactive online versions that can accurately predict, and convert between any of the common allele definitions and between genome builds for genome-wide meta-analysis and imputation of genotypes from SNP-arrays or deep-sequencing, particularly for data from the dbGaP and other public databases.
Genome-wide association studies (GWASs) and next-generation deep sequencing studies have successfully identified genes associated with human diseases and traits, yet they suggest that the identified variants cumulatively explain a small percentage of the estimated inherited risk to develop these diseases. Combining samples from multiple GWASs or deep sequencing datasets of the same phenotype for large-scale meta-analyses will increase the statistical power to identify new or rare associated variants , particularly for complex traits where the disease variants may have moderate effect sizes, which may account for some of the missing heritability . However, the raw single nucleotide polymorphism (SNP) genotype datasets might have been generated using different genotyping or sequencing platforms, array types  or SNP calling procedures, resulting in the use of different genome builds or allele definitions (nomenclatures). Thus, combining multiple GWASs or deep sequencing studies (e.g. the 1000 Genomes Project ) requires conversions of inconsistent allele definitions and genome builds between the datasets, as demonstrated in a large number of NHGRI (http://www.genome.gov) GWAS meta-analyses . Likewise, imputation, one of the commonly-used approaches to predict the genotypes for un-assayed loci, requires the same consistency between the study and reference datasets, for example, imputation has been applied to almost half of the GWASs  in the NHGRI GWAS Catalog.
Four common nomenclatures exist for reporting biallelic SNPs, including: probe/target or A/B, Plus (+)/Minus (−), TOP/BOT, and Forward/Reverse . The genotype data from different studies are often not consistent or matched for genome builds or allele definitions, and thus, genotype and build conversions are required if an investigator combines multiple GWASs or imputes a reference dataset (e.g., the 1000 Genome data) into a study GWAS. For example, different genome builds, primarily build 36 (b36) and b37, and various allele definitions were adopted in the 15,541 NHGRI GWAS Catalog datasets. The solutions that disregard mismatched SNPs, i.e., direct allele-flipping or removal of mismatches , will lead to undesirable consequences. For example, allele-flip (i.e., from A1 to A2 and vice versa) ignores the allele frequencies of study population and may make the downstream analyses of the flipped SNPs irrelevant to the sample population; and genotype removal may significantly lower the SNP density of relevant regions. Thus, the build of the human genome that was used to call the study SNPs (or true-genotypes) and the allele definition have to be determined and converted where necessary prior to imputation and meta-analysis.
Genotype mismatches between the GWAS and 1000 genomes datasets
FLIP & CSF
62,793 (78.3) (81.7)†
74,256 (92.6) (96.7)†
Imputation is often desirable before combining multiple genotype datasets from different recourses for meta-analysis. Our imputation analysis revealed higher quality for imputed SNPs when GACT was used, compared to when mismatched SNPs were excluded (Additional file 1: Table S1). While GACT aims to convert between allele definitions and maximize the number of correctly matched alleles to a reference, there are many other factors that can affect imputation quality. Hence, we measured the effects of selected variant types (such as singletons (i.e. SNPs with only one copy of the minor allele among all samples), monomorphic SNPs, and ambiguous SNPs) and GWAS quality control procedures (such as genotype missing rate) on imputation quality. We found that the exclusion of singletons and monomorphic SNPs from the reference improved imputation quality of rare SNPs with minor allele frequency (MAF) < 0.005 (the mean quality score increased from 0.52 to 0.57, which was the highest increase across all MAF ranges) but had no effect on SNPs with MAF > 0.005 (the mean score remained 0.91). The ambiguous SNPs had no measurable effect on imputation, while imputation quality decreased as the genotype missing thresholds became more conservative. Surprisingly, for imputed common SNPs (MAF > 0.1), the decrease in imputation quality started to emerge under very stringent genotype missing thresholds (0.004-0.001, instead of the commonly-used 0.05); by comparison, the imputation of relatively rare SNPs (MAF < 0.1) was even more robust, the decrease was not significant until the missing threshold reached a more stringent threshold of 0.0005 (corresponding to removal of 61.4% of the genotypes). Moreover, the physical locations of the SNPs that were excluded under these missing thresholds were distributed uniformly across the chromosomes. Our analyses provide novel insight into imputation insensitivity to genotype missingness, particularly for rare SNPs.
Subjects and genotype data
A cohort of 3,096 subjects of Ashkenazi Jewish ethnicity were genotyped using the Illumina Human Omni 1 Quad arrays. The GWAS genotype data were obtained through the NIH dbGaP [phs000448].
GACT was designed for matching allele definitions between the study GWAS and reference data before imputation or merging multiple genome-wide genotype datasets before meta-analysis, where the genotypes were generated from SNP-arrays or deep-sequencing platforms (Figure 1). Figure 2 shows the study design and GACT pipeline, which can be directly connected to other commonly-used methods, including genotype phasing of GWAS (or deep sequencing) data, imputation, data merging, and meta-analysis (Figure 1). The proper execution in command line of GACT requires PLINK , GenGen, and the genotyping array annotation files in the same directory, which can be downloaded from our website. The command line follows this syntax (example): ./gact b36 b37 ab plus o1qd map_file_name. The arguments represent the current genome build (b36), desired genome build (b37), current allele definition (ab), desired allele definition (plus), annotation file of SNP genotyping array (o1qd = Human Omni 1 Quad Duo), and input map file name, respectively. The input file should be in the same format as the PLINK binary map file, containing chromosome location and reference alleles of each SNP. The web version accesses the same command line options on the server-end after user uploads the input file, a PLINK format map file, and chooses the preferred options on the web interface. Moreover, the web tool allows the user to view in real time a log of every step in the conversion process. The command line has no pre-defined limit on the input file size while the web tool has a limit of 40 megabytes (MB), which is sufficient for most SNP arrays (e.g., the entire map file of the Illumina Human Omni 1 Quad array is < 30 MB).
Imputation quality assessment
The GWAS genotype data of the 3,096 Ashkenazi Jewish samples was in b36 genome build and A/B allele definition. GACT was used to convert the allele definition and genome build to the b37 and PLUS allele to keep them consistent with the 1000 Genomes panel. The genotype match rates between the study and reference datasets and imputation quality scores were used as primary measurements to assess conversion quality of GACT. After converting the genome builds and allele definitions in the map files using GACT, we recoded all the genotypes of the GWAS data using PLINK. The genotype phasing and imputation were carried out using SHAPEIT and Impute2 , respectively. The latest phased 1000 Genomes genotypes of the European population (Phase 1 integrated release version 3) were used as the imputation reference. Imputation quality was assessed using the Impute2 information scores of the reference SNPs. The scores (equivalent to the r-squared metric reported by MaCH  and BEAGLE ) vary between 0 and 1, where values closer to 1 represent imputation with high certainty. The mean and standard deviation of these scores were used as measures of overall imputation quality of SNPs at specific MAF ranges. To compare the imputation quality between different MAFs, we used the Welch two sample t-test. All the statistical analyses and graphs were generated using the latest version of R (version 3.0.2), and the imputations were conducted using the multi-core cluster at the Vermont Advanced Computing Center.
GACT prediction of genome build and allele definition
We measured the frequencies of all 16 possible genotype patterns under three allele definitions, including Plus (+)/Minus (−), Forward/Reverse, and TOP/BOT (the A/B or probe/target definition is differently coded). The distributions (Figure 3) were clearly distinguishable, and thus used to predict all the four designations. We observed the enrichment of two patterns A/G and G/A, two patterns A/G and C/T, and four patterns A/G, G/A, C/T and T/C for TOP/BOT, Forward/Reverse, and Plus/Minus, respectively. The prediction model matches relative ratios of the input genotypes to the expected ratios in each definition by measuring the proportions of CT, TC and GA alleles present. These three values acted as the input neurons into a multilayer perceptron that classified the input map file into one of the four SNP definitions (Additional file 2: Figure S1). Thus, for users who have no knowledge about the allele definitions and (or) genome build, GACT will first notify the user of the predicted definition and build of the input SNPs prior to actual conversion. The prediction module is particularly useful when the datasets are obtained from public genotype repositories, such as the dbGaP.
GACT conversion of genome build and allele definition
GACT has been demonstrated to identify and clean all the convertible allele mismatches. Table 1 shows the amounts of genotypes that should be discarded if we incorrectly assumed versus correctly converted the allele definitions between our GWAS data and the 1000 Genome data (Plus/Minus) during imputation. For instance, if we incorrectly converted our GWAS genotypes to the “Forward/Reverse” or “TOP/BOT” definition, and imputed with the 1000 Genome data, we had to discard 21.7% and 51.5% of the genotypes, respectively, due to mismatch. By comparison, if we correctly converted our genotypes to “Plus/Minus” by using GACT, only 7% needed to be discarded across all the chromosomes (Table 1). Moreover, since 3,344 SNPs existed in our data but not in the reference, when only the SNPs that existed in both datasets were used in the calculation, the discarded genotypes only accounted for 3.3%, which was significantly lower than commonly-observed mismatch rates in the literature. The reasons for the 3.3% mismatches are described in the discussion.
Quality scores of the imputed (I) and study (S) SNPs for each MAF category
Both genome builds and allele definitions should be well-matched before combing or imputing one genotype data with another. In this study, we have developed a new, powerful, and user-friendly tool that can predict, and convert the genome builds and allele definitions simultaneously between multiple GWAS or deep sequencing genotype datasets for meta-analyses, imputations or both. Our GWAS data demonstrated the accuracy of predictions and performance of conversions. Our further imputations showed that the inclusion of singletons in the reference panel significantly decreased imputation quality. However, the exclusion of SNPs with missing rate > 0.001 led to comparably high imputation quality with the commonly-used threshold of 0.05 for rare SNPs (Table 2 and Figure 5 and Additional file 6: Figure S5), which implied that approximately 600,000 well-typed SNPs were likely to be sufficient for high quality genome-wide imputation of rare SNPs in our GWAS data.
GACT achieved as low as 3.3% discarded genotypes (Table 1), which was significantly lower than commonly-observed mismatch rates. It should be noted that we always observe genotype mismatches in real datasets, particularly when one dataset is from microarray-based study and the other is from deep-sequencing-based study, like the case in Table 1. This is likely to be attributed to various factors, such as different experimental protocols, genotyping error rates, and disease statuses of research subjects. Interestingly, the genotype mismatch rates between different platforms are not significantly higher than those within same platforms. For instance, a recent study  showed 0.6-1.6% genotype mismatch rate within two deep-sequencing studies (Li et al’s data and the 1000 Genomes); by comparison, the 3.3% mismatch rate between two different platforms/samples is reasonably low. All these results demonstrated that it is required to correctly convert allele definitions prior to imputation or meta-analysis.
Comparisons of tools for genome build and allele definition conversions
Allele definition prediction
Uninformed strand/allele flip1
Informed allele conversion2
Automatic allele conversion
Genome build prediction
Genome build conversion
Interactive web interface
Imputation after GACT Conversion
Imputation before combining GWAS datasets is desirable because of 1) increased power for identifying disease-associated variants, e.g. by more than 10% as suggested previously ; 2) higher SNP coverage for fine-mapping disease genes; 3) additional rare SNPs and applicability to other variants such as copy number variations or classical leukocyte antigen alleles ; and 4) cost- and time-efficiency compared with the molecular genotyping or sequencing experiments. Various studies have been carried out to evaluate or identify the factors that might affect imputation quality [12, 17], including ambiguous, monomorphic, and singleton SNPs. Phasing of singletons is known to be challenging, and imputation becomes faster with no burden in the downstream association tests when singletons are removed from the reference. We found that, additionally, the removal of either ambiguous or monomorphic SNPs alone from the study data prior to phasing and imputation had no detectable effect on imputation. However, the exclusion of monomorphic and singleton SNPs from the reference increased imputation quality, which is in accordance with previous studies [12, 17]. We further found that SNPs with very low MAF (0.001-0.005) showed the most significant increase of the imputation quality compared with the other MAF ranges (Table 2). This finding is important, particularly, for the rare variants, which are of increasing interest in the genetic studies of complex diseases and traits.
Balancing between genotype quality and genome coverage is important for imputation. The genotype missing thresholds of 0.05 to 0.02  are generally recommended for quality controls in GWAS. However, no published studies have explicitly evaluated the effects of more conservative missing thresholds (than the commonly-used values) on imputation quality. Our assessments might provide a new perspective on the selection of genotype missing thresholds in imputation. Based on our GWAS data, an approximate number of 600 thousand well-typed SNPs are likely to be sufficient for high quality genome-wide imputation of rare SNPs (high quality assayed SNPs may compensate for low true-genotype density). However, further analyses are warranted to replicate the findings in additional arrays. It should be noted that only the data on chromosome 1 were used for most of the analyses based on our observation of similar genotype missing patterns or comparable results across all the chromosomes (Additional file 6: Figure S5 and Additional file 7: Figure S6).
Ignorance of inconsistent allele definitions and genome builds or incorrect conversions lead to incorrect genetic association “findings”. In this study, we developed a comprehensive tool, GACT, with both powerful command-line and user-friendly web interface versions to predict, and convert both genome builds and allele definitions between multiple GWAS (or deep sequencing) genotype data, which is required for all imputations and genome-wide meta-analyses. GACT will facilitate and ease a broad use of the GWAS data from the dbGaP and other publicly available genotype repositories for large-scale secondary analyses and multi-laboratory collaborations in the genetic association studies of human diseases.
Availability and requirements
Project name: GACT: Genome build and Allele definition Conversion Tool
Project homepage: http://www.uvm.edu/genomics/software/gact
Operating system(s): Linux, UNIX (for command version) and Windows (for interactive web version)
Programming language: Python, Ruby, Hypertext Preprocessor (PHP), and Bash scripts
Availability: GACT (both command-line and web versions), including source code, documentation, and examples, is freely available for non-commercial use with no restrictions at http://www.uvm.edu/genomics/software/gact and http://asulovar.w3.uvm.edu/gact.
This work was supported by the start-up fund from the University of Vermont, USA. The GWAS data in Ashkenazi Jewish that was described in this study were obtained from the database of Genotypes and Phenotypes (dbGaP) through accession number phs000448. Funding support for the GWAS was provided through the NIH RC2MH089964. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. We thank Robert Howe, a research student in our laboratory, for providing technical assistance in completing the web application of GACT. The authors acknowledge the Vermont Advanced Computing Core which is supported by NASA (NNX 06AC88G), at the University of Vermont for providing high performance computing resources that have contributed to the research results reported within this paper. We also thank reviewers for their helpful suggestions and comments.
- Panagiotou OA, Willer CJ, Hirschhorn JN, Ioannidis JP: The power of meta-analysis in genome-wide association studies. Annu Rev Genomics Hum Genet. 2013, 14: 441-465. 10.1146/annurev-genom-091212-153520.PubMed CentralPubMedView ArticleGoogle Scholar
- Panagiotou OA, Evangelou E, Ioannidis JP: Genome-wide significant associations for variants with minor allele frequency of 5% or less–an overview: a HuGE review. Am J Epidemiol. 2010, 172 (8): 869-889. 10.1093/aje/kwq234.PubMedView ArticleGoogle Scholar
- Nicolazzi EL, Picciolini M, Strozzi F, Schnabel RD, Lawley C, Pirani A, Brew F, Stella A: SNPchiMp: a database to disentangle the SNPchip jungle in bovine livestock. BMC Genomics. 2014, 15: 123-10.1186/1471-2164-15-123.PubMed CentralPubMedView ArticleGoogle Scholar
- Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA, Genomes Project C: A map of human genome variation from population-scale sequencing. Nature. 2010, 467 (7319): 1061-1073. 10.1038/nature09534.PubMedView ArticleGoogle Scholar
- Nelson SC, Doheny KF, Laurie CC, Mirel DB: Is ‘forward’ the same as ‘plus’?…and other adventures in SNP allele nomenclature. Trends Genet. 2012, 28 (8): 361-363. 10.1016/j.tig.2012.05.002.PubMedView ArticleGoogle Scholar
- Marchini J, Howie B: Genotype imputation for genome-wide association studies. Nat Rev Genet. 2010, 11 (7): 499-511. 10.1038/nrg2796.PubMedView ArticleGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81 (3): 559-575. 10.1086/519795.PubMed CentralPubMedView ArticleGoogle Scholar
- Delaneau O, Zagury JF, Marchini J: Improved whole-chromosome phasing for disease and population genetic studies. Nat Methods. 2013, 10 (1): 5-6. 10.1038/nchembio.1414.PubMedView ArticleGoogle Scholar
- Howie BN, Donnelly P, Marchini J: A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009, 5 (6): e1000529-10.1371/journal.pgen.1000529.PubMed CentralPubMedView ArticleGoogle Scholar
- Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR: MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 2010, 34 (8): 816-834. 10.1002/gepi.20533.PubMed CentralPubMedView ArticleGoogle Scholar
- Browning SR, Browning BL: Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet. 2007, 81 (5): 1084-1097. 10.1086/521987.PubMed CentralPubMedView ArticleGoogle Scholar
- Liu EY, Buyske S, Aragaki AK, Peters U, Boerwinkle E, Carlson C, Carty C, Crawford DC, Haessler J, Hindorff LA, Marchand LL, Manolio TA, Matise T, Wang W, Kooperberg C, North KE, Li Y: Genotype imputation of Metabochip SNPs using a study-specific reference panel of ~4,000 haplotypes in African Americans from the Women’s Health Initiative. Genet Epidemiol. 2012, 36 (2): 107-117. 10.1002/gepi.21603.PubMed CentralPubMedView ArticleGoogle Scholar
- Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR: Low-coverage sequencing: implications for design of complex trait association studies. Genome Res. 2011, 21 (6): 940-951. 10.1101/gr.117259.110.PubMed CentralPubMedView ArticleGoogle Scholar
- Magi R, Morris AP: GWAMA: software for genome-wide association meta-analysis. BMC Bioinformatics. 2010, 11: 288-10.1186/1471-2105-11-288.PubMed CentralPubMedView ArticleGoogle Scholar
- Willer CJ, Li Y, Abecasis GR: METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics. 2010, 26 (17): 2190-2191. 10.1093/bioinformatics/btq340.PubMed CentralPubMedView ArticleGoogle Scholar
- Spencer CC, Su Z, Donnelly P, Marchini J: Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip. PLoS Genet. 2009, 5 (5): e1000477-10.1371/journal.pgen.1000477.PubMed CentralPubMedView ArticleGoogle Scholar
- Turner S, Armstrong LL, Bradford Y, Carlson CS, Crawford DC, Crenshaw AT, de Andrade M, Doheny KF, Haines JL, Hayes G, Jarvik G, Jiang L, Kullo IJ, Li R, Ling H, Manolio TA, Matsumoto M, McCarty CA, McDavid AN, Mirel DB, Paschall JE, Pugh EW, Rasmussen LV, Wilke RA, Zuvich RL, Ritchie MD, et al: Quality control procedures for genome-wide association studies. Current protocols in human genetics/editorial board. Edited by: Haines JL. 2011, Chapter 1: Unit1 19-Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.