GWASeq: targeted re-sequencing follow up to GWAS
- Matthew P. Salomon1, 2Email author,
- Wai Lok Sibon Li1,
- Christopher K. Edlund1,
- John Morrison1,
- Barbara K. Fortini1,
- Aung Ko Win3,
- David V. Conti1,
- Duncan C. Thomas1,
- David Duggan4,
- Daniel D. Buchanan3, 5,
- Mark A. Jenkins3,
- John L. Hopper3,
- Steven Gallinger6,
- Loïc Le Marchand7,
- Polly A. Newcomb8,
- Graham Casey1 and
- Paul Marjoram1
© Salomon et al. 2016
Received: 22 January 2015
Accepted: 9 February 2016
Published: 3 March 2016
For the last decade the conceptual framework of the Genome-Wide Association Study (GWAS) has dominated the investigation of human disease and other complex traits. While GWAS have been successful in identifying a large number of variants associated with various phenotypes, the overall amount of heritability explained by these variants remains small. This raises the question of how best to follow up on a GWAS, localize causal variants accounting for GWAS hits, and as a consequence explain more of the so-called “missing” heritability. Advances in high throughput sequencing technologies now allow for the efficient and cost-effective collection of vast amounts of fine-scale genomic data to complement GWAS.
We investigate these issues using a colon cancer dataset. After QC, our data consisted of 1993 cases, 899 controls. Using marginal tests of associations, we identify 10 variants distributed among six targeted regions that are significantly associated with colorectal cancer, with eight of the variants being novel to this study. Additionally, we perform so-called ‘SNP-set’ tests of association and identify two sets of variants that implicate both common and rare variants in the etiology of colorectal cancer.
Here we present a large-scale targeted re-sequencing resource focusing on genomic regions implicated in colorectal cancer susceptibility previously identified in several GWAS, which aims to 1) provide fine-scale targeted sequencing data for fine-mapping and 2) provide data resources to address methodological questions regarding the design of sequencing-based follow-up studies to GWAS. Additionally, we show that this strategy successfully identifies novel variants associated with colorectal cancer susceptibility and can implicate both common and rare variants.
We live in the era of the Genome-wide Association Study [GWAS]. Large numbers of samples have been collected and genotyped in a bid to associate Single Nucleotide Polymorphisms [SNPs] with phenotypic variation. In the context of human disease, the design of such studies and, in particular, the so-called SNP-chip technology that underpins them, has aimed to exploit the common disease, common variant hypothesis (e.g.), [1, 2]. This assumes that common diseases will frequently be associated with common (>1–5 % frequency) variants.
There is now a long history of GWAS studies, and large numbers of variants have been found to be associated with disease . However, such studies do not come without financial cost, and there has also been a lively discussions regarding whether such a track record should be regarded as a success or failure [4, 5]. Our purpose here is not to add to that discussion, but rather to focus on what will often be a frequent ‘next step’ in such studies.
While it is undeniable that GWAS has uncovered large numbers of variants that are associated with disease, it has also become clear that, while these variants do appear to be associated with disease, they can only explain a fraction of the phenotypic variation that is observed. Unfortunately, this fraction is general very low (e.g.), ; but see, also, . Such demonstrations of missing heritability have lead to some skepticism about the common disease, common variant hypothesis [e.g., 5].
There are many possible explanations for missing heritability (see, e.g.), [4, 5] for discussions. These include rare variants, complex genetic architectures, structural variation such as copy number variation, the joint effects of large numbers of variants each of small effect marginally, and so-called phantom heritability (e.g.), [5, 7, 8]. In this paper we focus on the first of those hypotheses: the discovery of nearby, and possibly rare variants that drive GWAS signals.
Because of their focus on common variants, SNP-chip platforms were not well placed to discover associations between disease and rare genetic variants. Theoretically, discovery through so-called synthetic associations with common variants is possible , although we note that there is some discussion regarding whether such phenomena are likely to explain most GWAS signals (e.g.), [10, 11]. However, given this possibility, combined with the recognition that an initial GWAS may well be finding SNPs that are not causative in themselves, but are instead linked with nearby causative polymorphisms, there has been a move towards following-up GWAS studies by sequencing studies (e.g.), . Here, the hope is that a signal of association that has been found in GWAS can be refined, and strengthened, by sequencing the region of the genome that surrounds the focal SNP (the SNP that was observed to have a small p-value in the original GWAS). Alternative strategies, that are not the subject of the present paper, include whole-genome or whole-exome sequencing (e.g.) .
However, before such a sequencing study can be conducted, several design questions must be resolved (where to sequence, at what depth, etc.). With this in mind, NIH formed the GWASeq consortium, in which multiple groups were funded to conduct sequence-based follow-up to GWAS, and thereby create a pool of publically-available data that could both a) provide the potential for refinement of GWAS signal for the phenotypes of interest, and b) provide a publically-available resource that the wider community might use to help guide their approach to such design questions for their own studies. The study we describe in this paper is one member of the GWASeq consortium. As such, the data are in the process of being made publically available through dbGaP, the NCBI’s repository for data that attempt to relate genotype to phenotype (http://www.ncbi.nlm.nih.gov/gap).
It should be noted, of course, that a large number of studies outside the GWASeq consortium are also attempting to follow-up GWAS hits using NGS technology, and examples are beginning to appear. An early example of this is Nejentsev et al. , in which the authors sequenced exons and splice-sites for ten candidate genes that contained previously associated common SNPs for type-1 diabetes in order to identify rare functional variants. Likewise, there are also a growing number of exome-sequencing studies, e.g. Liu, et al. , that focus on testing for rare functional variants.
Our study focuses on colorectal cancer. Colorectal cancer is the fourth-most common cancer and the second-most common cause of cancer death in the United States, with approximately 148,810 new cases and 49,960 deaths estimated in 2008 . There is known to be a strong genetic component to CRC risk, and individuals with a family history of colorectal cancer are at increased risk of the disease. For example, having a first-degree relative with CRC roughly doubles the risk, . Further evidence of heritability is seen in twin studies. For example, in a large twin study, up to 35 % (95 % CI: 10 % to 48 %) of CRC risk could be explained by inherited factors . GWAS hits have been found in a number of regions: 8q23.3 (rs16892766), 8q24 (rs6983267, rs7014346, rs10505477) [18–22], 9p24 (rs719725) [19, 20], 8q23.3 (rs16892766, EIF3H) and 10p14 (rs10795668) , 11q23 (rs3802824) , 12q13.13 (rs7136702), 14q22.2 (rs4444235), 15q13.3 (rs4779584) , 18q21 (rs4939827, SMAD7) [22, 24], and 20q13.33 (rs4925386). It is these regions that form the basis for follow-up in our experimental design.
Sample information for all samples sequenced in this study
Summary of 11 genomic regions sequenced
Total sequenced (bp)
% of target with > = 30X
Uncorrected p-value (0 PCs)
Uncorrected p-value (2 PCs)
Overall call rates and concordance between genotypes identified in the sequence data as compared to genotypes called from previously collected SNP array data (where they exist) was high. We identified 33 samples where the concordance between the sequence-based genotype calls and one or more array-based genotype calls was low and therefore we removed those samples from the analysis. Given the high concordance between our sequence-based genotype calls and the array-based genotype calls we believe that the number of mis-identified or mis-labeled samples in our final data set is negligible.
The logistical constraints of this study, in which data was shipped from a variety of centers at a variety of times, and sequencing was necessarily performed in batches at a third-party center, meant that we were not able to explicitly design the study to guard against batch or center effects during sequencing. However, we carefully examined the sequence data for batch and center effects and found none. We saw no evidence of any clustering by center in PC plots, nor variation in coverage by center. We also examined SNP density, \( \pi \), for each center and means were very similar (ranging from 0.00100 to 0.00106).
In this paper we focus upon testing for association in the population-based samples that passed the preceding QC checks (1993 cases, 899 controls). Analysis of the family-based data is ongoing and will be presented in a separate paper.
Population structure and candidate region studies
It is traditional to control for population structure in a GWAS context. In order to do this, global genome-wide structure and relatedness between samples is evaluated using a genome-wide set of (roughly unlinked) markers. In the present study, which amounts to a study of 11 candidate regions, this is impossible. While we chose to calculate PCs as part of the QC process, to uncover obvious outliers, it is entirely unclear that such PCs will reliably capture genome-wide patterns of relatedness. Neither do we have ‘SNP-chip’ data for every sample in our data. We have a total of just ~ 5.8 MB of data per sample, divided into 11 short (~500 KB) regions. As such, we believe it is likely that PCs calculated form this data may capture local, rather than global structure. Indeed we see no correspondence between PCs calculated on the samples retained for the association analysis and ethnicity or center of collection. We also note that the first PC explains <4 % of the variation in the sample. This lack of structure is consistent with the vast majority (>97 %) of our samples being comprised of Caucasian individuals, and is further consistent with the structure that is observed in earlier GWAS analyses using CCFR samples and common SNPs.
For this reason, while for comparison’s sake, we present results for analyses that include both 0 and 2 PCs, we propose to focus on the results for the analysis containing 0 PCs. We return to this point in the next section.
(Non-rare) variant associations
Most significantly associated SNPs identified in the PLINK analysis
rs ID number
MAF: Cases (Controls)
Gene (distance to nearest genes)
Uncorrected p-value (0PCs)
FDR corrected (0PCs)
FDR corrected (2PCs)
A to G
T to G
A to G
A to T
A to T
A to C
A to C
T to C
C to T
A to G
It is of particular interest to examine the strength of association found with each of the ‘focal’ SNPs around which the 11 regions we defined. This is recorded in the final two columns of Table 2. (Here, we report uncorrected p-value, as if one were conducting a validation study of that SNP alone). We note that, with the exception of rs4444235 and rs4779584 we see a strong tendency for these tests of association to return small p-value. This is, at the very least, encouraging regarding the veracity of those original signals.
Rare variant associations
Composition of the significantly associated SNP sets identified in the SKAT combined analysis
rs ID number
MAF: Cases (Controls)
p-value for SNP set
Uncorrected 3.17e–004 (1.79e–004)
FDR corrected 0.0486 (0.0275)
Uncorrected 1.04e–005 (1.83e–005)
FDR corrected 0.0032 (0.0056)
Composition of the significantly associated SNP sets identified in the SKAT combined analysis
rs ID number
MAF: Cases (Controls)
p-value for SNP set
Uncorrected 8.08e–005 (7.59e–005)
FDR corrected 0.0055 (0.0052)
Which samples should be sequenced?
Which regions should be sequenced?
What depth of coverage should be used and how far around the focal SNP should we sequence?
To what extent can we rely upon imputation?
What designs are more efficient for variant discovery and testing associations?
There are at least two ways one could try to answer design questions such as this. The first is to conduct a large simulation study. Here, data are simulated under a variety of possible disease models, and for a variety of population models and designs for GWAS and subsequent sequencing study. A very large number of variables are at play here, but the advantage of a simulation-based study is that it is possible, at least in principle, to simulate a wide variety of possibilities. The disadvantage, of course, is that the conclusions one draws may or may not be robust to inaccuracies in the underlying simulation model. As was famously noted by George Box, “All models are wrong; some are useful” . The hope is that the conclusions drawn from such an analysis will be useful despite their inevitable (and admitted) inaccuracies. Such studies are extremely computationally intensive, which limits the range of model and design parameters that might be considered, but examples do exist. For example,  conducted such an analysis based upon simulating populations of 10,000 genotypes designed to mimic a breast cancer study . They demonstrated that informative sampling based on disease and phenotype status jointly, could improve power, as could incorporating phenotype data from extended pedigree information, in family-based studies.
The second approach to resolving these design questions is data- rather than simulation-based. Here the goal is to collect data in which sequencing has been used to follow-up GWAS hits, and to attempt to draw robust conclusions from those data. Now, as Box might say, the ‘model’ is correct. The data got there however disease data got there (i.e., the model is reality itself). However, the price we pay here is lack of replication - we have a relatively small number of such datasets. Therefore, the challenge will be in drawing robust conclusions from these studies. Our hope is that the data resource described by this paper, and other members of the GWASeq consortium, will help begin to provide those guidelines.
One perspective of the GWASeq consortium is to provide data that are rich enough to enable investigators to assess the effectiveness of alternative designs by subsetting the available data. Given this view, the goal we followed when designing the study was to be conservative in the sense of sequencing at greater depth, for wider regions and more samples, than might otherwise have been the case. This provides maximum scope for assessment of efficiency of alternative designs.
For the 4,052 individuals included in this study we were able to successfully sequence the majority (~80 %) of the intended target region surrounding the GWAS implicated SNP of interest to a sufficient depth to allow us to accurately genotype previously unknown variants in the region. Of the variants we identified, the vast majority (~90 %) of SNPs in our data set are comprised of rare variants with a MAF of less than 0.01. This abundance of rare variants is consistent with the findings of other large-scale sequencing projects that have shown very high levels of genetic diversity present in the human population, driven by recent and rapid population expansion .
Of course, a primary interest when collecting data such as these is to determine whether stronger genotype-phenotype associations will be found near the focal SNPs from prior GWAS. Here, we employed two strategies to detect associations and identify variants that confer risk for colorectal cancer susceptibility. The first strategy focused on more “common” variants to determine if any of the higher allele frequency novel variants identified in this data set could be associated with disease susceptibility. This strategy identified 10 new variants, none of which have been previously associated with colorectal cancer. The second strategy we employed was to look at the combined effect of rare and common variants. Here we uncovered associations with distinct variant sets containing both common and rare variants in the 3′UTR of the gene C11orf53, and the 5′UTR of the gene ATF1.
Previous work to identify the functional risk variants for colorectal cancer in the 11q23.1 region has implicated several genes including C11orf53 as likely factors in colorectal cancer etiology [31, 32]. Furthermore, Pittman et al.  found a variant (rs3087967) in the 3′UTR of C11orf53 to be in high LD with SNP rs3802842, which was the focal GWAS SNP that our sequencing region was designed around. While we did not detect a significant association with rs3802842, the SKAT combined test did identify three other variants in the same 190 bp region that comprises the 3′UTR of C11orf53. The 3′UTR contributes to post transcriptional gene regulation through the regulatory actions of miRNAs. If differences in C11orf53 expression are involved in colorectal cancer susceptibility, then mutations in the 3′UTR might lead to changes in miRNA binding affinity and thus lead to changes in the expression of that gene. In fact, rs3087967 is directly adjacent to a miR-9 binding sequence (microRNA.org).
Additionally, we detected a significant (p = 0.0230, Fisher’s exact test) enrichment of novel variants in cases as compared to controls within the 5′UTR region of the activating transcription factor 1 (ATF1) gene, an important cAMP-responsive transcription factor. Two of these novel variant positions lie within an upstream open reading frame (uORF), and three of them lie within the internal ribosome entry site (IRES) . Both uORFs and IRES elements contribute to overall gene expression levels via translation control [35, 36]. In 2012, Huang, et al.  showed that expression levels of ATF1 are positively correlated with survival in colorectal cancer patients.
This study is likely to be one of a large number of studies that perform targeted sequencing in order to follow-up hits from earlier GWAS. The jury is still out regarding how likely it is that stronger associations will be uncovered by such a strategy, but heritability estimates for many diseases indicate that this is a reasonable hope. In our own study we do find such associations in a number of the regions that we sequenced. However, it is also the case that in a number of the regions no significant signal was found. Regardless, by placing such data in the public domain we hope to enable other groups to better design their own follow-up studies, and thereby increase their own chances of successful discovery.
The study samples
The samples used in this study were taken from the Colon Cancer Family Registry. Informed consent was obtained from all study participants and the study protocol was approved at each center. The overwhelming majority of samples were from DNA extracted from blood samples (~93 %), with the remaining being from buccal cells (~7 %). Each CCFR center individually extracted total genomic DNA and shipped the extracted DNA to the Baylor College of Medicine for sequencing. All study protocols were approved by the USC Health Sciences Institutional Review Board.
A total of 11 genomic regions were selected for targeted re-sequencing (Table 2). For the purpose of this study we define a genomic region as the flanking sequence on either side of a focal GWAS SNP. The amount of flanking sequence surrounding each focal SNP was determined by the local LD structure around the focal SNP, to ensure that the sequenced area extended beyond the LD block containing the focal SNP. The targeted regions were isolated from total genomic DNA using custom designed NimbleGen Sequence Capture Microarrays (following the manufacturer’s protocols). Individual sample libraries were multiplexed together and sequenced on an Illumina HiSeq 2000 at the Baylor College of Medicine. A total of ~5.8 MB of genic and intergenic sequence was collected from each individual sample.
Sequence reads were mapped to the 1000 Genomes (b37) build of the human genome using BWA (version 0.6.2-r126)  with default settings. The resulting alignments were further processed using the GATK (version 2.3–9)  base quality score recalibration, indel realignment, duplicate removal (picardtools, version 1.84), and read-reduction functions in accordance with the GATK Best Practices recommendations .
Variant detection was performed for polymorphism discovery and genotyping across all 4,052 samples simultaneously using the GATK UnifiedGenotyper (version 2.7–4). We applied an additional mapping quality (MQ) filter of 50 during variant calling to remove false positive variants that result from poor mapping (Additional file 4). The raw variant calls were further refined using the variant quality score recalibration (VQSR) methods according to the GATK Best Practices recommendations [40, 41]. The final variant call set was checked for concordance with both dbSNP (version 137) and 1000 Genomes SNP calls using the GATK AnnotateEval tool and functional annotations were performed using the ANNOVAR annotation pipeline following the authors’ recommendations .
Given the number of samples, repository centers, and individuals involved in the generation of the sequence data, the possibility of some samples becoming mis-labeled is a valid concern. Therefore, for those samples that had existing SNP array data (~2300 samples), we compared the genotype calls from the sequence data to genotype calls from existing SNP array genotyping data generated from the same samples to confirm the identity of as many of the sequenced samples as possible.
Additionally, we tested difference in coverage levels between cases and controls in any of the sequenced regions. Such a difference could induce biases, particularly in the rare variant tests. We found no evidence of any statistically significant difference in coverage between cases and controls in any of the regions (see Additional file 1).
Subset of samples used for associations
A subset of the 4,052 samples that were sequenced consisted of pedigree-based samples (~1,000 samples) - these were removed from the analysis presented in this paper. We also removed 33 samples that were suspected of being mis-labeled (see Sample QC), as well as any samples that lacked full covariate, or phenotypic data. We then conducted a PC analysis, using a set of unlinked variants that covered each of our sequenced regions, using the SNPRelate package in R . This resulted in our removing two samples that represented obvious outliers. PC axes were then recalculated for possible inclusion in association tests. The final data set comprised of 2,892 population-based samples with 1993 cases, 899 controls.
Marginal tests for association were preformed using the PLINK software package . We used a logistic model and included age at disease diagnoses, sex, CCFR center. Only variants with a genotyping rate > = 95 %, and with MAF > 0.005, were included in the analysis. To correct for multiple testing we applied a false discovery rate (FDR)  correction based on the total number of variants tested as implemented in the R function p.adjust.
In order to test for associations with rare variants we employed a sequence kernel association test (SKAT) using the SKAT package in R . We employed the combined SKAT test in order to test the combined effect of both common and rare variants . Variant sets comprised of variants annotated in UTRs, exons, introns, within 1 KB of a known gene, and the intergenic sequence between two known genes for a total of 307 variant sets that were included in the analysis. To correct for multiple testing we applied a FRD correction to the raw p-value generated by the combined SKAT test.
Availability of supporting data
All data used in this article are in the process of being deposited in dbGaP. Readers are encouraged to contact the authors for further details.
We gratefully acknowledge the help of the Baylor College of Medicine Sequencing Center, numerous helpful conversations with other members of the GWASeq consortium, and, in particular, Lisa Brooks for leading the consortium. We thank the CCFR for providing the samples used in this study and for providing technical assistance. In particular, we thank Allyson Templeton for her unstinting efforts in helping navigate the CCFR system, and the PIs of each of the CCFR centers. We are also grateful to the reviewers for helpful comments on an earlier version of the manuscript. This work was funded primarily funded by NIH award HG005927. Additional funding was provided by award GM103804. Data collection for this work was supported by grant UM1 CA167551 from the National Cancer Institute and through cooperative agreements with the following CCFR centers:
Australasian Colorectal Cancer Family Registry (U01 CA074778 and U01/U24 CA097735), Ontario Familial Colorectal Cancer Registry (U01/U24 CA074783), Seattle Colorectal Cancer Family Registry (U01/U24 CA074794), University of Hawaii Colorectal Cancer Family Registry (U01/U24 CA074806) and USC Consortium Colorectal Cancer Family Registry U01/U24 CA074799).
The content of this manuscript does not necessarily reflect the views or policies of the National Cancer Institute, National Institute of Health, or any of the collaborating centers in the Colon Cancer Family Registry (CCFR).
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Pritchard JK, Cox NJ. The allelic architecture of human disease genes: common disease - common variant … or not? Hum Mol Genet. 2002;11:2417–23.View ArticlePubMedGoogle Scholar
- Botstein D, Risch N. Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet. 2003;33:228–37.View ArticlePubMedGoogle Scholar
- Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, Klemm A, Flicek P, Manolio T, Hindorff L, Parkinson H. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 2014;42:D1001–6.Google Scholar
- Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke M, Clark AG, Eichler EE, Gibson G, Haines JL, Mackay TFC, McCarroll SA, Visscher PM. Finding the missing heritability of complex diseases. Nature. 2009;461:747–53.Google Scholar
- Gibson G. Rare and common variants: twenty arguments. Nat Rev Genet. 2011;13:135–45.View ArticlePubMed CentralGoogle Scholar
- Allen HL, Estrada K, Lettre G, Berndt SI, Weedon MN, Rivadeneira F, Willer CJ, Jackson AU, Vedantam S, Raychaudhuri S, Ferreira T, Wood AR, Weyant RJ, Segre AV, Speliotes EK, Wheeler E, Soranzo N, Park J-H, Yang J, Gudbjartsson D, Heard-Costa NL, Randall JC, Qi L, Smith AV, Maegi R, Pastinen T, Liang L, Heid IM, Luan J, Thorleifsson G, et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature. 2010;467:832–8.View ArticleGoogle Scholar
- Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW, Goddard ME, Visscher PM. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010;42:565–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Zuk O, Hechter E, Sunyaev SR, Lander ES. The mystery of missing heritability: Genetic interactions create phantom heritability. Proc Natl Acad Sci U S A. 2012;109:1193–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Dickson SP, Wang K, Krantz I, Hakonarson H, Goldstein DB. Rare Variants Create Synthetic Genome-Wide Associations. PLoS Biol. 2010;8(1):e1000294.Google Scholar
- Anderson CA, Soranzo N, Zeggini E, Barrett JC. Synthetic associations are unlikely to account for many common disease genome-wide association signals. PLoS Biol. 2011;9:e1000580.View ArticlePubMedPubMed CentralGoogle Scholar
- Wray NR, Purcell SM, Visscher PM. Synthetic Associations Created by Rare Variants Do Not Explain Most GWAS Results. PLoS Biol. 2011;9(1): e1000579.Google Scholar
- Cirulli ET, Goldstein DB. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat Rev Genet. 2010;11:415–25.View ArticlePubMedGoogle Scholar
- Liu L, Sabo A, Neale BM, Nagaswamy U, Stevens C, Lim E, Bodea CA, Muzny D, Reid JG, Banks E, Coon H, Depristo M, Dinh H, Fennel T, Flannick J, Gabriel S, Garimella K, Gross S, Hawes A, Lewis L, Makarov V, Maguire J, Newsham I, Poplin R, Ripke S, Shakir K, Samocha KE, Wu Y, Boerwinkle E, Buxbaum JD, et al. Analysis of rare, exonic variation amongst subjects with autism spectrum disorders and population controls. PLoS Genet. 2013;9:e1003443.View ArticlePubMedPubMed CentralGoogle Scholar
- Nejentsev S, Walker N, Riches D, Egholm M, Todd JA. Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science. 2009;324:387–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Jemal A, Murray T, Ward E, Samuels A, Tiwari RC, Ghafoor A, Feuer EJ, Thun MJ. Cancer statistics, 2005. Ca-a Cancer J Clins. 2005;55:10–30.View ArticleGoogle Scholar
- Macklin MT. Inheritance of cancer of the stomach and large intestine in man. J Natl Cancer Inst. 1960;24:551–71.PubMedGoogle Scholar
- Gardner EJ. A genetic and clinical study of intestinal polyposis, a predisposing factor for carcinoma of the colon and rectum. Am J Hum Genet. 1951;3:167–76.PubMedPubMed CentralGoogle Scholar
- Tomlinson I, Webb E, Carvajal-Carmona L, Broderick P, Kemp Z, Spain S, Penegar S, Chandler I, Gorman M, Wood W, Barclay E, Lubbe S, Martin L, Sellick G, Jaeger E, Hubner R, Wild R, Rowan A, Fielding S, Howarth K, Silver A, Atkin W, Muir K, Logan R, Kerr D, Johnstone E, Sieber O, Gray R, Thomas H, Peto J, et al. A genome-wide association scan of tag SNPs identifies a susceptibility variant for colorectal cancer at 8q24.21. Nat Genet. 2007;39:984–8.View ArticlePubMedGoogle Scholar
- Zanke BW, Greenwood CMT, Rangrej J, Kustra R, Tenesa A, Farrington SM, Prendergast J, Olschwang S, Chiang T, Crowdy E, Ferretti V, Laflamme P, Sundararajan S, Roumy S, Olivier J-F, Robidoux F, Sladek R, Montpetit A, Campbell P, Bezieau S, O'Shea AM, Zogopoulos G, Cotterchio M, Newcomb P, McLaughlin J, Younghusband B, Green R, Green J, Porteous MEM, Campbell H, et al. Genome-wide association scan identifies a colorectal cancer susceptibility locus on chromosome 8q24. Nat Genet. 2007;39:989–94.Google Scholar
- Poynter JN, Figueiredo JC, Conti DV, Kennedy K, Gallinger S, Siegnumd KD, Casey G, Thibodeau SN, Jenkins MA, Hopper JL, Byrnes GB, Baron JA, Goode EL, Tiirikainen M, Lindor N, Grove J, Newcomb P, Jass J, Young J, Potter JD, Haile RW, Duggan DJ, Le Marchand L. Variants on 9p24 and 8q24 are associated with risk of colorectal cancer: Results from the colon cancer family registry. Cancer Res. 2007;67:11128–32.View ArticlePubMedGoogle Scholar
- Haiman CA, Le Marchand L, Yamamato J, Stram DO, Sheng X, Kolonel LN, Wu AH, Reich D, Henderson BE. A common genetic risk factor for colorectal and prostate cancer. Nat Genet. 2007;39:954–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Tenesa A, Farrington SM, Prendergast JGD, Porteous ME, Walker M, Haq N, Barnetson RA, Theodoratou E, Cetnarskyj R, Cartwright N, Semple C, Clark AJ, Reid FJL, Smith LA, Kavoussanakis K, Koessler T, Pharoah PDP, Buch S, Schafmayer C, Tepel J, Schreiber S, Voelzke H, Schmidt CO, Hampe J, Chang-Claude J, Hoffmeister M, Brenner H, Wilkening S, Canzian F, Capella G, et al. Genome-wide association scan identifies a colorectal cancer susceptibility locus on 11q23 and replicates risk loci at 8q24 and 18q21. Nat Genet. 2008;40:631–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Jaeger E, Webb E, Howarth K, Carvajal-Carmona L, Rowan A, Broderick P, Walther A, Spain S, Pittman A, Kemp Z, Sullivan K, Heinimann K, Lubbe S, Domingo E, Barclay E, Martin L, Gorman M, Chandler I, Vijayakrishnan J, Wood W, Papaemmanuil E, Penegar S, Qureshi M, CORGI Consortium, Farrington S, Tenesa A, Cazier J-B, Kerr D, Gray R, Peto J, et al. Common genetic variants at the CRAC1 (HMPS) locus on chromosome 15q13.3 influence colorectal cancer risk. Nat Genet. 2008;40:26–8.View ArticlePubMedGoogle Scholar
- Broderick P, Carvajal-Carmona L, Pittman AM, Webb E, Howarth K, Rowan A, Lubbe S, Spain S, Sullivan K, Fielding S, Jaeger E, Vijayakrishnan J, Kemp Z, Gorman M, Chandler I, Papaemmanuil E, Penegar S, Wood W, Sellick G, Qureshi M, Teixeira A, Domingo E, Barclay E, Martin L, Sieber O, CORGI Consortium, Kerr D, Gray R, Peto J, Cazier J-B, et al. A genome-wide association study shows that common alleles of SMAD7 influence colorectal cancer risk. Nat Genet. 2007;39:1315–7.View ArticlePubMedGoogle Scholar
- Newcomb PA, Baron J, Cotterchio M, Gallinger S, Grove J, Haile R, Hall D, Hopper JL, Jass J, Le Marchand L, Lindor N, Potter JD, Templeton AS, Seminara D, Thibodeau S for the Colon Cancer Family Registry. Colon Cancer Family Registry: An International Resource for Studies of the Genetic Epidemiology of Colon Cancer. CEBP. 2007;16(11):2331–43.Google Scholar
- Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89:82–93.View ArticlePubMedPubMed CentralGoogle Scholar
- Box GEP, Draper NR. Empirical Model-Building and Response Surfaces. Hoboken, New Jersey: John Wiley & Sons Inc; 1987.Google Scholar
- Thomas DC, Yang Z, Yang F. Two-phase and family-based designs for next-generation sequencing studies. Front Genet. 2013;4:276.View ArticlePubMedPubMed CentralGoogle Scholar
- Bernstein JL, Langholz B, Haile RW, Bernstein L, Thomas DC, Stovall M, Malone KE, Lynch CF, Olsen JH, Anton-Culver H, Shore RE, Boice JD, Berkowitz GS, Gatti RA, Teitelbaum SL, Smith SA, Rosenstein BS, Børresen-Dale A-L, Concannon P, Thompson WD, WECARE study. Study design: evaluating gene-environment interactions in the etiology of breast cancer - the WECARE study. Breast Cancer Res. 2004;6:R199–214.View ArticlePubMedPubMed CentralGoogle Scholar
- Tennessen JA, Bigham AW, O'Connor TD, Fu W, Kenny EE, Gravel S, McGee S, Do R, Liu X, Jun G, Kang HM, Jordan D, Leal SM, Gabriel S, Rieder MJ, Abecasis G, Altshuler D, Nickerson DA, Boerwinkle E, Sunyaev S, Bustamante CD, Bamshad MJ, Akey JM, GO B, GO S, Project NES. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science. 2012;337:64–9.Google Scholar
- Biancolella M, Fortini BK, Tring S, Plummer SJ, Mendoza-Fandino GA, Hartiala J, Hitchler MJ, Yan C, Schumacher FR, Conti DV, Edlund CK, Noushmehr H, Coetzee SG, Bresalier RS, Ahnen DJ, Barry EL, Berman BP, Rice JC, Coetzee GA, Casey G. Identification and characterization of functional risk variants for colorectal cancer mapping to chromosome 11q23.1. Hum Mol Genet. 2014;23:2198–209.View ArticlePubMedPubMed CentralGoogle Scholar
- Peltekova VD, Lemire M, Qazi AM, Zaidi SHE, Trinh QM, Bielecki R, Rogers M, Hodgson L, Wang M, D'Souza DJA, Zandi S, Chong T, Kwan JYY, Kozak K, De Borja R, Timms L, Rangrej J, Volar M, Chan-Seng-Yue M, Beck T, Ash C, Lee S, Wang J, Boutros PC, Stein LD, Dick JE, Gryfe R, McPherson JD, Zanke BW, Pollett A, et al. Identification of genes expressed by immune cells of the colon that are regulated by colorectal cancer- associated variants. Int J Cancer. 2014;134:2330–41.Google Scholar
- Pittman AM, Webb E, Carvajal-Carmona L, Howarth K, Di Bernardo MC, Broderick P, Spain S, Walther A, Price A, Sullivan K, Twiss P, Fielding S, Rowan A, Jaeger E, Vijayakrishnan J, Chandler I, Penegar S, Qureshi M, Lubbe S, Domingo E, Kemp Z, Barclay E, Wood W, Martin L, Gorman M, Thomas H, Peto J, Bishop T, Gray R, Maher ER, et al. Refinement of the basis and impact of common 11q23.1 variation to the risk of developing colorectal cancer. Hum Mol Genet. 2008;17:3720–7.View ArticlePubMedGoogle Scholar
- Grillo G, Turi A, Licciulli F, Mignone F, Liuni S, Banfi S, Gennarino VA, Horner DS, Pavesi G, Picardi E, Pesole G. UTRdb and UTRsite (RELEASE 2010): a collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs. Nucleic Acids Res. 2010;38(Database issue):D75–80.View ArticlePubMedPubMed CentralGoogle Scholar
- Somers J, Poeyry T, Willis AE. A perspective on mammalian upstream open reading frame function. Int J Biochem Cell Biol. 2013;45:1690–700.View ArticlePubMedGoogle Scholar
- Faye MD and Holcik M. The role of IRES trans-acting factors in carcinogenesis, Biochim. Biophys. Acta. (2014). http://dx.doi.org/10.1016/j.bbagrm.2014.09.012.
- Huang G-L, Guo H-Q, Yang F, Liu O-F, Li B-B, Liu X-Y, Lu Y, He Z-W. Activating transcription factor 1 is a prognostic marker of colorectal cancer. Asian Pac J Cancer Prev. 2012;13:1053–7.View ArticlePubMedGoogle Scholar
- Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–60.View ArticlePubMedPubMed CentralGoogle Scholar
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–303.View ArticlePubMedPubMed CentralGoogle Scholar
- Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella KV, Altshuler D, Gabriel S, DePristo MA. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics. 2013;11:11.10.1–11.10.33.Google Scholar
- DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, Del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43:491–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38:e164.View ArticlePubMedPubMed CentralGoogle Scholar
- Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics. 2012;28:3326–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–75.View ArticlePubMedPubMed CentralGoogle Scholar
- Benjamini Y, Hochberg Y. Controlling the false discovery rate - a practical and powerful approach to multiple testing. J Royal Stat Soc Series B-Methodological. 1995;57:289–300.Google Scholar
- Ionita-Laza I, Lee S, Makarov V, Buxbaum JD, Lin X. Sequence kernel association tests for the combined effect of rare and common variants. Am J Hum Genet. 2013;92:841–53.View ArticlePubMedPubMed CentralGoogle Scholar