Exome sequencing generates high quality data in non-target regions

Background Exome sequencing using next-generation sequencing technologies is a cost efficient approach to selectively sequencing coding regions of human genome for detection of disease variants. A significant amount of DNA fragments from the capture process fall outside target regions, and sequence data for positions outside target regions have been mostly ignored after alignment. Result We performed whole exome sequencing on 22 subjects using Agilent SureSelect capture reagent and 6 subjects using Illumina TrueSeq capture reagent. We also downloaded sequencing data for 6 subjects from the 1000 Genomes Project Pilot 3 study. Using these data, we examined the quality of SNPs detected outside target regions by computing consistency rate with genotypes obtained from SNP chips or the Hapmap database, transition-transversion (Ti/Tv) ratio, and percentage of SNPs inside dbSNP. For all three platforms, we obtained high-quality SNPs outside target regions, and some far from target regions. In our Agilent SureSelect data, we obtained 84,049 high-quality SNPs outside target regions compared to 65,231 SNPs inside target regions (a 129% increase). For our Illumina TrueSeq data, we obtained 222,171 high-quality SNPs outside target regions compared to 95,818 SNPs inside target regions (a 232% increase). For the data from the 1000 Genomes Project, we obtained 7,139 high-quality SNPs outside target regions compared to 1,548 SNPs inside target regions (a 461% increase). Conclusions These results demonstrate that a significant amount of high quality genotypes outside target regions can be obtained from exome sequencing data. These data should not be ignored in genetic epidemiology studies.


Background
Next-generation sequencing technologies have substantially decreased the cost of sequencing large genomic regions. It is still financially prohibitive, however, to perform whole genome sequencing for a large number of subjects, especially for large scale genetic epidemiology association studies, at a sufficient depth for accurate genotype calls. The exome represents about 1% of the human genome with approximately 30 million base pairs [1] but accounts for over 85% of all mutations identified in Mendelian disorders [2]. As a result, exome sequencing is currently an attractive and practical approach for investigation of coding variations.
While exome sequencing primarily targets exons, noncoding regions such as introns, intron-exon boundary regions, UTRs, and intergenic regions can also be sequenced as a byproduct. It is well known that SNPs located in promoter and UTR regions may regulate gene expression. It was traditionally believed that introns were not as important as exons, but many studies have now established some functional significance for introns. For example, Rearick et al. [3] suggested that some introns can encode specific proteins and can be processed after splicing to form noncoding RNA molecules. Yi et al. [4] found a pair of intronic SNPs from the EPAS1 gene that show strongly elevated allele frequencies in Tibetans compared with Han Chinese. Furthermore, intergenic regions comprise around 70% of the human genome. Many GWAS have established strong associations between intergenic SNPs and diseases [5][6][7], and many of the results have been replicated in independent datasets. Based on the results from the 1000 Genomes Project [8], among the 1,227 SNPs associated with complex traits identified by GWAS (www.genome.gov/gwastudies), under 30% are either annotated as a nonsynonymous variant or in substantial LD (r 2 ≥ 0.5) with a non-synonymous variant.
Exome and targeted sequencing requires capture of DNA fragments that overlap with target regions. Although a captured library is greatly enriched for target regions, a significant fraction of DNA fragments still fall outside target regions. The fraction varies depending on capture efficiency. For example, Yi et al. [4] (using Nim-bleGen 2.1 M kit) reported having 64.5% of sequenced bases outside target regions and 31.9% more than 500 bp away from target regions; Ng et al. [2] (using Agilent 244 K microarrays for target enrichment) reported over 50% of sequenced bases outside target regions. Our data obtained with Agilent SureSelect v1 capture kit also had nearly 50% of sequenced bases outside target regions. As a result, a significant fraction of intronic and intergenic regions may have been sequenced, which may include promoters, conserved non-coding sequences, untranslated regions, miRNA sites, and other potentially functional regions. It has been a common practice to ignore data outside target regions and focus only on the bases inside target regions [2,4,[9][10][11]. While justified, this approach is inefficient as it overlooks a large amount of data that would otherwise be useful in genetics studies.
In this paper, we systematically evaluate the quality of data outside target regions using our exome sequencing data and a subset of the Pilot 3 data from the 1000 Genomes Project [12]. The Agilent SureSelect v1 capture kit (38 million bases) and Illumina TrueSeq capture kit (62 million bases) used in our exome study are two of the most widely used kits for exome sequencing studies. The 1000 Genomes Project Pilot 3 data focused on the exons of a thousand randomly selected genes as target regions and used Agilent's array enrichment method (2.1 million bases) [8]. For both our data and the data from the 1000 Genomes Project, we will describe coverage, data quality, SNP distribution, and Ti/Tv ratio for the SNPs outside target regions.

Data description
We studied whole exome sequencing data for 28 breast cancer patients recruited to the Shanghai Breast Cancer Study (SBCS). The SBCS is a large, population-based case-control study of women in urban Shanghai, the details of which have been previously described [13]. All patients had very early-onset (22-32 years old) breast cancer or early-onset (38-41 years old) plus first-degree family history of breast cancer. Genomic DNA from buffy coat was extracted using QIAmp DNA kit (Qiagen, Valencia, CA) following the manufacture's protocol.
Approval of the study was granted by the relevant institutional review boards in both China and the United States. All of our samples have been genotyped using the Affymetrix 6.0 array in a previous genome wide association study [13].
We studied 22 exomes captured using the Agilent Sur-eSelect kit and 6 exomes captured using the Illumina TrueSeq kit. Our Agilent SureSelect exome data were 72-base paired-end reads generated from Illumina GA IIx machines. Each sample was run on a single lane of a flowcell. DNA enrichment was done using Agilent Sure-Select Human All Exon kit v1 which was designed to target 165,637 genomic regions (37.8 million bases; 71.6% inside exons; average length 228 bp). Our Illumina TrueSeq exome data were 100-base paired-end reads. The six samples were barcoded and sequenced on Illumina HiSeq2000 (five samples per lane). The TrueSeq capture kit targets 201,071 regions (62.1 million bases; 49.3% inside exons; average length 309 bp). The consensus coding sequences database (CCDS) [14] has 27.8 million bases, 98.3% of which are covered by the SureSelect target regions and 96.5% by the TrueSeq target regions. In summary, both kits targeted more than just exon regions but they also did not have 100% coverage of known exons.
We shifted the Illumina base quality scores (Phred + 64) to the Sanger scale (Phred + 33) [15] and performed initial alignment to the NCBI human reference genome (version 36 for SureSelect data and 37 for TrueSeq data) using the program Burrows-Wheeler Aligner (BWA) [16]. We then marked duplicates with Picard [17] and carried out regional realignment and quality score recalibration using Genome Analysis Toolkit (GATK) [18]. For variant calling, we only used reads with a mapping quality score (MAPQ) ≥20 (i.e., ≤1% probability of being wrong) and bases with base quality score (BQ) ≥20. We used GATK's Unified Genotyper to call SNPs simultaneously on all SureSelect samples and then on all True-Seq samples.
Our SNP filtering criteria were determined through the following steps: 1) Plot Ti/Tv ratio versus genotype quality score (GQ), depth, allelic balance, BQ, and MAPQ for all SNPs and separately for SNPs in dbSNP and novel SNPs. 2) Draw the above plots only for SNPs inside target regions. 3) Calculate overall and heterozygote consistency rates with GWAS data for SNPs overlapping both platforms, filtered by GQ, depth, BQ, and MAPQ. 4) Identify the best filtering criteria that resulted in both high genotype consistency rates and Ti/Tv ratios close to expected values (see next section for details). We found that the two most effective factors affecting sequencing data genotype quality were GQ and depth, and used both GQ ≥ 20 and depth ≥ 5 as genotype filters throughout the study unless otherwise specified.
In addition, we selected 6 subjects from the Pilot 3 study of the 1000 Genomes Project, which designed a capture assay for the exons of 1000 genes (8,496 target regions, 1.4 million bases, average length 169 bp) [8]. Currently, only SNP calls for positions inside the target regions were reported. We selected subjects that were also in the HapMap II so that we could evaluate SNP call quality of sequencing data by comparing them with HapMap genotypes. We focused on Caucasian subjects as a complement to our Asian samples. Two sequencing platforms, Illumina and 454 Life Sciences, were used for those subjects. We focused on the Illumina platform because it used shorter DNA fragments, which tend to overlap more with the typically short target regions. We selected subjects with paired-end data and no flags of "failing" or "withdrawn." Among those who qualified the above criteria, we selected two subjects from each sequencing center with the highest number of reads: NA12043 and NA12144 from the Sanger Center (SC), which used Nimblegen 385-K array hybridization method; NA12249 and NA12716 from the Washington University Genome Sequencing Center (WUGSC), which used a biotinylated capture library generated by PCR in the presence of biotinylated CTP from a pool of 190 bp synthesized oligos; and NA12154 and NA12005 from the Broad Institute (BI), which used RNA baits transcribed in the presence of biotinylated UTP from primers cleaved from an Agilent microarray. Subject NA12005 had the third highest number of reads from BI; the subject with the second highest number of reads, NA12892, had a very high fraction of non-aligned reads (~10% for one of its two FASTQ files in contrast to~2% for the other subjects). We downloaded the FASTQ files that were used in the Pilot 3 study and performed the same data processing procedure as that for our samples. For estimation of genotype consistency rate, we downloaded the HapMap II genotype data for these subjects.

Data quality measurement
One measurement of quality for sequence-based SNP calls is whether they can be validated using an alternative genotyping platform. We thus calculated genotype consistency rate between sequence-based and Affymetrix chip-based SNP calls [13] for SNPs overlapping the two platforms for our subjects, and between sequence-based SNP calls and HapMap genotypes for the subjects from the 1000 Genomes Project. We calculated two types of consistency rate: overall consistency and heterozygous SNP consistency, which probably is more informative on the true error rate. Heterozygote consistency rate was computed as the number of heterozygous genotypes consistent between SNP chip and sequencing divided by the number of heterozygous genotypes on SNP chip that had sequence-based calls with GQ ≥ 20. Furthermore, we calculated Ti/Tv ratio as another measure of data quality. The Ti/Tv ratio is around 3.0 for SNPs inside exons and about 2.0 elsewhere [19]; it also differs between synonymous and non-synonymous SNPs [20]. Since the target regions of exome capture kits often cover more than just exons, the Ti/Tv ratio for SNPs inside target regions is between 2.0 and 3.0 with the value depending on the fraction of exons inside target regions. We also compared the Ti/Tv ratios between novel SNPs and the SNPs reported in dbSNP.
For clarity of presentation, we classified bases into three categories: inside a target region (denoted "inside TR"), outside target regions but within 200 bp from the nearest target region (denoted "outside ≤200 bp"), and outside target regions with >200 bp distance from the nearest target region (denoted "outside >200 bp"). The choice of 200 bp was because the insert sizes of the data we analyzed were mostly between 150 and 200; we also used 100 bp as a threshold and similar patterns of results were observed (data not shown). Reads were classified similarly: a read was "inside TR" if at least half of its bases were inside a target region; similarly, a read was "outside ≤ 200 bp" if at least half the bases were within 200 bp from the nearest target region. The length of a read was determined after applying soft clips according to its CIGAR string information in the BAM file [21]. We studied SNP quality and distribution in the three categories defined above. Furthermore, we annotated SNPs "outside > 200 bp" using the functional variant annotation tool ANNOVAR [22].

Quality of SNP chip genotypes
The 22 breast cancer patients sequenced with the Agilent SureSelect capture kit and the 6 samples sequenced with the Illumina TrueSeq capture kit were part of 2776 patients that were genotyped using the Affymetrix SNP 6.0 array in a genome-wide association study; detailed genotyping methods and stringent QC criteria were described in Zheng et al. [13]. The original scan included three quality control samples in each 96-well plate, and the SNP calls showed a very high concordance rate (mean 99.9%; median 100%) for the quality control samples. In addition, 742 SNPs were genotyped using alternative genotyping platforms for a subset of subjects, and they also had a high concordance rate with genotypes obtained from the SNP chip (mean 99.1%; median 99.8%). The SNP chip call rate for the 28 samples investigated here ranged from 97.83% to 99.99%.
As expected, the depth of coverage was the highest for "inside TR" and lowest for "outside >200 bp". For bases around the boundaries of target regions, the average depth follows a clear decreasing pattern as the position moves away from target regions (Figure 1). This is true for all three data sets we analyzed. However, when we focused on reads with MAPQ ≥ 20, the distributions of MAPQ and BQ scores were similar across all three categories of regions we defined (Figure 2). This suggests that after filtering most data outside target regions were as good as those inside target regions. The zigzag pattern in Figure 2 panels (b), (e), and (h) was an artifact of the MAPQ computation algorithm in BWA, which was used to align all the data investigated here.
In our SureSelect data, on average, 97.0% (range 96.5-97.2%) of the sites "inside TR" and 43.8% (38.0-47.7%) of the sites "outside ≤ 200 bp" were covered with at least one read. Of the nearly 3 billion sites "outside >200 bp", an average of 25.5% (14.1-26.1%) were covered with at least one read. Among the sites that had ≥5x depth of coverage, a significant portion still fell outside target regions (Figure 3a): 25.7 million (28.9%) were outside ≤200 bp and 28.2 million (31.7%) were outside >200 bp. The results were similar for our TrueSeq data and the data from the 1000 Genomes Project (Figure 3b,c).
We also examined the GC content of the sequence data we collected. For our SureSelect target regions, the GC content for "inside TR" was 50.6% and dropped to 42.0% for the "outside ≤ 200 bp" regions. For bases "outside > 200 bp" with depth of coverage ≥ 10, the GC content was around 46% (Table 2). Similar patterns were observed for the data sequenced with the Illumina TrueSeq capture kit and by the 1000 Genomes Project ( Table 2). The higher GC content inside target regions is a reflection of the known higher GC content of coding regions.

SNPs inside and outside target regions
We used GQ ≥ 20 and depth ≥ 5 as the thresholds for SNP filtering. For our SureSelect data, we identified 65,231 SNPs inside the target regions, with Ti/Tv ratio 2.81 and heterozygote consistency rate 99.2%. In addition, a total of 84,049 high quality SNPs were identified outside the target regions (Figure 4a,b,c). For "outside ≤ 200 bp" regions, at depth ≥ 5, we observed an average of 44,854 SNPs per subject, with Ti/Tv ratio 2.26; 77.1% of the SNPs were in dbSNP131. The overall consistency rate with array-based SNP calls was 99.5% on  Table 3.
For the six samples sequenced with the Illumina True-Seq capture kit, we identified 95,818 SNPs inside target regions (Ti/Tv ratio 2.30; heterozygote consistency rate 99.6%) and a similar pattern of high quality SNPs outside the target regions (Figure 4d,e,f ). For outside ≤ 200 bp regions, at depth ≥ 5, we observed an average of 117,866 SNPs per subject (Ti/Tv ratio 2.01; 87.4% in dbSNP132; overall consistency rate 99.6%; heterozygote consistency rate 99.6%). For outside >200 bp regions, at depth ≥ 5, we observed an average of 104,305 SNPs per subject (Ti/Tv ratio 1.89; 77.3% in dbSNP132; overall consistency rate 99.2%; heterozygote consistency rate 99.1%).
For the six samples sequenced by the 1000 Genomes Project, we identified 1,548 SNPs inside their target regions (Ti/Tv ratio 3.23; heterozygote consistency rate 99.3%) and many high quality SNPs outside the target regions ( Figure  4g,h,i). For outside ≤ 200 bp regions, at depth ≥ 5, we observed an average of 3,011 SNPs per subject (Ti/Tv ratio 2.23; 92.5% in dbSNP131; overall consistency rate 98.6%; heterozygote consistency rate 99.6%). For outside >200 bp, at depth ≥ 5, we observed an average of 4,128 SNPs per subject (Ti/Tv ratio 1.80; 85.3% in dbSNP131; overall consistency rate 97.2%; heterozygote consistency rate 98.7%). At depth ≥ 20, we observed 1,883 SNPs with Ti/Tv ratio 1.96 and >98% consistency with HapMap data.
We used ANNOVAR to annotate the SNPs identified more than 200 bp away from target regions. The characteristics of those SNPs are summarized in Table 4. For sites with average depth ≥ 5 in our SureSelect data, we observed 6,194 SNPs in introns, 35 within 2 bp of a splicing junction, 4,873 in noncoding RNAs (ncRNAs), and 33,569 intergenic. We were also able to observe many exonic SNPs that were outside the target regions of the SureSelect or TrueSeq capture kit used for generating our data, including 981 non-synonymous SNPs and 25 stop-gain and 4 stop-loss mutations. These potentially functional SNPs could be missed if the investigator would only look at results inside target regions. We also summarized the results for sites with average depth ≥10, and repeated this procedure for our TrueSeq data and the data generated by the 1000 Genomes Project (Table 4).

Discussion
Our samples were blood samples from breast cancer patients. Though they were not a random sample, we expect this to have little impact on the generalizability of the phenomena we observed. We did not evaluate any data generated by sequencers from 454 Life Sciences or Applied Biosystems. However, we expect similar results for these platforms because the SureSelect whole exome capture kit we used was designed to be compatible with all three major platforms of sequencing technology.
We examined if SNP quality could be influenced by artifacts. One artifact of exome capturing is strand imbalance; Figure 2 Distributions of depth, mapping quality score, and base quality score for "Inside TR", "Outside ≤200 bp", and "Outside > 200 bp". that is, the distribution of forward and reverse strands can be heavily uneven at many positions, especially those close to the boundaries of target regions. This phenomenon exists for positions both inside and outside target regions, although it is more extreme outside target regions. In extreme situations, all reads can be on the same strand (examples in Additional file 1: Table S9). Using our Agilent SureSelect data, we selected all positions where the depth was ≥5, all reads were on one strand, and GWAS genotypes were available; 46,069 positions only had forward reads, and 43,145 positions only had reverse reads. We computed genotype consistency rates with GWAS  data for those positions. Both overall and heterozygous consistency rates were above 98.5%. Thus, the strand imbalance had little effect on genotype quality. Another artifact we observed with exome sequencing is that some regions far away from target regions had abnormally high coverage, with depth in the hundreds. We suspected that this high coverage might be a result of strong homology between intergenic regions and exome regions, causing nonspecific binding during the capture process. Using our Agilent SureSelect data we picked 10 regions with very high depth (Additional file10: Table S10) and examined them against the pseudo gene lists from Yale and UCSC (www.pseugenes.com). Only 2 regions overlapped with pseudo genes. We also BLASTed [17] these 10 intervals against the SureSelect target regions and found that most of these intervals had strong homology with sequences inside the target regions (Additional file 1: Table S10). Our analysis results on GC content also suggest that many of the high depth regions "outside > 200 bp" have strong homology with coding regions and thus a higher GC content than introns. Further study is needed to understand the full extent of sequence homology in the human genome.
In addition to SNPs, we found many indels outside target regions that had high quality scores. For our data sequenced with the Agilent SureSelect capture kit, filtered with QUAL Score (reported by GATK's Unified Genotyper) ≥ 1000, 2,383 indels were detected by GATK's Unified Genotyper inside the target regions, 5,344 were outside ≤ 200 bp, and 4,158 were outside > 200 bp. The numbers were 7,211, 9,456, and 7,426 for the six TrueSeq samples, and 10, 273, and 207 for the six samples sequenced by the 1000 Genomes Project. Systematic validation of these indel calls would take a significant amount of effort and is beyond the scope of this paper.
The amount of information outside target regions depends on capture efficiency, which varies across different capturing technologies. With Agilent's SureSelect v1 capturing kit, only 55% of the reads in our data were aligned within target regions; it was about 50% for Illumina's TrueSeq capture kit. In general, older capture kits were less efficient compared to the newer kits. But even the most current capture kit today can only claim to have around 80% capture efficiency [23], while the real capture efficiency is probably lower in practice. We have shown little quality deterioration outside target regions after filtering. Thus for exome and targeted sequencing data, instead of inefficiently limiting ourselves to variants inside target regions, we should also analyze data in any regions with high-quality SNP calls.
Furthermore, exome sequencing has also been used with non-human subjects. Regardless of the organism being sequenced, our findings about data outside target regions should apply. By including SNPs outside target regions, the genomic coverage is effectively improved and the chance of identifying quantitative trait loci is increased.

Conclusion
We studied the quality of SNP calls for positions outside target regions using our whole exome sequencing data and the 1000 Genomes Project Pilot 3 data. These datasets were generated using five different capture kits and three different sets of target regions. The sequencing was also performed at five different facilities. Despite these differences, we observed reliable genotypes for many SNPs outside target regions, some far outside target regions, in all datasets. By analyzing all available sequencing data and applying stringent filtering criteria, we more than doubled the number of high-quality SNP calls in comparison to what we would have if we had just focused on target regions. Given the amount of significant discovery researchers have made in the noncoding regions, these extra SNPs residing outside target regions should not be ignored in data analysis, especially given the fact that sequencing is still relatively expensive.