Performance comparison of four exome capture systems for deep sequencing
© Chilamakuri et al.; licensee BioMed Central Ltd. 2014
Received: 15 November 2013
Accepted: 27 May 2014
Published: 9 June 2014
Recent developments in deep (next-generation) sequencing technologies are significantly impacting medical research. The global analysis of protein coding regions in genomes of interest by whole exome sequencing is a widely used application. Many technologies for exome capture are commercially available; here we compare the performance of four of them: NimbleGen’s SeqCap EZ v3.0, Agilent’s SureSelect v4.0, Illumina’s TruSeq Exome, and Illumina’s Nextera Exome, all applied to the same human tumor DNA sample.
Each capture technology was evaluated for its coverage of different exome databases, target coverage efficiency, GC bias, sensitivity in single nucleotide variant detection, sensitivity in small indel detection, and technical reproducibility. In general, all technologies performed well; however, our data demonstrated small, but consistent differences between the four capture technologies. Illumina technologies cover more bases in coding and untranslated regions. Furthermore, whereas most of the technologies provide reduced coverage in regions with low or high GC content, the Nextera technology tends to bias towards target regions with high GC content.
We show key differences in performance between the four technologies. Our data should help researchers who are planning exome sequencing to select appropriate exome capture technology for their particular application.
KeywordsExome capture technology Next-generation sequencing Coverage efficiency Enrichment efficiency GC bias Single nucleotide variant Indel
Exome capture technology designs
Bait length range (bp)
Median bait length (bp)
Number of baits
Total bait length (Mb)
Target length range (bp)
Median target length (bp)
Number of targets
Total target length (Mb)
Human, mouse, 3 plant species
Human, mouse, 14 other species custom
Clark et al. compared three capture technologies and showed that NimbleGen technology required the least number of reads to sensitively detect small variants, whereas Agilent and Illumina technologies appeared to detect a higher total number of variants with additional reads . In another study, Sulonen et al. compared NimbleGen and Agilent technologies, and showed that there were no major differences between the two technologies, except that NimbleGen showed greater efficiency in covering the exome with a minimum of 20x coverage . Asan et al. compared NimbleGen Sequence Capture Array, NimbleGen SeqCap EZ, and Agilent SureSelect, and showed that all three technologies achieved a similar accuracy of genotype assignment and single nucleotide polymorphism (SNP) detection, and had similar levels of reproducibility and GC bias . In another exome capture comparison study, Parla et al. showed that both NimbleGen SeqCap EZ Exome Library SR and Agilent SureSelect All Exon were similar to each other in performance, and able to capture most of the human exons targeted by their probe sets. However, they failed to cover a noteworthy percentage of the exons in the consensus coding sequence database (CCDS) .
During the past few years, substantial updates have been made to the different capture technologies, including new content and improved probe design. For instance, NimbleGen’s SeqCap EZ exome library v2.0 targets approximately 44 Mb of genome, where as their next version EZ exome library v3.0 targets 64.1 Mb. The new Illumina Nextera capture technology has to the best of our knowledge not been tested extensively vis-à-vis other technologies.
The lack of a clear consensus from previous studies, updates in three major capture technologies, and the important new Illumina Nextera capture technology, using an entirely different strategy, motivated us to perform a detailed comparative analysis before initiating a major exome sequencing project.
We, therefore, systematically compared four exome capture technologies, NimbleGen’s SeqCap EZ exome library v3.0, Agilent SureSelect Human all exon V4, Illumina TruSeq and Illumina Nextera, with respect to features such as design differences relative to coverage efficiency, GC bias, and variant discovery.
Distinctive features of four exome capture technologies
Many different RNA databases are available, such as RefSeq  and Ensembl , which differ in the number of non-coding RNAs and total number of exons reported, as well as the start and end coordinates of exons. Significant portions of the sequences are common among the different databases (Figure 1B). CCDS contains protein-coding sequences with high quality annotations . RefSeq and CCDS share a greater proportion of bases with each other, whereas Ensembl possesses more unique bases (2.19 million) than the other two databases. We investigated the coverage of RefSeq (coding and UTR), Ensembl (coding) and CCDS (coding).Illumina covers a greater portion of coding exon bases across all the databases, followed by NimbleGen and Agilent (Figure 1C–E). There are 32.11 Mb common across the three databases, but only about 24 Mb are covered by all four technologies. The majority of Illumina-specific bases (22.5 Mb) target untranslated regions (UTRs) (Figure 1F), whereas NimbleGen and Agilent target UTRs at 9.5 Mb and 5.6 Mb, respectively.
Sequencing, sequence alignment, and read filtering
Target coverage efficiency differs among four technologies
Influence of GC content on coverage
Ability to detect SNVs
We also investigated whether capture technologies showed bias in substitution detection, but none of the technologies showed bias towards specific nucleotide substitutions (Additional file 4: Figure S4 and Additional file 5: Figure S5). Transitions were expected to occur twice as frequently as transversions. The transition-transversion (ts/tv) ratio is a metric for assessing the specificity of new SNP calls. We assessed the ts/tv ratio on their respective target regions (including non-exonic segments), and it ranged from 2.215 in Nextera to 2.257 in Agilent (Additional file 4: Figure S4). Previous studies have shown ts/tv ratios of ≈ 2.0–2.1 for whole genome datasets . The Nextera and TruSeq technologies showed very similar ts/tv ratios, caused most likely by their identical target regions. Also, Agilent and NimbleGen had very similar ts/tv ratios. The difference in ts/tv ratios between Illumina technologies (TruSeq and Nextera) and non-Illumina technologies (Agilent and NimbleGen) may be because Illumina technologies target a significantly higher number of UTRs than the other technologies. We also determined the ts/tv ratio in CCDS coding exons (Additional file 5: Figure S5). The ts/tv ratio on CCDS ranges from 3.054 in Nextera to 3.109 in NimbleGen. It has been previously shown that the ts/tv ratio is ≈ 3.0–3.3 for exonic variation .
Detection of insertions and deletions
Indel detection in the regions covered by exome databases was also studied (Figure 7C–E). The number of indels detected in exons was significantly lower, than indels detected on the respective technology target regions and UTRs. We observed more indels of three or six bases (Additional file 6: Figure S6B), probably due to the negative selection of sizes not equal to multiples of three bases in coding sequences because they cause deleterious frame shift mutations.
When compared between replicates, both SNVs (Additional file 7: Figure S7 and Additional file 8: Figure S8) and indels (Additional file 9: Figure S9), showed similar trends in detecting total number of variants and showed very high overlap in newly detected variants.
Continuous advancement in sequencing technologies increases the throughput of DNA sequencing, while at the same time contributes sharply to decreasing its cost. Although sequencing costs have fallen, whole genome sequencing is still quite expensive, and data interpretation remains challenging. Therefore, whole genome sequencing is not the most appropriate choice for all investigations. The ability to target certain regions of the genome, such as protein and or RNA-coding exons, is an attractive alternative for many experiments. In recent times, target enrichment by hybridization technologies has demonstrated rapid progress in development and usage by the research and diagnostic community.
We present a comparative study of four whole exome capture technologies from three manufacturers, designed to reveal important performance aspects of the technologies. To address this, we studied six parameters for each technology: the portion of target bases representing different exome databases, target coverage efficiency, GC bias, sensitivity in SNV detection, sensitivity in small indel detection, and reproducibility.Although all four exome capture technologies show very high target enrichment efficiency and cover large portions of the exome, only a small portion of the CCDS exome is uniquely covered by each technology (Figure 1C). Therefore, a researcher who is planning exome sequencing should assess which technology best covers the regions of interest to the investigation. Agilent targets the smallest part of the genome with 51.1 Mb, followed by Illumina technologies with 62.08 Mb, and NimbleGen with 64.1 Mb. There are 26.2 Mb of the human genome shared by all four technologies; the majority of which falls in CCDS exonic regions. Illumina not only encompasses far more UTRs, but also shows a higher coverage of RefSeq, CCDS, and Ensembl exome databases, followed by NimbleGen and Agilent.
Target coverage efficiency differs between the four technologies. Using pass-filter reads, Agilent shows higher coverage efficiency than the other technologies, which may be partially explained by the smaller targeted region (51.1 Mb) compared with 64.1 Mb and 62.08 Mb for NimbleGen and Illumina respectively. Among the Illumina technologies, TruSeq gave a more uniform coverage than Nextera, but both had inferior efficiency compared with Agilent. Agilent gives the highest percentage of usable reads (pass-filter reads) (71.7%), closely followed by NimbleGen.
Regardless of high or low target region GC content, there was a negative correlation between sequencing coverage and extreme GC content. Preference for transposon targets with high GC content can help explain non-uniform coverage for the Nextera technology.
Most researchers aiming for exome sequencing, especially in the medical sciences, focus on protein-coding regions. Therefore, the ability to identify SNVs and indels in coding regions is critical to many applications. NimbleGen captures the highest number of SNVs, followed by Illumina technologies and Agilent, when the total number of SNVs detected are correlated with technology target size. However, the number of bases sequenced also has cost and capacity considerations. Our results suggest that Illumina technologies detect a higher number of SNVs over the other technologies with regard to SNV detection against the CCDS and RefSeq exomes, owing to a higher coverage of these regions, but Agilent was better at detecting indels. We also observed that Nextera shows a clear edge over other technologies in the CCDS and RefSeq exomes, because it covers a larger fraction of these sequences.
We did not observe significant differences in technical reproducibility between the four technologies. However, we could, by comparing performance between replicates to the differences observed above, conclude that although some differences in SNV and indel detection were due to random experimental error, the major effect appears to be due to technological biases.
Since the comparison is based on a tumor sample, which may contains genomic aberrations that could differentially affect the performance of each technology, we investigated the coverage differences in COSMIC cancer genes. No significant deviation in coverage was observed when compared with global coverage (Figure 3 and Additional file 10: Figure S10).
Another important consideration is exome capture technologies evolve rapidly. For instance, Agilent recently released their next version of exome capture SureSelect Human All Exon V5. Although these versions do differ with regard to the genomic regions they target, about 84% of target region bases overlap. Illumina also has a new version, with a smaller targeted panel, just for exons. It is called Nextera Rapid Capture Exome (37 Mb), while the larger panel version is now named Nextera Expanded Exome (62 Mb). Illumina has also improved the Nextera protocol, with the Nextera Rapid kit; this improvement may reduce the GC bias observed here.
In total, our data suggest that all four technologies offer comparable performance. Other factors, such as the DNA content of the targeted regions, the amount of input DNA required, the extent of automation in library construction, and the cost of reagents to reach a certain depth of coverage, need to be considered before selecting the exome capture technology most appropriate for your particular application.
Readers should keep in mind that this study is based on one biological sample with two replicates. The observed technical reproducibility is very high and variability may be higher when two biological replicates are compared.
We systematically evaluated the performance of four whole exome capture technologies, and show that all the exome capture technologies perform well, but do exhibit consistent differences. Illumina covers a greater portion of coding exon bases across all the databases, followed by NimbleGen and Agilent. All the technologies give high coverage of their respective target regions, with the Agilent technology giving highest coverage (99.8%) followed by Nextera (98.2%), Truseq (96.9%), and NimbleGen (96.5%) of the intended targets. Nextera shows a sharp increase in read depth for GC content of 60% or higher compared other technologies. In common regions covered by all four technologies, Agilent detects slightly higher number of SNVs, followed by Nextera, TruSeq and Nimblegen. At all the read counts very few indels were common across the four technologies. All technologies give high technical reproducibility. One major limitation is that none of the capture technologies are able to cover all of the exons of the CCDS, RefSeq or Ensembl databases. Our study should help researchers who are planning exome sequencing experiments select the most appropriate technology for their study, without having to perform expensive and time-consuming comparisons.
Sample collection and library preparation
One human osteosarcoma was selected from a tumor collection at the Department of Tumor Biology at the Norwegian Radium Hospital. The tumor was collected immediately after surgery after written informed consent, cut into small pieces, frozen in liquid nitrogen and stored at −70°C until use.
High quality genomic DNA was isolated using the Promega Wizard Genomic DNA Purification Kit. One μg of genomic DNA was used to produce each exome captured sequencing library for four different technologies: NimbleGen SeqCap EZ v3.0, Agilent SureSelect XT2 Human All Exome v4.0, Illumina TruSeq Exome Enrichment kit and Illumina Nextera Exome Enrichment kit. The exome captured library preparation from the last three technologies was done following the manufacturers’ protocols applying pre-capture multiplexing. The protocol for NimbleGen SeqCap EZ was adapted from the company’s application note (http://www.nimblegen.com/products/lit/NimbleGen_SeqCap_EZ_SR_Pre-Captured_Multiplexing.pdf). The exome captured sequencing libraries were quality-controlled using an Agilent 2100 Bioanalyzer, and quantified using the Agilent QPCR NGS Library Quantification Kit (illumine GA) prior to cluster generation on an Illumina cBot.
The human reference genome (hg19), RefSeq, CCDS, and Ensembl databases were downloaded from the UCSC genome table browser (http://genome.ucsc.edu/).
Because of Norwegian legal regulations, the ethical approval for this study and the consent signed by the patient, we are not able to deposit our dataset in a public repository. We will provide access to the data if requested.
Sequencing and bioinformatics data analysis
Briefly, initial FASTQ files were subjected to quality control with the FastQC tool (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Raw reads from each capture library were aligned to the human reference genome (hg19) with Novoalign (http://novocraft.com/), using default parameters. If more than one pair (PE sequencing) had identical start and end coordinates, they were considered PCR duplicates and were removed using in-house scripts. Filtered read counts were normalized to 50 M reads between all four exome capture sequencing experiments by randomly selecting 50 M reads from each filtered read set. These randomly selected sets were further used to select 5–50 M reads, using an increment of 5 M reads.
SNVs and indels were called with GATK . The GATK pipeline was independently run on each data set. We followed the procedure recommended by the GATK documentation. Reads around indels were realigned. To remove systematic biases in quality scores, base quality score recalibration was done. The UnifiedGenotyper algorithm was run using a stand_emit_conf of 10.0 and stand_call_conf of 30.0. All variants with a Phred-based quality score <30.0 were called low quality and ignored.
Consensus coding sequence
Genome analysis toolkit
Single nucleotide polymorphism
Single nucleotide variant
Polymerase chain reaction
We would like to thank Jan Christian Bryn and Russell Castro for help with manuscript preparation. The Research Council of Norway (grant no. 218241/Hl0) and Radiumhospitalets legacy financially supported this study.
- Luikart G, England PR, Tallmon D, Jordan S, Taberlet P: The power and promise of population genomics: from genotyping to genome typing. Nat Rev Genet. 2003, 4: 981-994.PubMedView ArticleGoogle Scholar
- Yu TW, Chahrour MH, Coulter ME, Jiralerspong S, Okamura-Ikeda K, Ataman B, Schmitz-Abe K, Harmin DA, Adli M, Malik AN, D'Gama AM, Lim ET, Sanders SJ, Mochida GH, Partlow JN, Sunu CM, Felie JM, Rodriguez J, Nasir RH, Ware J, Joseph RM, Hill RS, Kwan BY, Al-Saffar M, Mukaddes NM, Hashmi A, Balkhy S, Gascon GG, Hisama FM, LeClair E, et al: Using whole-exome sequencing to identify inherited causes of autism. Neuron. 2013, 77: 259-273.PubMed CentralPubMedView ArticleGoogle Scholar
- Schuster B, Knies K, Stoepker C, Velleuer E, Friedl R, Gottwald-Muhlhauser B, de Winter JP, Schindler D: Whole exome sequencing reveals uncommon mutations in the recently identified Fanconi anemia gene SLX4/FANCP. Hum Mutat. 2013, 34: 93-96.PubMedView ArticleGoogle Scholar
- Kalsoom UE, Klopocki E, Wasif N, Tariq M, Khan S, Hecht J, Krawitz P, Mundlos S, Ahmad W: Whole exome sequencing identified a novel zinc-finger gene ZNF141 associated with autosomal recessive postaxial polydactyly type A. J Med Genet. 2013, 50: 47-53.PubMedView ArticleGoogle Scholar
- Izumi R, Niihori T, Aoki Y, Suzuki N, Kato M, Warita H, Takahashi T, Tateyama M, Nagashima T, Funayama R, Abe K, Nakayama K, Aoki M, Matsubara Y: Exome sequencing identifies a novel TTN mutation in a family with hereditary myopathy with early respiratory failure. J Hum Genet. 2013, 58: 259-266.PubMedView ArticleGoogle Scholar
- Albert TJ, Molla MN, Muzny DM, Nazareth L, Wheeler D, Song X, Richmond TA, Middle CM, Rodesch MJ, Packard CJ, Weinstock GM, Gibbs RA: Direct selection of human genomic loci by microarray hybridization. Nat Methods. 2007, 4: 903-905.PubMedView ArticleGoogle Scholar
- Hodges E, Xuan Z, Balija V, Kramer M, Molla MN, Smith SW, Middle CM, Rodesch MJ, Albert TJ, Hannon GJ, McCombie WR: Genome-wide in situ exon capture for selective resequencing. Nat Genet. 2007, 39: 1522-1527.PubMedView ArticleGoogle Scholar
- Bainbridge MN, Wang M, Burgess DL, Kovar C, Rodesch MJ, D’Ascenzo M, Kitzman J, Wu YQ, Newsham I, Richmond TA, Jeddeloh JA, Muzny D, Albert TJ: Whole exome capture in solution with 3 Gbp of data. Genome Biol. 2010, 11: R62-PubMed CentralPubMedView ArticleGoogle Scholar
- Marine R, Polson SW, Ravel J, Hatfull G, Russell D, Sullivan M, Syed F, Dumas M, Wommack KE: Evaluation of a transposase protocol for rapid generation of shotgun high-throughput sequencing libraries from nanogram quantities of DNA. Appl Environ Microbiol. 2011, 77: 8071-8079.PubMed CentralPubMedView ArticleGoogle Scholar
- Clark MJ, Chen R, Lam HY, Karczewski KJ, Chen R, Euskirchen G, Butte AJ, Snyder M: Performance comparison of exome DNA sequencing technologies. Nat Biotechnol. 2011, 29: 908-914.PubMed CentralPubMedView ArticleGoogle Scholar
- Sulonen AM, Ellonen P, Almusa H, Lepisto M, Eldfors S, Hannula S, Miettinen T, Tyynismaa H, Salo P, Heckman C, Joensuu H, Raivio T, Suomalainen A, Saarela J: Comparison of solution-based exome capture methods for next generation sequencing. Genome Biol. 2011, 12: R94-PubMed CentralPubMedView ArticleGoogle Scholar
- Asan , Xu Y, Jiang H, Tyler-Smith C, Xue Y, Jiang T, Wang J, Wu M, Liu X, Tian G, Wang J, Wang J, Yang H, Zhang X: Comprehensive comparison of three commercial human whole-exome capture platforms. Genome Biol. 2011, 12: R95-PubMed CentralPubMedView ArticleGoogle Scholar
- Parla JS, Iossifov I, Grabill I, Spector MS, Kramer M, McCombie WR: A comparative analysis of exome capture. Genome Biol. 2011, 12: R97-PubMed CentralPubMedView ArticleGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007, 35: D61-D65.PubMed CentralPubMedView ArticleGoogle Scholar
- Flicek P, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S, Gil L, Gordon L, Hendrix M, Hourlier T, Johnson N, Kähäri AK, Keefe D, Keenan S, Kinsella R, Komorowska M, Koscielny G, Kulesha E, Larsson P, Longden I, McLaren W, Muffato M, Overduin B, Pignatelli M, Pritchard B, Riat HS, et al: Ensembl 2012. Nucleic Acids Res. 2012, 40: D84-D90.PubMed CentralPubMedView ArticleGoogle Scholar
- Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell CM, Loveland JE, Ruef BJ, Hart E, Suner MM, Landrum MJ, Aken B, Ayling S, Baertsch R, Fernandez-Banet J, Cherry JL, Curwen V, Dicuccio M, Kellis M, Lee J, Lin MF, Schuster M, Shkeda A, Amid C, Brown G, Dukhanina O, Frankish A, Hart J, et al: The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome Res. 2009, 19: 1316-1323.PubMed CentralPubMedView ArticleGoogle Scholar
- Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, Fennell T, Giannoukos G, Fisher S, Russ C, Gabriel S, Jaffe DB, Lander ES, Nusbaum C: Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol. 2009, 27: 182-189.PubMed CentralPubMedView ArticleGoogle Scholar
- Aird D, Ross MG, Chen WS, Danielsson M, Fennell T, Russ C, Jaffe DB, Nusbaum C, Gnirke A: Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011, 12: R18-PubMed CentralPubMedView ArticleGoogle Scholar
- Kane MD, Jatkoe TA, Stumpf CR, Lu J, Thomas JD, Madore SJ: Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays. Nucleic Acids Res. 2000, 28: 4552-4557.PubMed CentralPubMedView ArticleGoogle Scholar
- Green B, Bouchier C, Fairhead C, Craig NL, Cormack BP: Insertion site preference of Mu, Tn5, and Tn7 transposons. Mob DNA. 2012, 3: 3-PubMed CentralPubMedView ArticleGoogle Scholar
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010, 20: 1297-1303.PubMed CentralPubMedView ArticleGoogle Scholar
- Ebersberger I, Metzler D, Schwarz C, Paabo S: Genomewide comparison of DNA sequences between humans and chimpanzees. Am J Hum Genet. 2002, 70: 1490-1497.PubMed CentralPubMedView ArticleGoogle Scholar
- Freudenberg-Hua Y, Freudenberg J, Kluck N, Cichon S, Propping P, Nothen MM: Single nucleotide variation analysis in 65 candidate genes for CNS disorders in a representative sample of the European population. Genome Res. 2003, 13: 2271-2276.PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.