A systematic evaluation of hybridization-based mouse exome capture system
© Gao et al.; licensee BioMed Central Ltd. 2013
Received: 30 January 2013
Accepted: 19 July 2013
Published: 21 July 2013
Exome sequencing is increasingly used to search for phenotypically-relevant sequence variants in the mouse genome. All of the current hybridization-based mouse exome capture systems are designed based on the genome reference sequences of the C57BL/6 J strain. Given that the substantial sequence divergence exists between C57BL/6 J and other distantly-related strains, the impact of sequence divergence on the efficiency of such capture systems needs to be systematically evaluated before they can be widely applied to the study of those strains.
Using the Agilent SureSelect mouse exome capture system, we performed exome sequencing on F1 generation hybrid mice that were derived by crossing two divergent strains, C57BL/6 J and SPRET/EiJ. Our results showed that the C57BL/6 J-based probes captured the sequences derived from C57BL/6 J alleles more efficiently and that the bias was higher for the target regions with greater sequence divergence. At low sequencing depths, the bias also affected the efficiency of variant detection. However, the effects became negligible when sufficient sequencing depth was achieved.
Sufficient sequence depth needs to be planned to match the sequence divergence between C57BL/6 J and the strain to be studied, when the C57BL/6 J–based Agilent SureSelect exome capture system is to be used.
KeywordsMouse exome capture system Sequence divergence Capture bias Efficiency of variant detection
Massive parallel sequencing has revolutionized the search for sequence variants in the human genome as well as other model organisms. Compared with the Sanger method, these so-called next-generation sequencing platforms can sequence DNA much faster and at a much lower cost[1, 2]. The resequencing of an entire human genome can now be achieved within days at the cost of a few thousand dollars. In spite of such dramatically improved performance, the current technology does not allow routine screening of the complete genome for a large number of samples in an economically-efficient manner. Even when the cost of whole genome sequencing breaks the much-anticipated $1000 threshold, the necessary computational workload could still remain burdensome and limit its widespread implementation outside major genome centres. In comparison, targeted sequencing is less expensive and generates datasets several orders of magnitude smaller[4–8]. In recent years, exome sequencing (i.e. the sequencing of the full complement of protein-coding exons in the genome) has therefore become a favoured approach in identifying sequence variants causing rare as well as common human diseases[9–11].
Over the last few decades, the mouse has emerged as a preeminent model organism for exploring human biology. Since over 90% of known mouse genes have an orthologue in the human genome, identification of genetic defects responsible for certain phenotypes in mice can directly indicate the genes involved in human diseases. Recently, given the success of exome sequencing in identifying human disease-causing variants, hybridization-based exome capture systems have been developed for mouse and are now commercially available[12, 13]. The current design of mouse exome capture probes is based on the genome reference sequences of the C57BL/6 J strain and the robustness of applying such systems in other strains has been shown in a study that mapped putative N-ethyl-N-nitrosourea (ENU)-induced mutations in four different inbred Mus musculus strains. However, the marginal differences between the genomes of C57BL/6 J and the other strains used in that study do not allow for a systematic evaluation of the effects of sequence divergence on the efficiency of sequence capture and variant detection. In addition, the effect of sequence divergence could be even greater in a mixed genetic background.
The genome of the SPRET/EiJ strain has recently been sequenced. Compared with that of C57BL/6 J, the genome of SPRET/EiJ contains about 35.4 million single nucleotide variants (SNVs) and 4.5 million insertion and deletions (indels). In this study, to gain a better understanding of how sequence divergence could affect the efficiency of the current exome capture system, especially in a mixed genetic background, we performed exome sequencing in a F1 hybrid mouse generated by crossing the two strains with a high degree of sequence variation. After mapping the sequencing reads separately to the genomes of the two parental strains, we observed that probes captured the sequences derived from C57BL/6 J alleles more efficiently and the bias was higher for target regions with greater sequence divergence. Such bias also reduced the efficiency of variant detection which could be counteracted by higher sequencing depth.
Results and discussion
Summary of mapping results from exome sequencing, WGS and simulated exome sequencing
Exome 1 (million)
Exome 2 (million)
Exome 3 (million)
No. of total reads
No. of reads uniquely mapped to either genome
No. of non-redundant reads mapped to the targets on either genome
No. of non-redundant reads mapped only to the targets from C57BL/6 J
No. of non-redundant reads mapped only to the targets from SPRET/EiJ
In this study, on one hand, to avoid the effect of mapping bias, we focused our analysis only on the targets containing no more than 6 SNVs and 4 indels. As shown in the previous section, the greater the sequence divergence, the higher capture bias was observed. Therefore it is conceivable that the efficiency in variant detection could be even lower for the targets with higher sequencing divergence. On the other hand, we used the capture system to detect heterozygous variants in a F1 hybrid mouse. For many research projects that work on pure inbred strains and/or only search for homozygous variants, the effect of sequence divergence might be more subtle than what has been demonstrated here.
With the enormous progress in the field of human genetics, exome capture systems have been developed to search for phenotypically-relevant mutations in mice. Unlike the human genome in which sequence differences among individuals are rather limited, the sequence divergence between different mouse strains could be substantial. However, it had not been extensively explored whether the sequence divergence could affect the efficiency of hybridization-based capture systems designed based on a reference genome (C57BL/6 J) when applied in the study of distantly-related strains. In this study, we performed an exome capture and sequencing on a F1 hybrid mouse generated by crossing the C57BL/6 J and SPRET/EiJ strains. Our results clearly demonstrated that the probes captured the sequences derived from C57BL/6 J alleles more efficiently and the bias increased for the target regions with higher sequence divergence. This bias also affected the efficiency of variant detection. The effect, however, could be counteracted by increasing sequencing depth. For example, to achieve a 99% detection sensitivity, an average sequencing depth of 50 or 60 would be required for the regions containing one or four SNVs. Therefore, in the design of exome sequencing in different mouse strain backgrounds, sufficient sequence depths need to be planned to match the sequence divergence between the strain and C57BL/6 J.
Exome capture and sequencing
Whole exome sequencing libraries were prepared using an Agilent SureSelect XT Mouse All exon kit. Genomic DNA samples extracted from mouse (a female F1 hybrid mouse derived from crossing C57BL/6 J and SPRET/EiJ) liver were used to generate Illumina Paired-End precapture libraries according to the manufacturer’s protocol (Agilent). After determining concentration and quality, 3 μg genomic DNA was sheared into fragments with an average length between 150 and 200 bp using a Covaris S2 system (Covaris). The fragmented DNA was end-repaired in 100 μl total reaction volume containing 48 μl sheared DNA, 10 μl 10X buffer, 1.6 μl dNTP, 1 μl T4 DNA polymerase, 2 μl Klenow DNA polymerase and 2.2 μl T4 Polynucleotide Kinase at 20°C for 30 minutes. A-tailing was performed in a total reaction volume of 50 μl containing end-repaired DNA, 5 μl 10X buffer, 3 μl Klenow Fragment, 1 μl dATP and 11 μl water at 37°C for 30 minutes. Illumina adapter ligation was performed in a total reaction volume of 50 μl containing 10 μl 5X buffer, 10 μl ligase and 10 μl adaptor oligo mix at room temperature for 30 minutes. After ligation, PCR with Illumina PE 1.0 and SureSelect GA Indexing Pre Capture PCR Reverse Primer was performed in 50 μl reactions containing 10 μl 5x Herculase II Rxn Buffer, adaptor-ligated DNA, 1.25 μl of each primer, 0.5 μl 100 mM dNTP mix and 1 μl Herculase II Fusion DNA Polymerase. The standard thermocycling for PCR was 2 minutes at 98°C for the initial denaturation followed by 6 cycles of 30 seconds at 98°C, 30 seconds at 65°C and 60 seconds at 72°C and a final extension for 10 minutes at 72°C. Agencourt® XP® Beads (Beckman Coulter Genomics) was used to purify DNA after each enzymatic reaction. After bead purification, the amount and size of PCR products were determined using an Agilent 2100 Bioanalyzer (Agilent) and Qubit (Life Technology).
Then, 500 ng of Illumina paired-end pre-capture library DNA was hybridized to Agilent SureSelect mouse exome capture probes according to the manufacturer’s specifications. After assessing the quality of capture libraries using Agilent 2100 Bioanalyzer, the captured library was sequenced using Illumina HiSeq 2000 system according to the manufacturer’s protocol. Each library was sequenced in a separate lane in a 101 nt single-end sequencing format.
Whole genome sequencing (WGS)
An Illumina paired-end sequencing library was generated from the same genomic DNA used for exome sequencing (see above) according to the manufacturer’s protocol (Illumina). The procedure is the same as the one for constructing the precapture library described above. The library was sequenced in seven lanes in a 2 x 100 nt paired-end sequencing format using Illumina HiSeq 2000 system according to manufacturer’s protocol. In this study, to be comparable with the format of exome sequencing, we analyzed only the first reads from WGS.
Mapping orthologous target regions in the SPRET/EiJ genome
The Agilent SureSelect XT Mouse All Exon Kit was designed based on the C57BL/6 J genome (UCSC mm9). The BED file containing all the target regions on the C57BL/6 J genome was downloaded from the Agilent website (http://www.genomics.agilent.com). The C57BL/6 J sequences of all the targets except those from chrY and mitochondrial were downloaded from UCSC genome browser (mm9). The SPRET/EiJ genome sequences were downloaded from the Sanger Institute (http://www.sanger.ac.uk/resources/mouse/genomes/). The C57BL/6 J sequences were then aligned to the SPRET/EiJ genome using the BLAST tool with default parameters. Only the targets that could be uniquely mapped in the same chromosome as in the C57BL/6 J genome were retained. In addition, we excluded the targets that did not lie in the same 5’-to-3’ order with their neighbouring targets as in the C57BL/6 J genome. Each of the remaining targets was then extended 100 nt from both 5’ and 3’ ends. Those targets that were extended into the regions without reference sequences (shown as ‘N’s in the genome reference sequences) in either genome were discarded.
Simulation of exome sequencing
Randomly select one target i. Here the target from the C57BL/6 J and SPRET/EiJ genomes was treated as different ones. The probability of choosing a target i was its length divided by the cumulative length of all the 809,984 (2 * 404,992) targets.
Randomly select a start position j of a sequencing read within the target i. Here each of the positions between (i start − 100) and i end (i start, the start position of target i; i end , the end position of target i ) was chosen with equal probability. The end position of the sequencing read was then j + 100.
Randomly select the strand. Here forward or reverse strand was chosen with equal probability
Extract the sequence of the read from either C57BL/6 J or SPRET/EiJ genome according to 2) and 3). The quality scores for all the bases were set to be 40 (the maximum in the Sanger scale);
Repeat step 1 to step 4 for 75 million times to generate 75 million sequencing reads.
Evaluation of allelic bias in capture efficiency
Raw sequencing reads in FASTQ format from either exome sequencing, WGS, or simulated datasets, were aligned separately to both C57BL/6 J and SPRET/EiJ genomes with the Burrows-Wheeler Aligner (BWA) (Additional file4). Reads that could be mapped to multiple positions in one or both genomes were discarded. Reads that could be mapped to both genomes with the same edit distance were also excluded for the analysis of allelic bias in capture efficiency since their allelic origins could not be determined. PCR duplicates were removed with Picard MarkDuplicates (http://picard.sourceforge.net) (Additional file4). Of the remaining reads, if a read could be mapped only to one genome, the allelic origin is obvious, while if a read could be mapped to both genomes with different edit distances, the strain with smaller edit distance was assigned as the allelic origin. To evaluate allelic bias in capture efficiency, for each of the target, the number of sequencing reads that were overlapped with the target on C57BL/6 J genome was compared to the number of overlapping sequencing reads derived from SPRET/EiJ genome. The sequencing depth of each nucleotide position was designated as the number of sequencing reads overlapped with that position. The sequencing depth of a target region was calculated as the mean depth of all nucleotide positions within the target.
Raw sequencing reads from exome sequencing, WGS or simulation were aligned to the C57BL/6 J genome using BWA with the same parameters as described in “Evaluation of allelic bias in capture efficiency”. PCR duplicates were removed with Picard MarkDuplicates with the same parameters as described in “Evaluation of allelic bias in capture efficiency”. The Genome Analysis Toolkit (GATK) variant calling pipeline (version 2.2.15) was then used to identity SNVs on each data set separately. In brief, the RealignerTargetCreator and the IndelRealigner tools were used for local realignment around indels, the BaseRecalibrator and the PrintReads tools were used for quality score recalibration, the UnifiedGenotyper tool was used for variant calling, and the VariantRecalibrator and ApplyRecalibration tools were used for variant quality score recalibration. All the commands were listed in Additional file4.
Animal ethics statement
Mice were housed and maintained in a temperature-controlled, 12-hour light/dark cycle environment with ad libitum access to regular chow food in accordance to requirements established by Landesamt für Gesundheit und Soziales (Lageso). All experimental procedures were approved under protocol T 0436/08.
Availability of supporting data
The data sets supporting the results of this article are available in the European Nucleotide Archive (accession number ERP002193).
Single nucleotide variant
Insertion and deletion
Whole genome sequencing
We thank Mirjam Feldkamp and Claudia Langnick for their excellent technical assistance. We thank Dr. Andreas Gogol-Döring for helpful discussions, Dr. Matthew Poy and Jennifer Stewart for critically reading the manuscript. We thank Dr. Jean Jaubert and Dr. Xavier Montagutelli from the Pasteur Institute for providing F1 hybrid mice. As part of the Berlin Institute for Medical Systems Biology at the MDC, the research group of Wei Chen is funded by the Federal Ministry for Education and Research (BMBF) and the Senate of Berlin, Berlin, Germany (BIMSB 0315362A, 0315362C). Q.G. and W.S. are supported by the Chinese Scholarships Council (CSC).
- Mardis ER: Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. 2008, 9: 387-402. 10.1146/annurev.genom.9.081307.164359.View ArticlePubMedGoogle Scholar
- Shendure J, Ji H: Next-generation DNA sequencing. Nat Biotechnol. 2008, 26: 1135-1145. 10.1038/nbt1486.View ArticlePubMedGoogle Scholar
- Olson M: Enrichment of super-sized resequencing targets from the human genome. Nat Methods. 2007, 4: 891-892. 10.1038/nmeth1107-891.View ArticlePubMedGoogle Scholar
- Summerer D: Enabling technologies of genomic-scale sequence enrichment for targeted high-throughput sequencing. Genomics. 2009, 94: 363-368. 10.1016/j.ygeno.2009.08.012.View ArticlePubMedGoogle Scholar
- Najmabadi H, Hu H, Garshasbi M, Zemojtel T, Abedini SS, Chen W, Hosseini M, Behjati F, Haas S, Jamali P: Deep sequencing reveals 50 novel genes for recessive cognitive disorders. Nature. 2011, 478: 57-63. 10.1038/nature10423.View ArticlePubMedGoogle Scholar
- Hu H, Wrogemann K, Kalscheuer V, Tzschach A, Richard H, Haas SA, Menzel C, Bienek M, Froyen G, Raynaud M: Mutation screening in 86 known X-linked mental retardation genes by droplet-based multiplex PCR and massive parallel sequencing. HUGO J. 2009, 3: 41-49. 10.1007/s11568-010-9137-y.PubMed CentralView ArticlePubMedGoogle Scholar
- Hu H, Eggers K, Chen W, Garshasbi M, Motazacker MM, Wrogemann K, Kahrizi K, Tzschach A, Hosseini M, Bahman I: ST3GAL3 mutations impair the development of higher cognitive functions. Am J Hum Genet. 2011, 89: 407-414. 10.1016/j.ajhg.2011.08.008.PubMed CentralView ArticlePubMedGoogle Scholar
- Kahrizi K, Hu CH, Garshasbi M, Abedini SS, Ghadami S, Kariminejad R, Ullmann R, Chen W, Ropers HH, Kuss AW: Next generation sequencing in a family with autosomal recessive Kahrizi syndrome (OMIM 612713) reveals a homozygous frameshift mutation in SRD5A3. Eur J Hum Genet. 2011, 19: 115-117. 10.1038/ejhg.2010.132.PubMed CentralView ArticlePubMedGoogle Scholar
- Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J: Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet. 2011, 12: 745-755. 10.1038/nrg3031.View ArticlePubMedGoogle Scholar
- Gilissen C, Hoischen A, Brunner HG, Veltman JA: Unlocking Mendelian disease using exome sequencing. Genome Biol. 2011, 12: 228-10.1186/gb-2011-12-9-228.PubMed CentralView ArticlePubMedGoogle Scholar
- Kiezun A, Garimella K, Do R, Stitziel NO, Neale BM, McLaren PJ, Gupta N, Sklar P, Sullivan PF, Moran JL: Exome sequencing and the genetic basis of complex traits. Nat Genet. 2012, 44: 623-630. 10.1038/ng.2303.PubMed CentralView ArticlePubMedGoogle Scholar
- Fairfield H, Gilbert GJ, Barter M, Corrigan RR, Curtain M, Ding YM, D’Ascenzo M, Gerhardt DJ, He C, Huang WH: Mutation discovery in mice by whole exome sequencing. Genome Biol. 2011, 12: R86-10.1186/gb-2011-12-9-r86.PubMed CentralView ArticlePubMedGoogle Scholar
- Agilent SureSelect Mouse All Exon Kits-Details and Specifications. [http://www.genomics.agilent.com/article.jsp?crumbAction=push&pageId=3102]
- Keane TM, Goodstadt L, Danecek P, White MA, Wong K, Yalcin B, Heger A, Agam A, Slater G, Goodson M: Mouse genomic variation and its effect on phenotypes and gene regulation. Nature. 2011, 477: 289-294. 10.1038/nature10413.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.View ArticlePubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25: 1754-1760. 10.1093/bioinformatics/btp324.PubMed CentralView ArticlePubMedGoogle Scholar
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA: The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010, 20: 1297-1303. 10.1101/gr.107524.110.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.