- Research article
- Open Access
Replicate exome-sequencing in a multiple-generation family: improved interpretation of next-generation sequencing data
© Cherukuri et al. 2015
- Received: 6 October 2014
- Accepted: 16 October 2015
- Published: 25 November 2015
Whole-exome sequencing (WES) is rapidly evolving into a tool of choice for rapid, and inexpensive identification of molecular genetic lesions within targeted regions of the human genome. While biases in WES coverage of nucleotides in targeted regions are recognized, it is not well understood how repetition of WES improves the interpretation of sequencing results in a clinical diagnostic setting.
To address this, we compared independently generated exome-capture of six individuals from three-generations sequenced in triplicate. This generated between 48x-86x mean target depth of high-quality mapped bases (>Q20) for each technical replicate library. Cumulatively, we achieved 179 - 208x average target coverage for each individual in the pedigree. Using this experimental design, we evaluated stochastics in WES interpretation, genotyping sensitivity, and accuracy to detect de novo variants.
In this study, we show that repetition of WES improved the interpretation of the capture target regions after aggregating the data (93.5 - 93.9 %). Compared to 81.2 - 89.6 % (50.2-55.4 Mb of 61.7 M) coverage of targeted bases at ≥20x in the individual technical replicates, the aggregated data covered 93.5 - 93.9 % of targeted bases (57.7 – 58.0 of 61.7 M) at ≥20x threshold, suggesting a 4.3 – 12.7 % improvement in coverage. Each individual’s aggregate dataset recovered 3.4 – 6.4 million bases within variable targeted regions. We uncovered technical variability (2-5 %) inherent to WES technique. We also show improved interpretation in assessing clinically important regions that lack interpretation under current conditions, affecting 12–16 of the 56 genes recommended for secondary analysis by American College of Medical Genetics (ACMG). We demonstrate that comparing technical replicate WES datasets and their derived aggregate data can effectively address overall WES genotyping discrepancies.
We describe a method to evaluate the reproducibility and stochastics in exome library preparation, and delineate the advantages of aggregating the data derived from technical replicates. The implications of this study are directly applicable to improved experimental design and provide an opportunity to rapidly, efficiently, and accurately arrive at reliable candidate nucleotide variants.
- Exome Sequencing
- Local Polynomial Regression
- Exome Data
- Input Sequence Data
- Clinical Exome Sequencing
Whole-exome sequencing (WES) is becoming a rapid and cost-effective molecular diagnostic tool in individuals with genetic diseases [1–5]. Recent reports demonstrate WES’ utility in both clinical [6–8] as well as basic genetics research [9–11]. With growing demand for WES and drop in costs of next-generation sequencing (NGS), WES as a technique requires greater understanding of how experimental design can improve data interpretation and thereby biological outcomes.
Inherent within WES and NGS, however, is much heterogeneity and bias in mean number of times a targeted nucleotide base is sequenced . This heterogeneity in sequencing depth arises due to numerous factors, such as target-enrichment kit used [13, 14], target sequence GC bias , PCR amplification bias , repeats and pseudo-genes , and other experimental design variables. These factors directly and systematically influence sensitivity of WES [14, 17, 18]. While evolving versions of exome-enrichment kits continue to address these biases, effects of technical replicate experimental design in pedigree-based WES are poorly understood.
Pedigree-based WES approaches facilitate the discovery of not only de novo variants [19–21], but also multiple-inherited variants [22, 23]. Overall, clinical exome sequencing studies report significantly higher molecular diagnostic yield for pedigree-based approaches compared to single-proband sequencing  stressing the importance of understanding WES’s performance in multiple-generation families.
Here, we present an investigation on estimating the proportion of WES target sequence coverage biases that can be eliminated by repetition of the procedure in all individuals in a multiple-generation family. Specifically, we directly compare independent exome-capture libraries generating 18 technical replicates in 6 members of a multiple-generation family (3 per individual). In this study, we evaluated variability and interpreted targeted sequencing within targeted exome regions. Overall, our work reveals the advantages of technical replicate pedigree-based WES in multiple generations, specifically in relation to interpretation of WES derived genotypes [7, 24, 25].
Direct comparisons of exome-capture samples in triplicate
Whole exome sequencing mean target coverage depth and percent target coverage statistics of replicate and aggregate data
Mean target depth
% Target ≥1x
% Target ≥10x
% Target ≥20x
Mean ± SE
96.3 ± 0.1
91.9 ± 0.4
87.4 ± 0.8
Mean ± SE
96.4 ± 0.1
91.8 ± 0.4
86.7 ± 1.0
Mean ± SE
96.3 ± 0.3
91.7 ± 1.2
86.8 ± 3.0
Mean ± SE
96.4 ± 0.2
91.6 ± 1.3
86.1 ± 3.2
Mean ± SE
96.2 ± 0.2
91.2 ± 0.8
85.4 ± 2.9
Mean ± SE
96.2 ± 0.2
91.0 ± 1.0
84.2 ± 3.3
Technical replicate data were compared directly for each individual (Fig. 1a). The mean target-depth of sequencing varied linearly with the input of total sequence data and was evident for all technical replicates derived from the six individuals (Fig. 1b; Table 1). The most variable technical replicate depth of sequencing results was from individual ID4385 (51x-86x) and the least variable was from individual ID3866 (69x-71x). Upon aggregation of data from all technical replicates for each individual, the depths of coverage were 179 - 208x for targeted regions (Fig. 1b).
Stochastics in capture and sequencing can be estimated by replicate libraries
Because the above results for coverage at ≥20x were theoretically dependent on sequencing input quantity, we repeated the analysis at near ‘predicted’ saturation of capture and sequencability (≥1x coverage). Only 1.9 % of targeted bases were variable among the technical replicates. Specifically, 58,994,725 bases (95 %) were sequenced at ≥1x coverage, whereas 1,203,131 bases (1.9 %) were sequenced in at least one but not all three technical replicates (not significant) (Fig. 2c). This suggested that stochastic variability among these technical replicates contributes little to overall sequencability (≥1x) although it had an appreciable affect on usable (≥20x coverage) sequence data.
Cumulative technical replicate sequencing improves targeted sequence interpretation
Of the variability within WES target capture regions, 2-3 % arose within protein-coding regions at ≥20x depth of sequencing threshold . To understand whether or not deep sequencing addresses stochastic variability and benefits achieve theoretical maximum coverage (>95 % of targeted bases), we merged the three technical replicate bam files from each subject to generate a single bam file (Additional file 1: Table S1). For each aggregate data set, 20–22 Gb high quality reads covered targeted WES regions; each targeted base had an average of 179 - 208x coverage. Compared to 81.2 - 89.6 % (50.2-55.4 Mb of 61.7 M) coverage of targeted bases at ≥20x in the individual technical replicates, the merged data covered 93.5 - 93.9 % of targeted bases (57.7 – 58.0 of 61.7 M) at ≥20x, suggesting a 4.3 – 12.7 % improvement in coverage (Fig. 3a, Table 1). Each individual’s aggregate dataset recovered 3.4 – 6.4 million bases of variable targeted region.
The distribution of consecutive targeted bases recovered to ≥20x sequencing depth followed a power-law distribution (Fig. 3b). Aggregate data recovered 117,913-170,667 singleton target-base positions (Fig. 3b); the average-size of consecutive bases recovered was ~50 bp (Fig. 3b). In each individual, we identified 17,132 - 40,726 segments greater than 50 bp. We then intersected regions greater than 50 bp with UCSC known-gene protein-coding exons. Intersecting UCSC known-gene coding bases with recovered regions revealed that 8,156 - 22,868 regions (0.5 – 2 Mb) overlapped protein-coding regions, including 12–16 of the 56 genes that the American College of Medical Genetics (ACMG) recommended for return of incidental findings in clinical sequencing . Figure 3c illustrates that current depths of sequencing consistently fail to meet 20x coverage at clinically important sites. For example c.16C > T variant (p.Q6X; chr1: 156,084,725; hg19) in LMNA, a cause of autosomal dominant Emery-Dreifuss muscular dystrophy (EMD) . This illustrates how aggregate deep sequencing may help recover variable regions to 20x or greater depth of coverage. Taken together, this analysis not only revealed the advantage of technical replicate sequencing to determine exact targeted regions affected by stochastics under current exome sequencing standards but also demonstrated the utility of merging the technical replicate data to permit interpretation of regions with coverage that is otherwise too shallow.
Genotyping sensitivity and accuracy to detect de novo variants improves with cumulative replicate sequencing
We investigated the effect of stochastic variation on genotyping of variants among technical replicate data sets of the same individual. We included all 18 technical replicates for this analysis. Since depth of sequencing and the relative proportion of representation of the alternate allele play a key role in genotype calling , we delineated genotype discordances among technical replicates at varying depths of sequencing. To address this question systematically, we binned each targeted position into 10-19x, 20-29x, and ≥30x bins. We evaluated sites within each of these bins where genotype calls disagreed between technical replicates of the same individual. In total we evaluated 19,424,806, 11,888,469, and 75,145,195 positions in the 10-19x, 20-29x, and ≥30x bins, respectively, and found 65 differences. 62 of these differences were in the 10-19x bin (between technical replicate genotype discordance: 3.2x10−6), 3 in the 20-29x bin (between technical replicate genotype discordance: 2.5x10−6) and none in the ≥30x bin.
To evaluate the accuracy of WES genotyping at the 65 genotype-discordant sites, we performed Sanger dideoxy chain-termination sequencing. Of the 54 differences for which we could design functional primers (Additional File 1: Table S3), Sanger sequencing showed that 20 (37 %) were heterozygous; 29 (54 %) were homozygous reference, and 5 (9 %) were homozygous non-reference. This demonstrated that 20 heterozygotes were not called in at least one technical replicate in an individual (disagreement rate: 0.64x10−6; 20 in 31,313,275 sites tested), while 29 variant sites were falsely called as heterozygotes in at least one technical replicate. The sequence surrounding 13 of the 29 sites mapped to multiple regions of the human reference sequence suggesting that the differences arose from mis-mapping (Additional file 1: Table S3).
Next, we called de novo variants that arose in second and third generations of the family. In technical replicate data, for each trio, we found 3–11 (mean: 6.2 ± 3) and 1–8 (mean: 3.3 ± 2) de novos at 10x and 20x thresholds, respectively. In aggregated data, we found 4–7 (mean: 6 ± 1.7) and 3–4 (mean: 4 ± 2.3) de novos at 10x and 20x thresholds, respectively. Of the 30 sites for which we could design effective primers, Sanger sequencing showed that 4 were true de novo variants in technical replicates (Additional file 1: Table S4). Two of the 6 de novo variants identified in the aggregate data were missed in technical replicate data. The technical replicate and aggregate variants that did not validate by Sanger sequencing were in regions with problematic GC-content or mappability (Additional file 1: Figure S5 and Table S5) . Taken together, technical replicate data along with aggregate data for a given individual improved the interpretation of NGS genotype calls compared to current WES standards.
In this study, we investigated the utility of independent generation of exome capture libraries for the purpose of interlibrary comparison, the effects of stochastics on targeted genomic region capture, the advantages of aggregate data on targeted resequencing, and finally the effect of the overall process on genotyping and de novo detection.
We showed that current accepted exome-sequencing threshold of ≥ 20x is unsaturated at average 50x-100x WES target coverage, since interpretation of the targeted bases varied significantly among technical replicates at a ≥ 20x threshold and not significantly at ≥ 1x threshold. Our results are consistent with current understanding that capture sequencing data interpretation is heavily dependent on amount of exome data generated [14, 22, 28]. Certain targeted regions had variable coverage only as a consequence of input data since aggregate data met the genotyping threshold of ≥ 20x coverage. Controlled analysis done by limiting the effect of input sequence data on overall measurement of stochastics showed variability between technical replicates at the genotypeability threshold of ≥ 20x and this is addressed by aggregating data to higher depths. A recent report by Redin and colleagues  support our conclusion that aggregate data is beneficial to the overall interpretation of exome data. They showed that un-interpretable regions within targeted portions can be as low as 3.9 kb (1.8 kb protein-coding) given deep-sequencing (mean coverage: >350x) compared to our observation that it (mean coverage: 40-50x) can be as high as 0.5-2 Mb. Therefore deeper exome sequencing may have potential to improve diagnostic yield for unselected patients, which for rare disorders is currently 25 % (95 % CI: 20–31) in clinical laboratories .
Our approach is unique and novel because it addresses the potential library generation-specific sequencing biases that may propagate through the sequencing process and when evaluated, appear as true single nucleotide variants. Our study design comparing technical replicate library data from the same individual provides an added advantage to detect genotyping anomalies that would otherwise be undetectable. This approach however raises the question of whether funding would be better spent on two additional technical replicates rather than on single library preparation and generating additional sequence data. At the time when this study was designed and performed, it cost $300 per replicate library preparation (Illumina TruSeq kit; see Additional file 1: Table S6). However, due to the rapid drop in prices and technological advancements, cost estimates for replicate library preparations currently are around $124 per replicate library (Illumina Nextera; see Additional file 1: Table S6). Using the triplicate library approach would therefore equate to an extra cost of $248 per individual sequenced, when compared to single library preparation coupled with deep sequencing approach (same sequence data output). Estimating current sequencing costs at $0.1 per million bases sequenced (http://www.genome.gov), we assess that $248 would theoretically allow for purchase of 2.48 Gb of additional sequence data per sample. Given 70-85 % of 2.48 Gb would pass post-alignment and quality-filter analysis, 1.7-2.1 Gb mappable data would be available for interpretation. Given our observations of target-base coverage saturation at ≥ 20x threshold for 20–23 Gb aggregate user-quality data, we conclude that an extra 1.7-2.1 Gb per sample would minimally alter the overall interpretation of targeted regions under evaluation (see Additional file 1: Figure S6).
Finally, we argue that our approach and findings are consistent with other studies that note benefits of replicate exome comparisons for variant detection and replicated exome merging for variant calling accuracy . Benefits of this approach may also minimize the stochastic branching process of allele-distribution in exome datasets derived from a single library generation process and may additionally help mitigate library specific amplification biases .
We describe a method to evaluate the reproducibility and stochastics in exome library preparation, and delineate the advantages of aggregating the data derived from technical replicates. The implications of this study are directly applicable and provide an opportunity to rapidly, efficiently, and accurately diagnose patients.
Patients accepted into the NIH Undiagnosed Diseases Program (UDP) were enrolled in clinical protocol 76-HG-0238 approved by the Institutional Review Board (IRB) of the National Human Genome Research Institute. The individuals or their guardians gave written, informed consent.
Genomic DNA extraction
Genomic DNA was extracted from peripheral whole blood using the Gentra Puregene Blood kit (Qiagen, Inc.), which employs modified salting-out precipitation according to the manufacturer’s protocol as previously described . DNA was eluted in 250 – 1000 μL at a concentration of up to 35 μg/mL.
Independent exome-library preparation and capture
Independent exome-libraries were generated in triplicate (3 technical replicates) for each subject (Fig. 1b) using the Illumina TruSeq DNA Sample Preparation kit (version 2) according to the manufacturer’s protocol. Pre-enrichment, all independent libraries were pooled and multiplexed up to 6-samples. Pooled libraries were captured in-solution for isolating exonic regions of interest in the human genome using hybrid selection with TruSeq Exome Enrichment kit, version 2 (Illumina, Inc.) as per the manufacturer’s protocol. The kit targeted 62 Mb of the human genome using 95mer probes that selected target libraries of 300–400 bp and enriched 265–465 bases centered symmetrically on the midpoint of the probe.
Exome sequencing and data processing
Paired-end sequencing was performed on the Illumina HiSeq 2000 instrument generating 100-bp reads. The output reads from the Illumina Genome HiSeq 2000 were mapped to the reference haploid human-genome sequence (Genome Reference Consortium human genome build 37; human genome 19) with the use of the eland (Illumina, Inc) generating per-sequencing-lane bamfiles. Fastq files derived from per-sequencing-lane bamfiles were re-aligned to the reference sequence using the Novoalign program (http://www.novocraft.com) with default parameters; data for this stage were grouped by technical replicate samples. SAMtools  was used to identify and remove PCR duplicates. Each exome sequencing result was either saved as a technical replicate bamfile or a merged bamfile (aggregate data) derived by merging all three technical replicates of a given individual. Genotypes were called at all positions where there were high-quality sequence bases (Phred-like Q20 or greater) using a Bayesian algorithm (Most Probable Genotype – MPG) . Nucleotide coverage queries and read depth analyses were performed using SAMtools . The depth of coverage was calculated from bamfiles with SAMtools and custom PERL scripts. Descriptive statistics for sequencing data analysis were obtained using R statistical software (version 3.1.2).
De novo mutation analysis
For de novo mutation (DNM) detection analysis, all family members’ pedigree information was formatted into trio data. Each individual’s data were organized into technical replicate bamfiles. For each technical replicate bamfile, genotypes were called in targeted regions and these data were binned by target region coverage: 10-19x, 20-29x, and ≥30x bins. Variants were classified as DNMs in scenarios that did not fit Mendelian inheritance patterns for each trio, similar to methodologies established in previous publications [34, 35]. We iterated this procedure for all possible trios and all bins. Technical replicate DNM concordance was computed using PERL scripts.
Bioinformatics data analysis
The data were chiefly analyzed and annotated by means of a bioinformatics pipeline that was developed in-house, mainly consisting of PERL scripts. Data were formatted for parallel computational processing using GNU Bash scripting (http://www.gnu.org/software/bash) on a Linux operating system (http://linux.org). Parallel and batch compute job tasks were submitted, queued, and managed by Portable Batch System (PBS) computer software in a Linux cluster environment. Job tasks were processed on National Institutes of Health (NIH) Biowulf system via login node (biowulf.nih.gov). All nodes within this system are connected to a 1 Gb/s switched Ethernet network, while sub-sets of nodes are on high-performance Infinipath or Infiniband networks (16 Gb/s bandwidth with very low latency). Job tasks were mostly processed on 8 x 2.67 GHz Intel X5550, 32 x 72 GB nodes (352 nodes) in this system. Data were accessed from RAID-6 file systems mounted over Network File System (NFS; Sun Microsystems) or General Parallel File System (GPFS; IBM).
R statistical software (version 3.1.2) (www.r-project.org) was used for statistical data analysis and plotting figures. Non-parametric regression method, Local Polynomial Regression Fitting (loess) was used for local polynomial fitting of percentage of exome target region sequenced data (span = 0.75). Fitting was locally controlled by alpha parameter (span (α)), which determined the size of the regression neighborhood. To test for stochastics in target bases sequenced, we used two independent proportion test (prop.test). This tested the null that the probabilities of success (coverage of same sequence bases) in several groups are the same (conf.level = 0.95).
For Sanger sequencing verification of genotypes called by NGS, oligo-nucleotide primers were synthesized by Integrated DNA Technologies. The regions containing the suspected single nucleotide variants (SNVs) were amplified by polymerase chain reaction (PCR) using 50 ng of genomic DNA derived from patient peripheral blood, the listed primers and Qiagen HotStar Taq Plus under the following conditions: 95 °C x 5 min denaturation followed by 40 cycles of 95 °C x 30 s, 55 °C x 30 s, 72 °C x 30 s. Residual primers and nucleotides were removed with ExoSAP-IT reagent (USB, Cleveland, OH, USA). The amplicons were then sequenced using BigDye® terminator chemistry by Macrogen (Rockville, MD) and compared to Human Genome reference sequence (GRCh37; assembly hg19) using Sequencher (GeneCodes, Ann Arbor, MI, USA).
The authors thank Dr. Camilo Toro, Dr. Thomas Markello, Lynn Wolfe, and clinical staff for thoughtful discussions and feedback. We are extremely grateful to NIH Intramural Sequencing Center (NISC) for NGS library preparation, sequencing and data generation. We are extremely grateful to Drs. Ben Solomon, Thierry Vilboux, and Ram Iyer for thoughtful comments and discussion. We are extremely grateful to the participants of the Undiagnosed Diseases Program (UDP). Extensive use was made of the NIH Biowulf cluster, a GNU/Linux parallel processing system for computational analysis; we acknowledge the NIH Helix Systems Staff for management of this system. This study is supported by funds from the Intramural Research Program of the National Institutes of Health.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Linderman MD, Brandt T, Edelmann L, Jabado O, Kasai Y, Kornreich R, et al. Analytical validation of whole exome and whole genome sequencing for clinical applications. BMC Med Genomics. 2014;7:20.PubMed CentralView ArticlePubMedGoogle Scholar
- Gilissen C, Hoischen A, Brunner HG, Veltman JA. Disease gene identification strategies for exome sequencing. Eur J Hum Genet. 2012;20(5):490–7.PubMed CentralView ArticlePubMedGoogle Scholar
- Gahl WA, Markello TC, Toro C, Fajardo KF, Sincan M, Gill F, et al. The National Institutes of Health Undiagnosed Diseases Program: insights into rare diseases. Genet Med. 2012;14(1):51–9.PubMed CentralView ArticlePubMedGoogle Scholar
- Ombrello MJ, Sikora KA, Kastner DL. Genetics, genomics, and their relevance to pathology and therapy. Best Pract Res Clin Rheumatol. 2014;28(2):175–89.PubMed CentralView ArticlePubMedGoogle Scholar
- Lohmann K, Klein C. Next Generation Sequencing and the Future of Genetic Diagnosis. Neurotherapeutics. 2014;11(4):699–707.PubMed CentralView ArticlePubMedGoogle Scholar
- Ku CS, Cooper DN, Polychronakos C, Naidoo N, Wu M, Soong R. Exome sequencing: dual role as a discovery and diagnostic tool. Ann Neurol. 2012;71(1):5–14.View ArticlePubMedGoogle Scholar
- Biesecker LG, Green RC. Diagnostic clinical genome and exome sequencing. N Engl J Med. 2014;370(25):2418–25.View ArticlePubMedGoogle Scholar
- Dias C, Sincan M, Cherukuri PF, Rupps R, Huang Y, Briemberg H, et al. An analysis of exome sequencing for diagnostic testing of the genes associated with muscle disease and spastic paraplegia. Hum Mutat. 2012;33(4):614–26.PubMed CentralView ArticlePubMedGoogle Scholar
- Redin C, Gérard B, Lauer J, Herenger Y, Muller J, Quartier A, et al. Efficient strategy for the molecular diagnosis of intellectual disability using targeted high-throughput sequencing. J Med Genet. 2014;51(11):724–36.PubMed CentralView ArticlePubMedGoogle Scholar
- Ryan S, Willer J, Marjoram L, Bagwell J, Mankiewicz J, Leshchiner I, et al. Rapid identification of kidney cyst mutations by whole exome sequencing in zebrafish. Development. 2013;140(21):4445–51.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang X. Exome sequencing greatly expedites the progressive research of Mendelian diseases. Front Med. 2014;8(1):42–57.View ArticlePubMedGoogle Scholar
- Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, et al. Characterizing and measuring bias in sequence data. Genome Biol. 2013;14(5):R51.PubMed CentralView ArticlePubMedGoogle Scholar
- Clark MJ, Chen R, Lam HY, Karczewski KJ, Chen R, Euskirchen G, et al. Performance comparison of exome DNA sequencing technologies. Nat Biotechnol. 2011;29(10):908–14.PubMed CentralView ArticlePubMedGoogle Scholar
- Parla JS, Iossifov I, Grabill I, Spector MS, Kramer M, McCombie WR. A comparative analysis of exome capture. Genome Biol. 2011;12(9):R97.PubMed CentralView ArticlePubMedGoogle Scholar
- Asan, Xu Y, Jiang H, Tyler-Smith C, Xue Y, Jiang T, et al. Comprehensive comparison of three commercial human whole-exome capture platforms. Genome Biol. 2011;12(9):R95.Google Scholar
- Aird D, Ross MG, Chen WS, Danielsson M, Fennell T, Russ C, et al. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011;12(2):R18.PubMed CentralView ArticlePubMedGoogle Scholar
- Meynert AM, Bicknell LS, Hurles ME, Jackson AP, Taylor MS. Quantifying single nucleotide variant detection sensitivity in exome sequencing. BMC Bioinformatics. 2013;14:195.PubMed CentralView ArticlePubMedGoogle Scholar
- Heinrich V, Kamphans T, Stange J, Parkhomchuk D, Hecht J, Dickhaus T, et al. Estimating exome genotyping accuracy by comparing to data from large scale sequencing projects. Genome Med. 2013;5(7):69.PubMed CentralView ArticlePubMedGoogle Scholar
- Ku CS, Polychronakos C, Tan EK, Naidoo N, Pawitan Y, Roukos DH, et al. A new paradigm emerges from the study of de novo mutations in the context of neurodevelopmental disease. Mol Psychiatry. 2013;18(2):141–53.View ArticlePubMedGoogle Scholar
- Ku CS, Tan EK, Cooper DN. From the periphery to centre stage: de novo single nucleotide variants play a key role in human genetic disease. J Med Genet. 2013;50(4):203–11.View ArticlePubMedGoogle Scholar
- Lee H, Deignan JL, Dorrani N, Strom SP, Kantarci S, Quintero-Rivera F, et al. Clinical exome sequencing for genetic identification of rare Mendelian disorders. JAMA. 2014;312(18):1880–7.PubMed CentralView ArticlePubMedGoogle Scholar
- An JY, Cristino AS, Zhao Q, Edson J, Williams SM, Ravine D, et al. Towards a molecular characterization of autism spectrum disorders: an exome sequencing and systems approach. Transl Psychiatry. 2014;4:e394.PubMed CentralView ArticlePubMedGoogle Scholar
- Sambuughin N, Zvaritch E, Kraeva N, Sizova O, Sivak E, Dickson K, et al. Exome analysis identifies Brody myopathy in a family diagnosed with malignant hyperthermia susceptibility. Mol Genet Genomic Med. 2014;2(6):472–83.PubMed CentralView ArticlePubMedGoogle Scholar
- Biesecker LG. Opportunities and challenges for the integration of massively parallel genomic sequencing into clinical practice: lessons from the ClinSeq project. Genet Med. 2012;14(4):393–8.PubMed CentralView ArticlePubMedGoogle Scholar
- Green RC, Berg JS, Grody WW, Kalia SS, Korf BR, Martin CL, et al. ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing. Genet Med. 2013;15(7):565–74.PubMed CentralView ArticlePubMedGoogle Scholar
- Epi4K Consortium; Epilepsy Phenome/Genome Project, Allen AS, Berkovic SF, Cossette P, Delanty N, et al. De novo mutations in epileptic encephalopathies. Nature. 2013;501(7466):217–21.Google Scholar
- Bonne G, Di Barletta MR, Varnous S, Bécane HM, Hammouda EH, Merlini L, et al. Mutations in the gene encoding lamin A/C cause autosomal dominant Emery-Dreifuss muscular dystrophy. Nat Genet. 1999;21(3):285–8.View ArticlePubMedGoogle Scholar
- Toma C, Torrico B, Hervás A, Valdés-Mas R, Tristán-Noguero A, Padillo V, et al. Exome sequencing in multiplex autism families suggests a major role for heterozygous truncating mutations. Mol Psychiatry. 2014;19(7):784–90.View ArticlePubMedGoogle Scholar
- Benjamini Y, Speed TP. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 2012;40(10), e72.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang Y, Muzny DM, Reid JG, Bainbridge MN, Willis A, Ward PA, et al. Clinical whole-exome sequencing for the diagnosis of mendelian disorders. N Engl J Med. 2013;369(16):1502–11.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang Y, Li B, Li C, Cai Q, Zheng W, Long J. Improved variant calling accuracy by merging replicates in whole-exome sequencing studies. Biomed Res Int. 2014;2014:319534.PubMed CentralPubMedGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078–9.PubMed CentralView ArticlePubMedGoogle Scholar
- Teer JK, Bonnycastle LL, Chines PS, Hansen NF, Aoyama N, Swift AJ, et al. Systematic comparison of three genomic enrichment methods for massively parallel DNA sequencing. Genome Res. 2010;20(10):1420–31.PubMed CentralView ArticlePubMedGoogle Scholar
- Li B, Chen W, Zhan X, Busonero F, Sanna S, Sidore C, et al. A likelihood-based framework for variant calling and de novo mutation detection in families. PLoS Genet. 2012;8(10), e1002944.PubMed CentralView ArticlePubMedGoogle Scholar
- Ramu A, Noordam MJ, Schwartz RS, Wuster A, Hurles ME, Cartwright RA, et al. DeNovoGear: de novo indel and point mutation discovery and phasing. Nat Methods. 2013;10(10):985–7.PubMed CentralView ArticlePubMedGoogle Scholar