Sample data
Tables 1 and 2 provide references for previously collected data and published methods, respectively. Informed consent was obtained for individual HS1011 under protocol H-29697, which is approved by the Institutional Review Board at Baylor College of Medicine. This protocol provides consent to publish the detailed genomic information contained in this manuscript. Sequence data for HS1011 (BioSample SAMN00009513) can be obtained via the SRA database (accession numbers SRX286419, SRX852867, SRX852868, and SRX852869).
Illumina Nextera
The WGS Illumina Nextera data is 100 × 100 bp mate pair with an average fragment size of 6.5 kbp providing 71X clone coverage and approximately 2X read coverage.
Pacific Biosciences
Large-insert PacBio library preparation was conducted by following the User Bulletin - Guidelines for Preparing 20 kbp SMRTbell™ Templates (version 2) and Procedure & Checklist - 20 kbp Template Preparation Using BluePippin Size-Selection (version 3) listed in the website (http://www.pacificbiosciences.com/support/pubmap/documentation.html). In brief, a total of 120 ug genomic HS1011 DNA was sheared into 20 kbp targeted size by using Covaris g-TUBEs (Cat.# 520079, Covaris) on Eppendorf 5424 centrifuge. Each shearing processed 10 ug input DNA and a total of 12 shearings were performed. The sheared genomic DNA was examined by Agilent 2100 Bioanalyzer DNA12000 Chip (Cat.# 5067–1508, Agilent Technologies Inc.) for size distribution and underwent DNA damage repair/end repair, blunt-end adaptor ligation followed by exonuclease digestion. The purified digestion products were loaded onto pre-cast 0.75% agarose cassettes (Cat.# BHZ7510, Sage Science) for 7–50 kbp size selection using BluePippin Size Selection System (Cat.# BLU0001, Sage Science), and the recovered size-selected library products were purified using 0.5x pre-washed Agencourt AMPure XP beads (A63880, Beckman Coulter). The final libraries were examined by Agilent 2100 Bioanalyzer DNA12000 Chip for size distribution and the library concentrations were determined by Qubit 2.0 Fluorometer (Cat.# Q32866, Life Technologies).
BioNano Irys
Cells were washed with PBS, resuspended in cell resuspension buffer, and embedded in gel plugs (BioRad #170-3592). Plugs were incubated with lysis buffer and proteinase K for four hours at 50°C. The plugs were washed and then solubilized with GELase (Epicentre). The purified DNA was subjected to four hours of drop dialysis. It was quantified using Nanodrop 1000 (Thermal Fisher Scientific) and/or Quant-iT dsDNA Assay Kit (Invitrogen/Molecular Probes), and the quality was assessed using pulsed-field gel electrophoresis.
DNA was labeled according to commercial protocols using the IrysPrep Reagent Kit (BioNano Genomics, Inc.). Specifically, 300 ng of purified genomic DNA were nicked with 7U nicking endonuclease Nt.BspQI (New England BioLabs, NEB) at 37°C for two hours in NEB Buffer 3. The nicked DNA was labeled with a fluorescent-dUTP nucleotide analog using Taq polymerase (NEB) for one hour at 72°C. After labeling, the nicks were ligated with Taq ligase (NEB) in the presence of dNTPs. The backbone of fluorescently labeled DNA was stained with YOYO-1 (Invitrogen).
The DNA was loaded onto the nanochannel array of the BioNano Genomics IrysChip by electrophoresis of DNA. Linearized DNA molecules were then imaged automatically followed by repeated cycles of DNA loading using the BioNano Genomics Irys system.
The DNA molecules backbones (YOYO-1 stained) and locations of fluorescent labels along each molecule were detected using the in-house software package, IrysView. The set of label locations of each DNA molecule defines an individual single-molecule map.
Single-molecule maps were assembled de novo into consensus maps using tools developed at BioNano Genomics. Briefly, the assembler is a custom implementation of the overlap-layout-consensus paradigm with a maximum likelihood model. An overlap graph was generated based on pairwise comparison of all molecules as input. Redundant and spurious edges were removed. The assembler outputs the longest path in the graph and consensus maps were derived. Consensus maps are further refined by mapping single molecule maps to the consensus maps and label positions are recalculated. Refined consensus maps are extended by mapping single molecules to the ends of the consensus and calculating label positions beyond the initial maps. After merging of overlapping maps, a final set of consensus maps was output and used for subsequent analysis (Additional file 2: Figure S3).
Alignments between consensus maps were obtained using a dynamic programming approach where the scoring function was the likelihood of a pair of intervals being similar. Likelihood is calculated based on a noise model which takes into account fixed sizing error, sizing error which scales linearly with the interval size, misaligned sites (false positives and false negatives), and optical resolution. An interval or range of intervals whose cumulative likelihood is worse than 0.01 percent is classified as an outlier region. If such regions occur between highly scoring regions, an insertion or deletion call is made in the outlier region, depending on the relative size of the region on the query and reference maps.
At present, the BioNano assembly approach is agnostic to allelic bias when calling SVs, and all Irys SV calls are presumed to the homozygous. The BNG de novo assembly approach chooses a single haploid when representing a flattened reference model (which is the current BNG standard), presumably the allele that is present in the majority of molecules.
Array comparative genomic hybridization
Genomic DNAs of all four members of the family were utilized to perform aCGH using a variety of high-resolution platforms to detect CNVs in the family quartet. NA10851 DNA obtained from a cell line from Coriell Cell Repositories (http://ccr.coriell.org) was used as control for the comparative genomic hybridization for all individuals and platforms.
Agilent 1 M whole-genome aCGH
Array comparative genomic hybridization (aCGH) using Agilent’s 1 Million whole-genome high-density oligonucleotide microarrays containing one million probes across the genome was performed in the four members of the quartet and two additional siblings. Briefly, samples and control DNAs (2500 ng) were digested with the enzymes AluI and RsaI. Following digestion, the sample DNAs were labeled with Cy5-dCTP and the control DNAs were labeled with Cy3-dCTP using the BioPrime Array CGH genomic labeling kit (Invitrogen Corporation, Carlsbad, CA, USA). Purification of and quatitation of the labeled genomic DNA was performed and samples and controls were matched accordingly. Sample plus control labeled DNAs were mixed with human Cot-I DNA for blocking unspecific hybridization and mixed in blocking and hybridization buffers according to the manufacturer’s protocol. After pre-hybridization incubation, the labeled DNAs were deposited on the 1 M array slide for competitive hybridization to take place for 40 hours at 65°C. Washing, scanning, and data feature extraction were conducted according to the manufacturer’s protocol.
NimbleGen 4.2 M whole-genome aCGH
NimbleGen’s 4.2 Million whole-genome array platform was also used on the four members of the quartet. Briefly, genomic DNAs for samples and control (0.5 ug) were labeled using the manufacturer’s Cy3 (test sample) or Cy5 (control) labeled random nonamers. Labeled products were precipitated, purified and combined (sample + control) for competitive hybridization on the array slide at 42°C for 72 hours. After hybridization, washing, scanning, image processing and data extraction were conducted according to the manufacturer’s protocol and software.
PCR amplification and sequencing of breakpoints
Specific PCR primers based on the aCGH and SOLiD sequencing CNV calls were designed. Standard end-point and long-range PCR reactions were performed in order to amplify the specific CNV breakpoints. Sanger sequencing was done on all of the successfully amplified PCR products in order to elucidate the specific sequence and coordinates where the breakpoints occurred.
Software
Structural Variation detection by STAck and Tail (SV-STAT) is a reference-guided assembler that detects and ranks SVs at nucleotide resolution. First the algorithm catalogs candidate breakpoints, the genomic coordinates and orientations of which are determined by recurrent partial alignments, or “stacks.” Next, SV-STAT generates a fasta-formatted library of candidate junctions by concatenating breakpoint regions in orders and orientations consistent with otherwise discordant read-pairs. The algorithm’s metric for a candidate is a function of the difference between the scores of alignments A and B, where A is the alignment between a “stacked” read and the reference, and B is a re-alignment of the same read to the candidate junction. Full details will be reported elsewhere. For the purposes of this study, SV-STAT used the predictions of BreakDancer to determine the paired genomic regions in which to search. In this way, SV-STAT provided a ranking of BreakDancer predictions according to the support available at nucleotide resolution. The source code for SV-STAT is publically available (https://gitorious.org/svstat).
Structural Variation Assessment of CHRomosomal Aberrations (SVachra) is a breakpoint-calling program that uses discordant mate pair reads consisting of both inward and outward facing read types, for example, the data delivered by Illumina mate pair and Nextera Tagmentation sequencing libraries. The SVachra program calculates the distributions of the inward and outward facing mate pair types and applies independent clustering of the inward and outward facing discordant mapped reads to call chromosomal aberrations. Both inward and outward facing reads contribute to the calling of SV, reporting. SVachra calls large insertions-deletions, inversions, inter- and intra-chromosomal translocations, reporting breakpoints in the inward facing orientation thereby eliminating the contradictory outward facing read orientations. SVachra Source code is available at http://github.com/oliverhampton/SVachra.
Tiresias identifies mobile element insertions using clusters of improperly mapped read pairs comprising one read that maps uniquely to the genome and one that maps to a set of element-specific consensus sequences. Breakpoints consistent with each cluster are then identified as local genomic positions with multiple termini of soft-clipped reads.
Data mapping and alignment of the original SOLiD WGS sequence data were performed using Life Technologies’ (former Applied Biosystems) SOLiD Software and Corona Lite suite. Illumina WGS data were mapped and aligned using the BCM HGSC Mercury pipeline (See Appendix 3). All other additional analyses were performed using custom Perl scripts for data parsing, comparison, extraction and intersection.
Standardization
To compare the SV calls made by each program, all calls are first reduced to one of three types: deletion, insertion, and mismatch. Deletions correspond to any regions in the sample that are missing sequence that is locally present in the reference, insertions are regions with more sequence than the reference, and mismatches contain different sequence than the reference (e.g., inversions). Reduced results are stored in a VCF-like format (http://sourceforge.net/projects/parliamentsv/), and these files are then loaded into a SQLite database and clustered in a method- and data-sensitive manner.
Events with reference sequence spans (deletions and mismatches) are clustered using source-specific minimum reciprocal overlap thresholds. For example, the Illumina-based BreakDancer and CNVnator programs both generate calls of similar precision and such calls are grouped if they possess >50% reciprocal overlap. However, Bionano Irys calls contain only outer boundaries, while Crest calls contain exact breakpoints, so the Irys-Crest threshold is 20% reciprocal overlap. Exact breakpoint resolution of zero reference-span events can be complicated by genomic repeats and microhomology of the inserted sequence, resulting in non-overlapping insertion calls of the same event. To account for this ambiguity, insertion events are mean-shift clustered at several scales, with calls from more precise programs requiring clustering at smaller scales. All clustering and merging parameters can be adjusted by the user. A full list of the default parameters can be found in Supplemental Methods.
The BioNano-Irys platform provides outer-boundaries of reference spans with cumulative sequence-length differences between the sample compared to the reference. This differs from other SV detection methods in that a single Irys call may represent multiple individual events. For example, a 10 kbp span in the reference with a sequence-length difference of 1 kbp could represent an insertion event of 2 kbp and a deletion event of 3 kbp. Irys insertion calls have a mean and median span to sequence-length difference of 12.4 kbp and 7.7 kbp. Similarly, deletion calls have a mean and median difference of 18.8 kbp and 13.3 kbp. In order to appropriately incorporate these broad outer-boundaries and nuanced SV definition, a standard reciprocal overlap or mean-shift threshold is not an optimal use of the data. Therefore, we manually inspected the 852 Irys calls against our merged SVs in order to annotate the SVs as having additional support provided by Irys.
Spiral Genetics’ Anchored Assembly performs whole read overlap assembly on corrected, unmapped reads to detect SNVs, indels, and structural variants. Sequencing errors are corrected by generating k-mers from reads and giving each unique k-mer a quality score. Low-scoring k-mers are discarded as erroneous. The set of high scoring, or true k-mers is used to construct a de Bruijn graph representing an error-free reconstruction of the true read sequences. Each read is corrected by finding the globally optimum base substitution(s) so that it aligns to the graph with no mismatches and differs by the smallest base quality score from the original read. Of these corrected reads, those that do not match the reference exactly are assembled into a discontiguous read overlap graph to capture sequence variation from the reference. Variants are mapped to human reference coordinates (GHCr37.p7) by walking the read overlap graph in both directions until an “anchor” read, where a continuous 65 bp matches the reference, denotes the beginning and end of each variant. Where a variant has more than one anchor, pairing information is used to determine the correct location of the anchor. This analysis includes only variants where a variant was classified as a deletion, insertion, tandem repeat, or inversion and anchors on both ends mapped uniquely to the reference. Other variants detected using Anchored Assembly were not included.
Annotation
Parliament annotates each putative variant with Ensembl gene boundaries, UCSC gene features (e.g., exon, intron, UTR), hg19 gap features (telomeres and centromeres), known variants from DGV, and known repeats from the UCSC repeat masker track. Known variants are matched to putative sites if they have at least 50% reciprocal overlap.
Hybrid assembly and force calling
The Illumina WGS and PacBio data within 2,000 bp of each variant locus is extracted and locally assembled with PHRAP. After mapping the resulting contigs back to the reference with Blasr, we determine whether the remapped sequence is consistent with the size and type of the corresponding predicted SV event. We classify such matches as “valid” SV events. However, local assembly does not always yield contigs and nor does it always produce variant alleles. Thus, we also “force call” events at every variant locus using the PacBio data, requiring only one PacBio read to be consistent with the predicted SV event.
SNP concordance
If a deletion occurs within a region that is unique in the genome, we expect all the SNPs in the deleted region to be homozygous. For each deletion locus, we calculate the fraction of homozygous SNVs in a predicted deletion region (homozScore). We include only deletions that have at least 5 SNVs in the region (1,633/17,665). We use the average coverage in the region of <25X (1/2 of the average HS1011 Illumina coverage) to focus on deletion regions likely to be unique in the HS1011 genome: average coverage >50X might indicate a paralogous region in HS1011 genome. Since reads from paralogs sometimes map to the same reference region, they would result in heterozygous SNVs even if the deletion is present. We use homozScore of 0.8 instead of 1 to account for SNV genotyping errors and potential imprecision of SV breakpoints.