The identification of genomic SV is an integral part of understanding human genetic diversity, gene and genome variation, evolution, and disease etiology. Numerous genetic diseases, most notably cancer, have been associated with genomic SV of somatic cells [1] whilst germline genomic rearrangements cause a wide range of genomic disorders by diverse structural variation mutagenesis mechanisms [2, 3]. Historically, clone-based methods, such as end sequence profiling of Bacterial Artificial Chromosome and fosmid clones [4, 5] have provided paired-end sequencing of DNA fragments containing aberrant junctions, necessary to identify genomic structural rearrangements. Unfortunately, these clone-based methods proved too laborious, time consuming, and costly for routine chromosome aberration analyses. Alternatively, the adoption of massively parallel sequencing delivering hundreds of millions of short fragment paired-end reads has accelerated the study of genomics. While the quantity of reads produced by massively parallel sequencing is impressive, the standard paired-end sequencing libraries have sacrificed fragment sizes compared to clone-based methods, thereby limiting the detection of large complex genomic SV [6]. To alleviate the shortcomings of small fragment paired-end libraries, multi-kilobase mate pair and Nextera Tagmentation sequencing libraries have been introduced. For these long-range mate pair libraries, fragments (usually on the order of 3-10Kb) are isolated, end labeled with biotin and circularized. The circularized molecules are sheared and the library is enriched for biotin labeled junction fragments. Sequencing of these biotinylated fragments generates ‘outward-facing’ paired reads, meaning they align to the reference sequence in an outward facing direction from each other and at a distance in line with the selected long-range fragment size. The main restriction of such sequencing libraries is the contamination of inward facing reads, unbiotinylated fragments which map in an opposite orientation and smaller fragment size (usually between 200 and 300 bp) that confound the calling of chromosomal rearrangements by introducing contradictory discordant read information.
To improve the utility of multi-kilobase mate pair and Nextera Tagmentation sequencing libraries for genomic SV detection, we introduce a new freely accessible program called SVachra (Structural Variation Assessment of CHRomosomal Aberrations) that uses discordant mate pair reads consisting of both inward and outward facing read types. The SVachra program calculates the distributions of the inward and outward facing mate pair types and applies independent clustering of the inward and outward facing discordant mapped reads to call chromosomal structural variants. Both inward and outward facing reads contribute to SV calling, thereby improving the accuracy of the breakpoint location. SVachra identifies large insertions-deletions, inversions, inter- and intra-chromosomal translocations. Compared to other SV prediction tools that utilize mate pair mapping data; such as BreakDancer [7], SVDetect [8], and others (see review [9]), the utility and novelty of SVachra consists in its ability to (i) automatically segregate outward and inward facing reads based on K-Means clustering of fragment size distributions without a priori sequencing library information, (ii) independently cluster and merge discordant outward and inward facing reads to more accurately predict breakpoint locations, (iii) screen out user-defined segments of the reference for breakpoint consideration, such as centromeric and telomeric regions responsible for high false positive rates and heavy computation, and (iv) create various output file formats for visual assessment of SV and downstream processes, such as breakpoint spanning primer design.
Methods
SVachra takes a .bam file of mapped mate pair reads and filters out unmapped, duplicate, and low quality reads based on a user defined minimum mapping quality parameter. Mapping quality is based on the supplied .bam file MAPQ value that equals -10log10 times the probability{that the mapping position is wrong}, rounded to the nearest integer (default = 0). In addition, the user may select the “-u” option that will consider only uniquely mapped reads during computation. Finally, the user may supply a screen out .bed file that will filter reads based on overlap with undesirable reference segments such as centromeres and telomeres that are responsible for false positive structural variant calls. From the remaining high quality mapped reads, SVachra calculates the inward and outward facing fragment distributions. Based on the fragment size distributions, SVachra segregates paired reads using K-Means clustering; classifying paired reads as either outward or inward facing, and calculating the upper and lower bound fragment size thresholds for the inward and outward facing read sets. Discordant read pairs, those that map with anomalous order, orientation, and/or fragment size, are utilized for putative SV calling. Discordant read pairs and/or split read alignments are clustered based on matching orientation and overlapping fragments that span the same putative breakpoint. The minimum number of breakpoint spanning read pairs or split read alignments contributing to any given cluster is a user-defined parameter (default = 2). The allowed spanning overlap distance for clustering any two paired alignment sets is constrained to two times the maximum fragment size of the appropriate read distribution, i.e. the inward or outward fragment size thresholds. Given the fragment size thresholds, there is a possibility that more than one inward or outward cluster could be generated for any single chromosomal aberration. Such an occurrence would manifest as overlapping clusters with identical read pair orientations and represent a redundant structural variant call. Cluster quality control filtering may be selected using the “-s” option that flags (and reports) redundant clusters and retains only the highest scoring cluster in an overlapping set for subsequent processing. Putative breakpoints called independently by both outward and inward facing clusters are merged, improving the accuracy of those breakpoint locations, especially in cases of low coverage sequencing data sets. Clusters are annotated into independent chromosomal aberrations and classified as insertions or deletions (indels), inversions, or translocations. Sizes of indels are calculated from the average discordant distance disparity from the expected distribution mean. Each putative SV call reports a score based on the number of contributing spanning read pairs and split read alignments for that rearrangement. And while both inward and outward facing reads types inform the assessment of structural variants, reported SVs are orientated to the inward facing read orientations allowing easy comparison and integration of SV calls with the more common paired end sequencing data. The SVachra programmatic workflow is outlined in Fig. 1a.
SVachra provides numerous outputs for subsequent SV analysis; including: binned fragment lengths for plotting the mate pair insert distributions, lists of annotated SV calls in both .svp and .lff formats (see Additional file 1: S2 for output format definitions), .bed file of SV calls for intersection analysis, and Circos [10] input files for SV visualization. Two Circos input files are provided: a Circos link input file contains the SVachra chromosomal aberrations for the inter-chromosomal (purple) and intra-chromosomal (green) translocations, and a Circos tile input file contains the deletions (red), insertions (blue), and inversions (orange).