breseqoverview
breseq (pronounced \brēz-ˈsēk\) is an integrated computational pipeline intended for predicting mutations in haploid genomes of less than approximately 20 Mb in total length [21] (Additional file 1). It is implemented in C++ and designed to run as a command-line tool on Unix-like platforms, including Mac OS X and Cygwin. breseq requires Samtools (version 0.1.18 is bundled with the code), the read mapper/aligner Bowtie 2 (version 2.1.0 or later; 2.1.0 was used here), and the R statistical computing environment (version 2.14 or later; 2.15.3 was used here). All of these programs are freely downloadable as either source code or compiled executables. breseq is invoked using a single command that takes as input FASTQ formatted DNA sequencing reads and reference sequences in GenBank, GFF3, or FASTA format. Providing feature annotations (e.g., locations and names of genes) in the reference files is optional. When present, they are used to optimize the placement of junction breakpoints with respect to mobile elements and to output additional information about the molecular effects of mutations on specific genes.
The breseq pipeline uses a reference-sequence-based mapping strategy, which includes evaluating new sequence junctions supported by split-read alignments and tracking multiply-mapped reads, to predict point mutations and structural mutations from short-read DNA resequencing data. As output, breseq produces an HTML archive with human-readable summary tables and the relevant evidence for each predicted mutation, including pileups of read alignments and graphs of read-depth coverage. Output is also provided in a machine-readable Genome Diff flat file format that contains entries for predicted mutations (e.g., deletion, base substitution, indel, mobile element insertion) linked to evidence supporting each genome sequence change (e.g., missing sequencing coverage, new sequence junctions, mismatches within aligned reads). The gdtools program packaged with breseq can be used to manipulate Genome Diff files for further analysis. For example, it can compare mutations predicted in multiple samples or apply changes in a Genome Diff file to generate a mutated version of the genome. Aligned reads and mutation predictions are also output in community formats (e.g., BAM [25] and VCF [26]) that can be directly input into other tools or visualization environments such as Tablet [27] or the Integrative Genomics Viewer [28].
The following sections describe the specific steps of the breseq pipeline used to predict new sequence junctions and genomic regions that are missing sequencing coverage in a sample and to infer several types of structural variants from this evidence (Figure 1). Then, we evaluate the performance of breseq for detecting these types of mutations in simulated and experimental short-read genome resequencing data sets. The methods that breseq uses to predict point mutations and small indels from read alignment evidence are described in the online breseq documentation and elsewhere [22, 29], and they are not benchmarked here.
Mapping reads
breseq uses Bowtie2 to map short sequence read data to the reference genome because it performs gapped alignment, finds local matches in the read, and allows exhaustive reporting of all alignments between a read and the reference sequence [30]. Bowtie2 options are used where alignment scores are set to be equal to the number of base matches minus three times the number of base mismatches (down-weighted if they are lower quality base calls), with a gap open penalty of 2 and a gap extension penalty of 3. A staged alignment procedure is employed to accelerate read mapping and alignment. The first, “stringent” phase reports reads with near-perfect matches to the reference genome by using long and sparse seed substrings (seed length of 0.5 times the average read length, constrained to a minimum of 9 and a maximum of 31; seed spacing equal to 1 plus 0.25 times the square root of the read length, rounded down) and a stringent alignment score threshold (0.9 times the read length). Then, remaining unmapped reads are aligned to the reference in a “relaxed” phase with parameters that enable split-read matches to be reported. This stage uses shorter seed substring requirements (seed length of 5 bases plus 0.1 times the average read length, rounded down and constrained to a minimum of 9 and a maximum of 31; seeds spaced by 1 plus 0.25 times the square root of the read length) and a much lower alignment score threshold (6 plus 0.2 times the read length). In both stages, every valid alignment to the reference genome is reported, including all information about how reads map to repeat sequences.
Depending on whether an alignment to the reference genome matches far enough past a large indel for an extended alignment including both sides to achieve a higher score, different reads that overlap the same genomic site may be reported as two separate alignments or a single alignment with gaps. To treat these cases uniformly, breseq further splits all initially reported read matches into multiple separate alignments at sites with indels of a certain length or longer (default: 3). In this step, there are cases where, due to inserted bases in the sample that exactly repeat existing reference bases (short duplications), single alignments are rewritten as two overlapping read alignments, rather than simply divided in half, so that the single-alignment mapping results exactly match what would have been output in the two-alignment case.
Identifying junction candidates
Next, breseq creates a list of potential new sequence junctions, which may indicate that distant genomic sites in the reference sequence are juxtaposed in the sample, from the lists of split-read alignments. For all reads where no one alignment spans 90% of the total read length, breseq examines all pairs of alignments reported between the read and the reference genome. Each alignment pair uniquely specifies a junction candidate: the new sequence that would exist in the sample DNA if two discontinuous regions of the reference genome were joined (Figures 2 and 3). This approach assumes that cases where individual reads from a sample genuinely map to three or more distinct locations in the reference genome will be extremely rare, which will generally be true for short-read data from samples that have not greatly diverged from the reference genome.
A junction candidate is defined by the endpoints of the first and second matches of the alignment pair to the reference sequence that are nearest the breakpoint (a2 and b1) and the directions each sequence extends away from the breakpoint in reference sequence coordinates (whether a2 > a1 and whether b2 > b1). Additionally, there may be overlap between the two alignments (V) because bases at the breakpoint in the read could be assigned to either location in the reference genome without changing the junction sequence (Figure 2a). If there are base mismatches or indels in the initial alignment of this overlap region to either reference location, then each reference alignment is trimmed back until it consists of only perfect matches to the reference genome in the overlap region (Figure 2b). Alternatively, there may be bases at the breakpoint that are unique to the read (U). That is, they are not contained within the alignment of either side of the read to the reference genome (Figure 3a). In summary, any junction candidate sequence can be fully specified by six junction description parameters: two reference positions, two orientations with respect to the reference sequence, the number of overlap bases, and the identities of any inserted bases that are in the read and not in the reference (Figures 2c and 3b).
To reduce the number of possible new junction candidates generated, breseq only considers pairs of split-read alignments that meet all of the following criteria:
-
One alignment must start at the first base of the read (a1 = r1).
-
If the length of the read is 50 bases or less, the other alignment must end on the final base of the read. For longer reads, the number of unaligned bases at the end of the read must be no more than 10% of the number of bases by which the total read length exceeds 50 bases (if L ≤50, then s2 = L; otherwise s2 ≥ L – 0.1 (L – 50)).
-
The portion of each alignment that does not overlap the other must span a number of bases in the read at least equal to 20% of its total length (min(s1, r2) – r1 + 1 ≥ 0.2 L and s2 – max (s1, r2) +1 ≥ 0.2 L).
-
The number of read bases in the overlap between the alignments must be no more than 12 plus 40% of the number of bases by which the read length exceeds 12 (if r2 > s1 + 1, then V = s1 – r2 – 1 ≤ 12 + 0.4 [L – 12]).
-
The number of bases unique to the read between the matches to reference sequence must be no more than 12 plus 40% of the number of bases by which the read length exceeds 12 bases (if s1 > r2 + 1, then U = s1 – r2 – 1 ≤ 12 + 0.4 [L – 12]).
-
Only alignment pairs with the maximum value of the number of bases spanned in the read by the two alignments (s2 – r1 + 1) that is observed over all alignment pairs for a given read are eligible to construct junction candidates.
For all alignment pairs for a read that pass these guards, a junction candidate sequence is constructed by taking the corresponding regions of the reference genome and joining them together, accounting for any overlap between the alignments or for any intermediate bases that are in the read but not in the reference genome. The junction candidate sequence is extended on each end by adding enough bases to ensure that if the longest read in the input data set mapped to the junction across its breakpoint it would have a better alignment score than the best match to anywhere in the reference genome. This step requires adding a number of bases from the reference to each side equal to one fewer than the maximum read length in the sample, minus the overlap (V) or read-only sequence (U) size.
While processing all reads with suitable split mappings to the reference sequence in this manner, the lists of predicted junctions with the exact same junction sequence and junction description parameters are merged into a running list of possibilities. So that two equivalent junction candidates will be merged together regardless of which DNA strand the original reads matched, junction sequences are reverse complemented, if necessary, so that the first side (left in the orientation of the junction sequence) is always described by the first reference sequence matched of those provided as input (in alphabetical sort order by sequence ID) or by the lowest reference coordinate matched if both sides align to the same reference sequence.
Because breseq uses read mapping options that report all alignments to sequence repeats in the reference genome, pairs of alignments for the same read that match different reference locations may yield equivalent junction sequences but different junction descriptions. For example, reads that map to the boundary of a new insertion of a mobile element that is multicopy in the original genome will have multiple alignments to the reference sequence for the portion overlapping the mobile element. Due to differences in the flanking bases at the boundaries of different copies of the multicopy repeat sequence, the same junction may also be described with a different number of overlapping bases or inserted read-only bases at the breakpoint for each of these possibilities. Therefore, breseq next collapses the list of junction predictions further: to the set of the shortest junction candidate sequences that are subsequences of other sequences or their reverse complements, effectively favoring those junctions with the least overlap or smallest number of unique read bases. In this merging step, it also prefers to keep junctions in which both ends map to the same reference sequence fragment (e.g., chromosome), rather than different ones, and in which the two ends match nearby coordinates if they are on the same reference sequence.
To determine the best merged junction candidates to further test, breseq next assigns a coverage evenness score to each one (Figure 4). This score is equal to the number of distinct start positions for alignments of reads that extend across the breakpoint far enough to unambiguously support the junction and not the reference sequence. That is, they must span any overlap or read-only bases in the junction sequence. If the junction is a short deletion of a few bases in the reference sequence, then it may be required to extend additional bases that are not accounted for in these values — a continuation length — in order to unambiguously support the junction and count toward the actual or possible coverage evenness score (Figure 5). Each read is counted as starting at the position in the reference genome where its first base matches, so alignments with the same start and end coordinates, but on opposite strands will each count toward this evenness score once.Reads overlapping a typical position in the reference genome will have start sites that are relatively evenly distributed before and after that position for reads that align to the top and bottom strands, respectively (Figure 4a). New junctions that are the result of mutations should similarly be supported by reads that cross the breakpoint in many different registers. In contrast, junction candidates resulting from sequencing artifacts tend to be supported by reads that align unevenly across a junction and may have low evenness scores even when they have very high coverage (Figure 4b). This situation can result from reads with systematic errors, like homopolymeric base calls at their ends, or from reads in the sample that are so significantly different from the reference sequence that they are no longer accurately mapped.
Finally, the list of junction candidates is sorted from high to low by this coverage evenness score and only the top candidates are saved according to an iterative decision procedure. At each step, all junctions with the next-highest score in the list are considered for inclusion. If the new total number of junction candidates accepted to this point, including these, would exceed some maximum (default: 5000) or the cumulative length of junction candidate sequences to this point would exceed a threshold value (default: 10% of the reference genome length), then these and all lower-scoring junctions are not retained, as long as some minimum number of junction candidates has already been accepted at this point (default: 100). In addition to these criteria, a junction candidate must have at least some minimum evenness score to be considered (default: 2).
Evaluating new junction evidence
Some sequencing reads may map reasonably well to the original reference genome but actually align better to a junction candidate sequence. This situation can arise, for example, when the junction sequence spans an insertion or deletion of several bases in the sample relative to the reference. For junctions between sequences that were distant in the original reference genome, this situation can also occur when a read is not initially mappable to both sides of a junction because it does not extend far enough past the breakpoint to seed and detect a match to one side or the other. To resolve and properly count these cases when evaluating junctions, breseq re-maps all input reads to the set of candidate junction sequences using Bowtie2 with the stringent criteria for reporting alignments that were used in the first stage of mapping to the reference genome. For each read, breseq compares the alignment scores of the best matches to junction sequences and the reference sequence. If a read matches the reference genome better than all junction candidates, then it is immediately assigned to that location and not considered further with respect to junctions. If it matches a junction or multiple junctions better than or equally as well as the reference genome, then the read is temporarily saved with those junctions in an unresolved state until evidence from all reads has been compiled.
To evaluate whether to call a genomic variant in the sample, one would like to determine whether the final collection of reads that align to a candidate junction resembles those found at a typical site in the reference genome. To this end, the coverage evenness score — equal to the number of unique reference positions matched by the first bases of reads extending past the breakpoint and any overlap or read-only junction bases (Figure 4) — is recalculated based on the re-aligned reads that match best to each junction. This score is now used to rule out junction candidates with unusually low coverage or coverage biased toward certain registers with respect to the breakpoint that may be the result of mapping artifacts, contaminating sequences, or failed reads. This general approach has been used to manually screen out false-positive predictions of gene fusions [31] and incorporated into the scoring scheme used by TopHat2 [12]. We score the evenness of read coverage across a predicted breakpoint to evaluate the support for a new sequence junction in the context of a statistical model of read coverage across a haploid genome.
First, to calibrate whether any given coverage evenness score is unusually low with respect to the expectation for a typical position in the reference genome, breseq fits the distribution of coverage read depth across the genome and determines the average chance that at least one read starts at a given position of the reference sequence. These parameters are estimated from the initial best mappings of reads to the reference genome before resolving candidate junctions. For short-read resequencing data, the depth of read coverage at different positions in the reference genome is fit well by an overdispersed Poisson (negative binomial) distribution [29]. breseq fits this distribution using unique-only reference genome positions (those not matched by any read that maps to multiple reference locations equally well) (Figure 6a). Before fitting, this data is left-censored at half the average coverage, to account for positions that are truly deleted in the sample but may have a small amount of residual coverage due to incorrect mapping of reads with errors or cross-contamination from sequencing similar genome samples without the deletion at the same time. The data is also right-censored at 1.5 times the average coverage so that fitting will be more robust against cases where failed sequencing reads may spuriously map to a small number of genomic locations, creating anomalously high coverage, and to cases where there are increases in copy-number in a sample relative to the reference across a significant portion of the genome. The negative binomial distribution is described by the mean coverage (μ
cov
) and a size parameter (α
cov
) reflecting the overdispersion. As coverage is tabulated, the number of unique-only positions with no reads beginning there that match a given strand (forward or reverse) is also tracked for each reference sequence. Dividing this total count by twice the number of unique-only positions in a reference sequence gives , the average chance that no read will be found to start at a given position extending across a breakpoint, i.e., the chance that a possible position where a read could have started will contribute to the evenness score for a junction.
After creating this model of the statistics of the evenness score, breseq iterates through all junction candidates, beginning with the one with the highest coverage evenness score, and calculates how extreme this score is relative to expectations to accept or reject the prediction. In cases where reads are currently unresolved because they matched equally well to more than one junction sequence, these reads are temporarily assigned to the current junction and included in its recalculated evenness score at this point. The possible values that the coverage evenness score can take for a junction are determined by the maximum read length, but when reads with different lengths are present in a data set, the average read length gives a better idea of what values of the evenness score to expect. In addition, junctions with overlap bases or unique read-only bases will have fewer possibilities for reads to be positioned such that they align better to the junction sequence than to the reference genome, and therefore contribute to the coverage evenness score (Figure 4). As described above, it is also possible – particularly for deletions or duplications of short sequence repeats – that a read aligning from one side of the junction would not be distinguishable as matching the junction better than the reference unless it aligned a certain number of additional bases past the breakpoint (Figure 5). To account for both of these considerations, the expected maximum possible evenness score (S
max
) for a particular junction candidate is revised downward by taking into account the overlap (a2 − b1 + 1) and the numbers of ambiguous continuation bases on side one (C1) and side two (C2):
(1)
The chance of observing a given evenness score at a typical reference position is calculated by integrating the chances that a certain depth of read coverage would occur and the chance that one would observe the given evenness score at that read depth given the chance that a position would not have any reads with alignments starting at that position on a given strand and would, thus, not contribute to the score. Therefore, breseq calculates a p-value for rejecting the null hypothesis that the start position and strand distribution of reads across each junction is typical of the rest of the genome according to the following equation:
(2)
where NegBinom (x, μ, α) is the probability of observing x events from the negative binomial distribution with mean μ and size parameter α, and BinomCDF (x, p, n) is the chance of x of fewer successes in n draws with probability p of a success each time. The probability of success in this case is the chance that a read will be observed beginning at a given position and on a given strand at an arbitrary position in the reference genome as a whole, estimated as described above. This calculation uses a value for this chance that scales with the current coverage in the summation relative to the average coverage of the reference sequence. If the two sides of the junction match two different reference sequences (which may have different coverage-evenness properties), then a p-value is calculated for each one and the junction is assigned the least-significant value.
The “evenness skew” for the junction candidate is defined as the negative base-10 logarithm of this p-value, and breseq rejects predictions with skew scores that exceed some threshold (default: 3.0). For accepted junctions, any unresolved read alignments that mapped equally well to other junctions are claimed by the accepted junction prediction and are not counted toward the score of alternative junctions they matched equally well when they are considered later. The top-scoring rejected junctions are saved for reporting as marginal predictions in the output.
For accepted junctions, ambiguous placement of the breakpoint with respect to the two disjoint locations of matches in the reference sequence due to overlap is assigned to one side or the other according to the following rules. First, if reference sequence annotations of type “repeat sequence” or “mobile element” are present, and the breakpoint is within a certain distance of the end of one of these features (default: 20 bases), then the breakpoint is shifted to exactly align to the end of the annotated feature. This may involve assigning all of the overlap bases near the breakpoint to one side, or some of these bases to both sides of the junction. Second, if one side of the junction was marked as having only repeat sequence alignments during the phase when candidate junction sequences were constructed, then all of the ambiguous bases at the breakpoint are assigned to the side that had only unique alignments. If both sides of the junction had unique alignments, then the side with the highest priority when sorting by reference sequence ID and then by coordinate, is assigned all overlap bases. Reads mapping across successful junctions are split at the breakpoint and the two pieces are added to the final file of alignments to the reference genome as separate matches so that they are counted correctly toward predictions of other mutations, such as base substitutions.
Evaluating missing coverage evidence
In addition to compiling new junction evidence for structural variants as detailed above, breseq includes a step where it examines read coverage for evidence of deleted chromosomal regions. To do this, it first identifies reference positions where there is zero coverage of both uniquely-mapped and multiply-mapped reads. These seed intervals are propagated outward and joined through areas where the total of the unique-only read coverage is below a cutoff value. This threshold is calculated from the negative binomial fit to the coverage distribution (discussed above) by choosing a coverage value that yields a left-tail probability of 0.05 divided by the square root of the length of the current reference sequence (Figure 6a). Note that this procedure may lead to extending and joining putative missing coverage intervals through regions where multiply-mapped reads align (Figure 6b). When these occur between areas where there is unique-only coverage below a certain threshold or no coverage at all, it is assumed that those regions are also part of a deletion because the reads mapping there match equally well to a region elsewhere in the genome that was not deleted. When a region with multiply-mapped reads occurs on the margin of a region of missing coverage, the endpoint of the predicted deletion on that side is considered ambiguous, unless resolved by integrating this evidence with junction predictions as described below.
Mutation annotation
breseq predicts and annotates several types of mutations that produce structural variation by considering new junction (JC) and missing coverage (MC) evidence together with reference sequence annotations of the locations of mobile elements (Figure 7a). This automated integration of information results in precise (i.e., down to the exact nucleotide) and biologically meaningful (e.g., insertion of a new copy of the transposable element IS150 with duplication of three target site base pairs) predictions of how a sequence is altered in the sample relative to the reference that would be laborious to reconstruct from the output of other programs that could potentially be used to predict SV in microbial genomes. The predictions by breseq include:
-
1.
Large deletions creating unique junctions (JC + MC evidence) – When the endpoints of a missing coverage evidence item exactly match the breakpoint in a junction evidence item with two uniquely aligned sides, the region between them is predicted as a deletion. Deletions of this kind can be caused by illegitimate recombination.
-
2.
Deletions between by two copies of a repeat element (MC evidence) – When the ends of a missing coverage evidence item both fall within regions of multiply-mapped read coverage that are annotated in the reference genome as copies of the same repeat element in the same orientation, then a deletion is predicted that includes the intervening region between the repeats and the second of the two repeat copies. These types of deletions may occur via homologous recombination.
-
3.
Short deletions, insertions, substitutions, and tandem amplifications (JC evidence) – When the two sides of a new junction evidence item are uniquely aligned and the breakpoint positions are located within a distance of each other in the reference genome that is smaller than the average read length, then they are resolved to predict the corresponding change. Note that deletions are predicted even without missing coverage evidence in this case because it is possible for one or a few spurious reads or read alignments to prevent the prediction of a very small missing coverage evidence item.
-
4.
New mobile element insertions (JC + JC evidence) – When two junction evidence items have unique endpoints on one side that are located within 20 base pairs of one another and match opposite sides of the same repeat element family annotated in the reference genome, then they are resolved to predict a new insertion of a mobile element. A target site duplication of several bases typically occurs when a new copy of a bacterial transposable element inserts into a genome. The size of this duplication is predicted from the locations of the junctions. Additional deletions of a few bases from the ends of the newly inserted element relative to its consensus sequence or insertions of a few bases within the target site duplication may be necessary to reconcile the exact breakpoint sequences in some cases (Figure 7b).
-
5.
Deletions mediated by mobile elements (JC + MC evidence) – These mutations are predicted when a missing coverage evidence item exists with one side overlapping a repeat sequence in the reference genome and a new junction evidence item matches the unique side of the missing coverage interval and the proximal side of any copy of the corresponding repeat family in the genome. In this case, a deletion is predicted that extends from the unique breakpoint on one side of the repeat element up to its boundary.
Missing coverage and accepted new junctions that are not resolved into predictions of precise mutational events are displayed as “unassigned evidence” in the HTML output for further evaluation by the user. For example, breseq currently does not attempt to resolve new junction evidence that could support tandem sequence duplications greater than the read length. It also does not attempt to use new junction evidence items in which both sides fall in repeated reference genome sequences to predict mutations. These cases and other more complex types of events can potentially be resolved by manually examining unassigned evidence items, read-depth coverage graphs, and the relative locations of repeated sequences in the genome to predict the most likely mutational event [21]. In addition, the top-scoring junctions that were rejected according to the statistical tests described above are listed on a separate page that contains “marginal evidence” to potentially aid in resolving other unassigned evidence items.