DB^{2}: a probabilistic approach for accurate detection of tandem duplication breakpoints using pairedend reads
 Gökhan Yavaş^{1},
 Mehmet Koyutürk^{2, 4},
 Meetha P Gould^{3},
 Sarah McMahon^{3} and
 Thomas LaFramboise^{3, 4, 5}Email author
DOI: 10.1186/1471216415175
© Yavaş et al.; licensee BioMed Central Ltd. 2014
Received: 12 June 2013
Accepted: 18 February 2014
Published: 5 March 2014
Abstract
Background
With the advent of pairedend high throughput sequencing, it is now possible to identify various types of structural variation on a genomewide scale. Although many methods have been proposed for structural variation detection, most do not provide precise boundaries for identified variants. In this paper, we propose a new method, Distribution Based detection of Duplication Boundaries (DB^{2}), for accurate detection of tandem duplication breakpoints, an important class of structural variation, with high precision and recall.
Results
Our computational experiments on simulated data show that DB^{2} outperforms stateoftheart methods in terms of finding breakpoints of tandem duplications, with a higher positive predictive value (precision) in calling the duplications’ presence. In particular, DB^{2}’s prediction of tandem duplications is correct 99% of the time even for very noisy data, while narrowing down the space of possible breakpoints within a margin of 15 to 20 bps on the average. Most of the existing methods provide boundaries in ranges that extend to hundreds of bases with lower precision values. Our method is also highly robust to varying properties of the sequencing library and to the sizes of the tandem duplications, as shown by its stable precision, recall and mean boundary mismatch performance. We demonstrate our method’s efficacy using both simulated pairedend reads, and those generated from a melanoma sample and two ovarian cancer samples. Newly discovered tandem duplications are validated using PCR and Sanger sequencing.
Conclusions
Our method, DB^{2}, uses discordantly aligned reads, taking into account the distribution of fragment length to predict tandem duplications along with their breakpoints on a donor genome. The proposed method fine tunes the breakpoint calls by applying a novel probabilistic framework that incorporates the empirical fragment length distribution to score each feasible breakpoint. DB^{2} is implemented in Java programming language and is freely available at http://mendel.gene.cwru.edu/laframboiselab/software.php.
Background
Structural variation is a class of genetic variation that includes insertions, inversions, translocations, deletions, and duplications of segments of DNA. Tandem duplications are serially repeated segments of the human genome which may have repeat units several hundred kilobases in size. Many studies have implicated tandem duplications in a variety of diseases. In one such study [1], it was shown that a subset of ovarian cancers share a marked tandem duplication phenotype with triplenegative breast cancers. An internal tandem duplication of the FLT3 gene (FLT3/ITD) is recurrent in acute myeloid leukemia (AML) and myelodysplastic syndrome (MDS) with frequencies of 20 and 315%, respectively [2, 3]. Additionally, 5% to 10% of patients with AML possess the rearrangement of the mixedlineage leukemia (MLL, also known as ALL1 or HRX) gene as the result of a partial tandem duplication (PTD) [4]. Germline tandem duplications have also been associated with human disease. In one recent study [5], it was shown that a patient and his halfsister with extensive polysyndactyly of the hands and feet, and craniofacial abnormalities carried identical 900kb tandem duplications of the Indian hedgehog (IHH) locus. Another study [6] reported a father and daughter, both with a history of compulsive overeating in childhood, carrying a small tandem duplication within exon 1 of the SNURF/SNRPN gene on chromosome 15. These studies underscore the need for computational methods for identifying tandem duplications.
Nextgeneration sequencing (NGS) technology was first used to detect structural variations by Korbel et al.[7]. In that study, the pairedend sequences of two samples' genomes were generated and the read pairs with discordant pairedend orientation and mapped distance were used to find basic structural variations. Subsequently, [8] used NGS to discover genome rearrangements in tumor DNA. The first genome that was wholly sequenced by a NGS platform was presented in [9], which reported several structural variations.
NGS data provides several sources of information from which methods may detect structural variation, including read depth, pairedend orientation, distance between mapped ends, and pairs where one end is “split” mapped or “oneend anchored” (i.e., its mate is not mapped). PEMer [10], BreakDancer [11], VariationHunter [12, 13], GASV [14], and GASVPro [15] use the orientation and the mapped distance between the read pairs to detect insertions, deletions, inversions, and/or translocations. CREST [16] is another method that utilizes split mapped reads as well as pairedend read orientation. The problem of finding novel insertions was also addressed using oneend anchored read pairs in another recent study [17]. In addition, EWT [18] and SegSeq [19] were developed for detecting the genomic regions that differ in copy number between individuals using the depth of single reads in sequence data. Currently, the most wellknown methods for detecting the tandem duplications (along with other types of variations) using just the pairedend NGS data include SVDetect [20], CNVer [21], SPANNER [22], inGAPsv [23], BreakDancer [11], GASV [14] and CREST [16].
For methods that use pairedend reads, an important factor is fragment length, since the two sequenced ends of each fragment will be separated by this length. However, the length of each fragment is not known precisely. Although many of the existing methods assume that fragment length is within a certain range for all fragments [12–14], they do not make use of important information contained in the distribution of these lengths when prioritizing among the predicted breakpoints of the structural variations. If the length of each fragment were known, one could use this information to precisely detect the boundaries of duplications. While precise lengths are not generally available, their general distribution can be derived empirically from concordantly mapped reads. Here a read pair is said to be concordantly mapped to the reference genome when the end with a lower mapping coordinate is aligned to the forward strand, the end with the higher mapping coordinate is aligned to the reverse strand (i.e., FR read pairs, where F and R refer to forward and reverse strands, respectively) and the distance between the mapped ends is within an expected range.
Motivated by this insight, here we propose a method Distribution Based detection of Duplication Boundaries (DB^{2}) that characterizes the distribution of fragment length empirically and utilizes this empirical fragment length distribution to predict the breakpoints of the tandem duplications at a very high resolution with high accuracy and low false positive rate. To the best of our knowledge, none of the existing methods developed for detecting any kind of structural variations utilizes this valuable information for predicting the breakpoints of detected variations. Although we focus on tandem duplications in this manuscript, the proposed framework can easily be extended to detect the boundaries of other structural variations as well.
DB^{2} uses the read pairs that are reported to be concordant by the alignment tool to deduce the empirical fragment length distribution, and the RF read pairs for discovering the tandem duplications along with their putative genomic breakpoint coordinates. To identify the tandem duplications, DB^{2} adopts the geometric representation of the putative breakpoints of a tandem duplication that induces a discordant read pair, which was first proposed in the design of GASV [14]. Our method then groups the RF read pairs that are likely to be induced by the same tandem duplication and uses the information extracted out of multiple read pairs along with the empirical fragment length distribution to precisely infer the putative breakpoints of the tandem duplications.
As a final step, we resolve the conflicts among the tandem duplications, which are caused by multiple distinct tandem duplications having overlapping boundaries, by applying an algorithm that relies on the maximum parsimony principle. After the most likely false positive tandem duplications are eliminated in this step, the set of conflictfree duplications are reported to the user. As we show via systematic computational experiments in the Results section, incorporation of fragment length distribution greatly improves our method's ability in fine tuning the breakpoints of identified duplications.
Results and discussion
Simulation procedure
For simulation testing, we have implemented an artificial pairedend read generator using the February 2009 assembly (Hg19) of the human reference genome. Our simulator generates pairedend read sequences that are similar to those of the Illumina/Solexa platform (see Materials and Methods section for details). To evaluate the performance of the proposed method, for each experiment, we inserted 1000 tandem duplications whose lengths (in bases) were drawn from a normal distribution, with a default standard deviation of 100 bp and default mean of 10 Kbp, into the reference genome. For the experimental evaluation of our algorithm, we used four criteria; precision, recall, F_{1}score and mean breakpoint mismatch. Precision is defined as the fraction of the number of true tandem duplications (true positives) among all tandem duplications identified by our algorithm (true positives and false positives). In order for a predicted (by our method or other methods) tandem duplication to be considered as a true positive, we required at least 50% mutual overlap of the real and the predicted tandem duplications. Recall is defined as the fraction of true positives among all tandem duplications in the donor genome (true positives and false negatives). F_{1}score is a commonly used aggregate metric in information retrieval that considers both precision and recall. It is defined as the harmonic mean of precision and recall. Mean breakpoint mismatch is defined as the average of total distances (in bp) between the predicted and the real start and end positions of the inserted tandem duplications.
Other methods used for comparison
We compared the performance of our algorithm with that of five other software packages designed to detect structural variations from pairedend NGS data: SVDetect [20], CNVer [21], Breakdancer [11], GASV [14] and CREST [16]. Note that the more recent version of GASV, GASVPro, is not included in the compared methods because it does not support the identification of the tandem duplications. Although SPANNER [22] and inGAPsv [23] are also able to detect tandem duplications, both of these methods were excluded from the experimental evaluation since SPANNER was not publicly available and inGAPsv was significantly outperformed by the other methods. For all the methods, we aligned the generated read pair sequences with BWA using the default parameters. The default parameters for CNVer, Breakdancer, CREST and GASV were used, whereas the default values of window_size and step_length parameters had to be slightly modified in SVdetect to obtain the best performance with the simulation data. We set these two parameters to 1000 and 500, respectively.
Several factors can affect any method’s ability to detect a tandem duplication: the average depth coverage of the experiment, the base call error rate, characteristics of the tandem duplications in the donor genome (such as the size of the tandem duplications), properties of the read library (including the distribution of the fragment lengths), and read length. For this reason, we tested the algorithms across various values of six parameters as discussed in the following sections.
Effect of base calling error rate on performance
The recall of our method and SVDetect are almost identical (Figure 2B), whereas CNVer, GASV, Breakdancer and CREST have drastically declining performances with increasing error rate. The decrease in the sensitivities of all methods can be explained by the fact that the alignment tool fails to align increasingly noisy RF reads. Thus, as the error rate goes up, the effective coverage goes down, and the evidence for the duplications gets weaker, which results in fewer predictions and hence fewer true positives. To validate this claim, we computed the mean number of the read pairs supporting each tandem duplication as the base calling error increases (Additional file 1: Figure S1). As shown in this figure, the support for each tandem duplication significantly decreases due to lower effective coverage as we increase the noise in the data. To assess the overall accuracy of the methods, we present the F_{1}score performance in Figure 2C. As mentioned before, F_{1}score evaluates the precision and recall performance of each method by aggregating them into a single value for each error rate level. As seen in Figure 2C, our method outperforms all the presented methods in terms of F_{1}score at each error rate.
Overall, DB^{2} provides the best F_{1}score, which represents the aggregate of precision and recall, along with a very good mean breakpoint mismatch that is tolerable as the noise in the data increases.
Effect of depth coverage on performance
Varying levels of coverage directly impact the amount of data available to each method. As shown in the above analysis, DB^{2} consistently achieves the best F_{1}score and recall performance, but has slightly worse mean breakpoint mismatch performance than that of CREST, even when the data availability is low (i.e., lower coverage levels). Considering the CREST's much lower recall and precision performances, DB^{2}'s average mismatch of 15 base pairs when identifying the boundaries of a tandem duplication is quite tolerable.
Effect of duplication size on performance
For this set of experiments, we increased the size of the tandem duplications starting from 2 Kbp up to 10 Kbp in 2 Kbp increments for each experiment setting. Almost all of the methods have a stable performance in terms of all metrics as we increase the size of each duplication inserted into the donor genome (Additional file 2: Figure S2 and Additional file 3: Figure S3). This is an expected result for DB^{2}, since as long as the fusion point of a tandem duplication is straddled by a read pair, DB^{2} will use this information to identify its breakpoints regardless of duplication size.
Effect of changing properties of the read library on performance
There are multiple important factors during the read library preparation phase of any NGS experiment that can affect the performance of a structural variation identification method. These include (but are not limited to) the distribution of the lengths of the fragments, and the read length.
In order to see the effects of these factors, we conducted a series of experiments by changing the values of read length and fragment length mean/standard deviation during the simulation data preparation. With the exception of CREST, we observe no significant effect on any method’s Recall, Precision and F_{1}score performance (Additional file 4: Figure S4, Additional file 5: Figure S5 and Additional file 6: Figure S6, respectively). CREST performs poorly in terms of recall for a read length of 50 bp, but then improves for larger read lengths (Additional file 6: Figure S6). In contrast, the precision performance of CREST first deteriorates as we enlarge the reads, and then stabilizes around 70%.
Lastly, we observe a poor performance for GASV in terms of mean boundary mismatch for small reads (again for similar reasons), whereas DB^{2}'s performance is very stable for all read lengths (Figure 6C). Indeed, as the read length decreases, the area of each trapezoid induced by a discordantly aligned read pair increases. Again, we overcome this difficulty by calculating a probability value for each predicted loci pair using the empirical fragment length distribution and reporting the one with highest probability. As seen in the results of these experiments, our method is very resilient to negative effects of changing properties of the read library in terms of all metrics.
Runtime and memory consumption comparison
Average runtime and memory consumption for compared methods
DB^{2}  SVDetect  CNVer  Breakdancer  GASV  CREST  

Run time (seconds)  142.55  368.26  168.95  180.56  403.58  1625.092 
Peak memory Usage (kb)  8601184  5161536  4615120  144784  5309072  201024 
Tandem duplications identified in two ovarian cancer genomes
Previouslyreported tandem duplications identified by our method (in Hg 19 coordinates)
Sample  Chromosome  Start Bp (reported)  End Bp (reported)  Start Bp (by DB^{2})  End Bp (by DB^{2}) 

TCGA130723  2  28681251  29521634  28663242  29521603 
TCGA240980  2  28887883  28900892  28887881  28912909 
TCGA240980  2  122915488  122919330  122915490  122923325 
Tandem duplications identified in a melanoma genome
Colo829 tandem duplications identified by our method and PCR/Sanger validated or previously reported (Hg 18) coordinates
Chromosome  Reported*/Sequencing validated start  Reported*/Sequencingvalidated end  Predicted start Bp  Predicted end Bp  Previously reported?* 

1  222713226  222866743  222713222  222866796  Yes 
7  104272303  104399536  104272363  104399571  Yes 
7  114317959  114318185  114317896  114318193  No 
16  80356160  80356702  80356082  80356669  No 
Conclusions
Tandem duplications are an important class of structural variation whose identification requires specialized algorithms. The algorithm that we propose here can identify tandem duplications with a very low false positive rate and a very low mean breakpoint mismatch (approximately 1520 bp), even in very noisy NGS datasets, without compromising sensitivity. As shown by systematic computational experiments on simulated data, DB^{2} achieves a precision of 99.6% and a recall of 77% even for an unusually noisy data (base call error rate 0.07). These results indicate that our method is not very susceptible to the effects of base calling errors in terms of making false tandem duplication predictions and false boundary detections. One other important aspect of our algorithm is that its performance is stable even when the properties of the sequencing library or the size of tandem duplications in the target genome change. This shows the suitability of our method across NGS experiments with different characteristics.
The key to the success of DB^{2} in accurate breakpoint localization is the utilization of the empirical fragment length to predict the most feasible breakpoint for a tandem duplication. As shown in Additional file 10: Figure S8, the distribution of the fragment lengths is generally not uniform in NGS experiments. Thus, given an everted (RF) read pair as the evidence for a tandem duplication, breakpoints of this duplication that indicate a higher frequency fragment length (hence higher probability for this fragment length to be observed) for this RF read pair, should have a higher probability than the others to be the real breakpoints. DB^{2} uses this novel idea to precisely determine the breakpoints of the tandem duplications. Note that neither GASV, nor its extended version GASVPro employs empirical fragment length distribution to probabilistically score the potential breakpoints of structural variations. They instead assume that the lengths of all fragments are within a predefined range, and based on this assumption estimate a (rather broad) range of equally likely breakpoints for identified duplications. In contrast, we use the empirical length distribution obtained from the concordantly aligned reads to assign a probability score to each feasible breakpoint, thereby enabling ranking of candidate breakpoints in terms of their likelihood of being the correct breakpoint. As detailed in the Results and Discussion, the use of the fragment length distribution gives our method the stability for accurate boundary prediction performance.
Our method also achieves a very high precision and recall performance, substantially outperforming the SVDetect and CNVer in terms of these two measures. Although Breakdancer and GASV achieve the best precision performance among all the methods, they perform at most only 1% better than DB^{2}, and are substantially outperformed in terms of recall. In terms of F_{1}score, our method outperforms all the other methods with increasing error rate and data coverage, showing the superiority of our method in identifying the largest set of true positive tandem duplications with the least number of false positives. Finally, the duplications identified in the two TCGA ovarian cancer samples and the COLO829 cell line confirm the applicability of DB^{2} to real datasets.
DB^{2} is freely available at http://mendel.gene.cwru.edu/laframboiselab/software.php. Efforts are underway to extend the methodology to detecting nontandem duplications, deletions and inversions.
Methods
Our method uses the BAM files that are generated by BWA [24], which aligns the FASTQformatted read pair files generated by the sequencer from the donor genome’s (i.e. the genome under interrogation) DNA. Everted (RF) read pairs are considered to be indicative of tandem duplications [25]. The RF read pairs are those that map to the reference genome in such a way that the end with a lower mapping coordinate is aligned to the reverse strand on a chromosome, and the other end is aligned to the forward strand at a higher coordinate on the same chromosome.
Set of potential breakpoints implicated by a single discordant read pair
Suppose that there exists a tandem duplication of the segment delimited by genomic coordinates x_{0} and y_{0}, denoted here as t?=?(x_{0}, y_{0}). We refer to the coordinates x_{0} and y_{0} as respectively the start and end breakpoints of the tandem duplication t, hence (x_{0}, y_{0}) is called a breakpointpair. If the i^{th} fragment (i ∈ M) straddles the fusion point, then the corresponding pair is expected to have an RF discordant mapping (owing to aberrant orientation, as explained in [25]) to positions s_{ i } and e_{ i } on the reference genome as shown in Figure 8.
 (i)
y _{ 0 } ≥ x _{0}?+?e _{ i } – s _{ i } – r – 1?+?l _{min},
 (ii)
y _{ 0 } ≤ x _{0}?+?e _{ i } – s _{ i } – r – 1?+?l _{max},
 (iii)
x _{0} ≤ s _{ i } and
 (iv)
y _{0} ≥ e _{ i }?+?r – 1
As seen in Figure 8, l_{i} is equal to the sum of the lengths of two segments in the reference genome, one delimited by y_{0} and e_{i} and the other delimited by x_{0} and s_{ i }?+?r – 1 (i.e., l_{ i }?=?(y_{0} – e_{ i }?+?1)?+?(s_{ i }?+?r –1 – x_{0}?+?1)?=?y_{0} – x_{0} – e_{ i }?+?s_{ i }?+?r?+?1). Since fragment length is variable, we do not know the value of l_{i}, but do only know its minimum and maximum possible values. Thus, we obtain l_{min} ≤ y_{0} – e_{ i }?+?s_{ i } – x_{0}?+?r?+?1 ≤ l_{max} which yields to the inequalities (i) and (ii). Furthermore, the two reads will flank the fusion point but not contain it. These two restrictions are expressed by the inequalities (iii) and (iv).
Detecting distinct putative tandem duplications
A donor genome will often harbor multiple tandem duplications. Furthermore, as depth coverage for a typical experiment increases, one would expect that more than one read pair straddling the fusion point of each tandem duplication will be produced during the sequencing of a donor genome. This gives us the opportunity to use multiple read pairs to predict the breakpoints of the tandem duplications more precisely because we have more statistical power and more information as more RF read pairs are induced by the same tandem duplication. However, this also necessitates the identification of multiple read pairs that are induced by the same tandem duplication.
Given M, r, l_{min} and l_{max}, we can take advantage of the fact that, if two RF read pairs i and j are induced by the same tandem duplication (for ease of notation, we now denote each read pair by its corresponding index), then the real coordinates of that duplication should lie in the intersection of the corresponding trapezoids W_{ i } and W_{ j }. It follows that a tandem duplication in the donor genome can be identified by finding the maximum subset, denoted by S, of the set of all aligned RF read pairs such that ∩?_{i?∈?S}W_{ i }?≠?∅ (i.e. all trapezoids corresponding to read pairs in S intersect in at least one point). In this case, we say that the tandem duplication t induces the RF pair set S. Thus, the problem of discovering multiple tandem duplications can be framed as the problem of finding the set S?=?{S_{1}, S_{2}, …, S_{ n }} where each read pair set S_{ k } ∈ S is induced by a unique tandem duplication t_{ k }.
In an ideal setting, two trapezoids associated with distinct sets S_{ q } and S_{ p } (q?≠?p) should not overlap, since no read pair can straddle two tandem duplications simultaneously (assuming that the tandem duplications do not overlap). Thus S is ideally a partitioning of the set of all RF read pairs into disjoint subsets (i.e., ${\cup}_{{\mathit{S}}_{\mathit{k}}\in \mathbf{S}}{\mathit{S}}_{\mathit{k}}=\mathit{M}$ and S_{ q }?∩?S_{ k }?=?∅ for all q?≠?k) such that all read pairs in each S_{ k } have corresponding trapezoids intersecting at least one point, and trapezoids corresponding to read pairs from two different S’s do not intersect. However, noisy sequence data (e.g. base call or alignment errors) can lead to imperfect partitioning of the read pair set. As such, we relax the condition requiring that the trapezoids induced by the same tandem duplication contain the breakpoint coordinates of duplication. Instead, we require that there is a mutual intersection between the trapezoids induced by the same duplication. Formally, we require that each S_{ k } satisfies the condition: ∀i ∈ S_{ k }, ∃j ∈ S_{ k }, such that i ≠ j and W_{ i } ∩ W_{ j } ≠ ∅.
An important step in our method for finding the partitioning S involves determining which trapezoids intersect a given trapezoid. To perform this operation quickly, we implement an R* tree [28] data structure, which is a variant of the R tree data structure [29] used for indexing spatial information. Rtrees are hierarchical data structures, which are used for the dynamic organization of a set of multidimensional geometric objects by representing them with the minimum bounding multidimensional rectangles. DB^{2} builds an R* tree using the Java implementation freely available at [30] to index all of the trapezoids of M, and uses this data tree to identify the trapezoids that intersect a given trapezoid. In our experimental evaluation, we have observed that using R* trees for intersection identification is computationally more efficient compared to a naive method, which would check all the trapezoids in M for intersection.
To find the disjoint sets of intersecting trapezoids, we use a method similar to that used for finding the connected components of an undirected graph [31]. Namely, we implement a breadthfirst search (BFS) like algorithm, which starts with an arbitrary trapezoid, i, finds all trapezoids that intersect with i, and then iteratively finds all trapezoids that intersect with these trapezoids. This procedure discovers the entire connected trapezoid set containing i before it returns. Next, it assigns the newly found connected trapezoid set into a set S_{ k } (where initially k =1) and M is updated as M?=?M \ S_{ k } and k?=?k?+?1. Then the same procedure is repeated for the updated M until M becomes empty. The set of tandem duplications, T?=?{t_{1}, t_{2}, …, t_{ n }} corresponding to the set S?=?{S_{1}, S_{2}, …, S_{ n }} of connected trapezoids represents our algorithm’s final set of predicted tandem duplications. At this stage, the tandem duplication breakpoints are not yet precisely defined. Optimally determining these breakpoints is the next step.
Set of potential breakpoints implicated by multiple discordant read pairs
After we determine the set of distinct tandem duplications, T, and the set, S_{ k }, of RF read pairs induced by each tandem duplication, the next step is to estimate the start and end breakpoint sites of each t_{ k }. Ideally, the set of candidate breakpoints would be the intersection of all trapezoids corresponding to the read pairs in S_{ k } . However, due to sequencing and mapping errors, this intersection is often empty. For this reason, we consider the set of breakpoints that are supported by the maximum number of RF pairs as candidate breakpoints. In other words, we define Ω_{ k } as the set of all coordinates in the CxC plane that are contained by the maximum number of trapezoids corresponding to read pairs in S_{ k }. The set Ω_{ k } for each t_{ k } is the set of candidate breakpointpair coordinates for the corresponding tandem duplication.
Scoring candidate breakpoints based on the observed distribution of fragment length
Once we identify the set of candidate breakpointpairs for each tandem duplication, the final step is to score and rank these candidate breakpointpairs. For this purpose, we introduce a probabilistic model that makes use of the empirical distribution of fragment length.
In order to motivate the proposed approach, we first consider the case when only a single RF read pair, say the i^{th} pair, is induced by a tandem duplication. Recall that W_{ i } denotes the set of all possible genomic coordinates delimiting the tandem duplication that induces the i^{th} RF read pair. Now define P[(x, y)  i] (where (x, y) ∈ W_{ i }) as the probability of this tandemly duplicated segment being delimited by base positions x and y, given only the i^{th} RF read pair and the empirical fragment length distribution. If the distribution of fragment length, L, was uniform, then all the genomic coordinates in W_{ i } would have the same probability of being the true breakpointpairs. However, in practice, we know that fragment length is not uniformly distributed. This can be seen, for example, in the COLO829 cell line data [27] (Additional file 10: Figure S8).
where σ_{ i }(x, y)?=?P_{ L } [L?=?y − e_{ i }?+?s_{ i } − x?+?r?+?1] is based on the empirical fragment length distribution.
Conflict resolution among tandem duplications
After the set of all distinct tandem duplications, T, is identified along with their coordinates, it is possible that some of the predicted duplications overlap with each other in terms of their boundaries. In such a case, we say that the tandem duplications are conflicting with each other and the conflict is likely caused by false positive tandem duplications that are the results of the noisy data. Therefore, a conflict resolution procedure is needed to find the subset of the tandem duplications out of T, containing only nonoverlapping duplications that are possibly the true positives. Toward this end, we employ a simple idea based on the maximum parsimony principle. Namely, we assume that the true tandem duplications existing in a donor genome do not overlap; hence, the duplications that overlap with most of the other predicted duplications are falsely identified.
To obtain the true positive set, we use a greedy approach. Starting with T, we eliminate the tandem duplication that overlaps with most of the duplications in T to obtain a subset T′ of T. We then check if there is still any conflict in the new set of tandem duplications, T′. If there is no conflict, DB^{2} reports T′ as the final set of tandem duplications. Otherwise, the procedure is iterated until there is no conflict left.
Data generation for simulation experiments
We have implemented a freely available NGS data generator [32]. Our data generator first selects a userdefined number of base positions uniformly at random on the reference chromosome provided by the user. These randomly selected positions mark the starting point of each tandem duplication. Next, the size of each duplication is drawn from a normal distribution, whose mean and standard deviation are defined by the user. For our simulations, we have used 10 Kbp and 100 bp as the default mean and standard deviation, respectively, and simulate 1000 tandem duplications for each experiment. After determining the start and end breakpointpair for each duplication, our data generator inserts an exact copy of the genomic segment delimited by these two coordinates, right after the end breakpoint to spike in the tandem duplication.
We then select a userdefined number (which is computed according to the userdefined depth of coverage) of base positions v_{1}, v_{2}, …, v_{ u } on the genome as the start location of each read pair. Subsequently, left and right ends of the i^{th} read pair are generated as follows. A “read” of r bases (in the current study, we use r?=?75 as the default value of read length) starting from selected base position is extracted from the reference genome in the forward direction. This sequence forms the left end of the read pair. For generating the right end, our simulator first selects an l_{ i } value from a normal distribution L (with a default mean value of 200 and default standard deviation of 10). Note that the empirical length distribution of the pairedend reads obtained from the COLO829 cell line [27] is similar to this setting. The start locus of the right end on the reverse strand is determined as v_{ i }?+?l_{ i }. The right end read is formed by reading r bases of the reverse strand of the genome in the reverse direction (i.e., read direction is from right to left and the bases in the right end sequence are the complementary bases of the forward strand of the genome). During the read generation process, we replace the base at each locus with a randomly selected base with a userdefined probability value (i.e., base call error rate) to simulate the sequencing errors.
Abbreviations
 NGS:

Nextgeneration sequencing
 BAM:

Binary alignment/map
 PCR:

Polymerase chain reaction
 BFS:

Breadthfirst search.
Declarations
Acknowledgements
Research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health under Awards R25CA094186, R01CA131341, the National Science Foundation under Award IIS0916102, and the American Cancer Society under award 123436RSG1215901DMC.
Authors’ Affiliations
References
 McBride DJ, Etemadmoghadam D, Cooke SL, Alsop K, George J, Butler A, Cho J, Galappaththige D, Greenman C, Howarth KD, Lau KW, Ng CK, Raine K, Teague J, Wedge DC, Caubit X, Stratton MR, Brenton JD, Campbell PJ, Futreal PA, Bowtell DD, Cancer Study Group AO: Tandem duplication of chromosomal segments is common in ovarian and breast cancer genomes. J Pathol. 2012, 227: 446455. 10.1002/path.4042.PubMed CentralPubMedView ArticleGoogle Scholar
 Nakao M, Yokota S, Iwai T, Kaneko H, Horiike S, Kashima K, Sonoda Y, Fujimoto T, Misawa S: Internal tandem duplication of the flt3 gene found in acute myeloid leukemia. Leukemia. 1996, 10: 19111918.PubMedGoogle Scholar
 Yokota S, Kiyoi H, Nakao M, Iwai T, Misawa S, Okuda T, Sonoda Y, Abe T, Kahsima K, Matsuo Y, Naoe T: Internal tandem duplication of the FLT3 gene is preferentially seen in acute myeloid leukemia and myelodysplastic syndrome among various hematological malignancies. A study on a large series of patients and cell lines. Leukemia. 1997, 11: 16051609. 10.1038/sj.leu.2400812.PubMedView ArticleGoogle Scholar
 Schichman SA, Caligiuri MA, Gu Y, Strout MP, Canaani E, Bloomfield CD, Croce CM: ALL1 partial duplication in acute leukemia. Proc Natl Acad Sci U S A. 1994, 91: 62366239. 10.1073/pnas.91.13.6236.PubMed CentralPubMedView ArticleGoogle Scholar
 YukselApak M, Bögershausen N, Pawlik B, Li Y, Apak S, Uyguner O, Milz E, Nürnberg G, Karaman B, Gülgören A, Grzeschik KH, Nürnberg P, Kayserili H, Wollnik B: A large duplication involving the IHH locus mimics acrocallosal syndrome. Eur J Hum Genet. 2012, 20: 639644. 10.1038/ejhg.2011.250.PubMed CentralPubMedView ArticleGoogle Scholar
 Naik S, Thomas NS, Davies JH, Lever M, Raponi M, Baralle D, Temple IK, Caliebe A: Novel tandem duplication in exon 1 of the SNURF/SNRPN gene in a child with transient excessive eating behaviour and weight gain. Mol Syndromol. 2012, 2: 7680.PubMed CentralPubMedGoogle Scholar
 Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, Kim PM, Palejev D, Carriero NJ, Du L, Taillon BE, Chen Z, Tanzer A, Saunders AC, Chi J, Yang F, Carter NP, Hurles ME, Weissman SM, Harkins TT, Gerstein MB, Egholm M, Snyder M: Pairedend mapping reveals extensive structural variation in the human genome. Science. 2007, 318: 420426. 10.1126/science.1149504.PubMed CentralPubMedView ArticleGoogle Scholar
 Campbell PJ, Stephens PJ, Pleasance ED, O'Meara S, Li H, Santarius T, Stebbings LA, Leroy C, Edkins S, Hardy C, Teague JW, Menzies A, Goodhead I, Turner DJ, Clee CM, Quail MA, Cox A, Brown C, Durbin R, Hurles ME, Edwards PA, Bignell GR, Stratton MR, Futreal PA: Identification of somatically acquired rearrangements in cancer using genomewide massively parallel pairedend sequencing. Nat Genet. 2008, 40: 722729. 10.1038/ng.128.PubMed CentralPubMedView ArticleGoogle Scholar
 Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, Pratt MR, et al: Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008, 456: 5359. 10.1038/nature07517.PubMed CentralPubMedView ArticleGoogle Scholar
 Korbel JO, Abyzov A, Mu XJ, Carriero N, Cayting P, Zhang Z, Snyder M, Gerstein MB: PEMer: a computational framework with simulationbased error models for inferring genomic structural variants from massive pairedend sequencing data. Genome Biol. 2009, 10: R2310.1186/gb2009102r23.PubMed CentralPubMedView ArticleGoogle Scholar
 Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, Wendl MC, Zhang Q, Locke DP, Shi X, Fulton RS, Ley TJ, Wilson RK, Ding L, Mardis ER: BreakDancer: an algorithm for highresolution mapping of genomic structural variation. Nat Methods. 2009, 6: 677681. 10.1038/nmeth.1363.PubMed CentralPubMedView ArticleGoogle Scholar
 Hormozdiari F, Alkan C, Eichler EE, Sahinalp SC: Combinatorial algorithms for structural variation detection in highthroughput sequenced genomes. Genome Res. 2009, 19: 12701278. 10.1101/gr.088633.108.PubMed CentralPubMedView ArticleGoogle Scholar
 Hormozdiari F, Hajirasouliha I, Dao P, Hach F, Yorukoglu D, Alkan C, Eichler EE, Sahinalp SC: Nextgeneration VariationHunter: combinatorial algorithms for transposon insertion discovery. Bioinformatics. 2010, 26: 350357. 10.1093/bioinformatics/btq216.View ArticleGoogle Scholar
 Sindi S, Helman E, Bashir A, Raphael BJ: A geometric approach for classification and comparison of structural variants. Bioinformatics. 2009, 25: 222230. 10.1093/bioinformatics/btp208.View ArticleGoogle Scholar
 Sindi SS, Onal S, Peng LC, Wu HT, Raphael BJ: An integrative probabilistic model for identification of structural variation in sequencing data. Genome Biol. 2012, 13: R2210.1186/gb2012133r22.PubMed CentralPubMedView ArticleGoogle Scholar
 Wang J, Mullighan CG, Easton J, Roberts S, Heatley SL, Ma J, Rusch MC, Chen K, Harris CC, Ding L, Holmfeldt L, PayneTurner D, Fan X, Wei L, Zhao D, Obenauer JC, Naeve C, Mardis ER, Wilson RK, Downing JR, Zhang J: CREST maps somatic structural variation in cancer genomes with basepair resolution. Nat Methods. 2011, 8: 652654. 10.1038/nmeth.1628.PubMed CentralPubMedView ArticleGoogle Scholar
 Hajirasouliha I, Hormozdiari F, Alkan C, Kidd JM, Birol I, Eichler EE, Sahinalp SC: Detection and characterization of novel sequence insertions using pairedend nextgeneration sequencing. Bioinformatics. 2010, 26: 12771283. 10.1093/bioinformatics/btq152.PubMed CentralPubMedView ArticleGoogle Scholar
 Yoon S, Xuan Z, Makarov V, Ye K, Sebat J: Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res. 2009, 19: 15861592. 10.1101/gr.092981.109.PubMed CentralPubMedView ArticleGoogle Scholar
 Chiang DY, Getz G, Jaffe DB, O'Kelly MJ, Zhao X, Carter SL, Russ C, Nusbaum C, Meyerson M, Lander ES: Highresolution mapping of copynumber alterations with massively parallel sequencing. Nat Methods. 2009, 6: 99103. 10.1038/nmeth.1276.PubMed CentralPubMedView ArticleGoogle Scholar
 Zeitouni B, Boeva V, JanoueixLerosey I, Loeillet S, Legoixné P, Nicolas A, Delattre O, Barillot E: SVDetect: a tool to identify genomic structural variations from pairedend and matepair sequencing data. Bioinformatics. 2010, 26: 18951896. 10.1093/bioinformatics/btq293.PubMed CentralPubMedView ArticleGoogle Scholar
 Medvedev P, Fiume M, Dzamba M, Smith T, Brudno M: Detecting copy number variation with mated short reads. Genome Res. 2010, 20: 16131622. 10.1101/gr.106344.110.PubMed CentralPubMedView ArticleGoogle Scholar
 1000 Genomes Project Consortium: A map of human genome variation from populationscale sequencing. Nature. 2010, 467: 10611073. 10.1038/nature09534.View ArticleGoogle Scholar
 Qi J, Zhao F: inGAPsv: a novel scheme to identify and visualize structural variation from paired end mapping data. Nucleic Acids Res. 2011, 39 (Web Server issue): W567W575.PubMed CentralPubMedView ArticleGoogle Scholar
 Li H, Durbin R: Fast and accurate longread alignment with BurrowsWheeler transform. Bioinformatics. 2010, 26: 589595. 10.1093/bioinformatics/btp698.PubMed CentralPubMedView ArticleGoogle Scholar
 Alkan C, Coe BP, Eichler EE: Genome structural variation discovery and genotyping. Nat Rev Genet. 2011, 12: 363376. 10.1038/nrg2958.PubMed CentralPubMedView ArticleGoogle Scholar
 Oesper L, Ritz A, Aerni SJ, Drebin R, Raphael BJ: Reconstructing cancer genomes from pairedend sequencing data. BMC Bioinforma. 2012, 13 (6): S10View ArticleGoogle Scholar
 Pleasance ED, Cheetham RK, Stephens PJ, McBride DJ, Humphray SJ, Greenman CD, Varela I, Lin ML, Ordóñez GR, Bignell GR, Ye K, Alipaz J, Bauer MJ, Beare D, Butler A, Carter RJ, Chen L, Cox AJ, Edkins S, KokkoGonzales PI, Gormley NA, Grocock RJ, Haudenschild CD, Hims MM, James T, Jia M, Kingsbury Z, Leroy C, Marshall J, Menzies A, et al: A comprehensive catalogue of somatic mutations from a human cancer genome. Nature. 2010, 463: 191196. 10.1038/nature08658.PubMed CentralPubMedView ArticleGoogle Scholar
 Beckmann N, Kriegel HP, Schneider R, Seeger B: The R*tree: an efficient and robust access method for points and rectangles. Proceedings of the ACM SIGMOD: May 2325, 1990. Edited by: Hector GM, Jagadish HV. 1990, Atlantic City: ACM Press, 322331.Google Scholar
 Guttman A: RTrees: a dynamic index structure for spatial searching. Proceedings of the ACM SIGMOD. Edited by: Beatrice Yormark . 1984, Boston: ACM Press, 4757.Google Scholar
 R* tree source code download page. http://www.chorochronos.org/sites/default/files/algorithms/Rstarjava.zip,
 Hopcroft J, Tarjan R: Efficient algorithms for graph manipulation. Commun ACM. 1973, 16: 372378. 10.1145/362248.362272.View ArticleGoogle Scholar
 LaFramboise Laboratory Software Website. http://mendel.gene.cwru.edu/laframboiselab/software.php,
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.