Unraveling overlapping deletions by agglomerative clustering
- Roland Wittler^{1}Email author
Affiliated with
DOI: 10.1186/1471-2164-14-S1-S12
© Wittler; licensee BioMed Central Ltd. 2013
Published: 21 January 2013
Advertisement
DOI: 10.1186/1471-2164-14-S1-S12
© Wittler; licensee BioMed Central Ltd. 2013
Published: 21 January 2013
Structural variations in human genomes, such as deletions, play an important role in cancer development. Next-Generation Sequencing technologies have been central in providing ways to detect such variations. Methods like paired-end mapping allow to simultaneously analyze data from several samples in order to, e.g., distinguish tumor from patient specific variations. However, it has been shown that, especially in this setting, there is a need to explicitly take overlapping deletions into consideration. Existing tools have only minor capabilities to call overlapping deletions, unable to unravel complex signals to obtain consistent predictions.
We present a first approach specifically designed to cluster short-read paired-end data into possibly overlapping deletion predictions. The method does not make any assumptions on the composition of the data, such as the number of samples, heterogeneity, polyploidy, etc. Taking paired ends mapped to a reference genome as input, it iteratively merges mappings to clusters based on a similarity score that takes both the putative location and size of a deletion into account.
We demonstrate that agglomerative clustering is suitable to predict deletions. Analyzing real data from three samples of a cancer patient, we found putatively overlapping deletions and observed that, as a side-effect, erroneous mappings are mostly identified as singleton clusters. An evaluation on simulated data shows, compared to other methods which can output overlapping clusters, high accuracy in separating overlapping from single deletions.
It is well known that mutations in the human genome are associated to diseases such as cancer. Besides small scale mutations like single nucleotide variants, larger events such as deletions, insertions, inversions, or inter-chromosomal rearrangements can have a crucial impact on the initiation and development of cancer. The detection and analysis of these structural variations play an important role in understanding the underlying mechanisms of cancer, its diagnosis and treatment [1–4].
Current sequencing technologies allow to obtain high data volumes at low cost. It has now become affordable to sequence several samples of the same patient, enabling comparative analyses of, e.g., tumor cells versus healthy blood cells, or samples taken before versus after treatment, to distinguish tumor from patient specific variations, or to observe structural variations over time [5, 6].
In the analysis of such complex data, it is important to consider heterogeneity of various kinds [7]. Besides the differences between several tissues or time points, in cancer one always has to face heterozygosity (mutations only affecting one allele), loss of heterozygosity (mutation inactivating the second allele), aneploidy (different copy numbers for some chromosomes), copy number alterations (different copy numbers for parts of chromosomes), differentiation of tumor cell lines developing different variations, etc. An additional challenge is that a cancer sample is most likely a mixed sample, i.e., although taken from tumor tissue, it usually contains also normal cells [2, 8, 9].
For the detection of single nucleotide variants (SNVs), there exist several approaches, some of which address the above issues. For instance, SomaticSniper [10], JointSNVMix [11] and MutationSeq [12] call somatic SNVs from pairs of normal and tumor samples, where the first two methods follow a Bayesian approach to distinguish tumor from patient specific SNVs, and the latter builds on clustering by support vector machines. Strelka [13] explicitly models mixtures of tumor and normal cells and can also call small indels. Also, several tools exist to accurately detect SNVs in pooled data [14–17], even mutations of low abundance. Apart from analyzing single SNVs, also haplotype inference and assembly has been addressed [18–20]. For the analysis of gene expression data, also Bayesian approaches have been proposed, even considering subtypes of cancer [21].
In contrast to the analysis of SNVs, for the detection of somatic deletions and other larger structural variations, one usually has to process the different samples separately and to compare the results afterwards, e.g., subtract deletions found in the healthy sample from those found in the tumor sample. Or one can pool the data and afterwards only select those calls solely based on tumor data [22]. Only recently, joint analysis of several data sets have been proposed [5, 6].
As shown in a preliminary study [5] to detect deletions by a combined analysis of samples from tumor and healthy tissue, there were regions in the tumor genome for which existing tools predicted more deletions than there could actually be on a diploid genome. When instead two diploid sets of chromosomes were assumed, i.e., the tumor sample is actually a mixture of cancerous and healthy cells, almost all data could be explained consistently, by explicitly modeling heterozygous deletions on different alleles to be overlapping. These observations have been made particularly in regions where deletions were found in the healthy cells and additional deletions have been predicted in the tumor sample - thus in regions where it is especially difficult to distinguish cancer from patient specific mutations. However, the scope of the presented method has been to show that a consistent scenario of overlapping deletions can be found. It greedily constructs a "possible" solution, without raising the claim of reporting a "reasonable" result. Furthermore, the model is restricted to the very specific case of analyzing a mixture of cancer and normal cells, including some additional technical, combinatorial assumptions.
In this paper, we present a method to detect deletions that is particularly designed to handle overlapping deletions. For the sake of flexibility, no assumption on the composition of the data is made - whether it is just from one sample or pooled data, it contains different cell lines, aneploid cells, etc. Being aware of such heterozygosity and specially designed for such data, we refrain from predicting a deletion being "heterozygous or homozygous", or "tumor or patient specific". Instead, besides a tabular listing of the results, a rich visual output for each set of overlapping deletions is provided, allowing an easy inspection of the results.
The method takes as input paired ends that have been mapped to a reference genome and collects those mappings likely originating from the same deletion in clusters. Agglomerative clustering is utilized to cluster mappings by similarity. The similarity score is based on a three-dimensional representation of mappings and deletions similar to the two-dimensional representation introduced by Dew et al. [23] and used for structural variation detection by Sindi et al. [22] in the tool GASV. With this representation, we obtain a score that captures both the putative location and size of the deletion.
We applied our method to a data set from several samples taken from an acute lymphoblastic leukemia patient. Besides examples for predicted overlapping deletions, we find that single, putatively erroneous mappings not assigned to another cluster can nicely be identified as outliers. Since overlapping deletions are rare events and their verification is difficult, we performed an evaluation on simulated data showing good accuracy.
After providing the necessary background in the following section, we introduce our approach in Section "Method". In Section "Results and discussion", we present a simulation-based evaluation and results on real data, before we conclude our study.
We will now give a brief overview of existing approaches to identify structural variations and then introduce the technique of paired-end mapping, which our method is based on.
Besides different experimental techniques, there are many computational approaches for structural variation detection [24]. A straight forward idea to detect mutations would be to fully assemble the genome under consideration, the so-called donor genome, and to align it to a reference sequence. To save the time and cost intensive assembly and finishing steps which would be necessary to determine the full genome sequence, one usually follows other approaches. Instead of performing a full assembly, one can restrict the process to only reads from regions suspect to harbor a variation, as for instance done in [25]. The tool fermi [26] allows both full assembly and a pre-filtering for reads unique to one sample. Other recent methods simultaneously assemble several genomes into a single graph data structure allowing for variation calling [27, 28].
Another, more common approach is to omit any assembly of the donor genome and instead utilize the reads directly to detect differences by mapping them to the finished reference sequence. Basically, there are three classes of methods to identify structural variations from those mappings. (See [24] or [29] for reviews.) (1) Significant fluctuations of the coverage of the reference by mappings can indicate copy number changes. If part of a chromosome is lost or duplicated, the coverage drops or increases, respectively. (2) If a read has not been mapped completely, but in parts, this can indicate different types of mutations. E.g., if one half of the split read is mapped with some space to the remaining part, the segment in-between might be absent in the donor genome. If parts are mapped to different chromosomes, this can indicate inter-chromosomal rearrangements. (3) If the donor genome has been sequenced using a certain technology, pairs of reads (paired ends) are obtained. The orientation and the distance of the reads within a pair are known in the donor, and when the corresponding mappings on the reference do not agree with this pattern, structural variations can be called similar to the split read approach. This paired-end mapping technique, introduced by Korbel et al. [30], will be explained in detail in the following section.
Since all techniques have their advantages and disadvantages, in general, a combination of different techniques is advisable. Different tools can be applied separately and the results are combined in a post-processing step. Also tools exist which combine techniques already during the detection, e.g. inGAPsv [31], CNVer [32], GASVPro [33], SVseq2 [34], or the method by Nord et al. [35]. The method presented here is based on paired-end mapping. It is particularly designed for regions harboring several overlapping deletions. Due to the complexity of such regions, a partial assembly or coverage analysis is hardly possible. However, integrating a split read approach is planned for future work.
When genomic material from a sample is sequenced by short-read sequencing, many overlapping DNA-fragments are produced and a certain number of bases are read from both ends of each fragment, resulting in a pair of reads, so-called paired ends. The reading direction and the length of the reads are known. Further, since the fragment size is fixed - in practice distributed around the desired length - their approximate distance is known as well.
The paired ends from a newly sequenced donor genome can be mapped to a reference genome which is already assembled to a complete DNA-sequence. In a region where the two genomes do not differ, the mapped reads have the original direction and their distance coincides with the fragment length. Such a mapping is called concordant. If however a mapping is discordant, i.e., either the orientation is inconsistent or the distance differs significantly from the expected fragment size, this indicates a structural variation in the donor w.r.t. the reference.
In this paper, we only focus on deletions, i.e., a segment present in the reference is not present in - we say deleted from - the donor genome. Hence, paired ends spanning the deletion breakpoint in the donor genome will be mapped to the reference with a distance increased by the size of the deletion but with proper orientation. We call such a mapping stretched.
In a region that possibly harbors several deletions, we obtain paired ends from the different alleles which are then all mapped to the same reference sequence. Depending on how similar the deletions are in terms of location and size, it is difficult to separate the mappings and to recover the different deletions. Our goal is to partition the set of mappings into clusters of similar mappings probably belonging to the same deletion. To this end, we utilize the technique of agglomerative clustering, which in general works as follows. At the beginning, each object (in our case stretched mapping) is a singleton cluster, and a similarity score is computed for all pairs of clusters. Then, iteratively, a pair of clusters of maximum similarity is merged, the two original clusters are replaced by this new cluster, and the similarity between the new cluster and all others is recomputed. These merging steps are repeated until either only one cluster is left, or the maximum similarity is below a certain threshold.
In the following, we will first explain how we model deletion clusters, and then introduce our similarity score, which is crucial for accurate clustering. Finally, we give some details about our implementation.
For our similarity score, we build on this two-dimensional representation of deletion coordinates, as it nicely incorporates both the location and the size of a deletion. Again, we use the x- and y-axis for the deletion start and end coordinates, respectively. But to model the deletion size, instead of the sharply bounded interval, we use a continuous distribution. As mentioned before, assuming that the fragment length follows a normal distribution with mean μ and standard deviation σ, the expected deletion size implied by a mapping with distance d is normally distributed around d - μ with the same standard deviation σ. In the geometric interpretation, for each possible start coordinate for a deletion on the x-axis, this distribution describes the expected end coordinate on the y axis. Since the deletion has to be located in-between the mappings of the paired ends, these, together with the main diagonal, define a triangle of "allowed" deletion coordinates, which is covered in the third dimension by the normal distribution as shown in Figure 3(b). In this diagram and all following, the probability will be represented by shaded gray toning.
With our choice of l and r as the maximum and minimum, respectively, we follow an "average link" strategy as it corresponds to taking the intersection of the triangles. This is the most intuitive way since the new cluster should represent the "area" of deletion coordinates to those which are compatible to both of the original clusters. The equation for the joint standard deviation is borrowed from population-based statistics for aggregating non-overlapping sub-populations. If the two clusters agree in the location of the deletion, i.e., μ _{1} - μ _{2} = 0, the resulting is the mean of the two standard deviations. The more the two clusters disagree, the broader becomes the distribution. For reasons we will explain later, this way of joining the deviations turned out to be better suitable for our similarity score and the resulting performance in clustering than using the actual joint variance.
This score has the following properties.
The score is zero if and only if the triangles do not overlap, which means that there is no location for a deletion compatible with both clusters.
The score is one if and only if the two clusters are equal, because, only if all parameters of the two clusters are equal, the intersection volume equals the maximum.
It is more sensitive to differences in the deletion size than in the location. The reason for this is that a shift parallel to the ridge of the volume does not affect the intersection volume to such an extent as a shift perpendicular to it. This is a particularly desired behavior since we expect a deletion breakpoint being covered by several mappings. In some mappings the breakpoint lies more to the left and in others more to the right, which - even if all mappings are correct - corresponds to staggered triangles. Such triangles, even if staggered, would have to be clustered and should thus not be punished too hard by a low score. In contrast, a difference in the deletion size indicates a disagreement of the mappings.
We now come back to the issue how to combine the standard deviations of several mappings. As already mentioned, we did not choose the more intuitive combination by computing the standard deviation of the means, i.e. , where σ is the standard deviation of the fragment length. This results in narrower distributions for clusters containing more mappings. On the one hand, this would make sense since the prediction is more accurate, the more mappings support it. On the other hand, besides computational problems due to score values close to zero, this also yields an artifact: Clusters containing only a few mappings have a broader volume and are thus more likely to be clustered than larger clusters with a narrow volume. The order in which clusters are aggregated would be dominated by their cardinality, rather than by their similarity. To avoid this, we chose our definition of to be independent of the cardinality of the clusters.
It remains to describe the stopping criterion for the clustering process. Since the normal distribution is never exactly zero, any two clusters that overlap in their base (triangle) have a non-empty intersection volume, no matter how weak the overlap is. We thus set a minimum threshold S _{ min } for the similarity score. If no pair of clusters has similarity larger than S _{ min }, the clustering process is stopped. To avoid the introduction of a parameter which is arbitrarily fixed or has to be given by the user, we examine the data in a preprocessing step to determine a threshold that separates Significant similarity values from noise. For each mapping, we determine the smallest non-zero similarity score to any other (overlapping) mapping. Setting S _{ min } to the median of these minima showed robust accurate behavior in practice, also approved in our experiment as explained at the end of "Simulation of overlapping deletions".
In general, the run time complexity of agglomerative clustering depends on the similarity score. Using priority queues to store the scores, O(N ^{2} log N) pairwise score computations have to be performed to cluster N objects. Our score can be computed as follows. In each triangle, the height of the volume is constant on a line parallel to the main diagonal and thus parallel to the hypotenuse of the triangle. This allows us to compute the volume by traversing a triangle (or the intersection of two triangles, which is again a triangle) starting from its hypotenuse towards the upper left corner, and summing up the product of the length of the line and the height of the volume given by the normal distribution (or the minimum of two, respectively). Since this takes O (r - l) time and the score computation consists of a constant number of volume computations, the overall run time complexity of the agglomerative clustering is in O (L N ^{2} log N), where L is the maximum length of a mapping.
In practice, the run time is dominated by reading the input and thus approximately linear in the number of mappings.
We implemented the method in JAVA. The input consists of one or several BAM files and a simple tabular separated file listing the mean segment length and standard deviation for each file. Additionally, a color can be specified for each BAM file, which is used to visualize the mappings in the graphical output. The mappings are read from the BAM files using SAMtools [39] and applying several filters. Only mappings of high quality (quality value at least 20), without gaps, uncapped and without co-optimal mapping locations are used. Further, the mappings can be filtered by their length. Here, a minimum length of mean plus three times standard deviation showed good performance in practice.
Before the actual clustering, we partition the mappings into so-called regions, maximal subsets such that no mapping from one set overlaps with any mapping in another subset. These regions are then clustered independently.
The tool, including the source code and example data, is available from the Bielefeld University Bioinformatics Server: http://bibiserv.cebitec.uni-bielefeld.de/agglodel/
Before we present results on real data from three samples of a cancer patient, we first investigate the accuracy of the new clustering method in distinguishing overlapping from single deletions in simulated data sets.
Since, to the best of our knowledge, no other method exists that aims at identifying overlapping deletions, and because such instances are difficult to detect or verify in the wet lab, there is no gold standard available we could use for an evaluation. Instead, we created a simulated data set which is based on previously detected single deletions.
We used our agglomerative clustering approach with the filtering of mappings as explained in "Method", and also ran two other tools on the same set of filtered mappings. First of all, we included GASV [22] (release 2.01) into our evaluation, as our similarity score is based on an extension of the geometric interpretation used there. We set the minimum cluster cardinality to 2 and used both, the option to output all maximal sets of overlapping trapezoids (see "Method") which we will refer to as GASV_max, and the default to output merged clusters, referred to as GASV. Although a follow-up version GASVPro [33] has been published recently, the available software does not yet include all necessary preprocessing tools to process BAM files. Secondly, we also ran CLEVER [37] as one of the most recent and accurate variation detection tools. This method computes a probability for mappings arising from the same deletion, and builds a graph with the mappings as vertices and edges for pairs with Significantly high probability. Clusters are determined as maximal cliques in this graph, which in general allows overlapping deletions in the output. Clusters are determined as maximal cliques in this graph.
On average, CLEVER detected 92% of all deletions and made only about 1% wrong predictions. Even though maximal cliques output by CLEVER can be overlapping and the output can thus contain overlapping deletion predictions, this has to be understood as a technical consequence of the clustering procedure as CLEVER does not aim at predicting overlapping clusters. Nevertheless, we tried to distinguish between non-overlapping and overlapping deletions by simply considering non-overlapping cliques and pairs of overlapping cliques. The true and false positive rates were below 0.3 for all scenarios (not shown in the figure).
In all settings, agglomerative clustering with the proposed similarity score proves to be accurate. GASV performs well, where using the option to output all clusters before merging turned out to be slightly more effective in this setting.
We also investigated, when two overlapping deletions have been called correctly, how accurate the mappings originating from the two different alleles have been assigned to the two clusters. Ideally, one clusters should only contain mappings from the first deletion and the other cluster only those from the second deletion. On those pairs, correctly identified by all three methods, GASV clusters contained 0.041% mis-assigned mappings, GASV_max 0.019%, and our clusters contained 0.006%. In total, i.e., including the more difficult cases GASV did not detect as pairs at all, we observed 1.3% mis-assignments in our clusters and 4.0% in the GASV_max clusters (averaged over all four simulation settings).
The Department of Paediatric Oncology, Haematology and Immunology at the Düsseldorf University Hospital, Germany, provided sequencing data of an acute lymphoblastic leukemia patient. Three samples (before treatment, after treatment, and relapse) have been sequenced on an Illumina HiSeq 2000 with read length 51, segment length around 300, and sequence coverage of 6×, 8× and 8×, respectively. At the Institute of Medical Informatics at University of Münster, Germany, the reads have been mapped to hg19 with BWA [42] (version 0.5.9, at most 3 mismatches and standard parameters otherwise) and Picard [43] has been used to remove duplicates. Figure 1 exemplarily shows the fragment length distribution for chromosome 21 from one of the data sets estimated from the mapping distances. It approves that the assumption of a normal distribution is approximately met.
Number of overlapping deletions.
Size (# deletions) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ≥ 10 | Total |
---|---|---|---|---|---|---|---|---|---|---|---|---|
# regions | 631,769 | 97,869 | 1,768 | 68 | 25 | 8 | 2 | 2 | 3 | 4 | 4 | 731,522 |
# components | -- | 101,615 | 104 | 9 | 4 | 3 | 1 | 2 | 0 | 1 | 0 | 101,739 |
Many highly overlapping deletions were found in telomeric or centromeric regions, where mappings are not very reliable. Figure 4 exemplarily shows a region where we found overlapping clusters. The shown region is also an example for a general observation we made: Agglomerative clustering with our similarity score nicely identifies outliers among the discordant mappings, as those remain as singleton clusters. In our approach, such putatively erroneous mappings do not distort the main deletion prediction, whereas, for instance in GASV, a single mapping can drastically affect a cluster if the corresponding trapezoid overlaps it, or can result in many maximal sets of overlapping trapezoids.
An experimental validation of this and further findings is currently being performed. Here, we wanted to give an overview of the general performance of the presented method on a real world instance. In cooperation with the Düsseldorf University Hospital, further analyses of this and other data sets are planned.
It is well known that structural variations in the human genome play an important role in the development of diseases, especially cancer. In particular, the accurate detection of deletions still remains a challenging task. A preliminary study has motivated that detection methods should be capable of handling overlapping deletions to be able to draw a clearer picture of variations in heterogeneous samples, e.g., to distinguish cancer from patient specific mutations.
In the present study, we have demonstrated that agglomerative clustering is suitable for this task. We have introduced a similarity score that is based on a geometric and probabilistic interpretation of paired ends which have been mapped to a reference sequence. Taking into account both the location and the length of a deletion, this scoring allows to effectively cluster mappings into possibly overlapping clusters. The method has successfully been applied on real data, and has proven to be accurate according to a simulation-based evaluation.
Here, we have investigated the performance of an intuitive clustering approach. On the one hand, the simplicity of agglomerative clustering offered us insight into the clustering process and understanding of its behavior. On the other hand, computationally more sophisticated clustering strategies could perform even better, if they are computationally not too expensive for such large data sets.
We believe that there is still more potential to be explored in the detection of structural variants, especially deletions. Besides the combination with other approaches, such as split read or coverage analysis, the increasing availability of several samples from one patient offers a dimension that has to be investigated.
Another clustering based method to call deletions (and insertions) from paired-end mapping data from heterogeneous samples has been published very recently. The tool SVM^{2} by Chiara et al. [44] utilizes a support vector machine incorporating, similar to our tool, mapping location and mapping distance, but also coverage information. Including SVM^{2} into our simulation-based evaluation could not be finished during revision of the present manuscript.
The publication costs for this article were funded by the Deutsche Forschungsgemeinschaft and the Open Access Publication Funds of Bielefeld University Library.
This article has been published as part of BMC Genomics Volume 14 Supplement 1, 2013: Selected articles from the Eleventh Asia Pacific Bioinformatics Conference (APBC 2013): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S1.
The author wishes to thank Tobias Marschall for many discussions and valuable support in the evaluation; Peter Husemann, Linda Sundermann, Cedric Chauve and Jens Stoye for fruitful discussions; the Department of Paediatric Oncology, Haematology and Immunology at the Düsseldorf University Hospital for providing the data; Christoph Bartenhagen for performing the preprocessing of the data; and the reviewers for their helpful comments.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.