Accurate indel prediction using paired-end short reads
© Grimm et al.; licensee BioMed Central Ltd. 2013
Received: 22 October 2012
Accepted: 6 February 2013
Published: 27 February 2013
Skip to main content
© Grimm et al.; licensee BioMed Central Ltd. 2013
Received: 22 October 2012
Accepted: 6 February 2013
Published: 27 February 2013
One of the major open challenges in next generation sequencing (NGS) is the accurate identification of structural variants such as insertions and deletions (indels). Current methods for indel calling assign scores to different types of evidence or counter-evidence for the presence of an indel, such as the number of split read alignments spanning the boundaries of a deletion candidate or reads that map within a putative deletion. Candidates with a score above a manually defined threshold are then predicted to be true indels. As a consequence, structural variants detected in this manner contain many false positives.
Here, we present a machine learning based method which is able to discover and distinguish true from false indel candidates in order to reduce the false positive rate. Our method identifies indel candidates using a discriminative classifier based on features of split read alignment profiles and trained on true and false indel candidates that were validated by Sanger sequencing. We demonstrate the usefulness of our method with paired-end Illumina reads from 80 genomes of the first phase of the 1001 Genomes Project ( http://www.1001genomes.org) in Arabidopsis thaliana.
In this work we show that indel classification is a necessary step to reduce the number of false positive candidates. We demonstrate that missing classification may lead to spurious biological interpretations. The software is available at: http://agkb.is.tuebingen.mpg.de/Forschung/SV-M/.
The detection of genetic variation between individuals is a key challenge in current research in genome biology. This variation includes single nucleotide polymorphisms (SNPs), structural variants (SVs) and copy number variants (CNVs) such as deletions, insertions or duplications, as well as copy number invariant changes like translocations or inversions. SNPs are used extensively to link phenotypic traits with associated genotypes in genome-wide association studies (GWAS)  and to infer relationships in evolutionary studies [2, 3]. SVs can provide additional insights into the genomic causes of phenotypic diversity [4, 5]. Moreover, it is assumed that the total number of nucleotides spanned by SVs greatly exceeds that of SNPs in human and plants [6, 7]. Hence SVs will be included more and more into these studies [6, 8]. Furthermore, SVs are associated with different types of human diseases [5, 9–13] and plant phenotypes [14, 15]. Compared to SNP identification, the detection of larger divergent sequences remains a challenging task. We here present a machine learning approach to predict SVs based on NGS.
Traditionally, structural variants, in particular deletions and duplications, have been identified using array-based technologies (arrayCGH or SNP arrays) , but these strategies suffered from a limited size and localization resolution, which is dependent on the density of probes or known markers. With the advent of NGS methods, whole-genome studies became feasible. Small insertions and deletions (hereafter called indels) up to a few base pairs in length were called by sensitive alignment tools in the routine re-sequencing process [2, 17, 18]. However, the detection of larger structural variants based on depth-of-coverage (DOC)  or paired-end mapping (PEM) [20, 21] methods could not reduce the SV size and localization resolution to the one base pair level. DOC  algorithms detect regions with absent (deletion) or significantly elevated (duplication) coverage, but are not able to determine the exact insertion location on the base pair level of the duplicated sequence. PEM [20, 21] methods exploit the fact that the distance between the alignment locations of read pairs on a reference genome (the ‘insert size’ of the read pairs) usually follow a normal distribution. Clusters of read pairs mapping to the same genomic regions, whose distance is much shorter (longer) than expected can be explained by an insertion (deletion) in the newly sequenced individual compared to the reference genome. The standard deviation and the mean of the insert size distribution define the sensitivity of this method. Recently, so-called split-read mapping approaches (SRM) were introduced to pinpoint structural variants and especially indels in the genome correctly to the base pair level [22–24]. These methods use mapped-unmapped read pairs (MUR) from a paired-end alignment performed by existing short read mapping tools. The mapped partner serves as anchor to realign the unmapped partner using alignment algorithms allowing for long-range gaps in both the reference sequence (deletion) or the read (insertion). We will refer to the initial mapping which identifies MURs as first and the split read alignment as second mapping pass throughout this manuscript.
Deletions up to a few tens or hundreds of base pairs in length can be identified by array based, DOC and PEM approaches, while conventional short read alignments are designed to find only deletions of a few base pairs. In contrast, SRM predictions can in principle span the whole range of deletion lengths. However, the size of insertions is limited, and spurious alignments of the indel-flanking read parts might lead to multiple contradicting indel candidates. Both limitations directly depend on the read length and are thought to be counterbalanced by longer sequences from advanced technologies. Finally, deletions can be identified by limited de novo assembly methods, but they are not yet used routinely and require whole-genome alignments or close relative genomes for comparisons.
Though a large number of software packages for indel prediction from NGS data have been developed, application of these methods to identical data sets reveals little overlap . This is caused by different indel identification strategies. To reduce the number of false positive indel candidates, SRM methods, such as Pindel , rely on conservative realignment strategies. Here, solely perfect and uniquely mapped reads are considered for further analysis. Moreover, the realignment of the unmapped partner has to be mismatch-free as well. These constraints reduce the set of possible indel candidates drastically. Existing SRM programs report an indel as soon as two independent reads support the same SV, and if their partner reads lie within concordant insert sizes [22, 23]. Other methods use several alignment features as evidence of an indel. However, either relying on logical rules [22, 24] or on generative probabilistic models , they require an empirically defined threshold, above which a candidate is a true SV. The most reliable way to verify indels is by capillary sequencing, but this is unfeasible for a genome-wide scan. Thus, to identify a comprehensive set of indels, a non-conservative mapping strategy is needed that takes non-unique and non-perfect mapping reads into account. Furthermore, to rate their trustworthiness, an evaluation method is needed which collects information about numerous alignment features from different approaches and automatically weighs their contributions.
Here we introduce an extended realignment split read strategy to identify a comprehensive set of indel candidates. A de novo machine learning method is applied to discriminate between ‘true’ and ‘false’ indel candidates based on more than 10 alignment features, which can be derived from any short read mapping tool. Its core is a support vector machine (SVM) , a discriminative classifier that is trained on diverse alignment information on indel examples validated by reliable Sanger sequencing. Our SVM approach avoids the step of defining thresholds for each feature by automatically learning them from Sanger validated training data. We show that a commonly used criterion, namely the number of split read pairs supporting the same indel, is not sufficient to distinguish true indels from false candidates, but that additional features can accurately predict bona fide SVs. Concomitantly, our method reports the contribution of each feature to this decision process. Our approach was applied to 80 genomes of Arabidopsis thaliana and its validity demonstrated by recovering a highly similar population structure of the analyzed strains solely based on positively classified indels compared to taking SNP data as a basis.
We performed a custom split read alignment method to retrieve indel candidates from the Arabidopsis thaliana strain ICE111 from phase I of the 1001 Genomes Project . The read lengths ranged from 36 to 64 bp with an average sequencing depth of 21x. All mapped-unmapped read pairs (MURs) were retrieved from the available alignments from Cao et al., which allowed 4 base pair differences between read and reference, of which at most 3 could be gaps. The mapped partner may have multiple alignment positions across the whole genome. Because of many ambiguous alignments due to the high repetitivity of centromeric sequences, all MURs within centromeres were excluded. The unmapped partners of the MURs were mapped against Col-0 (TAIR8) in a 5,000 bp window downstream of the mapped partner using Gotoh’s alignment algorithm , allowing for long-range gaps as well as additional SNPs or few base pair-sized indels (Additional file 1). All best-scoring alignments were reported and indels with a minimum support of two reads constituted the indel candidate set to be further evaluated.
The split read alignment approach identified 14,155 potential indel candidates for the Arabidopsis thaliana strain ICE111. We randomly selected 219 deletion and 43 insertion candidates across all chromosomes from this set and labeled them as true or false after Sanger sequencing. Thus, we retrieved two training sets. The training corpus for deletions consisted of 172 correctly and 47 falsely labeled examples and the training corpus for insertions of 33 true and 10 false ones. These sets were used to train a SVM .
Applying our machine learning approach with the f13/f17 feature set to indel candidates of strain ICE111 positively classified 10,256 out of 13,547 deletions in total (76%), and 373 from 608 insertions (61%), respectively. The length of deletions ranged from 2 to 4,880 bp with a mean of 334 bp and a median of 12 bp. For insertions the length ranged from 2 to 5 bp with a mean and median of 4 bp. Thus, the SVM was capable to extract information from the defined features leading to falsifications of indel candidates. ’False’ SVs can be attributed to spurious mappings dependent on the length of the split read fragments or to multiple best-scoring alignments across the reference.
Next, we compared our predictions of indels in the strain ICE111 to those identified by two versions of Pindel  (v0.1 and v0.24). The minimum length of deletions was set to 5 bp for all three sets, and the maximum deletion size constituted 5,000 bp due to the adjustable restriction of the alignment space. Pindel detected a total of 2,087 (v0.1) and 3,272 (v0.24) deletions larger than 5 bp. 99.8% (v0.1 and v0.24) of Pindel’s deletions were shared among all unclassified deletions of our approach. The SVM classification identified 220 (11%, v0.1) and 309 (10%, v0.24) false positive deletions among the Pindel candidates. Further, our Gotoh approach detected an additional set of 6,890 (v0.1) and 5,706 (v0.24) positively classified deletions. This can be explained by different mapping strategies. Pindel  in version 0.1 requires uniquely and error-free mapped reads and allows only split read alignments where the read partners are aligned within two times the average insertion size. On the contrary, our SVM approach follows a less conservative re-alignment strategy by analyzing non error-free and multiple best-scoring alignments of the same read. Moreover, we tolerated split read mappings anywhere within the alignment window, which results in a larger candidate set by allowing the detection of SVs in highly divergent regions. The subsequent classification compensates for these introduced levels of uncertainty. Indeed, Sanger sequencing revealed 14 (out of 219 validated) deletions, where the read partners showed discordant insert sizes. Furthermore, a 61bp deletion, which was included in the training set and falsified by Sanger sequencing, was reported by Pindel, but correctly classified as false by our SVM. This deletion would have proposed a potential frameshift in a coding region.
Due to general alignment restrictions, detecting insertions is limited in terms of their length. Aligning a short sequence against a long one by introducing a series of gaps into the long sequence at the same time leads necessarily to an inferior alignment score. Thus, Pindel (v0.1) and our approach share merely 15% of all insertions. Abyzov et al. investigated exactly this problem and proposed an improved alignment algorithm called AGE. Applying this algorithm we could increase the number of shared insertions to 35%.
Next, we detected and classified indels in 80 accessions of Arabidopsis thaliana from the first phase of the 1001 Genomes Project . The average coverage of strains was 17x. By using a 5,000 bp alignment window, on average ∼2,116 positive deletions and ∼69 positive insertions were reported per strain, the largest being 4,916 bp long. Combining similarly localized indels of identical length in different strains revealed 169,246 non-redundant deletions and 5,500 insertions in this population. Altogether, they span over 25 Mb in total. Almost half, ∼44%, were shared among more than one strain (Additional file 3).
We found 829 long-range deletions spanning at least one complete gene sequence of the TAIR8 annotation with an average allele frequency of ∼2.7 (ranging from 1 to 65). Of those deletions 101 were classified as whole gene losses if there was no unique read coverage within the deletion (at least 90% zero-covered positions) and sufficient coverage in the same-sized flanking regions (at least 90% non-zero-covered positions), as determined by the first mapping pass. Their average allele frequency was ∼4.4 (ranging up to 29). Spurious read mapping within the deletion, ambiguous split read alignments, gene translocations or heterozygous deletions could explain long-range deletions not meeting the aforementioned criteria.
As expected, only a minority (< 10%) of indels overlap with coding regions and have a potential deleterious effect on proteins (Additional file 4). Indels that do not alter the open reading frame of a gene outnumber those that do by almost two to one (Additional file 4). However, the prediction of amino acid or framshift changes has been performed for each SV separately without considering potential nearby SVs. It is known that additional nearby variants can compensate for frameshifts , thus the number of protein-changing SVs reported here might still be an overestimate.
We present a discriminative machine learning-based approach for detecting true structural variants among indel candidates. The key benefit of using a discriminative model is to learn to distinguish between true and false candidates based on a Sanger validated ground truth, thereby reducing the false positive rate among predicted indels.
We use our method on indel candidates generated via an exact Gotoh  re-alignment of paired-end reads, for which one partner could not be mapped. By considering multiply mapped reads on the whole genome and non-error free reads as well as accepting all mappings within the entire alignment window we receive a larger set of potential indel candidates. Consequently, this non-conservative proceeding increases the chances for finding more true positives, but on the other hand tends to identify more false positives as well. Due to that fact it is essential to accurately classify indel candidates using our machine learning approach.
Conceptually, our machine learning approach for true indel detection can be combined with any kind of alignment strategy and candidate generation scheme. Indeed, to be able to detect more insertions a different alignment method can be useful. With the Gotoh approach, shorter insertions are preferentially called than longer insertions in a pairwise alignment due to the reduced number of nucleotide matches (i.e. positive scores) the longer the insertion is. Abyzov et al. developed an alignment tool called AGE  to better call long insertions. Their method was used on the ICE111 genome in our framework and improved the overlap of insertions between Pindel and the Gotoh approach from 15% to 35%.
Current methods for indel scoring, which either rely on logical rules [22, 24] or generative probabilistic models , have to manually define a threshold above which candidates are predicted to be true structural variants. Our machine learning approach avoids this step by automatically learning the threshold from the Sanger validated training data. Furthermore, all non-discriminative methods for scoring indel candidates have to solve the difficult task of how to weight different types of evidence for the occurrence of an indel. Unlike these methods, our discriminative approach automatically learns the weights of different features. In addition, the automatic weighting of features indicates which features are relevant and which ones are less relevant for indel detection. From our results, we can confirm that our discriminative indel detection benefits from combining several features .
The features we selected contain information that can reliably distinguish between true indels and false candidates as demonstrated by the consistent reconstruction of population structure based on true predicted indels. Furthermore, we showed that tools not relying upon a classification step may lead to spurious biological interpretations. Here, Pindel  identified a deletion candidate causing a potential frameshift, which our post-classification method predicted to be a false one; this prediction was confirmed by one of the Sanger sequences. The classifier was trained on a corpus of reads from a paired-end library sequenced to 21-fold coverage. Our method is robust to changes in fold coverage if the features that are derived from the read alignments, scale linearly with or are independent of sequencing depth. It is expected, that longer reads will improve our strategy since the longer the indel-flanking read sequences the less ambiguous split read alignments will be retrieved. The feature normalization we perform accounts for this fold change. To apply our method in different species, one would need to create a new Sanger validated dataset to account for its particular genomic properties such as the degree of heterozygosity or repetitiveness. However, to circumvent laborious Sanger sequencing, the increasing number of de novo assembled genomes or structural variant databases could serve as an alternative and extensive ground truth in future studies.
The software, Sanger validated training data and all annotated indels for the 80 genomes are available at http://agkb.is.tuebingen.mpg.de/Forschung/SV-M/.
We showed that accurate indel detection consists of two steps – the realignment of unmapped reads and the post-classification of detected candidates. Methods that rely predominantly on re-alignment strategies often contain a large number of false detected indels. We used a nonconservative re-alignment strategy (e.g. allowing multiply mapped reads) to enrich the number of candidates and applied a discriminative machine learning-based approach to then classify indel candidates into true and false ones. We achieved a classification accuracy of 95.1% ± 1.3% for deletions and 93.5% ± 2.6% for insertions. Furthermore, we showed that indel classification reduces the number of false candidates significantly and that missing classification may lead to spurious biological interpretations such as false frame shifts or gene losses.
A Support Vector Machine (SVM)  is a classifier which uses a hyperplane for classification. A SVM deals with a binary classification problem. We assume that we are given a set of data points , where . The label y i of a point x i describes whether the point is in the negative (y i = −1) or the positive class (y i = 1). A SVM tries to separate the set into a positive and a negative class by using a hyperplane. The process of finding the hyperplane is referred to as training. The hyperplane then defines a decision function to find the unknown label y i for a new data point x i . We use a soft-margin variant of the SVM, the C-SVM which is maximizing the margin and is minimizing the training error at the same time. The C-SVM uses a penalty factor C, which penalizes wrongly classified points in the training set. Our developed software is written in C/C++ and uses libsvm , a library for Support Vector Machines.
To train a classifier using a C-SVM and a linear kernel we labeled a set of detected indel candidates by reliable Sanger sequencing as true (positive class) and false indels (negative class). We split the validated corpus into a training and testing set. The training set was used to learn the discriminative model whereas the testing set measured the accuracy of the model. We counted a correctly classified test point as a true positive (TP) if the test point corresponded to the positive class and as true negative (TN) if the test point corresponded to the negative class. A false negative (FN) is a positive test point that is classified into the negative class. Correspondingly, a false positive (FP) is a negative test point that is classified into the positive class. For the actual training we selected 13 features for insertions and 17 for deletions derived from the split read alignment profiles (Figure 1). To train the SVM with these features, we normalized the data within the interval [0,1] and used a 10-fold cross-validation for deletions and a 5-fold cross-validation for insertions (Additional file 2). We repeated each cross-validation experiment 100 times. After cross-validation the best performing soft-margin parameter C value is 10 for deletions and 0.10 for insertions. To measure the performance we computed the area under (AUC) the receiver operation characteristic (ROC) curve and the specificity-sensitivity-break-even point (Spec-Sens-BEP). The ROC curve is the fraction of the TP over all positives TP + FN(sensitivity or true positive rate (TPR)) against the fraction of TN over all negatives TN + FP(specificity or true negative rate (TNR)). The Spec-Sens-BEP is the point where the TPR is equal to the TNR.
To detect indels we used the short read alignments of 80 strains from Arabidopsis thaliana against its reference genome Col-0 provided by Cao et al.. We parsed all read pairs, from which only one partner could be mapped (MURs). The mapped partner of a MUR may contain mismatches and may have multiple mapping positions across the whole genome. The mapped partner served as an anchor point. We considered a sequence window of length 5000 bp downstream of the anchor. Depending on whether the mapped read is located at the forward or backward strand we have to span the alignment window upstream or downstream. Using an exact Gotoh  alignment we aligned the unmapped read within this window against the reference (Additional file 1). The Gotoh  alignment is based on the Smith-Waterman  algorithm to compute a local pair-wise alignment between two sequences a and b using affine gap costs. Gotoh described in his work how to use dynamic programming to compute an alignment with affine gap costs in , where m and n are the lengths of sequences a and b. As alignment matrix we used the NUC4.2 scoring matrix ( ftp://ftp.ncbi.nih.gov/blast/matrices/), which scores a match with 5 and a mismatch with -4. A gap opening is scored with -10, whereas a gap extension scores with zero. Alignments with a score less than the maximal alignment score minus 30 were discarded allowing for up to 7 mismatches or 3 gap openings. If the alignment contains a sequence of at least two consecutive gaps either in the unmapped read or in the reference, a possible indel location was reported. For short reads this single split read alignment is not a strong indicator of an indel due to similar regions and alignment errors. Hence we considered a location as a possible indel candidate only if we found a second independent split read alignment that supported the same location. Furthermore, each fragment of the split read alignment (left and right part of the read compared to the indel) had to be at least 8 bp long. We then used a pre-trained SVM  to predict whether an indel candidate is a true indel.
Novembre et al. showed that the eigenvectors of the SNP covariance matrix reflect the population structure. We here used an indel covariance matrix. For this purpose, we combined identical or few base pair-shifted indels of the same length among different strains into an MxN matrix, where M is the number of strains and N the number of indels. All deletions and insertions with an SV frequency of at least two among all strains were encoded with 1 and -1, respectively. The absence of an indel was specified with zero. To compute the underlying population structure for all eighty genomes for the first phase of the 1001 genomes project for Arabidopsis thaliana we conducted a principle component analysis (PCA) using a custom Matlab script.
Next Generation Sequencing
Support Vector Machine
mapped-unmapped read pair
single nucleotide polymorphism
genome wide association
uniquely mapped read
non-uniquely mapped read
single position variant
area under the curve
principle component analyses
receiver operation characteristic
true positive rate
true negative rate.
Supported by Transnational Plant Alliance for Novel Technologies – Towards Implementing the Knowledge-based Bioeconomy in Europe (PLANT-KBBE) project Transcriptional Networks and Their Evolution in the Brassicaceae (TRANSNET), funded by the Bundesministerium für Bildung und Forschung, by a Gottfried Wilhelm Leibniz Award of the Deutsche Forschungsgemeinschaft, and the Max Planck Society (D.W.).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.