Improving alignment accuracy on homopolymer regions for semiconductor-based sequencing technologies
© The Author(s). 2016
Published: 22 August 2016
Ion Torrent and Ion Proton are semiconductor-based sequencing technologies that feature rapid sequencing speed and low upfront and operating costs, thanks to the avoidance of modified nucleotides and optical measurements. Despite of these advantages, however, Ion semiconductor sequencing technologies suffer much reduced sequencing accuracy at the genomic loci with homopolymer repeats of the same nucleotide. Such limitation significantly reduces its efficiency for the biological applications aiming at accurately identifying various genetic variants.
In this study, we propose a Bayesian inference-based method that takes the advantage of the signal distributions of the electrical voltages that are measured for all the homopolymers of a fixed length. By cross-referencing the length of homopolymers in the reference genome and the voltage signal distribution derived from the experiment, the proposed integrated model significantly improves the alignment accuracy around the homopolymer regions.
Besides improving alignment accuracy on homopolymer regions for semiconductor-based sequencing technologies with the proposed model, similar strategies can also be used on other high-throughput sequencing technologies that share similar limitations.
The rapid development of high-throughput sequencing technologies leads to appearances of many innovative sequencing platforms [1, 2]. Ion Torrent and Ion Proton are semiconductor-based sequencing platforms that are primarily designed for personal genome sequencing [3, 4]. Different from sequencing techniques enriched with substitution errors [5, 6], Ion semiconductor sequencing platforms suffer from the inaccuracy in detecting the length of homopolymers repeats of the same nucleotide [7, 8]. These homopolymer errors often lead to the inaccurate local alignment results, and become a critical barrier against accurate detection of genomic variations [9–11] (http://www.broadinstitute.org/gatk/media/docs/Samtools.pdf).
The sequencing chemistry for the Ion semiconductor-based technology is that the incorporation of a deoxyribonucleotide (dNTP) into a strand of DNA couples with the release of a hydrogen ion, which changes the pH of the solution and then leads to the electronic voltage pulse in the ion sensor. Multiple identical bases on the DNA strand often result in the detection of multiple times of the baseline voltage corresponding to the measurements at mononucleotide loci . The difficulty on the homopolymer length identification mainly results from the inaccurate measurement on the magnitude of the voltage pulse, which follows a signal distribution that can be dependent on multiple factors including the type of nucleotide, the length of homopolymer, and the relative position in the DNA template.
Thus far, several algorithms have been proposed in correcting the inaccurate homopolymer length identification, based on the raw data from the detected voltage signals for Ion semiconductor sequencing technologies. Lysholm designed a flow-space FAAST tool, where flowpeak information retrieved from detected voltage signals is utilized to improve accuracy of Smith-Waterman-Gotoh local alignment through correction of likely sequencing errors and thus obtain optimized homopolymer length . However, since dedicatedly designed for the naïve Smith-Waterman-Gotoh algorithm, the method undertakes a heavy computing burden and limits the further application with other alignment programs. In addition, the parameter selection in the algorithm is ad hoc, and was not designed for maximizing the performance. Zeng designed a PyroHMMsnp algorithm, where a hidden Markov model (HMM) is built to recognize overcall or undercall status of homopolymers in a realignment process, and is used to deduce the most possible homopolymer lengths . Similar with other refined alignment algorithms, this approach uses an EM-based strategy, which assumes the variant pattern on most of the reads at one specific loci follows the same distribution; this assumption maybe invalid for certain biological applications, such as the variant identification in cancer somatic tissues. In addition, PyroHMMsnp design does not have hidden state for mismatches, and therefore tends to mistakenly convert mismatches into INDELs. In this project, we aim to develop a simple computational strategy for improving the alignment accuracy by using the voltage signals, and relying only on the measurements of individual sequencing read.
In addition to the measured electrical voltage signals, it is evident that the reference genome contains significant amount of prior information that are not adequately considered by other methods. This is under the assumption that only a small percentage of nucleotides are different between two individuals; for human, it is about 1 % of whole genome nucleotides [13, 14]. Based on such consideration, we proposed a Bayesian-based integrated model to merge these two information sources to improve performance of homopolyer length identification. We demonstrate that our algorithm significantly outperformed Torrent Suite, the software package coupled with Ion Torrent and Proton Sequencers for accurately identifying the length of the homopolymer repeats, and therefore improved sequence alignment accuracy.
Ion Torrent sequencing
Different from imaging-based sequencing platforms, Ion semiconductor technology detects nucleotide composition using electronic sensors. During the sequencing process, the sensor detects released hydrogen ions when nucleotide incorporation occurs. The sensor then detects the pH change caused by hydrogen release, and translates such chemical signal to electrical voltage signal, which is proportional to the number of captured ions. Since one type of nucleotides is sequenced in one machine cycle, if homopolymer exists, the detected voltage level should reflect the length of homopolymer. Despite this simple principle, practically, however, the detected electrical voltage follows a distribution, and in many cases, may not accurately recapitulate the length of homopolymer. In order to design a bioinformatics strategy for correcting the length of homopolymers, we first systematically evaluate the signal distribution of the detected electrical voltage for all the nucleotide positions that share the same homopolymer length, same homopolymer nucleotide type (A, C, G, or T), and similar positions in the sequence reads. The original voltage signals for different nucleotides were extracted from the SFF file, which is exported from the Torrent Suite package.
Bayesian inference of homopolymer length
In the equation, P(V|N i ,P j ,L) is the prior possibility of occurrence of a specific voltage V if given homopolymer length L under situation of nucleotide N i and read position P j , while P(L) and P(V) respectively represent the probability of a specific homopolymer length L, and the probability of a specific voltage V. Both these two probabilities can be statistically derived from the entire sequencing data. In summary, P(L|N i ,P j ,V) is the probability of occurrence of a specific homopolymer length L if given sequencing voltage V under sequencing context of nucleotide N i and read position P j .
Integrated model to identify homopolymer length
For a specific assay, the weighting factor w will be determined by minimizing the identification error for the homopolymers whose length is known, such as samples also detected using other technologies.
Results and discussion
Illumina sequencing data from the same individual, NA11881, is downloaded from the 1000 Genomes database (http://www.1000genomes.org/data). Due to the chemistry differences, Illumina technology is more accurate in detecting homopolymer lengths. We therefore use the dataset from the Illumina platform as gold standard when refining the length of homopolymer repeats.
Distribution of detected voltage signals in homopolymer repeat regions
Bayesian inference of homopolymer length
Motivated by these observations, we develop a Bayesian-based model in inferring the length of homopolymer based on the homopolymer length, their relative positions in the reads, and the detected voltage signal. Since the nucleotide composition includes A, C, G, T and the homopolymer positions are classified into four zones (Z1, Z2, Z3, Z4), in total, 16 Bayesian inference models are built based on the aforementioned prior signal distributions. In each model, the homopolymer length is identified if given a specific voltage level under a particular nucleotide type and position in sequencing read.
In fact, after calculation of prior signal distributions of different kinds of homopolymer lengths, the length of homopolymer can be simply decided using naïve counting from the measured electrical voltage or the k-nearest neighbors algorithm. That is to identify the length of one homopolymer according to its nearest distance to the mean values of different prior signal distributions. In such way, the number of identification errors is 169,212, or 11.82 % of the whole 1, 430,986 homopolymers.
Comparing to k-nearest neighbors algorithm, with our designed Bayesian inference models, the number of identification errors decrease to 71, 460, or 4.99 % of the whole homopolymers.
However, our Bayesian inference result cannot outperform that from the Torrent Suite, where the number of identification errors is 29,623, or 2.07 % of the whole homopolymers. This is due to the fact that significant training has been included the Torrent Suite algorithm, which is proprietary, and uses a large amount of genomic features.
Identification of homopolymer length with Bayesian and reference genome information
Despite the superior performance of the Bayesian model comparing to naïve counting from the measured electrical voltage, both our model and output from the Torrent Suite, experience significant inconsistency based on our dataset with ground truth. Since genetic variants should only occur in a small percentage of genomic loci. We therefore hypothesize that using a combination of voltage signal with the guidance of the standard reference genome will significantly increase the detection accuracy.
Identification errors of homopolymer length with different methods
To show robustness of our proposed method, we also conducted analysis on one Ion Proton data(HapMap human dataset, NA12878) with the same pipeline and obtained the similar result (Additional file 1). Since more homopolymers retrieved in the Ion Proton data, their positions were classified into five zones depending on their distance from the beginning of the reads, Z1: 1–50 bp, Z2: 51–100 bp, Z3: 101–150 bp, Z4: 151–200 bp and Z5: 201–250 bp.
As an important category of sequencing platform, Ion semiconductor-based technology has been widely utilized due to its good performance of faster and cheaper sequencing. However, the technology is far from perfect and suffers from the problem of homopolymer uncertain length. With Bayesian inference and reference genome information, an integrated model was designed to resolve such a problem. Bayesian inference of homopolymer length was first calculated from detected voltage signals. Merged with reference genome sequences information, the homopolymer length was eventually deduced. Compared to several known algorithms, the proposed method presents a greatly improved performance.
It should be noted that the proposed method is designed for refining the sequencing alignment based on individual sequencing read information. This is different from other approaches that rely on the coordinated information from all the reads that align to the same genomic region. Our strategy enables mapping the reads that contain variants in only a small percentage of DNA fragments, such as cancer genome. The general framework of our method can also be used for other sequencing technologies that contain significant amount of sequencing error around homopolymer regions, such as nanopore technology.
This work was supported in part by grants from National Natural Science Foundation of China (61471139, 61403092), the National High-Tech Research and Development Program (863) of China 2015AA020101 and Fundamental Research Funds for the Central Universities (HEUCFT1302, HEUCFX41303).
The publication costs for this article were funded by the corresponding author.
This article has been published as part of BMC Genomics Volume 17 Supplement 7, 2016: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM) 2015: genomics. The full contents of the supplement are available online at http://bmcgenomics.biomedcentral.com/articles/supplements/volume-17-supplement-7.
Availability of data and materials
The sequencing data used here can be found in 1000 Genomes database.
BH and YL designed the study. WF and YW designed the model. SZ, DX, FS, ZL, DC performed the computational coding and implementation. YH conducted data analysis. WF and YL drafted the manuscript. All the authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Schuster SC. Next-generation DNA, sequencing. Nat Methods. 2008;5(1):11–21.View ArticleGoogle Scholar
- Ansorge WJ. Next-generation DNA, sequencing techniques. N Biotechnol. 2009;25(4):195–203.View ArticlePubMedGoogle Scholar
- Merriman B, Ion Torrent R&D Team, Rothberg JM. Progress in Ion Torrent semiconductor chip based sequencing. Electrophoresis. 2012;33:3397–417.View ArticlePubMedGoogle Scholar
- Rothberg JM, Leamon JH. The development and impact of 454 sequencing. Nat Biotechnol. 2008;26(10):1117–24.View ArticlePubMedGoogle Scholar
- Dohm JC. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008;36(16):e105.View ArticlePubMedPubMed CentralGoogle Scholar
- Smith AD, Xuan Z, Zhang M. Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinf. 2008;9:128.View ArticleGoogle Scholar
- Zeng F, Jiang R, Chen T. PyroHMMsnp: an SNP caller for Ion Torrent and 454 sequencing data. Nucleic Acids Res. 2013;41(13):e136.View ArticlePubMedPubMed CentralGoogle Scholar
- Lysholm F, Andersson B, Persson B. FAAST: Flow-space Assisted Alignment Search Tool. BMC Bioinformatics. 2011;12:293.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Gen Res. 2008;18(11):1851–8.View ArticleGoogle Scholar
- Whiteford N, Haslam N, Weber G, Bradley M, Neylon C. An analysis of the feasibility of short read sequencing. Nucleic Acids Res. 2005;33(19):171.View ArticleGoogle Scholar
- Ratan A. Comparison of sequencing platforms for single nucleotide variant calls in a human sample. PLoS One. 2013;8(2):e55089.View ArticlePubMedPubMed CentralGoogle Scholar
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, et al. Genome sequencing in micro fabricated high-density picolitre reactors. Nature. 2005;437(7057):376–80.PubMedPubMed CentralGoogle Scholar
- Venter JC, et al. The sequence of the human genome. Science. 2001;291(5507):1304–51.View ArticlePubMedGoogle Scholar
- Wheeler DA. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452:872–6.View ArticlePubMedGoogle Scholar