STR-realigner: a realignment method for short tandem repeat regions

Kojima, Kaname; Kawai, Yosuke; Misawa, Kazuharu; Mimori, Takahiro; Nagasaki, Masao

doi:10.1186/s12864-016-3294-x

Research Article
Open access
Published: 03 December 2016

STR-realigner: a realignment method for short tandem repeat regions

Kaname Kojima¹,
Yosuke Kawai¹,
Kazuharu Misawa¹,
Takahiro Mimori¹ &
…
Masao Nagasaki¹

BMC Genomics volume 17, Article number: 991 (2016) Cite this article

3511 Accesses
11 Citations
1 Altmetric
Metrics details

Abstract

Background

In the estimation of repeat numbers in a short tandem repeat (STR) region from high-throughput sequencing data, two types of strategies are mainly taken: a strategy based on counting repeat patterns included in sequence reads spanning the region and a strategy based on estimating the difference between the actual insert size and the insert size inferred from paired-end reads. The quality of sequence alignment is crucial, especially in the former approaches although usual alignment methods have difficulty in STR regions due to insertions and deletions caused by the variations of repeat numbers.

Results

We proposed a new dynamic programming based realignment method named STR-realigner that considers repeat patterns in STR regions as prior knowledge. By allowing the size change of repeat patterns with low penalty in STR regions, accurate realignment is expected. For the performance evaluation, publicly available STR variant calling tools were applied to three types of aligned reads: synthetically generated sequencing reads aligned with BWA-MEM, those realigned with STR-realigner, those realigned with ReviSTER, and those realigned with GATK IndelRealigner. From the comparison of root mean squared errors between estimated and true STR region size, the results for the dataset realigned with STR-realigner are better than those for other cases. For real data analysis, we used a real sequencing dataset from Illumina HiSeq 2000 for a parent-offspring trio. RepeatSeq and lobSTR were applied to the sequence reads for these individuals aligned with BWA-MEM, those realigned with STR-realigner, ReviSTER, and GATK IndelRealigner. STR-realigner shows the best performance in terms of consistency of the size of estimated STR regions in Mendelian inheritance. Root mean squared error values were also calculated from the comparison of these estimated results with STR region sizes obtained from high coverage PacBio sequencing data, and the results from the realigned sequencing data with STR-realigner showed the least (the best) root mean squared error value.

Conclusions

The effectiveness of the proposed realignment method for STR regions was verified from the comparison with an existing method on both simulation datasets and real whole genome sequencing dataset.

Background

From the development of high-throughput sequencing (HTS) technologies, the detailed variant detection is enabled for each individual with whole genome sequencing analysis. For single nucleotide variants (SNVs), various types of variant calling methods have been proposed [1–4] for HTS data, and the accurate SNV detection is archived for more than a thousand of individuals in genome-wide scale [5, 6]. However, unlike SNVs, there still exists difficulty in the accurate detection of structural variations such as genome insertion, genome deletion, short tandem repeat (STR) number polymorphisms, and copy number variations, especially from data with low read coverage [7].

For repeat number polymorphisms, several studies thus far reported associations with various disease phenotypes such as CAG repeats in the Huntingtin gene with Huntington’s disease [8] and CAG repeats in the androgen receptor gene with spinal and bulbar muscular atrophy [9]. From HTS data, several approaches such as lobSTR [10], RepeatSeq [11], STRViper [12], and coalescentSTR [13] have been proposed for estimating repeat numbers in STR regions. In lobSTR and RepeatSeq, repeat patterns included in sequence reads spanning the STR regions are considered for the estimation of repeat numbers. On the other hand, STRViper and coalescentSTR estimate repeat numbers by considering difference between the actual insert size and the insert size inferred from paired-end reads aligned to the flanking regions of the target repeat. The alignment quality of sequence reads is important for accurate repeat number estimation, especially in the former approaches although usual alignment methods have difficulty in STR regions due to insertions and deletions caused by the frequent change of repeat numbers.

We propose a new dynamic programming based realignment method named STR-realigner where repeat patterns in STR regions are given as prior knowledge, and repeat patterns are used multiple times in the realignment process. Although a similar algorithm is adopted in a tool for detecting STR regions in PacBio reads based on 3-stage modified Smith-Waterman [14], consecutive STR regions can be handled in the proposed algorithm unlike the tool. In addition, clipping fragments, which are an essential feature for the realignment, are also considered in the proposed algorithm. By allowing insertions and deletions of repeat patterns in STR regions with repeatedly use of repeat units, accurate realignment of sequence reads is expected.

In a simulation study with synthetically generated HTS data for artificial diploid genomes sequence based on phased genotypes of a sample in the dataset of 1000 Genomes Project [5], we showed the effectiveness of our model by evaluating root mean squared errors between true and estimated repeat numbers with RepeatSeq or allelotype, an STR calling software in the lobSTR package, from realignment results. For real data analysis, we applied STR-realigner, ReviSTER [15], and GATK IndelRealigner to HTS data from Illumina HiSeq 2000 for a HapMap CEU parent-offspring trio and show the effectiveness of STR-realigner based on consistency in Mendelian inheritance in the estimated repeat numbers in the parent-offspring trio. Root mean squared error values were also calculated from the comparison with the gold standard STR region size obtained from high coverage PacBio sequencing data for one of samples in the parent-offspring trio, and the results from the realigned sequencing data with STR-realigner showed the least (the best) root mean squared error value.

Method

Realignment algorithm considering repeat sequence as prior knowledge

We propose a dynamic programming based algorithm named STR-realigner that realigns query read R to a genome sequence, taking into account the multiple use of repeat patterns for prespecified STR regions. We consider a genome sequence comprised of series of m subsequences G ₁,…,G _m. Let B _j be a binary variable that takes one if G _j can be used repeatedly and zero otherwise, i.e., subsequence G _j with B _j=1 is for a repeat pattern in one of prespecified STR regions. Figure 1 shows an example of a genome sequence comprised of subsequences G ₁,…,G ₆, where G ₂, G ₃, and G ₅ are repeat patterns of prespecified STR regions and are repeatedly used in the proposed realignment algorithm. In the description of the proposed algorithm, |R| and |G _j| denote the size of R and G _j, and R[ k] and G _j[ k] denote bases at the kth position of R and G _j, respectively.

Since infinitely long deletions can be considered by using the same subsequence with B _j=1 repeatedly, we limit the size of deletions to less than |G _j| for subsequences with B _j=1. We consider the following six types of states for the alignment of the ith position in query read R to the kth position of subsequence G _j.

1.
s _M(i,j,k): a state representing match or mismatch between bases at the ith position of query read R and the kth position of subsequence G _j.
2.
s _I(i,j,k): a state representing insertion at the ith position of query read R right after the kth position of subsequence G _j.
3.
s _D(i,j,k): a state representing deletion of the kth position of subsequence G _j right after the ith position of query read R.
4.
s _D(i,j,k,l): a state representing deletion from the k−l+1 to kth positions of subsequence G _j right after the ith position of query read R. This state is considered only for subsequences with B _j=1 in order to avoid deletions longer than |G _j| by limiting the range of l from 2 to |G _j|−1. For l=1, s _D(i,j,k) is used, and consecutive deletions in the same subsequence are not considered for s _D(i,j,k) with B _j=1. If l is longer than k, the deletion starts from the |G _j|−l−k+1st position on the subsequence and the deletion part rotates from tail to head of the subsequence.
5.
s _L(i): a state representing left clipping that ends at the ith position of query read R.
6.
s _R(i): a state representing right clipping that starts at the ith position of query read R.

The following penalties are considered in the proposed realignment algorithm.

p _m,j: penalty for match of bases between query read R and subsequence G _j. Usually, the penalty is set to a minus value, i.e., the penalty is used for rewarding.
p _mis,j: penalty for mismatch of bases between query read R and subsequence G _j.
p _io,j and p _ie,j: penalties for open and extension of insertion on subsequence G _j, respectively.
p _do,j and p _de,j: penalties for open and extension of deletion on subsequence G _j, respectively.
p _c: penalty for clipping.

In the proposed dynamic programming algorithm, penalty and traceback information for state s are stored in functions P(s) and T(s), respectively. In the first step of the dynamic programming, penalty and traceback information of states for the first position in query read R are initialized in the following algorithm.

The best penalties for the alignment up to the ith position of query read R for each state is updated by using the best penalties of states for the i−1st position of query read R in Algorithm 2, where traceback information is also updated. Algorithm 3 given below updates penalty and traceback information for states representing match or mismatch. Algorithm 4 given below is used for obtaining states that are in preceding subsequences and can be traced from s _M(i,j,1). Algorithm 5 given below updates penalty and traceback information for states representing insertion. Algorithm 6 given below updates penalty and traceback information for states representing deletion.

For subsequence G _j with B _j=1, consecutive deletions in the same subsequence are handled with s _D(i,j,k,l), and hence s _D(i−1,j,k−1) is not considered at step 6 of Algorithm 6 for traceback. Procedures for updating penalty and traceback information for states representing consecutive deletions for subsequence G _j with B _j=1 is given as Algorithm 7. Algorithm 8 given below updates penalty and traceback information for s _R(i). Finally, an algorithm for traceback is given as Algorithm 9. By following states from head to tail in Q obtained with the above algorithm, the realignment result with the best penalty is obtained.

Figure 2 summarize a relationship of the above nine algorithms considered in STR-realigner as a flowchart. After initialization of penalty and traceback information for first query position with Algorithm 1, penalty and traceback information are updated for other query positions with Algorihtm 2 in a dynamic programming manner. Then, a realignment with the best penalty is obtained from traceback information with Algorithm 9.

Time and space complexities of STR-realigner

Time complexity analysis

For each position i in query read R, updating penalty and traceback information takes O(1) time for s _M(i,j,k), s _I(i,j,k), and s _D(i,j,k) for k>1 and subsequence G _j with B _j=0. For k>1 and subsequence G _j with B _j=1, updating information for s _M(i,j,k) and s _I(i,j,k) requires O(|G _j|) time while updating information for s _D(i,j,k) and s _D(i,j,k,l) requires O(1) time. For k=1, states for tail positions of preceding subsequences are additionally considered until reaching to subsequence G _j with B _j=0 or j=1 as in Algorithm 4. This process additionally requires \(O\left (\sum _{x={j'}}^{j} |G_{x}|\right)\) time for s _M(i,j,1), where j ^′ is one or the index for the first subsequence Gj′ with Bj′=0 reached from G _j. However, since the best state and its corresponding penalty before G _j−1 are already considered for updating information for s _M(i,j−1,1), by using this information, we need to newly consider only states in subsequence G _j−1, and hence the additionally required time complexity is reduced to O(|G _j−1|). Thus, with the modification of the algorithm according to the above argument, updating information for states s _M(i,1,1),…,s _M(i,m,1) requires \(O\left (\sum _{j} |G_{j}|\right)\) time in total. Since the same optimization can be applied to updating information for states representing insertion, updating information for states with k=1 requires \(O\left (\sum _{j} |G_{j}|\right)\) time in total as well. In addition, for s _L(i) and s _R(i), O(1) time and \(O\left (\sum _{j} |G_{j}|\right)\) time are required, respectively. Thus, updating penalties and traceback information for all the states requires \(O\left (\sum _{j} |G_{j}| + \sum _{j \in \{j'| B_{j'} = 1\}} |G_{j}|^{2}\right)\) time for each position in query read R, and hence the time complexity of the proposed algorithm is \(O\left (|R| \cdot \left (\sum _{j} |G_{j}| + \sum _{j \in \{j'| B_{j'} = 1\}} |G_{j}|^{2}\right)\right)\) time.

Space complexity analysis

The order of the number of states for each position in query read R is \(O\left (\sum _{j} |G_{j}|\right)\) for s _M(i,j,k), s _I(i,j,k), and s _D(i,j,k). For s _D(i,j,k,l), the order is \(O\left (\sum _{j \in \{j'| B_{j'} = 1\}} |G_{j}|^{2}\right)\), and for s _L(i) and s _R(i), the order is O(1). Thus, storing values from functions P and T requires \(O\left (|R| \cdot \left (\sum _{j} |G_{j}| + \sum _{j \in \{j'| B_{j'} = 1\}} |G_{j}|^{2}\right)\right)\) space. However, P(s _D(i,j,k,l)) can be obtained by calculating P(s _D(i,j,k))+(l−1)·p _de,j, and T(s _D(i,j,k,l)) is given by s _D(i,j,k,l−1) for l>2 and s _D(i,j,k) for l=2. Thus, the order of the space required for functions P and T can be reduced to \(O\left (|R| \cdot \left (\sum _{j} |G_{j}|\right)\right)\) by calculating functions P and T for s _D(i,j,k,l) with O(1) time when their values are required. The space required for updating for each state is less than the order of the number of states and is negligible, compared to spaces required for P and T. Thus, with the above modification, the proposed algorithm requires \(O\left (|R| \cdot \left (\sum _{j} |G_{j}|\right)\right)\) space.

Practical implementation

Irregular repeat patterns are often contaminated in the provided STR regions detected by some Bioinformatics tools [16, 17], and those irregular repeat patterns worsen the quality of the alignment of the proposed algorithm due to the difference of the actual sequence and the assumed repeat pattern. In order to address this issue, we extract maximal regions containing repeat patterns consecutively with some pre-specified error rate from the target STR region. The extracted region is used for a new target STR region for STR-realigner.

In order to use the realignment result from the proposed algorithm for resequenced data, parts of the query read aligned to G _j with B _j=1 are again realigned to the corresponding STR region of the reference genome. However, the quality of the alignment is also worsened due to irregular patterns in the STR region. Thus, we consider a subsequence for a repeat pattern right after the target STR region and set lower deletion penalty to the target STR region. For penalty, the following setting were used in our study: p _m,i=−1, p _mis,i=4, p _io,i=6, p _ie,i=1, p _do,i=6, p _de,i=1, and p _c=5. These parameter values are the same as the default values in BWA-MEM. For subsequences corresponding to target STR regions for lower deletion penalty, p _do,i is set to 4.

In Illumina reads, bases at positions after homopolymer regions are highly erroneous because the same phasing is accumulated in synthesis during the Illumina sequencing process in homopolymer regions. Figure 3 shows an example of erroneous bases around a homopolymer region where a lot of clippings occur around a long homopolymer comprised of A bases in GRCh37 due to sequencing errors. Since sequence reads with such highly erroneous bases worsen the quality of realignment with STR-realigner, we additionally implemented an option that skips the realignment with STR-realigner for homopolymer regions with some specified size such as 15.

Each mapping tool has its specific characteristics in the aligned reads. For example, a deletion exists in the start position of an STR region in the reads aligned with some mapping tool while a deletion exists in the end position of the STR region in the alignment result of another mapping tool for the same sequence reads. The performance of variant calling is worsened if such characteristics are mixed in the alignment results. Thus, all the reads aligned to a target STR region are realigned with STR-realigner in the default condition.

Results and discussion

Simulation analysis

From a list of STR regions provided in the RepeatSeq software package, we extracted STR regions for evaluation as follows:

STR regions not in chromosome 22 were filtered out.
STR regions with size longer than 100 bp were filtered out.

The maximum period, the size of repeat pattern, in the list is six. Since the length of sequence reads considered in the following experiments is 100 or 101 bp and these sequence reads cannot span STR regions > 100 bp for most of the cases, STR regions > 100 bp were filtered out. We then prepared synthetically generated diploid genome sequences of chromosome 22 based on phased genotypes for a CEU individual, NA12286, in the phase3 phased reference panel by the 1000 Genomes Project [18]. In the generation of the above genome sequences, variants located in the extracted repeat regions were ignored. The number of variants in total is 54,897. By randomly sampling repeat numbers, we generated two sets of repeat numbers for the extracted repeat regions and added STR variants to the diploid genome sequences based on the sets of repeat numbers for the evaluation. Note that repeat numbers with which the size of STR region is > 100 bp were avoided in the random sampling process. From the diploid genome sequences, we generated paired-end sequence reads in FastQ format with the read length of 100 bp and the insert size normally distributed with mean of 500 bp and standard deviation of 50 bp. In the generated reads, substitution errors were added with rate of 0.1%. Base quality scores for bases in FastQ format were set to Q30, which corresponds to 0.1% error. The read coverage of the generated data is 40 ×. A BAM file for the dataset was obtained by mapping the sequence reads to the reference genome (GRCh37) with BWA-MEM (0.7.12-r1039) [19]. We applied our proposed realignment method, STR-realigner, ReviSTER (0.1.7), and GATK IndelRealigner (GATK 3.4-0) to the BAM independently and generated three types of BAM files.

For GATK IndelRealigner, USE_READS was used for --consensusDeterminationModel option. RepeatSeq (v0.8.2) was applied to the original BAM file and the three types of realigned BAM files, and sizes of variants in the target STR regions were obtained. Table 1 shows call rates of results from RepeatSeq using the original BAM file and the three types of realigned BAM files. The call rate indicates the rate of results with STR region size estimated as a non-NA value. For all the STR periods other than period of 1, call rates of results from the BAM file realigned with STR-realigner are higher than those from other BAM files.

Table 1 Call rate of STR calling results with RepeatSeq using the original BAM file of 40 × and those realigned with STR-realigner, ReviSTER, and GATK IndelRealigner. The best result is underlined

STR-realigner: a realignment method for short tandem repeat regions

Abstract

Background

Results

Conclusions

Background

Method

Realignment algorithm considering repeat sequence as prior knowledge

Time and space complexities of STR-realigner

Time complexity analysis

Space complexity analysis

Practical implementation

Results and discussion

Simulation analysis

Real data analysis

Comparison of computational time

Conclusion

Abbreviations

References

Acknowledgements

Funding

Availability of data and materials

Authors’ contributions

Competing interests

Consent for publication

Ethics approval and consent to participate

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us