STR-realigner: a realignment method for short tandem repeat regions

Background In the estimation of repeat numbers in a short tandem repeat (STR) region from high-throughput sequencing data, two types of strategies are mainly taken: a strategy based on counting repeat patterns included in sequence reads spanning the region and a strategy based on estimating the difference between the actual insert size and the insert size inferred from paired-end reads. The quality of sequence alignment is crucial, especially in the former approaches although usual alignment methods have difficulty in STR regions due to insertions and deletions caused by the variations of repeat numbers. Results We proposed a new dynamic programming based realignment method named STR-realigner that considers repeat patterns in STR regions as prior knowledge. By allowing the size change of repeat patterns with low penalty in STR regions, accurate realignment is expected. For the performance evaluation, publicly available STR variant calling tools were applied to three types of aligned reads: synthetically generated sequencing reads aligned with BWA-MEM, those realigned with STR-realigner, those realigned with ReviSTER, and those realigned with GATK IndelRealigner. From the comparison of root mean squared errors between estimated and true STR region size, the results for the dataset realigned with STR-realigner are better than those for other cases. For real data analysis, we used a real sequencing dataset from Illumina HiSeq 2000 for a parent-offspring trio. RepeatSeq and lobSTR were applied to the sequence reads for these individuals aligned with BWA-MEM, those realigned with STR-realigner, ReviSTER, and GATK IndelRealigner. STR-realigner shows the best performance in terms of consistency of the size of estimated STR regions in Mendelian inheritance. Root mean squared error values were also calculated from the comparison of these estimated results with STR region sizes obtained from high coverage PacBio sequencing data, and the results from the realigned sequencing data with STR-realigner showed the least (the best) root mean squared error value. Conclusions The effectiveness of the proposed realignment method for STR regions was verified from the comparison with an existing method on both simulation datasets and real whole genome sequencing dataset.

such as genome insertion, genome deletion, short tandem repeat (STR) number polymorphisms, and copy number variations, especially from data with low read coverage [7].
For repeat number polymorphisms, several studies thus far reported associations with various disease phenotypes such as CAG repeats in the Huntingtin gene with Huntington's disease [8] and CAG repeats in the androgen receptor gene with spinal and bulbar muscular atrophy [9]. From HTS data, several approaches such as lobSTR [10], RepeatSeq [11], STRViper [12], and coalescentSTR [13] have been proposed for estimating repeat numbers in STR regions. In lobSTR and RepeatSeq, repeat patterns included in sequence reads spanning the STR regions are considered for the estimation of repeat numbers.
On the other hand, STRViper and coalescentSTR estimate repeat numbers by considering difference between the actual insert size and the insert size inferred from paired-end reads aligned to the flanking regions of the target repeat. The alignment quality of sequence reads is important for accurate repeat number estimation, especially in the former approaches although usual alignment methods have difficulty in STR regions due to insertions and deletions caused by the frequent change of repeat numbers.
We propose a new dynamic programming based realignment method named STR-realigner where repeat patterns in STR regions are given as prior knowledge, and repeat patterns are used multiple times in the realignment process. Although a similar algorithm is adopted in a tool for detecting STR regions in PacBio reads based on 3-stage modified Smith-Waterman [14], consecutive STR regions can be handled in the proposed algorithm unlike the tool. In addition, clipping fragments, which are an essential feature for the realignment, are also considered in the proposed algorithm. By allowing insertions and deletions of repeat patterns in STR regions with repeatedly use of repeat units, accurate realignment of sequence reads is expected.
In a simulation study with synthetically generated HTS data for artificial diploid genomes sequence based on phased genotypes of a sample in the dataset of 1000 Genomes Project [5], we showed the effectiveness of our model by evaluating root mean squared errors between true and estimated repeat numbers with RepeatSeq or allelotype, an STR calling software in the lobSTR package, from realignment results. For real data analysis, we applied STR-realigner, ReviSTER [15], and GATK IndelRealigner to HTS data from Illumina HiSeq 2000 for a HapMap CEU parent-offspring trio and show the effectiveness of STR-realigner based on consistency in Mendelian inheritance in the estimated repeat numbers in the parent-offspring trio. Root mean squared error values were also calculated from the comparison with the gold standard STR region size obtained from high coverage PacBio sequencing data for one of samples in the parent-offspring trio, and the results from the realigned sequencing data with STR-realigner showed the least (the best) root mean squared error value.

Realignment algorithm considering repeat sequence as prior knowledge
We propose a dynamic programming based algorithm named STR-realigner that realigns query read R to a genome sequence, taking into account the multiple use of repeat patterns for prespecified STR regions. We consider a genome sequence comprised of series of m subsequences G 1 , . . . , G m . Let B j be a binary variable that takes one if G j can be used repeatedly and zero otherwise, i.e., subsequence G j with B j = 1 is for a repeat pattern in one of prespecified STR regions. Figure 1 shows an example of a genome sequence comprised of subsequences G 1 , . . . , G 6 , where G 2 , G 3 , and G 5 are repeat patterns of prespecified STR regions and are repeatedly used in the proposed realignment algorithm. In the description of the proposed algorithm, |R| and |G j | denote the size of R and G j , and R[k] and G j [k] denote bases at the kth position of R and G j , respectively.
Since infinitely long deletions can be considered by using the same subsequence with B j = 1 repeatedly, we limit the size of deletions to less than |G j | for subsequences with B j = 1. We consider the following six types of states for the alignment of the ith position in query read R to the kth position of subsequence G j .
1. s M (i, j, k): a state representing match or mismatch between bases at the i th position of query read R and the k th position of subsequence G j . 2. s I (i, j, k): a state representing insertion at the i th position of query read R right after the k th position of subsequence G j . 3. s D (i, j, k): a state representing deletion of the k th position of subsequence G j right after the i th position of query read R. 4. s D (i, j, k, l): a state representing deletion from the k − l + 1 to k th positions of subsequence G j right after the i th position of query read R. This state is considered only for subsequences with B j = 1 in order to avoid deletions longer than |G j | by limiting the range of l from 2 to |G j | − 1. For l = 1, s D (i, j, k) is used, and consecutive deletions in the same subsequence are not considered for s D (i, j, k) with B j = 1. If l is longer than k, the deletion starts from the |G j | − l − k + 1st position on the subsequence Fig. 1 An example of notations in STR-realigner. Subsequence with B j = 1 can be used repeatedly in the realignment process and the deletion part rotates from tail to head of the subsequence. 5. s L (i): a state representing left clipping that ends at the i th position of query read R. 6. s R (i): a state representing right clipping that starts at the i th position of query read R.
The following penalties are considered in the proposed realignment algorithm.
• p m,j : penalty for match of bases between query read R and subsequence G j . Usually, the penalty is set to a minus value, i.e., the penalty is used for rewarding. • p mis,j : penalty for mismatch of bases between query read R and subsequence G j . • p io,j and p ie,j : penalties for open and extension of insertion on subsequence G j , respectively. • p do,j and p de,j : penalties for open and extension of deletion on subsequence G j , respectively. • p c : penalty for clipping.
In the proposed dynamic programming algorithm, penalty and traceback information for state s are stored in functions P(s) and T(s), respectively. In the first step of the dynamic programming, penalty and traceback information of states for the first position in query read R are initialized in the following algorithm.
Algorithm 1: Initialize penalty and traceback information of states for the first position in query read R 1. For subsequence ID j from 1 to m, perform the following steps: (a) For position k from 1 to |G j | in subsequence G j , perform the following steps: For each state s considered at steps i to iv, set T(s) to NULL.
The best penalties for the alignment up to the ith position of query read R for each state is updated by using the best penalties of states for the i − 1st position of query read R in Algorithm 2, where traceback information is also updated. Algorithm 3 given below updates Algorithm 2: Update penalties and traceback information of states for positions 2 to |R| in query read R 1. For position i from 2 to |R| in query read R, perform the following steps: (a) For subsequence ID j from 1 to m, perform the following steps: (i) For position in subsequence G j , k from 1 to |G j |, perform the following steps: penalty and traceback information for states representing match or mismatch. Algorithm 4 given below is used for obtaining states that are in preceding subsequences and can be traced from s M (i, j, 1). Algorithm 5 given below updates penalty and traceback information for states representing insertion. Algorithm 6 given below updates penalty and traceback information for states representing deletion.
For subsequence G j with B j = 1, consecutive deletions in the same subsequence are handled with s D (i, j, k, l), and hence s D (i − 1, j, k − 1) is not considered at step 6 of Algorithm 6 for traceback. Procedures for updating penalty and traceback information for states representing consecutive deletions for subsequence G j with B j = 1 is given as Algorithm 7. Algorithm 8 given below updates penalty and traceback information for s R (i). Finally, an algorithm for traceback is given as Algorithm 9. By following states from head to tail in Q obtained with the above algorithm, the realignment result with the best penalty is obtained.
Set P(s I (i, j, k)) to infinity and perform the following steps for each s ∈ S: (a) Set penalty p to P(s) + p ie,j if s is a state representing insertion, otherwise set p to P(s) + p io,j . (b) Set P(s I (i, j, k)) and T(s I (i, j, k)) to p and s, respectively if P(s I (i, j, k)) > p holds.
Add states obtained with Algorithm 4 to S if k = 1 holds. 8. Set P(s D (i, j, k)) to infinity and perform the following steps for each s ∈ S: (a) Set penalty p to P(s) + p de,j if s is a state representing deletion, otherwise set p to

Algorithm 7: Update penalty and traceback informa-
Algorithm 8: Update penalty and traceback information for s R (i) 1. Let S be an empty set for states. 2. For subsequence ID j from 1 to m, perform the following steps: (a) For position k from 1 to |G j | in subsequence G j , perform the following steps: 3. Getŝ = arg max s∈S P(s).

Set T(s R (i))
and P(s R (i)) toŝ and P(ŝ)+p c , respectively.

Algorithm 9:
Trace back states for the realignment result with the best penalty 1. Let S be a set of states representing match or mismatch for the |R|th position in query read R.
3. Getŝ = arg max s∈S P(s). 4. Let Q be an empty queue and addŝ to Q. 5. Setŝ to T(ŝ) 6. Ifŝ is not NULL, addŝ to Q as its head element and go back to step 5. Figure 2 summarize a relationship of the above nine algorithms considered in STR-realigner as a flowchart. After initialization of penalty and traceback information for first query position with Algorithm 1, penalty and traceback information are updated for other query positions with Algorihtm 2 in a dynamic programming manner. Then, a realignment with the best penalty is obtained from traceback information with Algorithm 9.

Time and space complexities of STR-realigner Time complexity analysis
For each position i in query read R, updating penalty and traceback information takes For k = 1, states for tail positions of preceding subsequences are additionally considered until reaching to subsequence G j with B j = 0 or j = 1 as in Algorithm 4. This process additionally requires O j x=j |G x | time for s M (i, j, 1), where j is one or the index for the first subsequence G j with B j = 0 reached from G j . However, since the best state and its corresponding penalty before G j−1 are already considered for updating information for s M (i, j − 1, 1), by using this information, we need to newly consider only states in subsequence G j−1 , and hence the additionally required time complexity is reduced to O(|G j−1 |). Thus, with the modification of the algorithm according to the above argument, updating information for states s M (i, 1, 1), . . . , s M (i, m, 1) requires O j |G j | time in total. Since the same optimization can be applied to updating information for states representing insertion, updating information for states with k = 1 requires O j |G j | time in total as well. In addition, for s L (i) and s R (i), O(1) time and O j |G j | time are required, respectively. Thus, updating penalties and traceback information for all the states requires O j |G j | + j∈{j |B j =1} |G j | 2 time for each position in query read R, and hence the time complexity of the proposed algorithm is

Space complexity analysis
The order of the number of states for each position in query read R is O j |G j | for s M (i, j, k), s I (i, j, k), and s D (i, j, k). For s D (i, j, k, l), the order is O j∈{j |B j =1} |G j | 2 , and for s L (i) and s R (i), the order is O(1). Thus, storing values from functions P and T requires O |R| · j |G j | + j∈{j |B j =1} |G j | 2 space. However, P(s D (i, j, k, l)) can be obtained by calculating P(s D (i, j, k)) + (l − 1) · p de,j , and T(s D (i, j, k, l)) is given by s D (i, j, k, l − 1) for l > 2 and s D (i, j, k) for l = 2. Thus, the order of the space required for functions P and T can be reduced to O |R| · j |G j | by calculating functions P and T for s D (i, j, k, l) with O(1) time when their values are required. The space required for updating for each state is less than the order of the number of states and is negligible, compared to spaces required for P and T. Thus, with the above modification, the proposed algorithm requires O |R| · j |G j | space.

Practical implementation
Irregular repeat patterns are often contaminated in the provided STR regions detected by some Bioinformatics tools [16,17], and those irregular repeat patterns worsen the quality of the alignment of the proposed algorithm due to the difference of the actual sequence and the assumed repeat pattern. In order to address this issue, we extract maximal regions containing repeat patterns consecutively with some pre-specified error rate from the target STR region. The extracted region is used for a new target STR region for STR-realigner.
In order to use the realignment result from the proposed algorithm for resequenced data, parts of the query read aligned to G j with B j = 1 are again realigned to the corresponding STR region of the reference genome. However, the quality of the alignment is also worsened due to irregular patterns in the STR region. Thus, we consider a subsequence for a repeat pattern right after the target STR region and set lower deletion penalty to the target STR region. For penalty, the following setting were used in our study: p m,i = −1, p mis,i = 4, p io,i = 6, p ie,i = 1, p do,i = 6, p de,i = 1, and p c = 5. These parameter values are the same as the default values in BWA-MEM. For subsequences corresponding to target STR regions for lower deletion penalty, p do,i is set to 4.
In Illumina reads, bases at positions after homopolymer regions are highly erroneous because the same phasing is accumulated in synthesis during the Illumina sequencing process in homopolymer regions. Figure 3 shows an example of erroneous bases around a homopolymer region where a lot of clippings occur around a long homopolymer comprised of A bases in GRCh37 due to sequencing errors. Since sequence reads with such highly erroneous bases worsen the quality of realignment with STR-realigner, we additionally implemented an option that skips the realignment with STR-realigner for homopolymer regions with some specified size such as 15. Each mapping tool has its specific characteristics in the aligned reads. For example, a deletion exists in the start position of an STR region in the reads aligned with some mapping tool while a deletion exists in the end position of the STR region in the alignment result of another mapping tool for the same sequence reads. The performance of variant calling is worsened if such characteristics are mixed in the alignment results. Thus, all the reads aligned to a target STR region are realigned with STR-realigner in the default condition.

Simulation analysis
From a list of STR regions provided in the RepeatSeq software package, we extracted STR regions for evaluation as follows: • STR regions not in chromosome 22 were filtered out.
• STR regions with size longer than 100 bp were filtered out.
The maximum period, the size of repeat pattern, in the list is six. Since the length of sequence reads considered in the following experiments is 100 or 101 bp and these sequence reads cannot span STR regions > 100 bp for most of the cases, STR regions > 100 bp were filtered out. We then prepared synthetically generated diploid genome sequences of chromosome 22 based on phased genotypes for a CEU individual, NA12286, in the phase3 phased reference panel by the 1000 Genomes Project [18]. In the generation of the above genome sequences, variants located in the extracted repeat regions were ignored. The number of variants in total is 54,897. By randomly sampling repeat numbers, we generated two sets of repeat numbers for the extracted repeat regions and added STR variants to the diploid genome sequences based on the sets of repeat numbers for the evaluation. Note that repeat numbers with which the size of STR region is > 100 bp were avoided in the random sampling process. From the diploid genome sequences, we generated paired-end sequence reads in FastQ format with the read length of 100 bp and the insert size normally distributed with mean of 500 bp and standard deviation of 50 bp. In the generated reads, substitution errors were added with rate of 0.1%. Base quality scores for bases in FastQ format were set to Q30, which corresponds to 0.1% error. The read coverage of the generated data is 40×. A BAM file for the dataset was obtained by mapping the sequence reads to the reference genome (GRCh37) with BWA-MEM (0.7.12-r1039) [19]. We applied our proposed realignment method, STRrealigner, ReviSTER (0.1.7), and GATK IndelRealigner (GATK 3.4-0) to the BAM independently and generated three types of BAM files.
For GATK IndelRealigner, USE_READS was used for --consensusDeterminationModel option. RepeatSeq (v0.8.2) was applied to the original BAM file and the three types of realigned BAM files, and sizes of variants in the target STR regions were obtained. Table 1 shows call rates of results from RepeatSeq using the original BAM file and the three types of realigned BAM files. The call rate indicates the rate of results with STR region size estimated as a non-NA value. For all the STR periods other than period of 1, call rates of results from the BAM file realigned with STR-realigner are higher than those from other BAM files. Table 2 summarizes the root mean squared errors (RMSE) between estimated and true STR region size for each BAM file for all the STR regions. In the calculation of RMSE, the size in the reference genome was assigned for the region size estimated as NA value. For all the STR periods other than period of 1, the results from the BAM file realigned with STR-realigner show the best RMSE value. The RMSE value from the results based on the BAM file realigned with GATK IndelRealigner is slightly better than that based on the original BAM file. In order to examine the performance excluding the results estimated as NA value, we summarized RMSE for STR regions where results were commonly estimated as a non-NA value on all the four types of BAM files in Table 3. Similarly to the results for all the STR regions, the results from the BAM file realigned with STR-realigner show the best RMSE value for all the STR periods other than period of 1. The RMSE value from the results based on the BAM file realigned with GATK IndelRealigner is slightly better than that based on the original BAM file. We also applied allelotype (4.0.0) [10], an STR calling software in lobSTR package, to the original BAM file and the three types of realigned BAM files. Tables 4, 5 and 6 show call rates, RMSE values averaged on all the regions, and RMSE values averaged on commonly called regions for results from allelotype, respectively. In these tables, the      In order to examine the performance on lower coverage data, we downsampled the original BAM file from 40× to 10× and estimated repeat numbers with Repeat-Seq and allelotype using the downsampled BAM file and BAM files obtained by applying the realignment methods to the downsampled BAM file. Table 7 shows call rates of results from RepeatSeq for the downsampled BAM files. For all the STR periods other than period of 1, call rates of results from the BAM file realigned with STR-realigner are higher than those from other BAM files. Tables 8 and 9 respectively summarize RMSE values for each BAM file for all the STR regions and the regions where results were commonly estimated as a non-NA value. In the calculation of RMSE, the size in the reference genome was set for the region size estimated as NA value for Table 8. In both  Tables 8 and 9, the results from the BAM file realigned with STR-realigner shows the best RMSE values for all the STR periods other than period of 1. The RMSE value from the results based on the BAM file realigned with GATK IndelRealigner is slightly better than that based on the original BAM file.
Tables 10, 11 and 12 show call rates, RMSE values averaged on all the regions, and RMSE values averaged on commonly called regions for results from allelotype, respectively. The results for the BAM file realigned with STR-realigner gave the highest call rate for all the STR periods other than period of 1. In total, STR-realigner gave the best results in both considering all the STR regions and commonly called regions.
The results for the sequencing data of 10× are always worse than those of 40× in all the cases.

Real data analysis
For real human sequencing data, we used 101 bp paired-end sequencing data of a CEU parent-offspring trio NA12878, NA12891, and NA12892 analyzed in the 1000 Genomes Project. NA12891 and NA12892 are parents of NA12878. The data was sequenced on Illumina HiSeq 2000 with the read coverage of 50× and the average insert size of 300 bp. Sequence reads were mapped to the reference genome (GRCh37) with BWA-MEM and stored in BAM format. The data was obtained from the Illumina Platinum Genomes Project through the European Nucleotide Archive under the      In Table 13, the number of regions with consistent sizes in terms of Mendelian inheritance (#CR) and the number of inconsistent estimation results (#IR) are summarized. Note that the larger number is better for consistent regions while the smaller number is better for inconsistent regions. The results for STR period of 1 without skipping homopolymer regions with size > 15 are in parentheses.
For STR periods of 3, 4, and 6, results from BAM files realigned with STR-realigner gave the best results in both consistent and inconsistent regions.
In total, results from BAM files realigned with STRrealigner gave the best results in both consistent and inconsistent regions. Results for GATK IndelRealigner are consistent in more regions than those for the original BAM files although the results for GATK IndelRealigner contains the most inconsistent regions.
In Table 14, the number of regions with consistent sizes in terms of Mendelian inheritance (#CR) and the number of inconsistent sizes (#IR) on results estimated with allelotype are summarized. For STR periods of 2, 3, 5, and 6, results from BAM files realigned with STR-realigner gave the best results in consistent regions. In addition, for STR periods of 2 and 5, results from BAM files realigned with STR-realigner also gave the best results in inconsistent regions. In total, the results for STR-realigner gave the best performance in both consistent and inconsistent regions. Figure 4 shows an IGV view where STR-realigner effectively works on realigning inserted repeat patterns in an STR region comprised of GGAT repeats located at   In the estimated result with RepeatSeq, the insertion was included in the estimated size of the STR region. We also evaluated root mean squared errors for NA12878 by using STR region sizes obtained from high coverage PacBio sequencing data as the gold standard. The PacBio sequencing data was obtained through Bio-Project PRJNA253696 with Sequence Read Archive accession numbers SRX627421 and SRX638310 [20] and its read coverage is 46× in total. Error-corrected reads with Falcon (https://github.com/PacificBiosciences/FALCON) in FASTA format were aligned to GRCh37 with BWA-MEM, and STR region size data was then obtained from variant calling results by applying allelotype to the aligned reads. Table 15 summarizes the RMSE values between estimated STR region size with RepeatSeq and the gold standard for each BAM file for all the STR regions. In the calculation of RMSE, the size in the reference genome was assigned for the region size estimated as NA value. RMSE values for estimated results with allelotype are also summarized in Table 16. In both cases using RepeatSeq and allelotype, the results from sequencing data realigned with STR-realigner showed the best performance in total. Table 17 shows the computational time of STR-realigner, ReviSTER, and GATK IndelRealigner for simulation and real data analyzed in Sections 1 and 1. For the real data, the computational time for realigning the BAM file for NA12878 was measured. Computation was conducted in Intel Xeon CPU E5-2670 processors with a single thread, and computational time for each case is for single process. STR-realigner is implemented in Java. In both simulation and real data, ReviSTER required the most computational time among these methods, and STR-realigner required more computational time than GATK IndelRealigner. For STR-realigner, by limiting alignment space within some window of the original alignment result, the drastic reduction of the computational time is expected while keeping its realignment quality. In addition, the computational time can be reduced by realigning sequence reads for each STR region in a parallel manner. For memory  consumption, STR-realigner requires less than 2GB in both simulation and real data.

Conclusion
We proposed a new realignment method for STR regions named STR-realigner that takes sequence reads aligned with other methods and realigns sequence reads by dynamic programing manner with the consideration of the corresponding STR repeat pattern as prior knowledge.
For the simulation data analysis, we prepared synthetically generated reads aligned with BWA-MEM, those realigned with STR-realigner, those realigned with ReviSTER, and those realigned with GATK IndelRealigner. In order to evaluate the effectiveness of our proposed method, we applied RepeatSeq and allelotype to these four types of aligned reads, and the results from sequence reads realigned with the proposed method give the best RMSE value among the results from these four types of aligned reads. From the comparison of root mean squared errors between estimated and true STR region size for these four types of aligned reads, the results for the dataset realigned with STR-realigner are better than those for other datasets for most of the cases. For real data analysis, we considered a real sequencing dataset from Illumina HiSeq 2000 for a parent-offspring trio, RepeatSeq was applied to an aligned sequencing dataset with BWA-MEM, that realigned with STR-realigner, that realigned with ReviS-TER, and that realigned with GATK IndelRealigner, and the results from the dataset realigned with STR-realigner shows the best performance in terms of consistency of the size of estimated STR regions in Mendelian inheritance. In addition, by using the size for STR regions obtained from high coverage PacBio sequencing data as the gold standard, the results for STR-realigner show the best RMSE values for the case considering all the periods. In both simulation and real data, ReviSTER required the most computational time among the realignment methods considered in this work, and the proposed method required more computational time than GATK IndelRealigner. However, the computational time of STRrealigner can be reduced drastically by parallel computing and limiting the search space for the realignment around the originally aligned results in some specified window size.