Perfect Hamming code with a hash table for faster genome mapping
 Yoichi Takenaka^{1}Email author,
 Shigeto Seno^{1} and
 Hideo Matsuda^{1}
https://doi.org/10.1186/1471216412S3S8
© Takenaka et al; licensee BioMed Central Ltd. 2011
Published: 30 November 2011
Abstract
Background
With the advent of nextgeneration sequencers, the growing demands to map short DNA sequences to a genome have promoted the development of fast algorithms and tools. The tools commonly used today are based on either a hash table or the suffix array/Burrow–Wheeler transform. These algorithms are the best suited to finding the genome position of exactly matching short reads. However, they have limited capacity to handle the mismatches. To find nmismatches, they requires O(2^{ n }) times the computation time of exact matches. Therefore, acceleration techniques are required.
Results
We propose a hashbased method for genome mapping that reduces the number of hash references for finding mismatches without increasing the size of the hash table. The method regards DNA subsequences as words on Galois extension field GF(2^{2}) and each word is encoded to a code word of a perfect Hamming code. The perfect Hamming code defines equivalence classes of DNA subsequences. Each equivalence class includes subsequence whose corresponding words on GF(2^{2}) are encoded to a corresponding code word. The code word is used as a hash key to store these subsequences in a hash table. Specifically, it reduces by about 70% the number of hash keys necessary for searching the genome positions of all 2mismatches of 21baselong DNA subsequence.
Conclusions
The paper shows perfect hamming code can reduce the number of hash references for hashbased genome mapping. As the computation time to calculate code words is far shorter than a hash reference, our method is effective to reduce the computation time to map short DNA sequences to genome. The amount of data that DNA sequencers generate continues to increase and more accurate genome mappings are required. Thus our method will be a key technology to develop faster genome mapping software.
Keywords
Background
The history of bioinformatics has been dominated by the search for faster sequence alignment methods. Beginning with dynamic programming for protein and genome sequence alignment, many algorithms have been proposed. Hash tables are used in the series of FASTA programs [1], which calculate approximate alignments in shorter times than dynamic programming can. BLAST tools, using automatons in their algorithms, are the most famous and most used alignment tools [2]. These tools are fast enough to align expression sequence tags generated by capillary electrophoresisbased DNA sequencers to target genomes.
The emergence of nextgeneration sequencing technology has changed the demands for alignment speed. A socalled nextgeneration sequencer can read far more base pairs than a conventional sequencer: more than two billion short DNA sequences in a single run. For such a large number of the sequences, BLAST tools are too slow to map the sequences to target genomes. Therefore, researchers have called for a faster approach that is focused on mapping short fragments.
To meet this demand, more than 25 software programs designed for mapping short DNA sequences onto genomes have been developed. These are classified into two categories according to their algorithms, which are either hashbased or suffix array/Burrow–Wheeler transition (BWT)based [3], MAQ [4] and SOAPv1 [5] are two hashbased algorithms. The former indexes short DNA sequences and the latter indexes genome sequences. mrsFAST is the one of the newest algorithms that indexes both the short DNA sequences and the genome sequences [6]. The first genomemapping algorithm based on suffix arrays was proposed in 2002 [7] and implemented as vmatch. Currently, a BWTbased algorithm is the fastest and is used in four tools: bowtie [8], BWA [9], SOAPv2 [10], and segemehl [11].
These algorithms are effective for mapping short sequences to genome positions of perfect matches and onebase mismatches, but are inefficient for mapping to positions for two or morebase mismatches. In general, they require O(2^{ n }) computation time to calculate nbase mismatches. But genome mapping that allows only onebase mismatches is inadequate. In practice, 20 to 40% of short sequences cannot be mapped to the genome. Therefore, “wet” researchers require faster algorithms for mapping short sequences to genome positions with two or morebase mismatches. In this paper, we propose a method that can accelerate hashbased genome mapping by reducing the number of hash references without increasing the size of the hash table.
In the proposed method, DNA subsequences are divided into equivalence classes by using a perfect Hamming code. Each equivalence class includes subsequences whose corresponding words on GF(2^{2}) are encoded to the corresponding code word of the perfect Hamming code. The code word is used as a hash key to store these subsequences in a hash table. A perfect Hamming code is a special case of a Hamming code, known in the field of coding theory [12], that satisfies the Hamming bound with equality. Perfect Hamming codes have been applied to ngram analysis of genome sequence [13] and multiple alignment [14].
Hashbased genomemapping algorithms use hash tables. A hash table is an array indexed by hash values generated from hash keys. Thus, a hash table is an implementation of an associative array. There are two methods for mapping short reads onto genomes using hash tables. One is to store subsequences of the genome and their positions in a hash table and the other is to store subsequences of short reads. As there is no essential difference between their hash usages, we use the former method for the following explanation.
The hashbased methods prepare a hash table whose keys and values represent subsequences of length l cut from a target genome and the subsequence genome positions, respectively. To map a short sequence to the genome, a sequence is cut into lengths l, and these are used as keys to refer to the hash table. The methods can find the genome position of a perfect match if the hash table returns at least one entry. In general, when the lengths of short sequences are longer than the length of the hash key, the methods expand the area of alignment from the genome position. The differences between methods are how and when they refer to the hash table.
There are three methods to find the nmismatch genome positions of a subsequence of length l with the hash table.
1. Refer to all nmismatch subsequences.
Prepare a hash table whose key length is l, and use the subsequence and its nmismatch subsequences as keys to refer to the table. It requires hash references to find all the nmismatch genome positions.
2. Store nmismatch positions in the hash table.
For each position of the subsequence of the genome, store the position times. The hash keys are the subsequence and its nmismatch subsequences.
3. Use pigeonhole principle; combine hash table and another method.
Generate a hash table whose key length is ⌊l/n⌋. After getting the perfectmatch genome position of length ⌊l/n⌋ by referring to the hash table, find nmismatch sequences by another method, such as dynamic programming or BWT.
These methods are effective when l is small and n equals 1. But they are difficult to use when n is 2 or more because the number of 2mismatch sequences of length l is l(l – 1)/2 ∗ 9 and that of 3mismatches is l(l – 1)(l – 2)/2 ∗ 9. The first and second methods require too many hash references and too big a hash table, respectively. The third method is the best, but as n becomes larger, the ability to narrow the genome position down becomes weaker, and so the load of the post process to find nmismatch sequences increases. To overcome these difficulties and improve the effectiveness of using hash tables for genome mapping, technical breakthroughs are needed.
We propose a method to reduce the number of hash references to find the genome positions of 2 or more mismatches without enlarging the size of the hash table. To realize the method, 4ary perfect Hamming code is used.
Results
Perfect Hamming codes as hash keys
Idea
The features of this hash table are as follows: (1) The number of entries in the hash table does not increase because each subsequence is stored only once. (2) Using this hash table, we can reduce the number of hash references to find the genome positions of subsequences of 1 or moremismatches.
This shows that 3l + 1 must be a power of 4. For example, l = 5 and 21 satisfies the condition.
It is not clear that the above equation is a sufficient condition for constructing equivalence classes. Even if it is, two problems still remain; how to construct the equivalence classes and how to calculate the center words from a given subsequence. Perfect Hamming codes provide solutions to both these problems.
Perfect Hamming code
where C is a set of q–ary block code of length n, d is the minimum Hamming distance between code words, and . In the PHC method, all the received words are classified into a code word or a 1bit error. In other words, all the words are decoded to code words whose Hamming distance is 0 or 1.
Addition and multiplication on GF(2^{2})
+  0  1  α  α ^{2}  ×  0  1  α  α ^{2}  
0  0  1  α  α ^{2}  0  0  0  0  0  
1  1  0  α ^{2}  α  1  0  1  α  α ^{2}  
α  α  α ^{2}  0  1  α  0  α  α ^{2}  1  
α ^{2}  α ^{2}  α  1  0  α ^{2}  0  α ^{2}  1  α 
The code word is calculated from a received word as follows.
1. Calculate the syndrome s.
2. If the syndrome s is zero, then the recieved word is the code word.
3. Find a column c of paritycheck matrix that is a constant factor t of the syndrome.
4. Subtract t from the column c of the received word and the result is the code word.
The code word of z is (ααα 00).
The (n, k)Hamming codes are composed of the information digits and the check digits. The information digits are k arbitrary digits of the code and the other n – k digits are the check digits. The generator matrix G can reproduce check digits from information digits. Therefore, code words can be uniquely represented by the k digits. In this paper, we call these representations short codes.
PHC and DNA subsequence
DNA sequences are composed of four nucleotides, adenine, cytosine, guanine and thymine. Let these correspond onetoone to the elements of Galois field GF(2^{2}). Then, DNA sequences correspond to words on the Galois field. Without loss of generality, let (A,C,G,T) correspond to (0,1, α, α^{2}). The sequence “GGGTA” is expressed as the word (αααα^{2}0), and the word (10000) represents the DNA sequence “CAAAA”.
This correspondence relationship and the PHC enables us to build the equivalence classes described in Section Idea. Each equivalence class is composed of a DNA subsequence that corresponds to a PHC code word and DNA subsequences whose corresponding words are errorcorrected to the code word. Figure 2 shows an equivalence class. The DNA subsequence “AAAAA” corresponds to word (00000) on GF(2^{2}) and is a code word of 4ary (53)PHC. From the properties of PHC, All the words whose Hamming distances from the code word (00000) are 1 are errorcorrected to the code word, and they are adjacent nodes of “AAAAA”. Additional File 1 shows correspondence table of 5mer subsequences and code words of 4ary (5,3)PHC. In the following, we regard DNA subsequences and their corresponding codes on the Galois field as equivalent (i.e., aliases).
Algorithms
Preparing the hash table
There are two ways to construct hash tables for mapping short DNA sequences onto a genome. One uses subsequences of the genome as hash keys to store their genome positions in a hash table. And another uses subsequences of short DNA sequence as the hash key. Because both of these use DNA subsequences as hash keys, our method can be applied to either. In the following, we use the former in the explanation.
Searching for nmismatches
In this section we describe how to find genome positions of 1 and 2mismatch subsequences of a given subsequence s given the hash table prepared as described in section Preparing the hash table. The efficiency of the method is also described.
Summary of our methods for lengths 5, 21, and 10 to refer to 1 and 2mismatch and 1 and 2gap sequences
length  condition  #keys  #words  ratio  f(s, K) when s = c(s)  f(s, K) when c ≠ c(s) 

5  1mismatch  6.625  16  41.4%  1 + 15x  1 + 15x + 42x^{2} + 54x^{3} 
2mismatches  27.25  106  25.7%  1 + 15 + 90x^{2} + 210x^{3} + 180x^{4}  1 + 15 + 90x^{2} + 170x^{3} + 156x^{4}  
1gap  3.25  4  81.3%  4 + 12x  4 + 60x  
2gaps  10  16  62.5%  16 + 36x + 108x^{2}  – ∗^{1}  
21  1mismatch  30.53  64  47.7%  1 + 63x  1 + 63x + 210x^{2} + 1710x^{3} 
2mismatches  611.31  1954  31.3%  1 + 63x + 1890x^{2} + 4410x^{3} + 34020x^{4}  1 + 63x + 1890x^{2} + 5650x^{3} + 31500x^{4}  
1gap  3.81  4  95.3%  4 + 60x  4 + 252x  
2gaps  13.87  16  86.7%  16 + 84x + 540x^{2}  16 + 48x + 960x^{2}  
10: Serialize  1mismatch  12.25  31  39.5%  1 + 30x + 225x^{2}  1 + 30x + 170x^{2} + 538x^{3} + 1089x^{4} + 1620x^{5} ∗^{2} 
10: Parallelize  1mismatch  13.25  31  44.1%  1 + 30x  1 + 30x + 84x^{2} + 108x^{3} ∗^{3} 
First, we analyze K_{1}(s) for s of length 5. The subsequence s is classified into two cases according to whether s is a code word on PHC.
Case 1: s is a code word. As s is the code word, all the 1mismatch words of s are decoded to s by PHC. That is to say, they belong to the same equivalence class E(s). Therefore, s is used as a hash key and it can refer to all the 1mismatch subsequences.
The rest of 12 words belong to six equivalence classes. Assume that word t differs at jth digit. t is not a code word because d_{ H }(t, c(s)) = 2 and the distance between code words must be more than 3. The code word c(t) and t differ at the kth digit, where k ≠ i, j. There is a word u that differs from s at the kth digit and c(t) at the jth digit. Because c(t) belongs to the equivalence class E(t), use c(t) as a hash key and two words t and u can be referred to. Finally, the number of keys K_{1}(s) to refer to all the 1mismatch subsequences is seven.
Because the proportions of Case 1 and Case 2 are respectively 1/16 and 15/16, the expected number of keys in K_{1}(s) is 6.625 (1 ∗ 1/16 + 7 ∗ 15/16).
Next, we show an algorithm to calculate the set of hash keys K_{1}(s).
Input : s
 1.
K_{1}(s) ← c (s)
 2.
s := c (s) and return (K_{1} (s))
 3.
for t in N_{1}(s) add c(t) to K_{1}(s)
 4.
return (K_{1}(s))
The algorithm calculates code words 16 times when s is not a code word, where the number of the hash keys is seven. There appears to be redundancy. There are various ways of reducing the computation time of the code words, but it is better not to use complicated algorithms. The calculation of code words is fast because the operations + and × on G(2^{2}) can be calculated as binary operations.
where s is a subsequence, K is a set of hash keys, and the coefficient a represents the number of subsequences that are referred to by K and whose distance from s is b.
The reference formula shows the proposed method searches many 2 and 3mismatch sequences. We discuss this feature in Section Discussion.
The above algorithm and analysis can be applied to the word length 21. The numbers of hash keys are 1 and 31 for Case 1 and Case 2, respectively. Using the rate of occurrences of Case 1 and Case 2, 1/64 and 63/64, respectively, the expected number of hash keys is 30.53. The reference formulas are shows in Table 2.
Type 1: 2mismatch subssequences belong to this type are neighbors of c(s).
Type 2: for some t such that d_{ H }(t, s) = 1 and c(t) ≠ c(s), 2mismatch subsequences are neighbors of t belonging to c(t).
Type 3: for some t such that d_{ H }(t, s) = 1 and c(t) ≠ c(s), they are neighbors of t not belonging to c(t).
Type 4: for some u such that d_{ H }(u, s) = 1 and c(u) = c(s), they are neighbors of u not belonging to c(s).
Search ngaps
To align DNA subsequence and a genome, there are three types of gaps. These are gaps in short DNA sequence, gaps in genome sequence, and gaps in both. Our method can reduce the number of hash keys to refer to gaps in short DNA sequences. Given a subsequence with gaps s, hash keys to refer to the genome positions are a set of code words of subsequences which with the gaps of s are substituted with nucleotides. The expected numbers of hash keys are 3.25 and 3.84 when the length of a subsequence with one gap is 5 and 21 respectively.
Let s be a subsequence with onegap, such as “AAAA”, and S be a set of subsequences for which a gap in s is substituted with a nucleotide, in this case the set comprising “AAAAA”, “AACAA”, “AAGAA”, and “AATAA”. To find the genome position that matches s, we need to refer to entries that correspond to S.
When a subsequence t ∈ S is a code word, all the subsequences in S belong to the same equivalence class. Therefore, by using c(t) as a hash key, all the entries correspond to gapped subsequence s can be referred to from. In this case, the reference formula becomes 4 + 12x, where the index number of x represents the minimum distance from the four subsequences in S.
If no t ∈ S is a code word, the four subsequences belong to different equivalence classes. Therefore, four hash keys are required and the reference formula is 4 + 60x. The proportion for which one member of S is a code word is s/# of words in a equivalence class = 4/16 = 1/4, and so the expected number of hash keys is .
In the same way, when the length is 21, the expected number of hash keys is . The reference formulas are shown in Table 2.
Next, we consider a subsequence with two gaps. Let s be a subsequence of length 5 with two gaps such as “AAA” and S be the set of 16 sequences for which the gaps in s are replaced with nucleotides. The number of information digits in (5, 3)PHC is 3, and so one of 16 words in S is a code word t. The code word t is the only code word in S because the maximum Hamming distance among words in S is 2, which equals the number of gaps, and the minimum Hamming distance between code words is 3. The equivalence class E(t) includes seven subsequences of S; let U be the remaining subsequences (U = S – N_{1}(t)). Because u ∈ U is not a code word and d_{ H }(u, t) = 2, the distance between t and the code word c(u) is 3 (d_{ H }(c(u),t) = 3). This implies that c(u) and t differ at a nongapped position in t. Therefore, for all u, v ∈ U, u ≠ v and d(c(u), v) = d(u, v) + 1 ≥ 2. Thus, all the subsequences in U belong to different equivalence classes and nine (=U) hash keys are required to refer to 2gap subsequences.
For example, if s=“AAA”, S includes a code word t =“AAAAA” and t can refer to 7 sequences: “AAAAA”, “ACAAA”, “AGAAA”, “ATAAA”, “AAACA”, “AAAGA”, and “AAATA”. The set of remaining subsequences are U = {“ACACA”, “ACAGA”, “ACATA”, “AGACA”, “AGAGA”, “AGATA”, “ATACA”, “ATAGA”, “ATAGA” }. Let u ∈ U = “ACACA”. The code cord c(u) is “ACACC” and Hamming distance from the rest of U are two or three. Therefore, all subsequences in U belong to different equivalence classes.
Length of subsequence

Case 1 s_{1} and s_{2} are both code words.

Case 2 s_{1} is a code word, but s_{2} is not.

Case 3 s_{2} is a code word, but s_{1} is not.

Case 4 neither s_{1} nor s_{2} is a code word.
In Case 1, Use s_{1}s_{2} as a hash key; this can refer to all the 1mismatches. The reference formula in this case is the square of the reference formula of length 5, (1 + 15x)(1 + 15x) = 1 + 30x + 225x^{2}.
and the reference formula is same as that of Case 2.
The proportions of cases 1 through 4 of 1/256, 15/256, 15/256, and 225/256, respectively, and so the expected number of hash keys is 12.25.
where each set corresponds to one of the two hash tables. The expectation number of keys is 13.25.
Let us consider the four cases, which are the same as those for serialization. In case 1, use s_{1}s_{2} as the hash key for two hash tables and all the entries of 1mismatch are referred to. In this case, the reference formula is (1 + 15x) + (1 + 15x) – 1 = 1 + 30x. As a subsequence s_{1}s_{2} is stored in both hash tables, one is subtracted in the formula. Though the number of hash keys appears to be one, it is used twice. Thus the number of hash keys K_{1}(s) is two.
The proportions of the cases are same as for serialization, and the expected number of hash keys is 13.25.
Discussion
To search genome positions of nmismatches and ngaps with our method, Table 2 shows it also searches some positions of n+αmismatch. For example, our method searches not only 1 perfect match and 63 1mismatch subsequences, but also 210 2mismatches and 1710 3mismatches as byproducts 11.1%(= 210/1890) and 4.8%(= 1710/35910) of all 2 and 3mismatches, respectively. This proportion increases to 46.7%(= 42/90), 20% (= 54/270) when l = 5. These byproducts are, in fact, effective. One reason for the current low mapping ratios from DNA sequencers of short reads to a genome is the small number of mismatches and gaps that the employed mapping method can find. Therefore, increasing the numbers of mismatches and gaps will contribute to increasing the mapping ratio and subsequent biological analyses, even if the method is probabilistic.
The increasing demand to map massive amounts of short DNA sequences to genomes is inevitable. Because the number of short sequences is enormous, it is difficult to ensure finding all genome positions of 1mismatches in a practical computation time. Therefore, faster methods are required and the proposed method is a step in that direction. We have shown that the proposed method can reduce the number of keys necessary to find the genome positions of nmismatches. The main idea behind the method is to classify the subsequences into equivalence classes using PHC. Because equivalence classes contain multiple subsequences, our method can increase the density of the hash table over those using in the usual method. That is to say, our method can use longer subsequences.
For example, the size of human genome is about 3G bases long. When this is stored it with subsequences of length 21 in usual way, the density of the hash table is 3 × 10^{12}/4^{21} ≈ 0.07%. On the other hand, the hash table using our proposed method using (21,18)PHC, the density is 3 × 10^{12}/[# of equivalence classes] = 3 × 10^{12}/4^{18} ≈ 4.7%. That is to say, our method can use longer subsequences. The length of subsequence is sensitive to the efficiency of the genomemapping programs, and the longer the better, for a given density of hash table. Therefore, the proposed method has an advantage from this point of view.
We consider the computation time for code words is far shorter than a hash reference when we describe the effectiveness of the proposed method. In practice, the calculation of the syndrome using the paritycheck matrix of the Hamming code is very short, even if on GF(2^{2}), and so it is easy to calculate the code word from a subsequence. Also, the calculation is small enough to be executed within a CPU cache. On the other hand, the size of hash table is larger than the size of CPU caches. Some exceeds the size of memory because the number of entries is almost equal to the length of the target genome. The hash reference is apparently slower than the calculation of the code word. Therefore, the advantage of reducing the hash references exceeds the disadvantage of additional tasks to calculate code words. With these advantages, our method will help to implement faster genome mapping programs.
Conclusions
The paper shows perfect hamming code can reduce the number of hash references for hashbased genome mapping. The method encodes subsequences to perfect hamming codes on GF(2^{2}) and use them as hash keys. It can reduce by about 70% the number of hash keys necessary for searching the genome positions of all 2mismatches of 21baselong DNA subsequence. As the amount of data that DNA sequencers generates continues to increase and more accurate genome mappings are required, our method will help to develop faster genome mapping software.
Declarations
Acknowledgements
This work was partially supported by KAKENHI (22680023) and (22310125).
This article has been published as part of BMC Genomics Volume 12 Supplement 3, 2011: Tenth International Conference on Bioinformatics – First ISCB Asia Joint Conference 2011 (InCoB/ISCBAsia 2011): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/14712164/12?issue=S3.
Authors’ Affiliations
References
 Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA. 1988, 85: 24442448. 10.1073/pnas.85.8.2444.PubMed CentralView ArticlePubMedGoogle Scholar
 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J. Mol. Biol. 1990, 215: 40310.View ArticlePubMedGoogle Scholar
 Burrows M, Wheeler DJ: A blocksorting lossless data compression algorithm. Palo Alto, CA, Digital Equipment Corporation. 1994, Technical report 124Google Scholar
 Li H, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research. 2008, 18 (11): 18511858. 10.1101/gr.078212.108.PubMed CentralView ArticlePubMedGoogle Scholar
 Li R, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program. Bioinformatics. 2008, 24 (5): 713714. 10.1093/bioinformatics/btn025.View ArticlePubMedGoogle Scholar
 Hach F, Hormozdiari F, Alkan C, Hormozdiari F, Birol I, Eichler EE, Sahinalp SC: mrsFAST: a cacheoblivious algorithm for shortread mapping. Nature Methods. 2010, 7: 576577. 10.1038/nmeth0810576.PubMed CentralView ArticlePubMedGoogle Scholar
 Abouelhoda MI, Kurtz S, Ohlebusch E: The enhanced suffix array and its applications to genome analysis. Lecture Note in Computer Science. 2002, 2542: 449463.View ArticleGoogle Scholar
 Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome Biology. 2009, 10 (3): R2510.1186/gb2009103r25.PubMed CentralView ArticlePubMedGoogle Scholar
 Li H, Durbin R: Fast and accurate short read alignment with BurrowsWheeler transform. Bioinformatics. 2009, 25 (14): 17541760. 10.1093/bioinformatics/btp324.PubMed CentralView ArticlePubMedGoogle Scholar
 Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009, 25 (15): 19661967. 10.1093/bioinformatics/btp336.View ArticlePubMedGoogle Scholar
 Hoffmann S, Otto C, Kurtz S, Sharma CM, Khaitovich P, Vogel J, Stadler PF, Hackermüller J: Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLoS Comput Biol. 2009, 5 (11): e1000502PubMed CentralView ArticlePubMedGoogle Scholar
 Encycropedic Dictionary of Mathematics, second edition. Edited by: Ito K. 1996, First MIT Press, secondGoogle Scholar
 Takenaka Y, Matsuda H: Frequency enmeration of DNA subsequences from largescale sequences using linear code. Proceeding of 11th Intl. conf. ISMB. 2003, B59.Google Scholar
 Takenaka Y, Sakata M, Matsuda H: Nucleotide encoding according to perfect linear code and its application to mutiple alignment. Information Processing Society of Japan, Special Interest Groups Technical Report. 2006, BIO5: 103109.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.