- Proceedings
- Open Access
Similarity evaluation of DNA sequences based on frequent patterns and entropy
- Xiaojing Xie^{1},
- Jihong Guan^{2} and
- Shuigeng Zhou^{1}Email author
https://doi.org/10.1186/1471-2164-16-S3-S5
© Xie et al.; licensee BioMed Central Ltd. 2015
- Published: 29 January 2015
Abstract
Background
DNA sequence analysis is an important research topic in bioinformatics. Evaluating the similarity between sequences, which is crucial for sequence analysis, has attracted much research effort in the last two decades, and a dozen of algorithms and tools have been developed. These methods are based on alignment, word frequency and geometric representation respectively, each of which has its advantage and disadvantage.
Results
In this paper, for effectively computing the similarity between DNA sequences, we introduce a novel method based on frequency patterns and entropy to construct representative vectors of DNA sequences. Experiments are conducted to evaluate the proposed method, which is compared with two recently-developed alignment-free methods and the BLASTN tool. When testing on the β-globin genes of 11 species and using the results from MEGA as the baseline, our method achieves higher correlation coefficients than the two alignment-free methods and the BLASTN tool.
Conclusions
Our method is not only able to capture fine-granularity information (location and ordering) of DNA sequences via sequence blocking, but also insensitive to noise and sequence rearrangement due to considering only the maximal frequent patterns. It outperforms major existing methods or tools.
Keywords
- DNA sequence comparison
- Similarity analysis
- Frequent pattern mining
- Entropy
Background
The rapid development of DNA sequencing technologies has led to a huge number of DNA sequences. It is possible now to obtain large amounts of individual genomes sequenced in one week with less than US$10,000 [1], using the high-throughput sequencing technologies, such as single-molecule sequencing [2] and next-generation sequencing (NGS) [3]. Consequently, DNA sequence analysis faces serious computational challenges due to the huge amounts of data.
Similarity evaluation between sequences is a crucial starting point for analyzing genomic sequences and has a wide range of applications. One important application is to discover the evolutionary relationship between species. This is based on the assumption that two species having similar sequences are close in evolutionary relationship. Another popular application is to search similar sequences in databases. The databases may be huge in size, hence an effective and efficient method for defining and computing the similarity between sequences is badly in need. In addition, there are many reference-based tools and algorithms for sequence compression, such as GReEn [4], RLZ [5] and so on. For these algorithms, the choice of reference sequence has a significant impact on both compression ratio and compression time. Therefore, a preprocessing step that assesses the similarity between sequences and then selects the most similar one with the others as the reference is critical to compression performance.
Due to the importance of sequence similarity analysis, a dozen of algorithms have already been developed. These algorithms can be roughly classed into two categories. The first category is based on sequence alignment, which has been reviewed in [6]. Sequence alignment is powerful for comparison between two related genomes; BLAST [7], FASTA [8] and MEGA [9] are typical sequence alignment tools. However, sequence alignment depends on the orderings of the nucleotides and may be computationally prohibitive. And for comparing long sequences, alignment-based methods take too much time. For example, CUDAlign 3.0[10], one of the state-of-the-art parallel alignment methods, spends more than 5,000 seconds when comparing two sequences of size ~50 MB using 16 GPUs. Fortunately, alignment is not must for similarity analysis.
The second category is alignment-free methods, some of which are based on word (k-mer) frequency. The frequencies of words within a DNA sequence are calculated and compared between DNA sequences using statistical distances. A review on these algorithms can refer to [11]. Blaisdell [12] introduced the first word frequency based method and used the Euclidean distance for assessing sequence similarity. Wu et al. [13] proposed methods based on Kullback-Leibler discrepancy between frequencies of words, Mahalanobis distances and standardized Euclidean distances under Markov chain models of base composition.
Some other alignment-free methods use geometric representations for DNA sequences. With these methods, sequences are transformed to 2D [14], 3D [15], 4D [16] and 5D [17] spaces and so on. Such graphical representation techniques provide a way to visually measure similarity and dissimilarity between DNA sequences. However, they are time-consuming when comparing long sequences. Recently, there are also methods based on entropy. Li et al. [18] introduced the weighted pseudo-entropy, which is used for constructing representative vectors of DNA sequences. And Zhang et al. [19] converted DNA sequences into time sequences and then used the approximate entropy [20].
In this paper, we develop a novel method based on Frequent sequential Patterns and Entropy (FPE in short) to represent DNA sequences. Concretely, each sequence is first divided into blocks of the same length. Then, a modified PrefixSpan [21] algorithm is used to discover the maximal frequent patterns in each block. Finally, with the probabilities of these patterns, the entropy of each block is calculated. The resulting entropies of the blocks constitute the components of the vector of the sequence. Note that sequences are usually different in size, hence their vectors may have different dimensions.
We conduct extensive experiments to evaluate the proposed method on the β-globin genes of 11 species. By evaluating the correlation coefficient between the calculated similarities and the results from MEGA [9], the proposed method FPE achieves higher correlation coefficients than two recently-developed alignment-free methods [14, 18] and BLASTN [7]. Further comparison analysis also shows that FPE is more accurate than the two recently-developed methods, and our results agree well with the evolutionary fact.
Methods
Preprocessing
For each sequence or genome to be processed, before partitioning it into blocks, a preprocessing is performed, which includes: 1) converting all characters to uppercases; 2) discarding all non-base characters; 3) ignoring all line-breaks in each sequence file.
Sequence blocking
In our approach, each sequence is first split into several blocks, each of which consists of the same number of consecutive bases. The blocks are independent of each other and the block size can be changed in practice. Note that if the size of the last block is smaller than the specified block size, this block will be discarded. To be clearer, following is an example. In this example, there are two sequences. Assuming that the block size is 20, these two sequences are divided into 3 and 4 blocks, respectively. And the last blocks of size 4 are discarded.
Sequence blocking is an important step that brings two major advantages. First, the blocking strategy can capture fine-granularity information of sequences, including location and ordering information. Second, even for the long sequences, blocking can reduce memory and time consumption for sequence processing.
Mining maximal frequent patterns from sequences
Most of the previously proposed methods based on word frequency select some fixed-length words and then calculate their frequencies, such as [12] and [13]. As DNA sequences are strings generated from the alphabet {A, G, C, T}, there are totally 4^{ k } words of length k, which are also called k-mers in the literature. To describe a sequence well, the parameter k should be carefully selected, which possibly depends on the application domain. To avoid information loss, as many as possible patterns (or k-mers) should be considered. However, this will unavoidably introduce many low-information patterns (or noise). Keeping this in mind, our method in this paper tries to avoid manually determining the value of parameter k while taking all important patterns into account.
In addition, considering that subsequence rearrangements are normal during the biological evolution process, if there are too many rearrangements in the sequences, the results of alignment based methods may be unreliable.
Therefore, in this paper we adopt a modified PrefixSpan [21] algorithm to discover the frequent patterns in DNA sequences and consider only the maximal frequent patterns. This makes our method be considerably tolerant of subsequence rearrangement and noise. In the Experiments and Results section, we will present the experimental results that show the proposed method's tolerance of noise and subsequence rearrangements.
PrefixSpan [21] is an efficient algorithm for sequential pattern mining. However, our problem is a little different from traditional sequential pattern mining. Instead of mining a sequence database, we process a single sequence. And there is no gap between the items (subpatterns) in each pattern. We give a formal definition of our problem as follows:
Definition 1 (Mining maximal frequent patterns from a DNA sequence). Given a DNA sequence S that is a sequence of bases denoted by S = <s_{1}s_{2} ...s_{ n }> where n = |S| is the length of sequence and s_{ i }(1 ≤ i ≤ n) is a character from the charset Ω = {A, T, C, G}, and a predefined minimum support threshold s_{ min }, the support (denoted by sup) of a subsequence of S is the occurrences of the subsequence in S. A subsequence (or pattern) < s_{ k } s_{ k+1 }...s_{ m } >(1 ≤ k ≤ m ≤ n) is a frequent pattern if its support sup is no less than s_{ min }. A maximal frequent pattern is the one that none of its super-sequences are frequent. Our problem is to find all the maximal frequent patterns in the DNA sequence.
In frequent pattern mining, a close concept to maximal frequent pattern is closed frequent pattern. A closed frequent pattern is the one that none of its proper super-sequences have the same support as itself. So maximal frequent patterns must be closed frequent patterns. Actually, mining maximal frequent patterns is done by mining closed frequent patterns.
Example 2 Considering sequence 1 in Example 1. Let s_{ min } = 2 and check the sub-sequence 〈CT GA〉 in the first block of sequence 1, its sup is 2, so it is a frequent pattern in the first block of sequence 1. Furthermore, there is no any super-sequence of 〈CT GA〉 has a sup that is ≥ 2, so 〈CT GA〉 is a maximal frequent pattern in the first block of sequence 1.
Before describing the modified PrefixSpan algorithm, we give the definitions of prefix, suffix and projected database as Definition 2 and Definition 3. These definitions are a little different from those in [21]. Note that here we also perform pseudo-projection, instead of physical projection. That is, the projected database contains only the indexes of the suffixes, not the real suffixes. This technique is widely used in the area of frequent pattern mining. In the following algorithms and examples, we just use pseudo-projection for the α-projected database, S|_{ α }.
Definition 2 (Prefix and Suffix) Given a sequence S = <s_{1}s_{2} ...s_{ n }>, we say sequence δ = <s_{1}s_{2} ...s_{ m }> with m < n is a prefix of S, and sequence γ = <s_{ m+1 }s_{ m+2 }...s_{ n }> is the suffix of S with regard to prefix δ, which is denoted as γ = S/δ.
Definition 3 (Projected database) Let α be a sequential pattern in a sequence S, the α-projected database, denoted as S|α, is the collection of suffixes of S with regard to prefix α.
Algorithm 1 outlines the mining process. Assuming that the current pattern is frequent, the algorithm extends it by appending one base at a time (Line 3), and constructs the corresponding projected database (Line 4). If the extended pattern is frequent and closed (Algorithm 2), then the algorithm recursively calls itself with the extended pattern (Line 8). Therefore, the current pattern is always closed. If the current pattern cannot extend to any frequent pattern (Line 9), it is maximal according to Definition 1. And if the pattern is long enough, it will be saved with its projected database (Line 10).
To check whether a frequent pattern is closed, we adopt the method proposed in [22], which is outlined in Algorithm 2. First, we calculate the start positions of the pattern's occurrences in the input sequence (block), using its pseudo-projected database. That is, we subtract the length of the pattern from each value in the pseudo-projected database (Line 2-3). Then, we check whether the set of these positions is a subset of any single-item-projected database (Line 4). Here, a single item means any base in {A, G, C, T}. If the answer is "yes", the pattern is impossible to be a closed frequent pattern.
Algorithm 1: freqPattens(Ω, s_{ min }, l_{ min }, α, S|_{ α }, R) -- the modified PrefixSpan algorithm for mining maximal frequent patterns from DNA sequence.
Input : Ω - the charset of bases, namely, {A, T, C, G};
s_{ min } - the minimum support threshold;
l_{ min } - the minimum pattern length;
α - the current pattern;
S|_{ α } - the α-projected database;
Output: R - the set containing all the maximal frequent patterns and their corresponding projected databases;
1 isMaximal = true;
2 foreach s ∈ Ω do
3 Append s to α to form a new pattern β;
4 Construct the β-projected database S|_{ β } ;
5 if | S |_{ β }| ≥ s_{ min }then /* β is frequent */
6 isMaximal = false;
7 if isClosed( β , S | _{ β } ) then
8 Call freqPattens(Ω, s_{ min }, l_{ min }, β , S |_{ β }, R);
9 if isMaximal = true and | α | ≥ l_{ min }then /* α is maximal */
10 R = R ∪ {α, S|_{ α }};
11 return R;
To have a better understanding of the mining process, we give an example as follows:
Illustration of the mining process of the modified PrefixSpan algorithm.
current pattern | extended patterns |
---|---|
〈C〉: 7, 10, 14, 16, 17; | 〈CA〉: 8; 〈CC〉: 17; 〈CG〉: Empty; 〈CT 〉: 11, 15, 18; |
〈CT 〉: 11, 15, 18; | 〈CT A〉: Empty; 〈CT C〉: 16; 〈CT G〉: 12, 19; 〈CT T 〉: Empty; |
〈CT G〉: 12, 19; | 〈CTGA〉: 13, 20; 〈CT GC〉: Empty; 〈CT GG〉: Empty; 〈CT GT 〉: Empty; |
〈CTGA〉: 13, 20; | 〈CT GAA〉: Empty; 〈CT GAC〉: 14; 〈CT GAG〉: Empty; 〈CT GAT 〉: Empty; |
〈A〉: 1, 8, 13, 20; | 〈AA〉: Empty; 〈AC〉: 14; 〈AG〉: Empty; 〈AT〉: 2, 9; |
〈AT〉: 2, 9; | 〈AT A〉: Empty; 〈AT C〉: 10; 〈AT G〉: 3; 〈AT T 〉: Empty; |
〈G〉: 3, 4, 6, 12, 19; | 〈GA〉: 13, 20; 〈GC〉: 7; 〈GG〉: 4; 〈GT 〉: 5; |
〈T 〉: 2, 5, 9, 11, 15, 18; | 〈T A〉: Empty; 〈TC〉: 10, 16; 〈T G〉: 3, 6, 12, 19; 〈T T 〉: Empty; |
〈TC〉: 10, 16; | 〈T CA〉: Empty; 〈T CC〉: 17; 〈T CG〉: Empty; 〈T CT 〉: 11; |
〈T G〉: 3, 6, 12, 19; | 〈T GA〉: 13, 20; 〈T GC〉: 7; 〈T GG〉: 4; 〈T GT 〉: Empty; |
Algorithm 2: isClosed(α, S|_{ α }) -- determining whether a pattern is closed.
Input : α - the pattern;
S|_{ α } - the α-projected database;
Output: True - if the pattern is closed;
False - if the pattern is not closed;
1 P = ∅; /* initialize the position set */
2 foreach c ∈ S | _{ α } do
3 P = P ∪ {c − |α|};
4 if P ⊆S |_{ A } or P ⊆S | C or P ⊆S | G or P ⊆S | T then
5 return False;
6 else
7 return True;
Entropy calculation
where s_{ pat } is the support of the pattern, l_{ block } is the length of the block, and l_{ pat } is the length of the pattern. It is obvious that the probability of each pattern is positively correlated with its support and its length. When the length of pattern increases to the length of the block, its support will become 1 so that the probability equals 1. And the probability equals 0 when the support drops to 0. As we consider only maximal frequent patterns, and s_{ min } is usually >1, those two extreme cases will not happen.
where R is the set of maximal frequent patterns mined from the block. Finally, the entropies of all blocks constitute the dimensional components of the final vector of the sequence.
Probabilities of patterns.
pattern | s _{ pat } | l _{ pat } | p _{ pat } |
---|---|---|---|
〈AT 〉 | 2 | 2 | 2/(20 − 2 + 1) = 0.105263 |
〈CT GA〉 | 2 | 4 | 2/(20 − 4 + 1) = 0.117647 |
〈T C〉 | 2 | 2 | 2/(20 − 2 + 1) = 0.105263 |
Similarity calculation
In the above subsection, we have obtained a representative vector for each sequence.
However, as the vectors may differ in size, there should be a special way to measure
the similarity of two vectors of different sizes.
When n equals m, dist(V_{ 1 }, V_{ 2 }) is degenerated to Euclidean distance.
Experiments and results
Data used in this work
Details of β-Globin genes of 11 species.
No. | species | accession number | location | length (nt) |
---|---|---|---|---|
1 | Bovine | [GenBank:X00376] | 278-1741 | 1464 |
2 | Chimpanzee | [GenBank:X02345] | 4189-5532 | 1344 |
3 | Gallus | [GenBank:V00409] | 465-1810 | 1346 |
4 | Goat | [GenBank:M15387] | 279-1749 | 1471 |
5 | Gorilla | [GenBank:X61109] | 4538-5881 | 1344 |
6 | Human | [GenBank:U01317] | 62187-63610 | 1424 |
7 | Lemur | [GenBank:M15734] | 154-1595 | 1442 |
8 | Mouse | [GenBank:V00722] | 275-1462 | 1188 |
9 | Opossum | [GenBank:J03643] | 467-2488 | 2022 |
10 | Rabbit | [GenBank:V00882] | 277-1419 | 1143 |
11 | Rat | [GenBank:X06701] | 310-1505 | 1196 |
Experimental setting
We compare our method (FPE) with BLASTN [7] and two recently-developed alignment-free methods [14, 18]. The results of the MEGA [9] software are used as the ground truth. The experiments are done on a PC with Intel Xeon E5606 2.13GHz CPU and 8 GB memory. The operating system is Ubuntu 10.04. By default, we run our method with the minimum support threshold being 3 and the minimum pattern length being 2.
Experimental results and analysis
Preservation of fine-granularity information
Tolerance of sequence rearrangement
Tolerance of noise
Similarity analysis
Pairwise distance matrix of β-Globin genes of 11 species.
bovine | chimpanzee | gallus | goat | gorilla | human | lemur | mouse | opossum | rabbit | rat | |
---|---|---|---|---|---|---|---|---|---|---|---|
Bovine | 0.0000 | 0.0782 | 0.1112 | 0.0474 | 0.0782 | 0.0670 | 0.0824 | 0.0666 | 0.1202 | 0.0666 | 0.0663 |
Chimpanzee | 0.0782 | 0.0000 | 0.0679 | 0.0957 | 0.0000 | 0.0000 | 0.0558 | 0.0568 | 0.1579 | 0.0569 | 0.0569 |
Gallus | 0.1112 | 0.0679 | 0.0000 | 0.1239 | 0.0679 | 0.0792 | 0.0962 | 0.0806 | 0.1935 | 0.0805 | 0.0805 |
Goat | 0.0474 | 0.0957 | 0.1239 | 0.0000 | 0.0957 | 0.0820 | 0.0675 | 0.0939 | 0.0994 | 0.0939 | 0.0938 |
Gorilla | 0.0782 | 0.0000 | 0.0679 | 0.0957 | 0.0000 | 0.0000 | 0.0558 | 0.0568 | 0.1579 | 0.0569 | 0.0569 |
Human | 0.0670 | 0.0000 | 0.0792 | 0.0820 | 0.0000 | 0.0000 | 0.0478 | 0.0663 | 0.1537 | 0.0664 | 0.0664 |
Lemur | 0.0824 | 0.0558 | 0.0962 | 0.0675 | 0.0558 | 0.0478 | 0.0000 | 0.0942 | 0.1428 | 0.0945 | 0.0944 |
Mouse | 0.0666 | 0.0568 | 0.0806 | 0.0939 | 0.0568 | 0.0663 | 0.0942 | 0.0000 | 0.1643 | 0.0672 | 0.0671 |
Opossum | 0.1202 | 0.1579 | 0.1935 | 0.0994 | 0.1579 | 0.1537 | 0.1428 | 0.1643 | 0.0000 | 0.1643 | 0.1643 |
Rabbit | 0.0666 | 0.0569 | 0.0805 | 0.0939 | 0.0569 | 0.0664 | 0.0945 | 0.0672 | 0.1643 | 0.0000 | 0.0003 |
Rat | 0.0663 | 0.0569 | 0.0805 | 0.0938 | 0.0569 | 0.0664 | 0.0944 | 0.0671 | 0.1643 | 0.0003 | 0.0000 |
Comparison with existing methods
Comparison of the distances between human and the other tested species.
Bovine | Chimpanzee | Gallus | Goat | Gorilla | Lemur | Mouse | Opossum | Rabbit | Rat | Correlation coefficient | |
---|---|---|---|---|---|---|---|---|---|---|---|
MEGA 5.2 | 0.4485 | 0.0095 | 0.8456 | 0.4696 | 0.0117 | 0.2423 | 0.4815 | 0.8337 | 0.4083 | 0.4935 | - |
BLASTN 2.2.29+[7] | 0.8600 | 0.0896 | 0.9880 | 0.8765 | 0.0896 | 0.6643 | 0.9026 | 1.0000 | 0.8423 | 0.9182 | 0.8912 |
Method of [14] | 22.4257 | 5.3704 | 23.5869 | 26.8209 | 5.3704 | 25.2515 | 25.8007 | 25.9952 | 20.5706 | 27.0102 | 0.7569 |
Method of [18] | 0.1000 | 0.0100 | 0.2150 | 0.1050 | 0.0110 | 0.0550 | 0.0830 | 0.0890 | 0.0700 | 0.0620 | 0.8318 |
FPE | 0.0670 | 0.0000 | 0.0792 | 0.0820 | 0.0000 | 0.0478 | 0.0663 | 0.1537 | 0.0664 | 0.0664 | 0.8966 |
Discussion
Existing methods for measuring similarity/dissimilarity between DNA sequences roughly fall into two types: alignment-based methods and alignment-free methods. As alignment-based methods may be computationally expensive and not scalable to huge datasets, alignment-free methods have been extensively investigated recently.
For the alignment-free methods, on the one hand, algorithms based on word frequency are sensitive to the length of words used, and may include noise if taking all the words into account. On the other hand, algorithms based on geometric representation provide visual comparison among DNA sequences locally and globally. However, for long sequences, these methods may require too much computation overhead and memory. Recently, entropy-based methods provide simple representations for the sequences, but they are prone to lost some important information, for example, location and ordering information. As they treat all symbols or patterns equally, they may also include noise.
Our method is based on frequent patterns and entropy. As we consider only the maximal frequent patterns in a sequence, our method can considerably tolerate noise and sequence rearrangements. Furthermore, by using the blocking strategy, fine-granularity information of the sequences can be captured.
However, for more accurately evaluating the similarity between two sequences, some parameters have to be tuned. In our algorithm, the block size, the minimum support threshold and the minimum pattern length can be changed for different sequences. Note that to compare two sequences, the block size should be the same. And the larger the block size, the more information will be lost. Here, we present the following rule of thumb for parameter tuning:
1) By experiments, we found that it is better to set the block size less than 400, which can be also observed from Figure 3.
2) Once the block size is determined, there may be an optimum pair of the minimum support threshold and the minimum pattern length to be determined.
3) We observed a negative correlation between the minimum support threshold and the minimum pattern length. This means that if the minimum support threshold is large, the minimum pattern length should be set to relatively small. Otherwise, we may get no maximal frequent patterns in some blocks, and thus lose too much information.
Finally, we want to point out that different methods have their own application scenarios. For example, methods based on geometric representation are very powerful tools for visually and intuitively analyzing the sequences. As for our method, we provide an effective way to represent a DNA sequence to a vector, which is suitable for searching similar sequences in databases or acting as a preprocessing step for other applications. Considering that our method has been shown to provide more accurate distances, so it is also suitable for discovering evolutionary relationships.
Conclusion
This paper presents a novel method based on frequent patterns and entropy to represent the DNA sequences and evaluate their similarities. By using blocking technique, our method can capture fine-granularity information of sequences. Our method can also tolerate noise and sequence rearrangements because we take only the maximal frequent patterns into account. Experiments over the β-globin genes of 11 species show that our method achieves more accurate distances than two recently-developed alignment-free methods and the BLASTN tool.
Declarations
Acknowledgements
This work was firstly presented at the 10th International Symposium on Bioinformatics Research and Applications (ISBRA 2014), June 28-30, 2014, Zhangjiajie, China. And later a one-page abstract of this work with similar title and authors was included in the Proceedings of ISBRA 2014, Vol. 8492, LNCS, page 388, Springer, 2014.
Declarations
This work was supported by China 863 Program (grant No. 2012AA020403) and National Natural Science Foundation of China (NSFC) under grants No. 61173118 and No. 61272380.
This article has been published as part of BMC Genomics Volume 16 Supplement 3, 2015: Selected articles from the 10th International Symposium on Bioinformatics Research and Applications (ISBRA-14): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/16/S3.
Authors’ Affiliations
References
- Loh P-R, Baym M, Berger B: Compressive genomics. Nature Biotechnology. 30 (7): 627-630.Google Scholar
- Pushkarev D, Neff NF, Quake SR: Single-molecule sequencing of an individual human genome. Nature Biotechnology. 2009, 27 (9): 847-850. 10.1038/nbt.1561.PubMed CentralView ArticlePubMedGoogle Scholar
- Horner D, Pavesi G, Castrignanò T, De Meo P, Liuni S, Sammeth M, Picardi E, Pesole G: Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing. Briefings in Bioinformatics. 2010, 11 (2): 181-197. 10.1093/bib/bbp046.View ArticlePubMedGoogle Scholar
- Pinho AJ, Pratas D, Garcia SP: Green: a tool for efficient compression of genome resequencing data. Nucleic Acids Research. 2012, 40 (4): 27-27. 10.1093/nar/gkr1124.View ArticleGoogle Scholar
- Kuruppu S, Puglisi SJ, Zobel J: Optimized relative lempel-ziv compression of genomes. Proceedings of the Thirty-Fourth Australasian Computer Science Conference. 2011, Australian Computer Society, Inc, 113: 91-98.Google Scholar
- Durbin R: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. 1998, Cambridge University Press, UKView ArticleGoogle Scholar
- Altschul SF, Madden TL, Sch¨affer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Research. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Pearson WR: Rapid and sensitive sequence comparison with fastp and fasta. Methods in Enzymology. 1990, 183: 63-98.View ArticlePubMedGoogle Scholar
- Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S: Mega5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Molecular Biology and Evolution. 2011, 28 (10): 2731-2739. 10.1093/molbev/msr121.PubMed CentralView ArticlePubMedGoogle Scholar
- Sandes EFdO, Miranda G, de Melo AC, Martorell X, Ayguade E: Cudalign 3.0: Parallel biological sequence comparison in large gpu clusters. Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium On. 2014, IEEE, 160-169.Google Scholar
- Vinga S, Almeida J: Alignment-free sequence comparison--a review. Bioinformatics. 2003, 19 (4): 513-523. 10.1093/bioinformatics/btg005.View ArticlePubMedGoogle Scholar
- Blaisdell BE: A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Sciences. 1986, 83 (14): 5155-5159. 10.1073/pnas.83.14.5155.View ArticleGoogle Scholar
- Wu T-J, Hsieh Y-C, Li L-A: Statistical measures of dna sequence dissimilarity under markov chain models of base composition. Biometrics. 2001, 57 (2): 441-448. 10.1111/j.0006-341X.2001.00441.x.View ArticlePubMedGoogle Scholar
- Yu H-J, Huang D-S: Graphical representation for dna sequences via joint diagonalization of matrix pencil. IEEE Journal of Biomedical and Health Informatics. 2013, 17 (3): 503-511.View ArticlePubMedGoogle Scholar
- Jafarzadeh N, Iranmanesh A: C-curve: a novel 3d graphical representation of dna sequence based on codons. Mathematical Biosciences. 2013, 241 (2): 217-224. 10.1016/j.mbs.2012.11.009.View ArticlePubMedGoogle Scholar
- Chi R, Ding K: Novel 4d numerical representation of dna sequences. Chemical Physics Letters. 2005, 407 (1): 63-67.View ArticleGoogle Scholar
- Liao B, Li R, Zhu W, Xiang X: On the similarity of dna primary sequences based on 5-d representation. Journal of Mathematical Chemistry. 2007, 42 (1): 47-57. 10.1007/s10910-006-9091-z.View ArticleGoogle Scholar
- Li C, Ma H, Zhou Y, Wang X, Zheng X: Similarity analysis of dna sequences based on the weighted pseudo-entropy. Journal of Computational Chemistry. 2011, 32 (4): 675-680. 10.1002/jcc.21656.View ArticlePubMedGoogle Scholar
- Zhang X, Zhou XA, Yu YH: Similarity analysis of dna using improved approximate entropy. Biomedical Engineering and Biotechnology, International Conference. 2012, IEEE, 511-514.Google Scholar
- Pincus SM: Approximate entropy as a measure of system complexity. Proceedings of the National Academy of Sciences. 1991, 88 (6): 2297-2301. 10.1073/pnas.88.6.2297.View ArticleGoogle Scholar
- Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U, Hsu M-C: Mining sequential patterns by pattern-growth: The prefixspan approach. Knowledge and Data Engineering, IEEE Transactions. 2004, 16 (11): 1424-1440. 10.1109/TKDE.2004.77.View ArticleGoogle Scholar
- Wang J, Han J: Bide: Efficient mining of frequent closed sequences. Data Engineering, 2004. Proceedings. 20th International Conference. 2004, IEEE, 79-90.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.