- Research
- Open Access

# Optimal choice of word length when comparing two Markov sequences using a *χ*
^{2}-statistic

- Xin Bai
^{1}, - Kujin Tang
^{2}, - Jie Ren
^{2}, - Michael Waterman
^{1, 2}and - Fengzhu Sun
^{1, 2}Email author

**Published:**3 October 2017

## Abstract

### Background

Alignment-free sequence comparison using counts of word patterns (grams, *k*-tuples) has become an active research topic due to the large amount of sequence data from the new sequencing technologies. Genome sequences are frequently modelled by Markov chains and the likelihood ratio test or the corresponding approximate *χ*
^{2}-statistic has been suggested to compare two sequences. However, it is not known how to best choose the word length *k* in such studies.

### Results

We develop an optimal strategy to choose *k* by maximizing the statistical power of detecting differences between two sequences. Let the orders of the Markov chains for the two sequences be *r*
_{1} and *r*
_{2}, respectively. We show through both simulations and theoretical studies that the optimal *k*= max(*r*
_{1},*r*
_{2})+1 for both long sequences and next generation sequencing (NGS) read data. The orders of the Markov chains may be unknown and several methods have been developed to estimate the orders of Markov chains based on both long sequences and NGS reads. We study the power loss of the statistics when the estimated orders are used. It is shown that the power loss is minimal for some of the estimators of the orders of Markov chains.

### Conclusion

Our studies provide guidelines on choosing the optimal word length for the comparison of Markov sequences.

## Keywords

- Markov chain
- Alignment-free genome comparison
- Statistical power
- NGS

## Background

The comparison of genome sequences is important for understanding their relationships. The most widely used methods are alignment based algorithms such as the Smith-Waterman algorithm [1], BLAST [2], BLAT [3], etc. In such studies, homologous genes among the genomes are identified, aligned, and then their relationships inferred using phylogenetic analysis tools to obtain gene trees. A consensus tree combining the gene trees from all the homologous genes is used to represent the relationship among the genomes. However, non-conserved regions form large fractions of most genomes and they also contain information about the relationships among the sequences. Most alignment based methods do not consider the non-conserved regions resulting in loss of information. Another drawback of the alignment based method is the extremely long time needed for the analysis, especially when the number of genome sequences is large.

With the development of new sequencing technologies, a large number of genome sequences are now available and many more will be generated. To overcome the challenges facing alignment based methods for the study of genome sequence relationships, several alignment-free sequence comparison methods have been developed as reviewed in [4, 5]. Most of the methods use the counts of word patterns within the sequences [6–12]. One important problem is the determination of word length used for the comparison of sequences. Several investigators addressed this issue using simulation studies or empirical data [13–15]. Wu et al. [15] investigated the performance of Euclidian distance, standardized Euclidian distance, and symmetric Kullback–Leibler discrepancy (SK-LD) for alignment free genome comparison. For a given dissimilarity measure, Wu et al. [15] simulated the evolution of two sequences with different mutation rates and chose the word length that yielded the highest Spearman correlation between the dissimilarity measure and the mutation rate. They showed that SK-LD performed well and the optimal word length increases with the sequence length. Using a similar approach, Forêt et al. [14] studied the optimal word length for *D*
_{2} that measures the number of shared words between two sequences [8]. Sims et al. [13] suggested a range for the optimal word length using alignment-free genome comparison with SK-LD.

Markov chains (MC) have been widely used to model molecular sequences to solve several problems including the enrichment and depletion of certain word patterns [16], prediction of occurrences of long word patterns from short patterns [17, 18], and the detecting of signals in introns [19]. Narlikar et al. [20] showed the importance of using appropriate Markov models on phylogenetic analysis, assignment of sequence fragments to different genomes in metagnomic studies, motif discovery, and functional classification of promoters.

In this paper, we consider the comparison of two sequences modelled using Markov chains [11, 12] as a hypothesis testing problem. The null hypothesis is that the two sequences are generated by the same Markov chain. The alternative hypothesis is that they are generated by different Markov chains. We investigate a log-likelihood ratio statistic for testing the hypotheses and its corresponding *χ*
^{2}-statistic based on the counts of word patterns in the sequences. The details of the statistics are given in “The likelihood ratio statistic and the *χ*
^{2}-statistic for comparing two Markov sequences” subsection. We use statistical power of the test statistic under the alternative hypothesis to evaluate its performance. We will study the following questions. a) What is the optimal word length *k* yielding the highest power of the *χ*
^{2}-statistic? b) How do the estimated orders of the Markov sequences, sequence length, word length, and sequencing error rate impact the power of the *χ*
^{2}-statistic? c) For NGS read data, what is the distribution of the *χ*
^{2}-statistic under the null hypothesis? (d) Do the conclusions from (a) and (b) still hold for NGS reads?

## Methods

### Alignment-free comparison of two long Markov sequences

We study alignment-free comparison of two long Markov sequences using counts of word patterns. We first introduce the likelihood ratio [11, 12] and corresponding *χ*
^{2}-statistic. We show theoretically and by simulations that the optimal word length is *k*= max{*r*
_{1},*r*
_{2}}+1, where *r*
_{1} and *r*
_{2} are the orders of the two Markov sequences. We then study the effects of sequence length, word length, and estimated orders of MCs on the power of the *χ*
^{2}-statistic.

### The likelihood ratio statistic and the *χ*
^{2}-statistic for comparing two Markov sequences

Given two Markov sequences **A**
_{1} and **A**
_{2}, we want to test if the two sequences follow the same MC, that is, if their transition probability matrices are the same. We formulate this as a hypothesis testing problem. The null hypothesis *H*
_{0} is that the two sequences are generated from the same MC. The alternative hypothesis *H*
_{1} is that the two sequences are generated from MCs with different transition probability matrices.

*k*(

*k*≥1) to test if the two sequences are from the same MC of order

*k*−1 as in [11]. The basic formulation of the problem can be described as follows. Let

*L*

_{ s }is the length of the

*s*-th sequence and

*A*

_{ s,i }, 1≤

*i*≤

*L*

_{ s }is the letter of the sequence at the

*i*-th position.

*k*−1. The probability of the

*s*-th sequence is

where **w**=*w*
_{1}
*w*
_{2}⋯*w*
_{
k
} is any word pattern of length *k*, **w**
^{−}=*w*
_{1}
*w*
_{2}⋯*w*
_{
k−1} (the last letter is removed), \(N^{(s)}_{\mathbf {w}}\) is the number of occurrences of word **w**, and *t*
^{(s)}(**w**
^{−},*w*
_{
k
}) is the (*k*−1)-th order transition probability from **w**
^{−} to *w*
_{
k
} in the *s*-th sequence, and *π*
^{(s)} is the initial distribution.

*t*

^{(s)}(

**w**

^{−},

*w*

_{ k }) is

*s*-th sequence \(\hat {P}(\mathbf {A}_{s})\) by replacing

*t*

^{(s)}(

**w**

^{−},

*w*

_{ k }) with \(\hat {t}^{(s)}(\mathbf {w}^{-}, w_{k})\) in equation (1). The likelihood of both sequences under the alternative hypothesis

*H*

_{1}is

*H*

_{0}, the transition matrices for the two sequences are the same. Using the same argument as above, we can show that the maximum likelihood estimate of the common transition probability

*t*(

**w**

^{−},

*w*

_{ k }) is given by

*P*

_{0}, of both sequences can be estimated similarly as in Eq. (2). The log-likelihood ratio statistic is given by (ignoring the first

*k*−1 bases in each sequence)

The above statistic has an approximate *χ*
^{2}-distribution as the lengths of both sequences become large [21, 22].

*χ*

^{2}-statistic [11] defined by

Since 2 log(*P*
_{1}/*P*
_{0}) and *S*
_{
k
} are approximately equal, in our study, we use the measure *S*
_{
k
} for sequence comparison.

*r*=0) have the same nucleotide frequencies, we set

*k*=1, \(N^{(s)}_{\mathbf {w}^{-}} = L_{s}, ~s = 1, 2\), \(N^{(-)}_{\mathbf {w}^{-}} = L_{1} + L_{2},\) and

*S*

_{1}is calculated by

where **w** is a nucleotide and the summation is over all the nucleotides, \(p_{\mathbf {w}}^{(s)} = N_{\mathbf {w}}^{(s)}/L_{s}\), and *L*
_{
s
} is the length of the *s*-th sequence.

### Estimating the order of a MC sequence

*r*, of the MC corresponding to each sequence and it needs to be estimated from the data. Several methods have been developed to estimate the order of a MC including those based on the Akaike information criterion (AIC) [23] and Bayesian information criterion (BIC) [24]. The AIC and BIC for a Markov sequence of length

*L*are defined by

*C*is the alphabet size. The estimators of the order of a Markov sequence based on AIC and BIC are given by

**w**estimated by a

*k*−2-th order MC.

*T*

_{ k }has an approximate

*χ*

^{2}-distribution with

*df*

_{ k }=(

*C*−1)

^{2}

*C*

^{ k−2}degrees of freedom when

*k*≥

*r*+2 [21, 22, 27, 28]. When

*k*<

*r*+2,

*T*

_{ k }will be large if the sequence is long, while

*T*

_{ k }should be moderate when

*k*≥

*r*+2. Based on this idea, we can estimate the order of the MC by

*T*

_{ k }directly, we can calculate the corresponding p-value

*t*

_{ k }is the observed value of

*T*

_{ k }based on the long sequence. Since

*t*

_{ k }is generally large when

*k*≤

*r*+1 and thus

*p*

_{ k }should be small, while

*p*

_{ k }is moderate when

*k*≥

*r*+2. Based on this idea, we can estimate the order of a MC by

**w**,

*Z*

_{ w }is approximately normally distributed when

*k*≥

*r*+2. When the sequence is long, we expect

*Z*

_{max}(

*k*)= max

**w**,|

**w**|=

*k*|

*Z*

_{ w }| to be large when

*k*≤

*r*+1, while it is moderate when

*k*≥

*r*+2. Similar to the ideas given above, we can estimate the order of the MC by

We are interested in knowing the power loss of the *χ*
^{2}-statistic when any of the estimated orders of the two sequences are used for the comparison of MC sequences.

### Alignment-free comparison of two Markov sequences based on NGS reads

We then investigate the comparison of sequences based on NGS reads. We first extend the *χ*
^{2}-statistic in Eq. (4) to be applicable to NGS reads. We then extend the methods for estimating the order of MC sequences for long sequences to be applicable to NGS reads. Finally, we study the optimal word length for genome comparison based on NGS reads and investigate the effect of sequence length, read length, distributions of reads along the genome, and sequencing errors on the power of the statistic.

### Alignment-free dissimilarity measures for comparing Markov sequences based on NGS reads

*χ*

^{2}-statistic in Eq. (4) depends only on word counts in the two sequences, and thus can be easily extended to NGS read data. We replace

*N*

_{ w }in Eq. (4) by \(N^{R}_{\mathbf {w}}\), the number of occurrences of word pattern

**w**among the NGS reads, to obtain a new statistic,

We will use \(S_{k}^{R}\) to measure the dissimilarity between the two sequences.

### Estimating the order of a Markov sequence based on NGS reads

We next extend the estimators of the order of a MC in “Estimating the order of a MC sequence” subsection to NGS reads. The estimators *r*
_{AIC} and *r*
_{BIC} cannot be directly calculated because the likelihood of the reads is hard to calculate due to the potential overlaps among the reads. On the other hand, the other remaining estimators in “Estimating the order of a MC sequence” subsection, *r*
_{
PS
}, *r*
_{
S
},*r*
_{
p
}, and *r*
_{
Z
}, depend only on the word counts and we can just replace *N*
_{
w
} in these Eqs. by \(N^{R}_{\mathbf {w}}\) for the NGS data. For simplicity of notation, we will continue to use the same notation as that in “Estimating the order of a MC sequence” subsection for the corresponding estimators. Similar to the study of long sequences, we investigate the power loss of the statistic \(S_{k}^{R}\) when the estimated orders of the sequences are used to compare the power of \(S_{k}^{R}\) when the true orders of the sequences are used.

## Results

### Optimal word length for the comparison of Markov sequences using the *χ*
^{2}-statistic

The following theorem gives the optimal word length for the comparison of two sequences using the *χ*
^{2}-statistics given in Eqs. 4 and (5). The theoretical proof is given in the Additional file 1.

###
**Theorem 1**

Consider two Markov sequences of orders *r*
_{1} and *r*
_{2}, respectively. We test the alternative hypothesis *H*
_{1}: the transition matrices of the two Markov sequences are different, versus the null hypothesis *H*
_{0}: the transition probability matrices are the same, using the *χ*
^{2}-statistic in Eqs. (4) and (5). Then the power of the *χ*
^{2}-statistic under the alternative hypothesis is maximized when the word length *k*= max{*r*
_{1},*r*
_{2}}+1.

*S*

_{ k }in Eqs. (4) and (5) for different values of sequence length and word pattern length. We simulated two Markov sequences

**A**

_{1}and

**A**

_{2}with different transition matrices and then calculated the distributions of the

*χ*

^{2}-statistic. We set the length of both sequences to be the same

*L*: 10, 20 and 30 kbps, respectively, and started the sequences from the stationary distribution. We simulated MCs of first order and second order, respectively. Tables 1 and 2 show the transition probability matrices of (a) the first and (b) the second order transition matrices we used in the simulations. Here we present simulation results based on transition matrices from Tables 1 and 2 for simplicity. We also tried other transition matrices and the conclusions were the same.

The transition probability matrix of the first order Markov chain in our simulation studies

A | C | G | T | |
---|---|---|---|---|

A | 0.1 | 0.2 | 0.3 | 0.4 |

C | 0.2 | 0.3 | 0.4 | 0.1 |

G | 0.3 | 0.4 | 0.1 | 0.2 |

T | 0.4 | 0.1 | 0.2 | 0.3 |

The transition probability matrix of the second order Markov chain

A | C | G | T | |
---|---|---|---|---|

AA | 0.1+ | 0.2- | 0.3+ | 0.4- |

AC | 0.2+ | 0.3- | 0.4+ | 0.1- |

AG | 0.3+ | 0.4- | 0.1+ | 0.2- |

AT | 0.4+ | 0.1- | 0.2+ | 0.3- |

CA | 0.1+ | 0.2- | 0.3+ | 0.4- |

CC | 0.2+ | 0.3- | 0.4+ | 0.1- |

CG | 0.3+ | 0.4- | 0.1+ | 0.2- |

CT | 0.4+ | 0.1- | 0.2+ | 0.3- |

GA | 0.1+ | 0.2- | 0.3+ | 0.4- |

GC | 0.2+ | 0.3- | 0.4+ | 0.1- |

GG | 0.3+ | 0.4- | 0.1+ | 0.2- |

GT | 0.4+ | 0.1- | 0.2+ | 0.3- |

TA | 0.1+ | 0.2- | 0.3+ | 0.4- |

TC | 0.2+ | 0.3- | 0.4+ | 0.1- |

TG | 0.3+ | 0.4- | 0.1+ | 0.2- |

TT | 0.4+ | 0.1- | 0.2+ | 0.3- |

The parameters *α*
_{
i
},*β*
_{
i
},*γ*
_{
i
},*δ*
_{
i
}, *i*=1,2, in Table 2 control the transition matrix of the second order MC. Note that if *α*
_{
i
}=*β*
_{
i
}=*γ*
_{
i
}=*δ*
_{
i
}, *i*=1,2, the MC will become a first order MC.

Under the null hypothesis, sequences **A**
_{1} and **A**
_{2} follow the same Markov model. So we set the transition matrices for both **A**
_{1} and **A**
_{2} to be Table 1. Under the alternative hypothesis, the two sequences are different and we set the transition matrix of sequence **A**
_{1} to be from Table 1 and the transition matrix of sequence **A**
_{2} to be from Table 2. We set the parameters of Table 2 to be (1) *α*
_{
i
}=*β*
_{
i
}=*γ*
_{
i
}=*δ*
_{
i
}=0.05, *i*=1,2, and (2) *α*
_{1}=*α*
_{2}=0.05,*β*
_{1}=*β*
_{2}=−0.05,*γ*
_{1}=*γ*
_{2}=0.03,*δ*
_{1}=*δ*
_{2}=−0.03. The former scenario corresponds to the situation that sequences **A**
_{1} and **A**
_{2} have different orders and the latter scenario corresponds to the situation that they both have first order but different transition matrices. We then calculated the dissimilarity measure between sequence **A**
_{1} and **A**
_{2} using the *χ*
^{2}-statistic in Eq. (4).

We repeated the above procedures 2000 times to obtain an approximate distribution of *S*
_{
k
} under the null hypothesis. We sorted the value of *S*
_{
k
} in ascending order and took the 95% percentile as a threshold. Under the alternative hypothesis, the power is approximated by the fraction of times that *S*
_{
k
} is above the threshold.

*k*and the power of

*S*

_{ k }for long sequences of different lengths. It can be seen from the figure that the power of

*S*

_{ k }is highest when the word length is

*k*

_{optimal}= max{

*r*

_{1},

*r*

_{2}}+1. When the word length is less than the optimal value, the power of

*S*

_{ k }can be significantly lower. On the other hand, when the word length is slightly higher than the optimal word length, the power of

*S*

_{ k }is still close to the optimal power. However, when the word length is too large, the power of

*S*

_{ k }can be much lower.

*S*

_{ k }changes when the estimated orders of the sequences are used compared to the power when the true orders of the sequences are known. Let \(\hat {r}_{1}\) and \(\hat {r}_{2}\) be the estimated orders of sequences

**A**

_{1}and

**A**

_{2}, respectively. We compared the power of \(S_{\hat {k}}\) where \(\hat {k} = \max \left \{ \hat {r}_{1}, \hat {r}_{2} \right \} + 1\) with that of

*S*

_{ k−optimal}where

*k*−optimal= max{

*r*

_{1},

*r*

_{2}}+1. The power loss is defined as the difference between the power of

*S*

_{ k−optimal}and that of \(S_{\hat {k}}\). When both sequences are of first order, there was no power loss in our simulations. Figure 2 shows the power loss using different methods to estimate the orders of the sequences described in Eqs. (6) to (11) when the first sequence is of first order and the second sequence is of second order. There are significant differences among the various estimators when the sequence length is below 20 kbps. The power loss is minimal based on

*r*

_{AIC},

*r*

_{BIC}, and

*r*

_{ p }for all three sequence lengths from 10 to 30 kbps, indicating their good performance in estimating the true Markov order of the sequence. When the sequence length is long, e.g 30kpbs, the power loss is minimal for all the estimators across the sequence lengths simulated.

### Optimal word length for \(S_{k}^{R}\) for the comparison of two Markov sequences with NGS data

The distribution of \(S_{k}^{R}\) was not known previously. In this paper, we have the following theorem whose proof is given in the Additional file 1.

###
**Theorem 2**

*L*and Markov orders of

*r*

_{1}and

*r*

_{2}, respectively. Suppose that they are sequenced using NGS with

*M*reads of length

*κ*for each sequence. Let \(S_{k}^{R}\) be defined as in Eqs. (12) and (13). Suppose that each sequence can be divided into (not necessarily contiguous) regions with constant coverage

*r*

_{ i }for the

*i*-th region, so that every base is covered exactly

*r*

_{ i }times. Let

*L*

_{ is }be the length of the

*i*-th region in the short read data for the

*s*-th sequence and \({\lim }_{L\rightarrow \infty } L_{is}/L=f_{i},~s=1,2\). Then

- 1.Under the null hypothesis that the two sequences follow the same Markov chain, as sequence length
*L*becomes large, \(S_{k}^{R}/d\) is approximately*χ*^{2}-distributed with degrees of freedom*df*_{ k }=(*C*−1)*C*^{ k−1}, where*C*is the alphabet size and$$ d=\frac{\sum_{i}r^{2}_{i}f_{i}}{\sum_{j}r_{j}f_{j}}. $$(14)In particular, under the Lander-Waterman model, the reads are randomly sampled from the long sequence so that the NGS reads follow a Poisson process with rate

*λ*=*M**κ*/*L*[29], for*r*_{ i }=*i*,*f*_{ i }=*λ*^{ i }exp(−*λ*)/*i*!,*d*=1+*λ*. - 2.
If we use \(S_{k}^{R}\) to test whether the two sequences follow the same MC, under the alternative hypothesis, the power of \(S_{k}^{R}\) is the highest when

*k*= max{*r*_{1},*r*_{2}}+1.

To illustrate the first part of Theorem 2, we simulated the distribution of \(S^{R}_{k}\) under the null hypothesis. We assumed that both sequences are of order 1 with the transition probability matrix from Table 1. First, we generated MCs with length of *L*=10 and 20 kbps, respectively. The simulations of long sequences were the same as in “Optimal word length for the comparison of Markov sequences using the *χ*
^{2}-statistic” subsection. Second, we simulated NGS reads by sampling a varying number of reads from each sequence. The sampling of the reads was simulated as in [26,30]. The length of the reads was assumed to be a constant *κ*=200 bps and the number of reads *M* = 100 and 200 bps, respectively. The coverage of reads is calculated as *λ*=*M*
*κ*/*L*. Two types of read distributions were simulated: (a) homogeneous sampling that the reads were sampled uniformly along the long sequence [29], and (b) heterogeneous sampling as in [31]. In heterogeneous sampling, we evenly divided the long genome sequences into 100 blocks. For each block, we sampled a random number independently from the gamma distribution *Γ*(1,20). The sampling probability for each position in the block is proportional to the chosen number.

Sequencing errors are present in NGS data. In order to see the effect of sequencing errors on the distribution of \(S^{R}_{k}\), we simulated sequencing errors such that each base was changed to other three bases with equal probability 0.005.

Once the reads are generated, we then calculated \(S^{R}_{k}\) between two NGS read data sets. In our simulation study, we fixed *k*=3 and the simulation process was repeated 2000 times for each combination of sequence length and number of reads (*L,M*) to obtain the approximate distribution of \(S^{R}_{3}/d\), where *d* is given in Eq. (14).

*χ*

^{2}distribution. The constant

*d*is 1+

*λ*where

*λ*denotes the coverage for homogeneous sampling; and

*d*is calculated from Eq. (14) for heterogeneous sampling. It can be seen from the figure that the Q-Q plots center around the line

*y*=

*x*for both homogeneous and heterogeneous sampling without sequencing errors. These observations are consistent with part 1 of the Theorem 2. However, when sequence errors are present, the distribution of \(S^{R}_{3}/d\) deviates slightly from \(\chi ^{2}_{48}\).

We next studied how the power of \(S_{k}^{R}\) changes with word length, sequence length, and sequencing errors. Here we show the results for the scenario that one sequence has first order and the other has second order. The results for the scenario that both sequences are of first order are given in the Additional file 1.

*k*and the power of \(S^{R}_{k}\) using NGS short reads for different sampling of the reads and with/without sequencing errors. Several conclusions can be derived. First, the power of \(S^{R}_{k}\) is the highest when the word length

*k*= max{

*r*

_{1},

*r*

_{2}}+1. This is consistent with the result with long sequences. Second, sequencing errors can decrease the power of \(S^{R}_{k}\). However, with the range of sequencing error rates of current technologies, the decrease in power is minimal. Third, the power of \(S^{R}_{k}\) based on heterogeneous sampling of the reads is lower than that based on homogeneous sampling of the reads. Fourth, the power of \(S^{R}_{k}\) increases with both sequence length

*L*and number of reads

*M*as expected.

*χ*

^{2}-statistic” subsection to study this problem except that we change long sequences to NGS reads. Figure 5 shows the results. It can be seen that the power loss is significant except when

*r*

_{ p }was used to estimate the order of the sequences. In all the simulated scenarios, the power loss is very small when

*r*

_{ p }is used to estimate the orders of Markov sequences. This result is consistent with the case of long sequences where

*r*

_{ p }also performs the best.

## Applications to real data

### Searching for homologs of the human protein HSLIPAS

We used *S*
_{
k
} to analyze the relationship of 40 sequences chosen from mammals, invertebrates, viruses, plants, etc. as in [32,33]. We used HSLIPAS human lipoprotein lipase (LPL) of length 1612 bps as the query sequence and searched for similar sequences from a library set containing 39 sequences with length from 322 to 14,121 bps. The relationships among all the 40 sequences are well understood. Among the 39 library sequences, 20 sequences are from the primate division of Genbank, classified as being related to HSLIPAS, and 19 sequences that are from the divisions other than the primate division of Genbank, classified as being not related.

Wu et al. [32] estimated the orders of the 40 sequences using Schwarz information criterion (SIC) [34] and found that 13 of them follow independent identically distributed (i.i.d) model (order = 0) and 27 of them follow a first order MC. We also used BIC and found the same results as SIC.

As in Wu et al. [32], we used *selectivity* and *sensitivity* to quantify the performance of the measure *S*
_{
k
} for different values of *k*. First, we calculated the dissimilarity between HSLIPAS and each of the 39 sequences using *S*
_{
k
} and then ranked the 39 sequences in ascending order according to the values of *S*
_{
k
}. The sequence closest to HSLIPAS is ranked as sequence 1, the sequence with the next shortest distance as sequence 2, etc. *Sensitivity* is defined as the number of HSLIPAS-related sequences found among the first 20 (1-20) library sequences. *Selectivity* is measured in terms of consecutive correct classifications [35], that is, starting from sequence 1, the total number of sequences are counted until the first non-HSLIPAS-related library sequence occurs. Thus, *selectivity* and *sensitivity* are scores from 0 to 20 and higher score means better performance on the real data set.

*S*

_{ k }for different values of

*k*from 1 to 6. It can be seen from Table 3 that

*k*=2 yields the best result for both selectivity and sensitivity. Since about two thirds of the sequences have estimated order 1 and one third of the sequences have estimated order 0, the results are consistent with our conclusion.

The selectivity and sensitivity of *S*
_{
k
} for different word length *k* based on the comparison of HSLIPAS with 39 library sequences

Word length | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|

selectivity | 7 | 11 | 10 | 7 | 3 | 1 |

sensitivity | 13 | 17 | 16 | 13 | 12 | 9 |

### Comparison of CRM sequences in four mouse tissues

We also used *S*
_{
k
} to analyze cis-regulatory module (CRM) sequences in four tissues from developing mouse embryo [36–38] as in Song et al. [4]. The four tissues we used are forebrain, heart, limb and midbrain, with the average sequence lengths to be 711, 688, 657, and 847 bps, respectively. For each tissue, we randomly chose 500 sequences from the CRM dataset to form the *positive* dataset. For each sequence in the positive dataset, we randomly selected a fragment from the mouse genome with the same length, ensuring a maximum of 30% repetitive sequences to form the *negative* dataset. Thus, we have a negative dataset containing another set of 500 sequences.

We calculated the pairwise dissimilarity of sequences within the positive and also the negative dataset using the *S*
_{
k
} statistic with word length from 1-7. Then we merged the pairwise dissimilarity from the positive and negative datasets together. Sequences within the positive dataset should be closer than sequences within the negative dataset because the positive sequences should share some common CRMs. Therefore, we ranked the pairwise dissimilarity in ascending order and then predicted sequence pairs with distance smaller than a threshold as from the positive sequence pairs and otherwise we predicted them as coming from the negative pairs. For each threshold, we calculated the false positive rate and the true positive rate. Thus, by changing the threshold, we plotted the receiver operating characteristic (ROC) curve and calculated the area under the curve (AUC). For each tissue and each word length *k*, we repeated the above procedures 30 times.

We used BIC to estimate the MC orders of the sequences. The estimated orders of positive sequences for all four tissues are given in the Additional file 1. Almost all positive sequences in the positive dataset have estimated orders of 0 or 1. The results are similar for the negative sequences (data not shown).

*k*and the AUC values in all four tissues using boxplot for the 30 replicates. It can be seen from the figure that the AUC values using word length 1-3 are much higher than that using word length 4-7. The AUC values when

*k*=1 are slightly higher than that when

*k*=2 and

*k*=3. However, the differences are relatively small. The results are consistent in all four tissues. These results show that when the word length is close to the optimal word length based on our theoretical results, the AUC is generally higher than that when the word length is far away from the optimal word length based on our theoretical results.

## Discussion

In this paper, we investigated only the *χ*
^{2}-statistic for alignment-free genome comparison and the optimality criterion is to maximize the power of the *χ*
^{2}-statistic under the alternative hypothesis. Many other alignment-free genome comparison statistics are available as reviewed in [4,5]. The optimal word length we derived in this study may not be applicable to other statistics.

We assumed that the sequences of interest are Markov chains. Real molecular sequences do not exactly follow Markov chains and the sequences are also highly related. The relationship between the true evolution distance between the sequences and the pairwise *χ*
^{2}-dissimilarity using the optimal word length needs to be further investigated. These are the topics for future studies.

## Conclusions

In this paper, we study the optimal word length when comparing two Markov sequences using word count statistics, in particular, the likelihood ratio statistic and the corresponding *χ*
^{2}-statistic defined in Eq. (4). We showed theoretically and by simulations that the optimal word length is *k*= max{*r*
_{1},*r*
_{2}}+1. When the orders of the sequences are not known and have to be estimated from the sequence data, we showed that the estimator *r*
_{
p
} defined in Eq. (10) and the estimator *r*
_{AIC} defined in Eq. (6) have the best performance, followed by *r*
_{BIC} defined in Eq. (7) based on long sequences. We then extended these studies to NGS read data and found that the conclusions about the optimal word length continue to hold. It was also shown that if we use *r*
_{
p
} defined in Eq. (10) to estimate the orders of the Markov sequences based on NGS reads \(\hat {r}_{p1}\) and \(\hat {r}_{p2}\), respectively, and then compare the sequences using \(S_{\hat {k}-\text {optimal}}\), with \(\hat {k}-\text {optimal} = \max \{ \hat {r}_{p1}, \hat {r}_{p2} \}+1 \), the power loss is minimal. These conclusions are not significantly changed by sequencing errors. Therefore, our studies provide guidelines on the optimal choice of word length for alignment-free genome comparison using the *χ*
^{2}-statistic.

## Declarations

### Acknowledgements

We would like to thank Prof. Minping Qian at Peking University (PKU) for suggestions on the proof of Theorem 1, and Yang Y Lu at USC for providing the software and suggestions to improve the paper.

### Funding

This research is partially supported by US NSF DMS-1518001 and OCE 1136818, Simons Institute for the Theory of Computing at UC Berkeley, and Fudan University, China. The publication costs of this paper were provided by Fudan University, Shanghai, China.

### Availability of data and materials

Data of the first real data application can be downloaded from [33]. Data of the second real data application can be downloaded from [37].

### About this supplement

This article has been published as part of *BMC Genomics* Volume 18 Supplement 6, 2017: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM) 2016: genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume-18-supplement-6.

### Authors’ contributions

FS and MSW conceived the study, designed the framework of the paper and finalized the manuscript. XB did the simulation studies, proved the theorems, and wrote the manuscript. KT and JR participated in the real data analysis. All authors read and approved the final manuscript.

### Ethics approval and consent to participate

Not applicable.

### Consent for publication

Not applicable.

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## Authors’ Affiliations

## References

- Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981; 147(1):195–7.View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ, et al. Basic local alignment search tool. J Mol Biol. 1990; 215(3):403–10.View ArticlePubMedGoogle Scholar
- Kent WJ. BLAT, the BLAST-like alignment tool. Genome Res. 2002; 12(4):656–64.View ArticlePubMedPubMed CentralGoogle Scholar
- Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F. New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief Bioinform. 2014; 15(3):343–53.View ArticlePubMedGoogle Scholar
- Vinga S, Almeida J. Alignment-free sequence comparison–a review. Bioinformatics. 2003; 19(4):513–23.View ArticlePubMedGoogle Scholar
- Qi J, Luo H, Hao B. CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Res. 2004; 32(Web Server Issue):45.View ArticleGoogle Scholar
- Behnam E, Waterman MS, Smith AD. A geometric interpretation for local alignment-free sequence comparison. J Comput Biol. 2013; 20(7):471–85.View ArticlePubMedPubMed CentralGoogle Scholar
- Torney DC, Burks C, Davison D, Sirotkin KM. Computation of d2: A measure of sequence dissimilarity. Comput DNA. 1990; 7:109–25.Google Scholar
- Reinert G, Chew D, Sun FZ, Waterman MS. Alignment-free sequence comparison (I): Statistics and power. J Comput Biol. 2009; 16(12):1615–34.View ArticlePubMedPubMed CentralGoogle Scholar
- Karlin S, Burge C. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 1995; 11(7):283–90.View ArticlePubMedGoogle Scholar
- Blaisdell BE. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci USA. 1986; 83(14):5155–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Blaisdell BE. Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding. J Mol Evol. 1985; 21(3):278–88.View ArticleGoogle Scholar
- Sims GE, Jun SR, Wu GA, Kim SH. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc Natl Acad Sci USA. 2009; 106(8):2677–82.View ArticlePubMedPubMed CentralGoogle Scholar
- Forêt S, Kantorovitz MR, Burden CJ. Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences. BMC Bioinforma. 2006; 7(5):1.Google Scholar
- Wu TJ, Huang YH, Li LA. Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences. Bioinformatics. 2005; 21(22):4125–32.View ArticlePubMedGoogle Scholar
- Pevzner PA, Borodovsky MY, Mironov AA. Linguistics of nucleotide sequences i: the significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J Biomol Struct Dyn. 1989; 6(5):1013–26.View ArticlePubMedGoogle Scholar
- Hong J. Prediction of oligonucleotide frequencies based upon dinucleotide frequencies obtained from the nearest neighbor analysis. Nucleic Acids Res. 1990; 18(6):1625–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Arnold J, Cuticchia AJ, Newsome DA, Jennings WW, Ivarie R. Mono-through hexanucleotide composition of the sense strand of yeast DNA: a Markov chain analysis. Nucleic Acids Res. 1988; 16(14):7145–58.View ArticlePubMedPubMed CentralGoogle Scholar
- Avery PJ. The analysis of intron data and their use in the detection of short signals. J Mol Evol. 1987; 26(4):335–40.View ArticlePubMedGoogle Scholar
- Narlikar L, Mehta N, Galande S, Arjunwadkar M. One size does not fit all: On how Markov model order dictates performance of genomic sequence analyses. Nucleic Acids Res. 2013; 41(3):1416–24.View ArticlePubMedGoogle Scholar
- Anderson TW, Goodman LA. Statistical inference about Markov chains. Ann Math Stat. 1957; 28(4):89–110.View ArticleGoogle Scholar
- Billingsley P. Statistical methods in Markov chains. Ann Math Stat. 1961; 32(1):12–40.View ArticleGoogle Scholar
- Tong H. Determination of the order of a Markov chain by Akaike’s information criterion. J Appl Probab. 1975; 12:488–97.View ArticleGoogle Scholar
- Katz RW. On some criteria for estimating the order of a Markov chain. Technometrics. 1981; 23(3):243–9.View ArticleGoogle Scholar
- Peres Y, Shields P. Two new Markov order estimators. arXiv preprint math/0506080. 2005.Google Scholar
- Ren J, Song K, Deng M, Reinert G, Cannon CH, Sun F. Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics. Bioinformatics. 2016; 32(7):993–1000.View ArticlePubMedGoogle Scholar
- Hoel PG. A test for Markov chains. Biometrika. 1954; 41(3/4):430–3.View ArticleGoogle Scholar
- Billingsley P. Statistical Inference for Markov Processes, vol 2. Chicago: University of Chicago Press; 1961.Google Scholar
- Lander ES, Waterman MS. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics. 1988; 2(3):231–9.View ArticlePubMedGoogle Scholar
- Song K, Ren J, Zhai Z, Liu X, Deng M, Sun F. Alignment-free sequence comparison based on next-generation sequencing reads. J Comput Biol. 2013; 20(2):64–79.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhang ZD, Rozowsky J, Snyder M, Chang J, Gerstein M. Modeling chip sequencing in silico with applications. PLoS Comput Biol. 2008; 4(8):1000158.View ArticleGoogle Scholar
- Wu T, Hsieh Y, Li L. Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition. Biometrics. 2001; 57:441–8.View ArticlePubMedGoogle Scholar
- Hide W, Burke J, Davison D. Biological evaluation of
*d*^{2}, an algorithm for high performance sequence comparison. J Comput Biol. 1994; 1:199–215.View ArticlePubMedGoogle Scholar - Schwarz G. Estimating the dimension of a model. Annals Stat. 1978; 6:461–4.View ArticleGoogle Scholar
- Wu T, Burke JP, Davison DB. A measure of dna sequence dissimilarity based on mahalanobis distance between frequencies of words. Biometrics. 1997; 53:1431–9.View ArticlePubMedGoogle Scholar
- Göke J, Schulz MH, Lasserre J, Vingron M. Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. Bioinformatics. 2012; 28(5):656–63.View ArticlePubMedPubMed CentralGoogle Scholar
- Blow MJ, McCulley DJ, Li Z, Zhang T, Akiyama JA, Holt A, Plajzer-Frick I, Shoukry M, Wright C, Chen F, et al. Chip-seq identification of weakly conserved heart enhancers. Nat Genet. 2010; 42(9):806–10.View ArticlePubMedPubMed CentralGoogle Scholar
- Visel A, Blow M, Li Z, et al. Chip-seq accurately predicts tissue-specific activity of enhancers. Nature. 2009; 457(7231):854–8.View ArticlePubMedPubMed CentralGoogle Scholar