To detect long regions of focal epigenetic states, we formulated this as a problem of finding an optimal set of disjoint (nonoverlapping) regions in a sequence that maximizes the sum of similarity scores in all regions. Our method calculates a similarity score between a vector of epigenetic modifications at each position and the feature vector of a focal epigenetic state, such as K27HMD, and outputs the set of regions with the highest sum of similarity scores.
Calculating a similarity vector
We need to generate a modification vector at each position from epigenomic signal data. For example, to create benchmark datasets in this study, we binarized the modification signal level at each position using BinarizeBed in ChromHMM [15], which classified the signal at each position into 0 or 1 according to a Poisson background model. Subsequently, we defined a modification vector as the vector with binary scores of modifications at each position.
Definition 1. Let w_{l},w_{2}...,w_{
n
} be nonoverlapping windows of the same length (e.g., 200 bp in this study) in a DNA sequence. Let {s}_{i}^{1},\dots ,{s}_{i}^{k} be binary or realvalued signals of k modifications in window i. The modification vector of w_{
i
} is defined as {M}_{i}=\left({s}_{i}^{1},\dots ,{s}_{i}^{k}\right). Let F denote the feature vector of a focal modification pattern with k elements. The similarity score of M_{
i
} and F is defined as their inner product minus a given threshold τ.
Example. Suppose that k = 3, τ = 1.3, F = (1,1,0), M_{1} = (1,1,0), M_{2} = (1,0,1) and M_{3} = (0,0,1). Similarity scores of F and M_{
i
} are 0.7, 0.3, and 1.3 for i = 1,2,3
When the inner product of M_{
i
} and F is positive for all i = 1,...,n, the optimal set of regions that maximizes the sum of similarity scores in the regions becomes the entire region, [1,n], which may not be informative. If we want to select a set of regions whose modification vectors are closer to the feature vector F, we can set the threshold τ to an appropriate positive value to yield a negative similarity score for the inner product that is lower than τ. Positions with negative similarity scores are less likely to be included in the optimal set of regions. A higher threshold is likely to divide the entire genome into smaller regions with a higher precision, while a lower threshold yields an opposite trend. In this manner, for a series of windows w_{l},w_{2}...,w_{
n
} in a DNA sequence, we generate a series of similarity scores.
Detecting an optimal set of disjoint regions
To detect regions of focal epigenetic states such as K27HMD, we present an algorithm for calculating an optimal set of disjoint regions in a sequence that maximizes the sum of similarity scores for all regions. In addition, to identify sufficiently long regions, we define a minimum length threshold of regions such that each region is longer than or equal to the minimum length. The problem can be defined as follows.
Definition 2. Let L = {L_{
i
}i = 1,2,...,n} be a series of real valued weights L_{
i
} (e.g., similarity scores). Let C be a series of nonoverlapping regions I_{
j
} (j = 1,...k) of L such that the length of each I_{
j
} is greater than or equal to a given minimum threshold m_{1}, and the length of the interval between I_{j1 }and I_{
j
}is greater than or equal to another given minimum threshold m_{0}. That is, C is a series of regions of the form \left\{\left[{a}_{1},{b}_{1}\right],\cdots \left[{a}_{k},{b}_{k}\right]\right\}\left(1\le {a}_{1}<{b}_{1}<{a}_{2}<{b}_{2}\cdots <{a}_{k}<{b}_{k}\le \text{n}\right) such that

1.
a_{
t
} + m_{1}  1 ≤ b_{
t
} for t = 1,...,k (the minimum length constraint on regions),

2.
b_{t1}+m_{0} <a_{
t
} for t = 2,...,k (the minimum length constraint on intervals between regions), and

3.
a_{1} = 1 or a_{1} >m_{0} (the first region start at position 1 or at position larger than m_{0}).
Readers may find the last condition strange because it appears to disallow the situation that the first region starts at position a_{1} >m_{0}. We used the condition to simplify the presentation of our lineartime algorithm, which is described later. To obtain such an optimal series of regions that the first region starts at a_{1} >m_{0}, for example, you can temporarily add m_{0} negative weights in front of L, calculate the optimal series, and restore the coordinate.
To calculate a C that maximizes the sum of weights in C, \sum _{i\in I\in C}{L}_{i}, we used a dynamic programming algorithm developed by Csuros [18]. Here, we outline the algorithm.
Definition 3. We assume that all series meet the conditions given in Definition 2. Let w(C) denote the sum of weights in C, \sum _{i\in I\in C}{L}_{i}. We consider two cases: that in which the last region of C ends at I and that in which it does not. When the last region does not end at i, let {C}_{i,m}^{0} denote a series of regions that maximizes w(C) among all series, such that the last region ends at position b_{
k
} ≤ i  m, where m ≥ 1. When the last region ends at i, let {C}_{i,m}^{1} denote a series of regions that maximizes w(C) among all series, such that the last region is of length ≥ m (≥1); specifically, a_{
k
} + m  1 ≤ i (= b_{
k
}).
Example. When i = 12, and L = (1,1,3,1,1,3,1,1,1,1,2,1), we have
\begin{array}{c}\hfill {C}_{12,1}^{0}=\left\{\left[1,2\right],\left[4.5\right],\left[7,10\right]\right\},\phantom{\rule{1em}{0ex}}{C}_{12,4}^{0}=\left\{\left[1,2\right],\left[4,5\right],\left[7,8\right]\right\},\hfill \\ \hfill {C}_{12,7}^{1}=\left\{\left[1,2\right],\left[4,12\right]\right\},\phantom{\rule{1em}{0ex}}{C}_{12,12}^{1}=\left\{\left[1,12\right]\right\}.\hfill \end{array}
According to this definition, C maximizing w(C) is either {C}_{n,1}^{0} or {C}_{n,{m}_{1}}^{1}. For calculating these two series, we define here w\left({C}_{i,m}^{0}\right) and w\left({C}_{i,m}^{1}\right) recursively for i = 1,...,n and m ≥ 1.
Definition 4. We define the following four types of weight sums, {W}_{short}^{0}\left(i\right), {W}_{long}^{0}\left(i\right), {W}_{short}^{1}\left(i\right), and {W}_{long}^{1}\left(i\right), depending on whether the last region ends at i or not (denoted as 1 or 0, respectively) and whether the minimum length constraint is satisfied or not (denoted as long or short, respectively):
\begin{array}{c}{W}_{short}^{0}\left(1\right)=0,\phantom{\rule{1em}{0ex}}{W}_{short}^{1}\left(1\right)={L}_{1},\hfill \\ {W}_{short}^{0}\left(i\right)=w\left({C}_{i,1}^{0}\right),\phantom{\rule{1em}{0ex}}{W}_{long}^{0}\left(i\right)=w\left({C}_{i,{m}_{0}}^{0}\right),\hfill \\ {W}_{short}^{1}\left(i\right)=w\left({C}_{i,1}^{1}\right),\phantom{\rule{1em}{0ex}}{W}_{long}^{1}\left(i\right)=w\left({C}_{i,{m}_{1}}^{1}\right)\hfill \end{array}
Csuros showed that these four types of weight sums can be calculated recursively as follows [18]:
\begin{array}{c}{W}_{short}^{0}\left(i\right)=max\left\{{W}_{short}^{0}\left(i1\right),{W}_{long}^{1}\left(i1\right)\right\}\phantom{\rule{2.77695pt}{0ex}}\text{for}\phantom{\rule{2.77695pt}{0ex}}i\in \left[2,n\right]\hfill \\ {W}_{short}^{1}\left(i\right)={L}_{i}+max\left\{{W}_{long}^{0}\left(i1\right),{W}_{short}^{1}\left(i1\right)\right\}\phantom{\rule{2.77695pt}{0ex}}\text{for}\phantom{\rule{2.77695pt}{0ex}}i\in \left[2,n\right]\hfill \\ {W}_{long}^{0}\left(i\right)={W}_{short}^{0}\left(i{m}_{0}+1\right)\phantom{\rule{2.77695pt}{0ex}}\text{for}\phantom{\rule{2.77695pt}{0ex}}i\in \left[{m}_{0},n\right]\hfill \\ {W}_{long}^{1}\left(i\right)={W}_{short}^{1}\left(i{m}_{1}1\right)+\sum _{j=i{m}_{1}+2}^{i}{L}_{j}\phantom{\rule{2.77695pt}{0ex}}\text{for}\phantom{\rule{2.77695pt}{0ex}}i\in \left[{m}_{1},n\right]\hfill \end{array}
Recall that C maximizing w(C) is either {C}_{n,1}^{0} or {C}_{n,{m}_{1}}^{1}. From {W}_{long}^{1}\left(n\right), we can build the series of regions, {C}_{n,{m}_{1}}^{1}, by tracing back the process of calculating {W}_{long}^{1}\left(n\right). Similarly, from {W}_{short}^{0}\left(n\right), we can obtain {C}_{n,1}^{0}.
We implemented the above idea. We call the program CSMinfinder.