To detect long regions of focal epigenetic states, we formulated this as a problem of finding an optimal set of disjoint (non-overlapping) regions in a sequence that maximizes the sum of similarity scores in all regions. Our method calculates a similarity score between a vector of epigenetic modifications at each position and the feature vector of a focal epigenetic state, such as K27HMD, and outputs the set of regions with the highest sum of similarity scores.
Calculating a similarity vector
We need to generate a modification vector at each position from epigenomic signal data. For example, to create benchmark datasets in this study, we binarized the modification signal level at each position using BinarizeBed in ChromHMM [15], which classified the signal at each position into 0 or 1 according to a Poisson background model. Subsequently, we defined a modification vector as the vector with binary scores of modifications at each position.
Definition 1. Let wl,w2...,w
n
be non-overlapping windows of the same length (e.g., 200 bp in this study) in a DNA sequence. Let be binary or real-valued signals of k modifications in window i. The modification vector of w
i
is defined as . Let F denote the feature vector of a focal modification pattern with k elements. The similarity score of M
i
and F is defined as their inner product minus a given threshold τ.
Example. Suppose that k = 3, τ = 1.3, F = (1,1,0), M1 = (1,1,0), M2 = (1,0,1) and M3 = (0,0,1). Similarity scores of F and M
i
are 0.7, -0.3, and -1.3 for i = 1,2,3
When the inner product of M
i
and F is positive for all i = 1,...,n, the optimal set of regions that maximizes the sum of similarity scores in the regions becomes the entire region, [1,n], which may not be informative. If we want to select a set of regions whose modification vectors are closer to the feature vector F, we can set the threshold τ to an appropriate positive value to yield a negative similarity score for the inner product that is lower than τ. Positions with negative similarity scores are less likely to be included in the optimal set of regions. A higher threshold is likely to divide the entire genome into smaller regions with a higher precision, while a lower threshold yields an opposite trend. In this manner, for a series of windows wl,w2...,w
n
in a DNA sequence, we generate a series of similarity scores.
Detecting an optimal set of disjoint regions
To detect regions of focal epigenetic states such as K27HMD, we present an algorithm for calculating an optimal set of disjoint regions in a sequence that maximizes the sum of similarity scores for all regions. In addition, to identify sufficiently long regions, we define a minimum length threshold of regions such that each region is longer than or equal to the minimum length. The problem can be defined as follows.
Definition 2. Let L = {L
i
|i = 1,2,...,n} be a series of real valued weights L
i
(e.g., similarity scores). Let C be a series of non-overlapping regions I
j
(j = 1,...k) of L such that the length of each I
j
is greater than or equal to a given minimum threshold m1, and the length of the interval between Ij-1 and I
j
is greater than or equal to another given minimum threshold m0. That is, C is a series of regions of the form such that
-
1.
a
t
+ m1 - 1 ≤ b
t
for t = 1,...,k (the minimum length constraint on regions),
-
2.
bt-1+m0 <a
t
for t = 2,...,k (the minimum length constraint on intervals between regions), and
-
3.
a1 = 1 or a1 >m0 (the first region start at position 1 or at position larger than m0).
Readers may find the last condition strange because it appears to disallow the situation that the first region starts at position a1 >m0. We used the condition to simplify the presentation of our linear-time algorithm, which is described later. To obtain such an optimal series of regions that the first region starts at a1 >m0, for example, you can temporarily add m0 negative weights in front of L, calculate the optimal series, and restore the coordinate.
To calculate a C that maximizes the sum of weights in C, , we used a dynamic programming algorithm developed by Csuros [18]. Here, we outline the algorithm.
Definition 3. We assume that all series meet the conditions given in Definition 2. Let w(C) denote the sum of weights in C, . We consider two cases: that in which the last region of C ends at I and that in which it does not. When the last region does not end at i, let denote a series of regions that maximizes w(C) among all series, such that the last region ends at position b
k
≤ i - m, where m ≥ 1. When the last region ends at i, let denote a series of regions that maximizes w(C) among all series, such that the last region is of length ≥ m (≥1); specifically, a
k
+ m - 1 ≤ i (= b
k
).
Example. When i = 12, and L = (1,1,-3,1,1,-3,1,1,1,1,-2,1), we have
According to this definition, C maximizing w(C) is either or . For calculating these two series, we define here and recursively for i = 1,...,n and m ≥ 1.
Definition 4. We define the following four types of weight sums, , , , and , depending on whether the last region ends at i or not (denoted as 1 or 0, respectively) and whether the minimum length constraint is satisfied or not (denoted as long or short, respectively):
Csuros showed that these four types of weight sums can be calculated recursively as follows [18]:
Recall that C maximizing w(C) is either or . From , we can build the series of regions, , by tracing back the process of calculating . Similarly, from , we can obtain .
We implemented the above idea. We call the program CSMinfinder.