In addition to gene sequence-driven gene regulatory mechanisms, epigenetic modifications, such as DNA methylation, also participate in the regulation of gene expression induced by signals from the environment. Here, based on genome-wide DNA methylation profiling in gene promoter regions, we present a method to investigate transcription factors that are affected by DNA methylation and that are functionally active in gene expression.

### Transcription factor match score

As a functional protein, a transcription factor has the intrinsic tendency to combine with specific DNA sequences, and we define a value termed ‘transcription factor match score’ to evaluate such binding ability for each transcription factor in the promoter region of each gene. In the TRANSFAC database produced by BIOBASE, position weight matrices (PWMs) for every transcription factor are provided. In these matrices, each row consists of four weights representing different capabilities to combine with nucleotides A, C, G and T, respectively. Using these PWMs, each gene promoter region can be scanned nucleotide by nucleotide with a smooth window to compute transcription factor match scores.

For the *ith* transcription factor with a motif length of *L*, a match value at the *kth* putative binding site in promoter region of the *jth* gene can be calculated as *A*_{
ijk
}.

{A}_{\mathit{\text{ijk}}}={\displaystyle \sum _{l=1}^{L}{a}_{\mathit{\text{jkl}}}{w}_{\mathit{\text{il}}}^{T}}

(1)

where *a*_{
jkl
} is the nucleotide (A, C, G, T) at the *lth* position in the *kth* possible binding site in promoter region of the *jth* gene, and *w*_{
il
} is the *lth* nucleotide in the match weights row vector in the PWM of the *ith* transcription factor. So, suppose the length of the promoter region of the *jth* gene is *N*, *N*-*L*+*1* match values can be calculated and the maximal value is adopted as the match score, *S*_{
ij
}, to reflect the binding ability of the *ith* transcription factor in the *jth* gene promoter region.

{S}_{\mathit{\text{ij}}}=max{A}_{\mathit{\text{ijk}}}

(2)

Hence, for one transcription factor, a collection of match scores can be calculated with respect to every gene promoter region.

Although match scores can approximate the opportunity for a transcription factor to bind to a gene promoter region, it is also meaningful to determine a threshold for match scores to evaluate whether the transcription factor binds and regulates the transcription of specific genes.

### Transcription factor match score threshold

As described in the method proposed by Hertzberg [14], for a given transcription factor, a *Z*-score, which considers the relationship between transcription factor match scores and gene expression levels, was calculated to infer the match score threshold.

Suppose there are *n* genes in a cell. Let *g*_{
1
}, …,*g*_{
n
} be their log expression values, which would follow normal distribution with an average of *μ* and a standard deviation of *σ*. If a threshold, *t*, is set for match scores of a transcription factor, there will be a subgroup of *k* genes, *G*_{
i
}, whose match scores are greater than the threshold *t*, and *G*_{
i
} are assumed to be targets of the transcription factor. The log expression values of these selected genes, *G*_{
i
}, also approximately follow normal distribution for a large number of elements in *G*_{
i
}. The *Z*-score can then be calculated.

Z\left(TF,{G}^{i}\right)=\left(\frac{1}{k}{\displaystyle \sum _{j=1}^{k}{g}_{\mathit{\text{ij}}}-\mu}\right)/\left(\sigma /\sqrt{k}\right)

(3)

The *Z*-score reflects the extent to which average expression of the selected target genes differ from average expression of all genes. In other words, a larger absolute *Z*-score value means a higher relationship between transcription factor match scores and expression of selected genes, and that these genes are more likely to be regulated by the same transcription factor. Therefore, with different thresholds for transcription factor match scores, we can obtain different groups of transcription factor target genes and subsequently different *Z*-score values. Finally, the best threshold can be determined when the maximal *Z*-score value (if positive) or the minimal *Z*-score value (if negative) is found, where the corresponding *Z*-score for the particular transcription factor is called *Z*_{
m
}.

However, without considering the effects of epigenetic modifications, the match score defined above only considers DNA sequences to decide whether a transcription factor binds and regulates the expression of certain genes. This undoubtedly makes subsequent Z-score values inaccurate in the evaluation of transcription factor binding status in gene promoter regions. Hence, we have improved the calculation of the transcription factor match score by adding the effect of epigenetic modifications. However, among the many epigenetic modifications, only DNA methylation was considered because of the requirement for high precision and high resolution data.

### General model of DNA methylation effect

DNA methylation is known to repress transcription factor binding ability [15–23]; therefore, we designed a general model to describe such an effect, where a nonlinear *S*-function is adopted to normalize the effect between 0 and 1. The model consists of two parts. The first sense part uses an inverse *S*-function (Equation 4) to accurately depict the DNA methylation effect. In the equation, *M*_{
jk
} is the methylation level at the *kth* putative transcription factor binding site in the promoter region of the *jth* gene, and *C*_{
i
} and *S*_{
i
} are parameters of the model for the *ith* transcription factor.

{E}_{ijk1}=\frac{{e}^{-\left(\left({M}_{\mathit{\text{jk}}}-{C}_{i}\right)/{S}_{i}\right)}}{1+{e}^{-\left(\left({M}_{\mathit{\text{jk}}}-{C}_{i}\right)/{S}_{i}\right)}}

(4)

Two biological observations are considered here. A larger methylation level results in reduced transcription factor binding ability in a monotonic way and *vice versa*. Next, the sensitivity of the methylation effect on transcription factor binding ability is not the same at different methylation levels. When the methylation level is quite large or small, the effect tends to be saturating to *0* or *1* and insensitive to a change in methylation level. In contrast, the effect will be sensitive to change when the methylation level is around a median value. Here, an inverse *S*-function is capable of fitting such a relationship, which is shown in Figure 1 (solid line, the parameters are assumed as *C*=*0* and *S*=*0.1*). In Figure 1, the methylation level is on the *X* axis and the suppression of transcription factor binding ability is on the *Y* axis. In the general model, two parameters, *C* and *S*, of the inverse *S*-function are adjustable and can be tuned for different transcription factors in a specific cell.

To increase sensitivity of the method, we also propose a second part to the general model to depict the effect of DNA methylation in an antisense way (Figure 1, dashed line), where a normal *S*-function was used (Equation 5). In the model, a large methylation level was assumed to impact less on transcription factor binding ability and *vice versa*.

{E}_{ijk2}=\frac{1}{1+{e}^{-\left(\left({M}_{\mathit{jk}}+{C}_{i}\right)/{S}_{i}\right)}}

(5)

### Transcription factor binding score

With consideration of the DNA methylation effect, the binding ability of the *ith* transcription factor in the *jth* gene promoter region can be modified as binding score *B*_{
ij
} from match score *M*_{
ij
}.

{B}_{\mathit{\text{ij}}}=\underset{k}{max}\left({A}_{\mathit{\text{ijk}}}\times {E}_{\mathit{\text{ijk}}}\right)

(6)

where *A*_{
ijk
} is the sequences match value of the *ith* transcription factor and *E*_{
ijk
} is the effect of DNA methylation on the binding ability of the *ith* transcription factor at the *kth* putative binding site in the promoter region of the *jth* gene.

Similar to the transcription factor match score, by threshold analysis of the transcription factor binding score, a maximal *Z*-score value (if positive) or a minimal *Z*-score value (if negative), known as *Z*_{
m
}, can also be calculated based on the relative analysis of transcription factor binding scores and gene expression profiles. However, in contrast to only one *Z*_{
m
} value based on the match score, there are many *Z*_{
m
} values for a transcription factor when different compositions of parameters *C* and *S* are selected in the model to calculate different binding scores. Then, when parameters *C* and *S* of the model are fixed to obtain an optimized *Z*_{
m
} value, the effect of methylation on transcription factor binding ability can be quantitatively determined.

### Functionally active transcription factors

According to different ways of describing DNA methylation effects on transcription factor binding ability, three *Z*_{
m
} values can be calculated to investigate functionally active transcription factors. Without considering a DNA methylation effect, *Z*_{
m-o
} is computed when transcription factor match scores are adopted. In contrast, with the consideration of a DNA methylation effect using our proposed model, *Z*_{
m-p
} is analyzed from transcription factor binding scores from the sense orientation and *Z*_{
m-q
} is calculated from transcription factor binding scores from the antisense orientation. Furthermore, with different compositions of model parameters, a group of *Z*_{
m-p
} and *Z*_{
m-q
} values can be calculated for each transcription factor. Then, if absolute *Z*_{
m-p
} values are found to be obviously larger than the absolute *Z*_{
m-o
} value and absolute *Z*_{
m-q
} values are always less than the absolute Z_{m-o} value, the transcription factor is considered to be affected by DNA methylation and functionally active.