Detecting and correcting the binding-affinity bias in ChIP-seq data using inter-species information

Background Transcriptional gene regulation is a fundamental process in nature, and the experimental and computational investigation of DNA binding motifs and their binding sites is a prerequisite for elucidating this process. ChIP-seq has become the major technology to uncover genomic regions containing those binding sites, but motifs predicted by traditional computational approaches using these data are distorted by a ubiquitous binding-affinity bias. Here, we present an approach for detecting and correcting this bias using inter-species information. Results We find that the binding-affinity bias caused by the ChIP-seq experiment in the reference species is stronger than the indirect binding-affinity bias in orthologous regions from phylogenetically related species. We use this difference to develop a phylogenetic footprinting model that is capable of detecting and correcting the binding-affinity bias. We find that this model improves motif prediction and that the corrected motifs are typically softer than those predicted by traditional approaches. Conclusions These findings indicate that motifs published in databases and in the literature are artificially sharpened compared to the native motifs. These findings also indicate that our current understanding of transcriptional gene regulation might be blurred, but that it is possible to advance this understanding by taking into account inter-species information available today and even more in the future. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2682-6) contains supplementary material, which is available to authorized users.

Here we denote p(Y u n |M n = 0, θ) and p(X u,o n |Y u n , M n = 0, θ) by parameters p(Y u n |M n = 0, θ) = π Y u n 0 p(X u,o n |Y u n , M n = 0, θ) = γ o · π X u,o n 0 + (1 − γ o ) · δ X u,o n =Y u n according to the F81 model, where the base distribution of each position of the background sequence is denoted by π 0 , the probability of a nucleotide a in the background sequence is denoted by π a 0 , and the substitution probability from the primordial species to species o is denoted by γ o .
Likelihood of a motif-bearing alignment In the data generating process for motif-bearing alignments we sample alignments until one of them is accepted. Mapping this into a likelihood requires the usage of the Felsenstein's pulley principle [1], that allows us to select any particular species as the root of the tree. In this case it will come handy to select the reference species as the root. Thus, the likelihood can be expressed as p(X n |M n = 1, θ) = Ln−W +1 n=1 p(X .,1 n |M n = 1, n ) · where the base distributions of the positions 1, . . . , W of the binding sites are denoted by π 1 , . . . , π W and the probability of a nucleotide a in the binding site at position w is denoted by π a w . Given π, n ∈ {1, . . . , L n −W +1}, and M n = 1, each single nucleotide alignment is independent of any other single nucleotide alignment, and we obtain p(X n |M n = 1, θ) = Ln−W +1 p(X u,o n |Y n , M n = 1, n )p( n |M n = 1, θ).
We need to determine the probability of a particular nucleotide in a specific position of the reference species after selection, that is p(X u,1 n |M n = 1, n ). On one hand, notice that selection does not affect the probability distribution of those nucleotides outside the binding site. Thus, for u < n or u ≥ n +W we have that p(X u,1 n = a|M n = 1, n ) = π a 0 . On the other hand, for nucleotides in the binding site, the distribution after filtering is p(X u,1 The probabilities for the nucleotides in the ancestral sequence and in the non-reference species are given by the F81 model. In particular, for the ancestral sequence and for the non reference species Finally, since we assume binding sites to be uniformly distributed, we have that p( n |M n = 1, θ) = 1 Ln−W +1 . This completes the specification of the likelihood function.

Example interpretation of difference logos
We give an exemplary interpretation of the difference logos in row 1 of Figure S5. Here, the motif inferred by M − − (column 1) is used as the reference and is compared to the motifs inferred by M − BA (column 2), M C − (column 3), and M C BA (column 4). As indicated by the background colors, the smallest difference can be observed for M C − and the largest difference can be observed for M − BA . No difference logo shows notable differences at motif positions 1 to 3 and 17 to 21.
In the difference logos for M − BA and M C BA we observe a decrease of the most abundant bases (below the abscissa) and a gain of the remaining bases (above the abscissa). In contrast, the difference logo for M C − shows mainly a gain of cytosine at motif position 6 and a loss of the remaining bases. All other motif positions show much smaller Jensen-Shannon divergences. We can observe the opposite behaviour in the difference logo for M − BA at motif position 6. This discrepancy seems to be compensated in the difference logo for M C BA . This compensation cannot be observed for the other motif positions. Here, the differences are similar to these in the difference logo M − BA .  Table S2), stating that despite the oversimplified assumption of the Boltzmann distribution the new models that take into account the binding-affinity bias are always more realistic than the traditional models that neglect this bias.  Table S3), stating that despite the oversimplified assumption of the Boltzmann distribution the new models that take into account the binding-affinity bias are always more realistic than the traditional models that neglect this bias.  Table S4), indicating an enrichment of high-affinity binding sites in all cases. In addition, the information contents of motifs inferred by M − − are higher than the information contents of motifs inferred by M C BA in all cases, indicating that the superposition of the contamination bias and the binding-affinity bias leads to a sharpening of the motifs. We perform a stratified repeated random sub-sampling validation and show the means and their standard errors for each of the four models and each of the five data sets. The classification performance decreases in particular for the models with contamination bias (M C − and M C BA ) compared to the results achieved for five species (see Figure S1). We find that M C BA yields a higher classification performance than M C − and that M − BA yields a higher classification performance than M − − for all five data sets which we find also in case of five species.  Figure S2). We find that M C BA yields a higher classification performance than M C − and that M − BA yields a higher classification performance than M − − for all five data sets which we find also in case of five species.  Figure S17: Information contents of the inferred motifs from data sets of pair-wise alignments of human and monkey. We compute the information contents for each of the inferred motifs from M − − , M − BA , M C − , and M C BA for the data sets of each of the transcription factors CTCF, GABP, NRSF, SRF, and STAT1, each consisting of alignments of human and monkey only. We perform a stratified repeated random sub-sampling validation and show the means and their standard errors for each of the four models and each of the five data sets. The information contents estimated on data from human and monkey range from 0.5 to 0.95 and are higher compared to the results achieved for five species (0.25 to 0.65, see Figure S3). This indicates that the binding affinity bias is corrected to a lesser extend because the phylogenetic distance between human and monkey is not sufficient. However, as found in case of five species, the information contents of motifs inferred by M − BA and M C BA are significantly smaller than the information contents of motifs inferred by M − − and M C − in each of the five data sets, indicating an enrichment of high-affinity binding sites in all cases. In addition, the information contents of motifs inferred by M − − are higher than the information contents of motifs inferred by M C BA in all cases, indicating that the superposition of the contamination bias and the binding-affinity bias leads to a sharpening of the motifs. The average of the estimated inverse temperature range from 1.3 < β < 3.3 and is typically higher compared to the results achieved for five species (1.3 < β < 2.1, see Figure S4). However, as found in case of five species, we find that β is significantly greater than 1 in both cases, indicating an enrichment of high-affinity binding sites in all five data sets.  Table S1: P-values for the differences of the information contents of the motifs in human, monkey, dog, cow, and horse calculated by the Wilcoxon Signed-Rank Test.