Comprehensive genome-wide transcription factor analysis reveals that a combination of high affinity and low affinity DNA binding is needed for human gene regulation
© Wang et al.; licensee BioMed Central Ltd. 2015
Published: 11 June 2015
High-throughput in vivo protein-DNA interaction experiments are currently widely used in gene regulation studies. Hitherto, comprehensive data analysis remains a challenge and for that reason most computational methods only consider the top few hundred or thousand strongest protein binding sites whereas weak protein binding sites are completely ignored.
A new biophysical model of protein-DNA interactions, BayesPI2+, was developed to address the above-mentioned challenges. BayesPI2+ can be run in either a serial computation model or a parallel ensemble learning framework. BayesPI2+ allowed us to analyze all binding sites of the transcription factors, including weak binding that cannot be analyzed by other models. It is evaluated in both synthetic and real in vivo protein-DNA binding experiments. Analysing ESR1 and SPIB in breast carcinoma and activated B cell-like diffuse large B-cell lymphoma cell lines, respectively, revealed that the concerted binding to high and low affinity sites correlates best with gene expression.
BayesPI2+ allows us to analyze transcription factor binding on a larger scale than hitherto achieved. By this analysis, we were able to demonstrate that genes are regulated by concerted binding to high and low affinity binding sites. The program and output results are publicly available at: http://folk.uio.no/junbaiw/BayesPI2Plus.
High-throughput in vivo protein-DNA binding experiments such as ChIP-chip and ChIP-seq are currently widely used to study gene regulation. Identification of transcription factor (TF) binding sites is an essential step to understand TF function and gene regulatory networks . In such analyses, raw reads of ChIP-seq experiments are mapped to a human reference genome. Subsequently a peak-calling program is used to detect putative TF binding sites. However, two problems arise by doing so. First, the identified TF binding sites are dependent of the threshold value used by the peak-calling program. If stringent criteria are applied, many potentially functional TF binding sites with weak binding (i.e. low binding affinities or low ChIP-seq tag count) may be eliminated . If on the other hand non stringent cutoff values are chosen, many false binding sites are identified. Second, ChIP-seq experiment may not necessarily identify the direct TF-DNA interactions due to the inherent inability of current ChIP-seq or ChIP-chip technology to distinguish direct versus indirect protein-DNA interactions [3, 4]. So far, the first problem has been ignored . Several studies have addressed the second problem only partially. Vallania et al.  focused on identifying only functional direct TF binding sites by using a computational method based on weight matrices, comparative genomics, and gene expression profiles; Gordan et al  developed a method to separate direct TF binding from indirect TF binding in yeast ChIP-chip data but it has not been used to analyze human ChIP-seq experiments. In addition, the method requires that both in vivo binding data and in vitro DNA binding motifs are available; more recently, Bailey et al  proposed an interesting statistical method "Central Motif Enrichment Analysis" (CentriMo) to predict direct DNA binding sites from ChIP-seq data. Unfortunately, the program only works with equal-length genomic sequences and does not provide information about functional indirect TF binding.
In the present work, a more comprehensive computational approach, BayesPI2+, is developed for analyzing in vivo high-throughput protein-DNA interaction data, where the above-mentioned problems are solved. Especially, non stringent peak calling cutoff values can be used allowing the inclusion of many weak protein binding sites in the data analysis. The newly developed BayesPI2+ is a C program that is based on a biophysical model of protein-DNA interactions [8, 9] and that can be run in both a serial computation model and in a parallel ensemble learning framework. BayesPI2+ estimates protein binding energy matrix (PBEM), protein concentration (or chemical potential) in a solution , and differential binding affinity (dbA) of protein-DNA interactions, through in vivo protein-DNA interaction experiments. Based on these predicted features, BayesPI2+ allowed to distinguish high and low affinity protein binding sites called here for reasons of simplicity, type I and type II TF binding . The novel method was first tested in both synthetic ChIP-seq data and several real in vivo protein-DNA binding data sets. Then, gene regulatory difference between the predicted type I and type II TF binding sites was investigated in two human ChIP-seq experiments, the estrogen receptor α (ESR1) using the MCF7 breast cancer cell line and the SPIB TF using the HBL1 activated B cell-like diffuse large B-cell lymphoma (ABC DLBCL) cell line. ESR1 is a member of the nuclear receptor family of ligand-activated TFs and is involved in the development and progression of breast cancer . SPIB belongs to ETS-family of TFs and is required for the survival of ABC DLBCL cells . Putative SPIB type I and type II binding sites were verified by in vitro protein-DNA interaction experiments. Subsequently, we tested whether these binding sites have an effect on gene expression. Of interest, our analysis showed that the binding of TF to both type I and type II binding sites is important for gene expression.
Distinguishing type I versus type II TF binding sites using synthetic ChIP-seq data sets
In Additional file 1: Figure S1, test error rates by using a parallel ensemble BayesPI2+ learning are shown. In this analysis, both meta-PBEM and meta-chemical-potential (mean of PBEMs from multiple predictions and the corresponding chemical potential) were computed by five times random splitting training and testing (i.e. 50%) data. These predictions were completed in five to eight minutes. The overall prediction accuracy with this method is almost the same as what was obtained by a serial computation. Of interest, the longer the binding motif, the better the test error rates. TP and TN of ACE2 with a binding motif of 6bp are between 70% and 90% whereas the TP and TN of the other TFs with binding motifs between 8bp and 12bp are approximately 95%. Thus, both serial and parallel ensemble BayesPI2+ learning are able to distinguish type I versus type II TF binding sites in synthetic ChIP-seq datasets.
Distinguishing type I versus type II TF binding sites by using in vivo protein-DNA binding data
Prediction of PBEM in five data sets of various sizes
Distinguishing type I versus type II TF binding sites in human ChIP-seq data sets
Distinguishing type I versus type II TF binding sites in yeast ChIP-chip data sets
ChIP-seq is a high resolution experiment that identifies putative protein binding site sequences of short and equal-length, approximately 200 bp. We therefore also tested whether BayesPI2+ can predict type I TF binding sites by using unequal-length genomic sequences. Here, a series computation of BayesPI2+ was applied to four yeast ChIP-chip experiments with TFs ACE2, SWI4, INO4, and XBP1 in rich medium conditions . The putative protein binding sites are positioned on ~6725 yeast intergenic regions of unequal-length varying between 50 bp to ~ 2700 bp, with a median length of ~360 bp. The results indicate that only three of four yeast TFs (ACE2, SWI4 and INO4) obtained good PBEM by using BayesPI2+ (i.e. the motif similarity scores [8, 17] between the best PBEM and the SGD consensus sequences was >0.8; Additional file 1: Figure S7). A plot of P-values, reflecting the confidence level of detecting a binding site,  against the number of binding sites with p values below the defined p-value can be made for the four yeast ChIP-chip experiments (Additional file 1: Figure S8). The plot shows that there are more than 100 binding sites with confidence level P < 0.005 for ACE2, SWI4 and INO4, respectively. However, none of the XPB1 binding targets passed the same level of significance of binding. The poor binding prediction for XBP1 can be explained either by true weak protein binding or poor quality of the ChIP-chip experimental data. Therefore, only ACE2, SWI4 and INO4 were considered in the subsequent data analysis.
Comparison of predicted type I TF binding sites identified by BayesPI2+ and CentriMo in mouse ChIP-seq data sets
First, position specific weight matrices (PSWM) of three mouse TFs (MYC, STAT3, and OCT4) were obtained from JASPAR database. Based on the known PSWM of each mouse TF, type I (direct) TF binding sites were predicted by applying BayesPI2+ and CentriMo  on all called peaks (i.e. 500 bp genomic sequence centered on each peak) of three mouse ChIP-seq data sets, respectively. Here, default parameter settings were used in both programs. Results suggest that around 83%, 88%, and 92% of CentriMo predicted type I MYC, STAT3, and OCT4 binding sites are recovered by BayesPI2+, respectively. Thus, for predicting type I (direct) TF binding sites in in vivo protein-DNA interaction experiment, there is a good agreement between the new biophysical model BayesPI2+ and the published statistical method CentriMo.
Differences in expression of genes with type I and type II TF binding sites for ESR1 and SPIB transcription factors
Functional ESR1 target genes are regulated by both type I and type II TF binding sites
Robustness of the predicted ESR1 type I and type II TF binding site analysis
To test the robustness of the binding site prediction, a repeated analysis of gene expression profiles, ChIP-seq tag density, and histone modifications for the three classes of ESR1 target genes was performed, but this time with removal of the binding sites with low affinity. ESR1 binding sites of which both ChIP-seq tag count was less than 1.5 fold of the minimum tag count for type II TF binding sites and of which the dbA level was smaller than the maximum dbA in type II TF binding sites, were considered low affinity binding sites and removed. After filtering, 9355 and 2786 ESR1 type I and type II TF binding sites were remaining, respectively. Compared to the results before changing the thresholds, few ESR1 type I TF binding sites (~300) but more than half of the type II TF binding sites were lost. GREAT was then also used to find putative target genes for the new set of ESR1 binding sites. This time, 922 'C' genes, 3552 'B' genes, and 2450 'A' genes were found. Gene expression profile analysis, ChIP-seq tag density enrichment, and histone modifications analysis for these classes of genes are given in Additional file 1: Figures S12, S13, and S14, respectively. All results are consistent with those for the previous analyses, with the exception of a similar tag density for 'B' and 'C' genes after removal of low binding affinity sites. Functional annotation of the new three classes of ESR1 target genes by DAVID tool revealed that 'A' genes are highly enriched in pathways active in cancer, including the MAPK signaling pathway (Additional file 1: Table S1a and S1b).
Functional SPIB target genes are regulated by both type I and type II TF binding sites
Robustness of the predicted SPIB type I and type II TF binding site analysis
This analysis was performed following the same filtering criteria as for removing low affinity ESR1 binding sites. After filtering, 29378 and 4020 SPIB binding sites were remaining for type I and type II TF binding, respectively. 'C' genes are increased almost two-fold (1453 genes), 'A' genes were slightly reduced (6740 genes remaining), and 'B' genes were decreased by about one-forth (4742 genes remaining). Further analysis of lenalidomide treated ABC DLBCL microarray gene expression data and SPIB knockdown experiments in ABC DLBCL cells show that 'A' genes have the most significant response. These genes but not 'B' or 'C' genes, showed increased expression in lenalidomide-treated cells and decreased expression in knockdown experiments (Additional file 1: Figures S15 and S16). Of interest, 'A' genes were highly enriched in the same pathways (Additional file 1: Table S2b) as identified in the previous section.
The nature of type II ESR1 and SPIB binding
Verification of predicted type I and type II SPIB binding sites by EMSA
In order to verify SPIB binding sites, a two-step filtering procedure was applied to 6911 putative SPIB target genes ('A' genes) with SPIB binding motifs. For each putative SPIB target gene, a two-sample Kolmogorov-Smirnov goodness-of-fit hypothesis test was performed between time-series microarray gene expression profiles (lenalidomide treated Oci-ly10 or TMD8 cells) and the gene expression profile (HBL1 cells) after down-regulation of SPIB by shRNA [11, 12]. The assumptions were that genes controlled by SPIB are differentially expressed (i.e. P-value of KSTest < = 0.05) in ABC DLBCL cells treated as the above-mentioned two conditions, and that SPIB regulated genes are those genes with either a type I or a type II SPIB binding sites located between 5 kb upstream and 1 kb downstream to the transcription start site. In total, 1687 genes fulfilled these criteria (Additional file 1: Table S3). To test whether those genes contained SPIB binding sites and whether type II binding sites represented alternative binding sites, as predicted by our analysis, EMSA was performed on 10 randomly selected type I SPIB binding sites and 10 type II SPIB binding sites (Additional file 1: Tables S4 and S5). For this analysis, 50bp DNA sequences were selected of which the center sequence corresponded with the identified peak in our analysis.
A new biophysical model, named BayesPI2+ was designed to analyze more extensively TF binding sites in in vivo protein-DNA interaction data than hitherto achieved with existing models. To achieve this BayesPI2+ was designed to analyze a large number of called ChIP-seq peaks (i.e. hundreds of thousands) simultaneously, including weak binding sites. Additionally, the differential binding affinity (dbA) for each called peak was computed to identify high affinity (type I) and low affinity (type II) TF binding sites. In this work, the strength of type I (direct) and type II (indirect/alternative) binding sites is not in proportion to the number of measured ChIP-seq tag counts. In other words, TF acts independently in type I (high affinity) binding but it requires a co-factor to stabilize the protein-DNA interaction in type II (low affinity) binding. An initial test was performed on synthetic ChIP-seq datasets with success, yielding a true negative rate of >90%. For the test, a serial version and a parallel ensemble learning of BayesPI2+ was used with similar prediction accuracies. However, in the test, the parallel version of BayesPI2+ is at least three times faster than the serial one in completing the prediction. Thus, for very large input data (i.e. more than ten thousands of called ChIP-seq peaks), a parallel ensemble learning of BayesPI2+ is recommended because it saves the overall waiting time considerably by splitting the job to multiple computer processors.
In a subsequent analysis of five human ChIP-seq datasets, the predicted PBEMs between a serial BayesPI2+ computation and a parallel ensemble learning of BayesPI2+ were compared. Though the best PBEM identified by serial computation and the meta-PBEM estimated by parallel ensemble learning were similar, a positional variation in information content of the predicted meta-PBEMs were observed across different sizes of input data (Figure 2 and Additional file 1: Figure S3). The positional variation in information content for PBEM reaches saturation point when the number of input peaks exceeds 25% of all called peaks. This indicates that various binding positions within a PBEM are important in the formation of the protein-DNA complex, during in vivo protein-DNA interactions [23, 24]. By analyzing all possible binding sites with our method, two types could be discerned. Particularly, the dbA level of each TF, estimated by using the corresponding meta-PBEM, allowed to recognize low affinity TF binding, called type II TF binding sites (Figure 3 and Additional file 1: Figure S6). Additionally, the inferred meta-PBEMs from ChIP-chip data and the binding consensus sequence motifs from online databases  are significantly enriched in the type I TF binding sites when analyzing yeast ChIP-chip experiments for TFs ACE2, SWI4, and INO4. Of interest, the latter analyses were performed on ChIP-chip data, thus on genomic sequences of unequal length. Therefore, our proposed new method to distinguish two types of TF binding sites is applicable to both equal-length and unequal-length genomic sequences. Of note, BayesPI2 has a very good agreement (> 80% overlap) with CentriMo concerning the analysis of type I (direct) TF binding.
To test the function of predicted type I and type II TF binding sites, we analyzed their effect on gene regulation. This was done for two human TFs, ESR1 and SPIB in breast cancer and diffuse large B cell lymphoma, respectively. ESR1 encodes the estrogen receptor alpha and contributes to cell growth in breast carcinoma . SPIB is a B cell transcription factor that is highly expressed in a clinically aggressive subtype of DLBCL, with activated B cell immunophenotype . Genes with the putative TF binding sites were classified into three groups: 'A' genes, containing both type I and type II TF binding sites; 'B' genes containing only type I TF binding sites; and 'C' genes, containing only type II TF binding sites. Of interest, 'A' genes show the strongest expression in the breast cancer cell line MCF7 and the highest ESR1 ChIP-seq tag density. In addition, we found lower nucleosome density, higher RNA polymerase II expression and histone acetylation for 'A' genes than for 'B' or 'C' genes. It is known that decreased nucleosome density and increased RNA Pol-II binding indicate gene transcription [25, 26] whereas acetylation of histones (i.e. H3K14ac, and H3K9ac) indicate functional TF binding  and are important for activation of promoters and enhancers . Functional annotation of the three types of ESR1 target genes also suggests that, only 'A' genes are highly enriched in pathways of importance to breast cancer such as the MAPK signaling pathway . Analogous to what is observed for ESR1 in breast carcinoma, SPIB 'A' genes show the highest gene expression in diffuse large B cell lymphoma, and the strongest response after silencing of SPIB by RNA interference in ABC DLBCL cell lines. Functional annotation of SPIB 'A' genes suggests that they are significantly involved in pathways of importance for DLBCL lymphoma biology, such as the B cell receptor signaling pathway. The B cell receptor signaling pathway is activated in diffuse large B-cell lymphoma and contributes to tumor cell survival and proliferation . Thus, our results indicate that ESR1 and SPIB regulation of gene expression requires both type I and type II binding. A repeat analysis of genes controlled by either ESR1 or SPIB binding sites, after filtering out the lowest binding affinity sites, yielded similar results: 'A' genes are the likely functional targets of these TFs.
Though the effect of TF on genes requires direct protein-DNA interaction, it has been well-known that additional and different interactions of the TF with the gene are needed for regulation . We show for the first time the scale of this effect by our genome-wide analysis [1, 31]. Type II TF binding sites for ESR1 and SPIB were further investigated to find out whether these constituted indirect binding sites or alternative binding sites with different affinity. No ESR1 similar motif was found for type II ESR1 binding sites. This indicates that type II ESR1 likely represents indirect binding to DNA. By contrast, a similar consensus motif was found for both type I and type II SPIB binding sites, except for one nucleotide difference in the area adjacent to core SPIB motif GGAA (Figure 9). These results indicate that the type II SPIB binding site is an alternative low affinity SPIB binding site. The findings were also confirmed by EMSA: about 90% and 50% of tested sequences with putative type I and type II SPIB binding sites, respectively, resulted in band shift in EMSA (Figure 10). These results also confirm that type II TF binding is rather weak, which requires co-factors to enhance the binding to DNA sequence. Notwithstanding, our analysis shows that type I and type II binding sites act concertedly in gene regulation.
In summary, we developed a new program BayesPI2+ to analyze all potential TF binding sites in the genome, without filtering. By doing so, we found two types of functional TF binding sites, both indicate to be important for gene regulation. BayesPI2+ can be used in a serial computation or a parallel ensemble approach. Hitherto, other methods only used the top few hundred or thousand high affinity binding sites which we showed results in loss of valuable information.
Synthetic ChIP-seq datasets
Four synthetic ChIP-seq datasets were generated by the Monte Carlo sampling method . Each dataset has 500 putative target sites with DNA sequence length 500 bp. One of four yeast TFs (i.e. ACE2, SWI4, INO4 and XBP1 with binding motif length of 6 bp, 8 bp, 10 bp and 12 bp, respectively) was randomly positioned in a DNA sequence with a synthetic Z-score greater than 0.25. The synthetic Z-scores were produced by the MATLAB build-in random number generator. The synthetic ChIP-seq datasets were used to evaluate prediction accuracy of BayesPI2+ because type I (direct) binding targets can be easily recovered later. The final prediction accuracy of each synthetic ChIP-seq data is shown in box plots, where the error rates are estimated by randomly splitting the data to training (400 target sites) and test sets (100 target sites) five times.
Human ChIP-seq datasets
Called peaks of ChIP-seq experiments for human CTCF in CD4+ T cell, NRSF in Jurkat T lymphoblast cell, and STAT1 in interferon y-stimulated Hela S3 cell were obtained from previous publications , where raw reads were mapped to hg18 reference genome. There are around 5814, 26815 and 73957 called peaks by SISSRS  for NRSF, CTCF, and STAT1, respectively. Raw ChIP-seq reads of human ERα in breast cancer cells (MCF-7) under both E2 treated and control conditions, and human SPIB in ABC DLBCL cell line HBL1 and control condition were downloaded from earlier publications [12, 14]. These raw reads were aligned to hg19 reference genome by BWA program , and the peaks were called by SISSRS (i.e. 16720 and 43135 peaks for ERα and SPIB, respectively). For a particular TF ChIP-seq experiment under a specific condition or cell lines, both DNA sequences of putative protein binding sites and the corresponding tag densities were used by BayesPI2+ to infer protein binding energy matrix (PBEM) and the chemical potential. For each binding site, a 200 bp DNA sequence centered on the genomic coordinate of called peak was extracted from reference genome. GREAT  tool was used to assign protein binding sites to the putative target genes. The assignment of TF binding sites with nearby genes is a two step process: first, every gene is assigned a basal regulatory domain (i.e. 5kb upstream and 1kb downstream of the transcription start site); then, the gene regulatory domain is extended in both directions to the nearest gene's basal domain but no more than the maximum extension (i.e. 1000 kb) in one direction.
Mouse ChIP-seq data sets
Three mouse embryonic stem cell (ES) TF ChIP-seq data sets (STAT3, MYC, and OCT4/POU5F1) were downloaded from a previous publication . DNA sequence centered on each called peak (500 bp) and the normalized tag counts are obtained from papers  and , respectively.
Human histone modification and microarray gene expression datasets
Histone modification datasets (e.g., H3K14ac, H3K27me3, H3K4me1, H3K4me2, H3K4me3, H3K9ac, H3K9me2 and H3K9me3), DNA accessible regions (FAIRE), and Pol-II data under both E2 treated and control conditions in MCF-7 cells were obtained from previous publications [14, 19, 20]. Pre-processing of the datasets and enrichment tests (t-test and Mann-Whitney U test) between E2 treated and control conditions were described in detail in previous studies [25, 26]. Microarray gene expression profiles for E2 treated MCF-7 cells, lenalidomide induced ABC DLBCL cells and shRNA transduction treated SPIB in HBL1 cells are taken from published works [11, 12]. The DAVID tool  was used to perform functional analysis of identified protein target genes.
Yeast ChIP-chip datasets
Genome-wide in vivo protein-DNA interaction datasets of 4 yeast S. cerevisiae TFs (ACE2, SWI4, INO4 and XBP1) in rich medium conditions, the corresponding intergenic DNA sequences, and the P-values of significance of intergenic DNA sequences bound by each given TF were obtained from the work of Harbison et al. . In the original paper, p < 0.001 was used to determine DNA regions were bound by a TF. Here, a confidence level p < 0.1 is used to select putative TF binding targets (i.e. intergenic DNA sequences with either low or high affinity TF binding sites) that need to be classified into type I and type II binding targets for each given TF, respectively.
Biophysical model to distinguish type I versus type II TF-DNA interaction
Protein-DNA binding probability
Following previous descriptions of biophysical modeling of protein-DNA interactions [8, 36], the probability of a DNA sequence S to be bound by a protein is , where E represents PBEM (i.e. the estimated binding energy at each nucleotide; protein binding energy matrix) and μ is the concentration of proteins (or chemical potential) in a solution. This is a Femi-Dirac form of protein binding probability. If the concentration of protein is very low then the Fermi-Dirac function can be approximated by a Maxwell-Boltzmann protein binding function  . In earlier works, BayesPI  and MatrixREDUCE  had used Femi-Dirac form and Maxwell-Boltzmann function to model genome-wide in vivo protein-DNA interaction experiments, respectively, by combining measured TF occupancy data and PBEM prediction for each TF.
Computational implementation of BayesPI2+
BayesPI2+ is a pure C program with several build in functions: for example, various normalization options for the input data, DNA strand specific calculation, DNA binding affinity calculation based on PBEM, chemical potential estimation by using known PBEM, and prediction of PBEM with multiple motif length in parallel etc. In this work, both a serial computation (similar to original BayesPI MATLAB program) and a parallel ensemble learning approach (estimating meta-PBEM; mean of multiple predictions) are developed. Prediction of PBEM from measured high-throughput sequencing data is achieved by a Bayesian Hierarchical nonlinear regression model , where a Fermi-Dirac form of protein binding probability was adopted. Detailed description of the algorithm is available in the earlier works [8, 39], which is further improved in the serial version of BayesPI2+. The new parallel ensemble learning of BayesPI2+ is accomplished by a combination of C and Perl programs, where the C program is used to perform nonlinear parameter fitting and Perl scrip files are utilized to perform random selection of input data, split data to multiple computer processes, calculate motif similarity scores , and obtain a meta-PBEM by aligning multiple predicted PBEMs.
Estimation of meta-chemical potential
In BayesPI2+, a new function is added to estimate TF chemical potential (TF concentration μ) for known PBEM, where the PBEM of protein binding probability P(S) is fixed but the TF chemical potential μ is estimated from in vivo protein-DNA binding data by Bayesian nonlinear parameter fitting. In this way, TF meta-chemical potential is calculated based on a given meta-PBEM. Generally, an ensemble learning of meta-PBEM, by randomly drawing a portion of input data multiple times in parallel, is more robust than a serial prediction of PBEM from one data set. That is because the input data quality (outliers) may reduce the robustness of nonlinear parameter fitting for both PBEM and chemical-potential . In addition, if the input data size is large then the nonlinear parameter fitting will suffer significantly from a long serial computation. Thus, the parallel ensemble learning will not only speed up the calculation, but also improve the prediction accuracy.
Calculation of TF binding affinity with either Femi-Dirac or Maxwell-Boltzmann function
The second new function in BayesPI2+ is TF binding affinity estimation, which is given by , represents a DNA sequence, N is the length of the sequence, m is the length of TF binding site, w is a weight coefficient and equals one, and is the protein-DNA binding probability. If TF chemical potential is included in the calculation then Femi-Dirac form of protein binding probability will be used by . Otherwise, Maxwell-Boltzmann function will be chosen in if information of TF chemical potential μ is not available. The latter implementation is equivalent to the previous work of matrixREDUCE . In this work, if both PBEM and chemical potential are inferred from high-throughput protein-DNA interaction data, then the Femi-Dirac function is used to compute TF binding affinity. For consensus sequence motifs that collected from on-line databases, Maxwell-Boltzmann function will be used to estimate the TF binding affinity.
Computation of differential binding affinity - dbA
For each TF, its protein binding affinity to a DNA sequence S i (i.e. 200bp DNA sequence centered on the called peak) is computed based on the above-mentioned description. Differential binding affinity (dbA) to the DNA sequence S i is defined as , where represents estimated protein binding affinity at rth randomly mutated DNA sequence S i , and R is the total number of random shuffling of DNA sequence S i ,. Expected P-value of dbA i to a sequence S i is the ratio between the number of and the total number of random shuffling. Usually, if a DNA sequence S i contains a protein binding motif (type I TF binding) then . On the contrary, if there is not a direct protein binding motif (type II TF binding) in the DNA sequence S i then . In the latter case, the type II protein-DNA interaction is treated as a protein interacts to randomly mutated DNA sequences (i.e. DNA sequence of each protein binding site is randomly shuffled R times). Generally, the smaller the P-value the better the type I TF-DNA binding, the larger the P-value the better the type II TF-DNA binding. In this work, these calculations are split to multiple computer processes and run in parallel, which significantly reduces the overall waiting time.
A serial computation. Using a serial version of BayesPI2+ to distinguish type I versus type II protein-DNA interactions: 1) to predict the best representative PBEMs with various lengths for all called peaks; 2) to calculate motif similarity scores  between the predicted PBEMs and a golden standard one (i.e. a position specific weight matrix (PSWM) from either JASPAR  or TRANSFAC ), and a PBEM with the highest motif similarity score is selected; 3) to compute protein binding affinities for all called peaks by using the above-chosen PBEM and its chemical potential; 4) to calculate dbA for all called peaks based on the same PBEM, where 200bp DNA sequences that centered on each peak are randomly shuffled 2000 times; 5) to compute expected P-value of dbA for all called peaks (the expected chance of type I binding at a target site); 6) to classify all called peaks to two groups (i.e. type I and type II protein-DNA interactions) by applying fuzzy neural gas algorithm [31, 43] (Additional file 1: supplementary methods) on the dbA, where the classification between type I and II TF binding can be further improved by adding expected P-values (i.e. for human TFs, type I TF bindings with expected p < 0.09). It is worth noting that dbA may reveal the true protein binding pattern in different genomic regions because the effect of variation of background binding is removed.
A parallel ensemble learning framework
Though BayesPI2+ can handle a large number of called peaks in one run, the computational cost is increased significantly when the number of input peaks reaches hundreds or thousands. In order to avoid such hindrance by the big data, a parallel ensemble learning version of BayesPI2+ is built: 1) to randomly select a subset of all called peaks (i.e. 25%), the random selection is repeated multiple times (i.e. 10 times); 2) to estimate PBEM and the corresponding parameters based on each randomly selected subset, for example, ten computer processes control ten randomly selected subsets and run step 1 of the serial BayesPI2+ computation in parallel; 3) to compute motif similarity scores between all predicted PBEMs and a golden standard one (i.e. a PSWM from JASPAR), and to obtain a meta-PBEM by aligning good PBEMs (i.e. motif similarity scores >0.7) against the golden standard one; 4) to infer meta-chemical-potential for the new meta-PBEM based on all called peaks; 5) to calculate expected P-value and dbA for all called peaks based on inferred meta-PBEM and meta-chemical potential (i.e. steps 3, 4, and 5 of the serial BayesPI2+ computation); 6) to classify all called peaks to two groups (i.e. type I and type II protein-DNA interactions) by applying fuzzy neural gas algorithm on dbA, and the classification between type I and II TF binding can be further improved by using expected P-values (i.e. for human TFs, type I TF bindings with expected p < 0.09). All calculations were done on the Linux cluster, where computer nodes have a minimum 64 GB RAM, and 16 physical CPU cores and are connected by FDR (56Gps) InfiniBand.
Electrophoretic mobility shift assay (EMSA) for detecting protein-DNA interactions
EMSA  was performed with the BioRad Mini Protean gel system (BioRad, USA) at 90V for 1 hour. The binding reactions were performed for 30 minutes with the Odyssey™ EMSA Buffer Kit (LI-COR, USA) according to the manufacturer recommendations with some modifications. Binding reaction: 1x binding buffer, 2.5mM DTT/0.25% Tween20, 2.5% glycerol, 8ng/μl of SPIB protein, 250nM of probe and a total incubation volume of 20 μl. Products were resolved by polyacrylamide gel electrophoresis using a 10% Mini-PROTEAN® TBE Precast Gel (BioRad), and 0.5 × TBE buffer, then analyzed by staining with GelRed Nucleic Acid Gel Stain, (Biotium, USA) and visualized by UV lamp. Purified SPIB (Human) Recombinant Protein (P01) was purchased from Abnova (Taiwan). Double stranded (ds) 50bp DNA probes were annealed from ssDNA oligonucleotides in 1xTE buffer pH 8 with addition of 50mM NaCl, final probe concentration was 1.5 μM. Probes were heated to 90oC for 5 min, then slowly cooled down to RT in 2h. The 10% TBE gels were prerun in 0.5x TBE buffer for 1h at 90V. Specificity of binding was demonstrated by mutation of the putative SPIB core binding motif in two probes, GG was changed to TT (GGAA→TTAA).
JBW is supported by Norwegian Cancer Society (DNK 2192630-2012-33376 and DNK 2192630-2013) and Norwegian Research Council NOTUR project (nn4605k). JBW would like to thank Radiumhospital Legater for the support of attending ICIBM 2014 conference.
The publication costs for this article were funded by Norwegian Cancer Society (DNK 2192630).
This article has been published as part of BMC Genomics Volume 16 Supplement 7, 2015: Selected articles from The International Conference on Intelligent Biology and Medicine (ICIBM) 2014: Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/16/S7.
- Farnham PJ: Insights from genomic profiling of transcription factors. Nat Rev Genet. 2009, 10 (9): 605-616. 10.1038/nrg2636.PubMed CentralView ArticlePubMedGoogle Scholar
- Tanay A: Extensive low-affinity transcriptional interactions in the yeast genome. Genome Research. 2006, 16 (8): 962-972. 10.1101/gr.5113606.PubMed CentralView ArticlePubMedGoogle Scholar
- Gordan R, Hartemink AJ, Bulyk ML: Distinguishing direct versus indirect transcription factor-DNA interactions. Genome Research. 2009, 19 (11): 2090-2100. 10.1101/gr.094144.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Park PJ: ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet. 2009, 10 (10): 669-680. 10.1038/nrg2641.PubMed CentralView ArticlePubMedGoogle Scholar
- Weirauch MT, Cote A, Norel R, Annala M, Zhao Y, Riley TR, et al: Evaluation of methods for modeling transcription factor sequence specificity. Nature Biotechnology. 31 (2): 126-134.Google Scholar
- Vallania F, Schiavone D, Dewilde S, Pupo E, Garbay S, Calogero R, et al: Genome-wide discovery of functional transcription factor binding sites by comparative genomics: the case of Stat3. Proc Natl Acad Sci U S A. 2009, 106 (13): 5117-5122. 10.1073/pnas.0900473106.PubMed CentralView ArticlePubMedGoogle Scholar
- Bailey TL, Machanick P: Inferring direct DNA binding from ChIP-seq. Nucleic Acids Res. 40 (17): e128-Google Scholar
- Wang J Morigen: BayesPI - a new model to study protein-DNA interactions: a case study of condition-specific protein binding parameters for Yeast transcription factors. BMC bioinformatics. 2009, 10: 345-10.1186/1471-2105-10-345.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang J: Quality versus accuracy: result of a reanalysis of protein-binding microarrays from the DREAM5 challenge by using BayesPI2 including dinucleotide interdependence. BMC bioinformatics. 15 (1): 289-Google Scholar
- Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, et al: Diversity and complexity in DNA recognition by transcription factors. Science. 2009, 324 (5935): 1720-1723. 10.1126/science.1162327.PubMed CentralView ArticlePubMedGoogle Scholar
- Cicatiello L, Mutarelli M, Grober OM, Paris O, Ferraro L, Ravo M, et al: Estrogen receptor alpha controls a gene network in luminal-like breast cancer cells comprising multiple transcription factors and microRNAs. Am J Pathol. 2010, 176 (5): 2113-2130. 10.2353/ajpath.2010.090837.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang Y, Shaffer AL, Emre NC, Ceribelli M, Zhang M, Wright G, et al: Exploiting synthetic lethality for the therapy of ABC diffuse large B cell lymphoma. Cancer Cell. 21 (6): 723-737.Google Scholar
- Jothi R, Cuddapah S, Barski A, Cui K, Zhao K: Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res. 2008, 36 (16): 5221-5231. 10.1093/nar/gkn488.PubMed CentralView ArticlePubMedGoogle Scholar
- Welboren WJ, van Driel MA, Janssen-Megens EM, van Heeringen SJ, Sweep FC, Span PN, Stunnenberg HG: ChIP-Seq of ERalpha and RNA polymerase II defines genes differentially responding to ligands. EMBO J. 2009, 28 (10): 1418-1428. 10.1038/emboj.2009.88.PubMed CentralView ArticlePubMedGoogle Scholar
- Moorman C, Sun LV, Wang J, de Wit E, Talhout W, Ward LD, et al: Hotspots of transcription factor colocalization in the genome of Drosophila melanogaster. Proc Natl Acad Sci U S A. 2006, 103 (32): 12027-12032. 10.1073/pnas.0605003103.PubMed CentralView ArticlePubMedGoogle Scholar
- Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, et al: Transcriptional regulatory code of a eukaryotic genome. Nature. 2004, 431 (7004): 99-104. 10.1038/nature02800.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen CY, Tsai HK, Hsu CM, May Chen MJ, Hung HG, Huang GT, Li WH: Discovering gapped binding sites of yeast transcription factors. Proc Natl Acad Sci U S A. 2008, 105 (7): 2527-2532. 10.1073/pnas.0712188105.PubMed CentralView ArticlePubMedGoogle Scholar
- Teixeira MC, Monteiro P, Jain P, Tenreiro S, Fernandes AR, Mira NP, et al: The YEASTRACT database: a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae. Nucleic Acids Res. 2006, 34 (Database): D446-D451.PubMed CentralView ArticlePubMedGoogle Scholar
- Vermeulen M, Eberl HC, Matarese F, Marks H, Denissov S, Butter F, et al: Quantitative interaction proteomics and genome-wide profiling of epigenetic histone marks and their readers. Cell. 2010, 142 (6): 967-980. 10.1016/j.cell.2010.08.020.View ArticlePubMedGoogle Scholar
- Joseph R, Orlov YL, Huss M, Sun W, Kong SL, Ukil L, et al: Integrative model of genomic factors for determining binding site selection by estrogen receptor-alpha. Molecular Syst Biol. 2010, 6: 456-View ArticleGoogle Scholar
- Zhao Y, Stormo GD: Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nature Biotechnology. 2011, 29 (6): 480-483.PubMed CentralView ArticlePubMedGoogle Scholar
- Morris Q, Bulyk ML, Hughes TR: Jury remains out on simple models of transcription factor specificity. Nature Biotechnology. 2011, 29 (6): 483-484. 10.1038/nbt.1892.PubMed CentralView ArticlePubMedGoogle Scholar
- Kechris KJ, van Zwet E, Bickel PJ, Eisen MB: Detecting DNA regulatory motifs by incorporating positional trends in information content. Genome Biology. 2004, 5 (7): R50-10.1186/gb-2004-5-7-r50.PubMed CentralView ArticlePubMedGoogle Scholar
- Moses AM, Chiang DY, Kellis M, Lander ES, Eisen MB: Position specific variation in the rate of evolution in transcription factor binding sites. BMC Evolutionary Biology. 2003, 3: 19-10.1186/1471-2148-3-19.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang J: Computational study of associations between histone modification and protein-DNA binding in yeast genome by integrating diverse information. BMC Genomics. 2011, 12: 172-10.1186/1471-2164-12-172.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang J, Lan X, Hsu PY, Hsu HK, Huang K, Parvin J, et al: Genome-wide analysis uncovers high frequency, strong differential chromosomal interactions and their associated epigenetic patterns in E2-mediated gene regulation. BMC Genomics. 2011, 14: 70-10.1186/1471-2164-14-70.View ArticleGoogle Scholar
- Karmodiya K, Krebs AR, Oulad-Abdelghani M, Kimura H, Tora L: H3K9 and H3K14 acetylation co-occur at many gene regulatory elements, while H3K14ac marks a subset of inactive inducible promoters in mouse embryonic stem cells. BMC Genomics. 2011, 13: 424-View ArticleGoogle Scholar
- Izrailit J, Berman HK, Datti A, Wrana JL, Reedijk M: High throughput kinase inhibitor screens reveal TRB3 and MAPK-ERK/TGFβ pathways as fundamental Notch regulators in breast cancer. Proc Natl Acad Sci U S A. 2013, 110 (5): 1714-1719. 10.1073/pnas.1214014110.PubMed CentralView ArticlePubMedGoogle Scholar
- Bogusz AM, Baxter RH, Currie T, Sinha P, Sohani AR, Kutok JL, Rodig SJ: Quantitative immunofluorescence reveals the signature of active B-cell receptor signaling in diffuse large B-cell lymphoma. Clin Cancer Res. 2012, 18 (22): 6122-6135. 10.1158/1078-0432.CCR-12-0397.View ArticlePubMedGoogle Scholar
- von Hippel PH, Berg OG: On the specificity of DNA-protein interactions. Proc Natl Acad Sci U S A. 1986, 83 (6): 1608-1612. 10.1073/pnas.83.6.1608.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang J: A new framework for identifying combinatorial regulation of transcription factors: a case study of the yeast cell cycle. J Biomed Inform. 2007, 40 (6): 707-725. 10.1016/j.jbi.2007.02.003.View ArticlePubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760. 10.1093/bioinformatics/btp324.PubMed CentralView ArticlePubMedGoogle Scholar
- McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, et al: GREAT improves functional interpretation of cis-regulatory regions. Nature Biotechnology. 2010, 28 (5): 495-501. 10.1038/nbt.1630.View ArticlePubMedGoogle Scholar
- Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, et al: Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008, 133 (6): 1106-1117. 10.1016/j.cell.2008.04.043.View ArticlePubMedGoogle Scholar
- Huang da W, Sherman BT, Tan Q, Kir J, Liu D, Bryant D, et al: DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucleic Acids Research. 2007, 35 (Web Server issue): W169-W175.PubMed CentralView ArticlePubMedGoogle Scholar
- Berg OG, von Hippel PH: Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. Journal of Molecular Biology. 1987, 193 (4): 723-750. 10.1016/0022-2836(87)90354-8.View ArticlePubMedGoogle Scholar
- Dekoninck A, Calomme C, Nizet S, de Launoit Y, Burny A, Ghysdael J, Van Lint C: Identification and characterization of a PU.1/Spi-B binding site in the bovine leukemia virus long terminal repeat. Oncogene. 2003, 22 (19): 2882-2896. 10.1038/sj.onc.1206392.View ArticlePubMedGoogle Scholar
- Foat BC, Morozov AV, Bussemaker HJ: Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics. 2006, 22 (14): e141-e149. 10.1093/bioinformatics/btl223.View ArticlePubMedGoogle Scholar
- Wang J: The effect of prior assumptions over the weights in BayesPI with application to study protein-DNA interactions from ChIP-based high-throughput data. BMC Bioinformatics. 2010, 11: 412-10.1186/1471-2105-11-412.PubMed CentralView ArticlePubMedGoogle Scholar
- Ward LD, Bussemaker HJ: Predicting functional transcription factor binding through alignment-free and affinity-based analysis of orthologous promoter sequences. Bioinformatics. 2008, 24 (13): i165-i171. 10.1093/bioinformatics/btn154.PubMed CentralView ArticlePubMedGoogle Scholar
- Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Research. 2004, 32 (Database issue): D91-D94.PubMed CentralView ArticlePubMedGoogle Scholar
- Wingender E, Dietze P, Karas H, Knuppel R: TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Research. 1996, 24 (1): 238-241. 10.1093/nar/24.1.238.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang J, Bo TH, Jonassen I, Myklebost O, Hovig E: Tumor classification and marker gene prediction by feature selection and fuzzy c-means clustering using microarray data. BMC Bioinformatics. 2003, 4: 60-10.1186/1471-2105-4-60.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.