An improved ChIP-seq peak detection system for simultaneously identifying post-translational modified transcription factors by combinatorial fusion, using SUMOylation as an example

Background Post-translational modification (PTM) of transcriptional factors and chromatin remodelling proteins is recognized as a major mechanism by which transcriptional regulation occurs. Chromatin immunoprecipitation (ChIP) in combination with high-throughput sequencing (ChIP-seq) is being applied as a gold standard when studying the genome-wide binding sites of transcription factor (TFs). This has greatly improved our understanding of protein-DNA interactions on a genomic-wide scale. However, current ChIP-seq peak calling tools are not sufficiently sensitive and are unable to simultaneously identify post-translational modified TFs based on ChIP-seq analysis; this is largely due to the wide-spread presence of multiple modified TFs. Using SUMO-1 modification as an example; we describe here an improved approach that allows the simultaneous identification of the particular genomic binding regions of all TFs with SUMO-1 modification. Results Traditional peak calling methods are inadequate when identifying multiple TF binding sites that involve long genomic regions and therefore we designed a ChIP-seq processing pipeline for the detection of peaks via a combinatorial fusion method. Then, we annotate the peaks with known transcription factor binding sites (TFBS) using the Transfac Matrix Database (v7.0), which predicts potential SUMOylated TFs. Next, the peak calling result was further analyzed based on the promoter proximity, TFBS annotation, a literature review, and was validated by ChIP-real-time quantitative PCR (qPCR) and ChIP-reChIP real-time qPCR. The results show clearly that SUMOylated TFs are able to be pinpointed using our pipeline. Conclusion A methodology is presented that analyzes SUMO-1 ChIP-seq patterns and predicts related TFs. Our analysis uses three peak calling tools. The fusion of these different tools increases the precision of the peak calling results. TFBS annotation method is able to predict potential SUMOylated TFs. Here, we offer a new approach that enhances ChIP-seq data analysis and allows the identification of multiple SUMOylated TF binding sites simultaneously, which can then be utilized for other functional PTM binding site prediction in future.


Introduction
SUMOylation was initially identified as a reversible posttranslational modification that controls a variety of cellular processes including replication, chromosome segregation, and DNA repair [1][2][3]. The growing list of SUMO substrates includes various transcription factors (TFs) and chromatin remodeling molecules, which, upon SUMOylation, are often associated with transcriptional repression [4], and the maintenance of heterochromatin silencing [5,6]. The deregulation of SUMOylation has been associated with a number of diseases including cancer [7][8][9][10]. SUMO has been found in all eukaryotes, but not in prokaryotes. Furthermore, the global regulatory role of SUMO in gene expression and protein interactions has been shown to be richly exploited in lower eukaryotes such as yeast [11,12]. While numerous studies have provided considerable insight into the regulation of SUMOylated proteins in higher eukaryotes, their scope usually has been limited to a single host factor. The underlying complexity of SUMOylation has been extended by identifying the downstream consequences of these non-covalent interactions with effectors via SUMO interaction motifs (SIMs) [13], with the SIMs being critical to both SUMO conjugation and SUMO-mediated effects. Exploring the functions of SUMO conjugation and interaction during epigenetic regulation in mammalian cells will considerably enhance our knowledge of transcriptional regulation of SUMOylation in higher eukaryotes.
SUMOylation of transcriptional regulators results in alterations to the transcription regulation of individual genes, while the SUMOylation of epigenetic regulators brings about long-range chromatin remodeling, and hence global changes in expression. When chromatin structures are regulated by SUMO, it has been found to involve SUMO targeting of histone deacetylases and this then results in histone deacetylation, chromosome condensation, and transcriptional repression. At the same time, numerous transcription factors have been reported to be SUMO substrates, including Elk-1 [14], SP1 [15], AP2 [16], and many others. The study of epigenetic regulation when there is PTM of regulatory transcription factors is still in its infancy and there remains a need for new and improved screening tools as well as the development of assay pipelines.
Recently, chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-seq), has become a powerful and high resolution method that allows the study of the impact of TFs and their co-regulators in higher eukaryotes in a genome-wide manner [17,18]. During the ChIP process, DNA is initially cross-linked in a specific sample to the protein that binds to it. This crosslinked DNA is then broken into fragments and immunoprecipitation with a specific antibody for the DNA-binding protein follows; finally, the associated DNA is identified after de-crosslinking. High-throughput sequencing of short tags (reads) can be achieved using the resulting DNA population. ChIP-seq involves the short read (30~100 bp) sequencing of ChIP-enriched DNA fragments. These reads are subsequently aligned to a reference genome such as hg19. The first step of all ChIP-seq analyses is peak detection. Peaks are regions that are markedly enriched in terms of read density based on the ChIP-seq data. Potential transcription factor binding sites (TFBS) can only be precisely identified when the true peaks are detected first by "peak calling" tools.
Many peak calling algorithms have been developed for identifying ChIP-enriched regions from ChIP-seq experiments from a single TF [19]. In addition to commercial software, there are more than 30 open source programs available. Many reviews of the major steps in ChIP-seq analysis are available in literature [20][21][22]. These offer a variety of strategies that allow evaluation of each system and answer questions as to how to choose the most appropriate software from among the many available peak calling tools. Although current software is well established and can find the TFBS of single TFs, annotation of multiple functional TFBSs using the same PTM remains challenging [23]. TFs are known to recognize more than one motif and similar motifs can be recognized by different TFs. Simultaneously detecting the binding sites of multiple TFs, including SUMOylated TFs, is therefore a difficult task. Another big challenging is that the SUMO enriched sites represent not only directly SUMO modified TFs but also SUMOylated cofactors that are able to bind to the chromatin bound TFs (Figure 1). Therefore there is a wide range of discordance among the peaks identified by different software systems. This paper attempts to address the problem of predicting potential chromatin bound SUMOylated TFs and identifying their binding sites. To overcome the difficulty of simultaneously identifying SUMOylated TFs in ChIP-seq experiments, we investigated and compared the peak detection results of various different software approaches [24]. We selected four mainstream tools, Model-base Analysis of ChIP-seq (MACS) [25], T-PIC [26] , BayesPeak [27], and CisGenome [28]. MACS models uses the shift size of ChIP-seq tags to identify peaks and utilizes a dynamic Poisson distribution to highlight local biases in the genome. The "shift size" strategy of MACS helps to identify board and blunt peaks. However, this strategy may loss many sharp ones. T-PIC identifies significant peaks using topological data analysis and provides a non-parametric approach that is statistically sound and robust in relation to experimental noise.
The T-PIC strategy is therefore able to identify most sharp peaks. Combine these two methods help us identifying most potential chromatin binding peaks. However, these two approaches may also identify some false positive peaks. The false positive peaks can be eliminated by combing peak detection methods, such as BayesPeak and CisGenome, that is specifically designed for identify the false positive peaks. BayesPeak was developed to model data structure using Bayesian statistical techniques. CisGenome was developed to model data structure using conditional binomial model. Combining BayesPeak or CisGenome with MACS and T-PIC using combinatorial fusion analysis [29], the results show that MACS*CisGenome*T-PIC (M*C*T) is superiors over MACS*T-PIC*-BayesPeak (M*T*B). The M*C*T pipeline is able to improve peak detection in ChIP-seq data significantly. This approach should help produce great advances in our understanding of how SUMO modifications contribute to important biological processes.

Results
Global identification of SUMO-1 peaks in a primary effusion lymphoma (PEL) cell line, BCBL-1 We used a ChIP-verified polyclone antibody specifically against SUMO-1 to immunoprecipitate SUMO-1 from a B lymphoma cell line, BCBL-1. Size-selected (400 bp) DNA fragments were excised and short reads (100 bp) obtained from both ends (paired-end reads) via high throughput sequencing-by-synthesis on an Illumina ® Genome Analyzer IIx System. Analysing and interpreting ChIP-seq data typically involves pre-treating the raw reads using multiple applications, which can include mapping of sequences to the human genome, filtering and quality control. Around 63 million reads were mapped to the human genome sequence, hg19. Details of the number of reads that underwent data pre-process are presented in Table 1. After the density profiles were generated, the focus shifted to localizing the potential peaks. Here, we selected MACS, T-PIC, BayesPeak and CisGenome to localize the potential binding sites for delineated SUMO-1 targeting TFBSs. As shown in Table 2, the peaks calling results were found to be very different when the four different methods were compared. Specifically, MACS (M) detected 53,972 peaks with the longest regions (average 810 bp). T-PIC (T) detected the shortest peaks (average 442 bp). BayesPeak (B) and CisGenome (C) that were primarily designed to identify false positive peaks can be used to eliminate untrue peaks. Peaks sets identified by different methods were annotated using TFBSs (see materials and methods). T-PIC detected the greatest number of TFBS (477,353) in the whole genome, while MACS found the highest number of TFBS (27,615) in promoter regions. An example of the peaks identified by individual methods and their annotation by TFBS is presented in Figure 2. Consistent with other SUMO-1 ChIP-seq datasets (GEO ACCESSION: GSM1012775), we identified peaks in the promoter region of the NOSIP gene.

Intersection of different peak calling tools represents positive results
To evaluate the various individual systems and different combinations, we used four indexes: P promoter , P TFBS , P tp_p , and P tp_t (see Methods section). The higher value of each index means that meaningful peaks were detected either in the promoter region or annotated TFBS. The results of the four individual tools are recorded in Tables Table 3). We choice the top three tools (C, M, and T) to do the following steps. All four combinations of intersection (*) and union (+) are recorded in Table 4 and 5, respectively. When we used the union and the intersection strategies to analysis the peaks of two or three tools, the average precision of intersection (M*C*T) was found to be the most effective method with highest average precision (45.8%) ( Table 3 and 4). Three pools of SUMO-1 putative peaks in the promoter region were intersected to give 4,834 peaks. Among them, 3,604 peaks contain TFBSs. In total, 3,860 SUMO-1 related TFBSs were identified from these 3,604 peaks. This result indicates that the intersection method is able to extract functional peaks from a massive range of peaks.
Validation the data from ChIP-seq for ELK-1 binding sites with SUMO-1 enrichment by real-time qPCR To confirm the SUMO-1 enrichment at the ELK-1 binding sites, we randomly pick up three ELK-1 binding regions where the SUMO-1 peaks had been identified by the ChIP-seq assay and design primers for qPCR assay. The SUMO-1 enrichment in promoter regions of TARS2, NDUFB7 and ADAMTS10 was then validated using a ChIP sample and real-time qPCR. Consistent with the ChIP-seq results, the three ELK-1 binding regions tested here showed significant enrichment for SUMO-1 compares to IgG control ( Figure 3A to 3C). ChIP-reChIP analysis was used to further confirm the colocalization of SUMO-1 and ELK-1 on ELK-1 binding sits of TARS2, NDUFB7 and ADAMTS10 promoter region with SUMO-1 enrichment. Control rabbit IgG and SUMO-1 antibody were used in the first ChIP, followed by reChIP using antibody for ELK-1. Real-time qPCR analyses of the first ChIP and reChIP product with TARS2 and NDUFB7 promoter-specific primers indicates that the SUMO-1 and ELK-1 are co-localized in the TARS2 and NDUFB7 promoter region ( Figure 4A and 4B). Maybe due to the low PCR efficacy of ADAMTS10 promoter-specific primers, qPCR data show low detection value in the input of ADAMTS10 promoter region and no signal in ChIP and reChIP samples.
Potential SUMO-1 targeting TF identification that relies on SUMO-1ChIP peak height scores and can be validated via a literature review A score function, considering peak heights, frequency of TFBS on SUMO-1 peaks, and number of TFBS, was designed to predict SUMO-1 targeted TFs. Table 6 lists the 19 potential SUMO-1 targeting TF candidates predicted by the M*C*T method with Z-score using a cutoff value of 2.24. Literature-verified SUMOylation of the 19 potential SUMO-1 targeting TFs are presented in Table 6. The top five potential SUMO TFs, ELK-1 [30], E2F [31], NFY [32], and CREB [33], have all been confirmed to be SUMO substrates by literature review and the percentage of SUMO-verified TFs shown in Figure 6 indicates that the most favorable result is obtained when using the M*C*T combination. Among the 19 potential SUMO TFs, 17 of them have been previously identified as SUMO substrates. For example, Elk-1, the top 1 SUMOylated TF candidate in our analysis, can be SUMO modified at its R motif [30]. Overall, 30% of the SUMO peaks (149/482) containing the Elk-1 TFBS that were identified in the present study are also found in another Elk-1 ChIP-seq data (GEO ACCESSION: GSM608163). Although no previous study   reports have indicated that hRFX1 and NSCL1 are SUMOylated, we cannot rule out the potential of these two proteins to form a SUMO complex and/or to bind a SUMOylated cofactor.
Validation of SUMO-1 enrichment in ELK-1 binding site identified in HeLa cells Recently, a ChIP-seq report has pinpointed the global chromatin localization of ELK-1 in human HeLa cells  (GEO ACCESSION: GSM608163). We selected three high quality ELK-1 binding sites identified in HeLa cells overlapping with our SUMO-1 enriched regions and validated by ChIP-qPCR. As shown in Figure 3D to 3F, there is a significant increase of SUMO-1 enrichment in ELK-1 binding sites of SNRPE, INO80B and LYSMD1 promoter identified from previous study of others in HeLa cells. However, ChIP-reChIP data shows no co-localization of ELK-1 and SUMO-1 in the promoter region of SNRPE, INO80B and LYSMD1 genes ( Figure 4C to 4E). Consistent with the ChIP-reChIP data, the transcription of all three genes showed no changes during K-Rta induced KSHV reactivation after SUMO-1 knockdown comparing to the control cells ( Figure 5D to 5F). The results were similar to the negative control genes, MCL-1 and IRF-3, which have ELK-1 binding site but not SUMO-1 enrichment in their promoter regions ( Figure 5G and 5H). The inconsistency between our results and the findings in HeLa cells may be due to the cell type specificity. Together, taking ELK-1 as an example, the result here suggests that our pipeline is able to identify the potential chromatin region bound by SUMO modified transcription factors successfully. The biological role of SUMOylation in regulating the function of ELK-1 was also identified in a cell type-specific manner.

Discussion
Comparisons of the different methods available for the global identification of SUMO-1 peaks As revealed in Figure 7, different algorithms returned disjointed sets of peaks, which indicate that these divergent approaches and algorithms recognize distinct peaks. This finding creates a challenge as to how to integrate the results from different tools. Pepke et al. [20] classified the density profile of ChIP-seq result into three categories: (1) punctate regions; (2) broader regions; and (3) broad regions. Different strategies should be employed when delineating different profiles. Interestingly, evidence shows that SUMO-mediated transcription regulation not only involves covalent SUMO modifying transcription regulatory proteins, but also non-covalent associated co-regulatory proteins that contain the SUMO interacting motif (SIM). In most cases, SUMO formed complexes seems to result in regions that extend beyond a single TFBS, therefore rendering traditional peak calling methods inadequate when studying the binding sites for SUMOylation within long regions. An accurate characterization of the SUMOylation binding patterns will be of real significance to the study of SUMO binding sites across the entire genome, as well as any analysis of their correlation with transcriptional regulation.

Evaluation of system fusion result
We performed two kinds of combination, intersection and union, with the four mainstream peak detection tools, namely MACS, T-PIC, BayesPeak and CisGenome (see Methods section). Intersection of two systems is expected to improve specificity, while union is expected to improve sensitivity. When evaluating each system or combination, we viewed the results with respect to combinatorial peaks using four percentage indices, P promoter , P TFBS , P tp_p , and P tp_t (see Methods section). To evaluate these four indices, we created an average precision (AP). The results are shown in Tables 3 to 5. Table 3 lists the four indices from the four individual tools and each of thefour tools has its own strengths. MACS, T-PIC, Baye-sPeak and CisGenome detected the highest percentages of P TFBS , P tp_p , P promoter and P tp_t , respectively. Table 4 showed that all combinations by union are negative cases with respect to the individual methods, due to an abundance of un-annotated peaks and intergenic peaks. As highlighted in Table 5, all combinations by intersection are positive cases, especially the M*C*T method. Collectively, each type of tool providing information beneficial to identify SUMO-1 peaks and the pipeline design here pinpoints potential SUMO-1 targeting TFs from others according to the scoring step. As shown in Figure 6, though the top 10 SUMO-1 targeting TF candidates are identified by M*C*T, the C+T, C*T, C and T methods provide similar SUMO verification rates. The verification rate for the following groups, namely top 15 to top 35, became lower compared to the M*C*T rate, C rate (the   CisGenome, together with the numbers of peak presented. The numbers for the union and intersection of the peaks, and the mapped genes as obtained by the software can also be found in Table 4. best individual method), C+T rate (the best union method) and M+C+T rate (the worst method of all). The results suggest that while all the methods are able to predict potential SUMO-1 targeted TFs when there is a strong peak score, the M*C*T method predicts SUMOylated TFs with a lower peak score in a more effective manner. In addition, we also compare the combination of all four methods of MACS, T-PIC, BayesPeak and CisGenome. As shown in Figure 8, combinational methods of M*C*T *B is not superior than M*C*T. The result indicates that the choice of peak calling tool is important. Using intersection strategy can filter the false positive peaks, however intersecting too many peak calling tools may let the unfit tool filter out the good peaks.

Conclusions
Decoding how PTMs impacts the TF regulatory system that governs diverse cellular responses remains challenging. Taking SUMO modification as an example, we have developed a computational pipeline for predicting putative SUMOylated TFs from a group of TFBS. Using the combinatorial fusion methods described here, there is no need to depend on a single "best" algorithm. The merge detection method is able to find peaks with greater accuracy than any other peak calling software alone using ChIP-seq data retrieved from targeted PTMs. SUMO-1 target TFs are predicted well using our pipeline. In total, 89% of the 19 potential SUMOylated TFs were found to be SUMOylated after confirmation by literature review.
In summary, our observations includes: (1) based on the criteria and performance evaluations used, there are no single answer to the selection of the best available method for ChIP-seq peak detection when identifying PTMs; (2) combinations of different tools are able to improve results in many cases; and (3) M*C*T is the superior prediction method when detecting putative SUMOylated TFs. More than 60% of the peaks identified in this study have not been annotated. One of the reasons for this is that the human cell contains thousands of TFs, and many of them are able to be SUMOylated. The TFBS data set from the UCSC genome browser only includes binding sites for 258 TFs out of these thousands of TFs.
In the future, our work should help researchers to achieve a greater understanding of SUMOylated TFs once a better TFBS database become available. Moreover, we intend to explore the non-TFBS-annotated SUMO peaks in order to identify chromatin remodeling molecules that are not TFs. Most importantly, our pipeline here provides a platform for all researchers who want to study the relation between PTM and epigenetic regulation using their own chromatin binding data.

Experiment design and sample preparation
The epigenetic study underling this paper's aim is an investigation of the impact of SUMO/TF interaction in a primary effusion lymphoma (PEL) cell line, BCBL-1. To this end, we generated ChIP-seq data using anti-SUMO-1 antibodies and BCBL-1. In general, the results of a SUMO-1 ChIP-seq experiment were anticipated to reflect indirectly the SUMO regulatory genome via SUMOylated TF binding and chromatin. In parallel to this, another scenario is that SUMO-1 antibody identifies SUMOylated cofactors that are recruited to TFs and TF-occupied DNA sequences. The cross-linked SUMO-TF-DNA complexes were extracted and contained the DNA that interacts with either the SUMOylated-TFs or the SUMOylated transcription complexes. After purification of ChIP-enriched DNA, a library was developed to allow sequencing on a NGS platform (Figure 1).
Chromatin immunoprecipitation-sequencing (ChIP-Seq), ChIP-reChIP assay and real-time quantitative PCR (qPCR) 1 × 10 7 cells were harvested and ChIP assays were performed following the protocol described by the Farnham laboratory (provided at http://genomics.ucdavis.edu/ farnham). ChIP-reChIP assays were performed by Re-ChIP-IT kit (Active Motif, Carlsbad, CA) following the manufacturer's instruction. ChIP-verified rabbit polyclone antibodies specific against SUMO-1 (Abcam, Cambridge, MA) and rabbit non-immune serum IgG (Alpha Diagnostic International) were used for the ChIP and ChIP-reChIP assays.
ChIP-seq library construction was carried out following the sample preparation protocol from Illumina. Short reads (100 bp) from both ends (paired-end sequencing) were sequenced on an Illumina ® Genome Analyzer IIX System. The binding sites were verified by SYBR ® Green Based qPCR using a CFX connect™ real-time PCR detection system (Bio-Rad, Richmond, CA). Specific primer sets were designed around the identified binding sites for this purpose. Enrichment of SUMO-1 and IgG samples were normalized with the input.

Data analysis Input datasets
The reads within the SUMO-1 ChIP-seq data sets were aligned by BWA with default parameters [37]. hg19 was used as the reference genome, having been downloaded from the UCSC genome browser [38]. The Ensembl database was used to define gene regions [34]. Promoter regions are defined as the area that stretches from 5 kb upstream to 1kb downstream of a transcription start site (TSS).

Scoring system for TFBS in SUMO peaks
Peak calling was the last, perhaps most pivotal and dynamic step in the process of the ChIP-seq pipeline after fragmentation, immunoprecipitation, sequencing and aligning. Figure 9 describes our pipeline for the SUMO-1 ChIP-seq experiment and the analytical workflow. The initial stage of peak detection was to identify the enriched regions with a large number of mapped reads. Subsequently, the peak calling tools had to determine if these regions were significant enough to represent a protein-DNA binding site across various peak heights and/or directionality score. This approach ensured that the peak heights are a scoring function in which the system assigned a number to each possible region. We propose that the peak detection for each of the binding sites be viewed as a scoring system on sets of all possible SUMO binding site regions, and the UCSC TFBS data set be viewed as known TFBS regions when annotating the SUMO binding site regions. The TFBS dataset was downloaded from the UCSC genome browser database, and includes a total of 5,797,266 TFBS for 258 TFs in Track TFBS [35].
Let T = [t 1 , t 2 ... t 258 ] be the set of TFs, and TB i , i = 1 258 be the set of TFBSs of t i . A range of SUMO peak detection scoring systems were developed using different algorithms. Using multiple scoring systems that were defined by the set of possible TFBS regions on the set of SUMO possible peaks, we were able to study the reproducibility of peak calls among different replicate. Multiple scoring systems were also used to develop and design new pipelines that had greater accuracy, efficiency and scalability when detecting SUMOylated protein binding sites during ChIP-seq data analysis. We drew from recent research on combinatorial fusion and applied this to ChIP-seq data analysis. Since peak heights were found to be the most consistent and best performing feature of peak calling methods, peak heights was selected as the score function to represent each method's scoring of the region identified. Let D x be the set of peak regions identified by tool X, and D i x be the intersection of D x and TB i . The score function is defined as It means the sum of the peak heights D i x weighting with the percentage of D i x in all TB i . Let the rank function R(D i x ) be the function from 1 to 258 that is obtained by sorting the values in S(D i x ) into descending order and converting the function S(TB x ) into the function R(TB x ) using the rank as its function value.

Combined two peak detection systems
Union In the union of two systems, x and y, D x+y is the set of regions that contains all peaks identified by X and all peaks identified by Y, where the overlapping regions between the two tools are merged to gather and form new compound regions, and non-overlapping peaks are allowed to maintain their genome position. Let D i x+y be the intersection of D x+y and TB i . The score function is S(D Intersection The intersection of two system, X and Y, D x*y is the set of SUMO TFBS that are detected both by X and Y. Figure 9 Diagram of the SUMO-1 ChIP-seq analysis workflow. Scheme used for the modified high-resolution ChIP-seq method and its validation. The literature was used to verify 17 of the top 19 SUMO-1-TF candidates. The SUMO-1-TF candidates were predicted by the following steps: (1) filtering poor and repeat reads out, and aligning reads to the human genome (hg19); (2) calling peaks using three tools MACS, T-PIC and CisGenome; (3) combining three peak sets; (4) annotating peaks using TFBS; (5) scoring and ranking SUMO-1 TF candidates; and finally (6) verifying SUMO-1 TF candidates via the literature. Identifying potential SUMO-1 target TFs using the Hampel Identifier Hampel identifier is a measure for the robustness of an estimator against outliers. It is regarded as one of the most robust and efficient outlier identifiers [36,37]. The higher value of Hampel identifier means much more different from the main part of the data. We use Hampel Identifier [38] to identify the potential SUMO-1 targeting TFs. We apply Hampel Identifier on the score function S(D i x * y ) which are estimates of and treat any observation as a potential SUMO-1 targeting TF for which the following is true:

Performance evaluation methods
For many TFs, the majority of binding sites can be found near the TSS of expressed genes. Therefore, whether or not the peak is in the promoter region (promoter peak) can be an index when evaluating ChIP-seq software systems, and different combination methods. Thus, when, a peak overlaps with a TFBS, as a TFBS peak, this indicates that this is a functional peak. Thus, potentially, there is a percentage of TFBS peak found for all peaks and for promoter peaks, both of which represent evaluation indices. In this evaluation, we defined four indexes to compare the peaks identified by a particular tool and by combination of the three tools. P promoter = Peak # in promoter Total peak # P TFBS = TFBS peak# Total peak# P tp p = TFBS peak # in promoter Peak # in promoter P tp t = TFBS peak # in promoter TFBS peak# Meanwhile, average precision (AP) for a system is defined as AP = P promoter + P TFBS + P tp p + P tp t 4

Conflicts of interests
The authors declare that they have no competing interests.
Authors' contributions Experiments conceived and designed by HJK, WWC and CYT, and performed by HWH, CYC and PCC. Algorithm designed by CYC and FRS. Data analyzed by CYC and CHC. The article is written by CYC, CHC and PCC.