Combining multiple ChIP-seq peak detection systems using combinatorial fusion

Background Due to the recent rapid development in ChIP-seq technologies, which uses high-throughput next-generation DNA sequencing to identify the targets of Chromatin Immunoprecipitation, there is an increasing amount of sequencing data being generated that provides us with greater opportunity to analyze genome-wide protein-DNA interactions. In particular, we are interested in evaluating and enhancing computational and statistical techniques for locating protein binding sites. Many peak detection systems have been developed; in this study, we utilize the following six: CisGenome, MACS, PeakSeq, QuEST, SISSRs, and TRLocator. Results We define two methods to merge and rescore the regions of two peak detection systems and analyze the performance based on average precision and coverage of transcription start sites. The results indicate that ChIP-seq peak detection can be improved by fusion using score or rank combination. Conclusion Our method of combination and fusion analysis would provide a means for generic assessment of available technologies and systems and assist researchers in choosing an appropriate system (or fusion method) for analyzing ChIP-seq data. This analysis offers an alternate approach for increasing true positive rates, while decreasing false positive rates and hence improving the ChIP-seq peak identification process.


Introduction
One of the most important biotechnologies developed in the 20 th century is the Sanger method for the sequencing of DNA [1]. Recently developed next-generation DNA sequencing (NGS) technologies have increased DNA sequencing capacity by many orders of magnitude, making entirely new applications possible [2,3]. Chromatin Immu-noPrecipitation (ChIP) is a biochemical method to identify binding sites on DNA that interact with proteins. It involves cross-linking proteins to DNA with a reagent such as formaldehyde, randomly shearing the DNA into small fragments (200-500 base pairs) (fragmentation), then using an antibody specific for a known DNA-interacting protein to isolate DNA fragments bound to the target protein [4] (immunoprecipitation).
The combination of the ChIP process and microarray DNA chip technologies lead to the method of Chip-onchip [5] or ChIP-chip [6] that can identify DNA fragments isolated by ChIP using a DNA microarray containing large numbers of probes of known genomic sequences. ChIP-seq [7] uses next-generation sequencing (NGS) to identify the DNA fragments isolated by ChIP. Nextgeneration DNA sequencing machines are capable of simultaneously determining the sequences of millions of DNA fragments in a single sample with a high degree of accuracy (high-throughput sequencing). The sequence reads (known as tags) obtained from ends of ChIP-selected DNA fragments are typically 25-50 base pairs long. These short reads can then be mapped to a reference genome by a stringent DNA sequence alignment algorithm such as ELAND (Illumina Inc.), MAQ [8], or Bowtie [9] (mapping). Sequence reads that do not map to a unique position on the genome (with 2 or fewer mismatches) are generally discarded. The final product of such a mapping procedure is a set of positions on the reference genome indicating the start and end of each short sequence read. Once the reads are mapped to the genome, the tag positions can then be analyzed for clusters of tags or "peaks", which indicate (predict) protein binding (or histone modification) positions enriched by the ChIP (peak detection). The results of ChIP-seq studies can provide an unbiased genome-wide profile of DNA regulatory regions targeted by transcription factors as well as the signatures of modified histone proteins associated with epigenetic changes in chromatin. Figure 1 shows a framework for a ChIP-seq experiment and analytic workflow [10].
Peak detection is the last and probably most crucial and dynamic step in the process of the ChIP-seq method and system after fragmentation, immunoprecipitation, sequencing, and mapping. Along the pipeline, the set of mapped sequence tags can easily acquire noise from background contamination, co-precipitation of unbound DNA fragments, non-specific interactions of the ChIP target protein with DNA, and a variety of sources such as replication and amplification artifacts (e.g. PCR artifacts). A useful ChIP-seq peak detection technique or tool has to be robust and reliable. With the rising popularity and increasing importance of ChIP-seq, there has been a proliferation of new analytical and computational methods to find peaks in ChIP-seq data. At the last count, there are over 30 open source programs, in addition to many commercial software applications, available to the research community [11].
The first step in the peak detection process is to identify those genomic regions with a large number of mapped sequence tags (enriched regions) [12][13][14][15][16][17][18][19][20][21][22][23][24]. Then the peak detection and identification system must determine the number of tags (peak heights) or directionality score (tag count) that constitutes enrichment "significant" enough to represent a protein-DNA binding site. In this way, tag count (T) is a scoring function in which the system assigns a number to each possible region. Often, a tag count threshold is chosen to define a peak [24]. One way to set this threshold is to compare the distribution of tags in enriched regions to tags that are placed randomly on the genome. The outcome is a significance value (p-value) of the sequence tag enrichment. This value (P) is also a scoring function used to select peaks [16,17,20,22]. Some methods use sequence data from a control dataset and then use the control tag densities to assess the significance of peaks in the ChIP sample set. In this case, a fold enrichment (F) ratio of ChIP tag count over the normalized control tags in the candidate regions is calculated to give another scoring function [7,14,17,25]. Different methods use various statistical models to assess the significance of ChIP peaks or assign a false discovery rate (FDR) to each peak with respect to control data [12,16,[18][19][20][21][22][23][24][25][26].
Recently, Pepke et al [27] gave a review of the major steps in ChIP-seq analysis and described the algorithmic approaches of twelve existing programs for detecting peaks. Laajala et al [28] gave some metrics for evaluating various methods of peak detection. Wilbanks and Facciotti [11] compared the performance of eleven different peak calling programs on common empirical, transcription factor data sets. Their work offers a variety of ways to assess the performance of each algorithm and address the questions as to how to select the most suitable among several available methods of ChIP-seq analysis for peak detection. In our study, we evaluate six methods: CisGenome, MACS, PeakSeq, QuEST, SISSRs, and TRLocator [16,17,23,24,26,29] based on the three attributes: tag count, p-value, and fold change, and their combinations. We then analyze the merged results of all twomethod combinations. In particular, a recently developed information fusion method, Combinatorial Fusion Analysis [30], is utilized to demonstrate that ChIP-seq peak detection can be improved by fusion using score or rank combination. Our study offers an alternate approach to select a suitable method for ChIP-seq analysis. This study also offers ways to improve existing methods by combining them in an appropriate way using Combinatorial Fusion Analysis.
Based on preliminary experiments, we have observed that the peak-detection abilities of available ChIP-seq methods and systems vary greatly depending on the type of protein that is targeted by the antibody used in the ChIP. We have identified three types of protein-DNA interactions that generate very different results when the same peak detection system is used to analyze the ChIPseq data [10]. The first observation is that transcription factors, such as E2F4, bind strongly to a single highly specific DNA sequence (a motif) near the transcription start site (TSS) of a gene, and are characterized by distinct ChIP-seq peaks~500 bases wide, with oriented tags that approximately follow a normal distribution. A second observed pattern is with transcription factors, such as Sin3a, that bind weakly to DNA together with co-factors, yielding wider ChIP-seq peaks (800-1600 bases) with a flat distribution of lower tag density and un-oriented tags. A third kind of ChIP-seq target, modified histone proteins, such as tri-methylated H3K4, produce much wider peaks (~4000 bases) and un-oriented tags [10]. In this study, we use a trimethylated H3K4 (H3K4me3) data set [31].

Previous work
Similar to the analysis of microarray gene expression data, many computational methods have recently been developed for the analysis of ChIP-seq data. In both cases, the proliferation of software and systems was an indication that it is difficult to find a single well-validated method that performs well in a variety of domain applications. It also depends on what criteria one uses to evaluate the systems. In this study, we use the following six methods and systems to analyze their intra-and intersystem properties and improvement by combination. They are (A) CisGenome [16], (B) MACS [24], (C) Peak-Seq [26], (D) QuEST [23], (E) SISSRs [17], and (F) TRLocator [29].
CisGenome [16] uses a two-pass algorithm for peak detection to ensure adjustment for DNA fragmentation length. It can analyze both ChIP-seq and ChIP-chip data, or combine the two. In order to correct many types of systemic bias created by sample preparation, amplification, sequencing (or hybridization), and alignment, it uses both a ChIP sample and a negative control sample (input DNA or mock-ChIP with IGG) to compute FDR at each specific location. It also provides methods to detect binding regions, peak localization, and filtering.
QuEST [23] provides a data-driven statistical analysis model to generate peak calls by leveraging the key attributes of the sequenced and aligned DNA reads, such as directionality (strand orientation) and the original size of ChIP-isolated DNA fragments. The statistical framework used is the kernel density probability estimation approach, which facilitates the aggregation of signals originated from densely packed sequence reads at protein interaction sites.
MACS (Model-based Analysis of ChIP-Seq) [24] empirically models the shift size of ChIP-seq tags to enhance peak identification by taking advantage of the bimodal pattern of forward and reverse tags. MACS also utilizes a dynamic Poisson distribution to identify local biases in the genome.
Site Identification from Short Sequence Reads (SISSRs) [17] estimates high read counts using Poisson probabilities and calls regions where the peaks shift from the forward to the reverse strand. The SISSRS method is attractive because it explicitly makes use of information from the orientation of tags around a protein binding site -where it is expected that forward strand tags will be found upstream of the true binding site and reverse strand tags downstream. This allows for very precise prediction of the actual binding site. However, for regions of low tag density or for histone methylation ChIP, where tags are not neatly oriented, it tends to create many different peaks across enriched regions, which may not be reproducible across replicates.
PeakSeq [26] utilizes input-DNA control data to refine the selection and scoring of peak regions in Chip-seq experiments to improve the identification of transcription factor binding sites. Since it has been observed that signal peaks in the control data are highly correlated with potential binding sites, PeakSeq compensates for this signal, caused by open chromatin structure, with a two-pass strategy. PeakSeq first identifies enriched peaks in the Chip-seq data as candidate regions. These putative regions are then compared to the normalized control and the regions that are significantly enriched with mapped sequence tags relative to the control are identified as binding sites.
TRLocator [29] is a peak detection method that has been developed at NYU-CHIBI. The algorithm utilizes the distribution of the background data to compute pvalues for putative peaks in the ChIP-seq data. Putative peak regions are generated based on a variable merging window size that can be adjusted according to the kind of data set being analyzed. Custom filters for finding qualified peak regions include: p-value, minimum number of tags within each putative peak, balance between the number of tags aligned to the positive strand and the number of tags aligned to the negative strand, and the log2 ratio between ChIP tags and background tags.

Combining peak detection systems Multiple scoring systems
We propose that the peak detection for each of the binding sites be viewed as a scoring system on the set of all possible binding site regions. Different scoring systems for peak detection can represent different features/cues/ attributes or different algorithms/methods/systems. They can also represent different technical replicates or different biological replicates using each of the same set of features or cues/attributes or the same algorithm or method/system. By using multiple scoring systems defined on the set of possible binding site regions to detect peaks for each of the binding sites, we can study the reproducibility of peak calls among different replicates. We also use multiple scoring systems to develop and design new algorithms with greater accuracy, efficiency, and scalability for detecting protein binding sites in ChIP-seq data alignment. We draw from recent research in combinatorial fusion [32,33]. Using a rankscore characteristic graph to measure the scoring diversity [34], combinatorial fusion has been an active research area in the past ten years in a variety of application domains such as microarray gene expression analysis [35], motif finding [36], protein structure prediction [37], virtual screening [38], information retrieval [39,33], and target tracking [40].
In our preliminary work, we analyzed the six individual systems according to three features, which include: tag count, p-value, and fold change (enrichment of ChIP tags compared to background control tags at the same genomic locus) [41]. We analyzed these features and their combinations according to average precision and observed that, in most cases, the tag count feature outperformed other features and combinations of features.
Since tag count was the most consistent and best performing feature between the methods, we choose to use the ChIP tag count as the score function to represent each method's scoring of the regions identified. Let D = {d 1 , d 2 ,..., d n } be the set of regions identified by system x and the score function s x (d) be the tag count of that region (number of ChIP tags in the data set that are located within that chromosomal region). Let the rank function r x (d) be the function from D to N = [1, n] = {1,2,..., n} which is obtained by sorting the values in s x (d) into descending order and converting the function s x (d) into the function r x (d) using the rank as its function value.

Combining two peak detection systems Union
The union of two systems, x and y, U(x, y) is the set of regions that contains all regions identified by x and all regions identified by y, where overlapping regions between the two methods are merged together to form new merged regions. All non-overlapping regions that belong to either x or y will maintain their genomic positions (chromosome, start and end bp coordinates). Each merged region will have a start position that is the minimum of all start positions of its overlapping regions from x and y, and an end position that is the maximum of all end positions of those overlapping regions. This new set of regions, U(x, y) = {d 1 , d 2 , ..., d p }, is scored based on the tag counts of systems x and y, as follows. Systems x and y have new score functions based on the regions in this union: s x' (d) and s y' (d). s x' (d) is obtained according to the following: Single regions -if the region was identified by system x, the score is the tag count given by x; otherwise the score is 0.
Merged regions -the score is the sum of the tag counts for the regions (that are part of this merged region) that were identified by x.
s y' (d) is obtained in the same manner. The score functions are then scaled from 0 to 1 by the following normalization: score function s x' (d): U(x, y) R is transformed to s max = max{ s x' (d): d U(x, y)}, and s min = min{ s x' (d): d U(x, y)}. s y' (d) is also normalized accordingly. The rank functions r x' (d) and r y' (d) from U(x, y) to N = {1, 2, ..., p} assign a rank to each region after sorting the scores given by s x' (d) and s y' (d) in descending order, respectively. In order to provide a single score and rank for each region in U(x, y) that is based on combined information from systems x and y, we perform score and rank combinations. The score combination for the union of systems x and y is defined as: s x * (d) + s y * (d) and the rank combination is computed according to:

Intersection
The intersection of two systems, x and y, I(x, y) is the set of the merged regions formed by overlapping  Figure 3 demonstrates the intersection and union of the PeakSeq and QuEST methods in the area depicted above. The intersection contains the merged regions that are formed by overlapping regions between the two methods. The union contains these merged regions and all non-overlapping regions of the individual methods.

Average precision
For many transcription factors, DNA polymerase II, and some modified histones such as tri-methylated H3K4, the majority of binding sites are located near the transcription start sites (TSS) of expressed genes. Therefore, it is possible to evaluate ChIP-seq software systems, and different combination methods, by their ability to locate peaks at a TSS. While not all true peaks are located at a TSS, not all TSS are correctly annotated in the reference genome, and not all true TSS have such a peak, the ratio of peaks located at an annotated TSS vs. those located elsewhere on the genome is a measure of precision of the peak finding method. We have validated this concept by visualizing all aligned tags on the genome without first identifying peaks. Peaks can be observed in the vicinity of most TSS annotated in the RefSeq database. An average peak can be visualized by superimposing the coverage depth of sequence reads for DNA regions within 1000 bases flanking all annotated RefSeq TSS ( Figure 4). No TSS peak is found in control DNA.
In this evaluation, we compare the peaks identified by a particular system (or combination of two systems) against the set of RefSeq TSS in the human genome. Average precision is used to evaluate the performance of systems and the result of fusion. A region is considered relevant if it overlaps with a TSS in the annotated set. We define the following overlap function for a region at rank i:

0, otherwise
Precision at rank r is computed as: Average precision for a system that identifies n regions is defined as:  Systems that have more regions in lower ranks that overlap with a TSS will have higher average precision.

Coverage
When evaluating the performance of a peak finding system, it is also important to consider its coverage, in terms of the breadth of TSS covered by its peaks (regions). Given a set of regions identified by a system or combination method, we generate the set of TSS that overlap with these regions; the coverage (C) is the number of unique TSS reached by the system: C = TSS that overlap with region (s) of system .

System fusion and evaluation
The ChIP tag count for a region is used as a score function to create all 2-combinations of the six systems:  CisGenome, MACS, PeakSeq, QuEST, SISSRs, and TRLocator. We perform two kinds of combination: intersection and union (see Methods section). The intersection of two systems is expected to improve specificity (detected by both systems) while the union is expected to improve sensitivity (detected by either system). When evaluating each system or combination of two systems, we use average precision and coverage (see Methods section). These results are listed in Tables 1, 2 Table 2, it can be observed that all 2-combinations by intersection are positive cases, which means its performance is better than or equal to the best of the two individual systems. Each of the two combinations: A*E between TRLocator (A) (ranked #1) and CisGenome (E) (ranked #5) and A* F between TRLocator (A) (ranked #1) and SISSRs (F) (ranked #6) is better than the three 2-combinations A*B, A*C, and A*D that involve TRLocator (A) (ranked #1), MACS (B) (ranked #2), PeakSeq (C) (ranked #3), and QuEST (D) (ranked #4). Moreover, each of the 2-combinations B*E and B*F is better than the other 2-combinations B*C and B*D. This phenomenon is quite interesting -individual systems such as CisGenome (E) (ranked #5) and SISSRs (F) (ranked #6), which are lesser preferred, can be combined with other systems (in this case, with TRLocator (A) (ranked #1)) to outperform other system combinations. Almost all of the 2-combinations by union (+) in Table 3 are negative cases -the performance of the 2-combination is less than the best performance of the single cases -except for the 2-combination of B+C (MACS and PeakSeq). It is also interesting to note that the three 2-combinations A+C, A+D, and A+E are better than the 2-combination A+B, reflecting the same phenomenon observed in Table 2.
Tables 1, 2, 3, 4, 5, 6 list the four cases of inter-system fusion: average precision for 2-combinations by intersection and by union; and coverage for 2-combinations by intersection and by union. The huge difference between average precision of the intersection and union is that the former has all the positive 2-combination cases, while the latter has all (but one) negative cases. Comparing Table 5 and Table 6 we find that each of the unions of two systems in Table 6 has higher coverage than those of the intersections of two systems in Table  5. Another difference is that in Table 5, 2-combinations C*D, C*E, and D*E move up to the second, third, and first ranks, while in Table 6, the 2-combinations involving CisGenome (C) (ranked #3) and SISSRs (F) (ranked #6), such as A+C, C+F, B+C, A+F, and B+F move up to the top five rankings.

Discussion
Evaluation of peak detection systems involves analyzing the regions identified as peaks according to criteria such as the average precision and TSS coverage.
Average precision measures the performance of a system according to higher scoring regions overlapping with a TSS. The intersection of two methods refers to the set of regions formed by extracting overlapping regions between two methods and merging them to form new regions. This set of regions represents the common peaks detected by both systems. The average precision of all 15 2-combinations improved when the intersection was evaluated. Combination by union only produced one result that improved average precision, MACS and PeakSeq.
When evaluating system combination according to TSS coverage, we refer to the number of unique TSS regions reached. When using the method of union to combine, all 15 2-combinations show improvement from both original systems. The result of combining two methods by union includes all overlapping regions that are then merged (intersection), in addition to all other regions belonging to each individual method. Some combinations show more improvement than others, which indicates that regions generated by those 2 systems are more diverse in terms of region location. For example, the regions identified by CisGenome overlap with 14010 unique TSS, and the coverage of PeakSeq is 15611. The combination of CisGenome and PeakSeq by union yields results that have a coverage of 21738, which means the combined result reaches many more TSS. Another example is for MACS and TRLocator, which individually have similar performance for coverage, 11804 and 11850, respectively. However, the combination of MACS and TRLocator by union greatly improves the performance and now reaches 20127 unique TSS; this demonstrates the diversity of the two systems. When using the method of intersection for system combination, 4 out of 15 combinations outperformed their component individual systems. Since the intersection consists of the merged,

Conclusions
This study entails the evaluation of and selection among multiple detection systems for ChIP-seq peak identification. In order to do so, we use six well-known methods A = CisGenome, B = MACS, C = PeakSeq, D = QuEST, E = SISSRs, and F = TRLocator and obtain the regions   In summary, we have the following observations resulting from the above experiments: • There is no single answer as to the selection of available methods (and systems) for ChIP-seq peak detection. It depends on the criteria (e.g. features) and performance evaluation (e.g. average precision or TSS coverage).
• Combinations of different methods (systems) do improve results in many cases (average precision of intersection, coverage of union, some for coverage of intersection). Some combinations of lesser preferred systems may outperform all other system combinations.
• Average precision improved more when combining two systems by intersection and coverage improved more when two methods are combined by union.
• The use of the rank function in our evaluation of multiple detection systems provides a generic framework to study the preference and relative preference for the method (or system) selection process.
In our future work, we will explore conditions such as diversity between or performance ratio of two methods (systems) of which two or more systems should be combined to obtain a better system (positive cases). Future work also involves application of method combination to other proteins and transcription factors. As not all TSS may be annotated in the reference genome, identifying  high-scoring regions among multiple methods can also be used to suggest potential TSS.