Chromatin compaction is one of the critical factors regulating gene expression. The basic unit of chromatin, the nucleosome, consists of 147 base pairs (bp) of DNA wrapped around an octamer of histone proteins (H2A, H2B, H3, H4). Many histone post-translational modifications contribute to establishing compacted, transcriptionally repressed heterochromatin (e.g., histone H3 lysine 9 trimethylation (H3K9me3)) or open, transcriptionally poised euchromatin (e.g., H3K4me3) [1, 2]. However, it is currently unknown why so many modifications — on at least 60 histone residues  — are necessary [3, 4]. One possibility is that individual modifications have specialized properties, such as “indexing” classes of genomic elements . Nevertheless, such discriminating properties remain largely unknown, as redundancy and enzyme promiscuity for non-histone targets have limited the amenability of histone modifications to genetic experimentation .
A potential solution to this problem is to apply supervised learning/classification techniques to high-throughput epigenomic data, such as chromatin immunoprecipitation sequencing (ChIP-Seq) data, for histone modificatons. Encouragingly, these approaches have had success in the related task of predicting the nucleosome occupancy of DNA sequences: they have elicited predictive features with biological (e.g., Rap1 transcription factor binding sites [7, 8]) and biophysical (e.g., GC content, DNA propeller twist [7, 9, 10]) interpretations. Nevertheless, attempts to apply classification techniques to histone modifications have been less forthcoming. This is, in part, because such analyses require less readily available datasets, which correspond to many ChIP-Seq experiments in the same cell type. As notable exceptions, Barski et al.  have generated a ChIP-Seq dataset for 20 histone methylations and the histone variant H2A.Z in human CD4+ T cells, and Wang et al. , of the same research group, have generated a similar dataset for 18 histone acetylations. A recent study by Gervais and Gaudreau  applied classification techniques to histone modifications using these datasets.
In particular, Gervais and Gaudreau  attempted to predict whether a nucleosome contains histone H2A.Z or H2A/H4 arginine 3 symmetric dimethylation (H2A/H4R3me2s; the authors refer to this as just “H2A”, though it is a methylated form ). Importantly, these two classes are likely mutually exclusive: H2A.Z lacks the R3 methylation site and localizes near active transcription start sites , while H2A/H4R3me2s localizes with repressed heterochromatin . The authors  first performed classification with histone modification features (co-localization with 37 other modifications from ChIP-Seq) and, then, with DNA sequence features (frequency of 6-mers in 147 bp nucleosome-bound DNA sequences). However, these analyses only achieved modest prediction accuracies of 68.3% and <60%, respectively (here, a trivial classifier would have an accuracy of 50%) . Furthermore, there was limited biological interpretation for histone modification features and no interpretation for DNA sequence features .
A partial explanation for this modest performance may be insufficient data pre-processing. Gervais and Gaudreau  used raw, aligned (25 bp) ChIP-Seq reads, and simply extended these to 147 bp to generate what they consider to be nucleosome-bound DNA sequences. However, this approach is problematic. Because ChIP-Seq is only a slight enrichment (not a purification) for sequences bound to the protein of interest , it is notoriously noisy. The majority (estimates upward of 90% ) of ChIP-Seq reads are instead from the background. Therefore, we, and others [15, 17, 18], advocate using position- (peak-) finding algorithms, such as Nucleosome Positioning from Sequencing (NPS)  (see Methods), that identify stable nucleosome positions with statistically significant enrichment over background, prior to analysis. Here, stable nucleosomes can be defined as those that are located at roughly the same chromosomal position across a population of cells and can therefore generate a signal peak when ChIP-Seq reads are aligned. Such nucleosomes are also referred to as being relatively well positioned or phased, and there is evidence for their regulatory importance [1, 16]. While using stable nucleosome positions might limit the analysis to a subset of nucleosomes (and thus influence interpretation), we still believe this approach is preferable to using raw, aligned reads — of which only a small minority were likely even bound to the nucleosomes of interest. This approach of using stable nucleosomes was also utilized in a recent study .
Aside from the handling of signal and background, the approach of Gervais and Gaudreau  might not adequately control for systematic biases present in ChIP-Seq data. First, because of PCR amplification bias it may be advisable to collapse duplicate reads prior to analysis [15, 18]. This is especially the case for datasets such as Barski et al.  and Wang et al.  where sequencing depth is relatively low, such that there is a lower likelihood of sequencing independently-precipitated fragments with the same start site (as future datasets begin to have much higher sequencing depth, more refined alternatives to read deduplication will be valuable). Indeed, even for stable nucleosomes, the positioning is often blurry, with nucleosomes not having precisely the same start site across cells . In addition, because coverage and the ability to detect peaks vary with sequencing depth, ChIP-Seq experiments need to be normalized for the number of reads . Refined normalization approaches are emerging  for ChIP-Seq datasets that contain a mock immunoprecipitation (IP) sample; however, for otherwise rich ChIP-Seq datasets that lack such a mock IP, including  and , we believe data should still be normalized for the number of reads, in the absence of a more delicate approach for this type of data (see Discussion).
Here, we employ appropriate ChIP-Seq data pre-processing and sequence-customized, or otherwise advanced, algorithms to investigate their impact on the accuracy and interpretability of classifying nucleosomes containing H2A/H4R3me2s or H2A.Z. For data pre-processing, we perform deduplication, normalization, and position-finding. Further, for DNA sequence-based classification, we utilize the recently developed Discriminatory Motif Feature Selection (DMFS) , which, in addition to achieving impressive accuracy, emphasizes interpretability, unlike so-called “black-box” classifiers. Specifically, DMFS elicits a small set of a priori discriminatory features (motifs) on a subsequently withheld data partition. This eliminates many noise features, which can comprise prediction and interpretation , and loosens restrictive feature length prescriptions (e.g., 6-mers in ), which could otherwise fail to generate key, longer features. For classification based on histone modification features, we utilize an ensemble method, random forests , which has been widely demonstrated to improve on individual classification trees [24, 25], as were deployed by Gervais and Gaudreau . Finally, we perform extensive downstream analysis. Importantly, in addition to achieving dramatically improved accuracies, our classification algorithms elicit predictive, interpretable features that are consistent with recent biochemical findings .