Dataset
In order to resolve the nucleosome variation in response to the mutation on the individual modifiable histone residue, we determined the nucleosome occupancy in twenty two histones mutants in Saccharomyces cerevisiae [6]. In those mutants, the modifiable lysine (K) and arginine (R) of four core histones were respectively substituted with the non-modifiable residue alanine (A). In the present study, the dataset is used to validate Dimnp. The dataset of nucleosome occupancy is available at website http://bioinfo.seu.edu.cn/Nu_dynamics_data_public/.
Identification of the differential nucleosome regions
Our approach has three steps, including calculating nucleosome occupancy, calling differential significance (P-value) and identifying the DNRs. In the first step, with mapped reads data, nucleosome occupancy is estimated and normalized for each sample so that the resulting signal is suitable for comparison. Firstly, the mapped reads are extended to 75 bp in 3′ direction and then shifted 37 bp to 3′ direction. Secondly, nucleosome occupancy profile is calculated by summating the reads count at each genomic locus. Thirdly, the profile is normalized by dividing it by the average reads count of the whole chromosome, i.e. a global background correction. Through the procedure, the nucleosome occupancy profile (reads count) of each locus is represented with a fold change to the average reads count, which eliminates the effect caused by different sequencing depth in the samples. We also provided an option to perform a local background correction, in which the profile is normalized by dividing the profile by the average of reads count of a 10-kbp sliding window at each genomic locus.
In the second step, a Chi-squared test-based model is applied to identify the DNRs. For n samples, given a
i
, representing the normalized reads count at locus k in sample i, and b
i
, representing the background of the normalized reads count at locus k in sample i, the test statistic χ2 at locus k is calculated with Eq. 1.
$$ {\chi}^2={\displaystyle \sum_{i=1}^n\left[\frac{{\left({a}_i-{A}_i\right)}^2}{A_i}+\frac{{\left({b}_i-{B}_i\right)}^2}{B_i}\right]}, v= n-1; i=1,\ 2, \dots, n, $$
(1)
Here, A
i
and B
i
represent the expected values for a
i
and b
i
, respectively. The background b
i
is an average normalized reads count of the chromosome. Then, significance P-value is calculated using Eq. 2.
$$ \mathrm{P}\hbox{-} \mathrm{value}=1- chicdf\left({\chi}^2, n-1\right) $$
(2)
Here, chicdf represents the Chi-square cumulative distribution for χ2 with the degree of freedom n-1.
In the third step, the DNRs are identified. Firstly, given a cutoff for the P-values, the regions with a P-value less than the cutoff are regarded as candidate DNRs. Then the candidate DNRs less than 10 bp are neglected, and any two adjacent DNRs with a distance less than 5 bp are merged into one DNR. We named the method with Dimnp, representing “differential identification in multiple (n ≥ 3) samples”.
Validation of Dimnp
In order to test the capacity of Dimnp, we manually identified the DNRs in three cell types, and compared them with the DNRs identified by Dimnp. The three cell types include wild type (H4WT), mutants H4R3A (arginine (R) 3 of histone H4 was mutated into alanine (A)) and H4K20A. In manual identification, we firstly performed a pairwise comparison by calculating a ratio of the normalized reads counts in any two samples. Those regions with a ratio of more than 1/0.6 or less than 0.6 are defined as DNRs. Then, we pooled the pairwise comparison DNRs together, forming the DNRs data for the three cell types. We marked the DNRs as “the manual”. Actually, the manual identification is a computational algorithm. At last, receiver operating characteristic (ROC) curves are employed in evaluating the performance. If a DNR by Dimnp overlaps a DNR by the manual, the former DNR is regarded as a positive DNR.
We also called the DNRs with literature tool Danpos in two cell types with default settings except parameter “-t 10−5” (P-value cutoff). By pooling the two-cell type DNRs together, we generated the DNRs data for the multiple cell types and compared them with the DNRs identified by Dimnp. A matching percentage was used to evaluate the extent of the matching DNRs between Dimnp and Danpos. The matching percentage is a ratio of the number of matched DNRs between Dimnp and Danpos relative to the total number of the DNRs by Dimnp under a specified deviation. The deviation is from 1 bp to 100 bp.
Software package of Dimnp
We developed two packages for Dimnp (Additional file 1). One is based on Python (version 2.7) and for Linux operation environment. This package needs the third-party Python package rpy2 and R environment (version 3.0.2). The other is written with Matlab (R2009) and can be used in Windows operating systems. The input file is the mapped reads data in bowtie or bed format. The Python-based package contains two modules: NuclPreprocess.py and CalDiffPoints.py. Function NuclPreprocess.py is used to calculate normalized reads count for each locus of genome. Usage: NuclPreprocess.py [−h] [−o output_path] [−f (bed, bowtie)] [−p paired] input_path. For argument -p, 1 is for paired-end reads data and 0 for single-end reads data. Function CalDiffPoints.py is used to call the DNRs. Usage: CalDiffPoints.py [−h] [−c CUTOFF] [−o output_path] file_names [file_name …]. Argument -c is for setting cutoff for P-value. In Matlab language (R2009), the core function of Dimnp is Statistical_test_forN.m. Usage: [Pval_lg, FDR_cutoff, region_filtered] = Statistical_test_forN (file_out, cutoff, N_flag, varargin), where varargin are the variables of nucleosome occupancy profile, cutoff is P-value cutoff and N_flag indicates the normalization way. Information of the DNRs (region_filtered) will be written into a file (file_out). It also estimates the false discovery ratio (FDR) for each DNR.
Both the program (Additional file 1) and the dataset are available at website: http://bioinfo.seu.edu.cn/Nu_dynamics_data_public/.
An enrichment analysis was carried out for nucleosome-dynamic genes with tool DAVID (http://david.abcc.ncifcrf.gov/).