Overview
HiCHap is consisted of four modules: read mapping, contact maps construction, identification of diploid chromatin 3D organization (compartment, topological domain and chromatin loop) and allele-specific analysis (Fig. 1). Specifically, the paired-end reads are first mapped to maternal and paternal genomes which are built by using phased single nucleotide polymorphisms (SNPs). The unmapped end of a read is split into several parts according to the scanned ligation junction sites, and then all split parts are re-mapped to keep all available SNP information in the reads. The noisy reads are filtered and the kept ones are assigned to maternal or paternal reads according to their SNP information. Then the maternal and paternal contact matrices are constructed by using our proposed strategy of bias correction. The compartments, topological domains and chromatin loops are identified by using principal component analysis [1], directionality index based hidden markov model [5, 10] and HiCCUPS [2] with some modifications for diploid Hi-C contact matrices. Then the allele-specific compartments, topological boundaries and chromatin loops are tested. In following statement, haploid and diploid data processing refer to the processing procedure without and with distinguishing two homologous chromosomes in diploid cells.
Hi-C read mapping
For diploid read mapping, all paired-end Hi-C reads were aligned to maternal and paternal genomes respectively by using Bowtie2 [24]. If one end of a Hi-C read was not mapped to maternal and paternal genomes, this end was scanned by the ligation junction sites. If no junction site existed in this unmapped end, the corresponding Hi-C read was discarded. Otherwise, this end was split in the junction sites and each part was mapped to maternal and paternal genomes again. The restriction-enzyme sites were used to determine the contact relationship among these split parts. Specifically, if two parts were in the same restriction fragment, they were considered to be in the same contact anchor (case 1 and case 2 in mapping strategy, Fig. 1). Occasionally, the split parts did not localize in the same restriction fragment due to complicated ligation procedure, and then two contacts were generated from one paired-end Hi-C read (case 3 in mapping strategy, Fig. 1). The mapped reads next underwent filtering by using the procedure in hiclib [25]. Finally, the obtained end was assigned to be maternal one if the number of maternally matched SNPs was larger or equal to the two times of the number of paternally matched SNPs in this end, and vice versa. If only one matched SNP existed in the end, this end was assigned to its matched parental genome. If no SNP was matched in this end, this end was considered to be unassigned. The haploid Hi-C read mapping was same as the diploid Hi-C read mapping except for two steps. First, the two ends of a Hi-C read were mapped to reference genome instead of maternal and paternal genomes. Second, no allele assignments were performed in haploid Hi-C read mapping.
Matrix construction and bias correction
The haploid contact matrices were constructed by following previous pipeline [25]. The vanilla coverage (VC) normalization [1, 2] was used to correct biases to keep consistent with diploid contact matrix normalization in this work. However, the iterative correction strategy [25] was also provided in HiCHap.
Two types of diploid contact matrices were constructed. The first type of diploid contact matrices was constructed by using the pipeline of haploid matrix construction, except that only the two-end assigned contacts were used to construct maternal and paternal contact matrices. Since only a small proportion of contacts could be simultaneously assigned at two ends, these two-end-assigned contact matrices were very sparse (Supplementary figure 1A). The second type of diploid contact matrices was constructed by imputing the one-end-assigned contacts to improve data utilization. If two ends of a contact were in the same chromosome number, this contact was directly considered to be intra-haplotype contact due to the high proportion of intra-haplotype contacts calculated from two-end-assigned contact matrices (Supplementary figure 1B). If the two ends of a contact were in different chromosome numbers, this contact was imputed by following a previously proposed procedure [23].
To correct the allele-assignment bias caused by variable SNP density, the one-end-assigned contacts were placed in an asymmetrical way, in which the assigned end and unassigned end were placed in the row and column respectively. Then the symmetrical two-end-assigned contact matrices and the asymmetrical one-end-assigned contact matrices were summed (contact map construction, Fig. 1) to make use of all available data. Intuitively, some chromosomal bins were poor in the mapped contacts due to limited SNPs, and they were called gap bins in this work (Supplementary figure 2A). Let \( {\left({O}_{ij}^{C_H}\right)}_{N_c\times {N}_c} \) denote the summed contact matrix for chromosome C, where the symbol CH represents the maternal (CM) or paternal (CP) contact matrix and Nc denotes the number of chromosomal bins. The non-zero contact ratio for chromosomal bin k was defined as rk = K/NC, where K is the number of non-zero contacts in the row vector \( {O}_k^{C_H}=\left({O}_{k1}^{C_H},{O}_{k2}^{C_H},\cdots, {O}_{k{N}_C}^{C_H}\right) \). Let rt denote the lower 25 percentile (Supplementary figure 3A) of all non-zero contact ratios \( \left\{{r}_1,{r}_2,\cdots, {r}_{N_C}\right\} \), and the threshold of rt was defined as \( \mathrm{t}=\left\{\begin{array}{c}{r}^t,{r}^t\le 0.2\\ {}0.2,{r}^t>0.2\end{array}\right. \) (Supplementary figure 3B). Then the bin k was defined as gap bin if rk ≤ t in either maternal or paternal summed contact matrix. If chromosomal bin k was not gap bin, let \( {f}_k^{C_M}={\sum}_{i=1}^{N_C}{O}_{ki}^{C_M} \) and \( {f}_k^{C_P}={\sum}_{i=1}^{N_C}{O}_{ki}^{C_P} \) denote the summations of maternal and paternal contact frequencies respectively, and \( {f}_k^C={\sum}_{i=0}^{N_C}{O}_{ki}^C \) denote the corresponding summation from haploid contact matrix \( {\left({O}_{ij}^C\right)}_{N_c\times {N}_c} \). The allele-assignment ratio led by SNP density was defined as \( {\alpha}_k^C=\left({f}_k^{C_M}+{f}_k^{C_P}\right)/{f}_k^C \). Then the SNP-bias correction factor for bin k was defined as \( {\beta}_k^C=\left\{\begin{array}{cc}1,&\ if\ {\alpha}_k^c\ge {\alpha}_{max}^c\\ {}{\alpha}_k^c/{\alpha}_{max}^c,& if\ {\alpha}_k^c<{\alpha}_{max}^c\end{array}\right. \), where \( {\alpha}_{max}^C \) denotes the upper 80 percentile (Supplementary figure 3C) of all available allele-assignment ratios in chromosome C by excluding gap bins. The row vector k in the summed contact matrix \( {\left({O}_{ij}^{C_H}\right)}_{N_c\times {N}_c} \) was corrected by \( {Q}_k^{C_H}={O}_k^{C_H}/{\beta}_k^C \). The inter-haplotype contact frequencies for chromosomal bin k were corrected in the same way by using the SNP-bias correction factor \( {\beta}_k^C \).
Next the corrected matrix was symmetrized by: \( {M}_{ij}^{C_H}=\left\{\begin{array}{cc}{Q}_{ij}^{C_H},& i=j\\ {}\max \left({Q}_{ij}^{C_H},{Q}_{ji}^{C_H}\right),& i\ or\ j\in gap\ bins\\ {}\left({Q}_{ij}^{C_H}+{Q}_{ji}^{C_H}\right)/2,& others\end{array}\right. \). Since many entries in this symmetric matrix exhibited zero or nearly zero values, we primarily adopted VC normalization [1, 2] to robustly reduce the biases caused by other sources: \( {M}_{ij}^{\prime }={M}_{ij}^{C_H}/\left(\sqrt[3]{W_i^2}\times \sqrt[3]{L_j^2}\right) \), where Wi and Lj are the summations of ith row and jth column in the symmetric matrix \( \left({M}_{ij}^{C_H}\right) \). However, the iterative correction [25] was also provided for diploid contact matrices in HiCHap. Finally, the average frequency of bias-corrected contact matrix was recalibrated to the average frequency of original contact matrix by: \( {f}_{ij}=\left( ave\left({O}_{ij}^{C_H}\right)/ ave\left({M}_{ij}^{\prime}\right)\right)\times {M}_{ij}^{\prime } \), where ave() denotes the average value of all contact frequencies in the matrix. The inter-haplotype contact matrices were symmetrized, normalized and recalibrated in the similar way by using inter-haplotype contact frequencies instead of intra-haplotype contact frequencies.
Identification of chromatin 3D organization
For the corrected haploid contact matrices, compartments were identified by using the principal component analysis [1]. The first, second and third principal components were selected as candidates to determine compartment A and B. For principal component i, let \( {\mu}_A^i \) and \( {\mu}_B^i \) denote the average strengths among the same kind of compartmental bins (A and B) respectively, in which the correlation matrices were used for strength calculations. Similarly, let \( {\mu}_{AB}^i \) denote the average strength among compartment A and compartment B. Then the principal component with the maximum value of \( \left({\mu}_A^i+{\mu}_B^i\right)/2-{\mu}_{AB}^i \) was selected as the final one for compartment identification. The topological domains were called by using the hidden markov model with modified directionality index (DI) [9, 10], in which three Gaussian mixed distributions were used in this work. The chromatin loops were called by using the modified HiCCUPS algorithm proposed previously [15]. The minor modification in this work was that the called significant contacts were removed from loop calling if their contact frequencies were below the corresponding median frequencies calculated from all contacts with the same genomic distance.
As for the corrected diploid contact matrices, the maternal and paternal compartments, topological domains and chromatin loops were identified in similar procedures to the haploid ones. However, different kinds of gaps in these sparse contact matrices should be taken into consideration to reduce the false-positive results. For a given chromosomal bin, if less than 5% intra-haplotype contacts showed non-zero contact frequencies in either maternal or paternal contact matrices, the row and column of this bin, called compartment gap in this work, were removed in compartment calculation (Supplementary figure 2B). For a given maternal (and paternal) contact matrix, the first, second and third principal components were used to calculate the Pearson Correlation Coefficients (PCCs) to the PC1 derived from haploid contact matrix respectively. The principal component with largest absolute PCC was selected as final one, and it was multiplied by the corresponding PCC sign to keep the identified compartment consistent. For a selected chromosome, if all PCCs were smaller than given threshold (0.7 in this work), the exception was reported. In boundary calling, if less than 80% local intra-haplotype contacts were non-zero contacts in the given window size, this chromosomal bin, called boundary gap in this work, was removed in calculation (Supplementary figure 2C). For an initially called boundary, if there existed 3 or more boundary gaps in upstream 7 bins, this boundary was not considered to be the end of topological domain. If there existed 3 or more boundary gaps in downstream 7 bins, this boundary was not considered to be the start of topological domain. If more than one third bins between two neighbor boundaries were boundary gaps, the chromatin region between these two boundaries was not considered to be a domain due to the potential existence of another boundary in these gaps. For a given contact, if both anchors were localized in the gap bins, this contact was removed in loop calling. If any one of its four neighbor contacts showed zero value (called loop gap in this work), this contact was also removed in loop calling (Supplementary figure 2D). To further remove the incredible loops, the initially called loops were weighted by \( {S}_l={f}_{ij}^l\times \left(-{\log}_{10}q\right) \), where \( {f}_{ij}^l \) denotes the contact frequencies of the initially called loops and q denotes the weighted q-values generated in loop calling. The loops with 15% lowest scores were further removed. Altogether, these empirical parameters in removing different types of gaps can be optionally set in HiCHap.
Testing on allele-specific chromatin 3D organization
For maternal and paternal compartments, the chromosomal bins with changed compartmental signs were selected as initial set. Since these compartmental transitions could be generated from biological variations (Supplementary figure 4), the permutation test was used to select reliable compartmental transitions. Specifically, for the transitioned bins, the entry values in both maternal and paternal eigenvectors were selected to build the maternal and paternal entry sets respectively. Randomly selected one value from maternal entry set and one value from paternal entry set, and calculated the value difference by maternal entry value minus paternal entry value. Repeated this calculation to obtain enough values to draw the empirical distribution. For each transitioned bin in initial set, the matched entry-value difference between maternal and paternal eigenvectors was used to calculate the p-value by using the empirical distribution.
The topological boundaries from haploid, maternal and paternal contact matrices were merged in testing allele-specific boundaries. The boundaries from three kinds of contact matrices were aligned by using the threshold of 3 bins. For the exactly aligned boundary, the insulation score was calculated for each contact in the lower triangle of a 10 × 10 square window (Allele-specific analysis in Fig. 1). If more than 70% contacts exhibited zero values in either maternal or paternal window, the boundary was removed from calculation. Then the contact insulation scores between maternal and paternal windows were paired to calculate the p-value by using the paired t-test. If maternal and paternal boundaries were aligned but without exactly same position, the genomic positions of maternal and paternal boundaries were used for p-value calculations respectively. The lower p-value in the two calculations was used for the aligned boundary. If only maternal or paternal boundary was called, the genomic position of the called boundary was used in the calculation. If both maternal and paternal boundaries were not called due to algorithmic sensitivity, the position of haploid boundaries was used for calculation.
The haploid chromatin loops anchored in the aforementioned gap bins were removed in allele-specific testing. The haploid, maternal and paternal chromatin loops were aligned by using previously proposed method [15]. The binomial distribution B(n, p) was used to calculate the p-value for each aligned chromatin loop, where \( \mathrm{p}=\sum {f}_{loop}^m/\left(\sum {f}_{loop}^m+\sum {f}_{loop}^p\right) \), and \( {f}_{loop}^m \) and \( {f}_{loop}^p \) denote the contact frequencies of maternal and paternal chromatin loops respectively. If the maternal and paternal chromatin loops were aligned but not matched exactly in position, the contact frequencies in their own positions were used for calculation. If only maternal or paternal chromatin loop was called, the two contact frequencies in the position of called chromatin loop were used for calculation. If both maternal and paternal chromatin loops were not called due to algorithmic sensitivity, the two contact frequencies in the position of haploid chromatin loop were used for calculation.
Other calculations
To evaluate the parameter robustness in SNP-bias correction in HiCHap (Supplementary figure 3), the matrix similarities among different parameter values were calculated by using the reproducibility analysis from HiCRep [26]. The relationship between allelic contact number and the SNP number was calculated by using kernel density estimation with Gaussian kernel function. The PCCs between maternal (or paternal) principal components and haploid principal components were calculated chromosome-by-chromosome, and the maternal and paternal PCCs were combined in presentation. The PCCs for DI were calculated and analyzed in the similar way. The boundary insulation scores were calculated by following a previous method [27]. The minor modification was that the zero-value contacts were excluded from calculations to reduce the negative impact of different kinds of gaps. The averaged contact maps of maternally biased and paternally biased boundaries were also calculated by excluding zero-value contacts. The loop strength was calculated by following the previous pipeline [15] modified from aggregate peak analysis [2]. The averaged contact maps of maternally biased and paternally biased chromatin loops were calculated by excluding loop gaps. Since the constructed diploid contact matrices were sparse at high resolutions, the maternal and paternal chromatin loops were mainly called and analyzed at the 40 kb resolution if there was no explicit statement. Occasionally, the 20 kb resolution was used for comparisons in this work. To obtain enough number of allele-specific chromatin 3D organizations for functional analyses, the allele-specific compartments with p-values smaller than 0.05 and the maternal/paternal PC entry ratios (or vice versa) larger than 1.5 were selected, and the allele-specific chromatin loops with p-values smaller than 0.05 and the maternal/paternal contact ratios (or vice versa) larger than 1.5 were selected. The allele-specific boundaries with adjusted p-values smaller than 0.01 were selected for statistics.
In the ChIP-Seq data processing, the reads were aligned to maternal and paternal genomes respectively. The obtained read was assigned to be maternal one if the maternally matched SNPs outnumber the paternally matched SNPs, and vice versa. In the enrichment analysis on the allele-specific compartments, the maternally and paternally mapped ChIP-Seq reads were processed by using deepTools with RPKM normalization under 1 kb bin size [28], and the averaged RPKM was subsequently calculated by excluding zeros in each given compartmental bin at the 200 kb resolution. Then the read difference between maternal and paternal bins were calculated by using Wilcoxon signed rank test. In the correlation analysis, the CTCF peak with largest allelic ratio in the given chromatin loop (40 kb resolution) was selected for the correlation calculation, and the same procedure was used in the Rad21 peak analysis.