GAM and hi-C data
We downloaded the raw co-segregation GAM data for 408 nuclear profiles (NPs) from GEO (GSE64881) at two resolutions (i.e., 1 Mb and 30 kb). The raw intra-chromosomal (cis) GAM contact matrices were generated using the co-segregation data as input.
We downloaded the raw Hi-C reads for mouse embryonic stem (mES) cells from GEO (GSE35156) and combined two HindIII replicates to generate the raw Hi-C contact matrices at 1 Mb resolution.
Definition of the systematics biases
GC content bias and mappability bias
The GC content and mappability biases in GAM are the same as in Hi-C [17, 18]. Therefore, detailed definitions will not be shown here. The GC content bias of a certain bin is the GC content of its DNA sequence. The mappability bias of a certain bin is generated in the same way as in scHiCNorm [19].
Window detection frequency bias
The existence of window detection frequency (WDF) bias in raw GAM data has been mentioned in [3] and is intuitively apparent because GAM cannot ensure that all of the bins are detected with the same frequency. Our analysis has proved the existence of this bias (data shown later), which indicates that the random orientations of the GAM cryosections cannot make all the bins to have the same chance to be detected in the NPs. For example, bin A1 is detected three times in Fig. 1a, A2 is detected two times, and A3 two times. Accordingly, we define their WDF biases as 3, 2, and 2.
Fragment length bias
The fragment length bias is based on similar idea as the window detection frequency bias. The window detection frequency bias is caused by the inconsistent frequencies for bins to be detected in the NPs, whereas the fragment length bias is also a type of bias for each bin but is caused by the inconsistent frequencies for fragments to be detected in the NPs. Our evaluation results will later prove that both two biases exist in raw GAM contact matrices.
For example, in Fig. 1a the fragment length for the blue chromatin segment is three bins based on NP1. For NP2, it is one bin, and three bins in terms of NP3. Therefore, we define the average fragment length of bin A1 as (3 + 1 + 3)/3 based on all of the three NPs. Similarly, we define the average fragment length of bin A2 as (3 + 0 + 3)/3 according to the three NPs. We use the average fragment lengths from all NPs as the fragment length bias of the bin. In our software, we omit the division of the number of NPs because this value is the same for all the bins. In this way, the fragment length biases of A1, A2, and A3 are 7, 6, and 6 in the example in Fig. 1.
The WDF bias and the newly found fragment length bias in the raw GAM data can be quantified by the ld_nld_cis function in our normGAM package. However, when a user is using the package to normalize raw GAM data, he/she does not need to quantify these biases in order to run the normalization algorithms. Instead, the algorithm can directly remove all the biases. The quantification of these biases is to prove the existence of the biases and to evaluate the performance of our bias-removing algorithms. If GC content and mappability biases at different resolutions are of interest, users can obtain the corresponding bias data for different reference genomes from our previous normalization method scHiCNorm [19].
Normalization methods
The normalization methods developed for raw Hi-C contact matrices (e.g., KR [2, 22] and HiCNorm [18]) may not be directly used for raw GAM data because (1) the input of KR algorithm needs to be a non-negative symmetric matrix, but the raw GAM contact matrices may contain negative entries; (2) HiCNorm [18] and scHiCNorm [19] were designed for count data, but the raw GAM contacts were composed of real numbers. To address the first problem, we developed a new method KR2, the 2-norm of KR, which was able to handle negative contact matrices.
The Knight-Ruiz (KR) [22] algorithm was designed to balance a matrix. Given a non-negative symmetric matrix D, the algorithm tries to find a vector x such that
$$ \mathit{\operatorname{diag}}(x) Dx=e, $$
(1)
where diag(x) is a diagonal matrix converted from x, and e represents a vector of all ones.
Eq. (1) can be turned into:
$$ f\left({x}_{\ast}\right)=\mathit{\operatorname{diag}}\left({x}_{\ast}\right)D{x}_{\ast }-e=0. $$
(2)
We can obtain the iterative from eq. (2):
$$ {x}_{k+1}=\mathit{\operatorname{diag}}{\left(D{x}_k\right)}^{-1}e. $$
(3)
We can also use Newton’s method to get an alternative to eq. (3):
$$ {x}_{k+1}={x}_k-f\left({x}_k\right)/{f}^{\prime}\left({x}_k\right)={x}_k-{\left(\mathit{\operatorname{diag}}\left({x}_k\right)D+\mathit{\operatorname{diag}}\left(D{x}_k\right)\right)}^{-1}\left(\mathit{\operatorname{diag}}\left({x}_k\right)D{x}_k-e\right), $$
(4)
which can be rearranged as:
$$ \left(D+\mathit{\operatorname{diag}}{\left({x}_k\right)}^{-1}\mathit{\operatorname{diag}}\left(D{x}_k\right)\right){x}_{k+1}=D{x}_k+\mathit{\operatorname{diag}}{\left({x}_k\right)}^{-1}e. $$
(5)
Let Dk = D + diag (xk)−1 diag (Dxk), which is symmetric as D. The next step is to use inner-outer iteration schemes with conjugate gradient method to solve eq. (5). When both sides of eq. (5) are premultiplied by diag(xk), we can get:
$$ \left({B}_k+\mathit{\operatorname{diag}}\left({B}_ke\right)\right){y}_{k+1}=\left({B}_k+I\right)e, $$
(6)
where Bk = diag (xk)Ddiag(xk) and yk + 1 = diag (xk)−1xk + 1. Eq. (6) can be solved using conjugate gradient iteration. More details about the initial values and stopping criteria can be found in [22]. The final balanced matrix can be obtained by diag(x)Ddiag(x). In summary, the original KR was designed for balancing a matrix in the 1-norm. Here, we present KR2 for balancing a GAM contact matrix in the 2-norm by taking the following three steps. We first conduct element-wise product by D2 = D · D and then run the original KR algorithm on D2 to get balanced matrix D3 = KR(D2). Finally, we obtain the normalized matrix Dnorm = D3 · sign (D), where sign(D) denotes signed all-ones matrix; and the entries’ signs are consistent with the corresponding entries in D.
The vanilla coverage (VC) [1], sequential component (SCN) [20], and iterative correction and eigenvector decomposition (ICE) [21] normalization methods share similarities in their methodologies. The details of the three methods are given below. Each normalized entry is calculated by Dij/(eiej), where Dij is the entry in the raw contact matrix D.
As for VC, ei is 1-norm of the ith row; and ej is 1-norm of the jth column. As for SCN, ei is 2-norm of the ith row; and ej is 2-norm of the jth column. The original SCN also uses maximum iterations to reduce errors, but we have found that SCN performs better when conducting only one iteration on GAM data. Therefore, the normalization results from SCN in this work were all using one iteration.
As for ICE, ei is 1-norm of the ith row divided by its mean value over non-zero bins; and ej is 1-norm of the jth column divided by its mean value over non-zero bins. There are two stopping criteria in ICE including the maximum iteration and error tolerance. Based on our experience, it is difficult to achieve a pre-defined error tolerance (e.g., 1e-6) for all chromosomes; and when we set a large value (e.g., 6000) to the maximum iteration the normalized values for several chromosomes are extremely irregular. Therefore, we set it to 100 in this work. If the error is less than 1e-3, the algorithm will be terminated.
Normalized Linkage Disequilibrium (NLD) was implemented in the same way as described in [3]. The normalized DAB is calculated by DAB/Dmax, and Dmax is the theoretical maximum and defined as:
$$ {D}_{max}=\left\{\begin{array}{c}\mathit{\min}\left\langle {f}_A{f}_B,\left(1-{f}_A\right)\left(1-{f}_B\right)\right\rangle\ if\ {D}_{AB}<0\\ {}\ \mathit{\min}\left\langle {f}_B\left(1-{f}_A\right),{f}_A\left(1-{f}_B\right)\right\rangle\ if\ {D}_{AB}>0.\end{array}\right. $$
(7)
Evaluation of the normalization methods
For each of the four biases (WDF, fragment length, mappability, and GC content), we sorted the chromatin bins in the whole genome (chr1 to chr19) based on their biases, stratified the whole bins of a genome into 20 sets, and then calculated the average number of GAM contacts between each pair of the 20 sets, resulting in a 20-by-20 heat map. We did the same operations on both the raw and normalized GAM contact matrices. A good normalization method can make the 20-by-20 heat map as smooth as possible, that is, the normalized 20-by-20 heat map is mostly occupied by one color.
Another evaluation measure is the Pearson’s correlation between the GAM data (both before and after normalization) and the four known biases. A good normalization method can achieve a very low correlation (i.e., close to zero) in terms of each of the four biases.