This section presents our method in detail. First, we model the sequence as a continuous linear polymer. We show how the contacts from Hi-C can be transformed and represented succinctly in the form of a distance matrix. The reconstruction then works by assigning coordinates to chromosome loci progressively. After all the coordinates are assigned, we refine and sharpen the coordinates iteratively through local search.
Structure modeling
In our algorithm, the chromosome is modelled as a continuous linear polymer, composed of many chromosomal loci. For example, human chromosome 1 can be modelled as ∼25,000 bins at a resolution (number of base pairs per bin) of 10 kbp (kilo base pairs). We use each bin to represent a locus within the chromosome. During the structure reconstruction step, the determination of each locus’s location is simplified to that of determining the 3D position of the locus’s centroid. Although this model omits the local structures at each locus, it is the most widely accepted representation and is considered the most accurate model achievable by the resolution of current 3C-based techniques [35, 36, 41]. The 3D positions of all the loci are referred to as a structure or a configuration. In this work, we let n denote the number of loci.
Converting contact frequency to distance
To infer the 3D structures from Hi-C experimental data, the pairwise distances between loci are first computed. Most of the Hi-C protocols are cell population-based experiments; they provide the average contact frequencies across different cells. Every pair of loci i and j are associated with a number of m replicates by accumulating different structures, and the normalized contact frequency fij can be inferred from Hi-C dataset. Many structure inference methods assume a power-law relationship between contact frequency and 3D distance, allowing a contact (fij) to be converted into a corresponding distance (dij) through the equation \(d_{ij}=1/f_{ij}^{\alpha }\). The power-law coefficient α varies across datasets and needs to be estimated using other techniques. The special case where α=1 is called an inverse frequency (IF). Given a fixed α and the inferred structure X, the goodness of fit is calculated as \(\sum _{f_{ij}>0}\left (f^{\prime }_{ij}-f_{ij}\right)^{2}\) where \(f_{ij}^{\prime }=1/d(X)_{ij}^{1/\alpha }\). Similar to ChromSDE, we assume that 0.1≤α≤3 (a range covering most α in previous studies) and use a golden section search to find the correct α through minimizing the goodness function. We face several challenges with this approach. First, Hi-C can only capture 2.5% of the contact loci, with large variations in the captured frequencies; these contacts are moreover only reliable for the physically close loci. The power-law relation \(d_{ij}=1/f_{ij}^{\alpha }\) also results in infinite distances for low frequency contact pairs. While this can be remedied by enforcing an upper bound on the distances, a criteria for deciding the upper bound would be difficult to derive. Furthermore, the distances converted with a power-law relationship are not metric, rendering many computations that work for metric relations unusable.
Recently, Lesne el al. solved these problems elegantly using the shortest-path method in graph theory [36]. They modeled the contact matrix as a connected graph, in which a vertex represents a locus and an edge is associated with a distance as the inversion contact frequency of the corresponding locus pair. The final distance between loci i and j is modified with the shortest-path distance within the graph. Not only is the shortest-path distance metric and represents a tighter estimation of locus distance, the approach also mitigates the problem due to low frequency contact pairs.
Computing shortest-path distance is, however, both time- and space-consuming since it requires the computation of all-pairs shortest-paths. Hence we propose a method to approximate this distance. We randomly choose ℓ (ℓ≪n) loci as pivots, and denote this set of loci as P. After that, we compute the single source shortest-path distances with each pivot locus as the source to all the n loci. Denote the shortest distance from pivot p to a vertex v as dp(v). With these shortest-path distances from the pivots we approximate the remaining shortest-path distances. Given a pair of loci i, j, if i or j is a pivot, we can obtain their shortest distance from the computed shortest-path distances. Otherwise, we use \({\min _{p\in P}}d_{p}(i)+d_{p}(j)\) to approximate the shortest distance between i and j. By increasing the number of pivots, the approximate shortest-path distance can be made arbitrarily close to the true shortest-path distance.
To reduce the space consumption, while computing the shortest-path distances, we adopt an adjacent list representation for the distance matrix derived from Hi-C dataset. Also, the approximated distances are not stored; we merely store the ℓ sets of shortest-path distances from the pivots — the approximated distances are computed on the fly. This data structure reduces the space complexity from O (n2) to O(ln+e) (e is the number of significant Hi-C contacts) to store a distance matrix with n×n dimensions. See Additional file 1: Figure S8 for a comparision of run-time memory usage.
Assigning coordinates progressively
The structure reconstruction problem is often formulated as: Given the pairwise distances (with errors) of all loci, to find a 3D configuration X for those loci which satisfy the distance constraints. Denote the distance between loci i and j inferred from contact information as \(\hat {d}_{ij}\), and denote the Euclidean distance between loci i and j in a configuration X as dij(X). The problem can be solved by minimizing the following objective function:
$$ \min{\sum\sum}_{i\le j\le n}\left(d_{ij}(X)-\hat{d}_{ij}\right)^{2} $$
(1)
which can be solved by multidimensional scaling (MDS) [45]. Such an approach has been utilized by several groups to reconstruct the chromatin structures [18, 24, 36, 42]. It performs well on Euclidean distance with small error. However, the distances inferred from contact information suffer from large errors, which are especially significant in the larger distances due to the underlying mechanism of Hi-C. This prompts us to adapt the formulation to a weighted one:
$$ \min{\sum\sum}_{i\le j\le n}w_{ij}\left(d_{ij}(X)-\hat{d}_{ij}\right)^{2} $$
(2)
where a weight wij can be assigned to each loci pair i and j according to their distance, dij, allowing us to give higher weights to closer pairs. Similar to the earlier problem, this problem has a solution through the use of WMDS (weighted multidimensional scaling) [46].
However, the use of MDS and WMDS remains time- and space-consuming, and will not scale on problems of larger sizes. To solve this, we propose a progressive solution for the MDS and WMDS problem, namely, iMDS (iterative MDS) and sMDS (scalable MDS). Both methods relay on conducting MDS on subsets of loci.
Our proposed approaches are based on the following insight: A distance matrix of size n×n contains \(n \choose 2\) variables. On the other hand, the inferred structures contain 3n−6 free variables. That is, the distance matrix contains information that may be considered redundant, which can be potentially discarded without affecting the quality of the inferred structure. Hence, our approach will reduce these redundant values, which we expect to lower the chances of errors. Besides this, our approach will also naturally allow the assignment of larger weights to distances that are more reliable. These will become apparent in the subsequent subsections.
Iterative MDS (iMDS) for structure polish
In the same way as performed in SC-MDS [47], we randomly split the set of loci into k overlapping subsets, s1,…,si,…,sk, with small intersections Ii between si and si+1. Then, classical multidimensional scaling [48] is performed on each subset to obtain the local coordinates of each locus. The subsets of loci are then combined by first selecting s1 as the reference, and then combining each si+1 to si iteratively until all the subsets are combined. Our combination method differs from SC-MDS in that we incorporate the reflection of objects, which we now describe.
Local structures recombination
To combine si+1 into si in iMDS, we use Ii as a set of anchors. Denote the local coordinates for Ii in si and si+1 as PI and \(P^{\prime }_{I}\), respectively. We want to superimpose \(P^{\prime }_{I}\) onto PI with a rigid transformation (a translation T and a rotation R) such as to minimize the RMSD. This problem is known to be solvable in linear time [49]. In order to solve the reflection problem, \(P^{\prime }_{I}\) and its mirror were both superimposed to PI. In the case that the RMSD between the mirror of \(P^{\prime }_{I}\) and PI is smaller than that between \(P^{\prime }_{I}\) and PI, the coordinates in si+1 are replaced with those in the mirror of si+1.
$$ RMSD=\min\sum\limits_{i=1}^{N}\sqrt{\left\| p_{i}^{\prime}-(Rp_{i}+T)\right\|^{2}} $$
(3)
After obtaining T and R, we apply them to the coordinates of si+1 such that si+1 and si would have the same frame of reference. This integrates si+1 into si. In addition, we average PI and the transformed coordinates of \(P^{\prime }_{I}\) by R and T to update the coordinates of Ii.
Successive subsetting and combination As discussed in [47], grouping only neighboring loci is not as beneficial as grouping both close and distant loci. Hence, we randomly split all loci into k overlapping subsets. In order to combine two 3D sets successfully, we need to select at least 4 points not lie on the same plane (points on the same plane cannot identity a structure in 3D space), additionaly, a small number of loci may lead to a poor estimation of the rotation and translation matrix. Hence, we need to choose enough loci in Ii. In practice, we set the number of loci in Ii as 50, which is large enough for successive combinations in a three dimensional space. The performance of random subsetting and intersection is described in Additional file 1: S2.
Due to Hi-C’s mechanism, the distances \(\hat {d}\) measured have different degrees of reliability, which may aversely affect MDS and iMDS. This situation worsens when noise is elevated. To address this, we propose a scalable MDS to polish the structure obtained by iMDS.
Incorporating different distances
The structure from iMDS is further improved to better agree with the data from Hi-C experiments. In Eq. 2, a weight wij was introduced for each distance \(\hat {d}_{ij}\) from Hi-C. Since shorter distances are more reliable than the longer ones, we set \(w_{ij}=\hat {d}_{ij}^{-2}\) [42] to decrease the influence from the longer distances. We found this weighting scheme to be more robust than other schemes such as wij=1 and \(w_{ij}=1/\hat {d}_{ij}\) [35, 36] (Additional file 1: S3).
However, computing WMDS for our framework remains infeasible for large data sets due to time and space complexities. We propose a scalable MDS approach here to address this issue. As far as we know, our approach is among the very few that are currently available for large-scale WMDS.
Scalable MDS
Our proposed scalable MDS is an iterative procedure which employs WMDS as a subroutine. At each iteration, we permute the loci randomly, and partition them by the permuted order into sets of size k each. Then, we apply WMDS to each resultant set, S say. Denote the loci coordinates in S before executing WMDS and after as PS and \(P_{S}^{\prime }\) respectively. We apply RMSD to discover the optimal superposition that maps each locus in \(P_{S}^{\prime }\) to itself in PS. Then, we update the coordinates for the locus to its average values from PS and \(P_{S}^{\prime }\) after the superposition. After we iterate through every set S, we restart the process for another round. This is repeated until the coordinates of the loci converge. The performance of random subsetting and iteration is described in Additional file 1: S2 and the sensitivity analysis of parameter settings is described in Additional file 1: S4.
WMDS is an iterative algorithm, the time complexity in each iteration is O (n3). In comparsion, the time complexity of sMDS for updating all loci once is O (nk2). Though sMDS need more iterations than WMDS (10 times more in practice), sMDS is faster.
Simulated contact maps
Binary contact map
We used known 3D structures as the basis for our simulation. For a given 3D structure, we constructed a binary matrix to store the pairwise contact information, with the top k nearest pairs of loci set to 1 and others set to 0.
Poisson distribution model
For a given 3D structure with pairwise distance dij, We generated the M(i,j) entry of a simulated contact map M as a Poisson-distributed random number nij. The parameter λij is defined as \(\lambda _{ij}=\beta /d_{ij}^{\alpha }\) based on the power-law conversion of distance and contact frequency. β is a tunable parameter to control the signal coverage of the contact map.
Data preparation
In silico genome structure
The in silico nucleus with 1 and 16 chromosome(s) (Additional file 1: Table S1) used to test our software were generated using a polymer model. The coordinates were taken from the Langevin dynamics simulation after it has reached thermal equilibrium. After that, we constructed Poisson distribution-based contact maps and binary contact maps for the in silico structures. In addition to the contact matrices corresponding to in silico chromosomes, we also included contact maps generated from Poisson distribution model at different signal coverage levels of the regular helix structure used as benchmark dataset by Zou et al. [37] as well as the structure of chromosome 2 reconstructed from a real Hi-C dataset (mESC).
Hi-C experimental dataset
In this study, we used Hi-C data from two cell lines mESC and GM12878. These raw sequence data are transformed into contact map with Juicer [50]. The matrix balancing method described in [51] was used to normalize the contact matrix in order to remove biases in Hi-C dataset.