 Methodology article
 Open access
 Published:
A maximum likelihood algorithm for reconstructing 3D structures of human chromosomes from chromosomal contact data
BMC Genomics volume 19, Article number: 161 (2018)
Abstract
Background
The development of chromosomal conformation capture techniques, particularly, the HiC technique, has made the analysis and study of the spatial conformation of a genome an important topic in bioinformatics and computational biology. Aided by highthroughput next generation sequencing techniques, the HiC technique can generate genomewide, largescale intra and interchromosomal interaction data capable of describing in details the spatial interactions within a genome. These data can be used to reconstruct 3D structures of chromosomes that can be used to study DNA replication, gene regulation, genome interaction, genome folding, and genome function.
Results
Here, we introduce a maximum likelihood algorithm called 3DMax to construct the 3D structure of a chromosome from HiC data. 3DMax employs a maximum likelihood approach to infer the 3D structures of a chromosome, while automatically reestimating the conversion factor (α) for converting Interaction Frequency (IF) to distance. Our results show that the models generated by 3DMax from a simulated HiC dataset match the true models better than most of the existing methods. 3DMax is more robust to structural variability and noise. Compared on a real HiC dataset, 3DMax constructs chromosomal models that fit the data better than most methods, and it is faster than all other methods. The models reconstructed by 3DMax were consistent with fluorescent in situ hybridization (FISH) experiments and existing knowledge about the organization of human chromosomes, such as chromosome compartmentalization.
Conclusions
3DMax is an effective approach to reconstructing 3D chromosomal models. The results, and the models generated for the simulated and real HiC datasets are available here: http://sysbio.rnet.missouri.edu/bdm_download/3DMax/. The source code is available here: https://github.com/BDMLab/3DMax. A short video demonstrating how to use 3DMax can be found here: https://youtu.be/ehQUFWoHwfo.
Background
A set of all chromosomes within the nucleus of a eukaryotic cell constitutes its genome. Studies of the organization of chromosomes and genomes reveal that they are structurally organized within a cell [1,2,3]. Studies find that this organization influences many biological mechanisms such as DNA replication, DNA repair, DNA translocation, gene regulation, transcription efficiency, genome interpretation, epigenetic modification, and genome stability maintenance [1,2,3,4]. The fluorescent in situ hybridization (FISH) [5] was often used in the investigation of the threedimensional (3D) organization of a genome, but it cannot produce the layout of the genome structure at a large scale. The chromosome conformation capturing techniques such as 3C [6], 4C [7], 5C [8, 9], and HiC [10, 11] were developed to analyze the spatial organization of chromatin in a cell at a larger scale. The HiC technique can use next generation DNA sequencing to determine genomewide spatial chromosomal interactions.
Much progress has been made in recent years on the study of chromosome and genome 3D structure modeling. Several methods have been proposed to construct the structure of an individual chromosome or an entire genome from chromosome conformation capturing data [12,13,14,15,16,17,18,19,20,21,22,23,24]. Some of these methods perform chromosome/genome 3D structure modeling in a twostep process, which involves converting interaction frequencies (IF) between fragment pairs in HiC data to distances between them, and then inferring the 3D structures that best satisfies the distances. Methods that implement this twostep process are known as distance restraintbased methods. Several of such methods have been proposed, each of which varies in restraint representation and optimization methods adopted [14,15,16,17,18,19,20,21,22,23,24].
In [12], Duan et al. considered the genome 3D structure prediction problem as a constrained nonconvex optimization problem, and hence used an optimization solver (opensource software) IPOPT [25] to solve it. Bau et al. [13] also treated the 3D modeling problem as an optimization problem, and used the Integrated Modeling Platform (IMP) [26] to construct 3D structure models. The MCMC5C [14] method designed a probabilistic model for the interaction frequency (contact) data, and thereafter used a Markov Chain Monte Carlo (MCMC) approach to generate a representative structure from the data. ChromSDE [16] formulated the 3D structure modeling problem as a nonconvex nonlinear optimization problem, but then relaxed it as a semidefinite programming (SDP) problem. Bayesian 3D constructor (BACH) [17] is another method that employs MCMC to infer the 3D structure by maximizing the likelihood of the observed HiC data following a Poisson regression approach. MOGEN [19, 22] is a contact based method that is different from the rest, because it does not require the conversion of interaction frequencies to distances before structure construction. ShRec3D [24] is a twostep algorithm that uses the shortest path algorithm to realize chromosome structure construction. LorDG [27] uses a Lorentzian objective function to construct the 3D model of a chromosome or genome. Despite the significant progress made over the years, some of the distancebased chromosome structure modeling methods have several limitations: they may simply assume that the parameters used for converting interaction frequencies to distances are independent of input data and therefore are fixed for different datasets [8, 19], they may converge slowly (common for Markov chain Monte Carlo (MCMC) approach [28, 29]), and they sometimes require to adjust quite a few parameters [19, 22], making it difficult to use.
In this paper, we introduce a new method called 3DMax that uses a maximum likelihood approach to infer the 3D structures of a chromosome from HiC data. In the 3DMax algorithm, the conversion factor (α) parameter to convert IF to its distance equivalent is determined automatically from the data. We show that 3DMax is relatively faster than most of the existing methods, and it only depends on optimizing the structural coordinate of predicted models through the least square residuals. 3DMax is capable of translating contact data of a chromosome, or genome into an ensemble of probable 3D conformations to approximate the dynamic 3D genome structures of a population of cells of the same type. Our experiment also demonstrates how parameters such as the learning rate and the convergence constant (epsilon) can impact the performance of a constructed model. We also demonstrate the effect of using different normalization method on the different chromosome 3D structure prediction algorithms. We benchmarked 3DMax with several popular methods [13,14,15,16, 19], and the result showed that our method performed robustly in the presence of noise and structural variability. We applied our method to a synthetic chromosomal interaction dataset, and two experimentally generated HiC datasets: a karyotypically normal human lymphoblastic cell line (GM06990) and a malignant Bcell. We used the data from FISH experiments available for the cell lines as independent validations of the reconstructed 3D chromatin structures. We performed a comparative analysis of the performance of 3DMax and several existing 3D reconstruction methods on the HiC datasets normalized by three commonly used methods [30,31,32]. These experiments show that 3DMax is an effective method for reconstructing 3D chromosomal structures from HiC data.
Methods
Generally, before HiC data [10, 11] are used for model construction, they are converted to a matrix form known as a contact matrix or a contact map.
Chromosome contact map
A chromosome contact map is a N * N matrix, extracted from a HiC data, showing the number of interactions between chromosomal regions. The size of the matrix (N) is the number of equalsize regions of a chromosome. The length of equalsize regions (e.g. 1 Mb base pair) is called resolution. Each entry in the matrix contains a count of read pairs that connect two corresponding chromosome regions in a HiC experiment. Therefore, the chromosome contact matrix represents all the observed interactions between the regions (or bins) in a chromosome. The 3DMax algorithm takes as input a contact map to build the 3D structure of a chromosome.
Structure initialization
To structurally represent a chromosome, each of its regions (or bins) is represented by three coordinates (x, y, z) in 3D space. In 3DMax, the structure construction starts with a random initialization of the coordinates of all the regions such that they are in the range [− 0.5, 0.5] as in [19].
Maximum likelihood objective function of a chromosome structure
We used a log likelihood function as an objective function to compute chromosome structures from a contact map.
Let S stand for a 3D chromosome structure, and D represent the contact matrix data derived from a HiC dataset. The likelihood of S, P(DS), can be expressed as the product of the probabilities of individual data points (interaction frequencies or distances) in D conditioned on the structure S, if the data points are conditionally independent of each other given a S. In 3DMax structure modeling, the input contact matrix is converted to spatial distances based on the assumption that the IF and the distance have an inverse relationship [14,15,16,17,18]. The conversion method is explained in the Subsection “conversion of interaction frequency to spatial distance” later. By assuming that data points D_{ i } in D are conditionally independent given a structure S, we defined the likelihood (L(S)) in Eq. (1) as:
where n represents the total number of data points to be considered, and D_{i} represents the i^{th} data point (i.e., the distance between a pair of chromosomal regions derived from the contact matrix). Assumed that each data point i obeys the normal distribution, the probability of data point D_{ i } can be described as:
where D_{ i }^{s} which is the actual Euclidean distance of the pair of regions corresponding to D_{ i }, computed from (x,y,z) coordinates of the two regions in 3D structure S as in [33]. σ^{2} is the variance of the distance. By combining Eqs. (1) and (2), we obtain the likelihood estimate of a structure S:
By taking the logarithm of both sides of the Eq. (3), we obtain the log likelihood objective function in Eq. (4) for 3DMax chromosome structure reconstruction. Our goal is to find a structure S* that maximizes the likelihood function: L(SD).
With the assumption that the data is normally distributed according to Eq. (2), σ is calculated as in Eq. (5):
We eliminated the dependence of the objective function on σ parameter by plugging Eq. (5) into the log likelihood objective function in Eq. (4). Hence, the resulting objective function L(S) can be represented as in Eq. (6). The objective function in Eq. (6) depends only on the (x,y,z) coordinates of regions in the structure.
Gradient ascent optimization algorithm
We used the gradient ascent method to optimize the objective function iteratively until the 3DMax algorithm converges. 3DMax algorithm is considered converged, if the difference between the newly calculated log likelihood L(S) function value obtained with updated (x, y, z) coordinates and old L(S) function value of the previous step is less than a small constant value (epsilon). The determination of the epsilon value is described in the Results section.
Gradient ascent is an iterative optimization algorithm that moves in the direction of the function gradient. Using Eq. (6) as the base equation, we calculated the partial derivative of the log likelihood function with respect to a region’s x, y, and z coordinates in a 3D structure.
Once the partial derivative for each coordinate was obtained, we used the gradient ascent optimization method to adjust each coordinate to get a new structure S* that increases the likelihood. Equation (7) shows how the update was done, where λ is the learning rate, and S is the (x, y, z) coordinate vector in 3D space. If the learning rate is too small, it can result in a slow convergence to an optimal solution. But, if a larger learning rate is defined, the algorithm might oscillate around an optimal solution. There is no standard approach to choose λ value, but it is common to set a larger learning rate at the beginning of the optimization, and reduce it as the optimization progresses. The result of using the different types of learning rate is described in the Subsection “choice of the learning rate” in the Result section.
where t is an iteration index, S^{(t)} is the structure coordinate at an iteration index t, λ^{(t)} is a learning rate at t that may vary as the iteration proceeds, and ∇L(S^{(t)}) is the partial derivative of the log likelihood with respect to the coordinates in the structure.
In this work, we also implemented a variant of the 3DMax algorithm above, called 3DMax1, which performs an extra preprocessing and filtering of the input contact matrix when the input is noisy (e.g. having low IFs). Moreover, 3DMax1 uses a stochastic gradient ascent algorithm with perparameter learning rate, which is called the adaptive Gradient algorithm (AdaGrad). The AdaGrad [34] is a gradientbased optimization that can adapt the learning rate to each parameter, it performs larger updates for infrequent or sparse parameters and smaller updates for frequent or less sparse parameters. And it often improves convergence performance over standard stochastic gradient ascent when dealing with sparse parameters [35]. Different from 3DMax that updates the values of all the structure parameters in S at once with the same learning rate λ, AdaGrad in 3DMax1 uses a different learning rate for every parameter in S at every time t . Let Eq. (8) represent the gradient of the log likelihood for a parameter S_{ i } at a time step t. Hence, the stochastic Gradient ascent in Eq. (7) can be written as in Eq. (9) for a parameter S_{ i } in S.
In the update rule for AdaGrad, it modifies the learning rate λ at each time step for every parameter S_{ i } based on the previously computed gradient for the parameter S_{ i }. according to Eq. (10)
Here, G_{ t } is a diagonal matrix where each diagonal element i, i is the sum of the squares of the gradients w.r.t. S_{ i } up to time step t according to Eq. (11). While ε is a smoothing term that avoids division by zero (usually on the order of 1e − 6).
In essence, G_{ t } contains the sum of the squares of the past gradients for all the parameters in S along its diagonal. One of AdaGrad’s main benefits is that it eliminates the need to manually tune the learning rate at each iteration.
Normalization of HiC data
Data normalization is necessary for HiC datasets, because there is a lot of noise in them. In this study, we used the iterative correction and eigenvector decomposition (ICE) technique [31] as the default technique to normalize the HiC data. The ICE technique was used to normalize the contact map derived from both the synthetic data and the experimental HiC data. The GM06990 HiC data was also normalized using the Yaffe and Tanay normalization technique [30]. The Yaffe and Tanay normalization technique normalizes the observed read counts by the expected read counts between the regions in a contact matrix. The other technique used to normalize the GM06990 HiC data is the Sequential Component Normalization (SCN) technique [32]. The results obtained by the three methods above are presented in the Results section.
Conversion of interaction frequency to spatial distance
An important aspect of most distance restraintbased modeling approaches including 3DMax is to convert the interaction frequency (IF_{ij}) between two regions (i, j) in a contact matrix to a hypothetical Euclidean distance. An inverse relationship is assumed to exist between them. The relationship is usually defined as 1/IF^{α}, where IF is the interaction frequency, and α is called the conversion factor. According to [16], α cannot be too small because the spatial distance becomes independent of the interaction frequency as α approaches zero. And α also cannot be too large because in this situation a small change in interaction frequency could produce a significant difference in the spatial distances. Therefore, choosing a conversion factor that correctly represents the relationship between distance and interaction frequency (IF) is important. For 3DMax, we assume that the optimal α will be in the range [0.1, 2], which is consistent with the previous study [14, 16].
Measurement of model similarity and accuracy
We used the Pearson correlation coefficient (PCC), the Spearman’s correlation coefficient (SCC), and the root mean square error (RMSE) to measure the similarities between chromosomal structures, and assess the accuracy of the constructed structures as in the previous studies [12,13,14,15,16,17,18,19,20]. When these assessment methods are applied on a distance representation of a model, or a distance representation of HiC data, they are sometimes called the distance Pearson Correlation Coefficient (dPCC), the distance Spearman Correlation Coefficient (dSCC), and the distance Root Mean Square error (dRMSE), respectively. For instance, if we have two pairwise distance dataset from two models, {d_{i}, …, d_{n}} containing n values, and another pairwise distance dataset {D_{i}, …, D_{n}} containing n values, the dPCC, the dSCC and the dRMSE can be computed using the formulas given below.

(1)
The distance Pearson correlation coefficient (dPCC) is defined as,
$$ \mathrm{dPCC}=\frac{\sum_{i=1}^n\left({d}_i\overline{d}\right)\left({D}_i\overline{D}\right)}{\sqrt{\sum_{i=1}^n{\left({d}_i\overline{d}\right)}^2{\sum}_{i=1}^n{\left({D}_i\overline{D}\right)}^2}} $$
where:

d_{ i } and D_{ i } are single distance samples indexed with i,

n is the number of pairwise distance.

\( \overline{d} \) and \( \overline{D} \)represent sample means. \( \overline{d}=\frac{1}{n}\sum \limits_{i=1}^n{d}_i \), \( \overline{D}=\frac{1}{n}\sum \limits_{i=1}^n{D}_i \) .

(2)
The distance Spearman’s correlation coefficient (dSCC) is defined as
$$ \mathrm{dSCC}=\frac{\sum_{i=1}^n\left({X}_i\overline{X}\right)\left({Y}_i\overline{Y}\right)}{\sqrt{\sum_{i=1}^n{\left({X}_i\overline{X}\right)}^2{\sum}_{i=1}^n{\left({Y}_i\overline{Y}\right)}^2}} $$
dSCC is calculated by converting distance variable d_{ i } and D_{ i } into ranked variables X_{ i } and Y_{ i }, and then, computing the dPCC between the ranked variables.
where:

X_{ i } and Y_{ i } is the rank of two distance d_{ i } and D_{ i } respectively. Hence, X and Y is a vector of distance rank of the distance vector d and D respectively.

\( \overline{X} \) and \( \overline{Y} \) represent sample means of rank. \( \overline{X}=\frac{1}{n}\sum \limits_{i=1}^n{X}_i \), \( \overline{Y}=\frac{1}{n}\sum \limits_{i=1}^n{Y}_i \) .

(3)
The distance Root Mean Square Error (dRMSE) is defined as,
$$ \mathrm{dRMSE}=\sqrt{\frac{1}{n}\sum {\left({d}_{ij}{D}_{ij}\right)}^2} $$

where d_{ ij } and D_{ ij } are the distance vector between regions i and j for the first model, and second model respectively.

n is the number of pairwise distance.
The dSCC measures the similarity of the distance profiles of two 3D structures. The dSCC value varies between − 1.0 and 1.0; the higher the dSCC value is, the more similar the two structures are. It is worth noting that, to determine the dRMSE of two structures, the structures must be compared at the same scale. For instance, assuming two structures are represented with coordinates S′ and S ∈ R ^{n × 3}, where S′ is the model constructed by 3DMax, S is the known model from a simulated data, and n is the number of regions representing a chromosome. To calculate the dRMSE value, we performed linear transformations that includes translation, orthogonal rotation, and rescaling of the points in the matrix R^{3 x n} of structure S′ in order to best match them with the points in matrix R^{3 x n} of structure S. The Procrustes function library defined in MATLAB [36,37,38,39] is used to do the transformation of the dimensions. After the transformation, the dRMSE value between the scaled structure S″ and the original structure S is calculated.
Datasets
The synthetic dataset from Trussart et al., 2015 [15] is a series of simulated HiC contact matrices where the genomic architectures are predefined and the noise level and structural variability (SV) are both simulated. The contact maps, the original models and their reconstructed models used in this study were downloaded from http://sgt.cnag.cat/3dg/datasets/.
The real HiC data used in this study is from a normal GM06990 cell line and a malignant Bcell line. The normal GM06990 dataset was downloaded from the Gene Expression Omnibus (GEO) repository under the accession number GSE18199. Its raw and normalized interaction frequency matrices at 1MB resolution [10] were downloaded from [40]. We used the normalization pipeline described in [31, 32] to obtain normalized contact matrices. The raw contact matrices of the malignant Bcell 1MB resolution were obtained from [41]. We used the pipeline [31] to normalize them. The fluorescence insitu hybridization (FISH) data of the GM06990 cell line is from [10]. Its FISH distances and contact maps were obtained from [21].
Results
We evaluated our method using a synthetic dataset (Trussart et al., 2015) [15] and two real HiC datasets of the two cell lines: a karyotypically normal human lymphoblastic cell line (GM06990) [10] and the malignant Bcell of an acute lymphoblastic leukemia patient [41].
Parameter estimation
To use 3DMax, the conversion faction (α) needs to be defined. As the default, we set the α value to be in the range [0.1, 2] as explained in the Methods section. Another parameter we defined in 3DMax is the convergence constant called epsilon. To estimate the best epsilon value to use, we experimented on the GM06990_HindIII cell line dataset using six epsilon values, i.e., 1, 0.5, 0.1, 0.01, 0.0001, and 0.00001(Table 1). According to our experiment, although the different epsilons produced comparable dSCC average, the epsilon = 0.0001 has the highest average dSCC score. Hence, we set it as the default epsilon value for 3DMax. The number of ensemble structures (N) to generate per conversion factor is another parameter to be tuned. Table 2 shows the performance changes by setting different numbers of ensemble structures (NUM_STR). It is observed that a higher N value does not guarantee a significant increase in the accuracy. We set the default N to 5 in our implementation.
We executed all the other methods following the directions for parameter settings by their authors. All the parameters used to produce all the results are made available in the “parameters” directory of each method in the 3DMax website (http://sysbio.rnet.missouri.edu/bdm_download/3DMax/). For instance, to evaluate the MOGEN program, we used the parameters that produced the best result after trying multiple settings for the parameters required by the algorithm. The different parameters used to generate the MOGEN models, the input data, and the outputs for the three normalization methods for the GM06990 cell line are all available at the 3DMax website.
Choice of the learning rate
As mentioned in the Methods section, the choice of the best learning rate can sometimes be a difficult task. However, it is common practice to use either a preferable constant learning rate, or a decreasing learning rate.
The constant learning rate uses a constant λ value through all the epoch steps for an algorithm. By experimenting with a range of learning rates in our work, Fig. 1 shows the model accuracy for different constant learning rates for GM06990_HindIII cell chromosome 1 to 22 datasets. The result shows the impact of using the different learning rates for structure modeling. We observed that λ = 0.0001, 0.001, and 0.005 shows a consistent better performance than the other λ values across all the chromosomes. As observed in the Figure, the larger learning rate (λ =0.01) had the advantage of faster convergence in some chromosomes, but suffered fluctuations or even decreased performance at some point (Chromosome 5,11, and 20). The smaller learning rates resulted in slow convergence and sometimes does not converge with a good model accuracy as in the case of λ = 0.00001 (Chromosome 3,11,13,15,1618, and 21).
Conversely, for the decreasing learning rate, a typical way to implement it is to choose a starting learning rate, and drop the learning rate by half every 70 epochs (in our algorithm). This approach is termed the step based learning rate decay schedule. It takes the mathematical form below:
In this work, we compared the result obtained by using the constant learning rate (λ =0.0001), and the decreasing learning rate methods in Fig. 2. Interestingly, the results show that both methods achieved a comparable accuracy for all the chromosomes. However, in terms of the computing speed, 3DMax is faster when the decreasing learning rate is used than when the constant learning rate is used. The running time and accuracy of the two methods of setting learning rates are reported in Table 2. In 3DMax, we made the decreasing learning rate approach the default because it converges faster.
Assessment on simulated datasets
The synthetic dataset includes a series of HiC matrices simulated from the predefined chromosome structures with different noise levels and structural variability (SV) level. Each worm like chain chromosome structure has ~ 1 Mb base pairs and is represented by 202 regions of 5 Kb base pairs each. The simulated data can be classified into two categories based on the different architectures of the chromosome structures: Topological Associated Domains (TAD)like architecture and NonTopological Associated Domains (NonTAD)like architecture [42,43,44,45]. Each of these architectures has three structural density levels (40 bp/nm, 75 bp/nm and 150 bp/nm), resulting in six densityarchitecture combinations. The entire synthetic dataset contains 168 simulated HiC matrices in total, i.e., six different combinations of density and architectures times seven levels of structural variability (SV) (denoted as 0, 1, 2, 3, 4, 5, 6) times four noise levels (i.e. 50, 100, 150 and 200). There are 28 simulated HiC contact matrices for each of the six densityarchitecture combinations. According to [14], the most difficult architecture to reconstruct is the 150 bp/nm density with no TADlike features because of its higher resolution and lack of regular TAD substructures.
We evaluated 3DMax on the 28 contact matrices (7 levels of structural variability with four noise levels each) of the synthetic dataset with resolution 150 bp/nm for both TAD and nonTAD like feature architecture, respectively. The matrices were normalized with the ICE technique before they were used as input for 3DMax. To determine the best conversion factor (α) for model reconstruction, the dSCC value between the distance matrix generated from the input contact matrix and the Euclidean distance of the representative chromosomal model is computed. To determine the representative structure for an input matrix, we generated an ensemble of 50 structures and calculated the similarity between each structure in the ensemble with the input distance matrix. The structure with the highest dSCC value in the ensemble was chosen as the representative structure for the input contact matrix. We then computed the average dSCC value across the 28 contact matrices of the simulated data, with resolution 150 bp/nm and TAD like feature architecture, for the conversion factor (α) in the range [0.1, 2] (Table 3). The result shows that α value 0.3 has the highest average dSCC value. We computed the average dSCC value between the models reconstructed by 3DMax and the true structures (i.e., a set of 100 true structures for each structural variability level in the simulated dataset) for the α values in the range [0.1, 2] for the simulated data with resolution 150 bp/nm and TAD like feature architecture (Table 4). The result also shows that the structures generated at α = 0.3 have the higher similarity to the true structures from simulated dataset than other α values. To compute the accuracy of 3DMax, we compared each structure in the generated ensemble with the true structures (i.e., a set of 100 true structures for each structural variability level) by using the spearman correlation coefficient. We thereafter selected the reconstructed structure closest to a true structure from the ensemble. The spearman correlation coefficient of the selected structure and the true structure was averaged and used as the dSCC accuracy for the ensemble of generated 3DMax structures. The reconstruction accuracy (dSCC) for 3DMax at different levels of noise and structural variability (SV) for α = 0.3 shows that the accuracy of reconstructed models decreased as the structural variability level increased for each noise level (Fig. 3). The reconstruction accuracy of structures generated by 3DMax is relatively high for different noise levels when the structural variability (SV) is low, while the average accuracy of structures decreases noticeably as the level of SV increases.
Similarly, we evaluated 3DMax on 28 contact matrices of the synthetic dataset with resolution 150 bp/nm and nonTAD like feature architecture. Table 5 shows the performance of 3DMax for different α values.
Comparison with existing methods on the simulated data
We compared 3DMax with three existing methods: MCMC5C [14], MOGEN [19], and ShRec3D [24]. We used each method to generate an ensemble of 50 structures for each input matrix. We compared each structure in the ensemble with the true structures (i.e., a set of 100 true structures for each structural variability level) using spearman correlation coefficient to select the reconstructed structure closest to a true structure from the ensemble. The spearman correlation coefficient of the selected structure and the true structures is averaged and used as the dSCC accuracy for the method. For clarity, the comparison is grouped based on the noise level of the simulated data from 50 to 200. For the different noise levels, 3DMax is comparable to the top method  MOGEN when structural variability (sets 01) is low. And as the variability increases (especially sets 36), it outperforms all the other methods (Fig. 4) most time. Table 6 shows a tabular representation of the dSCC values visualized in Fig. 4, to show the dSCC values generated by all the algorithms.
Assessment on real HiC data
We applied 3DMax to a 1 MB resolution HiC dataset of GM06990 cell line [10]. The HiC data for this cell line was generated with two different restriction enzymes: Ncol and HindIII. For comparison, we applied seven structure prediction methods 3DMax, 3DMax1 based on AdaGrad optimization algorithm, ShRec3D, ChromSDE, MCMC5C, MOGEN, and LorDG [27] to predict the 3D structure of chromosomes of this cell line. All the methods take as input an interaction frequency matrix normalized by using the normalization pipeline in [29]. We used the distance Spearman Correlation Coefficient (dSCC) and the distance Pearson Correlation Coefficient (dPCC) to assess the accuracy of these methods. The accuracy is determined by computing the dSCC value between the distance matrix of the normalized frequency input matrix and the Euclidean distance calculated from the predicted 3D structures. Figure 5(a) shows that 3DMax outperforms the other methods by at least 4% across 22 pairs of nonsex chromosomes of the cell line. 3DMax obtained an average spearman correlation coefficient of 0.85 across all the chromosomes while the second highest among the other methods has the coefficient of 0.82. Figure 5(b) shows the Pearson correlation coefficient on the GM06990_HindIII cell. 3DMax obtained the highest average Pearson correlation coefficient of 0.795, which is better than the other methods.
In Fig. 5(c) we compared the spearman correlation values of ShRec3D, ChromSDE, 3DMax, and 3DMax1 for the contact maps of GM06990 cell line with NcoI and HindIII restriction enzymes. 3DMax has the highest average dSCC value of 0.88 across the chromosomes of the cell line. Table 7 shows a tabular representation of the model accuracy comparison visualized in Fig. 5.
On average, 3DMax’s accuracy is at least 3% higher than the other methods. In addition, since each HiC data obtained with a restriction enzyme is an independent observation of the GM06690 cell, we checked the robustness of our method by comparing the predicted structure from Ncol with one from the HindIII enzyme. We compared the predicted structure of chromosome 19 of HindIII data and NcoI replicate data. The dSCC and dRMSE value of the comparison were 0.9 and 0.0064 respectively, suggesting the two models are very similar.
Consistency checking of models in ensembles
To assess the consistency of the structures generated by 3DMax, we compared 50 structures generated at the optimal α value for each chromosome for the GM06990_HindIII cell and the malignant Bcell, respectively. We used the dSCC value to measure the similarity between these structures. Figure 6 shows the average dSCC for each chromosome for HiC data of the GM06990_HindIII cell and the malignant Bcell respectively. The average dSCC between the models is > 0.9 for all the chromosomes, indicating chromosomal models generated by 3DMax are quite similar to each other.
Comparative analysis of the performance of 3DMax, 3DMax1, MOGEN, ChromSDE, ShRec3D, MCMC5C, and LorDG on HiC data normalized with three popular normalization methods
Due to biases in HiC experiments, HiC data is generally noisy. Some of these biases are associated with cutting frequencies of restriction enzymes, GC content and sequence uniqueness [11, 30,31,32]. In order reduce the effects of these biases, the HiC data contact matrix is normalized to reflect the strength of the underlying chromosomal interactions more accurately.
We performed a comparative study of the performance of different 3D modeling methods when each of the three commonly used normalization techniques: Yaffe and Tanay [30] normalization technique, ICE (iterative correction and eigenvector decomposition) technique [31], and Sequential Component Normalization (SCN) technique [32] is applied. Figure 5(a) shows the result obtained by using the Yaffe and Tanay normalization technique, where 3DMax outperformed the other methods. Table 8 shows the average dSCC value for different chromosomes for each of the normalization technique. 3DMax and 3DMax1 produces the best performance when the Yaffe and Tanay normalization technique is used, and the 3DMax1 produces the best performances when the ICE and SCN normalization method are used respectively. It is evident from the results that the normalization techniques have a significant impact on the performance of some 3D modeling methods.
Discussion
Comparison of the computing performance of the different methods
To improve the computing performance and the usability of our algorithm, we also implemented the 3DMax algorithm in the Java programming language (available via https://github.com/BDMLab/3DMax/releases). The performance comparison of the MATLAB and the Java programming versions for a GM06990_HindIII cell line dataset is shown in Fig. 7. As shown in the Figure, the result produced by two separate Java implementation runs is consistent with those of the MATLAB implementation. We tested 3DMax and all other methods on an Intel Core i52400 3.10GHz computer with 8GB RAM.
We compared 3DMax algorithm with the other algorithms mentioned above in terms of computation speed, and the memory cost. To do this, we benchmarked them against the chromosomes of GM06990_HindIII cell data. It takes 3DMax java implementation about 13 s to predict the structure for all the chromosomes of the entire genome when it uses a single conversion factor (α), while it generates a single structure for each chromosome. 3DMax uses about 20 min to generate the representative structures for the entire cell when it estimates the optimal conversion factor (α) in the range [0.1, 2].
Though ChromSDE produced one of the best results, it was memory intensive and slow to generate large structures. ChromSDE could not handle efficiently input data with > 400 bins on our machine with 8 GB RAM. We were only able to use ChromSDE to create structure on our server machine with 65GB RAM. It takes ChromSDE 2025 h to generate structure for the entire GM06990_HindIII cell data. MOGEN uses over 2 h to generate the models for the cell line. It takes LorDG about 1 h and 7 min to process the whole cell line. MCMC5C with the default parameters uses 1 h and 19 min to generate the models. But to obtain better accuracy by increasing the number of iterations and the number of structures generated, the MCMC5C algorithm could run for > 18 h before it converges.
Validation using FISH data
We validated the model of Chromosome 22 reconstructed by 3DMax with an independent FISH data for GM06990_HindIII cell. Four 3D FISH probes for four loci (L5, L6, L7, L8) of the consecutive positions alternate between two chromosome compartments (A and B) [10]. That is, locus L5 and locus L7 are in Compartment A, and locus L6 and locus L8 are in Compartment B. According to the FISH data, L7 is spatially closer to L5 than to L6, though L6 lies between L5 and L7 on the chromosome sequence. Likewise, L6 is spatially closer to L8 than to L7. To check if this holds in the reconstructed 3D model, we measured the distance between these loci on the predicted structure. Figure 8 shows a model constructed by 3DMax with the four probes L5, L6 L7, L8 colored green, blue, yellow, magenta respectively. The distances between these loci: L5 – L6, L5 – L7, L6 – L7, L6L8 are reported. Indeed, the distance L5 – L7 was shorter than L5 – L6 and the distance L6  L8 was shorter than L6 – L7. The 3D structure was visualized with Pymol [46].
Conclusions
We developed a new method (3DMax) based on the maximum likelihood inference to reconstruct the 3D structure of chromosomes from HiC data. 3DMax combines a maximum likelihood algorithm and a gradient ascent method to generate optimized structures for chromosomes. The results on synthetic datasets show that the method performs robustly in the presence of noise and structural variability. This method provides a way to automatically determine the best conversion factor (α) for any HiC contact data. The results on the real HiC datasets reveals that 3DMax can effectively reconstruct chromosomal models from HiC contact matrices normalized by different methods. We also show that a major strength of the 3DMax algorithm is that it is faster and has a low memory requirement compared to some other methods.
Abbreviations
 3C:

Chromosome Conformation Capture
 3DMax:

Three dimensional structure modeling using a maximum likelihood approach.
 3DMax1:

A variant of 3DMax
 ICE:

Iterative Correction and Eigenvector decomposition
 IF:

Interaction Frequency
 PCC:

Pearson Correlation Coefficient
 RMSE:

Root Mean Square Error
 SCC:

Spearman’s correlation coefficient
 SCN:

Sequential Component Normalization
 SV:

Structural Variability
 TAD:

Topological Associating Domain
References
Dekker J. Gene regulation in the third dimension. Science. 2008;319:1793–4.
Fraser P, Bickmore W. Nuclear organization of the genome and the potential for gene regulation. Nature. 2007;447:413–7.
Miele A, Dekker J. Longrange chromosomal interactions and gene regulation. Mol BioSyst. 2008;4:1046–57.
Misteli T. Beyond the sequence: cellular organization of genome function. Cell. 2007;128:787–800.
Van Steensel B, and Job Dekker. "Genomics tools for unraveling chromosome architecture." Nat Biotechnol 28.10 (2010): 10891095.
Dekker J, Rippe K, Dekker M, Kleckner N. Capturing chromosome conformation. Science. 2002;295(5558):1306–11.
Simonis M, Klous P, Splinter E, Moshkin Y, Willemsen R, de Wit E, van Steensel B, de Laat W. Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation captureonChIP (4C). Nat Genet. 2006;38:1348–54.
Zhao Z, Tavoosidana G, Sjölinder M, Göndör A, Mariano P, Wang S, Kanduri C, Lezcano M, Sandhu KS, Singh U, Pant V, Tiwari V, Kurukuti S, Ohlsson R. Circular chromosome conformation capture (4C) uncovers extensive networks of epigenetically regulated intra and interchromosomal interactions. Nat Genet. 2006;38:1341–7.
Dostie J, Dekker J. Mapping networks of physical interactions between genomic elements using 5C technology. Nat Protoc. 2007;2:988–1002.
LiebermanAiden E, Van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, Sandstrom R. Comprehensive mapping of longrange interactions reveals folding principles of the human genome. Science. 2009;326(5950):289–93.
Rao SS, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, Sanborn AL, Machol I, Omer AD, Lander ES, Aiden EL. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014;159(7):166580.
Duan Z, Andronescu M, Schutz K, McIlwain S, Kim YJ, Lee C, Shendure J, Fields S, Blau CA, Noble WS. A threedimensional model of the yeast genome. Nature. 2010;465(7296):363–7.
Baù D, MartiRenom MA. Genome structure determination via 3Cbased data integration by the integrative modeling platform. Methods. 2012;58(3):300–6.
Rousseau M, Fraser J, Ferraiuolo MA, Dostie J, Blanchette M. Threedimensional modeling of chromatin structure from interaction frequency data using Markov chain Monte Carlo sampling. BMC Bioinformatics. 2011;12(1):414.
Trussart M, Serra F, Baù D, Junier I, Serrano L, MartiRenom MA. Assessing the limits of restraintbased 3D modeling of genomes and genomic domains. Nucleic Acids Res. 2015;43(7):3465–77.
Zhang Z, Li G, Toh KC, Sung WK. Inference of spatial organizations of chromosomes using semidefinite embedding approach and hiC data. In: Annual international conference on research in computational molecular biology. Berlin Heidelberg: Springer; 2013. p. 317–32.
Hu M, Deng K, Qin Z, Dixon J, Selvaraj S, Fang J, Ren B, Liu JS. Bayesian inference of spatial organizations of chromosomes. PLoS Comput Biol. 2013;9(1):e1002893.
Varoquaux N, Ay F, Noble WS, Vert JP. A statistical approach for inferring the 3D structure of the genome. Bioinformatics. 2014;30(12):i26–33.
Trieu T, Cheng J. MOGEN: a tool for reconstructing 3D models of genomes from chromosomal conformation capturing data. Bioinformatics. 2016;32(9):1286–92.
Wang S, Xu J, Zeng J. Inferential modeling of 3D chromatin structure. Nucleic Acids Res. 2015;43(8):e54.
Zou C, Zhang Y, Ouyang Z. HSA: integrating multitrack hiC data for genomescale reconstruction of 3D chromatin structure. Genome Biol. 2016;17(1):40.
Trieu T, Cheng J. Largescale reconstruction of 3D structures of human chromosomes from chromosomal contact data. Nucleic Acids Res. 2014; https://doi.org/10.1093/nar/gkt1411.
Nowotny J, Ahmed S, Xu L, Oluwadare O, Chen H, Hensley N, Trieu T, Cao R, Cheng J. Iterative reconstruction of threedimensional models of human chromosomes from chromosomal contact data. BMC Bioinformatics. 2015;16(1):1.
Lesne, Annick, et al. "3D genome reconstruction from chromosomal contacts." Nat Methods 11.11 (2014): 11411143.
Wachter A, Biegler LT. On the implementation of an interiorpoint filter linesearch algorithm for largescale nonlinear programming. Math Program. 2006;106:25–7.
Russel D, Lasker K, Webb B, VelázquezMuriel J, Tjioe E, SchneidmanDuhovny D, Peterson B, Sali A. Putting the pieces together: integrative modeling platform software for structure determination of macromolecular assemblies. PLoS Biol. 2012;10(1):e1001244.
Trieu, Tuan, and Jianlin Cheng. "3D genome structure modeling by Lorentzian objective function." Nucleic Acids Res 45.3 (2017): 10491058.
Mossel E, Vigoda E. Limitations of Markov chain Monte Carlo algorithms for Bayesian inference of phylogeny. Ann Appl Probab. 2006:2215–34.
Cole SR, Chu H, Greenland S, Hamra G, Richardson DB. Bayesian posterior distributions without Markov chains. Am J Epidemiol. 2012;175(5):368–75.
Yaffe E, Tanay A. Probabilistic modeling of hiC contact maps eliminates systematic biases to characterize global chromosomal architecture. Nat Genet. 2011;43:1059–65.
Imakaev M, Fudenberg G, McCord RP, Naumova N, Goloborodko A, Lajoie BR, et al. Iterative correction of hiC data reveals hallmarks of chromosome organization. Nat Methods. 2012;9(10):999–1003.
Cournac A, MarieNelly H, Marbouty M, Koszul R, Mozziconacci J. Normalization of a chromosomal contact map. BMC Genomics. 2012;13(1):436.
Deza MM, Deza E. Encyclopedia of distances. Encyclopedia of distances. Berlin Heidelberg: Springer; 2009. p. 1–583.
Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research. 2011;12:212159.
Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Senior A, Tucker P, Yang K, Le QV, Ng AY. Large scale distributed deep networks. In Advances in neural information processing systems. 2012;122331.
Kendall DG. A survey of the statistical theory of shape. Stat Sci. 1989:87–99.
Bookstein FL. Morphometric tools for landmark data. Cambridge, UK: Cambridge University Press; 1991.
Seber GAF. Multivariate observations. Hoboken, NJ: John Wiley & Sons, Inc.; 1984.
MATLAB version 7.10.0. Natick, Massachusetts: The MathWorks Inc.; 2010.
GM, 06990 Normalized HiC Data. http://compgenomics.weizmann.ac.il/tanay/?page_id=283. Accessed 17 Feb 2018.
Wang Z, Cao R, Taylor K, Briley A, Caldwell C, Cheng J. The properties of genome conformation and spatial gene interaction and regulation networks of normal and malignant human cell types. PLoS One. 2013;8(3):e58793. 1–7
Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, Hu M, Liu JS, Ren B. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012;485(7398):376–80.
LévyLeduc C, Delattre M, MaryHuard T, Robin S. Twodimensional segmentation for analyzing hiC data. Bioinformatics. 2014;30(17):i386–92.
Wang Y, Li Y, Gao J, Zhang MQ. A novel method to identify topological domains using hiC data. Quant Biol. 2015;3(2):81–9.
Shin H, Shi Y, Dai C, Tjong H, Gong K, Alber F, Zhou XJ. TopDom: an efficient and deterministic method for identifying topological domains in genomes. Nucleic Acids Res. 2015:gkv1505.
Schrodinger, LLC. The PyMol molecular graphics system, version 1.3. 2010.
Acknowledgements
Not applicable.
Funding
This work was supported by the National Science Foundation (NSF) CAREER award (grant no: DBI1149224) to JC.
Availability of data and materials
The GM06990 cell datasets, the Malignant Bcell datasets, the models generated, and the datasets generated during and analysed during the current study are available at http://sysbio.rnet.missouri.edu/bdm_download/3DMax/.
The synthetic datasets used in this study can be downloaded from here: http://sgt.cnag.cat/3dg/datasets/.
The MATLAB, and the Java source codes for 3DMax are available at https://github.com/BDMLab/3DMax.
A video demonstration of how to use 3DMax is available at https://youtu.be/ehQUFWoHwfo.
Author information
Authors and Affiliations
Contributions
JC conceived the project. YZ and OO designed the algorithm. OO implemented the algorithm. OO performed the statistical and simulation analyses. YZ, OO and JC evaluated the results, and wrote the manuscripts. All authors reviewed the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Oluwadare, O., Zhang, Y. & Cheng, J. A maximum likelihood algorithm for reconstructing 3D structures of human chromosomes from chromosomal contact data. BMC Genomics 19, 161 (2018). https://doi.org/10.1186/s1286401845468
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1286401845468