TM3C library generation
Approximately six million NHEK and ten million KBM7 cells were fixed in 1.5% formaldehyde at room temperature for 10 minutes. The fixed cells were washed with TN buffer (10 mM Tris, 40 mM NaCl, pH 7.5) and collected by centrifugation at 600 g for 3 minutes. To increase digestion efficiency, fixed cells (6 or 10 million / 122 ul) were treated with SDS (add 3.8 ul of 10% SDS to a final of 0.30% SDS) at 64°C for 10 minutes and then at 37°C overnight (15 hours). The SDS concentration was reduced gradually to 0.10% by adding five times of 50 ul (1 x DpnII digestion buffer or NEB buffer 4 for multiple enzymes) with mixing. Triton X-100 (38 ul of 20% Triton X) was added to 1.8% concentration and the sample was incubated at 37°C for 1 hour. Sample volume was adjusted to 600 ul by adding 1 X restriction buffer, ATP (0.2 mM final) and BSA (100 ug/ml final). Digestion with appropriate restriction enzymes (300 units each) was carried out on a rotate shaker at 37°C for 15 hours. We used high concentration NEB enzymes to keep the final volume of the enzyme mixture less than 60 ul (1/10 reaction volume).
The digested samples were deactivated at 65°C for 15 minutes and then centrifuged at 15,000 g for 5 min. We recovered ∼95% of cellular DNA in the pellet fraction. The pellet fraction was re-suspended with T4 ligation buffer (15 ul 10 x buffer, 65 ul total) heated at 65°C and mixed with 100 ul of melted 2.5% low-melting agarose. We used 200 ul pipette to deliver the hot agarose sample to ice-cold ligation buffer (800 ul of 1 x ligation buffer containing T4 ligase (4000 units, NEB) in a steady fashion within ∼5 seconds, on melted ice. Strings of gel bead appeared instantly at 0°C. We sealed the tube with parafilm and perform ligation at RT (23°C.) overnight on top of a shaker (∼300 rpm), then transfer the tube to a iced water bath.
The sample pellet was recovered by centrifugation at 20,000 g for 2 minutes, then 10 ul of 1% SDS (0.05% final) was added and heated at 80°C for 1 hour. Cross-links were reversed by treatment with Proteinase K (200 ug/ml) at 65°C and 300 rpm overnight (12 hours). Melted TM3C-agarose sample was incubated with RNase A (10 ug / 210 ul) at 55°C for 15 minutes and then purified by QIAquick gel extraction protocol (QUIAGEN Inc., CA). Purified TM3C DNA was quantified using both a NanoDrop spectrophotometer (Thermo Scientific) and a Qubit 2.0 Fluorometer. The Qubit quantification represents the more accurate DNA concentration.
First phase mapping of sequence data
We mapped the paired-end reads to the human reference genome (hg19) using the short read alignment mode of BWA (v0.5.9) with default parameter settings. Each end of the paired reads was mapped individually. We post-processed the alignment results to extract the reads that satisfy the following three criteria: (i) mapped uniquely to one location in the reference genome, (ii) mapped with an alignment quality score of at least 30, (iii) mapped with an edit distance of at most 3. Reads that satisfy these criteria are named fully-mapped (F), and the rest of the mapped reads that did not satisfy these criteria are discarded from further analysis. We identified pairs of fully-mapped reads that share a common identifier to generate the set of contacts that we denote as F-F (fully-mapped - fully-mapped). The reads that did not map to any location in this phase of mapping are named non-mapped and are analyzed further.
Second phase mapping of non-mapped reads
Re-mapping the reads that are deemed non-mapped in the initial mapping is necessary to avoid discarding a significant number of informative reads for an assay such as TM3C that uses a frequently cutting restriction enzyme (or enzymes) for digestion. Due to the high frequency of cleavage sites in the genome, TM3C is highly likely to capture ligations between DNA fragments from two different loci in a single end of a read. We call each such read chimeric because the sequences do not come from a continuous piece of DNA but instead from two loci that are in proximity in the three-dimensional space. Therefore, for these chimeric ends, after splitting into smaller fragments from the cleavage sites of the restriction enzymes used in the digestion step, we applied a second phase of mapping.
Within each non-mapped read, we first counted the number of cleavage sites, taking into account all the restriction enzymes that are used in the digestion step for that specific library. We discarded reads that contain more than two cleavage sites. We also discarded reads that contain no cleavage sites because such reads surely are not chimeric. We split the remaining reads that contain only one cleavage site into two smaller fragments, preserving the entire cleavage site on both adjacent fragments. We mapped the two resulting fragments to the genome using BWA with default parameter settings. The 3-point filtering criteria mentioned in the previous section are applied to the aligned reads, but allowing an edit distance of at most 1 to make sure we only extract the unique and high quality mappings. The reads that are extracted from this phase of mapping are named partially-mapped(P) because they did not map as a whole, but their constituent fragments were successfully mapped to different loci. The two classes of mapped reads (fully-mapped (F) and partially-mapped (P)) yield three possible types of contacts, namely F-F, F-P and P-P. The first set (F-F) is extracted after the initial mapping in which each paired-end read can contribute at most one interaction between two loci. The second set (P-F) consists of paired-end reads with one end fully mapped and the other end having either one or two smaller fragments that mapped to the genome. If the latter contains only one mapped fragment, then the only interaction is between this fragment and the fully-mapped end. However, if the end has two mapped fragments, then this paired-end read produces three contacts: one between the two mapped fragments on the partially-mapped end and two others that have one side from a fragment from the partially-mapped end and the other side from the fully-mapped end. In addition, the same paired-end read produces one triple (i.e., interaction among three loci) of type P-F. For the contacts of the third type (P-P), each paired-end can produce either one, three or six pairwise contacts, depending on whether one or two fragments from each end are successfully mapped. If only one fragment from one end and two from the other is mapped, then, similar to the case of P-F, three pairwise contacts and one triple is produced. If both ends have two mapped fragments, then six pairwise contacts, four triples (of type P-P) and one quadruple (i.e., contact among four loci) are produced.
Normalization of contact maps
For each possible pair of 1 Mb loci, we refer to the total number of read pairs that link the two loci as the contact count, and we refer to the two-dimensional matrix containing these contact counts as the raw contact map. To normalize the 3113×3113 raw contact maps, we extended the iterative correction procedure, ICE [34], for a nearly haploid genome. First, we corrected for the bias caused by the partial diploidy of the genome. For that, we constructed a “deduplicated” contact counts matrix, where contact counts associated with diploid loci are divided into two equal parts, each of which is associated with one of the homologous chromosomes. Contact counts between two different copies of diploid chromosomes/regions are set to 0. The deduplicated matrix is akin to an artificially created allele-specific contact counts matrix, where homologous chromosomes interact in identical ways and do not interact with each other. As a preprocessing step, we ranked loci by their percentage of intrachromosomal contacts with zero counts and filter out the top 10% of this list. This filtering removes all loci for which the signal to noise ratio is too low (typically, regions of low mappability). Last, we applied ICE, a method that attempts to eliminate systematic biases in Hi-C data. ICE assumes that the bias for each entry can be decomposed as the product of the biases associated with each locus, and estimates a bias vector β under the equal visibility hypothesis: the coverage of counts should be uniform. The tensor product β⊗β generates a bias matrix that can be used to convert the raw contact map into a normalized contact map. To generate a contact count matrix of the original size, we summed all counts from homologous chromosomes associated with the same loci. This procedure yields a (3113×3113) contact counts matrix for which diploid loci interact twice as much as haploid loci.
Eigenvalue decomposition
We carried out eigenvalue decomposition on the normalized contact maps of KBM7 and NHEK TM3C datasets as described in [4]. For each chromosome we used the intrachromosomal contact matrices at 1 Mb resolution. We calculated the Pearson correlation between each pair of rows of the contact matrix and apply eigenvalue decomposition (using the eig function in MATLAB) to the correlation matrix. The sign of either the first or the second eigenvector defines chromosome compartments for each chromosome. Similar to [4], we used the second eigenvector in cases where the first eigenvector values are either all positive or all negative. To map signs of eigenvectors to open/closed compartment labels we used GC content as a marker. For each chromosome the sign with higher GC content is selected as open chromatin. We then compared the percentage of 1 Mb bins that are assigned the same compartment label by TM3C data versus previously published Hi-C data in four human cell lines (H1-hESC, IMR90 [5]; K562, GM06990 [4]).
Topological domain analysis
We identified topological domains using a previously described hidden Markov model-based software tool [5]. To facilitate direct comparison with the previously published topological domains in human cell lines, we carried out the domain calling for these published datasets using the human GRCh36/hg19 assembly. We applied the topological domain calling on normalized contact maps of our TM3C data at 40 kb resolution. To measure the consistency between the topological domains inferred from TM3C and those from published Hi-C data, we calculated the overlap of domain boundaries obtained between these two assays. We deemed two boundaries, one from each assay, as overlapping if they overlap by at least 1 bp or are adjacent to each other, as described in [5].
Contacts among regions with the same compartment label
We used compartment labels assigned by the eigenvalue decomposition as described above and computed the number of read pairs that define double and triple contacts between two or among three regions all with the same compartment label (all open or all closed) or at least two with opposite labels (mixed). We used only interchromosomal doubles and interchromosomal triples (linking three different chromosomes) for this analysis and eliminated regions that have less than 50% uniquely mappable bases. We then computed the number of all possible pairs and triples of 1 Mb windows and segregated this number into three groups (all open, all closed, mixed) giving us the expected percentages of contacts that should fall into each group. With exactly equal numbers of open and closed compartments for each chromosome, these percentages would be 25%, 25%, 50% for pairs of compartments and 12.5%, 12.5%, 75% for triples of compartments for the groups of all open, all closed and mixed, respectively. We then reported the ratio between the percentage of observed double and triple contacts to expected percentages within each of these three groups. A ratio >1 represents an enrichment for the observed contacts for that compartment label group.
Contacts among regions with similar numbers of DHSs
We performed an analysis similar to the compartment label analysis described above using joint (UW–Duke) DNase hypersensitivity peak calls for the six Tier 1 cell lines (GM12878, H1-hESC, HeLa-S3, HepG2, HUVEC, K562) downloaded from http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/integration_data_jan2011/byDataType/openchrom/jan2011/fdrPeaks. Since there is no DNase data for KBM7 we reported results for only K562 which is also a leukemia cell line. We computed for each 1 Mb window with mappability of at least 50% the number of DHS peaks that overlap with this window. We sorted all these windows by decreasing number of DHSs and labeled the top 50% as “high” and bottom 50% as “low” DNase sensitivity. We then calculated and reported the expected over observed percentage of doubles and triples as described for compartment labels.
Contacts within the same topological domain
After carrying out the topological domain calling using our KBM7-TM3C-1 data, we computed the percentage of intrachromosomal doubles and triples that link loci within the same topological domain. To estimate the significance of the observed percentages, we randomly shuffled topological domains by preserving the distribution of the domain lengths for each chromosome arm as described in Ay et al. [35]. We reported the mean and the standard deviation for the percentage of within domain doubles and triples across 100 randomized shufflings.
Inference of the 3D structure
We modeled each chromosome as a series of beads on a string, spaced approximately 1 Mb apart. We denote by \(\textbf {X} = (x_{1},\ldots,x_{n}) \in \mathbb {R}^{3 \times n}\) the coordinate matrix of the structure, where n denotes the total number of beads in the genome including the newly introduced chromosomes 8B and 15B (n=3289 for the KBM7 genome), and \(x_{i}\in \mathbb {R}^{3}\) represents the 3D coordinates of the i-th bead. Contacts from TM3C data can be summarized as an m×m matrix c, where each entry c
kl
corresponds to the observed contact count between loci k and l. Because contact information does not distinguish between homologous chromosomes, m only includes one copy of each chromosome and m<n. For loci in diploid regions, the contact counts are the sum of contact counts due to each copy of the region. If we denote by Φ:[1,n]→[1,m] the mapping that associates a bead i to a locus Φ(i) of the contact count matrix, this means that the contact count c
kl
between loci k and l is the sum of counts due to interactions between beads in Φ
−1(k) and Φ
−1(l). For any two beads i and j mapping respectively to loci k=Φ(i) and l=Φ(j), let us denote by 0≤μ
ij
≤1 the proportion of counts in c
kl
due to interactions between beads i and j. Since all contact counts must be accounted for by interactions between beads, we must have for any loci k and l:
$$\sum\limits_{i\in\Phi^{-1}(k)\,,j\in\Phi^{-1}(l)} \mu_{ij}=1\,. $$
We propose to jointly infer the structure X and the distributions of contact counts μ
ij
’s by maximizing the likelihood of the observed contact counts. For that purpose, we modeled the contact frequencies \((\mu _{\textit {ij}}c_{\Phi (i) \Phi (j)})_{(i,j) \in \mathcal {D}}\) (
is the set of non-zero contact counts) as independent Poisson random variables, where the Poisson parameter of μ
ij
c
Φ(i)Φ(j) is a decreasing function of the Euclidean distance d
ij
(X) between beads i and j. Our and others’ previous work suggested that the relationship between μ
ij
c
Φ(i)Φ(j) and d
ij
is approximately of the form d
ij
(X)α, with α=−3 [4,27,36]. We can then express the likelihood of the model as:
$$ \ell(\mathbf{X}, \mu) = \prod\limits_{i, j} \frac{\left(d_{ij}^{\alpha}\right)^{\mu_{ij} c_{\Phi(i)\Phi(j)}}} {\left(\mu_{ij} c_{\Phi(i)\Phi(j)}\right)!} \exp \left(- d_{ij}^{\alpha}\right) \,. $$
((1))
To infer the position of each bead, we maximized the log likelihood of the model which is:
$${} {\fontsize{9pt}{9.6pt}\selectfont{\begin{aligned} \mathcal{L}(\mathbf{X}, \mu) = \sum\limits_{i, j} \mu_{ij} c_{\Phi(i)\Phi(j)} \alpha \log(d_{ij}) - d_{ij}^{\alpha} - \log\left(\mu_{ij} c_{\Phi(i)\Phi(j)}!\right)\!. \end{aligned}}} $$
((2))
In practice, we solved the following relaxation since μ
ij
c
Φ(i)Φ(j) may not have integer values
$$ \begin{aligned} \mathcal{L}(\mathbf{X}, \mu) =& \sum\limits_{i, j} \mu_{ij} c_{\Phi(i)\Phi(j)} \alpha \log\left(d_{ij}\right) - d_{ij}^{\alpha}\\ &- \log\left(\Gamma\left(\mu_{ij} c_{\Phi(i)\Phi(j)} + 1\right)\right) \,, \end{aligned} $$
((3))
with the following constraints:
-
d
ij
≤d
max
. To find a suitable d
max
, we first computed the expected distances \(c_{i, i+1}^{-1 / 3}\) for adjacent beads of haploid chromosomes. We set d
max
to the 97% quantile, thus excluding outliers values arising in the normalization procedure.
-
0.3≤μ
ij
≤0.7, where i and j corresponds to loci from the same copy of a diploid chromosomes.
To optimize this non-convex function, we iterated between two steps: (1) infer the 3D structure X; (2) re-estimate the distribution of contact counts μ
ij
between diploid chromosomes. The first step is solved using an interior point method, as described in [27]. For the second step, the optimization problem can be performed with respect to each pair of loci k and l independently. Thus we perform a grid search on {μ
ij
|Φ(i)=k,Φ(j)=l}, with a step size of 0.01.
We ran the optimization 1000 times varying the initialization of the distribution of the contact counts, and another 1000 times varying the initial structure X. We then selected the top 100 structures with the highest log likelihoods.