Quantification of the ligation products
During the ligation step, one can envision to recover different types of products (Figure1A, step 3). Firstly, a RF can simply be circularized on itself (step 3i), resulting in a loop. Secondly, two consecutive RF on the genome can be re-assembled together (step 3ii). This type of event will be designated as a religation event. Note that religation events are virtually indistinguishable from non-digested restriction site (RS) given the original sequence is then restored. A third type of product can be recovered at this step, especially if the digestion is partially incomplete which will always be the case: longer DNA fragments formed out of two continuous RFs can be circularized during the ligation step (step 3iii). Finally, two RFs that are not consecutive on the genome can be ligated together (step 3iv). These products are the nuggets the experiment is digging for, and will be termed here as long-range interactions. Long-range interactions can either be intra or inter-chromosomal. Although inter-chromosomal events are easily identified through mapping of the pair-end reads along the genome, intra-chromosomal events necessitate a more careful examination of the positions of the sequences. A convenient way to identify the type of an intra-chromosomal ligation product is to use the orientation of the sequences obtained from the pair-end sequencing run. Each RF exhibits two extremities. The one with the highest coordinate according to the yeast genome conventional representation is labeled “+” and the other one “-”. Every ligation event therefore falls within one of these four categories: -/-, +/+, -/+ and +/- (see Additional file1: Figure S1A). Whereas long range interactions should not happen with any preferential orientation of the fragment extremities, a circularized RF will always connect its – extremity with its + extremity (Additional file1: Figure S1A). The distribution of interaction types (+/+, -/-, -/+ and +/-) can be plotted for self-interacting fragments as well as for contiguous fragments (i.e. separated by only one RS), and then separated by two, and more RSs. For the later category no preferential orientations are distinguishable (Additional file1: Figure S1B). A strong enrichment in +/- interactions is observed for pairs of collinear RFs. This enrichment is due to the presence of religation events (ii) as well as detection of sites which escaped the digestion step. The formation of type (iii) products is revealed by the fact that interactions between contiguous fragments on the genome are more often found in the -/+ configuration, which corresponds to a loop, than in a -/- or +/+ configuration. The relative number of those different products can be represented with a pie chart (Figure1B). Loops and religation appear to be very frequent events (about 80% of the original data). Those inevitable byproducts were removed from all subsequent analysis. In addition, fragments with no restriction site for the secondary enzyme and therefore that should not be detected according to the experimental protocol were also discarded. Similarly, fragments whose extremities align ambiguously along the reference genome were removed as well (see Methods for details). In total, more than 80% of the initial raw reads were removed for subsequent analysis, which is consistent with other experiments in the field, and leaves room for a lot of improvement.
Identification of major biases in the experimental protocol
Complex protocols involving a large number of steps are likely to generate biases in the data that has to be careful sought for. What we call biases here is a variability which is larger than the expected noise and can be explained primarily by properties of the fragment itself. In the following, three major biases likely to affect the number of detected interactions between fragment pairs were identified: the length of RFs, GC content of the paired-end reads, and the length of DNA segments at the circularization of steps 6 and 9.
The distribution of the number of reads per fragment as a function of the fragment size L is presented on Figure1C. Given the number of positions accessible to fixating agents along a RF increases with its size, one would expect the interaction probability to increase linearly with RF size. For RF under 800 bp, the number of reads per fragment increases, suggesting that indeed the probability for a cross-linking event to occur depends on the length of the fragment. However, for longer RFs, a plateau is reached, suggesting that the maximum probability for at least one cross-linking event to occur along that length is reached. In other words, the probability of longer fragments not to be cross-linked at least once is constant and very small (Methods).
Formaldehyde fixation, which is the first step of 3C based protocols, therefore introduces a length bias for sizes under 800bp. In this range, the longer a RF is, the more likely it will be cross-linked with other RF during the fixation step.
The distribution of reads per possible interaction between two RF extremities was plotted as a function of the GC content of these extremities (Figure1D). From this figure one can see that extreme GC content extremities tend to be under represented in the final interaction reads. Therefore, the PCR reaction or/and the deep-sequencing steps can introduce additional biases, notably by favoring reads with a GC content of about 45%. The bias of GC content in short reads data from high-throughput DNA sequencing has indeed been reported (see Figure2 in[19]). However, such biases do not appear to affect many interactions (see Figure1D).
Quite surprisingly we also identified an original, but retrospectively not unexpected, bias in the two steps involving circularization of DNA segments (Figure1A, step 6 and 9). It is known that the mechanical properties of DNA are such that the length of a fragment can strongly influence the efficiency of a circularization reaction. If the fragment is too small, the bending persistence of DNA is such that both ends cannot be ligated. If the fragment is two long, the entropic contribution to the free energy will also disfavor ligation. Here indeed, the distribution of the sum of the sizes (d
A
+ d
B
) of two interacting RF A and B presents a typical circularization efficiency profile, including an optimal circularization length close to 500 bp (Figure1E,[20]).
Intriguingly, a 10.5 bp periodicity of the circularization efficiency could be observed for the average number of circularization events for which d
A
+ d
B
< 500 bp, overall (i.e for the HindIII-MspI experiment, about 15% of the interactions fall into this category). Such a periodicity is actually predicted by polymer physics and results from the natural twist of the double helix which is 10.5 bp[21]. Here, the phenomenon can be observed at an unprecedented resolution (see inset of Figure1E) and consists in a bias that could affect any experimental procedure involving a circularization through ligation step.
Due to those various biases, some RFs will be involved in more interactions than expected, whereas others will be underrepresented in the final bank (see Additional file1: Figure S2). Since this variability results from the experimental protocol rather than the biological reality, it is worth minimizing theses effects by either correcting or normalizing the observed frequencies of interactions[17, 18]. These correspond to two different approaches: in order to correct the data, one needs to quantify the biases and then to divide each interaction frequency by its expected value, knowing the bias. On the other hand, no prior knowledge of the bias is needed to normalize the data: the procedure consists in dividing each interaction frequency between two fragments by the product of the sums, or the norms, of the total interaction reads involving those fragments (see below).
Generation of a normalized contact map through the “Sequential Component Normalization” (SCN) methodology
The correction method developed for the human Hi-C dataset is not readily adaptable to the yeast dataset since there is an additional circularization bias to the RF length and GC content bias[17]. A important issue with the circularization bias is that it is highly non monotonous: for example, it favors circularization lengths of 261 bp, but disfavors circularization length of 266 bp and again favors circularization lengths of 271 bp and so on and so forth (see inset in Figure1D). A similar methodology that was previously described in[17] was first applied in order to correct for this bias. However, the nature of the bias did not allow reaching a satisfying solution because of the non-monotonous specificity. In the following, instead of correcting each of the interactions frequencies individually, contact maps were normalized globally through what we called the SCN approach, which can be applied to any genomic contact map and independently from the protocol that was used to generate it. The normalization described below is based on the interactions exhibited by the entire restriction fragments, before the second digestion, in order to remain as broadly generalizable as possible to other experimental protocols. The reason why we applied normalization on the fragment instead on the extremities is that for each pair of fragment there are four possibilities to make religation event. Each of those four possibilities will exhibit a different GC content and a different dA+dB and therefore the biases described in Figure1D and 1E, that depends on the extremities, will be smoothen out when aggregating the combinations together. This point was also discussed in the original paper[13]. The advantage of this method is that it smoothens out all the biases described above and therefore provides a cleaner view of the frequency of interaction between any pair of restriction segments in the genome.
Intra- and inter-chromosomal interactions were treated separately but using the same procedure. Firstly, normalization will give an equal weight to each fragment in the contact map. Therefore, RF with very low number of reads, corresponding to RF that could not be properly detected, are likely to introduce noise in the normalized contact map and have to be removed (see Additional file1: Figure S3). In order to identify these fragments, we computed the distribution of reads in the contact map (see Additional file1: Figure S2B). This distribution is roughly gaussian, with a long tail corresponding to low interaction fragments. Based on this distribution, we cut the tail of the distribution (see Methods for further information).
Once low interacting fragments are removed, we wish to normalize all rows and columns of the contact map to one so that the matrix remains symmetric. This was done through the following simple procedure. Firstly, each column vector was normalized to one, using the euclidian norm. Then each line vector of the resulting matrix was normalized to one. The whole process was repeated sequentially until the matrix become symmetric again with each row and each column normalized to one (Additional file1: Figure S4 and Methods). Usually, two or three iterations are sufficient to insure convergence. Since it involves a sequential normalization of column and line vectors of the matrix, this method was named Sequential Component Normalization (SCN). This normalization can be viewed as a sequence of extensions and shrinking of interaction vectors so that they tend to reach the sphere of radius one in the interactions space. A similar and faster approach is to divide all the matrix elements c
ij
by the product of the norms of row i and column j :. This method yields to a normalized contact map overall very similar to SCN (Additional file1: Figures S5 and S6). However since the sum of each component is not necessarily equal using this method, it may bias further analysis such as assessing the 3D colocalization of genomic elements (see below). An alternative normalization method has been used so far by other groups[9], that use the sum of the components instead of the euclidian norm :. We noticed that this method yields to a contact map with lower contrast than the SCN (Additional file1: Figure S5 and S6) and therefore recommend SCN use in further works. The normalization using the sum will give more weight to fragments wich makes fewer interactions whereas our normalization will give more weight to fragments interacting moderately with many fragments. Intra and inter-chromosomal interactions were separated in two datasets and the corresponding normalized contact matrices between RFs were plotted as a function of their position along chromosomes (Figure2A and3A, respectively).
S. cerevisiae contact maps after SCN
The normalized maps overall are similar to those observed before[13]. Since the probability of interaction between monomers along a polymer is decreasing with the linear distance between them, the diagonal which represents neighboring RFs presents the highest interactions score[4]. In order to increase the contrast and observe interactions between non-adjacent intra-chromosomal RF we then divided the number of interactions between fragments separated by a genomic distance D
g
by the average interaction count between fragments separated by the same distance D
g
(see Methods). Some features appear more contrasted with respect to the original analysis, with a typical X shape pattern centered on the centromere for each chromosome (Figure2B). This pattern reflects the fact that the centromere does not interact much with the chromosome arms whereas both arms can interact together. In addition, interactions between RF located on both arms appear clearly more constrained when at symmetrical distances from the centromere and within its vicinity (Figure2C). In addition, the bipartite structure of chromosome 12 due to the insulating presence of the nucleolar rDNA repeats remains clearly apparent[13]. The corrected contact maps for inter-chromosomal interactions also reveal striking features (Figure3A). Centromere clustering is clearly apparent and results in all the centromeres interacting with each other’s on the map, as in[13]. The interactions between two chromosome arms along their length are also extremely clear. The X shaped patterns at inter-centromeric interactions observed in the matrix indicate that centromeres are somehow isolated from the rest of the chromosomal arm sequence (see for instance chromosome VII and chromosome XVI on Figure3B). This feature is even more striking when the correlation matrix is drawn similarly to[8] (Additional file1: Figure S7). In this matrix, each element c
ij
is the Pearson coefficient between the vectors i and j.
In addition, telomeres are also found to have enriched contact frequencies (for instance chromosome XIII and chromosome IV on Figure3C). To investigate the role of the chromosomal arm length in the inter-chromosomal interaction frequencies, all chromosomal arms were ranked with respect to their length and the corresponding contact maps were drawn (Figure4A). This layout conveniently reveals global interaction patterns in respect to chromosomal arm size: shorter arms tend to interact with shorter arms whereas longer arms tend to interact with longer arms (from the upper left corner to the lower right corner). On the contrary, shorter arms tend to make very few contacts with longer ones (upper right and lower left corners on Figure4A). Zooming on the five shorter arms on the contact map reveals that the interaction frequencies between subtelomeres from shorter arms are important, sometimes even more than centromeres (e.g arms III-L and IX-R, see Figure4B). To investigate the arm length relationship with sub-telomere interactions, we computed the mean interaction frequencies between all sub-telomere pairs for both the normalized and original data. The normalized data exhibit two types of preferred subtelomeric interactions, one for short and one for long chromosome arms, whereas the orginal analysis mostly emphasized short arms interactions (see Additional file1: Figure S8). Given that the measurements reflect a population average, it is impossible to know from this data if all the telomeres interact preferentially in a similar ways in all cells taken individually. However, similar preferred interactions have been observed in single cells using fluorescent microscopy approaches[22, 23] as well as in recent modeling approaches[24]. In addition, the rDNA now appears not only as an intra-chromosomal insulator region, but also modifies the interacting properties of the two DNA segments it delimits. Whereas a gradual shift in interaction frequencies from centromere to telomere is observed for long arms, for chromosome 12 the DNA segment located between the rDNA and the telomere seems less constrained that the one before the rDNA cluster (Figure4C).
Re-assessing the 3D colocalization of genomic elements
The influence of this normalization procedure on the preferential interactions detected previously was addressed. In the original analysis, receiver operating curve (ROC) confirmed an expected enrichment of interactions for centromeres and telomeres resulting from the Rabl configuration[13, 23]. More interestingly, early replication origins[25] were also shown to interact preferentially, a result experimentally supported[3]. Finally, two preferential interactions regions where identified for tRNA genes, one around the spindle pole body (SPB) and one in the vicinity of thenucleolus[13].
In this paper, we used a different method than the originally published ROC analysis. The initial ROC analysis asked the question: among the pool of strong interactions, is there an enrichment in interactions between two fragments which both carry the genomic object of interest. We ask the question: among the pool of strong interactions carrying one feature of interest, is there an enrichment for interactions with a fragment carrying the same feature (for details about the implementation, see Methods). ROC analysis on the normalized data confirmed the expected centromeres and telomeres preferential interactions (see Figure5A). In addition, enrichment in interactions between early replication origins was also observed. However, the frequencies of interactions between restriction fragments containing tRNA genes did not exhibit significant increase when using the normalized data (Figure5B, compare the right panel with the left panel). This was found to be true for all RFs containing tRNAs or for RFs containing only tRNAs previously found to interact preferentially with the SPB or with the nucleolus (see Figure5B).
The previously described preferential interaction between tRNA genes was lost because it resulted from the fact that, without normalization, two fragments interacting overall more with the whole genome will interact together more frequently than other fragments. This is actually the case for tRNA fragments (see Additional file1: Figure S9). The reason why tRNA bearing RF interact more frequently than others with all other fragments does not depend on their size, and remain open. A local improvement in cross-linking efficiency resulting from the chromatin state and/or presence of protein complexes is a possibility. Of course, we do not exclude the possibility of actual preferential interactions between tRNA as observed experimentally[26, 27] and suggested by other approaches[24]. However, more experiments and higher resolution will be needed to detect those through genomic 3C approaches.
Normalization of the human genome contact map using SCN
In order to test how the SCN approach can be applied to the interaction map of a larger genome, we used the human genome-wide dataset published in 2009 by Lieberman et al.[8]. The restriction enzyme used in this dataset cuts the human genome over 830,000 times. Therefore, the number of potential interaction in the experiment is higher than 340 billion. Since the typical number of reads obtained in such experiment hardly reaches one billion[11], the resulting genome wide contact matrix is very sparsed. In order to get enough information to build a contact map, one can bin the matrix by adding the contacts over several fragments along the genome together. For intra-chromosomal interactions, a typical bin size of about ten fragments is adequate since most of the interaction detected in such an experiment are intra-chromosomal and since the number of possible intra-chromosomal interactions is much lower than the number of possible inter-chromosomal interactions. For inter-chromosomal interaction the bin size has to be increased considerably. We used a bin of one hundred fragments to build the corresponding contact map for the human genome and normalized it through the SCN method. The resulting map clearly shows preferential interactions between small chromosomes and between the long arm of long chromosomes (Additional file1: Figure S10). Importantly, ROC curves which are used to determine the genomic elements enriched at high interaction hotspot strongly depend to whether or not the data were normalized. We performed ROC analysis on the binding sites of the CCCTC-binding factor (CTCF), a zinc finger protein that plays an important role in the organization of chromatin by mediating inter and intra-chromosomal contacts between distant loci[28, 29], PolII, the centromeres and the telomeres. The results for both raw and normalized data clearly show that the preferential interactions of CTCF, PolII and centromeres are only seen on the properly normalized data (Figure6).