Chromatin Loop Anchors Are Core Structural Components of the Gene Expression Machinery in Maize

Background Three-dimensional chromatin loop structures connect regulatory elements to their target genes in regions known as anchors. In complex plant genomes, such as maize, it has been proposed that loops span heterochromatic regions marked by higher repeat content, but little is known on their spatial organization and genome-wide occurrence in relation to transcriptional activity. Results Here, ultra-deep Hi-C sequencing of maize B73 leaf tissue was combined with gene expression and open chromatin sequencing for chromatin loop discovery and correlation with transcriptional activity. Chromatin loops, made of two “anchors” anking a loop “interior”, overlap with up to 90% of high-resolution interaction domains from a previous public maize interactome dataset. A majority of all anchors are shared between multiple loops, suggesting a highly dynamic environment, with a conserved set of anchors involved in multiple interaction networks. Chromatin loop interiors are marked by higher repeat contents than the anchors anking them. A small fraction of high-resolution interaction anchors, fully embedded in larger chromatin loops, co-locate with active genes and putative protein-binding sites. Combinatorial analysis indicate that all anchors studied here co-locate with at least 81.5% of expressed genes and 74% of open chromatin regions. Up to 63% of all unique variants derived from a prior public maize eQTL datasets overlap with Hi-C loop anchors. Anchor annotation suggests that <7% of all loops detected from one Hi-C library are potentially devoid of any genes or regulatory elements. The overall conservation and organization of chromatin loop anchors in the maize genome suggest a loop modeling system hypothesized to resemble phase separation of repeat-rich regions. Conclusions A majority of expressed genes and open chromatin regions co-locate with a conserved set of chromatin loop anchors. The results presented here will be a useful reference to further investigate the function of chromatin loop anchors and of the formation of interaction regions in the regulation of gene expression in maize.

In plants, the understanding of chromatin organization is mainly derived from a relatively small number of studies in species that include Arabidopsis [6], rice [8] and maize [9]. Large plant genomes can be partitioned into TAD-like domains, which are in fact compartment domains [9]. Some domains are enriched in active genes, open chromatin and active histone marks while others are enriched in epigenetics signatures typical of repressive domains (including DNA methylation) [9]. In maize, chromatin loops can be formed between active chromatin domains [9], forming a rich and complex molecular interaction network linking distal and proximal regulatory elements [10], suggesting that the presence of chromatin loops in repeat-rich plant genomes could be a mechanism allowing distal regulatory elements to activate, or repress, genes separated from those elements by condensed heterochromatin [11]. Speci c variants in regulatory elements also may contribute to variations in gene expression through long-range chromatin interactions, linking eQTLs to their associated genes via chromatin loop interactions [12].
While long-range loop formation between chromatin domains has been shown to link together gene-rich and distal regulatory regions in complex plant genomes, no study yet has been performed to determine the extent of such mechanism and its genome-wide correlation to gene transcription in maize. In addition, while Hi-C is known as a "low-resolution" method capturing mainly long-range chromatin loops [9], its relationship to higher resolution loops detected with methods such as Chromatin Interaction Analysis by Paired-End Tag Sequencing ("ChIA-PET") [10] still remains to be determined, along with the predisposition of expressed genes and regulatory elements to co-locate primarily with speci c loop types. In this study, the overall structure of chromatin loops in maize and their prevalence as a putative mode of action associated with the regulation of gene transcription were evaluated. Structural relationships between chromatin loops, gene expression and open chromatin regions (used as indirect signals for protein binding to DNA) were systematically assessed through ultra-deep sequencing of Hi-C libraries, combined with the generation of RNA-Seq and Assay for Transposase-Accessible Chromatin using sequencing ("ATAC-Seq") datasets from the same maize tissue, and further compared to public maize high-resolution interactome and eQTL functional datasets [10,13]. Results showed substantial overlaps between those features, revealing chromatin loops as biological components of the gene regulation machinery in maize, with a restricted number of chromatin loop anchors as its core structural unit.

Results
Whole maize B73 leaf tissue was collected at development stage v04. The same batch of plants, at the same stage (v04) and divided into four biological replicates (four plants per replicate) was used for gene expression pro ling (RNA-Seq; four replicates), three-dimensional chromatin pro ling (Hi-C; two replicates) and accessible chromatin (ATAC-Seq; three replicates).
The ultra-deep sequencing of two Hi-C biological replicates led to a total number of 3,435,596,872 and 3,424,795,714 raw paired reads, for replicates 1 and 2, respectively. Filtering of the raw data and mapping to the B73 AGPv4 reference genome sequence led to the detection of 392,396,275 and 449,828,472 Hi-C contact pairs, after combining inter-chromosomal and intra-chromosomal contacts for replicates 1 and 2, respectively (Table 1). Table 1 interaction metrics for the chromosomal distribution of Hi-C contact pairs. Intra-chromosomal contacts are further divided into short range (< 20Kbps) and long-range (> 20Kbps) contacts. Hi-C interaction matrices showed strong interactions between neighboring loci on euchromatic arms, accompanied by cis and trans interactions between centromeric and telomeric regions and cis interactions between chromosomal arms (Fig. 1a). The chromosomes could be further partitioned into A/B compartments by eigenvector analysis of the Hi-C interaction matrices (Fig. 1b). Further analysis showed evidence of inter-nested TAD-like regions, with prevalent cis interactions within those regions.
Increased interactions at their borders suggested the existence of chromatin loops (Fig. 1c).
A total of 17,176 and 25,917 chromatin loops were detected for Hi-C replicates 1 (See Additional File 1) and 2 (See Additional File 2), respectively. Distances between anchors within a loop varied from 30Kbps to > 1Mbp (Fig. 1d). Repeat element density analysis between anchors and the regions located between two anchors ("loop interiors") showed that repeats were more prevalent in loop interiors (Fig. 1e). There were 7,917 loops present in both replicates, with both anchors overlapping by at least 1 bp, and only Sequence coverage peaks from ATAC-Seq libraries are generally seen as proxy for binding sites and gene regulatory elements in genomic DNA [14]. Sequencing of three ATAC-Seq biological replicates led to the detection of 20,955 to 39,584 open chromatin peaks (see Additional le 3). For each replicate, peaks were classi ed as located between two nucleosomes ("nucleosome-free" or "NF") or overlapping one or more nucleosome ("multi-nucleosome" or "MN"), based on the distance between ATAC-Seq paired sequencing reads. While NF peaks tended to be discrete peaks (~ 100bps) located immediately upstream or downstream of genes (proximal peaks), or in intergenic regions (distal peaks), MN peaks tended to be broad peaks primarily covered entire gene regions. A list of 32,009 "consensus" NF and MN peaks was generated, where a peak had to be present in at least two individual replicates to be conserved, out of which only the 19,532 NF peaks were kept for further analysis (See Additional File 4).
Co-location analysis of ATAC-Seq NF peaks with chromatin loops was assessed, based on the following criteria. As many chromatin loop anchors were shared between multiple loops, in a signi cant number of cases, a peak could align to an anchor for one loop and a loop interior for another loop. Therefore, peak overlaps to loop regions were determined rst based on their potential overlap to at least one loop anchor.
If no overlap was detected, peaks were assessed based on their potential overlaps to loop interiors, then to genomic regions located outside of chromatin loops. Using this approach, up to 13,026 peaks, out of 19,532, overlapped primarily with anchors, while up to 4,649 peaks overlapped primarily with loop interiors (Fig. 2b). On the other hand, only 33% of all anchors overlapped with open chromatin peaks, suggesting technical constraints that limited the total number of peaks detected here, but also the possibility for distinct functions, or lack thereof, for some anchors not overlapping with peaks. To further explore potential relationships between gene expression and loop formation, gene content for anchors overlapping with at least one NF peak was assessed. In that analysis, anchors were annotated based on the activity of overlapping genes located next to NF peaks. The data show that 62% of all NF peaks overlapping with chromatin loop anchors either overlapped or were located < 2Kbps away from expressed genes, suggesting those peaks may correspond to proximal regulatory regions, including promoters. Conversely, analysis of gene activity in loop anchors (including genes present in multiple anchors) showed that, for 71% of anchors harboring genes, at least one of those genes was expressed (See Additional File 6). Expression was associated with the absence of inactive genes within the anchor for 88% of those anchors, and this trend was generally accentuated by the presence of proximal open chromatin peaks. 74% of the remaining anchors carried distal peaks, possibly representing long-range regulatory elements.
Anchor annotation using replicate 1 as an example (See Additional File 7) showed that, of all 18,296 anchors with at least one expressed gene, 15,058, or 82%, had one expressed gene only. A total of 3,684 anchors contained at least one peak and no expressed genes (including 2,285 containing one peak only).
Out of those, 2,897 were part of a loop where the opposite anchor contained at least one expressed gene and at least one peak, suggesting functional interactions between proximal and distal elements facilitated by loop formation. In addition, there was a total of 5,488 loops containing at least one expressed gene in both anchors, including 1,058 harboring at least one peak in each anchor, indicating potential interactions between proximal elements. Interestingly, out of the 11,066 anchors from replicate 1 also overlapped with the high-resolution chromatin interactions described above [10].

Discussion
A majority of all expressed genes and discrete open chromatin regions (located primarily at the 5' and 3' ends of genes, along with other intergenic regions, that may correspond to proximal and distal regulatory element binding sites) co-locate with a conserved set of chromatin loop anchors. A vast majority of the chromatin loops detected via Hi-C share anchors with other loops, including high-resolution interactions detected in the same genotype (B73), but with a more accurate technique (ChIA-PET) and on a relatively different tissue type (whole shoot). A majority of anchors detected here overlap with one or more expressed genes and open chromatin regions. Many are shared between multiple loops, including some that are fully embedded within larger ones. Taken together, those data indicate that gene transcription and its regulation are closely linked to loop formation in maize through a highly dynamic threedimensional mechanism reminiscent of enhancer/promoter interaction complexes described elsewhere [10].
Interestingly However, it is likely that, due to unexpected technical constraints, the 19,532 consensus NF peaks used here represent an underestimate of the expected total peak counts. Therefore, many of the "empty" loop anchors listed here may harbor putative open chromatin regions that, simply, have remain undetected in the present study.
When combined with the observation that repeat contents are denser in loop interiors than in the loop anchors anking them, the results shown here are in line with a phase separation model of chromatin dynamics in maize [15]. In this model, chromatin compartmentalization in transcribed regions is marked by the exclusion of adjacent repeat-rich regions and the formation of supramolecular condensates driving the regulation of gene expression through the active interactions between molecules that include proteins and nucleic acids. Such interaction potentially could be marked by the presence of ATAC-Seq peaks in loop anchors. Interestingly, some of the loops analyzed here exhibit multiple peaks in their anchors (see Additional le 7), suggesting the possibility of multi-protein assemblies at the vicinity of potential DNA binding sites. As the work done here was performed on whole tissue, it is possible that some of the signals observed here are related to speci c cell types. As a result, more cell type-speci c studies will be required to assess the mechanistic aspects of tri-dimensional regulation of gene transcription in maize. Of interest will be the need to determine whether loop formation facilitates gene expression or instead derives from it.

Conclusions
The results shown here indicate that the scope of the functional maize genome may be narrowed down to a restricted number of discrete loop anchors. Data suggest that those regions anchor regulatory complexes interacting with a majority of the genes expressed in maize, through three-dimensional interactions hypothesized to be in line with a phase separation model of chromatin dynamics. While further investigation will be required to con rm such model, the various sequencing datasets generated for this study narrow down the functional maize genome to a relatively conserved set of chromatin loop anchors, therefore facilitating the systematic discovery of new motifs, such as enhancers, and of eQTL variants, making them both prime candidates for future functional genomics studies through their direct associations with the transcription of speci c gene sets. In addition, the current datasets may be used to facilitate prediction modeling for breeding and genomic selection applications, through the ranking of speci c variants in loop anchors and of their predicted functional impact in genotypes of interest.

Plant material
Whole maize B73 v04 leaf tissue, grown in a greenhouse, was collected and stored at -80°C after grinding.
Plants were divided into four biological replicates (four plants per replicate). The same replicates were used for gene expression pro ling (RNA-Seq), accessible chromatin (ATAC-Seq), and three-dimensional chromatin pro ling (Hi-C).

Hi-C library construction
Grinded tissue from two biological replicates (replicates 1 and 2), kept at -80°C and each containing v04 leaves from four plants, were used for preparing two Hi-C replicate libraries. Libraries were constructed with the Dovetail Hi-C Library preparation Kit (Dovetail), according to the manufacturer's instructions, including protocol modi cations recommended by Dovetail for plant tissue. Since tissues had previously been grinded (in liquid nitrogen) before long-term storage at -80°C, 2ml 1X PBS and 81μl 37% formaldehyde were added directly to 250mg of frozen and grinded tissue in a 15-ml tube then incubated at room temperature for 15min, prior to completing the protocol.

RNA-Seq library construction
Total RNA was extracted from grinded v04 leaf tissue previously stored at -80°C using the RNeasy 96 kit (Qiagen). Four biological replicates, each containing v04 leaves from four plants, were used for total RNA extraction and subsequent poly(A) RNA library construction. Poly (A) RNA-Seq libraries were built using the TruSeq stranded mRNA library preparation kit (Illumina) according to the manufacturer's instructions.

ATAC-Seq library construction
Grinded tissue from three biological replicates, kept at -80°C and each containing v04 leaves from four plants, were used for nuclei extraction and ATAC-Seq library preparation. Prior to library construction, 1L of 1X Nuclei Isolation Buffer (NIB) solution (10mM Tris-HCl, pH 8.0; 10mM EDTA pH 8.0; 7.45g/L KCl; 171.2g/L sucrose; 1g/L spermidine trihydrochloride; 0.35g/L spermine tetrahydrochloride) was prepared and sterilized by ltration. Approximately 1g of frozen leaf tissue was mixed with 25ml NIBM (0.001% beta-mercapto ethanol in 1X NIB) and incubated on ice for 15 minutes. After ltering through a 100μm lter, the pellet was washed with 15ml NIBM and ltered again with a 40μm lter. 2ml of NIBT (10% Triton X-100 in 1X NIB) were added to the 40ml solution and incubated on ice for 15min. After centrifugation at 2,400g for 15min at 6°C, the pellet was resuspended on ice and 25-35ml of 1X NIB were added. After another centrifugation at 2,400g for 15min at 6°C, the supernatant was removed, and the nuclei pellet was kept on ice. Nuclei were counted after mixing 20μl Trypan Blue with 10μl 1X NIB and 10μl nuclei suspension and loading 10μl of the resulting mix on a hemocytometer. After counting, 40,000 nuclei were mixed with 1X Tagment DNA (TD) Buffer (Illumina) and 2.5μl Tagment DNA Enzyme 1 (TDE1) (Illumina) in a 50μl reaction. After incubation at 37°C for 30min, mixing up and down a few times, the tagmented DNA was cleaned-up in 15μl EB buffer with a MinElute PCR Puri cation Kit column (Qiagen) and stored at -20°C. PCR reaction then was performed after mixing 15μl of tagmented DNA with 2.5μl each of 25μM i5 (AATGATACGGCGACCACCGAGATCTACACNNNNNNNNTCGTCGGCAGCGTC) and i7 (CAAGCAGAAGACGGCATACGAGATNNNNNNNNGTCTCGTGGGCTCGG) PCR primers (where N indicates the presence of an 8-bp barcode) in 1X Q5 HotStart Master Mix (NEB), and incubating 5min at 72°C, 30sec at 98°C, followed by 5 cycles of: 10sec at 98°C; 30sec at 63°C; 1min at 72°C, and hold at 4°C. After this initial PCR, 5μl of ampli ed DNA was quanti ed via qPCR (using the same PCR primers) to determine the optimum number of PCR cycles for the remaining 45μl before over-amplifying the sample (characterized by ¼ of the maximum uorescent intensity on the qPCR plot). Typically, PCR reactions were completed after adding 6-8 ampli cation cycles. The DNA then was cleaned-up with a MinElute column and resuspended in 15μl EB buffer (Qiagen).

Sequencing and data analysis
Sequencing of the two Hi-C biological replicates (replicates 1 and 2) was performed on Illumina HiSeq 2500 and NovaSeq 6000 systems, at 2x150bps. Filtering of the raw read data and mapping to the B73 AGPv4 reference genome sequence were performed using the Juicer package [16]. Frequency of contacts were plotted on a 2D matrix using the Juicebox utility [17]. Chromatin loops were detected using the HICCUPS software package [4] (juicer_tools versions 1.9.9 and 1.14.08, for replicates 1 and 2, respectively). Default values for chromatin loop anchor lengths vary from 5 to 25Kbps, with a majority at 10Kbps.
For repeat density analysis, repeat elements for the B73 AGPv4 reference genome sequence were obtained from MaizeGDB. The manner in which these elements overlap the loops predicted from the replicates 1 and 2 data sets was examined. Each loop was treated as the union of two anchor regions and an interior region. These three components were compared to the set of repeat elements using "bedtools -coverage" [18] which reported the number of positions in each covered by at least a single repeat element. With this and the known sizes, the fraction of anchor and interior positions covered in each loop was computed.
For eQTL analysis, 61,188 eQTL records (see Supplemental Table 6 in [13]) provided the genomic location of each eQTL, its lead SNP, and the associated gene relative to the maize B73 RefGen v2 reference. Coordinates of the eQTLs and the SNPs were translated to coordinates in Zm-B73-REFERENCE-GRAMENE-4.0 (AGPv4) using the EnsemblPlants Assembly Converter. Gene names were translated to the B73 v4 Zm00001d.2 gene model set names using cross-reference information provided by MaizeGDB. A number of eQTLs could not be translated directly, as, for example, when the associated gene in RefGen v2 no longer existed in AGPv4 or the converter split it into several segments. In the former case the original record was ignored; in the latter, the segment containing the SNP was retained as the AGPv4 eQTL. Each AGPv4 eQTL record (eQTL region, lead SNP, associated gene) then was examined in conjunction with loop structures and coordinates predicted from replicate 2.
RNA-Seq libraries derived from all four biological replicates were sequenced on an Illumina HiSeq 2500 system at 1x50bps (single 3' end sequenced), targeting ~20MM single reads per replicate. Sequences were trimmed based on quality scores and mapped to the B73 AGPv4 reference genome using a number of aligners, including BOWTIE2 [19], BWA -MEM [20] and the RSEM pipeline [21]. Expression analysis was performed using Kallisto [22] and the Sleuth expression analysis suite in R [23] was used to determine abundance at the gene level. Counts then were normalized and converted to Transcript Per Million, or TPM [23].
Gramene gene models for build 4.0 of B73 were obtained from the MaizeGDB repository, speci cally from the GFF3 le Zm-B73-REFERENCE-GRAMENE-4.0_Zm00001d.2.gff3. Feature types "gene," "lincRNA_gene," "miRNA_gene," and "tRNA_gene" were selected, yielding 44,474 genes. Of these, 38,847 were reported to be associated with chromosomes 1-10. Individual expression levels (TPM counts) reported for each transcript of a gene in the four biological replicates were averaged to provide a global average expression level for each gene. Genes with average leaf expression level greater than or equal to one were taken as expressing. With this criterion, 18,700 genes associated with chromosomes 1-10 were listed as "expressed genes" while the remaining 20,147 were listed as "unexpressed". Finally, genes were mapped to Hi-C chromatin loops using BEDTools [18].
Sequencing of three ATAC-Seq biological replicates was performed on an Illumina NovaSeq 6000 system at 2x50bps. After removing contaminant adapter and organellar sequences, reads were mapped to the AGPv4 reference genome. After ltering for proper read pair alignments and removing duplicated reads, resulting BAM les were separated based on the distance between read pairs into "NF" ("nucleosomefree") regions and "MN" ("multi-nucleosome") regions. Reads mapping to speci c regions known to exhibit high coverage depth (for example, homologs to organellar genes) were removed using "bedtoolsintersect" [18]. NF and MN peaks were called separately with the MACS2 software tool [24]. Consensus peaks were generated by searching for physical overlap (>1bp) between peaks generated for each replicate.
ATAC-Seq NF consensus peaks were classi ed in relation to their proximity to a particular expressing gene. The location of each expressing gene was compared to the set of 19,532 NF consensus peaks on chromosomes 1-10. For each, the closest peak and its distance in bp was determined using "bedtoolsclosest" [18], comparing the genomic coordinates (start-end) of the gene with the genomic coordinates (start-end) of each peak. Genomic coordinates and strand information for genes were taken from the Gramene gene models GFF3 le. The nearest peak was declared "overlapping" if any part of its span overlapped (at least 1 bp) any part of the gene's span, in which case "bedtools -closest" reported a genomic distance of 0 bp. Otherwise, a signed distance was reported, negative if the peak was located "left" of the gene, positive when it was located "right" of it. The distance indicated the smallest absolute difference between the endpoints of the gene and the peak. This signed distance and the known strand of the gene allowed determination of the peak as lying 5' or 3' of the gene. Peaks then were mapped to gene models using BEDTools [18] according to the following four categories: 1) peaks overlapping a particular gene model; 2) peaks located <2Kbps downstream from a gene model; 3) peaks located <2Kbps upstream from a gene model; and 4) peaks located >2Kbps away from a gene model (upstream or downstream).
Peaks were mapped to chromatin loops using BEDTools [18]. Prioritization of peak overlaps was made necessary by the fact that a large fraction of chromatin loops shared one anchor. It was performed according to the following order: 1) peaks overlapping an anchor region; 2) peaks overlapping a loop interior region; 3) peaks not overlapping any loop regions. Following that order, peaks mapping to category 1 were removed from the list of peaks mapped to category 2 (and from category 3 when mapping to category 2). Figure 1 Interaction matrices of maize leaf v04 Hi-C replicate 1 library. a Genome-wide interaction matrix. Each interaction is represented by a "pixel" on the map and the frequency of interactions within a particular region is proportional to the number of pixels. Chromosomes are labeled by numbers (1 to 10, starting with Chr10 at the top left). b Interaction matrix for Chr08. Two TAD-like chromatin domains are marked by black circles indicating a higher frequency of interactions within a particular chromosomal region. Chromosome-wide eigenvectors are shown on top and left of the interaction matrix. A and B chromosomal compartments are colored blue and red, respectively. c Interaction matrix for a ~3Mbps region located on Chr03. The Hi-C contact matrix shows evidence of TAD-like domains, with increased cisinteractions at their borders, suggesting the formation of chromatin loops. Seven chromatin loops, marked by solid arrows on both side of the diagonal, are shown as examples. d Distribution of chromatin loop lengths (e.g., distance between anchors) for replicate 1 and replicate 2 Hi-C datasets. Lengths are shown in Kbps (x-axis). The origin of the size oscillation pattern shown for replicate 1 remains to be determined. e Repeat density analysis in anchor and loop interior regions for all loops detected in replicate 1 and replicate 2. Each dot in the graph represents an individual loop. X-axis: fraction of bases within whole anchor regions (0 to 1) that are occupied by conserved elements; Y-axis: fraction of bases within whole interior regions (0 to 1) that are occupied by conserved elements.