Spatial features for Escherichia coli genome organization
- Ting Xie†1,
- Liang-Yu Fu†1,
- Qing-Yong Yang†1,
- Heng Xiong†1,
- Hongrui Xu1,
- Bin-Guang Ma1Email author and
- Hong-Yu Zhang1Email author
© Xie et al.; licensee BioMed Central. 2015
Received: 20 July 2014
Accepted: 19 January 2015
Published: 5 February 2015
In bacterial genomes, the compactly encoded genes and operons are well organized, with genes in the same biological pathway or operons in the same regulon close to each other on the genome sequence. In addition, the linearly close genes have a higher probability of co-expression and their protein products tend to form protein–protein interactions. However, the organization features of bacterial genomes in a three-dimensional space remain elusive. The DNA interaction data of Escherichia coli, measured by the genome conformation capture (GCC) technique, have recently become available, which allowed us to investigate the spatial features of bacterial genome organization.
By renormalizing the GCC data, we compared the interaction frequency of operon pairs in the same regulon with that of random operon pairs. The results showed that arrangements of operons in the E. coli genome tend to minimize the spatial distance between operons in the same regulon. A similar global organization feature exists for genes in biological pathways of E. coli. In addition, the genes close to each other spatially (even if they are far from each other on the genome sequence) tend to be co-expressed and form protein–protein interactions. These results provided new insights into the organization principles of bacterial genomes and support the notion of transcription factory.
This study revealed the organization features of Escherichia coli genomic functional units in the 3D space and furthered our understanding of the link between the three-dimensional structure of chromosomes and biological function.
Thousands of genes are compactly encoded in bacterial genomes and orchestrate life activities, such as DNA duplication, RNA transcription and protein translation. The genes need to be well organized in the genome for effective regulation of different biological processes. Bacterial genes are not randomly distributed on the genomic sequence, but organized in sequential functional units called operons . The genes in an operon tend to be co-expressed [1,2] and their protein products have higher probability to interact with each other [3,4]. Operons participating in the same biological pathway or regulon (a group of transcriptionally co-regulated operons) are also close to each other on the genome sequence and present in one or multiple clusters [5,6]. However, numerous large regulons exist comprising multiple clustered operons that are separated distantly on the genome sequence. The organization of these long-range regulons has been suggested to be related with the three-dimensional packing of the chromosome, but this remains to be examined .
In the past decade, the chromosome conformation capture (3C) technique and its derivatives, such as 4C, 5C, Hi-C, and TCC , have been developed to detect DNA–DNA interactions to infer the chromosome spatial organization. The application of this technique in eukaryotes resulted in the interpretation of contact patterns between regulatory elements in the 3D space [8,9] and provided substantial information about the principles of chromosomal organization [10,11]. However, the application of 3C techniques in prokaryotes is still in its infancy . Recently, Cagliero and co-workers determined the chromosome conformation for Escherichia coli growing at the exponential (L) and starvation (S) phases using the genome conformation capture (GCC) technique . In this study, we attempted to use these valuable datasets to investigate the spatial features of bacterial genome organization.
Results and discussion
Renormalization and profile of the GCC data
We renormalized the GCC datasets using the following steps. First, high-quality reads were mapped onto the reference genome (E. coli K12 MG1655) using bowtie2 (version 2.1.0) . The resulting contact counts were further refined by setting the contact distance threshold between the contact fragments to remove self-ligation, non-ligation and random breaks (Additional file 1: Table S1). The noise was removed by setting a minimum required contact number through controlling the false discovery rate (FDR; Additional file 1: Table S2, see Additional file 2). We divided the genome into 10-kilobase (kb) bins to derive the DNA interaction information . At 10 kb resolution, 84.05% of the operons and 90.86% of the genes were inside (not across) the bins. Considering that the uneven distribution of the restriction enzyme sites (RESs) can bias the interaction frequencies, we normalized the interaction frequencies by dividing the number of Hhal RESs for each bin to remove this bias (Additional file 1: Figure S1) .
Spatial features for E. coli genome organization
By considering the individual operons in each DNA bin, the interaction frequencies between operons were derived from the interaction information of DNA fragments, and their connections to the operon organization were investigated. The interaction frequencies between operon pairs within a regulon were calculated and compared with those of randomly sampled operon pairs with similar sequence distances (with the same number of operons in between), excluding 0 interaction counts. The interaction frequency of an operon pair belonging to the same regulon was significantly higher than that of a random pair for both the L and S phases (Additional file 1: Figure S2a). Furthermore, the remote operon pairs, whose sequences were separated by at least 100 operons, were also compared with random samples (Additional file 1: Figure S2b). Notably, these remote operons still showed higher interaction frequencies than the randomly sampled operon pairs (with distance >100 operons) from the entire E. coli genome. This finding indicated that the DNA interaction-based genome architecture does contribute to the organization of operons into regulons. It also explains the frequent occurrence of the large regulons composed of multiple operons that are sequentially far from each other, thus confirming the suggested functions of 3D chromosome packing on the global organization of operons . We also found a similar phenomenon for genes in biological pathways. The interaction frequency between gene pairs in the same biological pathway was significantly higher than that of gene pairs obtained randomly from the genome for both phases (Additional file 1: Figures S2c, d). This phenomenon was observed not only in the overall gene pairs, but also in the remote gene pairs with sequence separation of at least 100 genes. Taken together, the results suggested that not only operons in a regulon but also genes in a biological pathway were likely to co-localize in the 3D E. coli genome.
To examine the spatial features for E. coli genome organization quantitatively, the C value was defined based on the DNA interaction frequency to measure the organizational compactness of the 3D genome at two levels: the compactness of regulons in terms of the interaction between operon pairs, and the compactness of biological pathways in terms of the interaction between gene pairs. A lower C value indicated that the operons/genes are more spread out and less compact in the 3D space globally.
Implications for E. coli biology
The qualitative and quantitative results both indicated that the previously reported organization principle of E. coli genome on the linear sequence [5,6] could be extended to the 3D space. The non-random organization of the linear genome has several effects. For example, neighboring genes on the genome have higher probability of co-expression and their protein products tend to form protein–protein interactions (PPIs) [5,17-21]. Here, we investigated if these effects persist in the 3D space.
For bacteria, the processes of transcription, translation, and PPI formation cannot be entirely separated because they lack a nuclear membrane. Thus, the connections observed in this study among the spatial DNA interactions, gene co-expression and protein interactions were partially interpretable in terms of cellular structure. These connections reflect the global genome organization features and the unity of transcription and its downstream processes for E. coli in the 3D space, which supports the notion of transcription factory which was modeled for all genomes .
In summary, starting from the GCC data for E. coli , the present analysis revealed certain spatial features of the E. coli genome organization: i) the operons/genes are not randomly distributed in the 3D space, but are constrained by regulons/bio-pathways to maximize spatial compactness; ii) the genes close to each other in the 3D space (even if far from each other on the genome sequence) exhibit trends of co-expression and formation of PPIs. These findings are helpful in elucidating the fundamental biology of bacteria, and support the concept of transcription factory.
Renormalization of the GCC data
The GCC sequencing data for E. coli MG1655 at L (exponential sample, WT) and S (serine hydroxamate-treated sample, SHX) growth phases were obtained from the NCBI SRA database. Only the first 70 bp of the whole reads with high quality were mapped onto the E. coli reference genome (Accession: NC_000913) using bowtie2 with the default parameters . Unique matches with score > 30 were used for further analysis. The genome was then divided into 32,802 Hhal restriction fragments. The matched RESs in their 500-bp-long flanking sequences were removed as random breaks . The read pairs were further refined by setting a contact distance threshold (>800 bp) between the contact fragment pairs to remove self-ligation and non-ligation . The basic interaction information on the remaining DNA fragments is presented in Additional file 1: Table S1. To differentiate the real contact from background noise, the FDR was controlled . By controlling the FDR at < 0.1, the fragment pairs with at least two contacts are non-random and thus were used for the analysis  (see Additional file 2).
Considering the size of the operons and genes, the genome was divided into 10-kb bins, and the interaction frequencies for restriction fragments were assigned to the corresponding bins . f ij is the interaction frequency between bin i and bin j. For each bin, the interaction score is defined as the sum of the interaction frequencies in that bin to reflect the interaction potential. We observed a significant, positive correlation between the interaction score and number of Hhal RESs for the GCC dataset (Additional file 1: Figure S1). Therefore, we normalized the interaction frequencies by dividing the number of Hhal RESs for each bin to remove this bias, following the method of a previous report . The interaction matrix after normalization is presented in Additional file 4.
The peaks in the genomic interaction profile were identified using a previously published algorithm . In the algorithm, read distribution along the genome could be modeled by a Poisson distribution  in which the parameter λ could capture both the mean and variance of the distribution. Across the genome, we searched for candidate peaks with a significant tag enrichment (Poisson distribution P-value based on λ, P = 10-3 in this work).
Derivation and handling of pathway and regulon data
The genome sequence and 4,639 annotated genes for E. coli were obtained from the NCBI RefSeq. The 319 biological pathways of E. coli that involved gene number ≥ 2 were obtained from the EcoCyc database . A total of 2,647 operons and 193 regulons were obtained from the RegulonDB database , and the 146 regulons with operon number ≥ 2 were used in our analysis.
For each regulon, the interaction frequencies between operon pairs within it were calculated (excluding 0 interaction counts). The background noise was estimated by randomly sampling operon pairs from the genome, keeping the number of operons between the same as the real interacting operon pairs. Using the Wilcoxon rank sum test, the significance of the real interaction that deviated from the random background was estimated and is shown in Additional file 1: Figure S2. Similarly, the remote operon pairs with a sequence separation of at least 100 operons were also compared with the random background.
The genome was then randomly shuffled (totally 1,000,000 times) at different degrees (percentage X = 10, 20, 30, …, 100) following a similar procedure to that previously reported  to determine whether the actual genome organization (in terms of interactions between operons/genes in regulons/pathways) in the 3D space is coordinated compared with random arrangements (Figure 4). The comparisons were performed for both the overall operon/gene pairs and the remote ones with sequence separation of at least 100 operons/genes in between.
Derivation and handling of gene co-expression data
The gene expression data for E. coli (E_coli_v4_Build_6; 466 experiments for 4,297 genes) were obtained from the M3D database  and the Pearson correlation coefficients (PCCs) were used to measure gene co-expression . The interacting gene pairs that were separated on the genome sequence by at least 100 kb and had the top 10% highest interaction frequencies were used in the co-expression analysis. Two genes were regarded as co-expressed if the PCCs between their expression data were above 0.5 [32,33]. The Wilcoxon rank sum tests were used to compare the distribution of correlation coefficients (of co-expressed genes) between highly interacting gene pairs and the random sampled gene pairs that were at least 100 kb from each other on the genome sequence (Figure 5).
Derivation and handling of protein–protein interaction data
The protein interaction data for E. coli were downloaded from the DIP database (Release 2013.10.31) . For the 12,726 interacting protein pairs obtained from DIP, 8,691 have protein information from the UniProt database (Release 2013_11) . After removing duplicates, 7,345 interacting protein pairs were obtained. The interactions of the proteins whose genes are located on the genome sequence with a distance less than 100 kb were removed. Finally, 6,714 protein interactions were used in the analysis. According to the DNA-interaction frequency, the interacting gene pairs were sorted in ascending order and then classified into four groups (corresponding to four quartiles). With another “non-contact” (interaction frequency = 0) group, five groups of gene pairs were thus used in the comparison of PPI frequency between their protein products. For the 6,714 analyzed protein interactions in E. coli, the fractions of these PPI in the five groups of DNA-interacting gene pairs were calculated and plotted in Figure 6 (magnified 1 million times). The differences between the proportions of PPIs in the five groups were compared using Wilcoxon rank sum tests (Additional file 1: Table S3).
The work is financially supported by the funding from National Basic Research Program of China (973 project, grant 2012CB721000), the National Natural Science Foundation of China (grant 31100602), the Natural Science Foundation of Hubei Province (grant 2013CFA016) and the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry of China.
- Jacob F, Perrin D, Sánchez C, Monod J. Operon: a group of genes with the expression coordinated by an operator. CR Hebd Seances Acad Sci. 1960;250:1727–9.Google Scholar
- Lercher MJ, Hurst LD. Co-expressed yeast genes cluster over a long range but are not regularly spaced. J Mol Biol. 2006;359(3):825–31.View ArticlePubMedGoogle Scholar
- Dossena S, Nofziger C, Bernardinelli E, Soyal S, Patsch W, Paulmichl M. Use of the operon structure of the C. elegans genome as a tool to identify functionally related proteins. Cell Physiol Biochem. 2013;32(7):41–56.View ArticlePubMedGoogle Scholar
- Wolf YI, Rogozin IB, Kondrashov AS, Koonin EV. Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. Genome Res. 2001;11(3):356–72.View ArticlePubMedGoogle Scholar
- Yin Y, Zhang H, Olman V, Xu Y. Genomic arrangement of bacterial operons is constrained by biological pathways encoded in the genome. Proc Natl Acad Sci USA. 2010;107(14):6310–5.View ArticlePubMed CentralPubMedGoogle Scholar
- Zhang H, Yin Y, Olman V, Xu Y. Genomic arrangement of regulons in bacterial genomes. PLoS One. 2012;7(1):e29496.View ArticlePubMed CentralPubMedGoogle Scholar
- Dekker J, Marti-Renom MA, Mirny LA. Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data. Nat Rev Genet. 2013;14(6):390–403.View ArticlePubMed CentralPubMedGoogle Scholar
- Hakim O, Sung M-H, Voss TC, Splinter E, John S, Sabo PJ, et al. Diverse gene reprogramming events occur in the same spatial clusters of distal regulatory elements. Genome Res. 2011;21(5):697–706.View ArticlePubMed CentralPubMedGoogle Scholar
- Krivega I, Dean A. Enhancer and promoter interactions—long distance calls. Curr Opin Genet Dev. 2012;22(2):79–85.View ArticlePubMed CentralPubMedGoogle Scholar
- Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012;485(7398):376–80.View ArticlePubMed CentralPubMedGoogle Scholar
- Sexton T, Yaffe E, Kenigsberg E, Bantignies F, Leblanc B, Hoichman M, et al. Three-Dimensional folding and functional organization principles of the drosophila genome. Cell. 2012;148(3):458–72.View ArticlePubMedGoogle Scholar
- Le TB, Imakaev MV, Mirny LA, Laub MT. High-resolution mapping of the spatial organization of a bacterial chromosome. Science. 2013;342(6159):731–4.View ArticlePubMed CentralPubMedGoogle Scholar
- Cagliero C, Grand RS, Jones MB, Jin DJ, O’Sullivan JM. Genome conformation capture reveals that the Escherichia coli chromosome is organized by replication and transcription. Nucleic Acids Res. 2013;41(12):6058–71.View ArticlePubMed CentralPubMedGoogle Scholar
- Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9.View ArticlePubMed CentralPubMedGoogle Scholar
- Bardet AF, He Q, Zeitlinger J, Stark A. A computational pipeline for comparative ChIP-seq analyses. Nat Protoc. 2011;7(1):45–61.View ArticlePubMedGoogle Scholar
- Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008;9(9):1–9.View ArticleGoogle Scholar
- Rogozin IB, Makarova KS, Murvai J, Czabarka E, Wolf YI, Tatusov RL, et al. Connected gene neighborhoods in prokaryotic genomes. Nucleic Acids Res. 2002;30(10):2212–23.View ArticlePubMed CentralPubMedGoogle Scholar
- Williams EJ, Bowles DJ. Coexpression of neighboring genes in the genome of Arabidopsis thaliana. Genome Res. 2004;14(6):1060–7.View ArticlePubMed CentralPubMedGoogle Scholar
- Bhardwaj N, Lu H. Correlation between gene expression profiles and protein–protein interactions within and across genomes. Bioinformatics. 2005;21(11):2730–8.View ArticlePubMedGoogle Scholar
- Overbeek R, Fonstein M, D’souza M, Pusch GD, Maltsev N. The use of gene clusters to infer functional coupling. Proc Natl Acad Sci USA. 1999;96(6):2896–901.View ArticlePubMed CentralPubMedGoogle Scholar
- Salgado H, Moreno-Hagelsieb G, Smith TF, Collado-Vides J. Operons in Escherichia coli: genomic analyses and predictions. Proc Natl Acad Sci USA. 2000;97(12):6652–7.View ArticlePubMed CentralPubMedGoogle Scholar
- Cook PR. A model for all genomes: the role of transcription factories. J Mol Biol. 2010;395(1):1–10.View ArticlePubMedGoogle Scholar
- Yaffe E, Tanay A. Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture. Nat Genet. 2011;43(11):1059–65.View ArticlePubMedGoogle Scholar
- Ay F, Bunnik EM, Varoquaux N, Bol SM, Prudhomme J, Vert J-P. Three-dimensional modeling of the P. falciparum genome during the erythrocytic cycle reveals a strong connection between genome architecture and gene expression. Genome Res. 2014;24(6):974–88.View ArticlePubMed CentralPubMedGoogle Scholar
- Rodley CD, Grand RS, Gehlen LR, Greyling G, Jones MB, O'Sullivan JM. Mitochondrial-nuclear DNA interactions contribute to the regulation of nuclear transcript levels as part of the inter-organelle communication system. PLoS One. 2012;7(1):e30943.View ArticlePubMed CentralPubMedGoogle Scholar
- DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43(5):491–8.View ArticlePubMed CentralPubMedGoogle Scholar
- Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, et al. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature. 2007;448(7153):553–60.View ArticlePubMed CentralPubMedGoogle Scholar
- Karp PD, Riley M, Saier M, Paulsen IT, Collado-Vides J, Paley SM, et al. The ecocyc database. Nucleic Acids Res. 2002;30(1):56–8.View ArticlePubMed CentralPubMedGoogle Scholar
- Salgado H, Peralta-Gil M, Gama-Castro S, Santos-Zavaleta A, Muñiz-Rascado L, García-Sotelo JS, et al. RegulonDB v8. 0: omics data sets, evolutionary conservation, regulatory phrases, cross-validated gold standards and more. Nucleic Acids Res. 2013;41(D1):D203–13.View ArticlePubMed CentralPubMedGoogle Scholar
- Faith JJ, Driscoll ME, Fusaro VA, Cosgrove EJ, Hayete B, Juhn FS, et al. Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata. Nucleic Acids Res. 2008;36 suppl 1:D866–70.PubMed CentralPubMedGoogle Scholar
- Korbel JO, Jensen LJ, Von Mering C, Bork P. Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs. Nat Biotechnol. 2004;22(7):911–7.View ArticlePubMedGoogle Scholar
- Krom N, Ramakrishna W. Conservation, rearrangement, and deletion of gene pairs during the evolution of four grass genomes. DNA research. 2010;17(6):343–52.View ArticlePubMed CentralPubMedGoogle Scholar
- Wagner A. Decoupled evolution of coding region and mRNA expression patterns after gene duplication: implications for the neutralist-selectionist debate. Proc Natl Acad Sci USA. 2000;97(12):6579–84.View ArticlePubMed CentralPubMedGoogle Scholar
- Xenarios I, Rice DW, Salwinski L, Baron MK, Marcotte EM, Eisenberg D. DIP: the database of interacting proteins. Nucleic Acids Res. 2000;28(1):289–91.View ArticlePubMed CentralPubMedGoogle Scholar
- Consortium U. Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res. 2013;41(D1):D43–7.View ArticleGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.