Separate base usages of genes located on the leading and lagging strands in Chlamydia muridarum revealed by the Z curve method
© Guo and Yu; licensee BioMed Central Ltd. 2007
Received: 11 April 2007
Accepted: 10 October 2007
Published: 10 October 2007
The nucleotide compositional asymmetry between the leading and lagging strands in bacterial genomes has been the subject of intensive study in the past few years. It is interesting to mention that almost all bacterial genomes exhibit the same kind of base asymmetry. This work aims to investigate the strand biases in Chlamydia muridarum genome and show the potential of the Z curve method for quantitatively differentiating genes on the leading and lagging strands.
The occurrence frequencies of bases of protein-coding genes in C. muridarum genome were analyzed by the Z curve method. It was found that genes located on the two strands of replication have distinct base usages in C. muridarum genome. According to their positions in the 9-D space spanned by the variables u1 – u9 of the Z curve method, K-means clustering algorithm can assign about 94% of genes to the correct strands, which is a few percent higher than those correctly classified by K-means based on the RSCU. The base usage and codon usage analyses show that genes on the leading strand have more G than C and more T than A, particularly at the third codon position. For genes on the lagging strand the biases is reverse. The y component of the Z curves for the complete chromosome sequences show that the excess of G over C and T over A are more remarkable in C. muridarum genome than in other bacterial genomes without separating base and/or codon usages. Furthermore, for the genomes of Borrelia burgdorferi, Treponema pallidum, Chlamydia muridarum and Chlamydia trachomatis, in which distinct base and/or codon usages have been observed, closer phylogenetic distance is found compared with other bacterial genomes.
The nature of the strand biases of base composition in C. muridarum is similar to that in most other bacterial genomes. However, the base composition asymmetry between the leading and lagging strands in C. muridarum is more significant than that in other bacteria. It's supposed that the remarkable strand biases of G/C and T/A are responsible for the appearance of separate base or codon usages in C. muridarum. On the other hand, the closer phylogenetic distance among the four bacterial genomes with separate base and/or codon usages is necessary rather than occasional. It's also shown that the Z curve method may be more sensitive than RSCU when being used to quantitatively analyze DNA sequences.
The compositional asymmetry between the leading and lagging strands in bacterial genomes has been the subject of intensive study in the past few years [1–11]. It is interesting that almost all bacterial genomes exhibit the same kind of asymmetry [3, 5, 8, 9] i.e., there is an excess of nucleotides G relative to C in the leading strand and of C to G in the lagging strand, which is frequently accompanied by an abundance of T over A in the leading strand [3, 8–11]. There is not a relationship between base composition biases and genomic G+C contents [1, 3, 8]. The excesses of G relative to C and T relative to A can be generally measured by GC skew and AT skew, which are given by (G-C)/(G+C) and (A-T)/A+T), respectively . The GC skew (and AT skew) has (have) been used to map or relocate the replication origins of many bacteria such as Mycoplasma genitalium , Treponema pallidum  and Borrelia burgdoreferi . Several plausible explanations have been proposed that account for the biases in base composition, which have been partly summarized in four recent papers [5, 7–9]. It seems that the cytosine deamination theory enjoys the most attention among the theories aimed at explaining strand biases [6, 9]. The deamination of cytosine leads to the formation of uracil. In normal circumstance in vivo, cytosine is effectively protected against deamination because of the Watson-Crick base paring . But the rate of cytosine deamination increases 140 times when the DNA is single-stranded [15, 16]. If the resulting uracil is not replaced with cytosine, C to T mutation occurs. During the process of replication, the leading strand is more exposed in the single-stranded state . Therefore, the C to T mutation occurs more frequently in the leading stand than in the lagging strand and the excesses of G(C) relative to C(G) and T(A) relative to A(T) are formed in the leading(lagging) strand.
The single base biases may propagate into higher-order biases in a correlated way, thereby changing the relative frequencies of codons and even amino acids of genes and encoded proteins in each of the replicating strands . Salzberg et al. observed the oligomer skews in about a dozen of bacterial genomes . Rocha et al. observed compositional asymmetry between the genes located on the leading and lagging strands at the level of codons and amino acids . Mcinerney showed that genes in B. burgdoreferi have two significantly different codon usages, depending on whether the gene is transcribed on the leading or lagging strand of replication . Strand-specific codon usage bias was not a new observation, but for the first time it could be shown that the codon usage of the genes in both strands of replication was separate . Frank and Lobry suggested that the separate codon usages observed in B. burgdoreferi may be an exceptional case . Interestingly, the separate codon usages of the genes transcribed on the two strands was also observed in T. pallidum and C. trachomatis, respectively [20, 21].
The complete genome sequence of Chlamydia muridarum has been reported . In this paper, we show that C. muridarum genes have two separate base usages (or nucleotide compositions at three codon positions) depending on whether the gene is transcribed on the leading or lagging strand, using the Z curve and CA methods (Different from [19–21], where CA of RSCU was carried out). There is also significant difference between the codon usage of the genes encoded on the two strands. The difference between the y component of the Z curves for C. muridarum and E. coli chromosome sequences may be advantageous to understand the appearance of the separate base usages in C. muridarum.
Results and discussion
(1) Strand-specific base usage biases revealed using CA of u1 – u9
Base usage for genes located on the leading and lagging strands in the C. muridarum
1st codon position
2nd codon position
3rd codon position
(2) Using the K-means clustering of u1 – u9 to differentiate quantitatively genes in the two strands of replication
The Results of K-means clustering based on the variables u1–u9 defined in equation (3)
No. of genes
No. of genes
Clustered correctly a
No. of genes
Clustered correctly a
B. burgdoreferi b
T. pallidum c
C. trachomatis d
C. muridarum e
(3) Codon usage bias
Codon usage for genes located on the leading and lagging strands in the C. muridarum genome
If K-means clustering is applied based on RSCU, then 842(842/909 = 92.6%) C. muridarum genes will be assigned to the correct strands. The ratio is a little lower than using the variables u1 – u9 of the Z curve method. Similar result is observed for B. burgdorferi, T. pallidum and C. trachomatis genomes. This may indicate that the Z curve method is more sensitive than RSCU when being used to quantitatively analyze the DNA sequences.
(4) Why do the separate base usages (or codon usages) appear in some special genomes while not in other genomes?
To our knowledge, the similar phenomenon, separate base usages or codon usages of the genes located on the leading and lagging strands has been observed in four bacterial genomes, B. burgdorferi, T. pallidum, C. trachomatis and C. muridarum. In this section, we do not focus on the underlying mechanism of the strand mutation biases, which lead to the compositional asymmetry in turn. We only want to give a possible clue about why the separate codon or base usage appears in special genomes and not in other genomes. The nature of the strand biases at the level of base composition is the same in almost all the bacterial genomes. But only several genomes have been reported of the separate base or codon usages. These four species have very different genome G+C contents, respectively, 28.6% (Borrelia burgdorferi), 40.3% (Chlamydia muridarum),41.3% (Chlamydia trachomatis) and 52.8%(Treponema pallidum). The common features of these four genomes are discussed in the following.
Furthermore, from the phylogenetic point of view, the four genomes have the closer phylogenetic distance than that with other class (or order). Borrelia burgdorferi and Treponema pallidum belong to spirochaetes family. On the other hand, Chlamydia muridarum and Chlamydia trachomatis belong to Chlamydia genera. According to Wolf et al.  and Qi et al. , Spirochetes and chlamydiae are grouped together in the evolutional tree. We would like to believe that the closer phylogenetic distance among the four genomes is necessary rather than occasional.
It's shown that protein-coding genes in C. muridarum genome have two separate and significantly different base (and codon) usages, depending on whether the gene is transcribed on the leading or lagging strand of replication. According to their positions in the 9-D space spanned by the variables u1 – u9 of the Z curve method, K-means clustering algorithm can classify about 94% of the genes into the correct strands, which is a few percent higher than those correctly classified by K-means of RSCU. The remarkable strand biases of G/C and T/A are supposed to be responsible for the appearance of separate base or codon usages in C. muridarum. Furthermore, B. burgdorferi, T. pallidum, C. muridarum and C. trachomatis have closer phlygenetic distance than that with other class (or order), in which the separate base and/or codon usages have been observed.
The C. muridarum genome DNA sequence and the annotation information were downloaded from GenBank ftp site . 909 protein-coding genes are listed in the annotation. GC skew with a non-overlapping sliding window of 1000 bp was used to determine the origin and termination of replication, as described by Read et al. . Consequently, the origin was assumed to lie before gene TC0001, whereas the termination between genes TC0438 and TC0439.
The Z curve
where A0 = C0 = G0 = T0 = 0 and hence x0 = y0 = z0 = 0. Here R, Y, M, K, W, and S represent the bases of puRine, pYrimidine, aMino, Keto, Weak hydrogen bonds and Strong hydrogen bonds, respectively, according to the Recommendation 1984 by the NC-IUB . The connection of the nodes P0 (P0 = 0), P1, P2, ..., until P N one by one sequentially by straight lines is called the Z curve for the DNA sequences inspected.
The Z curve offers an intuitive and convenient approach to study DNA sequences. By viewing the Z curve, some overall and local features of the sequence can be detected in a perceivable way. Furthermore, the phase-specific Z curve derived from the Z curve can be used for studying the nucleotide compositions in DNA fragments or [30, 31] distinguishing the coding regions from the non-coding ones [32–34].
The phase-specific Z curve
It is obvious that the nine variables u1 – u9 represent the base usage for a gene.
Correspondence analysis (CA) is a classical technique to reduce the dimensionality of the dataset by transforming to a new set of variables (the principal components) to summarize the feature of the data . The new set of variables is derived from the linear combination of the original variables. The first principal axis is chosen to maximize the standard deviation of the derived variable and the second principal axis is the direction to maximize the standard deviation among directions un-correlated with the first, and so forth. For details about this method, refer to .
K-means Clustering method
K-means clustering method  is adopted to differentiate quantitatively the leading and lagging strand genes according to their positions in the 9-D space (spanned by the variables u1 – u9) and 59-D space (spanned by RSCU values of codons, excluding three stop codons and the codons encoding for Met and Trp). K-means is a statistical method used to cluster data set into the given K classes based on the similarity of the elements. The idea in this method is to find a clustering (or grouping) of the observations so as to minimize the total within-cluster sums of squares. In this case, it sequentially processes each observation and reassigns it to another cluster if doing so results in a decrease in the total within-cluster sums of squares (referring to  for the details).
We thank the anonymous reviewers for valuable suggestions. We also thank Professor Chun-Ting Zhang for carefully reading the manuscript and giving his comments. Discussions with Dr. Mcinerney and Dr. Koonin via e-mail are gratefully acknowledged. The present study was supported by the Youth Research Foundation at UESTC (grant JX0643).
- Lobry JR: Asymmetric substitution patterns in the two DNA strands of bacteria. Mol Biol Evol. 1996, 13: 660-665.PubMedView ArticleGoogle Scholar
- Blattner FR, Plunkett G, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, Gregor J, Davis NW, Kirkpatrick HA, Goeden MA, Rose DJ, Mau B, Shao Y: The complete genome sequence of Escherichia coli K-12. Science. 1997, 277: 1453-1474. 10.1126/science.277.5331.1453.PubMedView ArticleGoogle Scholar
- Mclean M, Wolfe KH, Devine KM: Base composition skews, replication orientation and gene orientation in 12 prokaryotic genomes. J Mol Evol. 1998, 47: 691-696. 10.1007/PL00006428.PubMedView ArticleGoogle Scholar
- Salzberg SL, Salzberg AJ, Kerlavage AR, Tomb J: Skewed oligomers and origins of replication. Gene. 1998, 217: 57-67. 10.1016/S0378-1119(98)00374-6.PubMedView ArticleGoogle Scholar
- Mrazek J, Karlin S: Strand compositional asymmetry in bacterial and large viral genomes. Proc Natl Acad Sci USA. 1998, 95: 3720-3725. 10.1073/pnas.95.7.3720.PubMed CentralPubMedView ArticleGoogle Scholar
- Frank AC, Lobry JR: Asymmetric substitution patterns: a review of possible underlying mutational or selective mechanisms. Gene. 1999, 238: 65-77. 10.1016/S0378-1119(99)00297-8.PubMedView ArticleGoogle Scholar
- Karlin S: Bacterial DNA strand compositional asymmetry. Trends Microbiol. 1999, 7: 305-308. 10.1016/S0966-842X(99)01541-3.PubMedView ArticleGoogle Scholar
- Tillier ER, Collins RA: The contributions of replication orientation, gene direction, and signal sequences to base-composition asymmetries in bacterial genomes. J Mol Evol. 2000, 50: 249-257.PubMedGoogle Scholar
- Rocha EP, Danchin A: Ongoing evolution of strand composition in bacterial genomes. Mol Biol Evol. 2001, 18: 1789-1799.PubMedView ArticleGoogle Scholar
- Lobry JR, Sueoka N: Asymmetric directional mutation pressures in bacteria. Genome Biol. 2002, 3: RESEARCH0058-10.1186/gb-2002-3-10-research0058.PubMed CentralPubMedView ArticleGoogle Scholar
- Rocha EP: Is there a role for replication fork asymmetry in the distribution of genes in bacterial genomes?. Trends Microbiol. 2002, 10: 393-395. 10.1016/S0966-842X(02)02420-4.PubMedView ArticleGoogle Scholar
- Lobry JR: Origin of replication of Mycoplasma genitalium. Science. 1996, 272: 745-746. 10.1126/science.272.5262.745.PubMedView ArticleGoogle Scholar
- Fraser CM, Norris SJ, Weinstock GM, White O, Sutton GG, Dodson R, Gwinn M, Hickey EK, Clayton R, Ketchum KA, Sodergren E, Hardham JM, McLeod MP, Salzberg S, Peterson J, Khalak H, Richardson D, Howell JK, Chidambaram M, Utterback T, McDonald L, Artiach P, Bowman C, Cotton MD, Fujii C, Garland S, Hatch B, Horst K, Roberts K, Sandusky M, Weidman J, Smith HO, Venter JC: Complete genome sequence of Treponema pallidum, the Syphilis Spirochete. Science. 1998, 281: 375-388. 10.1126/science.281.5375.375.PubMedView ArticleGoogle Scholar
- Fraser CM, Casjens S, Huang WM, Sutton GG, Clayton R, Lathigra R, White O, Ketchum KA, Dodson R, Hickey EK, Gwinn M, Dougherty B, Tomb JF, Fleischmann RD, Richardson D, Peterson J, Kerlavage AR, Quackenbush J, Salzberg S, Hanson M, Vugt RV, Palmer N, Adams MD, Gocayne J, Weidman J, Utterback T, Watthey L, McDonald L, Artiach P, Bowman C, Garland S, Fujii C, Cotton MD, Horst K, Roberts K, Hatch B, Smith HO, Venter JC: Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi. Science. 1997, 390: 580-586.Google Scholar
- Frederico LA, Kunkel TA, Shaw BR: A sensitive genetic assay for the detection of cytosine deamination: determination of rate constants and the activation energy. Biochemistry. 1990, 29: 2532-2537. 10.1021/bi00462a015.PubMedView ArticleGoogle Scholar
- Beletskii A, Bhagwat AS: Transcription-induced mutations: increase in C to T mutations in the nontranscribed strand during transcription in Escherichia coli. Proc Natl Acad Sci USA. 1996, 93: 13919-13924. 10.1073/pnas.93.24.13919.PubMed CentralPubMedView ArticleGoogle Scholar
- Marians KJ: Prokaryotic DNA replication. AnnuRev Biochem. 1992, 61: 673-719. 10.1146/annurev.bi.61.070192.003325.View ArticleGoogle Scholar
- Rocha EP, Danchin A, Viari A: Universal replication biases in bacteria. Mol Microbiol. 1999, 32: 11-16. 10.1046/j.1365-2958.1999.01334.x.PubMedView ArticleGoogle Scholar
- Mcinerney JO: Replicational and transcriptional selection on codon usage in Borrelia burgdoreferi. Proc NatlAcad Sci USA. 1999, 95: 10698-10703. 10.1073/pnas.95.18.10698.View ArticleGoogle Scholar
- Lafay B, Lloyd AT, McLean MJ, Devine KM, Sharp PM, Wolfe KH: Proteome composition and codon usage in spirochaetes: species-specific and DNA strand-specific mutational biases. Nucleic Acids Res. 1999, 27: 1642-1649. 10.1093/nar/27.7.1642.PubMed CentralPubMedView ArticleGoogle Scholar
- Romero H, Zavala A, Musto H: Codon usage in Chlamydia trachomatis is the result of strand-specific mutational biases and a complex pattern of selective forces. Nucleic Acids Res. 2000, 28: 2084-2090. 10.1093/nar/28.10.2084.PubMed CentralPubMedView ArticleGoogle Scholar
- Read TD, Brunham RC, Shen C, Gill SR, Heidelberg JF, White O, Hickey EK, Peterson J, Utterback T, Berry K, Bass S, Linher K, Weidman J, Khouri H, Craven B, Bowman C, Dodson R, Gwinn M, Nelson W, DeBoy R, Kolonay J, McClarty G, Salzberg SL, Eisen J, Fraser CM: Genome sequence of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39. Nucleic Acids Res. 2000, 28: 1397-1406. 10.1093/nar/28.6.1397.PubMed CentralPubMedView ArticleGoogle Scholar
- Trifonov EN: Translation framing code and frame-monitoring mechanism as suggested by the analysis of mRNA and 16 S rRNA nucleotide sequences. J Mol Biol. 1987, 194: 643-52. 10.1016/0022-2836(87)90241-5.PubMedView ArticleGoogle Scholar
- Zhang R, Zhang CT: Single replication origin of the archaeon Methanosarcina mazei revealed by the Z curve method. Biochem Biophys Res Commun. 2002, 297: 396-400. 10.1016/S0006-291X(02)02214-3.PubMedView ArticleGoogle Scholar
- Wolf YI, Rogozin IB, Grishin NV, Koonin EV: Genome trees and the tree of life. Trends Genet. 2002, 18: 472-479. 10.1016/S0168-9525(02)02744-0.PubMedView ArticleGoogle Scholar
- Hao B, Qi J: Prokaryote phylogeny without sequence alignment: from avoidance signature to composition distance. J Bioinform Comput Biol. 2004, 2: 1-19. 10.1142/S0219720004000442.PubMedView ArticleGoogle Scholar
- GenBank ftp site. [ftp://ftp.ncbi.nih.gov/genbank/]
- Zhang CT: A symmetrical theory of DNA sequences and its applications. J Theor Biol. 1997, 187: 297-306. 10.1006/jtbi.1997.0401.PubMedView ArticleGoogle Scholar
- Cornish-Bowden A: Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Res. 1985, 13: 3021-3030. 10.1093/nar/13.9.3021.PubMed CentralPubMedView ArticleGoogle Scholar
- Zhang CT, Chou KC: A graphic approach to analyzing codon usage in 1562 Escherichia coli protein coding sequences. J Mol Biol. 1994, 238: 1-8. 10.1006/jmbi.1994.1263.PubMedView ArticleGoogle Scholar
- Ou HY, Guo FB, Zhang CT: Analysis of nucleotide distribution in the genome of Streptomyces coelicolor A3(2) using the Z curve method. FEBS Lett. 2003, 540: 188-94. 10.1016/S0014-5793(03)00263-1.PubMedView ArticleGoogle Scholar
- Guo FB, Ou HY, Zhang CT: ZCURVE: A new system for recognizing protein-coding genes in bacterial and archaeal genomes. Nucleic Acids Res. 2003, 31: 1780-1789. 10.1093/nar/gkg254.PubMed CentralPubMedView ArticleGoogle Scholar
- Guo FB, Wang J, Zhang CT: Gene recognition based on nucleotide distribution of ORFs in a hyper-thermophilic crenarchaeon, Aeropyrum pernix K1. DNA Res. 2004, 11: 361-370. 10.1093/dnares/11.6.361.PubMedView ArticleGoogle Scholar
- Guo FB, Zhang CT: ZCURVE_V: a new self-training system for recognizing protein-coding genes in viral and phage genomes. BMC bioinformatics. 2006, 7: 9-10.1186/1471-2105-7-9.PubMed CentralPubMedView ArticleGoogle Scholar
- Perriere G, Thioulouse J: Use and misuse of correspondence analysis in codon usage studies. Nucleic Acids Res. 2002, 30: 4548-55. 10.1093/nar/gkf565.PubMed CentralPubMedView ArticleGoogle Scholar
- Dillon WR, Goldstein M: Multivariate analysis, method and application, Willey Press, New York, USA. 1984Google Scholar
- Hartigan JA, Wong MA: A k-means clustering algorithm. Applied Statistics. 1979, 28: 100-108. 10.2307/2346830.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.