Evolutionary implications of inversions that have caused intra-strand parity in DNA

Background Chargaff's rule of DNA base composition, stating that DNA comprises equal amounts of adenine and thymine (%A = %T) and of guanine and cytosine (%C = %G), is well known because it was fundamental to the conception of the Watson-Crick model of DNA structure. His second parity rule stating that the base proportions of double-stranded DNA are also reflected in single-stranded DNA (%A = %T, %C = %G) is more obscure, likely because its biological basis and significance are still unresolved. Within each strand, the symmetry of single nucleotide composition extends even further, being demonstrated in the balance of di-, tri-, and multi-nucleotides with their respective complementary oligonucleotides. Results Here, we propose that inversions are sufficient to account for the symmetry within each single-stranded DNA. Human mitochondrial DNA does not demonstrate such intra-strand parity, and we consider how its different functional drivers may relate to our theory. This concept is supported by the recent observation that inversions occur frequently. Conclusion Along with chromosomal duplications, inversions must have been shaping the architecture of genomes since the origin of life.


Background
The most famous of Chargaff's rules is that in DNA, the proportion of A equals that of T, and C that of G [1]. This nucleotide balance is governed by complementary basepairing rules fundamental to the structure of the double helix [2]. Astonishingly, the nucleotides retain almost the same equality balance in either of the two single strands of DNA [3] and this phenomenon is sometimes named Chargaff's second parity rule [4][5][6][7][8][9][10]. Table 1 provides an illustration, with analysis of large contiguous segments from each human chromosome.
When there is no bias in mutation and selection between complementary strands, base substitution may explain the parity phenomenon [11,12]. In fact, strand bias has been demonstrated with mutational skews between the two strands, which causes deviation from parity [13,15]. Bacterial origins of replication were successfully identified by the distribution of such skews [16,17]. The strand bias of mutations, which can be associated with direction of transcription, is also found in mammalian genomes [18,19]. In spite of these anomalies, any violation of the second parity phenomena is generally small in magnitude [8,20].
Although different explanations for this parity phenomenon have been put forth, such as intra-strand base pairing [6], a simpler explanation for the rule may be DNA duplication and inversion [4,8,10]. If double-stranded DNA of any composition undergoes duplication followed by an inversion of the duplicated region, then each strand of the resulting DNA molecule would precisely satisfy Chargaff's second parity rule, so that %A = %T and %C = %G (Fig.  1A).
Not only single nucleotides but also oligonucleotides up to 30 nucleotides (nt) in length can demonstrate the parity phenomenon within strands [5,7,8]. In other words, the frequency of a particular oligonucleotide is approximately equal to that of its reverse complementary sequence in the same strand. Since DNA strands are complementary, the frequency of a particular oligonucleotide in one strand approximates that in the opposite strand. Hence, this double-stranded DNA characteristic can also be called "symmetry of complementary DNA strands" [5,8]. Chargaff's second parity rule ordinarily considers only mononucleotides, which have been extensively studied. However, since a single nucleotide could be deemed a one-nt oligonucleotide, it is plausible that addressing the symmetry of oligonucleotides (high-order strand symmetry) is a more general way of assessing biological meaning. Hereafter, we designate this comprehensive symmetry as "intra-strand parity" and attempt to explain it based on the mechanism of chromosomal inversion. Single nucleotide mutations may be considered to explain mononucleotide parity within strands [11,12] but have not been effective to explain the extended parity of oligonucleotides [8].

Results
We propose that inversion events (with or without underlying duplications) might be a sufficient mechanism to explain the phenomenon. To test this, we consider a double-stranded DNA molecule without intra-strand parity but which is long enough to undergo various (stochastic) inversions (Fig. 1B). A n and T n are defined as the frequency of any particular oligonucleotide sequence and its reverse complementary sequence, respectively, in the same strand after n inversions (n > 0). A 0 (0 < A 0 < 1) is the initial frequency of any particular oligonucleotide sequence (which can also be a mononucleotide) in the upper strand. T 0 (0 < T 0 < 1) is the initial frequency of its reverse complementary sequence in the same strand. If we define r n (0 < r n << 1) as the relative length of the nth inversion (Fig. 1B), we obtain these two equations.
A n = A n-1 -r n (A n-1 -T n-1 ) ( 1 ) T n = T n-1 -r n (T n-1 -A n-1 ) Equations (1) and (2) mean that an inversion changes A n and T n toward T n and A n , respectively. When the whole sequence is long enough, r n is close to 0. Nevertheless, whatever the size of the inverted region examined, any oligonucleotide sequence will eventually be homogenized between two strands. In other words, A n and T n ultimately converge to be equal to each other, regardless of r n , as long as r n is stochastic (see mathematical derivation in Methods).
Equation (3) is a mathematical explanation of intrastrand parity based on our hypothesis that inversions are sufficient to cause any DNA segment conform to parity. In this way, the vast majority of naturally occurring DNA molecules (chromosomes) will evolve to intra-strand parity via many inversions. Those few that deviate, such as mitochondrial DNA (mtDNA) [8,9,17], will have special properties (see below). We presume that any DNA can be made to evolve to intra-strand parity through a process of inversions, and that deviations from parity have been rare in evolution. Inversions must have been occurring as genomes of ancestral organisms were growing in complexity with the acquisition or creation of new genes.
The insertion of repetitive sequences was proposed to be a possible source underlying parity [8,10]. However, removing apparent repeats from the human and other genomes prior to analysis (see Methods) did not alter the symmetry characteristics of the remaining sequences. (An example of a 28.6-Mb contig from human chromosome 21 is shown in Table 2). Therefore, it is unlikely that insertion of such sequences accounts for the intra-strand parity, either in humans or organisms that have fewer repetitive sequences in their genomes.
We employ radar charts to allow simple visual perception of the high-order symmetry and asymmetry of exemplary DNAs (Fig. 2). Mitochondria are thought to have been derived from bacteria [21]. Mammalian mtDNA ( Fig 2C) is an exception that does not demonstrate intra-strand parity [8,9,17] whereas mtDNAs from plants and lower eukaryotes do. Mammalian mtDNA may have gradually deviated from its ancestral form [9]. The small circular size, its unique replication mechanism [22], and extranuclear localization could introduce different selective pressures against tolerance of inversions and thus deviation from the more general observation of intra-strand parity.
The mammalian mtDNA offers a natural source of sequence sufficiently deviating from parity to allow us to further test our mathematical explanation. We produced in silico semi-random inversions in human mtDNA. As few as eight 1-kb regularly-distributed inversions (see Methods) would be sufficient to homogenize the two strands of the 16.6-kb mtDNA and create intra-strand parity (Fig. 2D). We also depict a hypothetical inversion in the mtDNA to show the potential for rapid homogenization (Fig. 1C). Inversions as an explanation for intra-strand parity Figure 1 Inversions as an explanation for intra-strand parity. A, Duplication followed by inversion. If a double-stranded DNA, shown in gray, undergoes duplication and inversion, then the resulting molecule precisely demonstrates the strand parity (both within and between strands). B, A mathematical explanation of intra-strand parity. The nth inversion is illustrated by a box with crossed bars and r n is the relative length of the inversion within a total fragment of length = 1.
Ultimately both A n and T n converge to the average of their initial frequencies. See Methods for details. Although a linear double-stranded DNA is shown, this could also be circular. C, A small number of inversions can cause DNA to follow the intra-strand parity. A 40-bp double-stranded DNA fragment in the human mtDNA (position 1875-1914 in accession number NC_001807) is shown, along with the outcome of a single artificial inversion, which has homogenized the contents of the two strands. Although the lack of intra-strand parity in mammalian mtDNA could be ascribed to its small length, other loci of comparable length (e.g. the TP53 gene, Fig. 2B) do adhere to parity. Unlike other mtDNAs, those of mammals have no intergenic segments and have only one regulatory region per strand. Moreover, unlike among nuclear genomes, the order and direction of genes -as well as biased gene density between the two strands -are strictly conserved among mammalian species [23]. Therefore, it seems that the configuration is already fixed, and that inversions are not tolerated in mammalian mtDNA.

Discussion
The ubiquity of inversions suggests that they had some advantage in natural selection. Duplications are thought to play an important role in creating genetic variety [24], however, some duplications are deleterious for organisms, due to sudden increases of gene dosage. To avoid being negatively selected, one of the duplicated copies could undergo mutation such as deletion. Inversions or interchromosomal rearrangements could render the duplicated gene nonfunctional due to its release from interaction with its promoter or other regulatory elements. This may be one reason why many inverted and interchromosomal segmental duplications are found in the human genome [25,26]. An approximately symmetrical gene distribution between the two strands may have been brought about by these rearrangements [27].
In some cases, a rearranged genome might confer positive selection. Although we can find syntenic regions among vertebrates, chromosomal organizations can be quite dif-ferent among species. This suggests an advantage for evolution or speciation. Recently, the importance of gene order and gene position in the three-dimensional nucleus has been suggested [28]. It is likely that genomes continually undergo rearrangement toward optimal positions for each gene and each gene cluster. Our group showed an unexpectedly large number of inversions (from 23 bp to 62 Mb in size) between human and chimpanzee genomes [29], species which diverged only six million years ago. Although most may be selectively neutral, some likely were selected for, and contributed to the speciation. Many more inversions may also have occurred and may have been negatively selected. Inversions can also give rise to new transcripts, some of which will be selected for and become new genes. We identified hybrid transcripts of the AZGP1 and GJE1 genes on human chromosome 7 (manuscript in preparation) and are intrigued that the orthologues of these genes in non-primate mammals reside in a head-to-head manner. It is likely that the common ancestor of primates underwent inversion of the AZGP1 gene to produce the hybrid transcripts, creating an opportunity for primate diversity.

Conclusion
In summary, we propose that the relatively frequent occurrence and accumulation of inversions in genomes may be a major contributor to the phenomenon of intrastrand parity. Whereas single base substitutions might explain Chargaff's second parity rule at the level of mononucleotides, they can explain neither the high-order intrastrand parity nor the exceptional deviation of mammalian mtDNAs. In contrast, inversion events are not limited by size and can involve millions of bases of sequence. Other mechanisms may have contributed to some extent; nevertheless, they are not necessary to account for intra-strand parity if inversions are considered.
Inversions are one process contributing to genome evolution that allow for rearrangement toward optimal position, order, and orientation of genes and regulatory elements, and for escape from deleterious effects caused, for example, by some duplications. Although we acknowledge the possibility of preferential sites, inversions occur randomly as shown in our mathematical explanation. Many of these are expected to be deleterious and would presumably be selected against, but others should be neutral or positively selected and could therefore become fixed in the genome [30]. Quantitative estimation of inversion using genomic sequences of extant organisms is unfortunately meaningless, as it cannot account for those events lost to natural selection. Further, inversions must have contributed to the basic character of DNA sequences since the origin of life. There are now substantial data supporting the frequency of inversions within genomes of a variety of organisms, including plants, insects and pri- Intra-strand parity visually represented by radar charts The high frequencies of poly-A and poly-T, which might be, in part, traces of retrotranspositions of poly-A + mRNA, and the deficiencies of trinucleotides that contain the CpG dinucleotide make the stalk and four grooves, respectively, of the "maple leaf" shape. (The shapes vary slightly based on the genome sequence analyzed, but the general symmetry is maintained). B, The genomic sequence of the p53 (TP53) locus (U94788, 20,303 bp). The symmetry is roughly retained in sequences as short as 20 kb in length. The protein-coding sequences occupy 5.8% of this locus. This chart also suggests that transcriptional asymmetry is small in magnitude. C, Human mtDNA. The asymmetry illustrates that this DNA does not show intra-strand parity. D, Human mtDNA after inversion in silico. It becomes symmetrical, demonstrating that inversions can change a sequence to create the parity. In this case, each r n approximates to 1/16.6. This also demonstrates that only 1/(2r ave ) inversions (eight inversions in this case) are enough to make a sequence conform to parity. E, The difference of frequencies of GGG and CCC ([GGG] -[CCC]) in human mtDNA approaches 0 by in silico random inversions. In this analysis, for simplicity, the size of each inversion was fixed to 100 bp. In human mtDNA, GGG and CCC have the largest difference of frequencies among all trinucleoties (see Fig. 2C).

A B
C D E mates [29][30][31][32][33], and these observable events are but the tip of the iceberg. Chromosomal rearrangements such as inversions reduce the rate of meiotic recombination between homologous chromosomes, with subsequent reproductive isolation [34]. Moreover, in these regions, mutations tend to be positively selected to give rise to speciation [35]. Ohno's seminal work [24] and that of others have emphasized the importance of duplications in evolution. Our suppositions further these ideas, in particular suggesting how inversions and duplications can complement each other to yield the properties of extant genomes.

Calculation of frequencies of oligonucleotides
The genomic sequences (human contigs, the TP53 gene, and the mtDNA sequence) were downloaded from NCBI (Build 36). Calculation of frequencies of oligonucleotides (including mononucleotides) was performed using Perl scripts, which are available upon request. The "plus" strand, which is stored in the database, was analyzed. We generated sequence free of repetitive elements using RepeatMasker with which 46.4% of the 28,617,429 nucleotides were masked. The coordinates of the eight 1-kb regularly-scattered in silico inversions were 1001-2000, 3001-4000, 5001-6000, 7001-8000, 9001-10000, 11001-12000, 13001-14000, and 15001-16000 in NC_001807.

Mathematical derivation
For the frequency of a particular oligonucleotide A n (n > 0), via the nth inversion, (1 -r n ) A n-1 remains; r n A n-1 decreases; r n T n-1 increases if we suppose the distribution of contents is even in the whole sequence. In this way, the two recurrence formulas (1) and (2) are derived (see text).
The following equations are obtained by adding equations (1) and (2).
A n + T n = A n-1 + T n-1 (4) A n + T n = A 0 + T 0 (5) These mean that inversions do not change the sum of the two frequencies. Using (5), other forms of (1) and (2) are derived.