 Research article
 Open Access
 Published:
Inversion symmetry of DNA kmer counts: validity and deviations
BMC Genomics volume 17, Article number: 696 (2016)
Abstract
Background
The generalization of the second Chargaff rule states that counts of any string of nucleotides of length k on a single chromosomal strand equal the counts of its inverse (reversecomplement) kmer. This Inversion Symmetry (IS) holds for many species, both eukaryotes and prokaryotes, for ranges of k which may vary from 7 to 10 as chromosomal lengths vary from 2Mbp to 200 Mbp. The existence of IS has been demonstrated in the literature, and other pairwise candidate symmetries (e.g. reverse or complement) have been ruled out.
Results
Studying IS in the human genome, we find that IS holds up to k = 10. It holds for complete chromosomes, also after applying the low complexity mask. We introduce a numerical IS criterion, and define the klimit, KL, as the highest k for which this criterion is valid. We demonstrate that chromosomes of different species, as well as different human chromosomal sections, follow a universal logarithmic dependence of KL ~ 0.7 ln(L), where L is the length of the chromosome.
We introduce a statistical ISPoisson model that allows us to apply confidence measures to our numerical findings. We find good agreement for large k, where the variance of the Poisson distribution determines the outcome of the analysis. This model predicts the observed logarithmic increase of KL with length. The model allows us to conclude that for low k, e.g. k = 1 where IS becomes the 2^{nd} Chargaff rule, IS violation, although extremely small, is significant. Studying this violation we come up with an unexpected observation for human chromosomes, finding a meaningful correlation with the excess of genes on particular strands.
Conclusions
Our ISPoisson model agrees well with genomic data, and accounts for the universal behavior of klimits. For low k we point out minute, yet significant, deviations from the model, including excess of counts of nucleotides T vs A and G vs C on positive strands of human chromosomes. Interestingly, this correlates with a significant (but small) excess of genes on the same positive strands.
Background
Erwin Chargaff has made, in 1950, the important observation that the numbers of nucleotides in DNA satisfy #A = #T and #G = #C [1, 2]. This statement, made on the basis of experimental observations with fairly large errors, played a crucial role in realizing that DNA has an underlying basepair grouping, as subsequently proposed by Crick and Watson [3] in their doublehelix structure.
The second Chargaff rule [4] states that the same sets of identities of nucleotide pairs hold for each long enough single DNA strand. This rule has been tested [5] for genome assemblies of many species, and found to be globally valid for eukaryotic chromosomes, as well as for bacterial and archaeal chromosomes. It fails for mitochondria, plasmids, singlestranded DNA viruses and RNA viruses.
The validity of the second Chargaff rule was unexpected. Obviously it should be regarded as a global rule, i.e. applicable to large sections of chromosomes. Nonetheless, not being derived from a compelling principle, such as the one underlying the first rule, it remains a mystery. This is even more so, when one studies extended versions of Chargaff’s second rule. Indeed, AlbrechtBuehler [6] observed that for triplet oligonucleotides, or 3mers, it remains true that their chromosomewide frequencies are almost equal to those of their reversecomplement 3mers. Prabhu [7] has shown that this symmetry holds up to 5mers in various species. This has been reviewed by Baldi and Brunak [8] who have argued that such symmetry rules have to be incorporated in Markov models of genomic sequences.
We refer to the symmetry between counts of kmers and their reverse complements as
Inversion Symmetry (IS) : the counts of a kmer of nucleotides on a chromosomal strand are almost equal to those of its inverse (reversecomplement) string.
Note that this implies that the number of times a string of nucleotides of length k is observed on a strand, when read from 5’ to 3’, is almost equal to the number of times it is observed on the other strand when the latter is read from its 5’ end to 3’ end.
Recent analyses of inversion symmetry include the following: Qi and Cuttichia [9] who have shown that inversion symmetry exists while reverse symmetry fails, i.e. kmers and their reverses do not appear with equal rates; Baisnee, Hampson and Baldi [10], who introduced a measure S1 to analyze inversion symmetry in a systematic fashion; Kong et al. [11], who established the validity of IS on 786 chromosomes of many species and showed that reverse or complement symmetry do not hold, and argued that IS may be due to segmental or wholegenome inverse duplications; Wang et al. [12] who argued that values of k for which kmer IS is valid increase with organismal complexity; and Afreixo et al. [13] who applied various criteria to demonstrate the statistical significance of IS up to k = 10. Studies of symmetries related to IS appear in [14, 15].
We introduce an IS measure which is different from S1 of [10], albeit the numerical results of both measures are correlated (see section 4 in Methods). Our measure is based on the ratio between differences of counts of inverse kmer pairs and their sum. We propose the criterion that if the average of this normalized measure (over all strings of length k) is less than about 0.1, IS will be regarded as a valid approximate symmetry. The average is taken over all M_{k} ≤ 4^{k} strings of length k which exist at least once on the chromosome. The value of k, for which the IS measure is closest to 0.1, is defined as the klimit (KL). This turns out to be KL = 10 on long human chromosomes (see Additional file 1), and KL = 7 or KL = 8 for bacteria.
Using this measure, one can readily demonstrate the existence of inversion symmetry, and the absence of analog symmetries between reverse pairs or complement pairs, as well as compare between different species. We will show that the klimit of inversion symmetry, KL, is logarithmically dependent on the length L of the chromosome, or of a chromosomal section on which it is measured. Moreover, this dependence is universal, i.e. it is valid for most species.
To analyze all these observations on a rigorous statistical basis, we introduce a Poisson model for the random occurrence of counts, regarding N(S) as a stochastic variable for any string S of length k. We define X(S,S^{*}) = N(S)N(S^{*})/(N(S) + N(S^{*})), which is a stochastic variable having positive values 0 ≤ X ≤ 1. In general S^{*} is some permutation of S over the set of all strings of length k. When S^{*} ≡ S^{inv}, i.e. where S^{*} is the inverse of S, IS implies that X < <1. For strict IS, X = 0. In practice we may observe small deviations when checking for its realization on a chromosome. The important question we address is whether these deviations mean that the IS rule is not valid, or that the data are consistent with IS yet the observed values of X reflect statistical fluctuations.
To answer this question we introduce the stochastic variable Z(S,S^{*}) = (N(S)N(S^{*}))/(N(S) + N(S^{*}))^{½}. The symmetry assumption means that N(S) = N(S*), i.e. these two stochastic variables have the same distribution. If we further assume that N(S) ~ Poisson then Z should be approximately distributed as a standard normal (see Methods).
We will demonstrate on genomic data that for inversion symmetry we empirically observe that Z ~ Standard Normal, but for other pairings of S and S* (e.g. reverse or complement) it is not. Continuing with the analysis of IS we show that

For small k, E_{k}[X] is extremely small, yet E_{k}[Z] = E_{k}[X/σ_{X}] > 2, where σ_{X} is the (theoretically estimated) standard deviation of X(S,S^{inv}). Therefore we conclude that there exists a systematic small breaking of IS, observed for k < 4 on human chromosomes.

For large k (k > 5 on human chromosomes) E_{k}[Z] = E_{k}[X/σ_{X}] < 1 hence, due to the large variance, one may state that the observed X values are consistent with IS. Moreover, the data are consistent with Z ~ Standard Normal for large k.

The empirical values of X(S,S^{inv}) for large k are of the order of magnitude of (N(S) + N(S^{inv}))^{½}.

The logarithmic variation of the klimit, KL, as function of chromosomal length L, is correctly predicted by our ISPoisson model.
We use the italicized notation N, X, Z, for the stochastic variables of our model, and employ N, X, Z for their empirical counts on chromosomes. KL is defined as the value of k for which E_{k}[X] is closest to 0.1.
Results
Inversion symmetry (Generalized 2^{nd} Chargaff Rule)
Let S and S^{*} be two strings of nucleotides of same length k, i.e. two kmers. Suppose they appear N(S) and N(S^{*}) times respectively on a particular chromosome. We denote by X(S,S^{*}) the normalized difference X(S,S^{*}) = N(S)N(S^{*})/(N(S) + N(S^{*})) where S is one of the M_{k} different kmers over the 4 nucleotides, which are being counted on the chromosome at least once, i.e. N(S) > 0 and/or N(S^{*}) > 0. If both N(S) = N(S^{*}) = 0, X(S,S^{*}) is defined to be 0. In general 0 ≤ X(S,S^{*}) ≤ 1.
We use E_{k}[X] = ∑_{S} X(S,S^{*})/M_{k}, where M_{k} is the number of different kmers encountered empirically, as a measure to demonstrate and quantify the studied symmetry. For low and moderate k, we find that M_{k} = 4^{k}, but for large kvalues, such as k > 10 in the human genome, many of the kmers may not be realized empirically, leading to lower M_{k}. In the following we will look at values of X(S,S^{*}) over various possible choices of string pairs S and S^{*}, and demonstrate that for inverse pairs they are distributed differently than for other types of kmer pairs.
Let us start by computing E_{k}[X] for inverse pairs (i.e. S and S^{*} ≡ S^{inv} are reversecomplements of each other) for different k, on various chromosomes of the human genome assembly HG38. Data were downloaded from the UCSC genome browser http://genome.uscs.edu. The values of E_{k}[X] for several human chromosomes are displayed as function of k in Fig. 1. Inversion Symmetry (IS) is seen to hold quite well for kmers with large kvalues for all the displayed chromosomes. Chr Y, which is the shortest among the 24 chromosomes, has the least inversion symmetry. IS holds also for all other chromosomes (Additional file 1). It fails for the mitochondrial chromosome, which is a wellknown exception to the 2^{nd} Chargaff rule.
Repetitive structures are wellknown to constitute major fractions of eukaryotic chromosomes, hence one may wonder to what extent they are responsible for the observed inversion symmetry. To resolve this issue, we employed the same operations on the masked output of the UCSC genome browser, after filtering chromosomes for interspersed repeats and low complexity sequences. The results keep displaying the same behavior, with negligible differences for high values of k. Even chrY, which is well known for containing numerous repeats, with only 36 % of it surviving the masking filter, keeps showing the same qualitative behavior as in Fig. 1. In Additional file 1 we provide a list of the highest kvalues for which E_{k}[X] ≈ 0.1, both before and after masking (which removes repetitive and lowcomplexity stretches of the chromosome). We define the klimit (KL) of IS, as the value of k for which E_{k}[X] is closest to 0.1. The observed reduction in KL from 10 to 9 for the largest chromosomes, is due to the fact that masking shortens the effective chromosome length. The dependence of KL on length is an issue to which we will return below.
We have performed the same analysis on the older genome assembly HG18, leading to very similar results (see Additional file 2). We find similar IS results for mouse, frog, fly, worm, and yeast. Moreover, we find that inversion symmetry holds also for bacteria, but it is valid for a lower range of kmers, only up to KL = 6 or 7.
Outstanding features of inverse kmer pairs
In order to demonstrate how Inversion Symmetry, observed for frequencies of inverse pairs, differs from other natural pairings, we compare different choices of pairings of kmers,

a.
Inverse pairs (e.g. CGA vs TCG)

b.
Random pairs

c.
Reverse pairs (e.g. CGA vs AGC)

d.
Complement pairs (e.g. CGA vs GCT)
We have evaluated histograms of X(S,S^{*}) = N(S)N(S^{*})/(N(S) + N(S^{*})) for all pairings, and computed their averages E_{k}[X] = 4^{k}∑_{S} X(S,S^{*}) for different k. Calculations were performed both for human chromosomes as well as for many other species.
Figure 2a depicts the distributions of X values for inverse pairs on human chr 1 of HG38, evaluated for k = 4 to 10. These distributions are very narrow, leading to very low E_{k}[X] values, consistent with the results displayed in Fig. 1. As k increases they widen, leading to increasing E_{k}[X] values, which will be discussed below and are quoted in Table 1. In Fig. 2b and c we plot the corresponding distributions for the cases of random pairs (b), where for each S a random choice of S^{*} is being made, without repetition, and reverse pairs (c) on chr 1. Distributions of complement pairs (d) are identical to those of reverse pairs and are therefore not displayed as an additional figure. Note that the distributions in 2b and 2c are completely different from 2a: they possess a rugged wavy behavior, stretching over the whole range of 0 < X < 1. Since kmer distributions on the human genome are known to be different for strings containg CG dimers [16], we studied the same problems removing all such kmers. It turns out that, for the resulting kmer strings, the second peak in (b) and (c) disappears. But, even then, cases b and c continue to be very different from case a, displaying long tail distributions. Such characteristic differences occur also for all other species that we have tested, and also for masked chromosomes in human.
We can further use these distributions to establish that a symmetry relation holds only for inverse pairs, leading to very low E_{k}[X] values, and not for any other pairing. Table 1 lists the values of μ_{k} = E_{k}[X] for the three cases a, b and c, making it quite evident that IS holds and other symmetries do not. We will not dwell on it further, since Kong et al. [11] have already established the validity of IS (albeit using different measures) on 786 chromosomes of various species, and showed that reverse or complement symmetries do not hold.
Statistical analysis of inversion symmetry
In the Methods section we point out that, for large enough counts, if the counts N(S) and N(S*) are drawn from the same distribution, then the variable Y = (N(S)  N(S*))/(N(S) + N(S*)) should have an approximately Gaussian distribution with mean 0 and standard deviation σ_{G.} Moreover, the distribution of X = Y will have an expectation value E_{k}[X] = 0.8 σ_{G} and standard deviation σ_{X} =0.6 σ_{G}. If the counts N(S) are drawn from a Poisson model, we expect for each pair to find σ_{G} = (N(S) + N(S*))^{½}. Hence Z = (N(S)  N(S*))/(N(S) + N(S*))^{½} should follow a standard normal distribution, i.e. a Gaussian with mean = 0 and variance = 1. Hence the ISPoisson model predicts E_{k}(Z) = 0.8 and σ_{k}(Z) = 0.6, when rounded up to first decimal point.
We have tested this model by evaluating results for inverse pairs of kmers on chr1 of HG38. The results are displayed in Table 2.
For low kvalues, where E_{k}[Z] = E_{k}[X/σ_{X}] ≥ 2, one may say that a mathematical hypothesis of strict IS is invalid, since the peak of the Zdistribution lies outside the allowed confidence interval. On the other hand, clearly for all k < 4, E_{k}[X] < <0.01. Although the violation of IS is very small numerically, it is still statistically significant.
For large k > 5 we see that the data tend toward the prediction of our ISPoisson model, approaching the limit of E_{k}(Z) ± σ_{k}(Z) = 0.8 ± 0.6. This means that, due to the large variance, arising from relatively small values of N(S) + N(S^{inv}), the mathematical IS hypothesis cannot be refuted. It also means that variance plays a dominant role leading to the observed values of X(S,S^{inv}) which are of the order of magnitude of (N(S) + N(S^{inv}))^{½}. This implies that we should be able to deduce the behavior of the klimit, which is indeed the case as will be shown below.
To get a visual confirmation of the Gaussian nature of the Zdistribution we plot in Fig. 3a the results for k = 8 on human chr 1. The ensemble of Zvalues contributed by all kmers makes up the Gaussian distribution which is displayed here. The variance calculated from these data is 1.27, quite close to the value 1 expected from a standard normal distibution. For comparison, we display in Fig. 3b the analogous distributon of reverse pairs, which has variance of 1600. Note the different scales and shapes, which reflect the large difference between inverse and reverse distributions. The complementpair distribution (not shown here) is essentially identical to the reverse one.
Considering the peak at Z = 0 displayed in Fig. 3b, it is important to note that there exists a subset of kmers which obey S = S^{rev}, i.e. they are palindroms. They will contribute to the peak at Z = 0, with small variations of palindroms contributing to the region around this peak. Nonetheless, their numbers are small compared to all 8mers: only 3224 out of 65536 8mers lie within Z < 1.65, which is where 90 % of a standard normal distribution are expected to reside. Hence the variance of the reversepair Zditribution is very large.
Inversion symmetry for chromosomal sections
We next test to what degree IS is valid within various sections of human chromosomes. In Additional file 3, we display a characteristic distribution of inverse pairs drawn from a section of length 10Mbp, and in Additional file 4 we show an analogous distribution for length of 1Mbp. The IS quality, as determined by our convention, deteriorates leading to lower klimits as the length of the section decreases, but it remains valid. The distributions in Additional file 4 are evidently noisier than their analogs in Additional file 3; however they are much narrower than those of the reverse and random pairs (not shown here).
To study systematically different sections of chromosomes, we evaluate the E_{k}[X] values of inverse, random and reverse pairs, on nonoverlapping windows of given lengths L. In general, inversepairs lead to smaller E_{k}[X] than the other pairing choices. To determine the klimit we impose the condition E_{k}[X] <0.1 on the average over all chromosomal sections. The example displayed in Additional file 5 is of chr1, which is being tested with windows of length L = 5Kbp for inversepairs of k = 2. Although the average value is 0.07, obeying our criterion for IS validity, it is quite obvious that on many 5 K windows the values are higher. The value KL = 2 is chosen as the klimit of IS validity in this case. Reducing the section length further down to L = 1Kbp, in Additional file 6, we find that IS fails even at order k = 1, i.e. the second Chargaff rule does not hold for such short sectors.
Similar evaluations for different chromosomes, on both HG18 and HG38 assemblies, lead in a consistent manner to the klimits of “human sections” displayed in Table 3, where they are compared with results obtained for various other species, both eukaryotes and prokaryotes. They all follow a logarithmic increase of KL as function of the length of the chromosomal section, as is quite evident from their display in Fig. 4.
The logarithmic increase is modelled well by our ISPoisson model. To prove it let us define N(S) = f(S) L /4^{k}, and let us assume that E[Z] reaches its asymptotic value 0.8. We may then rewrite
The expression in {} is the expectation value of (f(S))^{½} within the Zdistribution. Let us denote it by 0.8c_{k} since E_{k}(Z) = 0.8. It follows that E_{k}[X] < 0.1 means that 0.8 c_{k} (4^{k}/2L)^{½} < 0.1. If c_{k} is a slowly varying function of k then k < lnL/ln4 + const = 0.72lnL + const.
This result is evidently borne out by the experimental fit k < 0.73 ln(L) + const. in Fig. 4. Furthermore, studying human chr1 we find that the experimental averages of (f(S))^{½} (without weighting by Z) obtain the values 1.24, 1.45, 1.47 for k = 4, 6, 8 respectively. This verifies that c_{k} is indeed a slowly varying function.
An early observation of inversion symmetry measures increasing logarithmically with sequence size has been made by [10] for various DNA and RNA sequences (see Fig. 1 in [10]).
Modeling inversion symmetry
If IS holds exactly for some k = k_{0}, it will hold also for all k < k_{0}, since the latter are substrings of the former and, therefore, all the frequencies of the k inversepair substrings will be matched (since the frequencies of their k_{0} hosts are being matched). In Methods we show that this statement is also true when IS is approximately true, i.e. when E_{k} [X] < <1. In practice we find it to hold when we apply our criterion E_{k} [X] ≤0.1 (see Table 4). One may wonder to what extent the opposite may hold within, e.g., low order Markov models: will a Markov model, constructed such that it satisfies IS for some k induce IS at the level k + 1? The answer is negative. Even for low values of k, a Markov model based on a lower statistic cannot generate the higher statistic [8]. This issue has been discussed in [10] where the difference between the two has been termed “residual symmetry”.
The simplest random model is that of a uniform distribution, which is generated on the basis of the second Chargaff rule (i.e. #A = #T and different from #C = #G). Such a distribution will trivially account for low E_{k} [X] values for inverse pairs at large values of k, limited by the length of the model chromosome. However it will also give rise to very low values for reverse pairs at a similar range of k, because any comparison of kmers with one of their permutations will lead to similar E_{k}[X]. In other words, this random independent (but not IID) model satisfies additional symmetries that are not observed in genomic data. Therefore it is not a realistic model of inversion symmetry.
A plausible explanation of the observed IS can be based on the fact that genomes evolve through rearrangement processes. By comparing synteny blocks in human and mouse, Pevzner and Tesler [17] have argued that rearrangements occur on many scales in the genome, and intrachromosomal rearrangements are more frequent than interchromosomal ones. Rearrangements may be viewed as inversions of sections between two breakpoints on the chromosome, and they may even follow one another in a nested fashion. Their study [17] demonstrated that human and mouse chr X share 281 synteny blocks of size >1 Mb, and at least 245 rearrangements occurred since the divergence of the two species.
Building on this intuition, derived from comparative genomics, it seems reasonable to assume that a series of such rearrangements on different scales may lead to IS. This mechanism has already been suggested by [6], and has been studied by [18] and by [11]. We have tested it on a simple model, starting from the human mitochondrial chromosome, which does not satisfy the second Chargaff rule. Since the mitochondrial chromosome is only 16Kbp long, we first construct out of it an enlarged model chromosome with length L = 100Mbp, by concatenating random selections of subsequences of chr M. We then apply to it rearrangements at various scales. We found that 5000 rearrangements at scales of 100 K have led to good IS effects, but best results were obtained for 50,000 rearrangements, whose breakpoints were randomly chosen, and their section lengths befit a uniform random distribution of length < 10 K. These results exhibit a high degree of IS, as displayed in Additional file 7.
Inversion symmetry: validation and deviations
Figure 4 provides an experimental validation of our ISPoisson model, in so far that it predicts correctly the behavior of KL as function of ln(L).
To look further into it, we present in Table 4 a comparative analysis of chr1 of different species, in both its unmasked and masked formats, as well as the analysis of the E coli genome. Shown are E[Z] values, E[X] values, the fraction of unrealized pairs (where both N(S) = N(S^{inv}) = 0), and the relevant length of the studied chromosomes. Highlighted are all E[X] values which are closest to 0.1 (defining the relevant KL) for the different chromosomes. By comparing with the upper part of the table one realizes that for k = KL, E[Z] is indeed close to 0.8, hence the success of the KL formula. By comparing with the 3^{rd} part of the table, we see that for these values of k, only for a very small fraction of all possible kmers, both S or S^{inv} are not realized on the studied chromosomes.
Our criterion for approximate IS, E_{k}[X] ≤ 0.1, was introduced as an intuitive but somewhat arbitrary decision. From Table 4 we learn that this is where the Z distribution approximates very well the data. For larger k we observe that some of the kmers do not appear, and their fraction increases rapidly with k. Hence our criterion selects also the range where almost all kmer strings are being realized. This serves as a posterior justification of our IS criterion.
Table 4 carries also the message that, for small k, a strict validity of IS cannot be guaranteed. This may be interpreted by stating that the breaking of IS is small, but it is statistically significant. In particular, testing the 2^{nd} Chargaff rule, one finds a systematic deviation from N(T) = N(A) and N(G) = N(C) for all human chromosomes, as displayed in Table 5. For most human chromosomes, we find an excess of T over A and of G over C on the positive strand. Only chr 8 and chr 22 display opposite trends. As shown, these results are statistically significant, when compared with an assumption that the counts of complement nucleotides are derived from the same Poisson distribution. Moreover, the same is true for both bare and masked versions of the chromosome. The difference between the bare and masked regions of the chromosome defined the lowcomplexity chromosomal regions. The asymmetry seems to be quite significant in all three regions (bare,masked,low complexity) as can be seen in Additional files 8 and 9.
It is wellknown that there exist local violations of the 2^{nd}Chargaff rule; in particular, there exists an excess of #G over #C and #T over #A on the coding strand within most genes. Green et al. [19] have argued that mutational asymmetry has acted over long periods of time to produce such a compositional asymmetry, and discontinuities of such asymmetries are associated with loci of replication origin. These questions have also been studied by Huvet et al. [20]. Could it be that the asymmetry that we have encountered is somehow connected to these findings? Since the gene coding strand may be either the plus (P) or the minus (M) strand of the conventional genomic notation, this may seem to be unrelated, assuming there is equal probability for genes to occur on each strand.
The convention which is being used in the UCSC genome browser is that the “plus” strand refers to the linear 5’ to 3’ order of encountering the parm before the centromere, which is followed by the qarm of the chromosome. This convention is consistent with NCBI “top” assignment. Counting protein coding genes and RNA genes on human chromosomes as recorded by GeneCards (http://www.genecards.org) we are led to the conclusion that they display a clear excess of genes on the plus (P) strands. The results are displayed in Table 6. Their relation to preferences of #T > #A and #G > #C on the P strand looks statistically significant. Clearly genes occupy only a small fraction of human chromosomes but they could still be the cause for the very small deviation from the Chargaff rule. It may also be that some other mechanism leads to a builtin excess on the chromosomes, and the latter affects the preference of gene allocations within the two strands. A notable exception to the observed general trend is chr 11.
There are two different issues which are noteworthy in Table 6. One is the correlation of the preference of #T > #A and #G > #C with the positive labeling of the strand. The other is the correlation of #T > #A and #G > #C with the preference for gene counts. Whereas the first may be coincidental (although it could be related to the labelling convention whose sources we were unable to trace), we believe that the second can be meaningful.
Next we looked for the violation of the 2^{nd} Chargaff rule on mouse and yeast, with the purpose of characterizing the asymmetries and looking for correlations with gene occurrences. The gene counts were obtained from MGI (MRK_list2 in ftp://ftp.informatics.jax.org/pub/reports/index.html) for mouse, and from SGD snapshot (http://www.yeastgenome.org/genomesnapshot) for yeast. While asymmetries of nucleotide occurrences are evident and significant in both, gene data are quite smaller than in human and no conclusive correlations can be deduced. The analyses are listed in Additional files 10 and 11. Finally we test the 2^{nd} Chargaff rule on C elegans and E coli in Additional file 12. While the former shows some significant inconsistencies, the latter is completely consistent. This behavior correlates well with the trends already noted in the first raw of Table 4, indicating large values of E[Z] for k = 1.
Discussion
Inversion symmetry may be stated as the equality N(S) = N(S^{inv}) where N are counts and S is some arbitrary string existing on a chromosome. Conventionally one studies such equalities over the space of all S which are kmers of some given length k. In addition to this equality one requires that, if S^{inv} is replaced by other permutations over the space of all kmers, analog rules will not hold.
After reinvestigating these questions on various genomic data, with special attention devoted to human data, we turned to a rigorous statistical study. For this purpose we defined the normalized differences Y = (N(S)N(S^{inv}))/(N(S) + N(S^{inv})) and X = Y. If the equality N(S) = N(S^{inv}) holds for stochastic variables N, we expect the variable Y to have approximately Gaussian behavior. If, moreover, N is a Poisson distribution, then Z = (N(S)N(S^{inv}))/(N(S) + N(S^{inv}))^{½} should have approximately a standard normal distribution. The stochastic variable Z is the appropriate one to be used for a ztest, characterizing the significance of IS values displayed by Y or X, under the IPPoisson model.
In order to characterize approximate IS we have employed E_{k}[X] ≤ 0.1 as a convenient measure. We saw in Table 4 that it captures the region for which significant results are obtained, and almost all kmers appear on the chromosome. Defining the klimit KL as the kvalue for which E_{k}[X] turns out to be closest to 0.1, we uncover a logarithmic increase of KL with chromosomal length. It turns out that this behavior is accounted for by our ISPoisson model.
Our original definition of IS regarded it as an approximate symmetry. As such it was seen to be valid for all ranges of k up to KL. With the advent of the IPPoisson model, we may investigate to what extent it can serve as an exact symmetry. It turns out that, for very low k, E_{k}[X] though extremely low, is significantly different from 0. In other words, the confidence intervals derived from ISPoisson, exclude a peak at Z = 0. This has lead us to investigate the violation of the 2^{nd} Chargaff rule, i.e. deviations from the relations N(T) = N(A) and N(G) = N(C). We find that deviations are very significant in human and in mouse, and quite significant on chromosomes of other eukaryotes. Moreover, in human we observe that, for most chromosomes, N(T) > N(A) and N(G) > N(C), i.e. these excesses are observed to occur on chromosomal plus (P) strands. Investigating the occurrences of genes on both strands, we find a similar excess with significant slight preference for the P strand. These results, for nucleotide excess and gene excess, are displayed in Table 6, and are seen to hold for a large majority of chromosomes.^{Footnote 1} Still, there exist also some counterexamples. Could it be that the known asymmetries of complement nucleotides on gene coding strands are related to the observed correlation of the two effects in the human genome? This remains an interesting question for future studies.
Conclusions
Inversion symmetry is valid for almost all chromosomes, even after filtering out their lowcomplexity regions. We have defined an empirical criterion of IS, and a corresponding klimit (KL), which is the highest k for which all kmer distributions abide by the symmetry. Analyzing the IS behavior using rigorous statistical methods, and comparing empirical results with our ISPoisson model, we account for the universal increase of KL with respect to the chromosomal length.
For low k we find minute, yet significant, deviations from strict IS. This includes excess of counts of nucleotides T vs A and G vs C on positive strands of human chromosomes. We point out that this finding correlates with a significant (but small) excess of genes on the same positive strands.
Methods
For a string S of length k, and its symmetryrelated S*, we introduce the stochastic variables N(S) and N(S*), and through them the following variables X, Y and Z (using an italicized notation):
All three variables can be evaluated for each specific kmer, hence they carry implicit indices of the space of 4^{k} kmers. The empirical values of the different variables will be denoted by N, X, Y and Z. Measures similar to Y appear in the literature for k = 1 and k = 2 and are known as skews.
The symmetryrelation means that N(S) and N(S*) are drawn from the same distribution, N(S) = N(S*). Furthermore, it is reasonable to assume that kmer appearances on a long chromosome resemble a Poisson process. This has been verified by us by investigating counts for all nonoverlapping windows of some size L (e.g. L = 100 K on human chr 1). If the expectation of the Poisson is large enough (a typical quoted number is 30), we can safely assume by the Central Limit Theorem that the distribution of the statistic Z we define is well approximated by a standard normal distribution.

1.
The semiGaussian distribution of inversepair differences.
Let us consider a pair of kmers, with counts N(S) and N(S*) respectively, for which we evaluate the ratios Y = (N(S)  N(S*))/(N(S) + N(S*)) and X = Y. Moreover, we assume that these counts are due to two random variables drawn from the same distribution, thus having the same average, E[N(S)N(S*)] = 0, and follow a Gaussian distribution G = exp(−Y^{2}/2 σ_{G} ^{2})/ σ_{G}(2π)^{½}. The counts of the distribution of X = Y will then follow a semiGaussian distribution P = 2 exp(−X ^{2}/2 σ_{G} ^{2})/ σ_{G}(2π)^{½}, defined for positive X only. The mean and variance of this semiGaussian are E[X] = σ_{G} (2/π) ^{½} ≈ 0.798σ_{G} and V[X] = σ_{G} ^{2} (12/π). Hence σ_{X} = 0.603σ_{G} = 0.755E_{X}.
Empirical verification of inverse pair distributions can be carried out by choosing counts for all nonoverlapping windows of some size L (e.g. L = 100 K on human chr 1). Testing the X and Y distributions for inverse pairs of k = 8 we find the above description to be valid.

2.
Poisson distributions of counts.
Let us now assume that the counts N(S) observed on a chromosome, are realizations of stochastic variables which follow Poisson distributions, each with its own mean = variance. In the IS limit the distribution of the inverse N(S^{inv}) coincides with that of N(S). Their difference should have a mean of 0, and variance which is the sum of the variances. Thus, for each inverse pair of kmers, we expect Y = (N(S)  N(S^{inv}))/(N(S) + N(S^{inv})) to become approximately Gaussian with mean 0 and standarddeviation σ_{G} =1/(N(S) + N(S^{inv}))^{½}. Alternatively we can state that Z = (N(S)  N(S^{inv}))/(N(S) + N(S^{inv}))^{½} should approximately follow a standard normal distribution with mean = 0 and variance = 1. It follows then, from the previous paragraph, that in this regime we should obtain, after averaging over all kmers, the results E_{k}(Z) = 0.8 and σ_{k}(Z) = 0.6.
Note that E_{k}(Z) may also be viewed as E_{k}(X/σ_{X}), where σ_{X} = 1/(N(S) + N(S^{inv}))^{½} for every particular kmer under consideration. Tables 2 and 4 demonstrate that the experimental results for large k are close to the predicted theoretical expectation.

3.
Monotonic increase of E_{k}[X] as function of k in the IS limit.
For perfect IS it is trivial to prove that E_{k}[X] = 0 implies that this equality holds for lower k, i.e. E_{k1}[X] = 0. Here we study the case of approximate inversion symmetry, with the purpose of proving that small E_{k}[X] < <1 implies even smaller E_{k1}[X]. For simplicity we assume that all 4^{k} kmers are being realized on the chromosomal strings.
Let {S_{j}, j = 1…4^{k}} be the set of all kmers, and {S’_{i}, i = 1…4^{k1}}be the set of all (k1)mers. Each (k1)mer can be extended to the right by one nucleotide, resulting in four kmers. Let us refer to this extension of S’_{i} as the set S_{jϵI}. Corresponding relations will hold for their inverse partners, extended by nucleotides to their left. It follows then that the counts of these sets can be related N(S’_{i}) = ∑_{jϵI} N(S_{j}). Hence
$$ \begin{array}{l}{\mathrm{X}}_{\mathrm{k}\hbox{} 1}\left(\mathrm{i}\right) = \left\mathrm{N}\left(\mathrm{S}{'}_{\mathrm{i}}\right)\ \hbox{}\ \mathrm{N}\left(\mathrm{S}{'_{\mathrm{i}}}^{\mathrm{i}\mathrm{nv}}\right)\right/\left(\mathrm{N}\left(\mathrm{S}{'}_{\mathrm{i}}\right) + \mathrm{N}\left(\mathrm{S}{'_{\mathrm{i}}}^{\mathrm{i}\mathrm{nv}}\right)\right) = \left\left({\displaystyle {\sum}_{\mathrm{j}\in \mathrm{I}}\mathrm{N}\left({\mathrm{S}}_{\mathrm{j}}\right)}\hbox{}\ {\displaystyle {\sum}_{\mathrm{j}\in \mathrm{I}}\mathrm{N}\left({{\mathrm{S}}_{\mathrm{j}}}^{\mathrm{i}\mathrm{nv}}\right)}\right)\right\ \\ {}/\ \left({\displaystyle {\sum}_{\mathrm{j}\in \mathrm{I}}\mathrm{N}\left({\mathrm{S}}_{\mathrm{j}}\right)} + {\displaystyle {\sum}_{\mathrm{j}\in \mathrm{I}}\mathrm{N}\left({{\mathrm{S}}_{\mathrm{j}}}^{\mathrm{i}\mathrm{nv}}\right)}\right).\end{array} $$Using the notation N(S_{j}) – N(S_{j} ^{inv}) = ∆N(S_{j}), we note that the numerator on the right obeys ∑_{jϵI} ∆N(S_{j}) ≤ ∑_{jϵI} ∆N(S_{j}). Because of varying signs this inequality may imply a strong decrease.
We may now compare the expressions of E_{k1}[X] =4^{k+1}∑_{i}X_{k1}(i) and E_{k}[X] = 4^{k}∑_{j}X_{k1}(j). Using the results of the previous paragraph we conclude that the numerators of X_{k1}(i) in E_{k1}[X] are smaller (or equal) than the numerators of X_{k}(j), where jϵI, in E_{k}[X]. Note however that the denominators of X_{k1}(i) and X_{k}(j), where jϵI, are different. To the extent that all N(S_{jϵI}) have similar values within the group jϵI when we approach the IS limit, this leads to X_{k1}(i) ≤ ∑_{jϵI} X_{k}(j)/4, which implies that
$$ {\mathrm{E}}_{\mathrm{k}\hbox{} 1}\left[\mathrm{X}\right]\ \le {\mathrm{E}}_{\mathrm{k}}\left[\mathrm{X}\right]. $$In practice, for large k, we find in Tables 2 and 4 that E_{k1} [X] ≈ E_{k}[X]/2.
It should be emphasized that the monotonic increase holds in the IS limit, i.e. when E_{k}[X] < <1, but it is not a general property of kmers on any chromosomal section. Synthetic counter examples can be constructed.

4.
Comparison of E_{k}[X] with the S1 measure.
The measure S1, introduced by Baisnee Hampson and Baldi [10], comparing counts of all kmers with their inverses, is defined by
$$ \mathrm{S}1=1{\textstyle \hbox{} }{\displaystyle {\sum}_{\mathrm{i}}}\left\mathrm{N}\left({\mathrm{S}}_{\mathrm{i}}\right){\textstyle \hbox{}}\mathrm{N}\left({{\mathrm{S}}_{\mathrm{i}}}^{\mathrm{i}\mathrm{nv}}\right)\right/{\displaystyle {\sum}_{\mathrm{i}}}\left(\mathrm{N}\left({\mathrm{S}}_{\mathrm{i}}\right)+\mathrm{N}\left({{\mathrm{S}}_{\mathrm{i}}}^{\mathrm{i}\mathrm{nv}}\right)\right). $$The denominator in this expression equals twice the length of the chromosome. The numerator may be regarded as an L1 distance between two sets of sequences.
Note the difference from our measure E_{k}[X], which may be written as
$$ {\mathrm{E}}_{\mathrm{k}}\left[\mathrm{X}\right]={{\mathrm{M}}_{\mathrm{k}}}^{{\textstyle \hbox{} }1}{\displaystyle {\sum}_{\mathrm{i}}}\left\mathrm{N}\left({\mathrm{S}}_{\mathrm{i}}\right){\textstyle \hbox{}}\mathrm{N}\left({{\mathrm{S}}_{\mathrm{i}}}^{\mathrm{i}\mathrm{nv}}\right)\right/\left(\mathrm{N}\left({\mathrm{S}}_{\mathrm{i}}\right)+\mathrm{N}\left({{\mathrm{S}}_{\mathrm{i}}}^{\mathrm{i}\mathrm{nv}}\right)\right). $$E_{k}[X] averages the relative difference of all kmers on equal footing, whereas S1 sums all absolute differences.
A comparison of the two different measures on human chr1 is presented in Table 7. We find that E_{k}[X] is roughly twice (1S1)_{k}, and the latter is approximately equal to E_{k1}[X].
Notes
Skew analyses, i.e. nonvanishing Y(A,T) and Y(C,G), have been carried out before. One example is table 2 in [21]. The correlation with gene numbers on human chromosomes observed in Table 6 is new, to the best of our knowledge. Forsdyke et al. [22] have investigated the correlation of conventional positive (or “top”) strands with the difference of #A#T in chromosomes of C elegans and D melanogaster. These organisms, which have low numbers of chromosomes, do not exhibit a clear preference for excess of either #A or #T.
References
Chargaff E. Chemical specificity of nucleic acids and mechanism of their enzymatic degradation. Experientia. 1950;6(6):201–9.
Chargaff E. Structure and function of nucleic acids as cell constituents. Federal Proc. 1951;10:654–9.
Crick F, Watson JD. Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid. Nature. 1953;171:737–8.
Rudner R, Karkas JD, Chargaff E. Separation of B. subtilis DNA into complementary strands. III. Direct Analysis. Proc Natl Acad Sci U S A. 1968;60:921–2.
Mitchell D, Bridge R. A test of Chargaff’s second rule. Biochem Biophys Res Commun. 2006;340(1):90–4.
AlbrechtBuehler G. Asymptotically increasing compliance of genomes with Chargaff’s second parity rules through inversions and inverse transpositions. Proc Natl Acad Sci U S A. 2006;103(47):17828–33.
Prabhu VV. Symmetry observations in long nucleotide sequences. Nuc. Acids Res. 1993;21(12):2797–800.
Baldi P, Brunak S. Bioinformatics, the machine learning approach. MIT Press. 2001
Qi D, Cuticchia AJ. Compositional symmetries in complete genomes. Bioinformatics. 2001;17:557–9.
Baisnee PF, Hampson S, Baldi P. Why are reverseary DNA strands symmetric? Bioinformatics. 2002;18:1021–33.
Kong SG, Fan WL, Chen HD, Hsu ZT, Zhou N, Zheng B, Lee HC. Inverse symmetry in complete genomes and wholegenome inverse duplication. PlosOne. 2009;4:e7553.
Wang S, Tu J, Jia Z, Lu Z. High order intrastrand partial symmetry increases with organismal complexity in animal evolution. Sci Rep. 2014;4:6400.
Afreixo V, Bastos CAC, Garcia SP, Rodrigues JMOS, Pinho AJ, Ferreira PJSG. The breakdown of the word symmetry in the human genome. J Theor Biol. 2013;335:153–9.
Powdel BR, Satapathy SS, Kumar A, Jha PK, Buragohan AK, Borah M, Ray SK. A Study in Entire Chromosomes of Violations of the Intrastrand Parity of Complementary Nucleotides (Chargaff’s Second Parity Rule). DNA Res. 2009;16:325–43.
Afreixo V, Rodrigues JMOS, Bastos CAC. Analysis of singlestrand exceptional word symmetry in the human genome: new measures. Biostatistics. 2015;16(2):209–21.
Chor B, Horn D, Goldman N, Levy Y, Massingham T. Genomic DNA kmer spectra: models and modalities. Genome Biol. 2009;10:R108.
Pevzner P, Tesler G. Genome rearrangements in Mammalian Evolution: Lessons from Human and Mouse Genomes. Genome Res. 2003;13:37–45.
Okamura K, Wei J, Scherer SW. Evolutionary implications of inversions that have caused intrastrand parity in DNA. BMC Genomics. 2007;8:160.
Green P, Ewing B, Miller W, Thomas PJ. NISC Comparative Sequencing Program & Green ED. Transcriptionassociated mutational asymmetry in mammalian evolution. Nat Gen. 2003;33:514–7.
Huvet M, Nicolay S, Touchon M, Audit B, d’AubentonCarafa Y, Arneodo A, Thermes C. Human gene organization driven by the coordination of replication and transcription. Gen Res. 2007;17:1278–85.
Mascher M, Schubert I, Scholz U, Friedel S. Patterns of nucleotide asymmetries in plant and animal genomes. BioSystems. 2013;111:181–9.
Forsdyke DR, Zhang C, Wei JF. chromosomes as interdependent accounting units. J Biol Syst. 2010;18:1–16.
Acknowledgments
We thank Martin Kupiec, Doron Lancet, Isaco Melijson, Erez Persi and Tal Pupko for helpful discussions.
Funding
This research was partially supported by the research fund of the Blavatnik School of Computer Science.
Availability of data and materials
The code we have developed for kmer counting on genomic sequences is available at https://genomeutils.codeplex.com.
Data were obtained from public depositories listed below under Data deposition.
Authors’ contributions
BC and DH initiated the study and contributed to its design. SR contributed to the design of the statistical analysis. SS carried out the numerical data analysis. BC and DH prepared the manuscript. All authors read and approved the final manuscript.
Competing interests
The authors declare they have no competing interests.
Consent for publication
The authors consent to publication by BMC genomics.
Ethics approval and consent to participate
Not applicable.
Data deposition
Eukaryote genomes have been downloaded from http://hgdownload.cse.ucsc.edu
These include Human HG18 & Masked, Human HG38 & Masked, Mouse MM10 & Masked, Yeast sacSer3 & Masked, C. Elegans CE10 & Masked, Chimpanzee panTro2, Zebrafish danRer6, D. melanogaster dm3.
Prokaryote genomes were downloaded from ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/
These include B. subtilis AL009126 UID76, M. avium paratubercolosis CP005928 MAP4 UID168471, P. furyosus AE009950 UID287, T. maritima AE000512 UID111.
E.coli.K12. mg1655.U00096.2 genome was downloaded from
https://www.genome.wisc.edu/pub/sequence/U00096.2.fas
Information regarding gene counts on plus and minus strands for human has been received from the GeneCards administrator at http://www.genecards.org
Mouse gene counts were obtained from MGI MRK_list2 in ftp://ftp.informatics.jax.org/pub/reports/index.html.
Yeast gene counts were obtained from SGD snapshot at http://www.yeastgenome.org/genomesnapshot
Author information
Authors and Affiliations
Corresponding author
Additional files
Additional file 1:
The variation of klimits (defined by largest k for which E_{k}[X] ≈ 0.1) as function of chromosome length in HG38 both before and after masking has been applied. (DOCX 17 kb)
Additional file 2:
The variation of klimits (defined by largest k for which E_{k}[X] ≈ 0.1) as function of chromosome length in HG18 both before and after masking has been applied. (DOCX 17 kb)
Additional file 3:
Distribution of inverse pairs in a chromosomal section of length 10Mbp, drawn from chr1. Range of X < 0.1. (DOCX 100 kb)
Additional file 4:
Distribution of inverse pairs in a chromosomal section of length 1Mbp drawn from chr 1. Range of X < 0.3. Smoother distributions are obtained when kmers containing CG dimers are excluded (not shown). (DOCX 165 kb)
Additional file 5:
Values of E_{2}[X] for inverse pairs of k = 2, evaluated over nonoverlapping windows (the ordinate specifies the serial number of the window) of length 5 K on chr1. Average value is 0.07. (DOCX 69 kb)
Additional file 6:
Jittery behavior of E_{1}[X] for inverse pairs on nonoverlapping windows of 1Kbp on chr 1 indicates semilocal violations of the second Chargaff rule. The ordinate specifies the serial number of the window. (DOCX 62 kb)
Additional file 7:
Histograms of inverse pairs for different kmers evaluated on the model based on rearrangements applied to an artificial chromosome of length 1 M constructed out of the mitochondrial chromosome, as described in the text. Note that the distributions are confined within the very short range of X < 0.04. (DOCX 102 kb)
Additional file 8:
Z values for comparison of T and A counts on HG38. (DOCX 21 kb)
Additional file 9:
Z values for comparison of G and C counts on HG38. (DOCX 15 kb)
Additional file 10:
Mouse data: ratios of #T/#A, #G/#C, and numbers of genes on Plus and Minus strands, together with their Z values. Most gene ratios have insignificant Z values, i.e. they are consistent with equality. Most #T/#A and #G/#C display significant violation of strict Chargaff rule. ChrX is exceptional: here both Z values are small and Chargaff violation is insignificant. Gene data are derived from MRK_list2 in ftp://ftp.informatics.jax.org/pub/reports/index.html. (DOCX 18 kb)
Additional file 11:
Yeast data: ratios of #T/#A, #G/#C, and numbers of genes on Plus and Minus strands, together with their Z values. All gene ratios have low Z values, i.e. they are consistent with equality. Many #T/#A and #G/#C display significant violation of strict Chargaff rule. Gene data are derived from http://www.yeastgenome.org/genomesnapshot. (DOCX 17 kb)
Additional file 12:
Ratios of #T/#A and #G/#C and their Z values for C elegans and E coli. While for C elegans one observes some significant violations of the 2^{nd} Chargaff rule, the E coli data are completely consistent with this rule. (DOCX 15 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Shporer, S., Chor, B., Rosset, S. et al. Inversion symmetry of DNA kmer counts: validity and deviations. BMC Genomics 17, 696 (2016). https://doi.org/10.1186/s1286401630128
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1286401630128
Keywords
 Generalized Chargaff rules
 Chromosome kmer distributions
 Inversion symmetry