 Research
 Open access
 Published:
Interchromosomal kmer distances
BMC Genomics volumeÂ 22, ArticleÂ number:Â 644 (2021)
Abstract
Background
Inversion Symmetry is a generalization of the second Chargaff rule, stating that the count of a string of k nucleotides on a single chromosomal strand equals the count of its inverse (reversecomplement) kmer. It holds for many species, both eukaryotes and prokaryotes, for ranges of k which may vary from 7 to 10 as chromosomal lengths vary from 2Mbp to 200 Mbp. Building on this formalism we introduce the concept of kmer distances between chromosomes. We formulate two kmer distance measures, D_{1} and D_{2}, which depend on k. D_{1} takes into account all kmers (for a single k) appearing on single strands of the two compared chromosomes, whereas D_{2} takes into account both strands of each chromosome. Both measures reflect dissimilarities in global chromosomal structures.
Results
After defining the various distance measures and summarizing their properties, we also define proximities that rely on the existence of synteny blocks between chromosomes of different bacterial strains. Comparing pairs of strains of bacteria, we find negative correlations between synteny proximities and kmer distances, thus establishing the meaning of the latter as measures of evolutionary distances among bacterial strains. The synteny measures we use are appropriate for closely related bacterial strains, where considerable sections of chromosomes demonstrate high direct or reversed equality. These measures are not appropriate for comparing different bacteria or eukaryotes.
Kmer structural distances can be defined for all species. Because of the arbitrariness of strand choices, we employ only the D_{2} measure when comparing chromosomes of different species. The results for comparisons of various eukaryotes display interesting behavior which is partially consistent with conventional understanding of evolutionary genomics. In particular, we define ratios of minimal kmer distances (KDR) between unmasked and masked chromosomes of two species, which correlate with both short and long evolutionary scales.
Conclusions
kmer distances reflect dissimilarities among global chromosomal structures. They carry information which aggregates all mutations. As such they can complement traditional evolution studies , which mainly concentrate on coding regions.
Background
The phenomenon of Inversion Symmetry (IS) has recently been reevaluated and established in [1]. This generalization of the second Chargaff rule [2] implies that the number of occurrences of any sequence m of length k on a chromosomal strand S is equal to the number of occurrences of its inverse (reversecomplement) sequence m^{inv} on the same strand. Another way of stating the same fact is that the number of occurrencesÂ of m on one chromosomal strand is equal to the number of occurrences of m on the other strand provided both are being read along their own 5â€² to 3â€² directions.
The accuracy of such statements depends on the length k of the nucleotide sequences which are being employed. It turns out to have a monotonic dependence on k, i.e. as k increases the symmetry worsens. If one sets the required accuracy at 10% one finds [1] that it holds for kâ€‰â‰¤â€‰KL where KL grows logarithmically with the length L of the chromosome. KL values for mammals are 9 or 10, while for bacteria they are 7 or 8. These choices of KL guarantee that all possible kmers of a particular kvalue will be found on the chromosome in question.
Inversion symmetry can be restated as the demonstration of a low kmer distance between the two strands of the same chromosome [3], with exact symmetry implying zero distance. The notion of kmer distances between different chromosomes, within and between species, is a simple extension of the same basic idea: comparing frequencies of all strings of nucleotides of the same length k on different chromosomes, summing over one or over both strands of each chromosome.
Short kmer distances can be interpreted as large structural similarities between chromosomes. In bacteria we establish correlations of short kmer distances between bacterial strains with large synteny proximities. Both concepts are explained in the Methods section. For bacterial strains, they also serve as good measures of evolutionary distances.
The synteny proximities which we employ are valid measures between bacterial strains which are very close evolutionary relatives. Otherwise one cannot find large genomic sections with high identities among them. Therefore, conventional synteny measures which are used in genomic evolutionary studies [4] are very different from our synteny proximities and are mostly concentrated on coding regions.
kmer distances, which are global measures, can be used to compare any two chromosomes. When studying eukaryotes, the compared chromosomes are dominated by noncoding regions. Comparing minimal kmer distances between various genomes, we find interesting results. In particular, ratios of unmasked to masked minimal genome distances, correlate with evolutionary distances among different species.
Methods
Definitions and properties of kmer distances between chromosomes
The term kmer refers (in the genomic context) to all possible nucleotide substrings of length k that are contained in a given chromosomal strand of length L, uncovered by a slidingwindow search. The total number of their occurrences is Nâ€‰=â€‰Lkâ€‰+â€‰1. We define the empirical frequency of a specific kmer, e.g. m_{1}, in the strand S as the number of occurrences of this kmer in S divided by N
Let us define the kmer distance D_{1} as the L1norm of the difference between kdim vectors containing frequencies of all kmers, when comparing two chromosomal strands (e.g. positive strands of two chromosomes) S_{1} and S_{2}:
The index 1 in D_{1} refers to the fact that we use only one strand on each chromosome in this comparison of two chromosomes.
Similarly, we may define a distance measure D_{2} by taking into account both strands of the two chromosomes, reading them along their own 5â€² to 3â€² directions. Since each specific kmer on the negative strand, is accompanied by its inverse (reversecomplement) on the positive strand, we may define D_{2} as
where we use a single strand on each chromosome and define for every kmer its inverse (reverse complement)
and sum over all of them along a single strand of each of the two chromosomes. Division by 2 is introduced in the definition of D_{2} because the effective number of counts on each chromosome becomes 2â€‰N.
The triangular inequality implies that
for every single kmer. It follows then that
Using the above definitions we summarize the properties of kmer distances:

1.
Positivity. By definition all distances are nonnegative.

2.
If \( {D}_{1,2}^k\left({S}_1,{S}_2\right)=0 \) then S_{1} and S_{2} are equivalent, in the sense that both chromosomes have the same frequencies of all kmers. This does not necessarily imply that the two chromosomes are equal to each other, because they may differ in length.

3.
Symmetry. By definition, \( {D}_{1,2}^k\left({S}_1,{S}_2\right)={D}_{1,2}^k\left({S}_2,{S}_1\right) \).

4.
Inequality: \( {D}_2^k\left({S}_1,{S}_2\right)\le {D}_1^k\left({S}_1,{S}_2\right) \), as proved above in Eq. 5.

5.
Triangular inequalities of distances:
This can be proved in an analogous fashion to property 4.

6.
Inversion symmetry [1] implies that \( {D}_1^k\left({S}_1,{S}_2\right)=0 \) if S_{2} is the inverse of S_{1} (or equivalent to it in the sense of property 2). Otherwise this distance will be positive. Such a definition of inversion symmetry has been introduced by [3]. \( {D}_2^k\left({S}_1,{S}_2\right)=0 \) is a trivial statement for two strands which are inverses of each other.

7.
Monotonic increase with k:
To prove this property note that a kmer m_{i}^{k} can be generated from a corresponding m_{j}^{k1}, which coincides with all first k1 entries of m_{i}^{k}, by adding to it one of the four nucleotides {A, C, G, T}. Let us define this set as {j,i} for a given m_{j}^{k1} and four corresponding m_{i}^{k}. It follows then that
by summing over the indices using the {j,i} association, and applying the extended triangular inequality to each set of four f_{i} whose kmers m_{i}^{k} begin with the same (k1)mer m_{j}^{k1} with index j.
This proof can be trivially extended to D_{2}.
One condition for these inequalities to hold is that all kmers are realized on the chromosomal strands which are being investigated, i.e. all \( n\left({m}_i^k\right)>0 \).
Finally we touch upon the question of the range of kvalues for which the distance measures can be applied.
Shporer et al. [1] have introduced the notion of the KL limit. This is the kvalue for which Inversion Symmetry fails at the rate of 10%. They demonstrated that chromosomes of different species, as well as different human chromosomal sections, follow a universal logarithmic slope of KLâ€‰~â€‰0.7 ln(L), where L is the length of the chromosome. This limit can also be derived from the assumption that L>â€‰>â€‰4^{k} allowing for all kmers to be expressed on the chromosome.
As an example of relevant statistics we display in Fig.Â 1 the percentage of missing kmers, i.e. those which do not appear on the strand, and the distance between two close strains of E. coli as function of k, demonstrating that good results are obtained for kâ€‰â‰¤â€‰KLâ€‰=â€‰7.
When evaluating distances between two chromosomal strands with different lengths, L_{1} and L_{2}, one should limit oneself to KL where Lâ€‰=â€‰min(L_{1}, L_{2}), guaranteeing that the same k is valid for both chromosomal strands which are being compared.
We provide a python program for calculating kmer distances between two chromosomes, given as fasta files, in (https://github.com/akafri/kmerdistances).
Definition of synteny distances
Synteny blocks are genetic sequences in genomes of two species which consist of aligned homologous genes. A recent example of their importance was demonstrated by [5, 6]. Here we introduce definitions of synteny distances, which will be used to compare with kmer distances. This comparison will be carried out using different strains of the same bacterium, where large synteny blocks with identity percentages higher than 90% exist. The threshold of 90% is arbitrary. It was made to guarantee high similarity between the relevant chromosomes. For bacteria, where the selection of a positive strand is well defined, we differentiate between Direct Synteny Blocks (DSB), appearing along the same strand in both genomes, and Inverse Synteny Blocks (ISB), lying on opposite strands. An example is shown in Fig.Â 2.
Searching for synteny blocks, BLAST was first used to identify local alignments between the full two sequences. The R package OmicCircus [7] was used to visualize results. From the BLAST output, we extract synteny blocks that have identity percentages higher than 90%, and calculate the overall sequence lengths of DSB and ISB (L_{DSB} and L_{ISB}) respectively.
We then define direct synteny proximity
and overall synteny proximity as
where L_{1} and L_{2} are the lengths of the chromosomes S_{1} and S_{2} which are being compared.
The matchedpair algorithm for kmer distances between two species
To define distances between two eukaryote genomes we started by evaluating a distance matrix between all chromosomes of the two species. We then constructed a graph whose vertices are the chromosomes of the two species and its edges (lines connecting the vertices) represent the distance value of each pair. We proceeded along the following algorithmic steps:

1.
Eliminate edges with distances >â€‰1 from the graph.

2.
Define an empty distance vector.

3.
Find the edge of the graph with the lowest distance value.

4.
Add this value as an entry to the distance vector.

5.
Remove this edge from the graph and repeat from step 3 until the graph is exhausted.

6.
Inspect the resulting distance vector and report its minimum (the first edge considered by the matching algorithm) and its median.
Results
Distance measures in bacteria
We compared genomes of 23 strains of E. coli and 14 strains of Salmonella enterica. They are listed in TablesÂ 1 and 2.
In Fig.Â 3 we present correlations of P_{DSYN} with D_{1} for (a) E. Coli and for (b) S. enterica strains. In each of the two data sets we have looked into all pairs of strains. The data are presented for kâ€‰=â€‰7. We report only results between strains of the same bacterium since no significant correlation was found between any two strains of the two different bacteria. The higher statistics of E. coli leads to a clearer observation of the correlations.
Next we turn to correlations of overall synteny with D_{2}^{kâ€‰=â€‰7}. This is presented in Fig.Â 4. Once again we note the strong correlations in the data. The strong negative correlation is particularly significant for the E. coli strains where we have many more pairs of strains which can be compared with one another. Hence we limit our further analysis to just E. coli strains.
In order to appreciate the variation with k we display in Fig.Â 5 the Pearson correlation coefficients of D_{1} and D_{2} for all E. coli pairs of strains, as function of k, for the two classes of synteny measures. Clearly kâ€‰=â€‰7, the choice made in Figs. 3 and 4, leads to a strong correlation, as observed in Figs. 3 and 4. The relevant Pearson correlation pvalues turn out to be miniscule, with the highest one being of order 10^{âˆ’â€‰7} for kâ€‰=â€‰1 for both D_{1} and D_{2}, and others of order 10^{âˆ’â€‰22} and smaller.
We find different correlations of the two measures with P_{DSYN}. Whereas D_{1} displays the expected negative correlation, for all relevant k, D_{2} is less sensitive to the direct synteny measure. This may be expected since D_{2} is a measure sensitive to both strands whereas P_{DSYN} is sensitive to only one strand in each chromosome.
In order to appreciate this result let us dwell on the question why inversion symmetry [1] holds up to large kvalues of order KL. The plausible explanation is that genomes evolve through rearrangement processes. These rearrangements are inversions of sections between two breakpoints on the same chromosome. They may follow one another in a nested fashion. This scenario can explain the observed inversion symmetry, as demonstrated in [1]. Pevzner and Tesler [5] have argued that such phenomena are the basis of chromosomal evolution for single chromosomes and, with lower probability, also between different chromosomes. Here we observed that D_{1} between two strains of bacteria correlates strongly with both P_{DSYN} and P_{SYN} for all kâ€‰â‰¤â€‰7, both reflecting chromosomal evolution at the short evolutional scale appropriate to different strains of the same bacteria.
Distance measures between different species
In the previous section we have analyzed kmer distances between closely related bacterial strains, where the synteny distances that we have defined can be easily observed. When evolutionary genomics is applied to different eukaryotes one often limits oneself to similarity between homologous proteins rather than accurate duplications or inversions of large sections of the DNA. The use of kmer distances can indicate similarities between full chromosomes, which is the study we propose. From Inversion Symmetry we learn the powerful effect of rearrangement within a single chromosome. Rearrangements may also occur between chromosomes and kmer distances reflect their effects.
Evaluating minimal D_{2} distances according to the matchedpair algorithm (see Methods) we obtain the results displayed in Tables 3 and 4. The genome inputs, both unmasked (Table 3) and masked (Table 4), are taken from the UCSC server (see data supplementary file). Clearly, there is quite a difference between the two choices: masking reduces the distance values considerably. We use kâ€‰=â€‰8 which is a choice appropriate for all displayed species in Tables 3, 4, 5 and 6.
There are several striking results in the two tablesÂ 3 and 4. One important result is the closeness of minimal and medial distance values. This implies that similar kmer distances are observed for many chromosomal pairs of the two genomes, and are not limited to a single particular pair of chromosomes. In other words, homology spreads out between different chromosomal sections of the two compared species.
Another important result is the huge difference between minimal kmer distances of unmasked and masked genomes. Conventional understanding regards the low complexity components of the unmasked regions as unprotected by evolution. Hence ratios of unmasked to masked minimal D_{2}^{kâ€‰=â€‰8} distances measure the aggregated effect of different strengths of mutations when the low complexity sections of genomes are taken into account.
The results for these ratios are presented in TableÂ 5. They seem to be correlated to evolutionary time lapses among primates and rodents, where the separation between human and chimpanzee is dated at 6.7 MYA (million years ago), between mouse and rat 20 MYA and between rodents and primates 90 MYA. However the correlation between all four to dog and cow, ceases to exist. The separation age between the primates to dog and cow is estimated at 96 MYA and between dog and cow 78 MYA. All the evolutionary estimates are derived from the timetree website (http://www.timetree.org/).
A major tool employed in genomic evolutionary studies is Reversal (or inversal) Distance (RD) [5, 6]. Concentrating on the orders and details of genes or other markers, the idea is to work out how many inversions take place along the evolutionary path from one species to another. RD is the minimum number of reversals required to transform one genome into the other. The webtool of (http://www.timetree.org/) can be used to evaluate such distances. They fit much better the evolutionary time estimates, which is somewhat a tautology because the estimates of (http://www.timetree.org/) take the RD methodology into account. However, RD is problematic when very large evolutionary distances are concerned, because of the shortage in genes which can be compared between distant organisms. Kmer distances are not subject to such constraints. Hence they can be applied to such problems. In TableÂ 6 we compare human with the nematode (C. elegans) and the fruit fly (D. melanogaster), using the same methods as in TableÂ 5. Obviously these results are satisfactory.
Interestingly, kmer distances are immune to large inversion events. In fact, this was the reason we use them to begin with, starting with the lessons drawn from Inversion Symmetry of chromosomes. On the other hand, kmer distances are sensitive to all other mutations that occur along an evolutionary path. In this sense, Kmer minimal Distance Ratios among genomes (KDR) can serve as a complement to RD. Moreover, it is applicable to all eukaryotes.
The full potential of KDR has still to be investigated and explained. Evolutionary genomic tools deal extensively with substitution rates, in particular the nonsynonymous ones affecting aminoacid changes in proteins. The analogous investigation of substitution rates in lowcomplexity and highcomplexity genomic regions is needed to explain how KDR, or the various minimal or median kmer distances among genomes, can be used for meaningful evolutionary conclusions.
Conclusions
We have introduced measures of kmer distances, and applied them to bacteria and to eukaryotes. The two measures D_{1} and D_{2} were compared to synteny measures in bacteria, tracing large identical sections of chromosomes between two strains of the same species. We identified a strong correlation between D_{1} and direct syntenic regions and a strong correlation between D_{2} and both direct and inverse syntenies, which indicates evolutionary similarity between two strains. We argue therefore that kmer distances are validated as good measures for evolutionary distances within bacteria.
D_{2} measures are also adequate for estimating distances between any two genomes which may have very ancient common ancestors. We exemplify this fact by demonstrating such distance measures between several eukaryotes. We find that there exists considerable difference between masked and unmasked distances, as expected from common evolutionary understanding of rapid variation in low complexity regions, being less protected by evolution. Moreover, we exploit this difference to establish minimal Kmer Distance Ratios (KDR), which correlate with evolutionary time scales of primates and rodents, as well as very large time scales such as between human, nematode and fruit fly.
Whereas conventional evolutionary studies continue to use traditional methods following changes within and throughout homologous genes, our kmer distances take into account the full chromosomes, involving both coding and noncoding sections. As such, they carry novel information which complements traditional investigations.
Availability of data and materials
All data analyzed during this study are included in the data supplementary information file.
References
Shporer S, Chor B, Rosset S, Horn D. Inversion symmetry of DNA kmer counts: validity and deviations. BMC Genomics. 2016;17(1):696. https://doi.org/10.1186/s1286401630128.
Rudner R, Karkas JD, Chargaff E. Separation of B. subtilis DNA into reverseary strands. III. Direct analysis. Proc Natl Acad Sci U S A. 1968;60(3):921â€“2. https://doi.org/10.1073/pnas.60.3.921.
Baisnee PF, Hampson S, Baldi P. Why are reverseary DNA strands symmetric? Bioinformatics. 2002;18(8):1021â€“33. https://doi.org/10.1093/bioinformatics/18.8.1021.
Sinha AU, Meller J. Cinteny: flexible analysis and visualization of synteny and genome rearrangements in multiple organisms. BMC Bioinformatics. 2007;8:82 Webserver: https://cinteny.cchmc.org/.
Pevzner P, Tesler G. Genome rearrangements in mammalian evolution: lessons from human and mouse genomes. Genome Res. 2003;13(1):37â€“45. https://doi.org/10.1101/gr.757503.
Pham SK, Pevzner PA. DRIMMSynteny: decomposing genomes into evolutionary conserved segments. Bioinformatics. 2010;26(20):2509â€“16. https://doi.org/10.1093/bioinformatics/btq465.
Hu Y, Yan C, Hsu CH, Chen QR, Niu K, Komatsoulis GA, et al. OmicCircos: a simpletouse R package for the circular visualization of multidimensional omics data. Cancer Informat. 2014;13:13â€“20. https://doi.org/10.4137/CIN.S13495.
Lukjancenko O, Wassenaar TM, Ussery DW. Comparison of 61 sequenced Escherichia coli genomes. Microbial Ecol. 2010;60(4):708â€“20.
NCBI browser at https://www.ncbi.nlm.nih.gov/genbank.
Acknowledgements
We thank Uri Gophna and Erez Persi for helpful discussions.
Funding
This research was partially supported by the research fund of the Blavatnik School of Computer Science.
Author information
Authors and Affiliations
Contributions
BC and DH initiated the study and contributed to its design. AK carried out the numerical data analysis. DH prepared the manuscript. All authors read and approved the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare they have no competing interests.
Additional information
Publisherâ€™s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
AK and DH dedicate this work to the memory of Benny Chor, a dear mentor and colleague.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Kafri, A., Chor, B. & Horn, D. Interchromosomal kmer distances. BMC Genomics 22, 644 (2021). https://doi.org/10.1186/s12864021079520
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12864021079520