The over-representation of binary DNA tracts in seven sequenced chromosomes

Background DNA tracts composed of only two bases are possible in six combinations: A+G (purines, R), C+T (pyrimidines, Y), G+T (Keto, K), A+C (Imino, M), A+T (Weak, W) and G+C (Strong, S). It is long known that all-pyrimidine tracts, complemented by all-purines tracts ("R.Y tracts"), are excessively present in analyzed DNA. We have previously shown that R.Y tracts are in vast excess in yeast promoters, and brought evidence for their role in gene regulation. Here we report the systematic mapping of all six binary combinations on the level of complete sequenced chromosomes, as well as in their different subregions. Results DNA tracts composed of the above binary base combinations have been mapped in seven sequenced chromosomes: Human chromosomes 21 and 22 (the major contigs); Drosophila melanogaster chr. 2R; Caenorhabditis elegans chr. I; Arabidopsis thaliana chr. II; Saccharomyces cerevisiae chr. IV and M. jannaschii. A huge over-representation, reaching million-folds, has been found for very long tracts of all binary motifs except S, in each of the seven organisms. Long R.Y tracts are the most excessive, except in D. melanogaster, where the K.M motif predominates. S (G, C rich) tracts are in excess mainly in CpG islands; the W motif predominates in bacteria. Many excessively long W tracts are nevertheless found also in the archeon and in the eukaryotes. The survey of complete chromosomes enables us, for the first time, to map systematically the intergenic regions. In human and other chromosomes we find the highest over-representation of the binary DNA tracts in the intergenic regions. These over-representations are only partly explainable by the presence of interspersed elements. Conclusions The over-representation of long DNA tracts composed of five of the above motifs is the largest deviation from randomness so far established for DNA, and this in a wide range of eukaryotic and archeal chromosomes. A propensity for ready DNA unwinding is proposed as the functional role, explaining the evolutionary conservation of the huge excesses observed.


Background
In 1952, Erwin Chargaff published a paper in which he brought evidence that runs of pyrimidines are highly over-represented in eukaryotic DNA [1]. DNA was "depurinated" in formic acid and the remaining pyrimidines were subsequently size separated by the then novel technique of paper chromatography [2], see also [3]. An unexpectedly high number of pyrimidine and purine tracts ("isostichs"), 9 bases and higher, was found in human, calf, salmon and rye DNA [4,5]. These findings were subsequently corroborated by a number of techniques, incl. molecular hybridization rates [6][7][8]. The over-representation of long purine and pyrimidine runs could be exactly analyzed when sequences of many genes became available [9,10]. The phenomenon discovered by Chargaff turned out to be a very significant one -over-representation of the longer tracts reaches values of many ten-folds, as will be demonstrated on a genome-wide basis in this paper. Homopurine (R) and homopyrimidine (Y) tracts will be referred to jointly as "R.Y tracts", because whenever a run of pyrimidines is present on one strand, it is complemented by a run of purines on the opposite strand (the dot separates complementary strands, in accordance with IUBMB rules). It should be stressed that alternating A and G (poly A-G) are only one component of R tracts, and any combination of A's and G's an make an R tract -see Additional file: 7.
Examining increasing number of genes revealed that R.Y tracts are not the only over-represented binary DNA motif. Three additional combinations of two bases are possible [11], namely: A, T only ("W tracts"); G, C only ("S tracts"), and tracts which are G, T on one strand complemented by A, C on the opposite strand (jointly: "K.M tracts"). The S tracts, found in high concentrations in certain regions, are well studied as CpG islands [12]. The abundance of these combinations was previously established in an assortment of mammalian genes [13] and in a yeast chromosome [14]. In bacteria, the W motif, rather than the R.Y motif, was found to be the predominating binary motif [15,16].
In this paper, we shall map the occurrence of binary tracts in seven recently sequenced chromosomes, representing the major currently studied eukaryotic and archeal phyla (previous studies encompassed mainly incidentally selected gene regions). These chromosomes, especially the human and plant ones, also represent a large selection of intergenic regions not previously mapped. It will be shown that the huge over-representation is prevalent in all the selected chromosomes, in particular in their intronic and intergenic subregions. A functional significance of this remarkable departure of real DNA from random DNA has yet to be established. We have previously suggested, based on our experimental findings [17], that a DNA unwinding role, necessary for initiation of transcription, replication and other DNA directed processes, could be involved, as will be detailed in the Disscusion.

R.Y tracts in chromosome 22
The chromosomes selected and their basic data are given in Table 1. Program TRACTS was applied to map the occurrence of binary DNA tracts in these chromosomes (See methods). The occurrence of R.Y tracts of different lengths in "contig 23", the main contig of human chromosome 22 (66.6% of the chromosome) is shown in Table 2. In columns 2 and 3 of the table, the number of R and Y tracts of each length found in the GenBank-listed strand is listed. Opposite each Y tract there is of course an R tract, and vice versa. The number of R tracts of each length can be seen to be roughly equal to the number Y tracts. This justifies the joint consideration of the R and Y tracts as a pair (R.Y) at this stage.
Every tract length up to 78 nt is represented, and many longer tracts are present. The longest tract found is a 367 nt long, an R tract (second column). In column 5, the number of R.Y tracts that are expected in random DNA of the same length and base composition as the analyzed contig is shown (see methods). It is seen that the number of tracts expected decreases much more rapidly than the number of tracts observed (column 4). In fact, for all tracts longer than 23 nt not even a single tract is expected in randomized DNA (see column 5), while 644 such tracts are found at that length alone! (column 4). This enormous over-representation certainly calls for a biological explanation. The extent of over-representation is listed in column 9, which gives the ratio between the number of tracts (or bases) observed, to the number of tracts (or bases)  expected in random DNA. This ratio is below unity for the first three rows, namely for single pyrimidines (purines) flanked by two purines (pyrimidines), their doublets and triplets. The low ratio for the short tracts compensates for the over-representation of the longer tracts, which increases steadily up to enormous figures for the higher lengths (column 9). The increase in ratios is relatively smooth, as can also be seen in Fig. 1a, which indicates that a property special to a particular length or length group is not responsible for the high excesses found. We shall use the found/expected ratio values ("f/e ratios") as the main measure for the extent of binary tract over-representation in the coming Tables. In the last column, the f/e ratio is listed for all tracts longer or equal (also "Greater or Equal", or "GE") than the length given in the first column (calculated as GE bases found divided by GE bases expected, eq. (4)). The GE value is more meaningful for the longer tract lengths, when only few tracts are encountered, so that individual f/e ratios lose their significance. Tables similar to Table 2 have been constructed for a series of other genomes as well as for the other binary DNA motifs. The Data for human chr. 21 and the Drosophila chromosome are shown in Figs. 1b and 1c as well as in the Additional files: 1 and 2, also at the authors web site http:/ /www.weizmann.ac.il/~lcyagil. The found/expected ratio values (f/e ratios) will be shown in most following tables, as the criterion for over-representation.  (20 Mb) where the longest tract is just 70 nt. It will be seen that the Drosophila chromosome is exceptional in other respects as well. A correlation between the size of the longest tract up to which every length is present, with the length of the input DNA sequence is also observed (  somes not far behind; the short M. jannaschii leads occasionally for 7-11 NT tracts. These conclusions can also be seen in Fig. 2a. Between the two human chromosomes, the gene poor chromosome 21 takes the lead.

K.M tracts in seven chromosomes
It was noted earlier that not only R.Y tracts are over-represented, but, at first a bit counter-intuitively, the other three binary DNA combinations as well. Thus, K.M tracts were found in large excess in the human β globin complex and in organelle DNA [13], as well as in yeast chromosome 3 [14]. The data in Tables 4 and 5 show that these findings can be extended to the wider range of phyla studied here. In Table 4, the f/e ratios for K.M tracts are shown. As for the R.Y pair, the detailed outputs for each chromosome (see Additional files: 1, 2 and 3) show that roughly equal numbers of K tracts and M tracts are present in the analyzed strand, and justifies their joint consideration. Overall, it is clear that K.M tracts are also highly over-represented, in all seven chromosomes, even if to a lesser extent than the R.Y tracts. In humans, contig 23 of chromosome 22 shows the highest over-representations but beyond 67 nt many lengths are missing, the longest tract being just 91 nt long. In contig 28 many K.M lengths beyond 62 nt are missing; there are only two K.M tracts longer than 100 nt (101 nt, 268 nt). The f/e ratios for K.M tracts are sometimes even higher than for R.Y tracts (in chr. 21 there are 9 cases between 32 and 51 nt and 5 cases in chr. 22). Beyond 52 nt, f/e ratios are always higher for R.Y than for K.M tracts.
The interesting genome is again Drosophila: Here the overrepresentation of K.M tracts is eventually 2-3 times higher than for the R.Y tracts (Fig. 1c) and is sometimes higher than in the human chromosomes (between 10-20 nt it is as high as in chr. 21 and not much lower at other lengths, Table 4). Whatever the function of the binary tracts may be, in Drosophila that function seems to be taken over, at least partly, by the K.M tracts. All K.M tract lengths are represented up to 45 nt, the longest K.M tract being just 74 nt. In Arabidopsis the K.M tracts are again in high excess, but to a lesser extent than the R.Y motif -there are only two tracts longer than 48 nt (50 and 58 nt). The excess of K.M tracts in elegans and in yeast is less by an order of magnitude compared to humans ( Fig. 2b; except for the yeast telomere), and is marginal but still significant in the archeon. Control runs with the same 5 × 1 Mb random sequences, but for the K.M motif, remain close to unity as expected (Table 4); the longest tract in this case is 24 nt long, present in a single random 1 Mb sequence.

W and S tracts in the seven chromosomes
W and S tracts are autocomplementary each, rather than complementing one another. W and S tracts are therefore separately compared. The f/e ratios for W tracts are shown in Table 5 and Fig. 2c. It is seen immediately that W tracts are also over-represented, but to a more variable degree Binary tract over-representation in three chromosomes: The log ratios of found to expected number of binary tracts (f/e ratios) are plotted against tract length Figure 1 Binary tract over-representation in three chromosomes: The log ratios of found to expected number of binary tracts (f/e ratios) are plotted against tract length. Control runs are average values of five randomized DNA tracts of 1 Mb each, see Table  2, Additional Table 4 and text. a) Contig 23 of human chromosome 22, see Table 1. b) Contig 28 of human chromosome 21. c) Chromosome 2 of D. melanogaster, right arm. Tracts were plotted up to just 40 nt, to enhance resolution and to make visible the under-representation of very short tracts (f/e ratio below unity). than R.Y or K.M tracts. A difference of more than 100 fold is evident between the two human chromosomes for W tracts longer than 32 nt: At that length, f/e = 18,990 in contig 23, vs. f/e = 227 in contig 28. This large difference is partly due to the sensitivity of the calculated value to the percentage of AT, which is 60.9% in contig 28 vs. 52.6% in contig 23 (%AC and %AG are always close to 50%, "the second Chargaff parity rule", see end of discussion). A far higher number of W tracts are thus expected in chr. 21 by eq. (1), simply due to different p and q values. In addition, the 60.9% AT of contig 28 is an average between a very gene poor half with a high %AT (~64% between coordinates 0-7 Mb, see Additional file: 8) and a gene richer half with 56% AT (towards the telomere of the chromosome). The actual f/e ratio in the gene rich domain is much closer to that of contig 23. In yeast, Arabidopsis and jannaschii (68.5 %AT!), W tracts are under-represented up to 15 nt, but then are increasingly over-represented, reaching an excess of hundred-folds for 30-40 nt tracts. The C. elegans chromosome contains few very long W tracts, up to 96 nt. Again -the relatively low excess of W is partly due to the high percent AT. The actual number of tracts, not f/e ratios, is closer to that of the R.Y or K.M motifs (Additional files: 1, 2 and 3). It should be added that the high % AT can be explained only very partly by the mere presence of many long W tracts, because more than 89% of the A's and T's reside in the majority of short, underrepresented tracts, up to 10 nt; a certain compensation may be in place for strict quantitative comparison. Still, it can be concluded that the W motif in eukaryotes is also an extensively over-represented binary motif, in similarity to the situation in bacteria [15].
Finally, S tracts. There are many fewer long S tracts in all the chromosomes studied (data in Additional file: 5). S tracts are often concentrated near transcription start site, as part of the well studied CpG Islands [12,25]. Thus, in contig 23 (47.4% G,C) only five S tracts longer than 37 nt are found (56 the longest). Nevertheless, in the 12 -37 nt range, over-representations increase from 1.12 up to 480,000 fold. In Arabidopsis (only 35.9% G,C) the longest S tract is 20 nt, but over-representation still increases steadily up 200 fold, at length 20. S tracts can thus be considered as another member of the over-represented class. Program TRACTS can be a convenient tool for detecting the CpG islands, espscially in its web version [26].

Distribution in genic subregions
In which genic subregions do the excessive tracts reside? Subprogram ANEX distributes the tracts between exon, intron and intercoding or intergenic classes. The term intergenic is appropriate when mRNA entries are parsed; in that case, UTR regions are evaluated as exons. The distribution between exons, introns and intergenic of all tracts 15 nt and longer (GE 15) is shown in Table 6. W and The over-representation of binary tracts in chromosomes of six organisms   The lower excess in exons can be expected, since, for instance, an oligopurine on the coding strand imposes on the coded protein mostly polar amino acids (all-purine codons code for lys, arg, gln, also for gly).

Longer tracts and Summary
Introns are the subregion in which K.M tracts are the most excessive, except for elegans, and jannaschii. In the fly introns have more excessive K.M tracts than R.Y tracts. The introns are the subregion richest in R.Y tracts in the fly, elegans and chr. 21 by the criterion used (≤ 15 nt). The wellknown oligopyrimidine close to the 3' splice site contributes to the excess of Y tracts in introns. We also observe, in the full sequence outputs, many long binary tracts in the UTR regions, particularly in the 3' UTR. An example can be seen in reference [26]: The three R.Y tracts above position 19,000 of p53 listed there are in the 3' UTR of the gene. A suggested RNA stability signal of 9 W bases [27] may explain some of the W tracts, but many other long tracts, of all motifs, are found in the 3' UTR region, appearing often in blocks, and call for an explanation. A reviewer inquired how over-representation varies along a chromosome. The data in Additional file: 8 shows that for contig 28, f/e ratios for R.Y tracts decrease somewhat from the A,T rich, gene poor "desert" in the first half, to the gene rich second half. The f/e ratio of the W tracts increase even stronger in the same direction, but that may be due to the fact, that expected values increase strongly with % A,T while actually found tracts increase much less if at all.

Interspersed elements
A major finding of the human sequencing project was that a very high portion of the human chromosomes consists of various interspersed elements introduced into the genome. To what extent can these elements be responsible for the over-represented binary tracts? For instance, most alu elements contain, at their end, 20-30 consecutive A's partly incorporated into the genome. To answer this question, several genes and chromosomal contigs were run by TRACTS after having been "masked" (interspersed elements taken out). This was done with program Repeat-Masker, with parameter -nolow; this means that "simple" repeats and certain other low complexity tracts are not taken out; only LTR, MER, LINE and SINE elements were masked out (mainly alu runs, Additional file: 6). The longest sequence we could run was contig "3.45" of Chr. 21, which is the q most contig of the chromosome, a relatively gene rich contig with 51.5% GC. After masking, 2,125,818 bases out of the original 3,450,347 bases remained (61%). The masked sequence was subjected to TRACTS. The results (Table 7) show that over-representation of all three binary pairs is reduced, but only to a limited extent -over-representation remains high for all three binary compositions. The most reduced motif is the W motif -possibly because of the last bases of the alu element. This means that a certain share of the long tracts does indeed reside in the inserted elements, but that many long tracts do reside in the non-masked fraction. This was true even when masking out also the "simple" and the "low complexity" elements. It is clear that over-represen- A -(hyphen) means that no tract of that length is present

Discussion
The main finding reported here is that DNA tracts consisting of only two of the bases are in vast excess all over the animal and plant kingdoms, reaching mega-fold values. The highest excesses are found for R.Y tracts in humans and in other mammals, as observed originally in the pioneering work of Erwin Chargaff and coworkers [29]. In certain organisms -like in Drosophila -K.M tracts prevail. In bacteria, W tracts are the most over-represented binary motif [15], a finding also anticipated by Chargaff and coworkers [29]. One caveat -only one chromosome or contig, from a single species in a particular phylum, is discussed here, except for the two human contigs. Two yeast chromosomes and one Drosophila segment were previously reported, and all show similar abundances [14,30].  (Table 2). Long tracts are thus not necessarily the major factor determining the dinucleotide signature. It is worth to note that the D. melanogaster chromosome, besides the high K.M ratios, manifests also the highest excess of long W tracts (Table 5), along with E. coli and H. influenzae; a closer relationship between these organisms has also been noted when dinucleotide signatures of E. coli and Drosophila were compared [31].    For instance, we have seen that purines avoid being flanked by pyrimidines, and prefer to be flanked by purines. Specifically, a single A, or a single G prefer an A or a G base next to them. This effect is formally a first order Markov effect, but we prefer the biological viewpoint that a particular function with selective advantage, rather than an inherent neighbor effect, drives the bases together to form binary tracts. A neutral, nonfunctional driving force towards excess of purine.pyrimidine caused by different substitution mutation rates has indeed been noted [32]. The substitution rates in the direction of allpurine or all-pyrimidine tracts were however the lowest [32] and are therefore unlikely to explain the massive excesses of R.Y tracts observed.
The vast excess of long binary tracts raises two questions: Is an essential structural and/or functional role responsible for the high numbers of binary tracts in the range of species studied? And if so, has that property been conserved throughout evolution, or have convergent processes been responsible for their wide spread presence? As to the second question, the reappearance of massive W tracts in Drosophila can be quoted in favor of independent (convergent) evolution, while if conservation would be the answer, an early progenitor with only a binary code could be suspected. A previous suggestion of an early RNY or YRN progenote is not in line with an all purine or all pyrimidine progenote [33]. More comparative binary DNA mapping will be required to answer this question.
This leaves the question of what can the essential function be. We, and others, have proposed that a special propensity of the binary tracts to unwind and be strand separated may be responsible. Ready unwinding is certainly expected for W tracts, based on their established melting properties. As to R.Y tracts, Weintraub and Larsen showed, in their seminal work [34], that certain purine/pyrimidine rich sequences in the 5' promoter region of the chicken beta globin gene complex are sensitive to single-strand DNA specific nucleases. Sensitivity to single-stranded specific nucleases means that these binary DNA regions have to be strand separated, at least temporarily. Since 1982, R.Y tracts in promoters of many genes (reviewed in [35]) have been found to be attacked by single-strand specific nucleases and hence are likely to undergo a transition into a strand separated state, at least temporarily. The list of these promoters includes a number of yeast and bacterial sequences characterized as AT rich [36,37]. The singlestrand nuclease sensitive regions have been called by Umek and Kowalski DNA Unwinding Elements, or DUE's [38]. Evidence from modification by chemical reagents attacking only unpaired bases, like permanganate [39,40], chloroacetaldehyde [41] and osmium tetroxide [42] support at least intermittent conversion of the attacked strands into an unwound state [43,44]. We have previously found that in yeast chromosomes III and XI [14] the highest binary tract concentrations are in the 5' promoter regions. This intriguing observation deserves a separate analysis of the promoter regions, which is in progress.
In our experimental work [17] we studied two yeast promoters that contain long oligopyrimidine tracts, namely the promoter regions of gene cyc1, which has an oligopyrimidine tracts of 40 nt, and of gene ded1, with a 32 nt pyrimidine tract (interrupted by a TATA box). These oligo Y regions, and their complementary R tracts, were found to be sensitive to the single-strand specific nuclease P1 when under normal cellular superhelical stress. Topological analysis was consistent with the opening of six turns of the primary helix. These findings strongly support the idea that binary tracts in critical regions can readily unwind and thus facilitate the transcription initiation process, possibly helped by single strand specific proteins. The notion that binary DNA tracts can lead to transitional strand opening can apply also to other DNA directed processes, including recombination, replication and segregation. We found evidence that a long W tract in the centromere yeast chromosome IV (78 nt) has a strong propensity to form an unwound structure [40]. A role in these processes can provide an explanation for the massive presence of the binary tracts in intergenic regions, far from transcriptional start sites.  As said, the early melting of W tracts is a well-established fact, while for S tracts the propensity to be methylated may be involved. It is somewhat harder to understand why R.Y or K.M tracts should readily unwind and form paranemic, unwound DNA structures [35] (also known as a local supercoil-stabilized structures [43]), especially when G or C rich. It is possible that the contribution of the different dinucleotides to stability [45] changes under superhelical stress and at ambient temperatures. Experiments to clarify this possibility have yet to be carried out.
It should be noted that the bulk of the binary tracts observed here do not have the internal symmetries associated with paranemic structures such as DNA triplexes, cruciforms or even B-Z junctions. The DUE's are more likely to separate into single-strands and be stabilized by cellular proteins.
Are the observed binary tracts "simple" sequences in the usual sense, i.e. are the observed tracts composed of one or few nucleotides repeated many times, like oligo (C-T)? A detailed inspection of the tracts listed by the program demonstrates that for most tracts this is not the case: To get an idea, the last 15 longer tracts of contig 23, located beyond the last gene of the contig, are shown in Additional file: 7. The list contains a few simple sequences, for instance a 17 nt tract with GA repeated 8 times, ((GA) 8 G). Some longer tracts may also show simple repeats within their sequence. For example, the long 367 nt R tract has a number of GGGAGGAGAGA repeats in it (see Additional file: 7). This repeat covers however only part of the tract and the other parts are much more random. A slippage mechanism [46] would need too many "slippages" to explain this tract, or many other tracts in the lists, as generated by TRACTS. Oligo A or Oligo T tracts are partial components of quite a number of R tracts (as well as of W and M tracts) and for these an additional mechanism may be operative. Nevertheless, the bulk of the binary tracts are just as random a mixture of two nucleotide bases as can be, and cannot be regarded as simple or even cryptic elements [46]. A full quantitative analysis has yet to be undertaken.
Finally, Table 6 documents another intriguing finding connected with the name of Erwin Chargaff, namely that in single strands the percentage of purines is equal to the percentage of pyrimidines. The same equality was found for A+C bases being equal to G+T bases, again in single strands [47,48]. This phenomenon has been termed "the second parity rule of Chargaff". The percentages of A+G and A+C shown in Tables 3 and 6 demonstrate that their closeness to 50% is quite convincing. I have not encountered serious departures from that rule down to the length of individual genes, in all phyla studied, including bacteria. Two explanations have been raised: One explanation is that random inversion of homopurines during evolu-tion caused this equality [49][50][51]. An alternative possibility is that there is a lot of potential or actual secondary structure in genomic DNA [52]. A definite explanation has yet to be provided and is beyond the scope of this paper.
In conclusion, it thus seems that the analytical findings of the Late E. Chargaff will keep us busy for a while to come.

Conclusions
This paper documents one of the more significant departures of DNA from randomicity, namely that genomes exhibit an enormous excess of DNA tracts composed of only two bases. This phenomenon is conserved throughout evolution, and is therefore likely to reflect a specific DNA function. A most likely function is a propensity of these binary tracts (and possibly additional base combinations) to adopt under suitable condition an alternative, paranemic conformation. This notion is supported by a range of experimental evidence, detailed in the discussion part. We are presently examining whether a particularly high excess of the binary tracts is present in human promoters, as already found in yeast (R.Y tracts) and E. coli (W tracts), supporting a role for the binary DNA tracts in the regulation of transcription and other DNA directed processes.

Methods
Program TRACTS identifies all binary tracts in a given DNA sequence. The program was run in its original FOR-TRAN version [9,15] on an UNIX platform. A cgi web server version, in perl, is now available [26] at url: http:// www.weizmann.ac.il/~lcyagil/binaries_refs.html. The program calculates overall binary tract frequencies (see Table 2) as well as distributions in genic sub regionsexons, introns and intergenic regions. A further output of TRACTS shows the sequence entered, with each exon and intron indicated and each binary tract beyond a given length shown in color on or below the line. For more details, see [26]. A preprogram, ANEX, parses GenBank/ DDJB/EMBL annotation files (flat format) and produces a file with a one line entry for each gene which includes a short comment on the product/function of the gene. When only a single or a few genes is examined, a list of all exons and introns is produced.
In GenBank files that contain both mRNA and CDS entries, the mRNA entries were parsed. Consequently, UTR regions are mostly part of the exonic sub regions. In yeast, C. elegans and M. jannaschii, where no mRNA data are yet available, the UTR regions are necessarily counted as intergenic (intercoding, to be strict). The accession and version numbers of the genomes analyzed are shown in Table 1. In humans, two large contigs, making up most of chromosomes 21 and 22, were analyzed: The "28" contig of chr. 21 which goes from the centromere through most of the q arm (28,515,322 nt) and makes up 85% of the sequenced chromosome; and the "23" contig of chr. 22 (22,998,450 nt), which makes up 66.6% of the sequenced chromosome.

Expected binary tract frequencies
Frequencies of binary tracts expected in random DNA are calculated as following: N(l) gives the number of tracts of length l expected in random DNA of length L and of fractional base composition p by: N(l) = L(p l xq 2 + q l xp 2 ), (1) where p, q are the fractions of the participating base pairs, p+q = 1 (p is e.g. the fraction of A+G). To calculate expected values for only one member of a pair, only one member of the above sum is to be used. The number of bases expected in tracts of length n(l) is simply: The expected number of tracts equal or greater (GE) than a given length l, N(≥l), can be shown to be [9]: N(≥l) = L (px q l + qx p l ). (3) The expected number of bases in these tracts, n(≥l), is: n(≥l) = L {(p + ql) p l + (q + pl) q l }. (4) The validity of these expressions was tested by generating random DNA sequences and running them by TRACTS. For this paper, five 1 Mb random sequences with exactly 25% of each nucleotide base were generated and run for each binary composition, so that standard deviations could be calculated and are listed in Additional file: 4. The percentage of W and S bases in the analyzed chromosomes is not 50%, but a control run with 62.5% AT was previously run for H. influenzae, giving the same picture [15].